On the Statistical Estimation of Asymmetrical Relationship Between Two Climate Variables

Two simple methods commonly used to detect asymmetry in climate research, composite analysis, and asymmetric linear regression, are discussed and compared using mathematical derivation and synthetic data. Asymmetric regression is shown to provide unbiased estimates only when the respective mean of positive and negative events is removed from both independent and dependent variables (i.e., non‐zero y‐intercepts). Composite analysis always provides biased results and strongly underestimates the asymmetry, albeit less so for very larger thresholds, which cannot be used with limited observational data. Hence, the unbiased asymmetric regression should be used, even though uncertainties can be large for small samples. Differences in estimated asymmetry are illustrated for the sea surface temperature and winter sea level pressure signals associated with El Niño and La Niña.


An Asymmetric Example
To easily compare how asymmetry is estimated from observations, we introduce a model that mimics in a very simple way the asymmetry of the atmospheric response to oceanic anomalies.Let N observations of an oceanic anomaly (e.g., an El Niño SST index) be represented by the time series x(j) for j = 1, N. The sample mean of the oceanic anomalies,  () = , is by definition of an anomaly equal to zero.Here  p is positive and  n negative, and there are  p positive and  n negative x-values, N =  p + n .To represent an asymmetric atmospheric response, we assume that an atmospheric variable y is given by a term proportional to the oceanic anomalies plus an uncorrelated noise ε with zero mean that represents intrinsic atmospheric variability and has the same variance in positive and negative cases, p() = p() +  (𝑘𝑘) (1) where k corresponds to positive x's and r to negative x's.The true response to negative x-values is taken for simplicity to be  n while that to positive x-values is given by    p .The true asymmetry is given by the ratio of the slopes (3) if one assumes for simplicity that there are as many positive  p and negative  n events.Then, atmospheric anomalies Y(k), as used in observational analyses, are defined as departures from the sample mean and thus have zero mean.This is obtained by subtracting

Zero-Intercept Regression (Method 1)
If one assumes that positive and negative events are represented by two regression models with fixed-zero intercepts FRANKIGNOUL AND KWON 10.1029/2022GL100777 3 of 9 where e is the residual, the a p estimator of the true slope   is, using Equation 4, If   > 1, the estimator a p of the true slope   is negatively biased, while the estimator a n of the true slope 1 is positively biased as one has from Equation 5n ≈ Hence, the true asymmetry

Non-Zero Intercept Regression (Method 2)
For this second approach, we apply the asymmetric regressions with fixed y-non-zero intercepts, that is, after removing the means for positive and negative values of the independent (oceanic) variable, respectively, and correspondingly for the dependent (atmospheric) variable.For positive events, let where A n denotes the estimate of the true slope 1.One has Therefore, the estimated slopes are unbiased and model 2 should be preferred.Note that the method 2 is quickly asymptotically equivalent in sample size to fitting a regression with slope and intercept parameters to both the positive and negative   values, and that for  p =  n = 0, one has Hence, there is no discontinuity in the linear fit. .To detect asymmetry, composites must be based on large absolute values of x generally obtained by selecting x-values larger or smaller than a given threshold.How it compares to the true solutions depends on the probability density function (pdf) of x and ε.We consider two simple cases and illustrate them for a = 3, a relatively strong asymmetry that is used in the synthetic examples below.

Uniform x-Distribution
Because it is analytically simple, we first consider the case where x is uniformly distributed in the interval (−50, 50).The pdf f(x) is given by:  , where the error is given by the standard error (SE).
For very large sample, say N = 10,000, the estimation errors are very small,  p = 25 ± 0.41 .Composites based on the 25% largest and smallest values yield  p25 = 37.5 ± 0.58 ,  n25 = −37.5 ± 0.58 .Neglecting the small averaging errors and using Equations 4 and 5 with   = 3, the estimated asymmetry is given by  15, as in the synthetic example below, one has  p25 ≈ n25 ≈ ±5.8 , and the approximate ranges become (2.02-2.61)and (1.35-2.02).For uniform x-distributions, composites are always biased and underestimate the asymmetry.

Gaussian Distribution
The Gaussian case is more relevant to observations.In the present example, one can get insights from Equations 4 and 5.For very large absolute x-values (e.g.,  p ≫ p ), one has  p() ≈ p() + () and  n() ≈ n() + () , so that unbiased estimates could be obtained.However, in observations or even climate model simulations, such very large values of x never occur, and for reasonable thresholds the estimated asymmetry degrades rapidly.For example, for positive and negative x-values twice their mean and , and  n() = 3  n + () , hence the expected asymmetry is substantially biased (i.e., 5:3 instead of 3:1 for the true values).The expected biases could be derived from the pdf of x and e by integrating over all x above or below an assumed threshold, but instead we simply use synthetic data in Section 5 to show that the composites are biased even with large thresholds.

Uniform x-Distribution
Synthetic data with zero mean and N = 100 are randomly generated from Equations 1 and 2 for uniform predictor x and noise ε distribution in (−50, 50) with   = 3.A realization of the x and Y =  −  data for is shown in Figure 1a, where x should be viewed as the oceanic anomaly and y as the atmospheric response to x.In Figure 1a, three different estimates of the asymmetry are represented and the results of 1,000 random simulations are in Figure 1b.For N = 100, the regression method 1 (Section 3.1) is biased and underestimates the slope for positive FRANKIGNOUL AND KWON 10.1029/2022GL100777 5 of 9 events and overestimates the slope for negative ones by 0.75, as is predicted by Equations 8 and 9, thus strongly underestimating asymmetry (i.e., 2.25:1.75instead of 3:1).Note that the standard (symmetric) regression (blue star, method 0) based on all x-values provides even more biased estimates of the true slopes since it does not distinguish between positive and negative x-values.On the other hand, the slope estimates are unbiased with the method 2 (Section 3.2), as expected.However, the error bar with the method 2 is rather large, more than twice that with the method 1.Note that the error bars are highly similar for positive and negative x-values since they only depend on N and ε.
Composite analysis (method 3) is also biased.As discussed in Section 4.1, composite based on all positive or negative values are unable to detect asymmetry.Consistent with our analysis, the biases in the slopes are larger for the 25% composites than for the 5% composites that indeed get close to the limit obtained for infinite sample (2.5:1.5),albeit with large uncertainty (nearly as large as with the regression method 2) since the sample is more limited.Interestingly, the 25% composites and the regression method 1 yield comparable estimates, but it occurs because they turned out to be similarly biased.

Gaussian Distribution
A second set of synthetic data was generated for   = 3 using standardized Gaussian distribution for the predictor variable x and the noise ε, so that expected values are  p = 0.8, n = −0.8, ∑p 1  p () p ()  =  0.5 (half the unit variance).Figure 2 compares the different estimates of the slope for (left) N = 100 and (right) N = 10,000 random x and ε samples.As above, the synthetic data are randomly generated 1,000 times to document the uncertainty of each estimate.The mean values of the asymmetric regressions do not significantly depend on N, but the uncertainties are of course much larger for N = 100 (they coarsely scale as N −1/2 ).In fact, the 95% confidence intervals are essentially negligible for N = 10,000.
From Equations 8 and 9 the mean biases of the asymmetric regression method 1 are large  (≈ 0.64), consistent with Figure 2.For N = 100, the 95% confidence intervals are small, so there is negligible chance to obtain the true values.As predicted, the method 2 is unbiased, and the 95% confidence interval, albeit larger, is somewhat smaller than for uniform distributions.In our example for N = 100, the ranges of estimated slopes in 1,000 simulations were (2.35-3.75)for positive x and (0.25-1.8) for negative x. Hence, relative estimation errors can be large, in particular if the true slope is small as was the case for negative x. Composite based on a 25% threshold are also strongly biased, and again comparable to those of the regression method 1, albeit with slightly larger uncertainties.For N = 100, the 95% confidence interval is (2.1-2.6) for positive x and (1.4-1.9) for negative x, while the full ranges were (1.9-2.9) and (1.1-2.1),respectively, so they did not even overlap with the true values.Composites based on the 5% largest or smallest x-values provide better, but still biased estimates of the asymmetry, and the 95% confidence intervals remain far from the true values.However, estimates could occasionally match or even exceed the true slope since the full ranges in 1,000 simulations were (2.0-3.3) for positive x and (0.6-2.0) for negative x.For N = 10,000, an unreasonable case for observed climate time series, composites based on a more extreme 1% threshold could be considered, but they remain biased, nearly as much as for the 5% threshold (not shown).Composite analysis thus always underestimates the asymmetry, as shown in Section 4.

Application to ENSO
To illustrate the differences between the three methods using observations, we consider the SST signature of El Niño and La Niña events and their teleconnections in boreal winter (DJF), even though ENSO time series have some skewness.We use the HadISST (Rayner et al., 2003) for SST and the NCEP-NCAR R1 (Kistler et al., 2001) for SLP in the period 1948-2018.After removing the mean seasonal cycle and a cubic trend from the monthly values, time series of the DJF mean Niño3.4 index as well as SST and SLP anomalies are constructed.
Regressions are on the DJF Niño3.4 index.Composites are based on 10% percentile, a larger threshold than most analyses, which are based on values larger or smaller than 1 standard deviation of the Niño3.4index (e.g., Deser et al., 2017;Mezzina et al., 2022).Our choice results in seven El Niño and seven La Niña events.Note that the units in Figure 3 are per positive or negative index (i.e., °C/°C), so that the sign of the negative case should be inverted to display La Niña signals.
The asymmetry in the SST anomalies between positive and negative cases is most noticeable in the location of the maximum anomalies, which is found in the central equatorial Pacific in the negative (La Niña) case but shifted eastward in the positive (El Niño) case.However, the local amplitudes exhibit the asymmetry consistent with our mathematical derivation and synthetic examples, as the amplitudes in the eastern Pacific is larger for the positive phase and the asymmetry is largest for regression method 2 and smallest for regression method 1.For example, at 90°W on Equator, the regression method 2 gives 1.30°C/°C and 0.38°C/°C for the positive and negative phases, respectively, while the regression method 1 and the composite give 0.93°C/°C versus 0.60°C/°C and 1.05°C/°C versus 0.49°C/°C, respectively.On the other hand, the SST amplitude in the central Pacific is slightly larger for the negative phase, but the asymmetry is again largest for regression method 2. For example, the amplitudes at 165°W on Equator for the positive versus negative phases are 1.17°C/°C versus 1.25°C/°C, 1.06°C/°C versus 1.29°C/°C, and 1.14°C/°C versus 1.27°C/°C for regression model 1, regression model 2, and composite, respec- The SLP regressions are more contrasted in amplitude and pattern, showing a strong Aleutian Low for El Niño and a much weaker, southwestward shifted low for La Niña (Figure 4).Consistent with our analysis, the strongest Aleutian Low deepening is given by regression method 2 and the weakest by regression method 1.The composites are based on a high threshold and are thus in-between, consistent with our mathematical derivation and synthetic examples (Figure 2).

Summary and Discussion
We have shown that asymmetry is best estimated by asymmetric regression models with fixed non-zero y-intercepts (method 2), which provide unbiased slope estimates.However, for limited samples as often found in observational analyses, errors in the estimated slopes can be large, in particular if the true slopes are small and the noise is larger than considered here.Asymmetric regression models with zero intercepts (method 1) are always biased and underestimate the asymmetry, and they should not be used.While method 2 removes means separately for the positive and negative values of the independent and corresponding dependent variables, method 1 removes means from independent and dependent variables for the entire data set.In the ocean-atmosphere coupling framework, method 1 thus implicitly assumes that a neutral ocean leads to no atmospheric anomaly, which does not hold if there is asymmetry.The single degree of freedom lost in setting the intercept in method 2 is well made up by avoiding the biased slope estimates of method 2. Composites are also biased and underestimate the asymmetry, strongly for small thresholds but less so for very large thresholds (which may be unattainable with observations).However, the errors are then quite large if the sample is limited.
To illustrate the differences between the three methods investigated here using observations, we have estimated the asymmetry of the tropical Pacific SST anomaly associated with positive and negative values of the Niño3.4index in DJF in the period 1948-2018, as done in many studies.All three methods nicely show the westward shift of maximum SST amplitude in La Niña events and the substantial amplitude along the South American coast in El Niño events.While the differences between methods are rather small, the local SST amplitudes consistently show the largest asymmetry for regression method 2. The associated SLP anomalies show a strong asymmetry in all cases, with a strong Aleutian Low strengthening during El Niño events and a much smaller, southwestward shifted low during La Niña events.Consistent with our mathematical derivation and synthetic example, the Aleutian Low response to El Niño is strongest for regression method 2 and weakest for regression method 1, FRANKIGNOUL AND KWON 10.1029/2022GL100777 8 of 9 while compositing based on a high 10% threshold provide intermediate results.The latter are consistent with the composites in Mezzina et al. (2022).Although the true asymmetry is not known in the observational data, the results confirm that compositing tend to underestimate asymmetry and suggest that asymmetric regression with non-zero intercepts (unbiased method 2) should be preferred.For data that can be highly skewed, however, such as precipitation signals, the linear assumption in asymmetric regression is limiting, and a probabilistic approach for detecting asymmetric ENSO teleconnections may be preferable, such as using contingency tables (e.g., Lenssen et al., 2020;Mason & Goddard 2012).
In recent studies of the atmospheric response to SST or other oceanic anomalies, multiple regressions have been increasingly used to distinguish their impacts from the influence of concomitant factors such as sea-ice concentration, or snow cover (e.g., Liu et al., 2008;Révelard et al., 2018;Simon et al., 2020).Asymmetry in the responses could not be estimated by applying method 2 to multiple linear regression, since the positive and negative values of each predictor would not occur simultaneously, thus preventing an appropriate definition of demeaned variables Y p and Y n .If asymmetry is only expected in the response to one predictor, method 2 could be applied to its positive and negative values, while including in each case the additional concomitant predictors, but two different slopes and significance might be obtained for these other predictors, despite the implicit assumption of symmetry.However, if only weak correlation between positive and negative values of multiple predictors is expected, as could be the case if the predictors are (orthogonal) principal components of the same variable (e.g., tropical SST anomalies), sequential estimation with method 2 may provide acceptable results.Using method 1 and separating each predictor time series into positive and negative sequences is feasible and was used by Frankignoul et al. (2011) andRévelard et al. (2016) to asymmetrically remove ENSO teleconnections based on two ENSO indices, but such estimates are biased, as shown here.In addition, if there are many regressors, the increased dimensionality might result in overfitting or collinearity, although series of positive and negative phases would be pairwise uncorrelated.Hence, the generalization to multiple asymmetric regression requires further investigation.

Figure 1 .
Figure 1.(a) Scatter plot of a realization of the synthetic x and Y series with uniform x and ε distribution in (−50, 50) and   = 3 with N = 100.The symmetric regression by method 0 and the asymmetric regressions by methods 1 and 2 are represented, as well as composite based on a 25% threshold (method 3).(b) Results of 1,000 random simulations.The colored symbols indicate the mean values of the estimated slopes and the composite threshold, and the vertical bars indicate 95% interval from the 1,000 iterations.Crosses correspond to positive x-values and circles to negative ones, and the horizontal lines indicate the true slopes.

Figure 2 .
Figure 2. As in Figure 1, but using synthetic data generated using x and e from standardized Gaussian distribution and   = 3 with (a) N = 100 and (b) N = 10,000.Note that (a) uses 5% threshold for the most extreme composites, while (b) uses 1%.Crosses correspond to positive x-values and circles to negative ones.

Figure 3 .
Figure 3. Sea surface temperature (in °C/°C with contour interval of 0.3°C/°C) associated with positive and negative values of the Niño3.4index in DJF.Estimates were made by (a) standard regression, (b and c) asymmetric regression 1, (d and e) asymmetric regression 2, and (f and g) composites based on a 10% threshold (seven El Niño and seven La Niña events).

Figure 4 .
Figure 4. Sea level pressure in JFM (in hPa/°C, contour interval 0.5 hPa/°C) associated with positive and negative values of the Niño3.4index in DJF.Estimates were made by (a) standard regression, (b and c) asymmetric regression 1, (d and e) asymmetric regression 2, and (f and g) composites based on a 10% threshold (seven El Niño and seven La Niña events).
Composites are built by selecting positive and negative values of x and comparing the corresponding values of Y, usually based on sample means.The simplest composites would consider all values of the same sign, thus