Tuesday, November 21, 2017

Scoring Multiple Variables, Too Many Variables and Too Few Observations: Data Reduction

This post will grow to cover questions about data reduction methods, also known as unsupervised learning methods. These are intended primarily for two purposes:

• collapsing correlated variables into an overall score so that one does not have to disentangle correlated effects, which is a difficult statistical task
• reducing the effective number of variables to use in a regression or other predictive model, so that fewer parameters need to be estimated
The latter example is the "too many variables too few subjects" problem.  Data reduction methods are covered in Chapter 4 of my book Regression Modeling Strategies, and in some of the book's case studies.

Sacha Varin writes 2017-11-19:

I want to add/sum some variables having different units. I decide to standardize (Z-scores) the values and then, once transformed in Z-scores, I can sum them.  The problem is that my variables distributions are non Gaussian (my distributions are not symmetrical (skewed), they are long-tailed, I have all types of weird distributions, I guess we can say the distributions are intractable. I know that my distributions don't need to be gaussian to calculate Z-scores, however, if the distributions are not close to gaussian or at least symmetrical enough, I guess the classical Z-score transformation: (Value - Mean)/SD, is not valid, that's why I decide, because my distributions are skewed and long-tailed to use the Gini's mean difference (robust and efficient estimator).
1. If the distributions are skewed and long-tailed, can I standardize the values using that formula :(Value - Mean)/GiniMd?  Or the mean is not a good estimator in presence of skewed and long-tailed distributions?  What about (Value - Median)/GiniMd?  Or what else with GiniMd for a formula to standardize?
2. In presence of outliers, skewed and long-tailed distributions, for standardization, what formula is better to use between (Value - Median)/MAD (=median absolute deviation) or (Value - Mean)/GiniMd?  And why?
My situation is not the predictive modeling case, but I want to sum the variables.

These are excellent questions and touch on an interesting side issue.  My opinion is that standard deviations (SDs) are not very applicable to asymmetric (skewed) distributions, and that they are not very robust measures of dispersion.  I'm glad you mentioned Gini's mean difference, which is the mean of all absolute differences of pairs of observations.  It is highly robust and is surprisingly efficient as a measure of dispersion when compared to the SD, even when normality holds.

The questions also touch on the fact that when normalizing more than one variable so that the variables may be combined, there is no magic normalization method in statistics.  I believe that Gini's mean difference is as good as any and better than the SD.  It is also more precise than the mean absolute difference from the mean or median, and the mean may not be robust enough in some instances.  But we have a rich history of methods, such as principal components (PCs), that use SDs.

What I'm about to suggest is a bit more applicable to the case where you ultimately want to form a predictive model, but it can also apply when the goal is to just combine several variables.  When the variables are continuous and are on different scales, scaling them by SD or Gini's mean difference will allow one to create unitless quantities that may possibly be added.  But the fact that they are on different scales begs the question of whether they are already "linear" or do they need separate nonlinear transformations to be "combinable".

I think that nonlinear PCs may be a better choice than just adding scaled variables.  When the predictor variables are correlated, nonlinear PCs learn from the interrelationships, even occasionally learning how to optimally transform each predictor to ultimately better predict Y.  The transformations (e.g., fitted spline functions) are solved for to maximize predictability of a predictor, from the other predictors or PCs of them.  Sometimes the way the predictors move together is the same way they relate to some ultimate outcome variable that this undersupervised learning method does not have access to.  An example of this is in Section 4.7.3 of my book.

With a little bit of luck, the transformed predictors have more symmetric distributions, so ordinary PCs computed on these transformed variables, with their implied SD normalization, work pretty well.  PCs take into account that some of the component variables are highly correlated with each other, and so are partially redundant and should not receive the same weights ("loadings") as other variables.

The R transcan function in the Hmisc package has various options for nonlinear PCs, and these ideas are generalized in the R homals package.

How do we handle the case where the number of candidate predictors p is large in comparison to the effective sample size n?  Penalized maximum likelihood estimation (e.g., ridge regression) and Bayesian regression typically have the best performance, but data reduction methods are competitive and sometimes more interpretable.  For example, one can use variable clustering and redundancy analysis as detailed in the RMS book and course notes.  Principal components (linear or nonlinear) can also be an excellent approach to lowering the number of variables than need to be related to the outcome variable Y.  Two example approaches are:

1. Use the 15:1 rule of thumb to estimate how many predictors can reliably be related to Y.  Suppose that number is k.  Use the first k principal components to predict Y.
2. Enter PCs in decreasing order of variation (of the system of Xs) explained and chose the number of PCs to retain using AIC.  This is far from stepwise regression which enters variables according to their p-values with Y.  We are effectively entering variables in a pre-specified order with incomplete principal component regression.
Once the PC model is formed, one may attempt to interpret the model by studying how raw predictors relate to the principal components or to the overall predicted values.

Returning to Sacha's original setting, if linearity is assumed for all variables, then scaling by Gini's mean difference is reasonable.  But psychometric properties should be considered, and often the scale factors need to be derived from subject matter rather than statistical considerations.

1. there is a good paper by Sun et al. where the use z-score is used to form a composite of an assortment of variables including binary and time-to-event in phase II trials to enhance power: http://circheartfailure.ahajournals.org/content/5/6/742.long
but it depends of course on what the purpose is. Paul

1. I like that paper. Can be a powerful approach though hard to interpret the results.

2. yes. we suggested using the probability index and a forest plot to aid interpretation: http://circheartfailure.ahajournals.org/content/10/1/e003222.long
cheers

2. This discussion on CrossValidated mentions GiniMD as being limited by having a '0 breakdown point' (see the first answer to the query... https://stats.stackexchange.com/questions/200595/comparing-spread-dispersion-between-samples). The concept of 'breakdown points' are discussed in this paper by Davies and Gather, The Breakdown Point-examples and counterexamples, (here ... https://www.ine.pt/revstat/pdf/rs070101.pdf).

1. I didn't see Gini's mean difference addressed in any of that. Note that if Y is binary with proportion of ones equal to p, Gini's mean difference is nicely equal to 2p(1-p)n/(n-1).

2. This comment has been removed by the author.

3. Here's the quote, again, from the first and only response (not a 'comment') to the OPs query, "Answering the second question, I would recommend to use QnQn or SnSn. Both of them have nice properties. Somebody can recommend to use Gini's means difference, but it has 0 breakdown point (but it is somewhat "robust" and also has a lot of good properties)." As noted, the paper discusses the 'concept of breakdown points.'

3. I'm not convinced that Gini's mean difference has a zero breakdown point. It's not like a quantile. But it has so many other good properties I might not be concerned anyway. Can you describe what QnQn and SnSn are?

1. With all due respect, I urge you to revert to the CV link for the full discussion and explanation. It may be that you will want to follow up with the specific CV participant who was quoted.

4. I read the entire CrossValidated page for a second time and still do not see any useful information about Gini's mean difference there.

1. Fair enough. It appears to me that we're all on a learning curve wrt the uses, advantages and limitations of GiniMD. You are correct that the CV thread does not consider GiniMD in any depth. Making reference to it seemed worthwhile only insofar as there was a hint of skepticism, a possible limitation to the use of the metric in the zero breakdown point. You've stated that you're 'not convinced' that this is correct. I would be interested in any evidence you can provide in support of this belief.

5. I just read http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/BetterThanMAD.pdf which was indirectly referenced in stackexchange.stats.com (CrossValidated) but oddly this paper did not address Gini's mean difference at all other than acknowledging its existence.

6. Thanks Thomas. An expert, HA David, has written much about Gini's mean difference and its high efficiency with regard to the ordinary SD even if normality holds. So I'm doubtful about the 0 breakdown comment. Gini's index is completely continuous. It does not involve any sample quantiles, just taking the mean of all absolute differences. It may not be as robust as the median of all absolute differences from the median, but the efficiency gain it has over that probably offsets the little bit of non-robustness. I would rely on Gini's mean difference until some reference shows it is inefficient or meaningless in a situation that occurs in practice.

1. Using David Donoho's finite sample breakdown definition, it is clear that the breakdown point of GiniMD is 1/n. As a single value goes to infinity, the GiniMD will go to infinity. An example of a high breakdown, high efficiency variance estimate is the minimum Hellinger distance estimator of R. Beran.

2. I have few worries about an estimator with breakdown point of 1/n (I still use the mean occasionally), and Gini's mean difference is very efficient, easy to interpret, and fast to compute.

7. One thing that I don't see discussed in the GMD literature is it's utility in hypothesis testing. This is very different from the SD (or SE) and suggests that, as a measure of dispersion for non-normally distributed information, it's closer to the coefficient of variation, a scale invariant measure of dispersion for more normally distributed information.

1. I'm not sure how something that is unitless and depends on the mean not being near zero (CV) can be compared to Gini's mean difference, which is in data units. Personally I don't use the CV because of the strong location away from zero requirement that makes it almost assume the distribution is a log-type distribution. Also, what made you mention hypothesis testing? I don't like hypothesis testing in general, and especially when it comes to measuring variation.

8. Good points and agreement on Neyman-Pearson approaches to hypothesis testing. I was using that as a 'straw man' example in an effort to clarify my ignorance and distinguish GMDs from SDs. Anther example could be ANOVA-type contrasts, and so on. But your point about the GMD being 'in data units' was both helpful and interesting. The CV does have value in a comparison, e.g., of systolic and diastolic blood pressure. Since these two metrics are in differing units, direct comparison of their SDs is not meaningful, whereas comparing their CVs would tell you which metric has more variability. To me, your comment suggests that the GMD would not be useful in determining which blood pressure metric has greater dispersion.

1. On the contrary I think that Gini's mean difference would be more applicable (though not perfect for) the comparison of variability of DBP and SBP. The CV might make one conclude that DBP was more variable just because it had a lower mean.

9. So, just to be clear, the GMDs for DBP and SBP are directly comparable?

1. Even though I believe Gini mean differences for the two blood pressures are more directly comparable than other measures, I still don't think they are fully comparable.

10. I don't mean to be pedantic but there is literature that, in this specific instance wrt DBP and SBP, supports the CV as being 'fully comparable' and, therefore, enabling a direct comparison of variability. E.g., p. 14 of Levy and Lemeshow's book Sampling for Health Professionals (1980 edition). It doesn't sound like the GMD is 'fully comparable' in the same sense.

1. The CV is not a measure of variability. It is a joint measure of the mean and squared-difference-from-the-mean variability. DBP by definition has a lower mean than SBP, and if variability of the two were to be the same, the CV for SBP would by necessity be lower than the CV for DBP. For reasons stated earlier, I don't use the CV. I want to separate location and spread statistically.

11. This has been an interesting, useful exchange, at least for me. Thank you.

1. Thanks Thomas. It's interesting to me how controversial many elements of statistics are. Unlike math, different statisticians have very different opinions about fundamentals because there are very few unique solutions in stats.