- collapsing correlated variables into an overall score so that one does not have to disentangle correlated effects, which is a difficult statistical task
- reducing the effective number of variables to use in a regression or other predictive model, so that fewer parameters need to be estimated
Sacha Varin writes 2017-11-19:
- If the distributions are skewed and long-tailed, can I standardize the values using that formula :(Value - Mean)/GiniMd? Or the mean is not a good estimator in presence of skewed and long-tailed distributions? What about (Value - Median)/GiniMd? Or what else with GiniMd for a formula to standardize?
- In presence of outliers, skewed and long-tailed distributions, for standardization, what formula is better to use between (Value - Median)/MAD (=median absolute deviation) or (Value - Mean)/GiniMd? And why?
What I'm about to suggest is a bit more applicable to the case where you ultimately want to form a predictive model, but it can also apply when the goal is to just combine several variables. When the variables are continuous and are on different scales, scaling them by SD or Gini's mean difference will allow one to create unitless quantities that may possibly be added. But the fact that they are on different scales begs the question of whether they are already "linear" or do they need separate nonlinear transformations to be "combinable".
- Use the 15:1 rule of thumb to estimate how many predictors can reliably be related to Y. Suppose that number is k. Use the first k principal components to predict Y.
- Enter PCs in decreasing order of variation (of the system of Xs) explained and chose the number of PCs to retain using AIC. This is far from stepwise regression which enters variables according to their p-values with Y. We are effectively entering variables in a pre-specified order with incomplete principal component regression.