Transformation in Regression
Six Sigma – iSixSigma › Forums › Old Forums › General › Transformation in Regression
 This topic has 11 replies, 3 voices, and was last updated 13 years, 2 months ago by newbie.

AuthorPosts

September 16, 2008 at 7:01 pm #50954
I am studying some info on regression modeling and the need to validate underlying assumptions to ensure the precision and accuracy of the unknown parameters. But I dont get this part…..
If I was fitting a regression model and the residual analysis indicated the need for a possible transformation, how do I know which variable to transform (I understand one can transform the Y and/or the Xs)? In addition, once the variable for transformation is determined, do I then simply save the transformed data in a seperate column and plug it into the analysis in place of the original data and rerun and recheck?0September 17, 2008 at 12:27 pm #175844
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.As was stated in another recent thread – it depends. The two basic residual shapes that suggest a need for a transformation are the shapes. In the case of < the first choice is to take the log of Y. After logging Y you need to rerun the regression on the log values and if the log solves the problem then the final regression model will be on the logged values.
The first choice for transforms when the residual pattern resembles > is some form of an inverse Y i.e. 1/Y. Usually, you will look at the residuals and identify the fitted Y value associated with the tip of the > shape. If it is say 100 then the transform would be 1/(100Yi)**2 where Yi are the individual Y values.
If these simple approaches don’t work then you have your work cut out for you and it will usually entail a lot of plotting of residuals against X’s and against time order to see what kinds of patterns emerge. You may find that the final version requires a transform of both the Y and some of the X’s. To the best of my knowledge there aren’t any simple rules if you have to move to this level of investigation.
There is one other pattern that does show up from time to time. When the residuals are plotted against the fitted Y’s you can get a plot that looks like a series of lines looking like rock strata. This occurs when the Y responses is constrained to a finite number of distinct categories and in the course of taking measurements these values are repeated in the data. The parallel lines will have a slope of 1.0September 17, 2008 at 1:45 pm #175847Robert,
That was very helpful, thank you! So a shape on the residual vs. fitted value plot indicates a potential need to transform the Y using log Y or 1/Y (or other power transformation based on the optimal lamda – I assume?).
In addition, I have a colleague who keeps putting the different predictors and / or response through various arbitrary transformations during a stepwise regression exercise, looking for a higher Rsq. Is this an acceptable approach — to simply start manipulating terms arbitrarily? This doesnt sound like the soundest strategy…
Thanks!0September 17, 2008 at 3:24 pm #175850Newbie:First a little review – the assumption of linear regression
is that the results are adequate when the RESIDUALS are normally distributed.
Deviations from normality are only seen in the residual plots AFTER attempting
to refine a regression model.There may be a nonlinear relationship between X and Y. This
can be addressed by the careful construction of a nonlinear model. For
example, Y=m*log(X)+b. In this case neither X or Y should also be transformed
before conducting the regression. An alternate method would be to take the
logarithm of X to construct another variable, i.e. L = log(X), then do the
regression using the equation Y=m*L+b.Some response variables are nonnormally distributed and
entirely new methods are used to construct and refine a model. Binary logistic
regression is one such method where the response is either 0/1 or pass/fail.
Some people make the mistake of converting the data to a series of percentages
and conducting regression on those data.There are a number of well known, nonnormal, distributions
that arise under different circumstances and I would expect the response (Y) to
show such a nonnormal distribution. In these cases, in order to use linear
regression, one or more of the variables should be mathematically transformed.
Some examples I have seen involve the number of defects per unit area, airborne
contaminant concentration, noncentricity of drilled holes, lengths of cut
metal, fill height of filled containers, deviations from planarity, most
financially based data, computer memory usage, and disk storage usage.I like to use the BoxCox transformation and look at the
graph of the effect of different lambda values. I look to see if the optimal
values are close to those involving specific transformations.Lambda
Transformation
1
None
1/2
Square root
0
Logarithm
1/2
Reciprocal square root
1
Reciprocal
I would caution anyone who would start transforming data to
get a better mathematical fit for a regression model without having a reason to
do so. Conduct a few experiments on the system to get a feeling for the
factors influencing the response. When using historical, happenstance data
there are usually many other reasons for nonlinear or nonnormal behaviour.Cheers, Alastair0September 17, 2008 at 5:26 pm #175853BTDT,
Thanks for the feedback and I would ask for your patience as I summarize my understanding of your and Robert B counsel:
So the stepwise or best subset techniques are used to determine the desired subgroup of predictors for the future model (via Mallows Cp, R^2, S, etc) which are then taken forward for regression, which in turn is determined by the data type of the terms and one’s understanding of the process. Once the model has been fitted, it is assessed for statistical signifiance (ANOVA and Pvalues for individual predictors), useability (R^2, R^2 adj, S, magnitude and direction of coefficients, etc), multicolinearity (VIF), and finally, validity of underlying assumptions (normality, independence, and equal variance or identical distribution of residuals).
Nonrandom patterns in the residual plots will indicate when and where the model should be adjusted to improve accuracy and predictability in the coefficients to include: a need to transform the Y (and/or X) or a need to change the predictors (ie add another single or higher order term)….Yes?
And finally:I read regression does not require the Y to be normal, but for the residuals to approximate normality….does this mean you can have a Y data set that is nonnormal and then approximate normality in the residuals?
Do curvlinear effects appear in the residual analysis and if so, what should be plotted to reveal its presence?
Thank you, thank you, thank you!0September 17, 2008 at 5:38 pm #175856
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.Blindly transforming data to push the value of R2 around is a waste of time. Transformations are used to impact the distribution of the residuals and as BTDT noted
“the assumption of linear regression is that the results are adequate when the RESIDUALS are normally distributed. Deviations from normality are only seen in the residual plots AFTER attempting to refine a regression model.”
…and the reason you want this is so that you can bring all of the standard tests of signficance to bear when you examine the results of your attempt to regress Y on one or more X’s.
If your only object is to push R2 as high as possible a far simpler solution is to just run a gross polynomial overfit. This approach makes as much sense as blind transformations and it is a lot simpler. In addition, if there are no repeat points, you can build an equation with an R2 = 1.
0September 17, 2008 at 5:57 pm #175858
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.This is the approach I’d recommend:
1. Don’t jump in and start generating regression statistics. First – plot the data – Y against all of the X’s of choice so that you have some idea of what the data looks like and thus what you have to work with.
2. If it is possible, check your X matrix for collienarity – at a minimum run VIF’s and plot the X’s against one another – I prefer 3d to 2d in this case since you can see the relationship between 4 X’s. (one X on the X,Y,Z axis and then color code the points with the values from the 4th X). Run the stepwise regression with the subset of X’s that are sufficiently independent of one another.
3. Run both stepwise and backward elimination on the data to check for consistency with respect to convergence to a type of model.
4. Look at the models – and look at the progression of the statistics in the development of the models (simultaneous changes in R2, Sy.x, and Mallows Cp). You may find an earlier model and not the final model is the better fit to the data.
5. Examine the model(s) – residual plots, lof, etc.
6. If there are issues with the residual plots or other aspects of the regression analysis make the appropriate changes (i.e. transforms, runs with and without apparent influential points, etc.) and try again.
To your final points
Yes, normality applies to the residuals not to X or Y.
If there is a curvilinear/linear effect that is missing (unaccounted for by the X’s you used) it will most likely show up in the plot of the residuals against the fitted Y’s in the form of a curve/straight line pattern in the scatterplot.0September 17, 2008 at 6:02 pm #175860Newbie:I looks like you have done a fair amount of reading. Keep it up, fitting regression and other models to data is an iterative and involved process. As you keep selecting and removing factors, changing the equation terms and assessing progress, keep looking at the residual plots for any nonrandom patterns in the residual error in the model. The ideal case is one where the errors in the model are smoothly distributed over the prediction interval in time and response.Curvilinear effects will be seen in the fits vs. residuals plot. For example, all the residuals are high for large and small values of Y, but low for intermediate values. In other words the plot will look like a boomerang.Minitab has an option to generate all residual plots each time you run the regression. All of them should look random.As Robert says, you can always increase your R^2 by adding more factors. Don’t make a model more complex without a decent reason to do so. The object is to fit the simplest model you can to adequately fit your data.Cheers, Alastair
0September 17, 2008 at 6:33 pm #175863BTDT and Robert,
Thank you so much! That was tremendously helpful. I will take all your advice moving forward and again, thank you for taking the time to answer a bunch of questions!
Regards,
Newbie0September 19, 2008 at 3:19 pm #175949Hey Alastair,
I was rereading your response and realized I didnt fully understand the last point….keeping a model as simple as possible (ie minimal number of terms and straightforward mathmatics I am assuming) while still having it prove useful….sooooo, where is that line between maxing out the R value and keeping the number of terms to a reasonable level? Is this where the Mallows CP comes in? Thanks again!0September 22, 2008 at 12:31 pm #176015
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.If you are running stepwise regression there will be times, especially with nondesigned data, when the machine will grind on adding terms and providing you with a model that is an overfit of the data.
One way to guard against this is to watch R2, Sy.x, and Mallows Cp and see how they behave as a unit. The “symptoms” of an overfit can be many and varied but the two I’ve seen most often are as follows:
1. For each term up to the Nth term Sy.x sees a substantial (yes, it is a judgment call) decrease as each term is added, R2 increases, and the Cp will either decline with each addition, or decline but at a more gradual pace. At the Nth+1 term R2 will again increase but the reduction in Sy.x will be much less than was seen in previous steps. Mallows Cp will continue to drop. With each addition of another significant term past the Nth you will see the same thing – the key being the changes in Sy.x are much smaller than before until the stepwise process finally grinds to a halt.
2. Same as above but now it is the Cp which seems to be running things. After the Nth term both Sy.x and R2 see very small decreases and increases however Cp continues to see big changes with each new term past the Nth term.
When it comes to regression the issue is the amount of variation in the data that is explained by the model. Consequently, I watch the above statistics in the order of Sy.x, R2, Cp.
When either 1 or 2 happen I will go back to the model where the change in reduction of Sy.x shifted from major to minor and I will take that model and the final model and run a regression analysis on both. I will compare their respective residual plots – looking for all of the usual suspects – trends, influential data points, etc., lof – in general and with respect to subsets of the data that might be of interest to me, and predictive ability across the range of the data.
If my regression analysis turns up things like data points that appear to be highly influential, independent variables whose missing data changing the structure of the data, etc. I will take appropriate actions and rerun everything.
At the end of the effort, if I still have situations like 1 or 2 and if I can’t find any major difference between the two models I will opt for the simpler version.0September 22, 2008 at 2:22 pm #176020Robert,
You were reading my mind – How do you choose between simplicity and model effectiveness? But you explained it very well and explaining how you balance those various measures and still keep an eye on simplicity was very helpful. Thanks guys!!0 
AuthorPosts
The forum ‘General’ is closed to new topics and replies.