Regression – Part 1
Regression is a technique that we use when we want to predict something such as people’s opinions, attitudes, or even behaviors. Based on knowledge of one characteristic, we can predict with some accuracy another characteristic. How accurate our prediction is will depend on the strength of the relationship between the two characteristics, or variables.
The stronger the relationship between the variables, the more accurate the prediction will be, and the weaker the relationship, the less accurate the prediction will be. For example, suppose our goal is to predict how many batters will be hit by a pitcher’s ball in any given game based on the temperature of the day. First, we would need to explore whether these two variables were even related, as Reifman and colleagues did in their 1991 article (http://deepblue.lib.umich.edu/bitstream/handle/2027.42/68476/10.1177_0146167291175013.pdf?sequence=2&isAllowed=y). They found a positive relationship (r = .11) between heat during the game and number of batters hit with the ball. Since the two variables are related, we could conceivably use temperature to predict number of times the batter would be hit in any baseball game.
However, one very important thing to remember is that since the correlation wasn’t strong (it was quite weak, actually), the accuracy of the prediction will be somewhat lacking. The reason for this is related to something called coefficient of determination. Remember what this is? Right! It’s the amount of variance in one variable that is associated with the variance in the other variable. In this case, it’s the amount of variance in batters being hit associated with the variance in temperature. In squaring the correlation coefficient (r2 = .112), we find that the coefficient of determination is only .012. This means that only about 1.2% of the variance in being hit is associated with the variance in temperature, which also means that 98.8% of that variance is associated with things other than temperature. In other words, there are a lot of other reasons batters are hit in baseball games besides how hot it is.
Stronger correlations between variables produce much more accurate predictions. For example, if the correlation between washing your hands regularly and the number of colds you get a year is -.70 (this is purely a guess!), then we should be able to fairly accurately predict how many colds people will get based on how often they wash their hands. Since the coefficient of determination for this relationship would be .49, we know that 49% of the variance in cold-catching is associated with the variance in hand-washing, which will lead to a more accurate prediction with regression.
In the next post, we’ll take a look at how size of the correlation affects the slope of the regression line. “What’s that??” you say? Patience….