Correlation

by

Adam J. McKee
Using
F. J. Gravetter and L. B. Wallnau's Essentials of Statistics for the Behavioral Sciences (4th Ed.).

 

What is Correlation?

  • Correlation is a statistical technique that is used to measure and describe a relationship between two variables.
  • Usually the two variables are simply observed as they exist naturally in the environment—there is no attempt to control or manipulate the variables.
  • It is important to note that correlation requires two separate scores for each individual—one score for each of two variables.
  • These scores are generically identified as X and Y.
  • Correlational data can be presented in the form of a table, or a graph known as a scatterplot (study figure 15.1 in the text).

The Direction of the Relationship

In a positive correlation, the two variables tend to change in the same direction: When the X variable increases, Y also increases; if the X variable decreases, the Y variable also decreases.

In a negative correlation, the two variables tend to go in opposite directions. As the X variable increases, the Y variable decreases.

The Form of the Relationship

  • Carefully consider example 15.1 in the text.
  • The relationship between beer sales and temperature at football games is linear.
  • That is, if we make a scatter plot of the data we find that it forms a pretty much straight line on the graph (see figure 15.2)
  • What we will consider in this class are linear relationships.
  • But remember, other things may not be linear—in that case linear statistics are not appropriate. (Think about the relationship between age and agility).

Degree of Relationship

  • Finally, a correlation measures how well the data fit the specific form being considered.
  • A perfect correlation is always identified by a correlation of 1.00 and indicates a perfect fit.
  • At the other extreme, a correlation of zero indicates no relationship at all.

The Pearson Correlation

The Pearson correlation measures the degree and direction of the linear relationship between two variables.

It is identified by the letter r.

The sum of Products of Deviation

  • The calculate the Pearson correlation, it is necessary to introduce one new concept: the sum of product deviations.
  • In the past, we’ve used a similar concept with the sum of squared deviations (SS).
  • The sum of products (SP) provides a very similar procedure for measuring the amount of covariability (changing together) between two variables.

Computing SP

 

Preliminary Data Analysis

Calculation of Person’s r

  • The Pearson correlation consists of a ratio comparing the covariability of X and Y (numerator) with the variability of X and Y separately (denominator).
  • In the formula for the Pearson r, SP is used to measure the covariability of X and Y.
  • The variability of X and Y will be measured by computing SS for the X scores and SS for the Y Scores separately.
  • Since your data are arranged in a new way, it is easier to use the computational formula to get SS:

Once you have SS for X and Y and SP, you can calculate r as follows:

Using and Interpreting r

If two variables are known to be related in some systematic way, it is possible to use one of the variables to make accurate predictions about another.

For example, college admissions officials often use SAT or ACT scores to predict success in college.

2: Validity

In research, validity asks the question "Am I measuring what I think I am measuring?"

For example, if an intelligence test was based on weight and shoe size, it wouldn’t be valid at all.

One common technique of validating tests is to use correlation—if a new test is highly correlated with another test regarded as valid, then the validity of the new test is supported.

3: Reliability

Closely related to validity, reliability asks how well (precisely) am I measuring whatever I’m measuring?

The key to reliability is consistency of results—if I give you a test and you make 100 on it, then I give it to you again and you make a 57, then the test is not reliable.

4: Theory Verification

Many social scientific theories make specific predictions about relationships between variables.

For example, a theory may predict a relationship between brain size and learning ability.

Correlation Caveats: #1

Correlation simply describes a relationship between two variables. It does not explain why the two variables are related. Specifically, correlations can not be used as proof of a cause-effect relationship between two variables—what is a third variable z caused both?

#2

The value of a correlation can be affected greatly by the range of scores represented in the data. Be careful in your interpretation of data that does not include a full range of scores—using senior English majors in a study of reading ability, for example.

#3

One or two extreme data points, often called outliers, can have a dramatic effect on the value of correlations.

#4

When judging how "good" a relationship is, it is tempting to focus on the number value.

Be careful not to interpret a correlation as a proportion—this is the appropriate interpretation of r2

Hypothesis Testing with r

  • The basis question for this hypothesis is whether or not a correlation exists in the population.
  • These hypotheses can be stated symbolically as follows:
  • HO: ρ = 0
  • HA: ρ ≠ 0
  • To test this, you will need to know the observed correlation, the sample size (n), and the alpha level.
  • If the calculated value of r is larger than the table value of r, then you reject the null hypothesis and conclude that there is a high probability that a correlation does exist between the two variables in the population.
  • Study example 15.6 in the text.

Regression

When we talked about scatter plots, we saw that you can draw a line through the middle of the data points.

This line serves several purposes:

The line makes the relationship between the variables easier to see.

The line identifies the center, or central tendency, of the relationship—just like the mean, it provides a ‘snapshot’ of the relationship between the two variables.

Finally, the line can be used for prediction. The line established a one-to-one relationship between X and Y.

Drawing the line on a graph is useful, but how do we know we are in the middle of all those dots? We can get a lot more precise about the line if we remember from algebra that straight lines can be drawn on a graph using simple equations.

Linear Equations

In general, a linear relationship between variables X and Y can be expressed by the equation

Y = bX + a, where b and a are fixed constants.

For example, a local tennis club charges a fee of $5 per hour and an annual membership fee of $25. With this information, we can determine the total cost of playing tennis by using a linear equation that describes the relationship between total cost (Y) and the number of hours (X).

Y = 5X + 25

In the general linear equation, the value of b is called the slope.

The slope determines how much the Y variable will change when X is increased by 1 point.

The value of a is called the Y-intercept because it determines the value of Y when X = 0.

Finding the Equation

The statistical technique for finding the best-fitting straight line for a set of data is called regression, and the resulting straight line is called the regression line.

The Least Squares Solution

  • For every X value in the data, the linear equation will determine a Y value on the line.
  • This value is the predicted Y and is called Y hat or Y prime.
  • The distance between the predicted value of Y and the actual value of Y is called error.
  • By minimizing this error, we can come up with a line that fits the best.
  • The most common way to do this uses the squared errors, so it is called the least squared error solution.
  • This is also why the most common type of regression analysis is called least squares regression, or Ordinary Least Squares (OLS).

Uses of Regression

  • Regression is also known as the General Linear Model—it is a very useful technique because of its versatility.
  • It can do the same thing as the mean difference tests we’ve discussed and has a built in measure of effect size, r.
  • It is very useful in prediction, like college admission standards.

This page available at: