Correlation Coefficient: Simple Definition, Formula, Easy Steps
Correlation coefficients are used to measure how strong a relationship is between two variables. There are several types of correlation coefficient, but the most popular is Pearson’s. Pearson’s correlation (also called Pearson’s R) is a correlation coefficient commonly used in linear regression. If you’re starting out in statistics, you’ll probably learn about Pearson’s R first. In fact, when anyone refers to the correlation coefficient, they are usually talking about Pearson’s.
Correlation Coefficient Formula: Definition
Correlation coefficient formulas are used to find how strong a relationship is between data. The formulas return a value between 1 and 1, where:
 1 indicates a strong positive relationship.
 1 indicates a strong negative relationship.
 A result of zero indicates no relationship at all.
Meaning
 A correlation coefficient of 1 means that for every positive increase in one variable, there is a positive increase of a fixed proportion in the other. For example, shoe sizes go up in (almost) perfect correlation with foot length.
 A correlation coefficient of 1 means that for every positive increase in one variable, there is a negative decrease of a fixed proportion in the other. For example, the amount of gas in a tank decreases in (almost) perfect correlation with speed.
 Zero means that for every increase, there isn’t a positive or negative increase. The two just aren’t related.
The absolute value of the correlation coefficient gives us the relationship strength. The larger the number, the stronger the relationship. For example, .75 = .75, which has a stronger relationship than .65.
Types of correlation coefficient formulas.
There are several types of correlation coefficient formulas.
One of the most commonly used formulas is Pearson’s correlation coefficient formula. If you’re taking a basic stats class, this is the one you’ll probably use:
Two other formulas are commonly used: the sample correlation coefficient and the population correlation coefficient.
Sample correlation coefficient
S_{x} and s_{y} are the sample standard deviations, and s_{xy} is the sample covariance.
Population correlation coefficient
The population correlation coefficient uses σ_{x} and σ_{y} as the population standard deviations, and σ_{xy} as the population covariance.
What is Pearson Correlation?
Correlation between sets of data is a measure of how well they are related. The most common measure of correlation in stats is the Pearson Correlation. The full name is the Pearson Product Moment Correlation (PPMC). It shows the linear relationship between two sets of data. In simple terms, it answers the question, Can I draw a line graph to represent the data? Two letters are used to represent the Pearson correlation: Greek letter rho (ρ) for a population and the letter “r” for a sample.
Potential problems with Pearson correlation.
The PPMC is not able to tell the difference between dependent variables and independent variables. For example, if you are trying to find the correlation between a high calorie diet and diabetes, you might find a high correlation of .8. However, you could also get the same result with the variables switched around. In other words, you could say that diabetes causes a high calorie diet. That obviously makes no sense. Therefore, as a researcher you have to be aware of the data you are plugging in. In addition, the PPMC will not give you any information about the slope of the line; it only tells you whether there is a relationship.
Real Life Example
Pearson correlation is used in thousands of real life situations. For example, scientists in China wanted to know if there was a relationship between how weedy rice populations are different genetically. The goal was to find out the evolutionary potential of the rice. Pearson’s correlation between the two groups was analyzed. It showed a positive Pearson Product Moment correlation of between 0.783 and 0.895 for weedy rice populations. This figure is quite high, which suggested a fairly strong relationship.
If you’re interested in seeing more examples of PPMC, you can find several studies on the National Institute of Health’s Openi website, which shows result on studies as varied as breast cyst imaging to the role that carbohydrates play in weight loss.
How to Find Pearson’s Correlation Coefficients
Example question: Find the value of the correlation coefficient from the following table:
Subject  Age x  Glucose Level y 

1  43  99 
2  21  65 
3  25  79 
4  42  75 
5  57  87 
6  59  81 
Step 1: Make a chart. Use the given data, and add three more columns: xy, x^{2}, and y^{2}.
Subject  Age x  Glucose Level y  xy  x^{2}  y^{2} 

1  43  99  
2  21  65  
3  25  79  
4  42  75  
5  57  87  
6  59  81 
Step 2: Multiply x and y together to fill the xy column. For example, row 1 would be 43 × 99 = 4,257.
Subject  Age x  Glucose Level y  xy  x^{2}  y^{2} 

1  43  99  4257  
2  21  65  1365  
3  25  79  1975  
4  42  75  3150  
5  57  87  4959  
6  59  81  4779 
Step 3: Take the square of the numbers in the x column, and put the result in the x^{2} column.
Subject  Age x  Glucose Level y  xy  x^{2}  y^{2} 

1  43  99  4257  1849  
2  21  65  1365  441  
3  25  79  1975  625  
4  42  75  3150  1764  
5  57  87  4959  3249  
6  59  81  4779  3481 
Step 4: Take the square of the numbers in the y column, and put the result in the y^{2} column.
Subject  Age x  Glucose Level y  xy  x^{2}  y^{2} 

1  43  99  4257  1849  9801 
2  21  65  1365  441  4225 
3  25  79  1975  625  6241 
4  42  75  3150  1764  5625 
5  57  87  4959  3249  7569 
6  59  81  4779  3481  6561 
Step 5: Add up all of the numbers in the columns and put the result at the bottom of the column. The Greek letter sigma (Σ) is a short way of saying “sum of” or summation.
Subject  Age x  Glucose Level y  xy  x^{2}  y^{2} 

1  43  99  4257  1849  9801 
2  21  65  1365  441  4225 
3  25  79  1975  625  6241 
4  42  75  3150  1764  5625 
5  57  87  4959  3249  7569 
6  59  81  4779  3481  6561 
Σ  247  486  20485  11409  40022 
Step 6: Use the following correlation coefficient formula.
The answer is: 2868 / 5413.27 = 0.529809
From our table:
 Σx = 247
 Σy = 486
 Σxy = 20,485
 Σx^{2} = 11,409
 Σy^{2} = 40,022
 n is the sample size, in our case = 6
The correlation coefficient =

 6(20,485) – (247 × 486) / [√[[6(11,409) – (247^{2})] × [6(40,022) – 486^{2}]]]
= 0.5298
The range of the correlation coefficient is from 1 to 1. Our result is 0.5298 or 52.98%, which means the variables have a moderate positive correlation.
Correlation Formula: TI 83
If you’re taking AP Statistics, you won’t actually have to work the correlation formula by hand. You’ll use your graphing calculator. Here’s how to find r on a TI83.
Step 1: Type your data into a list and make a scatter plot to ensure your variables are roughly correlated. In other words, look for a straight line. Not sure how to do this? See: TI 83 Scatter plot.
Step 2: Press the STAT button.
Step 3: Scroll right to the CALC menu.
Step 4: Scroll down to 4:LinReg(ax+b), then press ENTER. The output will show “r” at the very bottom of the list.
Tip: If you don’t see r, turn Diagnostic ON, then perform the steps again.
How to Compute the Pearson Correlation Coefficient in Excel
Step 1: Type your data into two columns in Excel. For example, type your “x” data into column A and your “y” data into column B.
Step 2: Select any empty cell.
Step 3: Click the function button on the ribbon.
Step 4: Type “correlation” into the ‘Search for a function’ box.
Step 5: Click “Go.” CORREL will be highlighted.
Step 6: Click “OK.”
Step 7: Type the location of your data into the “Array 1” and “Array 2” boxes. For this example, type “A2:A10” into the Array 1 box and then type “B2:B10” into the Array 2 box.
Step 8: Click “OK.” The result will appear in the cell you selected in Step 2. For this particular data set, the correlation coefficient(r) is 0.1316.
Caution: The results for this test can be misleading unless you have made a scatter plot first to ensure your data roughly fits a straight line. The correlation coefficient in Excel 2007 will always return a value, even if your data is something other than linear (i.e. the data fits an exponential model).
That’s it!
Correlation Coefficient SPSS: Overview.
Step 1: Click “Analyze,” then click “Correlate,” then click “Bivariate.” The Bivariate Correlations window will appear.
Step 2: Click one of the variables in the lefthand window of the Bivariate Correlations popup window. Then click the center arrow to move the variable to the “Variables:” window. Repeat this for a second variable.
Step 3: Click the “Pearson” check box if it isn’t already checked. Then click either a “onetailed” or “twotailed” test radio button. If you aren’t sure if your test is onetailed or twotailed
Step 4: Click “OK” and read the results. Each box in the output gives you a correlation between two variables. For example, the PPMC for Number of older siblings and GPA is .098, which means practically no correlation. You can find this information in two places in the output. Why? This crossreferencing columns and rows is very useful when you are comparing PPMCs for dozens of variables.
Tip #1: It’s always a good idea to make an SPSS scatter plot of your data set before you perform this test. That’s because SPSS will always give you some kind of answer and will assume that the data is linearly related. If you have data that might be better suited to another correlation (for example, exponentially related data) then SPSS will still run Pearson’s for you and you might get misleading results.
Tip #2: Click on the “Options” button in the Bivariate Correlations window if you want to include descriptive statistics like the mean and standard deviation.
Minitab
he Minitab correlation coefficient will return a value for r from 1 to 1.
Example question: Find the Minitab correlation coefficient based on age vs. glucose level from the following table from a prediabetic study of 6 participants:
Subject  Age x  Glucose Level y 

1  43  99 
2  21  65 
3  25  79 
4  42  75 
5  57  87 
6  59  81 
Step 1: Type your data into a Minitab worksheet. I entered this sample data into three columns.
Data entered into three columns in a Minitab worksheet.
Step 2: Click “Stat”, then click “Basic Statistics” and then click “Correlation.”
Step 3: Click a variable name in the left window and then click the “Select” button to move the variable name to the Variable box. For this example question, click “Age,” then click “Select,” then click “Glucose Level” then click “Select” to transfer both variables to the Variable window.
Step 4: (Optional) Check the “PValue” box if you want to display a PValue for r.
Step 5: Click “OK”. The Minitab correlation coefficient will be displayed in the Session Window. If you don’t see the results, click “Window” and then click “Tile.” The Session window should appear.
For this dataset:
 Value of r: 0.530
 PValue: 0.280
That’s it!
Tip: Give your columns meaningful names (in the first row of the column, right under C1, C2 etc.). That way, when it comes to choosing variable names in Step 3, you’ll easily see what it is you are trying to choose. This becomes especially important when you have dozens of columns of variables in a data sheet!
Meaning of the Linear Correlation Coefficient.
Pearson’s Correlation Coefficient is a linear correlation coefficient that returns a value of between 1 and +1. A 1 means there is a strong negative correlation and +1 means that there is a strong positive correlation. A 0 means that there is no correlation (this is also called zero correlation).
This can initially be a little hard to wrap your head around (who likes to deal with negative numbers?). The Political Science Department at Quinnipiac University posted this useful list of the meaning of Pearson’s Correlation coefficients. They note that these are “crude estimates” for interpreting strengths of correlations using Pearson’s Correlation:
r value =  
+.70 or higher  Very strong positive relationship 
+.40 to +.69  Strong positive relationship 
+.30 to +.39  Moderate positive relationship 
+.20 to +.29  weak positive relationship 
+.01 to +.19  No or negligible relationship 
0  No relationship [zero correlation] 
.01 to .19  No or negligible relationship 
.20 to .29  weak negative relationship 
.30 to .39  Moderate negative relationship 
.40 to .69  Strong negative relationship 
.70 or higher  Very strong negative relationship 
It may be helpful to see graphically what these correlations look like:
The images show that a strong negative correlation means that the graph has a downward slope from left to right: as the xvalues increase, the yvalues get smaller. A strong positive correlation means that the graph has an upward slope from left to right: as the xvalues increase, the yvalues get larger.
Cramer’s V Correlation
Cramer’s V Correlation is similar to the Pearson Correlation coefficient. While the Pearson correlation is used to test the strength of linear relationships, Cramer’s V is used to calculate correlation in tables with more than 2 x 2 columns and rows. Cramer’s V correlation varies between 0 and 1. A value close to 0 means that there is very little association between the variables. A Cramer’s V of close to 1 indicates a very strong association.
Cramer’s V  
.25 or higher  Very strong relationship 
.15 to .25  Strong relationship 
.11 to .15  Moderate relationship 
.06 to .10  weak relationship 
.01 to .05  No or negligible relationship 
Where did the Correlation Coefficient Come From?
A correlation coefficient gives you an idea of how well data fits a line or curve. Pearson wasn’t the original inventor of the term correlation but his use of it became one of the most popular ways to measure correlation.
Francis Galton (who was also involved with the development of the interquartile range) was the first person to measure correlation, originally termed “corelation,” which actually makes sense considering you’re studying the relationship between a couple of different variables. In CoRelations and Their Measurement, he said
“The statures of kinsmen are corelated variables; thus, the stature of the father is correlated to that of the adult son,..and so on; but the index of corelation … is different in the different cases.”
It’s worth noting though that Galton mentioned in his paper that he had borrowed the term from biology, where “Corelation and correlation of structure” was being used but until the time of his paper it hadn’t been properly defined.
In 1892, British statistician Francis Ysidro Edgeworth published a paper called “Correlated Averages,” Philosophical Magazine, 5th Series, 34, 190204 where he used the term “Coefficient of Correlation.” It wasn’t until 1896 that British mathematician Karl Pearson used “Coefficient of Correlation” in two papers: Contributions to the Mathematical Theory of Evolution and Mathematical Contributions to the Theory of Evolution. III. Regression, Heredity and Panmixia. It was the second paper that introduced the Pearson productmoment correlation formula for estimating correlation.
Correlation Coefficient Hypothesis Test
If you can read a table — you can test for correlation coefficient. Note that correlations should only be calculated for an entire range of data. If you restrict the range, r will be weakened.
Sample problem: test the significance of the correlation coefficient r = 0.565 using the critical values for PPMC table. Test at α = 0.01 for a sample size of 9.
Step 1: Subtract two from the sample size to get df, degrees of freedom.
9 – 7 = 2
Step 2: Look the values up in the PPMC Table. With df = 7 and α = 0.01, the table value is = 0.798
Step 3: Draw a graph, so you can more easily see the relationship.
r = 0.565 does not fall into the rejection region (above 0.798), so there isn’t enough evidence to state a strong linear relationship exists in the data.
Relationship to cosine
It’s rare to use trigonometry in statistics (you’ll never need to find the derivative of tan(x) for example!), but the relationship between correlation and cosine is an exception. Correlation can be expressed in terms of angles:
 Positive correlation = acute angle <45°,
 Negative correlation = obtuse angle >45°,
 Uncorrelated = orthogonal (right angle).
More specifically, correlation is the cosine of an angle between two vectors defined as follows (Knill, 2011):
If X, Y are two random variables with zero mean, then the covariance Cov[XY] = E[X · Y] is the dot product of X and Y. The standard deviation of X is the length of X.
Pearson correlation coefficient: Introduction, formula, calculation, and examples
What is the Pearson correlation coefficient?
Pearson correlation coefficient or Pearson’s correlation coefficient or Pearson’s r is defined in statistics as the measurement of the strength of the relationship between two variables and their association with each other.
In simple words, Pearson’s correlation coefficient calculates the effect of change in one variable when the other variable changes.
For example: Up till a certain age, (in most cases) a child’s height will keep increasing as his/her age increases. Of course, his/her growth depends upon various factors like genes, location, diet, lifestyle, etc.
This approach is based on covariance and thus is the best method to measure the relationship between two variables.
What does the Pearson correlation coefficient test do?
The Pearson coefficient correlation has a high statistical significance. It looks at the relationship between two variables. It seeks to draw a line through the data of two variables to show their relationship. The relationship of the variables is measured with the help Pearson correlation coefficient calculator. This linear relationship can be positive or negative.
For example:
 Positive linear relationship: In most cases, universally, the income of a person increases as his/her age increases.
 Negative linear relationship: If the vehicle increases its speed, the time taken to travel decreases, and vice versa.
From the example above, it is evident that the Pearson correlation coefficient, r, tries to find out two things – the strength and the direction of the relationship from the given sample sizes.
Pearson correlation coefficient formula
The correlation coefficient formula finds out the relation between the variables. It returns the values between 1 and 1. Use the below Pearson coefficient correlation calculator to measure the strength of two variables.
Pearson correlation coefficient formula:
Where:
N = the number of pairs of scores
Σxy = the sum of the products of paired scores
Σx = the sum of x scores
Σy = the sum of y scores
Σx2 = the sum of squared x scores
Σy2 = the sum of squared y scores
Pearson correlation coefficient calculator
Here is a step by step guide to calculating Pearson’s correlation coefficient:
Step one: Create a Pearson correlation coefficient table. Make a data chart, including both the variables. Label these variables ‘x’ and ‘y.’ Add three additional columns – (xy), (x^2), and (y^2). Refer to this simple data chart.
Step four: Use the correlation formula to plug in the values.
If the result is negative, there is a negative correlation relationship between the two variables. If the result is positive, there is a positive correlation relationship between the variables. Results can also define the strength of a linear relationship i.e., strong positive relationship, strong negative relationship, medium positive relationship, and so on.
Determining the strength of the Pearson productmoment correlation coefficient
The Pearson productmoment correlation coefficient, or simply the Pearson correlation coefficient or the Pearson coefficient correlation r, determines the strength of the linear relationship between two variables. The stronger the association between the two variables, the closer your answer will incline towards 1 or 1. Attaining values of 1 or 1 signify that all the data points are plotted on the straight line of ‘best fit.’ It means that the change in factors of any variable does not weaken the correlation with the other variable. The closer your answer lies near 0, the more the variation in the variables.
How to interpret the Pearson correlation coefficient
Below are the proposed guidelines for the Pearson coefficient correlation interpretation:
Note that the strength of the association of the variables depends on what you measure and sample sizes.
On a graph, one can notice the relationship between the variables and make assumptions before even calculating them. The scatterplots, if close to the line, show a strong relationship between the variables. The closer the scatterplots lie next to the line, the stronger the relationship of the variables. The further they move from the line, the weaker the relationship gets. If the line is nearly parallel to the xaxis, due to the scatterplots randomly placed on the graph, it’s safe to assume that there is no correlation between the two variables.
What do the terms strength and direction mean?
The terms ‘strength’ and ‘direction’ have a statistical significance. Here’s a straightforward explanation of the two words:
 Strength: Strength signifies the relationship correlation between two variables. It means how consistently one variable will change due to the change in the other. Values that are close to +1 or 1 indicate a strong relationship. These values are attained if the data points fall on or very close to the line. The further the data points move away, the weaker the strength of the linear relationship. When there is no practical way to draw a straight line because the data points are scattered, the strength of the linear relationship is the weakest.
 Direction: The direction of the line indicates a positive linear or negative linear relationship between variables. If the line has an upward slope, the variables have a positive relationship. This means an increase in the value of one variable will lead to an increase in the value of the other variable. A negative correlation depicts a downward slope. This means an increase in the amount of one variable leads to a decrease in the value of another variable.
Examples of Pearson’s correlation coefficient
Let’s look at some visual examples to help you interpret a Pearson correlation coefficient table:
 Large positive correlation:
The above figure depicts a correlation of almost +1.
The scatterplots are nearly plotted on the straight line.
The slope is positive, which means that if one variable increases, the other variable also increases, showing a positive linear line.
This denotes that a change in one variable is directly proportional to the change in the other variable.
An example of a large positive correlation would be – As children grow, so do their clothes and shoe sizes.
Let’s look at some visual examples to help you interpret a Pearson correlation coefficient table:
 Medium positive correlation:
The figure above depicts a positive correlation.
The correlation is above than +0.8 but below than 1+.
It shows a pretty strong linear uphill pattern.
An example of a medium positive correlation would be – As the number of automobiles increases, so does the demand in the fuel variable increases.
 Small negative correlation
In the figure above, the scatter plots are not as close to the straight line compared to the earlier examples
It shows a negative linear correlation of approximately 0.5
The change in one variable is inversely proportional to the change of the other variable as the slope is negative.
An example of a small negative correlation would be – The more somebody eats, the less hungry they get.
 Weak / no correlation
The scatterplots are far away from the line.
It is tough to practically draw a line.
The correlation is approximately +0.15
It can’t be judged that the change in one variable is directly proportional or inversely proportional to the other variable.
An example of a weak/no correlation would be – An increase in fuel prices leads to lesser people adopting pets.
Correlation Coefficient Formula
Correlation coefficient formula is given and explained here for all of its types. There are various formulas to calculate the correlation coefficient and the ones covered here include Pearson’s Correlation Coefficient Formula, Linear Correlation Coefficient Formula, Sample Correlation Coefficient Formula, and Population Correlation Coefficient Formula. Before going to the formulas, it is important to understand what correlation and correlation coefficient is. A brief introduction is given below and to learn about them in detail, click the linked article.
About Correlation Coefficient
The correlation coefficient is a measure of the association between two variables. It is used to find the relationship is between data and a measure to check how strong it is. The formulas return a value between 1 and 1, where 1 shows negative correlation and +1 shows a positive correlation.
The correlation coefficient value is positive when it shows that there is a correlation between the two values and the negative value shows the amount of diversity among the two values.
Types of Correlation Coefficient Formula
There are several types of correlation coefficient formulas. But, one of the most commonly used formulas in statistics is Pearson’s Correlation Coefficient Formula. The formulas for all the correlation coefficient are discussed below.
Pearson’s Correlation Coefficient Formula
Also known as bivariate correlation, the Pearson’s correlation coefficient formula is the most widely used correlation method among all the sciences. The correlation coefficient is denoted by “r”.
To find r, let us suppose the two variables as x & y, then the correlation coefficient r is calculated as:
Notations:
n  Quantity of Information 
Σx  Total of the First Variable Value 
Σy  Total of the Second Variable Value 
Σxy  Sum of the Product of & Second Value 
Σx^{2}  Sum of the Squares of the First Value 
Σy^{2}  Sum of the Squares of the Second Value 
Linear Correlation Coefficient Formula
The linear correlation coefficient formula is given by the following formula
Practice Questions from Coefficient of Correlation Formula
 Question 1: Find the linear correlation coefficient for the following data. X = 4, 8 ,12, 16 and Y = 5, 10, 15, 20.
 Question 2: Calculate correlation coefficient for x = 100, 106, 112, 98, 87, 77, 67, 66, 49 and y = 28, 33, 26, 27, 24, 24, 21, 26, 22.
 Question 3: What will be the correlation coefficient for X and Y values for the given values: X= (1,2,3,4,5) and Y= {11,22,34,43,56}
How to Calculate a Pearson Correlation Coefficient
A Pearson Correlation Coefficient measures the linear association between two variables.
It always takes on a value between 1 and 1 where:
 1 indicates a perfectly negative linear correlation between two variables
 0 indicates no linear correlation between two variables
 1 indicates a perfectly positive linear correlation between two variables
The formula to calculate a Pearson Correlation Coefficient, denoted r, is:
This tutorial provides a stepbystep example of how to calculate a Pearson Correlation Coefficient by hand for the following dataset:
Step 1: Calculate the Mean of X and Y
First, we’ll calculate the mean of both the X and Y values:
Step 2: Calculate the Difference Between Means
Next, we’ll calculate the difference between each of the individual X and Y values and their respective means:
Step 3: Calculate the Remaining Values
Next, we’ll calculate the remaining values needed to complete the Pearson Correlation Coefficient formula:
Step 4: Calculate the Sums
Next, we’ll calculate the sums of the the last three columns:
Step 5: Calculate the Pearson Correlation Coefficient
Now we’ll simply plug in the sums from the previous step into the formula for the Pearson Correlation Coefficient:
The Pearson Correlation Coefficient turns out to be 0.947.
Since this value is close to 1, this is an indication that X and Y are strongly positively correlated.
In other words, as the value for X increases the value for Y also increases in a highly predictable fashion.
Pearson Correlations – Quick Introduction
A Pearson correlation is a number between 1 and +1 that indicates
to which extent 2 variables are linearly related. The Pearson correlation is also known as the “product moment correlation coefficient” (PMCC) or simply “correlation”.
Pearson correlations are only suitable for quantitative variables (including dichotomous variables).
 For ordinal variables, use the Spearman correlation or Kendall’s tau and
 for nominal variables, use Cramér’s V.
Correlation Coefficient – Example
We asked 40 freelancers for their yearly incomes over 2010 through 2014. Part of the raw data are shown below.
Today’s question is: is there any relation between income over 2010
and income over 2011? Well, a splendid way for finding out is inspecting a scatterplot for these two variables: we’ll represent each freelancer by a dot. The horizontal and vertical positions of each dot indicate a freelancer’s income over 2010 and 2011. The result is shown below.
Our scatterplot shows a strong relation between income over 2010 and 2011: freelancers who had a low income over 2010 (leftmost dots) typically had a low income over 2011 as well (lower dots) and vice versa. Furthermore, this relation is roughly linear; the main pattern in the dots is a straight line.
The extent to which our dots lie on a straight line indicates the strength of the relation. The Pearson correlation is a number that indicates the exact strength of this relation.
Correlation Coefficients and Scatterplots
A correlation coefficient indicates the extent to which dots in a scatterplot lie on a straight line. This implies that we can usually estimate correlations pretty accurately from nothing more than scatterplots. The figure below nicely illustrates this point.
Correlation Coefficient – Basics
Some basic points regarding correlation coefficients are nicely illustrated by the previous figure. The least you should know is that
 Correlations are never lower than 1. A correlation of 1 indicates that the data points in a scatter plot lie exactly on a straight descending line; the two variables are perfectly negatively linearly related.
 A correlation of 0 means that two variables don’t have any linear relation whatsoever. However, some non linear relation may exist between the two variables.
 Correlation coefficients are never higher than 1. A correlation coefficient of 1 means that two variables are perfectly positively linearly related; the dots in a scatter plot lie exactly on a straight ascending line.
Correlation Coefficient – Interpretation Caveats
When interpreting correlations, you should keep some things in mind. An elaborate discussion deserves a separate tutorial but we’ll briefly mention two main points.
 Correlations may or may not indicate causal relations. Reversely, causal relations from some variable to another variable may or may not result in a correlation between the two variables.
 Correlations are very sensitive to outliers; a single unusual observation may have a huge impact on a correlation. Such outliers are easily detected by a quick inspection a scatterplot.
Correlation Coefficient – Software
Most spreadsheet editors such as Excel, Google sheets and OpenOffice can compute correlations for you. The illustration below shows an example in Googlesheets.
Correlation Coefficient – Correlation Matrix
Keep in mind that correlations apply to pairs of variables. If you’re interested in more than 2 variables, you’ll probably want to take a look at the correlations between all different variable pairs. These correlations are usually shown in a square table known as a correlation matrix. Statistical software packages such as SPSS create correlations matrices before you can blink your eyes. An example is shown below.
Note that the diagonal elements (in red) are the correlations between each variable and itself. This is why they are always 1.
Also note that the correlations beneath the diagonal (in grey) are redundant because they’re identical to the correlations above the diagonal. Technically, we say that this is a symmetrical matrix.
Finally, note that the pattern of correlations makes perfect sense: correlations between yearly incomes become lower insofar as these years lie further apart.
Pearson Correlation – Formula
If we want to inspect correlations, we’ll have a computer calculate them for us. You’ll rarely (probably never) need the actual formula. However, for the sake of completeness, a Pearson correlation between variables X and Y is calculated by
The formula basically comes down to dividing the covariance by the product of the standard deviations. Since a coefficient is a number divided by some other number our formula shows why we speak of a correlation coefficient.
Correlation – Statistical Significance
The data we’ve available are often but not always a small sample from a much larger population. If so, we may find a non zero correlation in our sample
even if it’s zero in the population. The figure below illustrates how this could happen.
If we ignore the colors for a second, all 1,000 dots in this scatterplot visualize some population. The population correlation denoted by ρ is zero between test 1 and test 2.
Now, we could draw a sample of N = 20 from this population for which the correlation r = 0.95. Reversely, this means that a sample correlation of 0.95 doesn’t prove with certainty that there’s a non zero correlation in the entire population. However, finding r = 0.95 with N = 20 is extremely unlikely if ρ = 0. But precisely how unlikely? And how do we know?
Correlation – Test Statistic
If ρ a population correlation is zero, then the probability for a given sample correlation its statistical significance depends on the sample size. We therefore combine the sample size and r into a single number, our test statistic t:
Now, T itself is not interesting. However, we need it for finding the significance level for some correlation. T follows a t distribution with ν = n – 2 degrees of freedom but only if some assumptions are met.
Correlation Test – Assumptions
The statistical significance test for a Pearson correlation requires 3 assumptions:
 independent observations;
 the population correlation, ρ = 0;
 normality: the 2 variables involved are bivariately normally distributed in the population. However, this is not needed for a reasonable sample size say, N ≥ 20 or so.*
Pearson Correlation – Sampling Distribution
In our example, the sample size N was 20. So if we meet our assumptions, T follows a tdistribution with df = 18 as shown below.
This distribution tells us that there’s a 95% probability that 2.1 < t < 2.1, corresponding to 0.44 < r < 0.44. Conclusion: if N = 20, there’s a 95% probability of finding 0.44 < r < 0.44. There’s only a 5% probability of finding a correlation outside this range. That is, such correlations are statistically significant at α = 0.05 or lower: they are (highly) unlikely and thus refute the null hypothesis of a zero population correlation.
Last, our sample correlation of 0.95 has a pvalue of 1.55e^{10} one to 6,467,334,654. We can safely conclude there’s a non zero correlation in our entire population.
Để lại một phản hồi