Mục Lục

## Introduction

Ever been to a shop and have noticed how the size of an object directly affects its price as well? Well, a relation is seen when two quantities are compared and there is either an increase or decrease in the value of both of them or it can also be that one quantity increases while the other decreases and vice versa. If these two quantities are further plotted on a graph, it is observed that there is a linear relation between them. Linear regression formula helps to define this linear relation that is present between the two quantities and how they are interdependent.

Linear regression is known to be the most basic and commonly used predictive analysis. In this concept, one variable is considered to be an explanatory variable, and the other variable is considered to be a dependent variable. For example, a modeller might want to relate the weights of individuals to their heights using the concept of linear regression.

**Simple Linear Regression**

- One is the dependent variable (that is interval or ratio).
- One is the independent variable (that is interval or ratio or dichotomous).

**Multiple Linear Regression**

- One is the dependent variable (that is interval or ratio).
- Two or more independent variables ( that is interval or ratio or dichotomous).

**Logistic Regression**

- One is the dependent variable (that is binary).
- Two or more independent variable(s) ( that is interval or ratio or dichotomous).

**Ordinal Regression**

- One is the dependent variable (that is ordinal).
- One or more independent variable(s) (that is nominal or dichotomous).

**Multinomial Regression**

- One is the dependent variable (that is nominal).
- One or more independent variable(s) (that is interval or ratio or dichotomous).

**Discriminant Analysis**

- One is the dependent variable (that is nominal).
- One or more independent variable(s) (that is interval or ratio).

**What is Linear Regression?**

Let’s know what linear regression is. It is very important and used for easy analysis of the dependency of two variables. One variable will be considered to be an explanatory variable, while others will be considered to be a dependent variable. Linear regression is a linear method for modelling the relationship between the independent variables and dependent variables. The linearity of the learned relationship makes the interpretation very easy. Linear regression models have long been used by people as statisticians, computer scientists, etc. who tackle quantitative problems. For example, a statistician might want to relate the weights of individuals to their heights using a linear regression model. Now we know what linear regression is.

**The Formula of Linear Regression**

Let’s know what a linear regression equation is. The formula for linear regression equation is given by:

y = a + bx

a and b can be computed by the following formulas:

Where

x and y are the variables for which we will make the regression line.

- b = Slope of the line.
- a = Y-intercept of the line.
- X = Values of the first data set.
- Y = Values of the second data set.

**Note:** The first step in finding a linear regression equation is to determine if there is a relationship between the two variables. This is often a judgment call for the researcher. You’ll also need a list of your data in an x–y format (i.e. two columns of data – independent and dependent variables).

**Simple Linear Regression Formula Plotting**

**Table 1. Example data.**

The concept of linear regression consists of finding the best-fitting straight line through the given points. The best-fitting line is known as a regression line. The black diagonal line in the figure given below (Figure 2) is the regression line and consists of the predicted score on Y for each possible value of the variable X. The lines in the figure given above, the vertical lines from the points to the regression line, represent the errors of prediction. As you can see, the red point is actually very near the regression line; we can see its error of prediction is small. By contrast, the yellow point we can see is much higher than the regression line and therefore its error of prediction is large.

The black line given in the figure consists of the predictions, the points that are the actual data, and the vertical lines between the points and the black line represent errors of prediction.

**Properties of Linear Regression**

For the regression line where the regression parameters b_{0} and b_{1} are defined, the properties are given as below:

- The line reduces the sum of squared differences between observed values and predicted values.
- The regression line passes through the mean of X and Y variable values.
- The regression constant (b
_{0}) is equal to the y-intercept of the linear regression. - The regression coefficient (b
_{0}) is the slope of the regression line which is equal to the average change in the dependent variable (Y) for a unit change in the independent variable (X).

**What is Linear Regression Used for?**

Linear regression is used for:

- The concept of studying engine performance from test data in automobiles.
- Linear regression can be used in market research studies and customer survey results analysis.
- Linear regression can be used in observational astronomy commonly enough. A number of statistical tools and methods can be used in astronomical data analysis, and there are entire libraries in languages like Python meant to do data analysis in astrophysics.
- Linear regression can also be used to analyze the marketing effectiveness, pricing, and promotions on sales of a product.

**Questions to be Solved**

**Question** 1) Find out the linear regression equation from the given set of data.

**Solution:**

Using the simple linear regression formula,

### Solved Examples

**Question: **Find linear regression equation for the following two sets of data:

x | 2 | 4 | 6 | 8 |

y | 3 | 7 | 5 | 10 |

**Solution:**

Construct the following table:

**Standard Error in Linear Regression Formula:**

The standard error that is seen about the regression line can be defined as the measure of the average proportion that the regression equation over- or under-predicts. This standard error is denoted by SE. The higher the coefficient of the determination being involved, the lower the standard error and hence, a more accurate result will be available.

## FAQs (Frequently Asked Questions)

**1. What is a linear regression with an example?**

Linear regression quantifies the relationship between one or more predictor variable(s) and one outcome variable. For example, it can be used to quantify the relative impacts of age, gender, and diet (the predictor variables) on height (the outcome variable).

**2. How do you calculate linear regression?**

The Linear Regression Equation : The equation has the form Y= a + bX, where Y is the dependent variable (that’s the variable that goes on the Y-axis), X is the independent variable (i.e. it is plotted on the X-axis), b is the slope of the line, and a is the y-intercept.

3. **How do you Calculate the Y-Intercept?**

Using the “slope-intercept” form of the line’s equation (y = mx + b), you solve for b (which is the y-intercept you’re looking for). You need to substitute the known slope for the variable m~~,~~ and substitute the known point’s coordinates for x and y, respectively, in the slope~~ ~~intercept equation. That will help you find b.

**4. What is a Regression Model Example?**

A simple linear regression plot for the amount of rainfall. Regression analysis can also be used in statistics to find trends in data (insights). For example, you might guess that there’s a connection between how much you eat and how much you weigh; regression analysis can help you quantify that.

**5. What are the prerequisites needed for regression analysis using the Linear Regression Formula?**

The regression analysis using the linear regression formula is valid only when the following conditions have been satisfied:

1. The dependent variable Y should have a linear relationship that will be independent of variable X. To check this, it should be made sure that the XY scatter plot will be linear and that the residual plot will show a random pattern.

2. For each of the values of X, the probability of Y has the same standard deviation. When the condition is being satisfied, the variability of the residuals will be relatively constant over all the values of X that have been considered which can be easily checked out through a residual plot.

**6. What is the coefficient of determination for a linear regression model?**

The coefficient of determinations is one of the main results of regression analysis. The properties of the coefficient of determination can be given as follows:

1. The coefficient determination will range from 0 to 1.

2. A coefficient determination that has a value of 0 will mean that the dependent variable cannot be easily predicted from the independent variable.

3. If the coefficient determination has a value of 1 will mean that the dependent variable can be easily predicted without any errors from the independent variable.

4. The range of coefficient determination from 0 to 1 hence provides the extent to which the dependent variable will be predictable.

### Why use Linear Relationships?

Linear relationships, i.e. lines, are easier to work with and most phenomenon are naturally linearly related. If variables *aren’t* linearly related, then some math can transform that relationship into a linear one, so that it’s easier for the researcher (i.e. you) to understand.

### What is Simple Linear Regression?

You’re probably familiar with plotting line graphs with one X axis and one Y axis. The X variable is sometimes called the independent variable and the Y variable is called the dependent variable. Simple linear regression plots one independent variable X against one dependent variable Y. Technically, in regression analysis, the independent variable is usually called the predictor variable and the dependent variable is called the criterion variable. However, many people just call them the independent and dependent variables. More advanced regression techniques (like multiple regression) use multiple independent variables.

Regression analysis can result in *linear *or *nonlinear* graphs. A linear regression is where the relationships between your variables can be described with a straight line. Non-linear regressions produce curved lines.(^{**})

Simple linear regression for the amount of rainfall per year.

Regression analysis is almost always performed by a computer program, as the equations are extremely time-consuming to perform by hand.

**As this is an introductory article, I kept it simple. But there’s actually an important technical difference between linear and nonlinear, that will become more important if you continue studying regression. For details, see the article on nonlinear regression.

### How to Find a Linear Regression Equation: Overview

**Regression analysis** is used to find equations that fit data. Once we have the regression equation, we can use the model to make predictions. One type of regression analysis is linear analysis. When a **correlation coefficient** shows that data is likely to be able to predict future outcomes and a scatter plot of the data appears to form a straight line, you can use simple linear regression to find a predictive function. If you recall from elementary algebra, the equation for a line is **y = mx + b**. This article shows you how to take data, calculate linear regression, and find the equation **y’ = a + bx**. **Note**: If you’re taking AP statistics, you may see the equation written as b_{0} + b_{1}x, which is the same thing (you’re just using the variables b_{0} + b_{1} instead of a + b.

### How to Find a Linear Regression Equation: Steps

**Step 1:** *Make a chart of your data, filling in the columns in the same way as you would fill in the chart if you were finding the Pearson’s Correlation Coefficient.*

Subject | Age x | Glucose Level y | xy | x^{2} | y^{2} |
---|---|---|---|---|---|

1 | 43 | 99 | 4257 | 1849 | 9801 |

2 | 21 | 65 | 1365 | 441 | 4225 |

3 | 25 | 79 | 1975 | 625 | 6241 |

4 | 42 | 75 | 3150 | 1764 | 5625 |

5 | 57 | 87 | 4959 | 3249 | 7569 |

6 | 59 | 81 | 4779 | 3481 | 6561 |

Σ | 247 | 486 | 20485 | 11409 | 40022 |

From the above table, Σx = 247, Σy = 486, Σxy = 20485, Σx2 = 11409, Σy2 = 40022. n is the sample size (6, in our case).

**Step 2:** Use the following equations to find a and b.

### Linear Regression Equation Microsoft Excel: Steps

Step 1: **Install the Data Analysis Toolpak**, if it isn’t already installed. For ins

Step 2: **Type your data into two columns in Excel.** For example, type your “x” data into column A and your “y” data into column b. Do not leave any blank cells between your entries.

Step 3: **Click the “Data Analysis” tab **on the Excel toolbar.

Step 4: **Click “regression” **in the pop up window and then click “OK.”

Step 5: **Select your input Y range.** You can do this two ways: either select the data in the worksheet or type the location of your data into the “Input Y Range box.” For example, if your Y data is in A2 through A10 then type “A2:A10” into the Input Y Range box.

Step 6: **Select your input X range **by selecting the data in the worksheet or typing the location of your data into the “Input X Range box.”

Step 7: **Select the location where you want your output range **to go by selecting a blank area in the worksheet or typing the location of where you want your data to go in the “Output Range” box.

Step 8: **Click “OK”.** Excel will calculate the linear regression and populate your worksheet with the results.

Tip: The linear regression equation information is given in the last output set (the coefficients column). The first entry in the “Intercept” row is “a” (the y-intercept) and the first entry in the “X” column is “b” (the slope).

### How to Find a Linear Regression Slope: Overview

Remember from algebra, that the slope is the “m” in the formula **y = mx + b**.

In the linear regression formula, the slope is the a in the equation **y’ = b + ax**.

They are basically the same thing. So if you’re asked to find linear regression slope, all you need to do is find **b** in the same way that you would find **m**.

Calculating linear regression by hand is tricky, to say the least. There’s a *lot *of summation (that’s the Σ symbol, which means to add up). The basic steps are below, or you can **watch the video at the beginning of this article. **The video goes into a lot more detail about how to do summation. Finding the equation will also give you the slope. If you don’t want to find the slope by hand (or if you want to check your work), you can also use Excel.

### How to Find the Regression Coefficient

A regression coefficient is the same thing as the **slope of the line of the regression equation**. The equation for the regression coefficient that you’ll find on the AP Statistics test is: B_{1} = b_{1} = Σ [ (x_{i} – x)(y_{i} – y) ] / Σ [ (x_{i} – x)^{2}]. “y” in this equation is the mean of y and “x” is the mean of x.

You could find the regression coefficient by hand (as outlined in the section at the top of this page).

However, you won’t have to calculate the regression coefficient by hand in the AP test — you’ll use your TI-83 calculator. Why? Calculating linear regression by hand is very time consuming (allow yourself about 30 minutes to do the calculations and check them) and because of the *huge* number of calculations you have to make you’re very likely to make mathematical errors. When you find a linear regression equation on the TI83, you get the regression coefficient as part of the answer.

**Sample problem**: Find the regression coefficient for the following set of data:

x: 1, 2, 3, 4, 5.

y: 3, 9, 27, 64, 102.

**Step 1:** Press STAT, then press ENTER to enter LISTS. You may need to clear data if you already have numbers in L1 or L2. To clear the data: move the cursor onto L1, press CLEAR and then ENTER. Repeat for L2 if you need to.

**Step 2:** *Enter your x-data into a list.* Press the ENTER key after each entry.

1 ENTER

2 ENTER

3 ENTER

4 ENTER

5 ENTER

**Step 3:** Scroll across to the next column, L2 using the arrow keys at the top right of the keypad.

**Step 4:** Enter the y-data:

3 ENTER

9 ENTER

27 ENTER

64 ENTER

102 ENTER

**Step 5:** Press the STAT button, then scroll to highlight “CALC.” Press ENTER

**Step 6:** Press 4 to choose “LinReg(ax+b)”. Press ENTER. The TI 83 will return the variables needed for the linear regression equation. The value you’re looking for >the regression coefficient > is b, which is **25.3 **for this set of data.

*That’s it!*

### Linear Regression Test Value: Steps

**Sample question**: Given a set of data with sample size 8 and r = 0.454, find the linear regression test value.

**Note**: r is the correlation coefficient.

**Step 1:** *Find r, the correlation coefficient, *unless it has already been given to you in the question. In this case, r is given (r = .0454). Not sure how to find r? See: Correlation Coefficient for steps on how to find r.

**Step 2:** *Use the following formula to compute the test value ( n is the sample size):*

### How to solve the formula

- Replace the variables with your numbers:

T = .454√((8 – 2)/(1-[.454]^{2}))- Subtract 2 from n:

8 – 2 = 6 - Square r:

.454 × .454 = .206116 - Subtract step (3) from 1:

1 – .206116 = .793884 - Divide step (2) by step (4):

6 / .793884 = 7.557779 - Take the square root of step (5):

√7.557779 = 2.74914154 - Multiply r by step (6):

.454 × 2.74914154 =**1.24811026**

- Subtract 2 from n:

The Linear Regression Test value, **T = 1.24811026**

That’s it!

### Leverage in Linear Regression: How it Affects Graphs

In linear regression, the influential point (outlier) will try to pull the linear regression line toward itself. The graph below shows what happens to a linear regression line when outlier A is included:

Outliers with **extreme X values** (values that aren’t within the range of the other data points) have more leverage in linear regression than points with less extreme x values. In other words, **extreme x-value outliers will move the line more** than less extreme values.

The following graph shows a data point outside of the range of the other values. The values range from 0 to about 70,000. This one point has an x-value of about 80,000 which is outside the range. It affects the regression line a lot more than the point in the first image above, which was inside the range of the other values.

In general, outliers that have values close to the mean of x will have less leverage that outliers towards the edges of the range. Outliers with values of x outside of the range will have more leverage. Values that are extreme on the y-axis (compared to the other values) will have more influence than values closer to the other y-values.

### Least-Squares Regression

The most common method for fitting a regression line is the method of least-squares. This method calculates the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line (if a point lies on the fitted line exactly, then its vertical deviation is 0). Because the deviations are first squared, then summed, there are no cancellations between positive and negative values.

**Example**

The dataset “Televisions, Physicians, and Life Expectancy” contains, among other variables, the number of people per television set and the number of people per physician for 40 countries. Since both variables probably reflect the level of wealth in each country, it is reasonable to assume that there is some positive association between them. After removing 8 countries with missing values from the dataset, the remaining 32 countries have a correlation coefficient of 0.852 for number of people per television set and number of people per physician. The *r²* value is 0.726 (the square of the correlation coefficient), indicating that 72.6% of the variation in one variable may be explained by the other. *(Note: see correlation for more detail.)* Suppose we choose to consider number of people per television set as the explanatory variable, and number of people per physician as the dependent variable. Using the MINITAB “REGRESS” command gives the following results:

The regression equation is People.Phys. = 1019 + 56.2 People.Tel.

To view the fit of the model to the observed data, one may plot the computed regression line over the actual data points to evaluate the results. For this example, the plot appears to the right, with number of individuals per television set (the explanatory variable) on the x-axis and number of individuals per physician (the dependent variable) on the y-axis. While most of the data points are clustered towards the lower left corner of the plot (indicating relatively few individuals per television set and per physician), there are a few points which lie far away from the main cluster of the data. These points are known as ** outliers**, and depending on their location may have a major impact on the regression line (see below).

### Outliers and Influential Observations

After a regression line has been computed for a group of data, a point which lies far from the line (and thus has a large residual value) is known as an ** outlier**. Such points may represent erroneous data, or may indicate a poorly fitting regression line. If a point lies far from the other data in the horizontal direction, it is known as an

**. The reason for this distinction is that these points have may have a significant impact on the slope of the regression line. Notice, in the above example, the effect of removing the observation in the upper right corner of the plot:**

*influential observation*With this influential observation removed, the regression equation is now

People.Phys = 1650 + 21.3 People.Tel.

The correlation between the two variables has dropped to 0.427, which reduces the *r²* value to 0.182. With this influential observation removed, less that 20% of the variation in number of people per physician may be explained by the number of people per television. Influential observations are also visible in the new model, and their impact should also be investigated.

## Để lại một phản hồi