How to do regression analysis. Regression analysis is a statistical method for studying the dependence of a random variable on variables. Solution using Excel spreadsheet

Regression and correlation analysis - statistical research methods. These are the most common ways to show the dependence of a parameter on one or more independent variables.

Below on specific practical examples Let's consider these two very popular analysis among economists. We will also give an example of obtaining results when they are combined.

Regression Analysis in Excel

Shows the influence of some values ​​(independent, independent) on the dependent variable. For example, how does the number of economically active population depend on the number of enterprises, the value wages and other parameters. Or: how do foreign investments, energy prices, etc. affect the level of GDP.

The result of the analysis allows you to prioritize. And based on the main factors, to predict, plan the development of priority areas, make management decisions.

Regression happens:

  • linear (y = a + bx);
  • parabolic (y = a + bx + cx 2);
  • exponential (y = a * exp(bx));
  • power (y = a*x^b);
  • hyperbolic (y = b/x + a);
  • logarithmic (y = b * 1n(x) + a);
  • exponential (y = a * b^x).

Consider the example of building a regression model in Excel and interpreting the results. Let's take linear type regression.

Task. At 6 enterprises, the average monthly salary and the number of employees who left were analyzed. It is necessary to determine the dependence of the number of retired employees on the average salary.

The linear regression model has the following form:

Y \u003d a 0 + a 1 x 1 + ... + a k x k.

Where a are the regression coefficients, x are the influencing variables, and k is the number of factors.

In our example, Y is the indicator of quit workers. The influencing factor is wages (x).

Excel has built-in functions that can be used to calculate the parameters of a linear regression model. But the Analysis ToolPak add-in will do it faster.

Activate a powerful analytical tool:

Once activated, the add-on will be available under the Data tab.

Now we will deal directly with the regression analysis.



First of all, we pay attention to the R-square and coefficients.

R-square is the coefficient of determination. In our example, it is 0.755, or 75.5%. This means that the calculated parameters of the model explain the relationship between the studied parameters by 75.5%. The higher the coefficient of determination, the better the model. Good - above 0.8. Poor - less than 0.5 (such an analysis can hardly be considered reasonable). In our example - "not bad".

The coefficient 64.1428 shows what Y will be if all the variables in the model under consideration are equal to 0. That is, other factors that are not described in the model also affect the value of the analyzed parameter.

The coefficient -0.16285 shows the weight of the variable X on Y. That is, the average monthly salary within this model affects the number of quitters with a weight of -0.16285 (this is a small degree of influence). The "-" sign indicates bad influence: the higher the salary, the less quit. Which is fair.



Correlation analysis in Excel

Correlation analysis helps to establish whether there is a relationship between indicators in one or two samples. For example, between the operating time of the machine and the cost of repairs, the price of equipment and the duration of operation, the height and weight of children, etc.

If there is a relationship, then whether an increase in one parameter leads to an increase (positive correlation) or a decrease (negative) in the other. Correlation analysis helps the analyst determine whether the value of one indicator can predict the possible value of another.

The correlation coefficient is denoted r. Varies from +1 to -1. The classification of correlations for different areas will be different. With a coefficient value of 0 linear dependence does not exist between samples.

Consider how to use Excel to find the correlation coefficient.

The CORREL function is used to find the paired coefficients.

Task: Determine if there is a relationship between the operating time of a lathe and the cost of its maintenance.

Put the cursor in any cell and press the fx button.

  1. In the "Statistical" category, select the CORREL function.
  2. Argument "Array 1" - the first range of values ​​- the time of the machine: A2: A14.
  3. Argument "Array 2" - the second range of values ​​- the cost of repairs: B2:B14. Click OK.

To determine the type of connection, you need to look at the absolute number of the coefficient (each field of activity has its own scale).

For correlation analysis of several parameters (more than 2), it is more convenient to use "Data Analysis" ("Analysis Package" add-on). In the list, you need to select a correlation and designate an array. All.

The resulting coefficients will be displayed in the correlation matrix. Like this one:

Correlation-regression analysis

In practice, these two techniques are often used together.

Example:


Data is now visible regression analysis.

Modern political science proceeds from the position on the relationship of all phenomena and processes in society. It is impossible to understand events and processes, predict and manage the phenomena of political life without studying the connections and dependencies that exist in the political sphere of society. One of the most common tasks of policy research is to study the relationship between some observable variables. Helps to solve this problem a whole class of statistical methods of analysis, combined common name"regression analysis" (or, as it is also called, "correlation-regression analysis"). However, if correlation analysis makes it possible to assess the strength of the relationship between two variables, then using regression analysis it is possible to determine the type of this relationship, to predict the dependence of the value of any variable on the value of another variable.

First, let's remember what a correlation is. Correlative called the most important special case of statistical relationship, which consists in the fact that equal values ​​of one variable correspond to different average values another. With a change in the value of the attribute x, the average value of the attribute y naturally changes, while in each individual case the value of the attribute at(with different probabilities) can take on many different values.

The appearance of the term “correlation” in statistics (and political science attracts the achievement of statistics for solving its problems, which, therefore, is a discipline related to political science) is associated with the name of the English biologist and statistician Francis Galton, who proposed in the 19th century. theoretical basis correlation and regression analysis. The term "correlation" in science was known before. In particular, in paleontology back in the 18th century. it was applied by the French scientist Georges Cuvier. He introduced the so-called correlation law, with the help of which, according to the remains of animals found during excavations, it was possible to restore their appearance.

There is a well-known story associated with the name of this scientist and his law of correlation. So, on the days of a university holiday, students who decided to play a trick on a famous professor pulled a goat skin with horns and hooves over one student. He climbed into the window of Cuvier's bedroom and shouted: "I'll eat you." The professor woke up, looked at the silhouette and replied: “If you have horns and hooves, then you are a herbivore and cannot eat me. And for ignorance of the law of correlation you will get a deuce. He turned over and fell asleep. A joke is a joke, but in this example we are seeing a special case of using multiple correlation-regression analysis. Here the professor, based on the knowledge of the values ​​of the two observed traits (the presence of horns and hooves), based on the law of correlation, derived the average value of the third trait (the class to which this animal belongs is a herbivore). In this case, we are not talking about the specific value of this variable (i.e., this animal could take various meanings on a nominal scale - it could be a goat, and a ram, and a bull ...).

Now let's move on to the term "regression". Strictly speaking, it is not connected with the meaning of those statistical problems that are solved with the help of this method. An explanation of the term can only be given on the basis of knowledge of the history of the development of methods for studying the relationships between features. One of the first examples of studies of this kind was the work of statisticians F. Galton and K. Pearson, who tried to find a pattern between the growth of fathers and their children according to two observable signs (where X- father's height and U- children's growth). In their study, they confirmed the initial hypothesis that, on average, tall fathers raise averagely tall children. The same principle applies to low fathers and children. However, if the scientists had stopped there, their works would never have been mentioned in textbooks on statistics. The researchers found another pattern within the already mentioned confirmed hypothesis. They proved that very tall fathers produce children that are tall on average, but not very different in height from children whose fathers, although above average, are not very different from average height. The same is true for fathers with very small stature (deviating from the average of the short group) - their children, on average, did not differ in height from peers whose fathers were simply short. They called the function that describes this regularity regression function. After this study, all equations describing similar functions and constructed in a similar way began to be called regression equations.

Regression analysis is one of the methods of multivariate statistical analysis data that combines a set of statistical techniques designed to study or model relationships between one dependent and several (or one) independent variables. The dependent variable, according to the tradition accepted in statistics, is called the response and is denoted as V The independent variables are called predictors and are denoted as x. During the course of the analysis, some variables will be weakly related to the response and will eventually be excluded from the analysis. The remaining variables associated with the dependent may also be called factors.

Regression analysis makes it possible to predict the values ​​of one or more variables depending on another variable (for example, the propensity for unconventional political behavior depending on the level of education) or several variables. It is calculated on PC. To compile a regression equation that allows you to measure the degree of dependence of the controlled feature on the factor ones, it is necessary to involve professional mathematicians-programmers. Regression analysis can provide an invaluable service in building predictive models for the development of a political situation, assessing the causes of social tension, and in conducting theoretical experiments. Regression analysis is actively used to study the impact on the electoral behavior of citizens of a number of socio-demographic parameters: gender, age, profession, place of residence, nationality, level and nature of income.

In relation to regression analysis, the concepts independent And dependent variables. An independent variable is a variable that explains or causes a change in another variable. A dependent variable is a variable whose value is explained by the influence of the first variable. For example, in the presidential elections in 2004, the determining factors, i.e. independent variables were indicators such as stabilization of the financial situation of the population of the country, the level of popularity of candidates and the factor incumbency. In this case, the percentage of votes cast for candidates can be considered as a dependent variable. Similarly, in the pair of variables “age of the voter” and “level of electoral activity”, the first one is independent, the second one is dependent.

Regression analysis allows you to solve the following problems:

  • 1) establish the very fact of the presence or absence of a statistically significant relationship between Ci x;
  • 2) build the best (in the statistical sense) estimates of the regression function;
  • 3) according to the given values X build a prediction for the unknown At
  • 4) evaluate the specific weight of the influence of each factor X on At and, accordingly, exclude insignificant features from the model;
  • 5) by identifying causal relationships between variables, partially manage the values ​​of P by adjusting the values ​​of explanatory variables x.

Regression analysis is associated with the need to select mutually independent variables that affect the value of the indicator under study, determine the form of the regression equation, and evaluate parameters using statistical methods for processing primary sociological data. This type of analysis is based on the idea of ​​the form, direction and closeness (density) of the relationship. Distinguish steam room And multiple regression depending on the number of studied features. In practice, regression analysis is usually performed in conjunction with correlation analysis. Regression Equation describes a numerical relationship between quantities, expressed as a tendency for one variable to increase or decrease while another increases or decreases. At the same time, razl and h a yut l frost And non-linear regression. When describing political processes, both variants of regression are equally found.

Scatterplot for the distribution of interdependence of interest in political articles ( U) and education of respondents (X) is a linear regression (Fig. 30).

Rice. thirty.

Scatterplot for the distribution of the level of electoral activity ( U) and age of the respondent (A) (conditional example) is a non-linear regression (Fig. 31).


Rice. 31.

To describe the relationship of two features (A "and Y) in a paired regression model, use linear equation

where a, is a random value of the error of the equation with variation of features, i.e. deviation of the equation from "linearity".

To evaluate the coefficients A And b use the least squares method, which assumes that the sum of the squared deviations of each point on the scatter plot from the regression line should be minimal. Odds a h b can be calculated using the system of equations:

The method of least squares estimation gives such estimates of the coefficients A And b, for which the line passes through the point with coordinates X And y, those. there is a relation at = ax + b. The graphical representation of the regression equation is called theoretical regression line. With a linear dependence, the regression coefficient represents on the graph the tangent of the slope of the theoretical regression line to the x-axis. The sign at the coefficient shows the direction of the relationship. If it is greater than zero, then the relationship is direct; if it is less, it is inverse.

The following example from the study "Political Petersburg-2006" (Table 56) shows a linear relationship between citizens' perceptions of the degree of satisfaction with their lives in the present and expectations of changes in the quality of life in the future. The connection is direct, linear (the standardized regression coefficient is 0.233, the significance level is 0.000). In this case, the regression coefficient is not high, but it exceeds the lower limit of the statistically significant indicator (the lower limit of the square of the statistically significant indicator of the Pearson coefficient).

Table 56

The impact of the quality of life of citizens in the present on expectations

(St. Petersburg, 2006)

* Dependent variable: "How do you think your life will change in the next 2-3 years?"

In political life, the value of the variable under study most often simultaneously depends on several features. For example, the level and nature of political activity are simultaneously influenced by the political regime of the state, political traditions, the peculiarities of the political behavior of people in a given area and the social microgroup of the respondent, his age, education, income level, political orientation, etc. In this case, you need to use the equation multiple regression, which has the following form:

where coefficient b.- partial regression coefficient. It shows the contribution of each independent variable to determining the values ​​of the independent (outcome) variable. If the partial regression coefficient is close to 0, then we can conclude that there is no direct relationship between the independent and dependent variables.

The calculation of such a model can be performed on a PC using matrix algebra. Multiple regression allows you to reflect the multifactorial nature of social ties and clarify the degree of influence of each factor individually and all together on the resulting trait.

Coefficient denoted b, called the linear regression coefficient and shows the strength of the relationship between the variation factor sign X and variation of the effective feature Y This coefficient measures the strength of the relationship in absolute units of measurement of features. However, the closeness of the correlation of features can also be expressed in terms of the standard deviation of the resulting feature (such a coefficient is called the correlation coefficient). Unlike the regression coefficient b the correlation coefficient does not depend on the accepted units of measurement of features, and therefore, it is comparable for any features. Usually, the connection is considered strong if /> 0.7, medium tightness - at 0.5 g 0.5.

As you know, the closest connection is a functional connection, when each individual value Y can be uniquely assigned to the value x. Thus, the closer the correlation coefficient is to 1, the closer the relationship is to a functional one. The significance level for regression analysis should not exceed 0.001.

The correlation coefficient has long been considered as the main indicator of the closeness of the relationship of features. However, later the coefficient of determination became such an indicator. The meaning of this coefficient is as follows - it reflects the share of the total variance of the resulting feature At, explained by the variance of the feature x. It is found by simply squaring the correlation coefficient (changing from 0 to 1) and, in turn, for a linear relationship reflects the share from 0 (0%) to 1 (100%) characteristic values Y, determined by the values ​​of the attribute x. It is recorded as I 2 , and in the resulting tables of regression analysis in the SPSS package - without a square.

Let us denote the main problems of constructing a multiple regression equation.

  • 1. Choice of factors included in the regression equation. At this stage, the researcher first compiles a general list of the main causes that, according to the theory, determine the phenomenon under study. Then he must select the features in the regression equation. The main selection rule is that the factors included in the analysis should correlate as little as possible with each other; only in this case it is possible to attribute a quantitative measure of influence to a certain factor-attribute.
  • 2. Selecting the Form of the Multiple Regression Equation(in practice, linear or linear-logarithmic is more often used). So, to use multiple regression, the researcher must first build a hypothetical model of the influence of several independent variables on the resulting one. For the obtained results to be reliable, it is necessary that the model exactly matches the real process, i.e. the relationship between the variables must be linear, not a single significant independent variable can be ignored, in the same way, not a single variable that is not directly related to the process under study can be included in the analysis. In addition, all measurements of variables must be extremely accurate.

From the above description follows a number of conditions for the application of this method, without which it is impossible to proceed to the procedure of multiple regression analysis (MRA). Only compliance with all of the following points allows you to correctly carry out regression analysis.

During their studies, students very often encounter a variety of equations. One of them - the regression equation - is considered in this article. This type of equation is used specifically to describe the characteristics of the relationship between mathematical parameters. This type of equality is used in statistics and econometrics.

Definition of regression

In mathematics, regression is understood as a certain quantity that describes the dependence of the average value of a data set on the values ​​of another quantity. The regression equation shows, as a function of a particular feature, the average value of another feature. The regression function has the form simple equation y \u003d x, in which y is the dependent variable, and x is the independent variable (feature factor). In fact, the regression is expressed as y = f (x).

What are the types of relationships between variables

In general, two opposite types of relationship are distinguished: correlation and regression.

The first is characterized by equality of conditional variables. In this case, it is not known for certain which variable depends on the other.

If there is no equality between the variables and the conditions say which variable is explanatory and which is dependent, then we can talk about the presence of a connection of the second type. In order to build a linear regression equation, it will be necessary to find out what type of relationship is observed.

Types of regressions

To date, there are 7 different types of regression: hyperbolic, linear, multiple, nonlinear, pairwise, inverse, logarithmically linear.

Hyperbolic, linear and logarithmic

The linear regression equation is used in statistics to clearly explain the parameters of the equation. It looks like y = c + m * x + E. The hyperbolic equation has the form of a regular hyperbola y \u003d c + m / x + E. The logarithmically linear equation expresses the relationship using logarithmic function: In y \u003d In c + t * In x + In E.

Multiple and non-linear

Two more complex types of regression are multiple and non-linear. The multiple regression equation is expressed by the function y \u003d f (x 1, x 2 ... x c) + E. In this situation, y is the dependent variable and x is the explanatory variable. The variable E is stochastic and includes the influence of other factors in the equation. The non-linear regression equation is a bit inconsistent. On the one hand, with respect to the indicators taken into account, it is not linear, and on the other hand, in the role of assessing indicators, it is linear.

Inverse and Pairwise Regressions

An inverse is a kind of function that needs to be converted to a linear form. In the most traditional application programs, it has the form of a function y \u003d 1 / c + m * x + E. The paired regression equation shows the relationship between the data as a function of y = f(x) + E. Just like the other equations, y depends on x and E is a stochastic parameter.

The concept of correlation

This is an indicator that demonstrates the existence of a relationship between two phenomena or processes. The strength of the relationship is expressed as a correlation coefficient. Its value fluctuates within the interval [-1;+1]. A negative indicator indicates the presence of feedback, a positive indicator indicates a direct one. If the coefficient takes a value equal to 0, then there is no relationship. The closer the value is to 1 - the stronger the relationship between the parameters, the closer to 0 - the weaker.

Methods

Correlation parametric methods can estimate the tightness of the relationship. They are used on the basis of distribution estimates to study parameters that obey the normal distribution law.

The parameters of the linear regression equation are necessary to identify the type of dependence, the function of the regression equation and evaluate the indicators of the chosen relationship formula. The correlation field is used as a method for identifying a relationship. To do this, all existing data must be represented graphically. In a rectangular two-dimensional coordinate system, all known data must be plotted. This is how the correlation field is formed. The value of the describing factor is marked along the abscissa, while the values ​​of the dependent factor are marked along the ordinate. If there is a functional relationship between the parameters, they line up in the form of a line.

If the correlation coefficient of such data is less than 30%, we can talk about the almost complete absence of a connection. If it is between 30% and 70%, then this indicates the presence of links of medium closeness. A 100% indicator is evidence of a functional connection.

A non-linear regression equation, just like a linear one, must be supplemented with a correlation index (R).

Correlation for Multiple Regression

The coefficient of determination is an indicator of the square of the multiple correlation. He speaks about the tightness of the relationship of the presented set of indicators with the trait under study. It can also talk about the nature of the influence of parameters on the result. The multiple regression equation is evaluated using this indicator.

In order to calculate the multiple correlation index, it is necessary to calculate its index.

Least square method

This method is a way of estimating regression factors. Its essence lies in minimizing the sum of squared deviations obtained due to the dependence of the factor on the function.

A paired linear regression equation can be estimated using such a method. This type of equations is used in case of detection between the indicators of a paired linear relationship.

Equation Options

Each parameter of the linear regression function has a specific meaning. The paired linear regression equation contains two parameters: c and m. The parameter t shows the average change in the final indicator of the function y, subject to a decrease (increase) in the variable x by one conventional unit. If the variable x is zero, then the function is equal to the parameter c. If the variable x is not zero, then the factor c does not make economic sense. The only influence on the function is the sign in front of the factor c. If there is a minus, then we can say about a slow change in the result compared to the factor. If there is a plus, then this indicates an accelerated change in the result.

Each parameter that changes the value of the regression equation can be expressed in terms of an equation. For example, the factor c has the form c = y - mx.

Grouped data

There are such conditions of the task in which all information is grouped according to the attribute x, but at the same time, for a certain group, the corresponding average values ​​of the dependent indicator are indicated. In this case, the average values ​​characterize how the indicator depends on x. Thus, the grouped information helps to find the regression equation. It is used as a relationship analysis. However, this method has its drawbacks. Unfortunately, averages are often subject to external fluctuations. These fluctuations are not a reflection of the patterns of the relationship, they just mask its "noise". Averages show patterns of relationship much worse than a linear regression equation. However, they can be used as a basis for finding an equation. By multiplying the size of a particular population by the corresponding average, you can get the sum of y within the group. Next, you need to knock out all the received amounts and find the final indicator y. It is a little more difficult to make calculations with the sum indicator xy. In the event that the intervals are small, we can conditionally take the indicator x for all units (within the group) the same. Multiply it with the sum of y to find the sum of the products of x and y. Further, all the sums are knocked together and the total sum xy is obtained.

Multiple Pair Equation Regression: Assessing the Importance of a Relationship

As discussed earlier, multiple regression has a function of the form y \u003d f (x 1, x 2, ..., x m) + E. Most often, such an equation is used to solve the problem of supply and demand for goods, interest income on repurchased shares, studying the causes and type of production cost function. It is also actively used in a wide variety of macroeconomic studies and calculations, but at the level of microeconomics, this equation is used a little less frequently.

The main task of multiple regression is to build a data model containing a huge amount of information in order to further determine what influence each of the factors individually and in their totality has on the indicator to be modeled and its coefficients. The regression equation can take on a variety of values. In this case, two types of functions are usually used to assess the relationship: linear and nonlinear.

A linear function is depicted in the form of such a relationship: y \u003d a 0 + a 1 x 1 + a 2 x 2, + ... + a m x m. In this case, a2, a m , are considered to be the coefficients of "pure" regression. They are necessary to characterize the average change in the parameter y with a change (decrease or increase) in each corresponding parameter x by one unit, with the condition of a stable value of other indicators.

Nonlinear equations have, for example, the form of a power function y=ax 1 b1 x 2 b2 ...x m bm . In this case, the indicators b 1, b 2 ..... b m - are called elasticity coefficients, they demonstrate how the result will change (by how much%) with an increase (decrease) in the corresponding indicator x by 1% and with a stable indicator of other factors.

What factors should be considered when building a multiple regression

In order to correctly construct a multiple regression, it is necessary to find out which factors should be paid special attention to.

It is necessary to have some understanding of the nature of the relationship between economic factors and the modeled. The factors to be included must meet the following criteria:

  • Must be measurable. In order to use a factor describing the quality of an object, in any case, it should be given a quantitative form.
  • There should be no factor intercorrelation, or functional relationship. Such actions most often lead to irreversible consequences - the system of ordinary equations becomes unconditioned, and this entails its unreliability and fuzzy estimates.
  • In the case of a huge correlation indicator, there is no way to find out the isolated influence of factors on the final result of the indicator, therefore, the coefficients become uninterpretable.

Construction Methods

There are a huge number of methods and ways to explain how you can choose the factors for the equation. However, all these methods are based on the selection of coefficients using the correlation index. Among them are:

  • Exclusion method.
  • Turn on method.
  • Stepwise regression analysis.

The first method involves sifting out all coefficients from the aggregate set. The second method involves the introduction of many additional factors. Well, the third is the elimination of factors that were previously applied to the equation. Each of these methods has the right to exist. They have their pros and cons, but they can solve the issue of screening out unnecessary indicators in their own way. As a rule, the results obtained by each individual method are quite close.

Methods of multivariate analysis

Such methods for determining factors are based on the consideration of individual combinations of interrelated features. These include discriminant analysis, pattern recognition, principal component analysis, and cluster analysis. In addition, there is also factor analysis, however, it appeared as a result of the development of the component method. All of them are applied in certain circumstances, under certain conditions and factors.

1. For the first time the term "regression" was introduced by the founder of biometrics F. Galton (XIX century), whose ideas were developed by his follower K. Pearson.

Regression analysis- a method of statistical data processing that allows you to measure the relationship between one or more causes (factorial signs) and a consequence (effective sign).

sign- this is the main distinguishing feature, feature of the phenomenon or process being studied.

Effective sign - investigated indicator.

Factor sign- an indicator that affects the value of the effective feature.

The purpose of the regression analysis is to evaluate the functional dependence of the average value of the effective feature ( at) from factorial ( x 1, x 2, ..., x n), expressed as regression equations

at= f(x 1, x 2, ..., x n). (6.1)

There are two types of regression: paired and multiple.

Paired (simple) regression- equation of the form:

at= f(x). (6.2)

The resultant feature in pairwise regression is considered as a function of one argument, i.e. one factor.

Regression analysis includes the following steps:

definition of the function type;

determination of regression coefficients;

Calculation of theoretical values ​​of the effective feature;

Checking the statistical significance of the regression coefficients;

Checking the statistical significance of the regression equation.

Multiple Regression- equation of the form:

at= f(x 1, x 2, ..., x n). (6.3)

The resultant feature is considered as a function of several arguments, i.e. many factors.

2. In order to correctly determine the type of function, it is necessary to find the direction of the connection based on theoretical data.

According to the direction of the connection, the regression is divided into:

· direct regression, arising under the condition that with an increase or decrease in the independent value " X" values ​​of the dependent quantity " at" also increase or decrease accordingly;

· reverse regression, arising under the condition that with an increase or decrease in the independent value "X" dependent value " at" decreases or increases accordingly.

To characterize the relationships, the following types of paired regression equations are used:

· y=a+bxlinear;

· y=e ax + b – exponential;

· y=a+b/x – hyperbolic;

· y=a+b 1 x+b 2 x 2 – parabolic;

· y=ab x – exponential and etc.

Where a, b 1 , b 2- coefficients (parameters) of the equation; at- effective sign; X- factor sign.

3. The construction of the regression equation is reduced to estimating its coefficients (parameters), for this they use least square method(MNK).

The least squares method allows you to obtain such estimates of the parameters, in which the sum of the squared deviations of the actual values ​​of the effective feature " at»from theoretical « y x» is minimal, that is

Regression Equation Options y=a+bx by the least squares method are estimated using the formulas:

Where A - free coefficient, b- regression coefficient, shows how much the resultant sign will change y» when changing the factor attribute « x» per unit of measure.

4. To assess the statistical significance of the regression coefficients, Student's t-test is used.

Scheme for checking the significance of regression coefficients:

1) H 0: a=0, b=0 - regression coefficients are insignificantly different from zero.

H 1: a≠ 0, b≠ 0 - regression coefficients are significantly different from zero.

2) R=0.05 – significance level.

Where m b,m a- random errors:

; . (6.7)

4) t table(R; f),

Where f=n-k- 1 - number of degrees of freedom (table value), n- number of observations, k X".

5) If , then deviates, i.e. significant coefficient.

If , then is accepted, i.e. coefficient is insignificant.

5. To check the correctness of the constructed regression equation, the Fisher criterion is used.

Scheme for checking the significance of the regression equation:

1) H 0: the regression equation is not significant.

H 1: the regression equation is significant.

2) R=0.05 – significance level.

3) , (6.8)

where is the number of observations; k- the number of parameters in the equation with variables " X"; at- the actual value of the effective feature; y x- the theoretical value of the effective feature; - coefficient of pair correlation.

4) F table(R; f 1 ; f2),

Where f 1 \u003d k, f 2 \u003d n-k-1- number of degrees of freedom (table values).

5) If F calc >F table, then the regression equation is chosen correctly and can be applied in practice.

If F calc , then the regression equation is chosen incorrectly.

6. The main indicator reflecting the measure of the quality of regression analysis is coefficient of determination (R 2).

Determination coefficient shows what proportion of the dependent variable " at» is taken into account in the analysis and is caused by the influence of the factors included in the analysis.

Determination coefficient (R2) takes values ​​in the range . The regression equation is qualitative if R2 ≥0,8.

The determination coefficient is equal to the square of the correlation coefficient, i.e.

Example 6.1. Based on the following data, construct and analyze the regression equation:

Solution.

1) Calculate the correlation coefficient: . The relationship between the signs is direct and moderate.

2) Build a paired linear regression equation.

2.1) Make a calculation table.

X at Hu x 2 y x (y-y x) 2
55,89 47,54 65,70
45,07 15,42 222,83
54,85 34,19 8,11
51,36 5,55 11,27
42,28 45,16 13,84
47,69 1,71 44,77
45,86 9,87 192,05
Sum 159,45 558,55
Average 77519,6 22,78 79,79 2990,6

,

Paired linear regression equation: y x \u003d 25.17 + 0.087x.

3) Find theoretical values ​​" y x» by substituting actual values ​​into the regression equation « X».

4) Plot graphs of actual " at" and theoretical values ​​" y x» effective feature (Figure 6.1): r xy =0.47) and a small number of observations.

7) Calculate the coefficient of determination: R2=(0.47) 2 =0.22. The constructed equation is of poor quality.

Because calculations during regression analysis are quite voluminous, it is recommended to use special programs ("Statistica 10", SPSS, etc.).

Figure 6.2 shows a table with the results of the regression analysis carried out using the program "Statistica 10".

Figure 6.2. The results of the regression analysis carried out using the program "Statistica 10"

5. Literature:

1. Gmurman V.E. Probability Theory and Mathematical Statistics: Proc. manual for universities / V.E. Gmurman. - M.: Higher school, 2003. - 479 p.

2. Koichubekov B.K. Biostatistics: Textbook. - Almaty: Evero, 2014. - 154 p.

3. Lobotskaya N.L. Higher Mathematics. / N.L. Lobotskaya, Yu.V. Morozov, A.A. Dunaev. - Minsk: Higher School, 1987. - 319 p.

4. Medic V.A., Tokmachev M.S., Fishman B.B. Statistics in Medicine and Biology: A Guide. In 2 volumes / Ed. Yu.M. Komarov. T. 1. Theoretical statistics. - M.: Medicine, 2000. - 412 p.

5. Application of statistical analysis methods for the study of public health and health care: textbook / ed. Kucherenko V.Z. - 4th ed., revised. and additional - M.: GEOTAR - Media, 2011. - 256 p.

The main goal of regression analysis consists in determining the analytical form of the relationship, in which the change in the resultant attribute is due to the influence of one or more factor signs, and the set of all other factors that also affect the resultant attribute is taken as constant and average values.
Tasks of regression analysis:
a) Establishing the form of dependence. Regarding the nature and form of the relationship between phenomena, there are positive linear and non-linear and negative linear and non-linear regression.
b) Definition of the regression function in the form of a mathematical equation of one type or another and establishing the influence of explanatory variables on the dependent variable.
c) Estimation of unknown values ​​of the dependent variable. Using the regression function, you can reproduce the values ​​of the dependent variable within the interval of given values ​​of the explanatory variables (i.e., solve the interpolation problem) or evaluate the course of the process outside the specified interval (i.e., solve the extrapolation problem). The result is an estimate of the value of the dependent variable.

Pair regression - the equation of the relationship of two variables y and x: y=f(x), where y is the dependent variable (resultant sign); x - independent, explanatory variable (feature-factor).

There are linear and non-linear regressions.
Linear regression: y = a + bx + ε
Nonlinear regressions are divided into two classes: regressions that are non-linear with respect to the explanatory variables included in the analysis, but linear with respect to the estimated parameters, and regressions that are non-linear with respect to the estimated parameters.
Regressions that are non-linear in explanatory variables:

Regressions that are non-linear in the estimated parameters:

  • power y=a x b ε
  • exponential y=a b x ε
  • exponential y=e a+b x ε
The construction of the regression equation is reduced to estimating its parameters. To estimate the parameters of regressions that are linear in parameters, the method of least squares (LSM) is used. LSM makes it possible to obtain such estimates of parameters under which the sum of the squared deviations of the actual values ​​of the effective feature y from the theoretical values ​​y x is minimal, i.e.
.
For linear and nonlinear equations reducible to linear, the following system is solved for a and b:

You can use ready-made formulas that follow from this system:

The closeness of the connection between the studied phenomena is estimated by the linear pair correlation coefficient r xy for linear regression (-1≤r xy ≤1):

and correlation index p xy - for non-linear regression (0≤p xy ≤1):

An assessment of the quality of the constructed model will be given by the coefficient (index) of determination, as well as the average approximation error.
The average approximation error is the average deviation of the calculated values ​​from the actual ones:
.
Permissible limit of values ​​A - no more than 8-10%.
The average coefficient of elasticity E shows how many percent, on average, the result y will change from its average value on average when the factor x changes by 1% from its average value:
.

The task of analysis of variance is to analyze the variance of the dependent variable:
∑(y-y )²=∑(y x -y )²+∑(y-y x)²
where ∑(y-y)² is the total sum of squared deviations;
∑(y x -y)² - sum of squared deviations due to regression ("explained" or "factorial");
∑(y-y x)² - residual sum of squared deviations.
The share of the variance explained by regression in the total variance of the effective feature y is characterized by the coefficient (index) of determination R2:

The coefficient of determination is the square of the coefficient or correlation index.

F-test - evaluation of the quality of the regression equation - consists in testing the hypothesis But about the statistical insignificance of the regression equation and the indicator of closeness of connection. For this, a comparison of the actual F fact and the critical (tabular) F table of the values ​​of the Fisher F-criterion is performed. F fact is determined from the ratio of the values ​​of the factorial and residual variances calculated for one degree of freedom:
,
where n is the number of population units; m is the number of parameters for variables x.
F table is the maximum possible value of the criterion under the influence of random factors for given degrees of freedom and significance level a. Significance level a - the probability of rejecting the correct hypothesis, provided that it is true. Usually a is taken equal to 0.05 or 0.01.
If F table< F факт, то Н о - гипотеза о случайной природе оцениваемых характеристик отклоняется и признается их статистическая значимость и надежность. Если F табл >F is a fact, then the hypothesis H about is not rejected and the statistical insignificance, the unreliability of the regression equation is recognized.
To assess the statistical significance of the regression and correlation coefficients, Student's t-test and confidence intervals for each of the indicators are calculated. A hypothesis H about the random nature of the indicators is put forward, i.e. about their insignificant difference from zero. The assessment of the significance of the regression and correlation coefficients using the Student's t-test is carried out by comparing their values ​​with the magnitude of the random error:
; ; .
Random errors of linear regression parameters and correlation coefficient are determined by the formulas:



Comparing the actual and critical (tabular) values ​​of t-statistics - t tabl and t fact - we accept or reject the hypothesis H o.
The relationship between Fisher's F-test and Student's t-statistics is expressed by the equality

If t table< t факт то H o отклоняется, т.е. a , b и r xy не случайно отличаются от нуля и сформировались под влиянием систематически действующего фактора х. Если t табл >t the fact that the hypothesis H about is not rejected and the random nature of the formation of a, b or r xy is recognized.
To calculate the confidence interval, we determine the marginal error D for each indicator:
Δ a =t table m a , Δ b =t table m b .
The formulas for calculating confidence intervals are as follows:
γ a \u003d aΔ a; γ a \u003d a-Δ a; γ a =a+Δa
γ b = bΔ b ; γ b = b-Δ b ; γb =b+Δb
If zero falls within the boundaries of the confidence interval, i.e. If the lower limit is negative and the upper limit is positive, then the estimated parameter is assumed to be zero, since it cannot simultaneously take on both positive and negative values.
The forecast value y p is determined by substituting the corresponding (forecast) value x p into the regression equation y x =a+b·x . The average standard error of the forecast m y x is calculated:
,
Where
and is being built confidence interval forecast:
γ y x =y p Δ y p ; γ y x min=y p -Δ y p ; γ y x max=y p +Δ y p
where Δ y x =t table m y x .

Solution example

Task number 1. For seven territories of the Ural region For 199X, the values ​​of two signs are known.
Table 1.

Required: 1. To characterize the dependence of y on x, calculate the parameters of the following functions:
a) linear;
b) power law (previously it is necessary to perform the procedure of linearization of variables by taking the logarithm of both parts);
c) demonstrative;
d) equilateral hyperbola (you also need to figure out how to pre-linearize this model).
2. Evaluate each model through the average approximation error A and Fisher's F-test.

Solution (Option #1)

To calculate the parameters a and b of the linear regression y=a+b·x (the calculation can be done using a calculator).
solve the system of normal equations with respect to A And b:
Based on the initial data, we calculate ∑y, ∑x, ∑y x, ∑x², ∑y²:
y x yx x2 y2 y xy-y xAi
l68,8 45,1 3102,88 2034,01 4733,44 61,3 7,5 10,9
2 61,2 59,0 3610,80 3481,00 3745,44 56,5 4,7 7,7
3 59,9 57,2 3426,28 3271,84 3588,01 57,1 2,8 4,7
4 56,7 61,8 3504,06 3819,24 3214,89 55,5 1,2 2,1
5 55,0 58,8 3234,00 3457,44 3025,00 56,5 -1,5 2,7
6 54,3 47,2 2562,96 2227,84 2948,49 60,5 -6,2 11,4
7 49,3 55,2 2721,36 3047,04 2430,49 57,8 -8,5 17,2
Total405,2 384,3 22162,34 21338,41 23685,76 405,2 0,0 56,7
Wed value (Total/n)57,89
y
54,90
x
3166,05
x y
3048,34
3383,68
XX8,1
s 5,74 5,86 XXXXXX
s232,92 34,34 XXXXXX


a=y -b x = 57.89+0.35 54.9 ≈ 76.88

Regression equation: y= 76,88 - 0,35X. With an increase in the average daily wage by 1 rub. the share of spending on the purchase of food products is reduced by an average of 0.35% points.
Calculate the linear coefficient of pair correlation:

Communication is moderate, reverse.
Let's determine the coefficient of determination: r² xy =(-0.35)=0.127
The 12.7% variation in the result is explained by the variation in the x factor. Substituting the actual values ​​into the regression equation X, we determine the theoretical (calculated) values ​​of y x . Let us find the value of the average approximation error A :

On average, the calculated values ​​deviate from the actual ones by 8.1%.
Let's calculate the F-criterion:

The obtained value indicates the need to accept the hypothesis H 0 about the random nature of the revealed dependence and the statistical insignificance of the parameters of the equation and the indicator of the tightness of the connection.
1b. The construction of the power model y=a x b is preceded by the procedure of linearization of variables. In the example, linearization is done by taking the logarithm of both sides of the equation:
lg y=lg a + b lg x
Y=C+b Y
where Y=lg(y), X=lg(x), C=lg(a).

For calculations, we use the data in Table. 1.3.
Table 1.3

YX YX Y2 x2 y xy-y x(y-yx)²Ai
1 1,8376 1,6542 3,0398 3,3768 2,7364 61,0 7,8 60,8 11,3
2 1,7868 1,7709 3,1642 3,1927 3,1361 56,3 4,9 24,0 8,0
3 1,7774 1,7574 3,1236 3,1592 3,0885 56,8 3,1 9,6 5,2
4 1,7536 1,7910 3,1407 3,0751 3,2077 55,5 1,2 1,4 2,1
5 1,7404 1,7694 3,0795 3,0290 3,1308 56,3 -1,3 1,7 2,4
6 1,7348 1,6739 2,9039 3,0095 2,8019 60,2 -5,9 34,8 10,9
7 1,6928 1,7419 2,9487 2,8656 3,0342 57,4 -8,1 65,6 16,4
Total12,3234 12,1587 21,4003 21,7078 21,1355 403,5 1,7 197,9 56,3
Average value1,7605 1,7370 3,0572 3,1011 3,0194 XX28,27 8,0
σ 0,0425 0,0484 XXXXXXX
σ20,0018 0,0023 XXXXXXX

Calculate C and b:

C=Y -b X = 1.7605+0.298 1.7370 = 2.278126
We get a linear equation: Y=2.278-0.298 X
After potentiating it, we get: y=10 2.278 x -0.298
Substituting in this equation the actual values X, we get the theoretical values ​​of the result. Based on them, we calculate the indicators: the tightness of the connection - the correlation index p xy and the average approximation error A .

The characteristics of the power model indicate that it is somewhat better linear function describes the relationship.

1c. The construction of the equation of the exponential curve y \u003d a b x is preceded by the procedure for linearizing the variables when taking the logarithm of both parts of the equation:
lg y=lg a + x lg b
Y=C+B x
For calculations, we use the table data.

Yx Yx Y2 x2y xy-y x(y-yx)²Ai
1 1,8376 45,1 82,8758 3,3768 2034,01 60,7 8,1 65,61 11,8
2 1,7868 59,0 105,4212 3,1927 3481,00 56,4 4,8 23,04 7,8
3 1,7774 57,2 101,6673 3,1592 3271,84 56,9 3,0 9,00 5,0
4 1,7536 61,8 108,3725 3,0751 3819,24 55,5 1,2 1,44 2,1
5 1,7404 58,8 102,3355 3,0290 3457,44 56,4 -1,4 1,96 2,5
6 1,7348 47,2 81,8826 3,0095 2227,84 60,0 -5,7 32,49 10,5
7 1,6928 55,2 93,4426 2,8656 3047,04 57,5 -8,2 67,24 16,6
Total12,3234 384,3 675,9974 21,7078 21338,41 403,4 -1,8 200,78 56,3
Wed zn.1,7605 54,9 96,5711 3,1011 3048,34 XX28,68 8,0
σ 0,0425 5,86 XXXXXXX
σ20,0018 34,339 XXXXXXX

The values ​​of the regression parameters A and IN amounted to:

A=Y -B x = 1.7605+0.0023 54.9 = 1.887
A linear equation is obtained: Y=1.887-0.0023x. We potentiate the resulting equation and write it in the usual form:
y x =10 1.887 10 -0.0023x = 77.1 0.9947 x
We estimate the tightness of the relationship through the correlation index p xy:

3588,01 56,9 3,0 9,00 5,0 4 56,7 0,0162 0,9175 0,000262 3214,89 55,5 1,2 1,44 2,1 5 55 0,0170 0,9354 0,000289 3025,00 56,4 -1,4 1,96 2,5 6 54,3 0,0212 1,1504 0,000449 2948,49 60,8 -6,5 42,25 12,0 7 49,3 0,0181 0,8931 0,000328 2430,49 57,5 -8,2 67,24 16,6 Total405,2 0,1291 7,5064 0,002413 23685,76 405,2 0,0 194,90 56,5 Average value57,9 0,0184 1,0723 0,000345 3383,68 XX27,84 8,1 σ 5,74 0,002145 XXXXXXX σ232,9476 0,000005 XX