Study of the interdependence between statistical indicators. Statistical study of the relationship. the establishment of quantitative estimates of the closeness of the connection, characterizing the strength of the influence of factor signs on the effective

8.1. Basic concepts of correlation and regression analysis

Exploring nature, society, economy, it is necessary to take into account the relationship of observed processes and phenomena. In this case, the completeness of the description is determined in one way or another. quantitative characteristics causal relationships between them. Evaluation of the most significant of them, as well as the impact of some factors on others, is one of the main tasks of statistics.

The forms of manifestation of interrelations are very diverse. As the two most common types allocate functional(complete) and correlation(incomplete) connection. In the first case, the value of the factor attribute strictly corresponds to one or more values ​​of the function. Quite often, the functional connection is manifested in physics, chemistry. In economics, an example is the directly proportional relationship between labor productivity and an increase in production.

Correlation (which is also called incomplete, or statistical) appears on average, for mass observations, when the given values ​​of the dependent variable correspond to a certain number of probable values ​​of the independent variable. The explanation for this is the complexity of the relationships between the analyzed factors, the interaction of which is influenced by unaccounted random variables. Therefore, the relationship between the signs is manifested only on average, in the mass of cases. With a correlation, each value of the argument corresponds to randomly distributed values ​​of the function in a certain interval.

For example, some increase in the argument will entail only an average increase or decrease (depending on the direction) of the function, while specific values ​​for individual units of observation will differ from the average. These dependencies are ubiquitous. For example, in agriculture, this may be the relationship between yield and the amount of fertilizer applied. Obviously, the latter are involved in the formation of the crop. But for each specific field, plot, the same amount of fertilizer applied will cause a different increase in yield, since there is still more in the interaction whole line factors (weather, soil conditions, etc.), which form the final result. However, on average, such a relationship is observed - an increase in the mass of applied fertilizers leads to an increase in yield.

In the direction of communication, there are straight, when the dependent variable increases with the increase in the factor trait, and reverse, at which the growth of the latter is accompanied by a decrease in the function. Such relationships can also be called positive and negative, respectively.

Regarding their analytical form of communication, there are linear And non-linear. In the first case, on average, linear relationships appear between the signs. A non-linear relationship is expressed by a non-linear function, and the variables are interconnected on average non-linearly.

There is one more rather important characteristic of connections from the point of view of interacting factors. If a relationship between two characteristics is characterized, then it is called steam room. If more than two variables are being studied − multiple.

The above classification features are most often found in statistical analysis. But in addition to the above, there are also direct, indirect And false connections. Actually, the essence of each of them is obvious from the name. In the first case, the factors interact directly with each other. An indirect relationship is characterized by the participation of some third variable, which mediates the relationship between the studied traits. A false connection is a connection established formally and, as a rule, confirmed only by quantitative estimates. It does not have a qualitative basis or is meaningless.

They differ in strength weak And strong connections. This formal characteristic is expressed by specific values ​​and is interpreted in accordance with generally accepted criteria for the strength of the connection for specific indicators.

In the most general view the task of statistics in the field of studying relationships is to quantify their presence and direction, as well as to characterize the strength and form of influence of some factors on others. To solve it, two groups of methods are used, one of which includes the methods of correlation analysis, and the other - regression analysis. At the same time, a number of researchers combine these methods into a correlation-regression analysis, which has some grounds: the presence of a number of common computational procedures, complementarity in interpreting the results, etc.

Therefore, in this context, we can talk about correlation analysis in the broad sense - when the relationship is comprehensively characterized. At the same time, correlation analysis is distinguished in narrow sense- when the strength of the connection is examined - and regression analysis, during which its form and the impact of some factors on others are evaluated.

Tasks proper correlation analysis are reduced to measuring the closeness of the relationship between varying traits, identifying unknown causal relationships and assessing the factors that have the greatest impact on the resulting trait.

Tasks regression analysis lie in the field of establishing the form of dependence, determining the regression function, using an equation to estimate unknown values ​​of the dependent variable.

The solution of these problems is based on appropriate techniques, algorithms, indicators, the use of which gives reason to talk about the statistical study of relationships.

It should be noted that the traditional methods of correlation and regression are widely represented in various statistical software packages for computers. The only thing left for the researcher is to properly prepare the information, choose a software package that satisfies the requirements of the analysis, and be ready to interpret the results obtained. There are many algorithms for calculating communication parameters, and at present it is hardly advisable to carry out such a complex type of analysis manually. Computational procedures are of independent interest, but knowledge of the principles of studying the relationships, possibilities and limitations of certain methods of interpreting the results is a prerequisite for research.

Methods for assessing the tightness of the connection are divided into correlation (parametric) and non-parametric. Parametric methods are based on the use, as a rule, of normal distribution estimates and are used in cases where the population under study consists of quantities that obey the normal distribution law. In practice, this position is most often taken a priori. Actually, these methods are parametric and are commonly called correlation methods.

Nonparametric methods do not impose restrictions on the law of distribution of the studied quantities. Their advantage is also the simplicity of calculations.

8.2. Pair Correlation and Pair Linear Regression

The simplest technique for identifying a relationship between two features is to build correlation table:

\Y
\
X\
Y 1 Y2 ... Yz Total Y i
x1 f 11 12 ... f 1z
x1 f 21 22 ... f2z
... ... ... ... ... ... ...
X r f k1 k2 ... fkz
Total ... n
... -

The grouping is based on two traits studied in the relationship - X and Y. Frequencies f ij show the number of corresponding combinations of X and Y. If f ij are arranged randomly in the table, we can talk about the absence of a relationship between the variables. In the case of the formation of any characteristic combination f ij, it is permissible to assert a connection between X and Y. In this case, if f ij is concentrated near one of the two diagonals, there is a direct or reverse linear relationship.

A visual representation of the correlation table is correlation field. It is a graph where X values ​​are plotted on the abscissa axis, Y values ​​are plotted along the ordinate axis, and the combination of X and Y is shown by dots. By the location of the points, their concentration in a certain direction, one can judge the presence of a connection.

In the results of the correlation table for rows and columns, two distributions are given - one for X, the other for Y. Let's calculate for each X i the average value of Y, i.e. , How

The sequence of points (X i , ) gives a graph that illustrates the dependence of the average value of the effective feature Y on the factor X, - empirical regression line, showing how Y changes as X changes.

In essence, both the correlation table, and the correlation field, and the empirical regression line previously characterize the relationship when the factor and resultant features are selected and it is required to formulate assumptions about the form and direction of the relationship. At the same time, a quantitative assessment of the closeness of the connection requires additional calculations.

In practice, to quantify the tightness of the connection, the linear correlation coefficient. It is sometimes referred to simply as the correlation coefficient. If the values ​​of the variables X and Y are given, then it is calculated by the formula

You can use other formulas, but the result should be the same for all calculation options.

The correlation coefficient takes values ​​in the range from -1 to + 1. It is generally accepted that if |r| < 0,30, то связь слабая; при |r| = (0.3÷0.7) – average; at |r| > 0.70 - strong, or close. When |r| = 1 – functional connection. If r takes a value near 0, then this gives grounds to speak about the absence of a linear relationship between Y and X. However, in this case, a nonlinear interaction is possible. which requires additional verification and other meters discussed below.

To characterize the influence of changes in X on the variation in Y, regression analysis methods are used. In the case of a steam room linear dependence a regression model is built

where n number of observations;
a 0 , a 1 – unknown parameters of the equation;
e i is the error of the random variable Y.

The regression equation is written as

where Y itheor is the calculated equalized value of the effective feature after substitution into the equation X.

The parameters a 0 and a 1 are estimated using procedures, the most widely used of which is least square method. Its essence lies in the fact that the best estimates for ag and a are obtained when

those. the sum of the squared deviations of the empirical values ​​of the dependent variable from those calculated using the regression equation should be minimal. The sum of squared deviations is a function of the parameters a 0 and a 1 . Its minimization is carried out by solving the system of equations

You can use other formulas that follow from the least squares method, for example:

The linear regression apparatus is quite well developed and, as a rule, is available in a set of standard programs for evaluating the relationship for a computer. The meaning of the parameters is important: and 1 is a regression coefficient that characterizes the effect that a change in X has on Y. It shows how many units Y will change on average when X changes by one unit. If a is greater than 0, then a positive relationship is observed. If a has a negative value, then an increase in X by one entails a decrease in Y on average by a 1 . The parameter a 1 has the dimension of the ratio Y to X.

Parameter a 0 is a constant value in the regression equation. In our opinion, it has no economic meaning, but in some cases it is interpreted as the initial value of W.

For example, according to the data on the cost of equipment X and labor productivity Y, the least squares method obtained the equation

Y \u003d -12.14 + 2.08X.

Coefficient a means that an increase in the cost of equipment by 1 million rubles. leads on average to an increase in labor productivity by 2.08 thousand rubles.

The value of the function Y \u003d a 0 + a 1 X is called the calculated value and forms on the graph theoretical regression line.

The meaning of theoretical regression is that it is an estimate of the mean value of the variable Y for a given value of X.

Pair correlation or pair regression can be considered as a special case of reflecting the relationship of some dependent variable, on the one hand, and one of the many independent variables, on the other. When it is required to characterize the relationship of the entire specified set of independent variables with the resultant attribute, one speaks of multiple correlation or multiple regression.

8.3. Assessing the significance of relationship parameters

Having obtained correlation and regression estimates, it is necessary to check them for compliance with the true parameters of the relationship.

Existing computer programs include, as a rule, several of the most common criteria. To assess the significance of the pair correlation coefficient, the standard error of the correlation coefficient is calculated:

As a first approximation, it is necessary that . The significance of r xy is checked by comparing it with , and one obtains

where t calc is the so-called calculated value of the t-criterion.

If t calc is greater than the theoretical (tabular) value of Student's t-test (t tabl) for a given level of probability and (n-2) degrees of freedom, then it can be argued that r xy is significant.

Similarly, based on the corresponding formulas, the standard errors of the parameters of the regression equation are calculated, and then the t-tests for each parameter. It is important to check again that the condition t calc > t tab. Otherwise, there is no reason to trust the obtained parameter estimate.

The conclusion about the correct choice of the type of relationship and the characteristic of the significance of the entire regression equation is obtained using the F-criterion, calculating its calculated value:

where n is the number of observations;
m is the number of parameters of the regression equation.

F calc should also be greater than F theor at v 1 = (m-1) and v 2 = (n-m) degrees of freedom. Otherwise, the form of the equation, the list of variables, etc., should be revised.

8.4. Nonparametric Methods for Estimating Relationships

The methods of correlation and variance analysis are not universal: they can be applied if all the characteristics under study are quantitative. When using these methods, one cannot do without calculating the main distribution parameters (averages, variances), so they are called parametric methods.

Meanwhile, in statistical practice, one has to deal with the problems of measuring the relationship between qualitative features, to which parametric methods of analysis in their usual form are not applicable. Statistical science has developed methods that can be used to measure the relationship between phenomena without using the quantitative values ​​of the attribute, and hence the distribution parameters. Such methods are called nonparametric.

If the relationship of two qualitative features is studied, then the combinational distribution of population units is used in the form of the so-called cross-link tables.

Let's consider the method of analysis of cross-contingency tables on specific example social mobility as a process of overcoming the isolation of certain social and professional groups of the population. Below is the data on the distribution of secondary school graduates by spheres of employment with the allocation of similar social groups of their parents.

The distribution of frequencies in the rows and columns of the cross-coupling table makes it possible to identify the main patterns of social mobility: 42.9% of the children of parents in group 1 (“Industry and construction”) are employed in the field of intellectual labor (39 out of 91); 38.9% of children. whose parents work in agriculture, work in industry (34 out of 88), etc.

One can also notice a clear heredity in the transfer of professions. Thus, out of those who came to agriculture, 29 people, or 64.4%, are children of agricultural workers; more than 50% in the field of intellectual labor have parents of the same social group etc.

However, it is important to obtain a generalizing indicator that characterizes the closeness of the relationship between the features and allows you to compare the manifestation of the relationship in different populations. For this purpose, for example, conjugacy coefficients Pearson (C) and Chuprov (C):

where f 2 is the root-mean-square contingency index, determined by subtracting one from the sum of the ratios of the squares of the frequencies of each cell of the correlation table to the product of the frequencies of the corresponding column and row:

K 1 and K 2 - the number of groups for each of the signs. The value of the coefficient of mutual contingency, reflecting the closeness of the relationship between qualitative features, fluctuates within the usual range for these indicators from 0 to 1.

In socio-economic studies, there are often situations when a feature is not expressed quantitatively, but the units of the population can be ordered. Such ordering of units of the population according to the value of the attribute is called ranking. Examples can be the ranking of students (pupils) according to their abilities, any set of people according to the level of education, profession, ability to be creative, etc.

When ranking, each unit of the population is assigned rank, those. serial number. If the value of the attribute is the same for different units, they are assigned a combined average serial number. For example, if the 5th and 6th units of the population have the same values ​​of features, both will receive a rank equal to (5 + 6) / 2 = 5.5.

The relationship between ranked features is measured using rank correlation coefficients Spearman (r) and Kendall (t). These methods are applicable not only for qualitative, but also for quantitative indicators, especially with a small volume of the population, since non-parametric methods of rank correlation are not associated with any restrictions on the nature of the distribution of the trait.

Previous

METHODOLOGICAL INSTRUCTIONS FOR SOLVING TYPICAL TASKS

To identify features in the development of phenomena, to detect trends, to establish dependencies, it is necessary to group statistical data. For this purpose, a grouping attribute is selected and a summary indicator system is developed that will characterize the selected groups, for which a table layout is compiled.

A table layout is a table that consists of rows and columns that are not filled with numbers. Each statistical table (or layout) has a subject and a predicate. The subject is the object of study. The predicate is a system of indicators that characterizes the object of study. The subject is located on the left in the form of the name of horizontal lines, and the predicate is on the right, in the form of the name of vertical columns.

Depending on the construction of the subject, the following types of tables are distinguished: simple, group, combination.

Group tables are those whose subject contains a grouping of population units according to one attribute.

In social production, all processes are closely interconnected. There are functional and correlation relationships between features. Functional relationships are such relationships in which the value of the trait under study is determined by one or more factors. Moreover, with a change in factorial features, the resulting feature always changes by the same amount. However, in social production this kind of dependency is rare.

Interrelations of signs of economic phenomena, as a rule, are of a correlational nature. With correlation relationships, one value of the trait under study can correspond to many values ​​of another or other traits, and, with a change in one trait, other traits vary in different directions.

There are correlations: simple and multiple (according to the number of signs of the connection); positive and negative (by direction); rectilinear and curvilinear (according to the analytical expression).

Pair correlation displays relationships between two features. With multiple correlation, an economic phenomenon is considered as a combination of the influence of many factors.

A positive correlation reflects the change in signs in direct proportion. Relationships, when an increase (decrease) in one attribute is accompanied by a decrease (increase) in another attribute, are called negative.

Rectilinear is a relationship that can be expressed by the equation linear function. For a curvilinear type of connection, expressed by the equation of a curved line, it is characteristic that with an increase in one sign, the second first increases and then decreases, after reaching certain level development.


In the process of correlation analysis, the following coefficients are used: linear correlation (r), correlation ratio (h), association (r a), mutual contingency (r c), rank correlation (r p), multiple correlation (r xyz), correlation index (I r), regression (R).

The linear correlation coefficient is an indicator that reflects the direction and measure of the closeness of the connection between the signs with linear relationships (or close to them).

For small samples, the linear correlation coefficient is calculated by the formula:

x, y - values ​​of the studied features;

Average values ​​for each attribute;

The average value of the product of features x And y;

n is the number of the row.

The most convenient formula for calculating the correlation coefficient is the following:

The correlation coefficient varies from -1 to +1. The closer the correlation coefficient is to one, the closer the relationship between the features.

The significance of the connection can be assessed coarsely using the Chaddock tables, but often there is a need to give a more accurate assessment of the significance either on the basis of the t-test (for small samples) or Fisher's F-test. A probabilistic assessment of the significance of the correlation coefficient with a small sample is preferably carried out on the basis of calculating the value of t - Student's criterion

where r is the correlation coefficient;

n is the number of matched pairs of observations.

The resulting calculated value of t - Student's criterion is compared with its theoretical value depending on the 5% and 1% significance level and n-1 number of degrees of freedom (Appendix B).

If t calc. > t tab. , then the relationship between the factor and the result is significant and vice versa, if t calc.< t табл. , то связь несущественная и данный фактор исключается из дальнейшего исследования.

If the sample size is more than 30, then the random error of the sample correlation coefficient is first determined by the formula:

where 2 - total variance;

S 2 is the variance of the differences between the empirical data and the regression line (residual variance).

where y are the empirical values ​​of the effective feature;

Estimated values ​​of the effective feature.

The calculated values ​​of t - Student's criterion will be determined:

The correlation coefficient accurately estimates the degree of closeness of the relationship only if there is a linear relationship between the features. If there is a curvilinear dependence, then an empirical correlation ratio or correlation index is used to assess the degree of closeness of the relationship between the features. The correlation ratio is determined by the formula:

the correlation index is calculated:

s 2 fact. - variation of the effective feature under the influence of factors;

s 2 total - variation of the effective feature under the influence of all factors;

s 2 rest. - Variation of the effective feature under the influence of other factors.

The significance of the calculated correlation ratio will be determined based on Fisher's F-test:

m is the number of parameters in the regression equation.

The calculated value of the F-criterion is compared with the theoretical value according to the F-distribution tables with the number of degrees of freedom of the numerator V 1 =k-1 and the denominator V 2 =n-k at the selected significance level (a=0.05 or a=0.01) ( appendix E).

If F calc. > F tab. , then the relationship between the signs is significant (essential), if F calc.< F табл то связь не существенна и фактор следует исключить их дальнейшего исследования.

In the process of studying the phenomenon, it is important to establish not only the closeness of the connection, but also to calculate the indicators that characterize the relationship between the signs. This is done by solving certain regression equations. For the analytical expression of rectilinear regression, the straight line formula is used:

where is the aligned value of the effective feature;

a, b - parameters representing the average values ​​of constant indicators;

Equation Options a And b determined on the basis of the least squares method, for which they solve a system of normal equations.

Calculations are made in tabular form, in which the values ​​of å x, å y, å x 2 , å xy are calculated.

After finding the parameters A And b the parametrized straight line equation is written.

But the linear form does not always reflect the essence of the phenomenon, although it is preferable because it is easy to interpret. Therefore, when choosing a form of communication, curvilinear dependencies are also necessarily considered:

parabolic

hyperbolic

mixed

demonstration

semi-logarithmic

and others.

Equation parameters are also found based on the least squares method. So, for a parabola, the following system of equations is solved:

The researcher is obliged to consider possible mathematical models, and then from the found parametrized equations, choose an approximating equation (the one that most accurately displays the empirical two-dimensional distribution series). This is done based on the approximation error:

The approximating one will be one of the parametrized equations for which the error is minimal, but for practical purposes an equation is used for which e a £5%.

Then the parameters of the approximation of the equation must be checked for significance.

Options A And b should be evaluated according to statistical criteria (t - Student's test, F - Fisher's test). Special attention must be given to the parameter b called the regression coefficient. This is due to the fact that this indicator, being a measure of changes in a dependent attribute considered as a factor, acquires the values ​​of the basis for the extrapolation operation.

Parameter materiality assessment b be made on the basis of the error of the regression coefficient:

where S 2 - residual dispersion;

x - series options (factorial sign);

The average value of the series;

The calculated value of t - criterion is determined by:

The calculated value of the t-criterion is compared with its theoretical value according to Student's tables (Appendix B) at n-2 degrees of freedom at 5% and 1% significance levels. If t calc. >t tab. , then the parameter b essential.

Parameter A is estimated according to the formula:

Estimated value of t - criteria for the parameter a defined:

Similarly with the above, it is compared with the theoretical value and a conclusion is made about the significance of the parameter A and a conclusion is made about the practical use of the resulting model for the purposes of planning, forecasting

If it is necessary to determine the influence of several factors on the effective attribute, then a multiple regression model is built:

In the case of a three-dimensional distribution, the regression equation will be as follows:

the parameters of the equation can be found based on the simplex method, or.

Annotation: For most statistical studies it is important to identify the existing relationships between ongoing phenomena and processes. Almost all observed phenomena of the economic life of society, no matter how independent they may seem at first glance, as a rule, are the result of the action of certain factors. For example, the profit received by an enterprise is associated with many indicators: the number of employees, their education, the cost of fixed production assets, etc.

12.1. The concept of functional and correlation

There are two main types of connection between social and economic phenomena - functional and statistical (also called stochastic, probabilistic or correlation). Before considering them in more detail, we introduce the concepts of independent and dependent features.

Independent, or factorial, are signs that cause changes in other related signs. Signs, the change of which under the influence of certain factors needs to be traced, are called dependent, or effective.

With a functional relationship, a change in independent variables leads to obtaining precisely defined values ​​of the dependent variable.

Most often, functional relationships are manifested in natural sciences, for example, in mechanics, the functional is the dependence of the distance traveled by an object on the speed of its movement, etc.

With a statistical relationship, each value of the independent variable X corresponds to a set of values ​​of the dependent variable Y, and it is not known in advance which one. For example, we know that the profit of a commercial bank is in a certain way related to the size of its authorized capital (this fact is not in doubt). Nevertheless, it is impossible to calculate the exact amount of profit for a given value of the last indicator, since it also depends on many other factors, in addition to the size of the authorized capital, among which there are random ones. In our case, most likely, we will determine only the average value of profit, which will be received as a whole for the aggregate of banks with a similar amount of authorized capital. Thus, a statistical relationship differs from a functional one by the presence of a large number of factors acting on the dependent variable.

Note that the statistical relationship is manifested only "in general and average" at large numbers observation of the phenomenon. So, intuitively, we can assume that there is a relationship between the volume of fixed assets of the enterprise and the profit it receives, namely, with an increase in the first, the amount of profit increases. But one can object to this and give an example of an enterprise that has a sufficient amount of modern production equipment, but nevertheless suffers losses. In this case we have good example statistical connection, which manifests itself only in large populations containing tens and hundreds of units, in contrast to the functional one, which is confirmed for each observation.

A correlation is a statistical relationship between features, in which a change in the values ​​of the independent variable X leads to a regular change mathematical expectation random variable Y.

Example 12.1. Let us assume that data are available for enterprises on the amount of retained earnings of the previous year, the volume of investments in main capital and on the amounts allocated for the purchase of securities (thousand den. units):

Table 12.1.
Company number Retained earnings of the previous year Acquisition of securities Investments in fixed assets
1 3 010 190 100
2 3 100 182 250
3 3 452 185 280
4 3 740 170 270
5 3 980 172 330
6 4 200 160 420
7 4 500 145 606
8 5 020 120 690
9 5 112 90 800
10 5 300 30 950

The table shows that there is a direct correspondence between the retained earnings of the enterprise and its investment in main capital: with an increase in retained earnings, the volume of investment also increases. Now let's pay attention to the relationship between the indicator of retained earnings and the volume of purchased securities. Here it has a completely different character: an increase in the first indicator leads to the opposite effect - the value of purchased securities, with rare exceptions (which already clearly excludes the presence of a functional connection), decreases. Such visual data analysis, in which observations are ranked in ascending or descending order of the independent value x, and then the change in the values ​​of the dependent variable y is analyzed, is called the parallel data reduction method.

In the considered example, in the first case, the connection is direct, etc. an increase (decrease) in one indicator entails an increase (decrease) in another (there is a correspondence in the changes in indicators), and in the second - the opposite, etc. a decrease in one indicator causes an increase in another, or an increase in one corresponds to a decrease in another.

Direct and inverse dependencies characterize the direction of the relationship between features, which can be illustrated graphically using the correlation field. When it is built in rectangular system coordinates on the abscissa axis are the values ​​of the independent variable x, and on the ordinate axis - the dependent y. The intersection of coordinates is indicated by points that symbolize observations. The shape of the scattering of points in the correlation field is used to judge the shape and tightness of the relationship. Figure 12.1 shows the correlation fields corresponding to various forms connections.


Rice. 12.1.

a - direct (positive) connection;

b - feedback (negative) relationship;

c - lack of communication

The section of statistical science dealing with the study of causal relationships between socio-economic phenomena and processes that have a quantitative expression is correlation-regression analysis. In essence, there are two separate areas of analysis - correlation and regression. However, due to the fact that in practice they are most often used in a complex way (based on the results of the correlation analysis, a regression analysis is carried out), they are combined into one type.

Carrying out correlation-regression analysis involves solving the following tasks:

Of the listed tasks, the first two are attributed directly to the problems of correlation analysis, the next three - to regression analysis and only in relation to quantitative indicators.

12.1.1. Requirements for statistical information studied by methods of correlation and regression analysis

Methods of correlation and regression analysis can not be applied to all statistical data. We list the main requirements for the analyzed information:

  1. observations used for the study should be randomly selected from the general population of objects. Otherwise, the initial data, which is a certain sample from the general population, will not reflect its nature, the conclusions drawn from them about the patterns of development will turn out to be meaningless and of no practical value;
  2. the requirement that the observations be independent of each other. The dependence of observations from each other is called autocorrelation, for its elimination in the theory of correlation-regression analysis, special methods have been created;
  3. the initial data set should be homogeneous, without anomalous observations. Indeed, a single, outlier observation can lead to catastrophic consequences for the regression model, its parameters will turn out to be biased, the conclusions absurd;
  4. it is desirable that the initial data for analysis obey the normal distribution law. normal law The distribution is used to check the significance of the correlation coefficients and construct interval boundaries for them, so that certain criteria can be used. If it is not required to check the significance and build interval estimates, the variables can have any distribution law. In regression analysis, when constructing a regression equation, the requirement for the normal distribution of the initial data is imposed only on the resulting variable Y, independent factors are considered as non-random variables and can actually have any distribution law. As in the case of correlation analysis, the requirement of normal distribution is needed to check the significance of the regression equation, its coefficients and find confidence intervals;
  5. the number of observations by which the relationship of features is established and a regression model is built should exceed the number of factor features by at least 3-4 times (and preferably 8-10 times). As noted above, the statistical relationship appears only with a significant number of observations based on the operation of the law big numbers, and the weaker the connection, the more observations are required to establish the connection, the stronger - the less;
  6. factor signs X should not be functionally dependent on each other. A significant relationship of independent (factorial, explanatory) features among themselves indicates multicolleniarity. Its presence leads to the construction of unstable regression models, "false" regressions.

12.1.2. Linear and non-linear connections

A linear relationship is expressed by a straight line, and a non-linear relationship by a curved line. A linear relationship is expressed by the equation of a straight line: y = a 0 + a i *x. The straight line is the most attractive from the point of view of the simplicity of calculating the parameters of the equation. It is always resorted to, including in cases of nonlinear relationships, when there is no threat of significant losses in the accuracy of estimates. However, for some dependencies, representing them in a linear form leads to big mistakes(approximation errors) and, as a consequence, to false conclusions. In these cases, nonlinear regression functions are used, which in the general case can have any arbitrary form, especially since the modern software allows you to quickly build them. Most often, the following nonlinear equations are used to express a nonlinear relationship: power, parabolic, hyperbolic, logarithmic.

The parameters of these models, as in the cases of linear dependencies, are also estimated based on the least squares method (see Section 12.3.1).

12.2. Correlation-regression analysis

The main tasks of correlation analysis are to determine the presence of a connection between the selected features, to establish its direction, and to quantify the closeness of the connection. To do this, in the correlation analysis, the matrix of paired correlation coefficients is first estimated, then, on its basis, partial and multiple correlation coefficients and determination coefficients are determined. After finding the values ​​of the coefficients, their significance is checked. The end result of the correlation analysis is the selection of factor signs X for further construction of a regression equation that allows one to quantitatively describe the relationship.

Let us consider the stages of correlation analysis in more detail.

12.2.1. Paired (linear) correlation coefficients

Correlation analysis begins with the calculation of paired (linear) correlation coefficients.

The pair correlation coefficient is a measure of the linear relationship between two variables against the background of the action of the other variables included in the model.

Depending on which calculation order is more convenient for the researcher, this coefficient is calculated using one of the following formulas:

The pair correlation coefficient varies from -1 to +1. The absolute value equal to one indicates that the relationship is functional: -1 - reverse (negative), +1 - direct (positive). The zero value of the coefficient indicates the absence of a linear relationship between the features.

A qualitative assessment of the obtained quantitative values ​​of paired correlation coefficients can be given on the basis of the scale presented in Table. 12.2.

Note: a positive value of the coefficient indicates that the relationship between the signs is direct, a negative value is inverse.

12.2.2. Communication materiality assessment

After the values ​​of the coefficients are obtained, their significance should be checked. Since the initial data, according to which the relationship of features is established, is a certain sample from a certain general population of objects, the pair correlation coefficients calculated from these data will be selective. Thus, they only estimate the relationship based on the information that the selected units of observation carry. If the initial data "well" reflect the structure and patterns of the general population, then the correlation coefficient calculated from them will show a real connection inherent in reality in the entire studied population of objects. If the data do not "copy" the relationship of the population as a whole, then the calculated correlation coefficient will form a false idea of ​​the relationship. Ideally, to establish this fact, it is required to calculate the correlation coefficient based on the data of the entire population and compare it with that calculated from selected observations. However, in practice, as a rule, this cannot be done, since the entire population is often unknown or it is too large. Therefore, how realistic the coefficient represents reality can only be judged approximately. On the basis of logic, it is easy to come to the conclusion that, obviously, with an increase in the number of observations (for ), the confidence in the calculated coefficient will increase.

The significance of pairwise correlation coefficients is tested in one of two ways: using the Fisher-Yates table or Student's t-test. Consider the verification method using the Fisher-Yates table as the simplest.

At the beginning of the test, a significance level is set (most often denoted by the letter of the Greek alphabet "alpha" - ), which indicates the probability of making an erroneous decision. The possibility of making a mistake arises from the fact that not the entire population, but only a part of it, is used to determine the relationship. Usually takes the following values: 0.05; 0.02; 0.01; 0.001. For example, if = 0.05, then this means that, on average, in five cases out of a hundred decision about the significance (or insignificance) of paired correlation coefficients will be erroneous; at = 0.001 - in one case out of a thousand, etc.

The second parameter when checking the significance is the number of degrees of freedom v, which in this case is calculated as v = n - 2. According to the Fisher-Yates table, the critical value of the correlation coefficient r cr is found. (=0.05, v=n - 2). Coefficients whose modulus is greater than the found critical value are considered significant.

Example 12.2. Suppose that in the first case there are 12 observations, and the pair correlation coefficient was calculated from them, which turned out to be 0.530, in the second - 92 observations, and the calculated pair correlation coefficient was 0.36. But if we check their significance, in the first case the coefficient will turn out to be insignificant, and in the second - significant, despite the fact that it is much smaller in magnitude. It turns out that in the first case there are too few observations, which increases the requirements, and the critical value of the pair correlation coefficient at a significance level = 0.05 is 0.576 (v = 12 - 2), and in the second case there are much more observations and it is enough to exceed the critical value of 0.205 ( v = 92 - 2) so that the correlation coefficient at the same level is significant. Thus, the fewer observations, the higher the critical value of the coefficient will always be.

Significance testing essentially decides whether the calculated results are random or not.

12.2.3. Determining the Multiple Correlation Coefficient

The next stage of the correlation analysis is associated with the calculation of the multiple (cumulative) correlation coefficient.

The multiple correlation coefficient characterizes the tightness of the linear relationship between one variable and a set of other variables considered in the correlation analysis.

If the relationship between the resulting feature y and only two factor features x 1 and x 2 is being studied, then the following formula can be used to calculate the multiple correlation coefficient, the components of which are paired correlation coefficients:

where r are pairwise correlation coefficients.

Exploring nature, society, economy, it is necessary to take into account the relationship of observed processes and phenomena. At the same time, the completeness of the description is somehow determined by the quantitative characteristics of the cause-and-effect relationships between them. Evaluation of the most significant of them, as well as the impact of some factors on others, is one of the main tasks of statistics.

The forms of manifestation of interrelations are very diverse. The two most common types are functional (complete) and correlation(incomplete) connection. In the first case, the value of the factor attribute strictly corresponds to one or more values ​​of the function. Quite often, the functional connection is manifested in physics, chemistry. In economics, an example is the directly proportional relationship between labor productivity and an increase in production.

Correlation (which is also called incomplete, or statistical) appears on average, for mass observations, when the given values ​​of the dependent variable correspond to a certain number of probable values ​​of the independent variable. The explanation for this is the complexity of the relationships between the analyzed factors, the interaction of which is influenced by

unaccounted for random variables. Therefore, the relationship between the signs is manifested only on average, in the mass of cases. With a correlation, each value of the argument corresponds to randomly distributed values ​​of the function in a certain interval.

In the direction of communication, there are straight, when the dependent variable increases with the increase in the factor trait, and reverse, at which the growth of the latter is accompanied by a decrease in the function. Such links can also be called respectively positive and negative.

Regarding his analytical form connections are linear And non-linear. In the first case, on average, linear relationships appear between the signs. A non-linear relationship is expressed by a non-linear function, and the variables are interconnected on average non-linearly.

There is another rather important characteristic of connections from the point of view of interacting factors. If a relationship between two characteristics is characterized, then it is called steam room. If more than two variables are being studied - multiple.

The above classification features are most often found in statistical analysis. But in addition to the above, there are also direct, indirect And false connections. Actually, the essence of each of them is obvious from the name. In the first case, the factors interact directly with each other. An indirect relationship is characterized by the participation of some third variable, which mediates the relationship between the studied traits. A false connection is a connection established formally and, as a rule, confirmed only by quantitative estimates. It does not have a qualitative basis or is meaningless.

They differ in strength weak and strong ties. This formal characteristic is expressed by specific values ​​and is interpreted in accordance with generally accepted criteria for the strength of the connection for specific indicators.

In the most general form, the task of statistics in the field of studying relationships is to quantify their presence and direction, as well as to characterize the strength and form of influence of some factors on others. To solve it, two groups of methods are used, one of which includes the methods of correlation analysis, and the other - regression analysis. At the same time, a number of researchers combine these methods into a correlation-regression analysis, which has some grounds: the presence of a number of common computational procedures, complementarities in interpreting the results, etc.

Therefore, in this context, we can talk about correlation analysis in the broad sense - when the relationship is comprehensively characterized. At the same time, there are correlation analysis in the narrow sense - when the strength of the connection is studied - and regression analysis, during which its form and the impact of some factors on others are evaluated.

Tasks proper correlation analysis are reduced to measuring the closeness of the relationship between varying traits, determining unknown causal relationships and assessing the factors that have the greatest impact on the resulting trait:

Tasks regression analysis lie in the field of establishing the form of dependence, determining the regression function, using an equation to evaluate unknown values dependent variable.

The solution of these problems is based on appropriate techniques, algorithms, indicators, the use of which gives reason to talk about the statistical study of relationships.

It should be noted that the traditional methods of correlation and regression are widely represented in various statistical software packages for computers. The only thing left for the researcher is to properly prepare the information, choose a software package that satisfies the requirements of the analysis, and be ready to interpret the results obtained. There are many algorithms for calculating communication parameters, and at present it is hardly advisable to carry out such a complex type of analysis manually. Computational procedures are of independent interest, but knowledge of the principles of studying the relationships, possibilities and limitations of certain methods of interpreting the results is a prerequisite for research.

Methods for assessing the tightness of the connection are divided into correlation (parametric) and non-parametric. Parametric methods are based on the use, as a rule, of normal distribution estimates and are used in cases where the population under study consists of quantities that obey the normal distribution law. In practice, this position is most often taken a priori. Actually, these methods are parametric and are commonly called correlation methods.

Nonparametric methods do not impose restrictions on the law of distribution of the studied quantities. Their advantage is also the simplicity of calculations.

2. Methods for identifying correlations

3. One-way correlation-regression analysis

4. Multivariate correlation and regression analysis

5. Non-parametric indicators of communication

1. Types of relationships and the concept of correlation dependence

All statistical indicators are interconnected in certain relationships and relationships.

The task of statistical research is to determine the nature of this relationship.

There are the following types of relationships:

1. Factorial. In this case, the connections are manifested in the coordinated variation of various features in the same population. In this case, one of the signs acts as a factor, and the other as a consequence. The study of these relationships is carried out by the method of groupings, as well as the theory of correlation .

2. Component. This type includes such interrelations in which the change in some complex phenomenon is entirely determined by the change in the components included in this complex phenomenon as factors (X=x·f). For this, the index method is used.

For example, with the help of a system of interrelated indices, they learn how the turnover has changed due to changes in the number of goods sold and prices.

3. Balance. They are used in the analysis of relationships and proportions in the formation of resources and their distribution. The balance is a system of indicators, which consists of two sums of absolute values, interconnected by an equal sign,

a + b = c + d.

For example, the balance of material resources:

balance + income = expense + balance

initial end

Signs (indicators) in the study of relationships are divided into 2 types:

signs that cause changes in others are called factorial, or simply factors.

signs, changing under the influence of factor signs, are productive.

There are 2 types of relationships: functional And stochastic.

functional they call such a relationship in which only one value of the effective attribute corresponds to a certain value of the factor attribute.

If a causal relationship does not appear in every case, but in general, on average, with a large number of observations, then such a relationship is called stochastic.

A special case of a stochastic connection is correlation, at which the change in the average value of the effective attribute is due to the change in the factorial one.

Features of stochastic (correlation) relationships:

They are found not in isolated cases, but in general and on average with a large number of observations;

- incomplete, they do not take into account all the existing factors, but only the essential ones;

Are irreversible. For example, a functional relationship can be turned into

other functional relationship. If we say that productivity

agricultural products depends on the amount of fertilizer applied, the converse statement is meaningless.

Towards allocate connection direct And reverse. At direct connection with an increase in the factor sign, an increase in the resultant sign occurs. When feedback with an increase in the factor sign, a decrease in the resultant sign occurs.

By analytic expression allocate connections linear (rectilinear) And non-linear (curvilinear). If the connection between phenomena is expressed by the equation of a straight line, then it linear. If the relationship is expressed by the equation of a curved line (parabola, hyperbola, exponential, exponential, etc.), then it nonlinear.

By number of factors, acting on the effective sign, distinguish between connections single-factor And multifactorial. If there is one sign-factor and an effective sign, then the relationship is one-factor (pair regression). If there are 2 or more signs-factors, then the relationship is multifactorial (multiple regression).

Connections are also distinguished by the degree closeness of communication(see Chaddock's table).