Pearson distribution (chi-square distribution). "chi-square" in problems of statistical data analysis The formula for obtaining the chi-squared distribution

Before late XIX century, the normal distribution was considered the universal law of data variation. However, K. Pearson noted that empirical frequencies can differ greatly from normal distribution. The question was how to prove it. It required not only a graphical comparison, which is subjective, but also a strict quantitative justification.

Thus was invented the criterion χ 2(chi square), which tests the significance of the discrepancy between empirical (observed) and theoretical (expected) frequencies. This happened back in 1900, but the criterion is still in use today. Moreover, it has been adapted to solve a wide range of tasks. First of all, this is the analysis of categorical data, i.e. those that are expressed not by quantity, but by belonging to a category. For example, the class of car, the gender of the participant in the experiment, the type of plant, etc. Mathematical operations like addition and multiplication cannot be applied to such data, only frequencies can be calculated for them.

We denote the observed frequencies Oh (Observed), expected - E (expected). As an example, let's take the result of throwing a die 60 times. If it is symmetrical and uniform, the probability of any side coming up is 1/6 and therefore the expected number of each side coming up is 10 (1/6∙60). We write the observed and expected frequencies in a table and draw a histogram.

The null hypothesis is that the frequencies are consistent, that is, the actual data does not contradict the expected. An alternative hypothesis is that deviations in frequencies go beyond random fluctuations, the discrepancies are statistically significant. To draw a rigorous conclusion, we need.

  1. A generalized measure of the discrepancy between observed and expected frequencies.
  2. The distribution of this measure under the validity of the hypothesis that there are no differences.

Let's start with the distance between frequencies. If we just take the difference O - E, then such a measure will depend on the scale of the data (frequencies). For example, 20 - 5 = 15 and 1020 - 1005 = 15. In both cases, the difference is 15. But in the first case, the expected frequencies are 3 times less than the observed ones, and in the second case, only 1.5%. We need a relative measure that does not depend on the scale.

Let's pay attention to the following facts. In general, the number of categories in which frequencies are measured can be much larger, so the probability that a single observation will fall into one category or another is quite small. If so, then the distribution of such a random variable will obey the law of rare events, known as Poisson's law. In the Poisson law, as is known, the value of the mathematical expectation and the variance are the same (parameter λ ). Hence, the expected frequency for some category of nominal variable Ei will be the simultaneous and its dispersion. Further, Poisson's law with a large number of observations tends to normal. Combining these two facts, we get that if the hypothesis about the agreement between the observed and expected frequencies is true, then, with a large number of observations, expression

It is important to remember that normality will only appear at sufficiently high frequencies. In statistics, it is generally accepted that the total number of observations (the sum of frequencies) should be at least 50 and the expected frequency in each gradation should be at least 5. Only in this case, the value shown above has a standard normal distribution. Let's assume that this condition is met.

The standard normal distribution has almost all values ​​within ±3 (three sigma rule). Thus, we have received a relative difference in frequencies for one gradation. We need a generalized measure. You can’t just add up all the deviations - we get 0 (guess why). Pearson suggested adding the squares of these deviations.

This is the signs Chi-square test Pearson. If the frequencies really correspond to the expected ones, then the value of the criterion will be relatively small (because most of the deviations are near zero). But if the criterion turns out to be large, then this testifies in favor of significant differences between the frequencies.

The Pearson criterion becomes “large” when the occurrence of such or even greater value becomes unlikely. And in order to calculate such a probability, it is necessary to know the distribution of the criterion when the experiment is repeated many times, when the hypothesis of frequency agreement is correct.

As you can see, the value of chi-square also depends on the number of terms. The more of them, the greater the value of the criterion should be, because each term will contribute to the total amount. Therefore, for every quantity independent terms, will have its own distribution. It turns out that χ 2 is a whole family of distributions.

And here we come to one ticklish moment. What is a number independent terms? It seems like any term (i.e. deviation) is independent. K. Pearson also thought so, but turned out to be wrong. In fact, the number of independent terms will be one less than the number of gradations of the nominal variable n. Why? Because if we have a sample for which the sum of frequencies has already been calculated, then one of the frequencies can always be defined as the difference between the total number and the sum of all the others. Hence, the variation will be somewhat less. This fact Ronald Fisher noticed 20 years after Pearson developed his criterion. Even the tables had to be redone.

On this occasion, Fisher introduced a new concept into statistics - degree of freedom(degrees of freedom), which is the number of independent terms in the sum. The concept of degrees of freedom has a mathematical explanation and appears only in distributions associated with the normal (Student, Fisher-Snedekor and the chi-square itself).

To better grasp the meaning of the degrees of freedom, let's turn to the physical analog. Imagine a point freely moving in space. It has 3 degrees of freedom, because can move in any direction of three-dimensional space. If a point moves along any surface, then it already has two degrees of freedom (forward-backward, right-left), although it continues to be in three-dimensional space. The point moving along the spring is again in three-dimensional space, but has only one degree of freedom, because can move either forward or backward. As you can see, the space where the object is located does not always correspond to the real freedom of movement.

Approximately also the distribution of a statistical criterion may depend on a smaller number of elements than the necessary terms for its calculation. In the general case, the number of degrees of freedom is less than the number of observations by the number of available dependencies.

So the distribution is chi squared ( χ 2) is a family of distributions, each of which depends on a parameter of degrees of freedom. And the formal definition of the chi-square test is as follows. Distribution χ 2(chi-squared) with k degrees of freedom is the distribution of the sum of squares k independent standard normal random variables.

Next, we could move on to the formula itself, according to which the chi-square distribution function is calculated, but, fortunately, everything has long been calculated for us. To get the probability of interest, you can use either the corresponding statistical table or a ready-made function in Excel.

It is interesting to see how the shape of the chi-squared distribution changes depending on the number of degrees of freedom.

As the degrees of freedom increase, the chi-square distribution tends to be normal. This is explained by the action of the central limit theorem, according to which the sum of a large number of independent random variables has a normal distribution. It doesn't say anything about squares.

Pearson's Chi-Squared Hypothesis Test

So we come to testing hypotheses using the chi-square method. In general, the technique remains. A null hypothesis is put forward that the observed frequencies correspond to the expected ones (i.e. there is no difference between them, since they are taken from the same general population). If this is the case, then the spread will be relatively small, within the limits of random fluctuations. The measure of spread is determined by the chi-square test. Next, either the criterion itself is compared with the critical value (for the corresponding level of significance and degrees of freedom), or, more correctly, the observed p-value is calculated, i.e. the probability of obtaining such or even greater value of the criterion under the validity of the null hypothesis.

Because Since we are interested in the agreement of frequencies, then the hypothesis will be rejected when the criterion is greater than the critical level. Those. the criterion is one-sided. However, sometimes (sometimes) it is required to test the left-handed hypothesis. For example, when the empirical data are sooooo much similar to the theoretical ones. Then the criterion can fall into an unlikely region, but already on the left. The fact is that in natural conditions, it is unlikely to obtain frequencies that practically coincide with theoretical ones. There is always some randomness that gives an error. But if there is no such error, then perhaps the data was falsified. But still, the right-handed hypothesis is usually tested.

Let's return to the problem with the dice. Calculate the value of the chi-square test according to the available data.

Now let's find the critical value at 5 degrees of freedom ( k) and a significance level of 0.05 ( α ) according to the table of critical values ​​of the chi-square distribution.

That is, a quantile of 0.05 chi squared distribution (right tail) with 5 degrees of freedom χ2 0.05; 5 = 11,1.

Let's compare the actual and tabular value. 3.4( χ 2) < 11,1 (χ2 0.05; 5). The calculated criterion turned out to be smaller, which means that the hypothesis of equality (consent) of frequencies is not rejected. In the figure, the situation looks like this.

If the calculated value fell into the critical region, then the null hypothesis would be rejected.

It would be more correct to calculate the p-value as well. To do this, you need to find the nearest value in the table for a given number of degrees of freedom and see the corresponding level of significance. But this is the last century. We will use a computer, in particular MS Excel. Excel has several functions related to chi-square.

Below is a brief description of them.

XI2.OBR- critical value of the criterion for a given probability on the left (as in statistical tables)

chi2.ex.ph is the critical value of the criterion for a given probability on the right. The function essentially duplicates the previous one. But here you can immediately indicate the level α , instead of subtracting it from 1. This is more convenient, because in most cases, it is the right tail of the distribution that is needed.

CH2.DIST– p-value on the left (density can be calculated).

HI2.DIST.PH– p-value on the right.

HI2.TEST– performs a chi-square test on two frequency ranges at once. The number of degrees of freedom is taken one less than the number of frequencies in the column (as it should be), returning a p-value.

For now, let's calculate for our experiment the critical (tabular) value for 5 degrees of freedom and alpha 0.05. The Excel formula will look like this:

CH2.OBR(0.95;5)

chi2.inv.rx(0.05;5)

The result will be the same - 11.0705. It is this value that we see in the table (rounded to 1 decimal place).

Finally, we calculate the p-value for 5 degrees of freedom of the criterion χ 2= 3.4. We need the probability on the right, so we take the function with the addition of RH (right tail)

CH2.DIST.RH(3.4;5) = 0.63857

So, with 5 degrees of freedom, the probability of obtaining the criterion value χ 2= 3.4 and more is equal to almost 64%. Naturally, the hypothesis is not rejected (p-value is greater than 5%), the frequencies agree very well.

Now let's test the frequency agreement hypothesis using the chi-square test and the Excel function HI2.TEST.

No tables, no cumbersome calculations. Specifying columns with observed and expected frequencies as function arguments, we immediately get p-value. Beauty.

Imagine now that you are playing dice with a suspicious type. The distribution of points from 1 to 5 remains the same, but he rolls 26 sixes (the number of all rolls becomes 78).

p-value in this case turns out to be 0.003, which is much less than 0.05. There are serious reasons to doubt the correctness of the dice. Here's what that probability looks like on a chi-squared distribution diagram.

The chi-square criterion itself here turns out to be 17.8, which, naturally, is more than the tabular one (11.1).

I hope I was able to explain what the goodness-of-fit criterion is. χ 2(chi-squared) Pearson and how statistical hypotheses are tested with it.

Finally, once again about an important condition! The chi-square test works properly only when the number of all frequencies exceeds 50, and the minimum expected value for each gradation is not less than 5. If in any category the expected frequency is less than 5, but the sum of all frequencies exceeds 50, then this the category is combined with the nearest one so that their total frequency exceeds 5. If this is not possible, or the sum of the frequencies is less than 50, then more accurate methods of testing hypotheses should be used. We'll talk about them another time.

Below is a video clip on how to test a hypothesis using the chi-square test in Excel.

Ministry of Education and Science Russian Federation

Federal Agency for Education of the city of Irkutsk

Baikal State University economics and law

Department of Informatics and Cybernetics

Chi-squared distribution and its application

Kolmykova Anna Andreevna

2nd year student

group IS-09-1

Irkutsk 2010

Introduction

1. Chi-square distribution

Application

Conclusion

Bibliography

Introduction

How are the approaches, ideas and results of probability theory used in our lives?

The base is a probabilistic model of a real phenomenon or process, i.e. a mathematical model in which objective relationships are expressed in terms of probability theory. Probabilities are used primarily to describe the uncertainties that must be taken into account when making decisions. This refers to both undesirable opportunities (risks) and attractive ones ("lucky chance"). Sometimes randomness is deliberately introduced into the situation, for example, when drawing lots, random selection of units for control, conducting lotteries or consumer surveys.

Probability theory allows one to calculate other probabilities that are of interest to the researcher.

A probabilistic model of a phenomenon or process is the foundation of mathematical statistics. Two parallel series of concepts are used - those related to theory (a probabilistic model) and those related to practice (a sample of observational results). For example, the theoretical probability corresponds to the frequency found from the sample. The mathematical expectation (theoretical series) corresponds to the sample arithmetic mean (practical series). As a rule, sample characteristics are estimates of theoretical ones. At the same time, the quantities related to the theoretical series "are in the minds of researchers", refer to the world of ideas (according to the ancient Greek philosopher Plato), and are not available for direct measurement. Researchers have only selective data, with the help of which they try to establish the properties of a theoretical probabilistic model that are of interest to them.

Why do we need a probabilistic model? The fact is that only with its help it is possible to transfer the properties established by the results of the analysis of a particular sample to other samples, as well as to the entire so-called general population. The term "population" is used to refer to a large but finite population of units being studied. For example, about the totality of all residents of Russia or the totality of all consumers of instant coffee in Moscow. The purpose of marketing or sociological surveys is to transfer statements received from a sample of hundreds or thousands of people to general populations of several million people. In quality control, a batch of products acts as a general population.

To transfer inferences from a sample to a larger population, some assumptions are needed about the relationship of sample characteristics with the characteristics of this larger population. These assumptions are based on an appropriate probabilistic model.

Of course, it is possible to process sample data without using one or another probabilistic model. For example, you can calculate the sample arithmetic mean, calculate the frequency of fulfillment of certain conditions, etc. However, the results of the calculations will apply only to a specific sample; transferring the conclusions obtained with their help to any other set is incorrect. This activity is sometimes referred to as "data analysis". Compared to probabilistic-statistical methods, data analysis has limited cognitive value.

So, the use of probabilistic models based on estimation and testing of hypotheses with the help of sample characteristics is the essence of probabilistic-statistical decision-making methods.

Chi-squared distribution

The normal distribution defines three distributions that are now commonly used in statistical data processing. These are the distributions of Pearson ("chi - square"), Student and Fisher.

We'll focus on distribution

("chi - square"). This distribution was first studied by the astronomer F. Helmert in 1876. In connection with the Gaussian theory of errors, he studied the sums of squares of n independent standard normally distributed random variables. Karl Pearson later named this distribution function "chi-square". And now the distribution bears his name.

Due to its close relationship with the normal distribution, the χ2 distribution plays important role in probability theory and mathematical statistics. The χ2 distribution, and many other distributions that are defined by the χ2 distribution (for example, the Student's t-distribution), describe sample distributions various functions from normally distributed observations and are used to construct confidence intervals and statistical tests.

Pearson distribution

(chi - squared) is the distribution of a random variable where X1, X2,…, Xn are normal independent random variables, and expected value each of them is zero, and the standard deviation is one.

Sum of squares


assigned by law

("chi - square").

In this case, the number of terms, i.e. n, is called the "number of degrees of freedom" of the chi-squared distribution. As the number of degrees of freedom increases, the distribution slowly approaches normal.

The density of this distribution

So, the distribution χ2 depends on one parameter n – the number of degrees of freedom.

The distribution function χ2 has the form:

if χ2≥0. (2.7.)

Figure 1 shows a graph of the probability density and χ2 distribution function for different degrees of freedom.

Picture 1 The dependence of the probability density φ (x) in the distribution χ2 (chi - squared) for a different number of degrees of freedom.

Moments of the "chi-square" distribution:

The chi-squared distribution is used in estimating the variance (using confidence interval), when testing hypotheses of agreement, homogeneity, independence, primarily for qualitative (categorized) variables that take a finite number of values, and in many other problems statistical analysis data.

2. "Chi-square" in problems of statistical data analysis

Statistical methods of data analysis are used in almost all areas of human activity. They are used whenever it is necessary to obtain and substantiate any judgments about a group (objects or subjects) with some internal heterogeneity.

The modern stage of development of statistical methods can be counted from 1900, when the Englishman K. Pearson founded the journal "Biometrika". First third of the 20th century passed under the sign of parametric statistics. Methods based on the analysis of data from parametric families of distributions described by Pearson family curves were studied. The most popular was the normal distribution. The Pearson, Student, and Fisher criteria were used to test the hypotheses. The maximum likelihood method, analysis of variance were proposed, and the main ideas for planning the experiment were formulated.

The chi-square distribution is one of the most widely used in statistics for testing statistical hypotheses. On the basis of the "chi-square" distribution, one of the most powerful goodness-of-fit tests, Pearson's "chi-square" test, was constructed.

The goodness-of-fit test is a criterion for testing the hypothesis about the proposed law of the unknown distribution.

The χ2 ("chi-square") test is used to test the hypothesis of different distributions. This is his merit.

The calculation formula of the criterion is equal to

where m and m' are the empirical and theoretical frequencies, respectively

distribution under consideration;

n is the number of degrees of freedom.

For verification, we need to compare empirical (observed) and theoretical (calculated under the assumption of a normal distribution) frequencies.

If the empirical frequencies completely coincide with the frequencies calculated or expected, S (E - T) = 0 and the criterion χ2 will also be equal to zero. If S (E - T) is not equal to zero, this will indicate a discrepancy between the calculated frequencies and the empirical frequencies of the series. In such cases, it is necessary to evaluate the significance of the χ2 criterion, which theoretically can vary from zero to infinity. This is done by comparing the actually obtained value of χ2ph with its critical value (χ2st). The null hypothesis, i.e., the assumption that the discrepancy between the empirical and theoretical or expected frequencies is random, is refuted if χ2ph is greater than or equal to χ2st for the accepted significance level (a) and number of degrees of freedom (n).

Chi-squared test - generic method checking the agreement between the results of the experiment and the statistical model used.

Pearson distance X 2

Pyatnitsky A.M.

Russian State Medical University

In 1900, Karl Pearson proposed a simple, universal and effective method verification of agreement between model predictions and experimental data. His "chi-square test" is the most important and most commonly used statistical test. Most of the problems associated with estimating the unknown parameters of the model and checking the agreement between the model and experimental data can be solved with its help.

Let there be an a priori (“pre-experimental”) model of the object or process being studied (in statistics they speak of the “null hypothesis” H 0), and the results of the experiment with this object. It is necessary to decide whether the model is adequate (does it correspond to reality)? Do not the results of the experiment contradict our ideas about how reality works, or in other words, should H 0 be rejected? Often this task can be reduced to comparing the observed (O i = Observed ) and expected according to the model (E i =Expected ) average frequencies of occurrence of certain events. It is believed that the observed frequencies were obtained in a series of N independent (!) observations made under constant (!) conditions. As a result of each observation, one of M events is registered. These events cannot occur simultaneously (they are pairwise incompatible) and one of them necessarily occurs (their combination forms a reliable event). The totality of all observations is reduced to a table (vector) of frequencies (O i )=(O 1 ,… O M ), which fully describes the results of the experiment. The value O 2 =4 means that the event number 2 happened 4 times. The sum of the frequencies O 1 +… O M =N. It is important to distinguish between two cases: N - fixed, non-random, N - random value. At a fixed total number N frequency experiments have a polynomial distribution. Let us explain this general scheme simple example.

Application of the chi-square test to test simple hypotheses.

Let the model (null hypothesis H 0) be that the dice is regular - all faces fall out equally often with probability p i =1/6, i =, M=6. An experiment was carried out, which consisted in the fact that the bone was thrown 60 times (N = 60 independent tests were carried out). According to the model, we expect that all observed frequencies O i of occurrence 1,2,... 6 points should be close to their average values ​​E i =Np i =60∙(1/6)=10. According to H 0 the mid-frequency vector (E i )=(Np i )=(10, 10, 10, 10, 10, 10). (Hypotheses in which the average frequencies are fully known before the start of the experiment are called simple.) If the observed vector (O i ) was equal to (34,0,0,0,0,26) , then it is immediately clear that the model is incorrect - bone cannot be correct, since only 1 and 6 fell out 60 times. The probability of such an event for a correct dice is negligible: P = (2/6) 60 =2.4*10 -29 . However, the appearance of such obvious discrepancies between the model and experience is an exception. Let the vector of observed frequencies (O i ) be equal to (5, 15, 6, 14, 4, 16). Does this agree with H 0 ? So, we need to compare two frequency vectors (E i ) and (O i ). At the same time, the vector of expected frequencies (E i ) is not random, but the vector of observed frequencies (O i ) is random - in the next experiment (in new series out of 60 rolls) it will be different. It is useful to introduce a geometric interpretation of the problem and assume that in the frequency space (in this case 6 dimensional) two points are given with coordinates (5, 15, 6, 14, 4, 16) and (10, 10, 10, 10, 10, 10 ). Are they far enough apart to consider it incompatible with H 0 ? In other words, we need:

  1. learn how to measure distances between frequencies (points in frequency space),
  2. have a criterion for what distance should be considered too (“improbably”) large, that is, inconsistent with H 0 .

The square of the usual Euclidean distance would be:

X 2 Euclid = S(O i -E i) 2 = (5-10) 2 + (15-10) 2 + (6-10) 2 + (14-10) 2 + (4-10) 2 + (16-10) 2

Moreover, the surfaces X 2 Euclid = const are always spheres if we fix the values ​​of E i and change O i . Karl Pearson noted that one should not use the Euclidean distance in frequency space. Thus, it is wrong to assume that the points (O =1030 and E =1000) and (O =40 and E =10) are at an equal distance from each other, although in both cases the difference O -E =30. After all, the greater the expected frequency, the greater the deviations from it should be considered possible. Therefore, points (O =1030 and E =1000) should be considered “close”, and points (O =40 and E =10) “far” from each other. It can be shown that if the hypothesis H 0 is correct, then the fluctuations of the frequency O i with respect to E i have a magnitude of the order square root(!) from E i . Therefore, Pearson suggested that when calculating the distance, square not the differences (O i -E i ), but the normalized differences (O i -E i )/E i 1/2 . So, here is the formula for calculating the Pearson distance (actually it is the square of the distance):

X 2 Pearson = S((O i -E i )/E i 1/2) 2 = S(O i -E i ) 2 /E i

In our example:

X 2 Pearson = (5-10) 2 /10+(15-10) 2 /10 +(6-10) 2 /10+(14-10) 2 /10+(4-10) 2 /10+( 16-10) 2 /10=15.4

For a regular dice, all expected frequencies E i are the same, but usually they are different, so surfaces on which the Pearson distance is constant (X 2 Pearson =const) turn out to be already ellipsoids, not spheres.

Now, after the formula for calculating distances has been chosen, it is necessary to find out which distances should be considered “not too large” (consistent with H 0). So, for example, what can be said about the distance we calculated 15.4? In what percentage of cases (or with what probability) if we experimented with a regular dice, we would get a distance greater than 15.4? If this percentage is small<0.05), то H 0 надо отвергнуть. Иными словами требуется найти распределение длярасстояния Пирсона. Если все ожидаемые частоты E i не слишком малы (≥5), и верна H 0 , то нормированные разности (O i - E i )/E i 1/2 приближенно эквивалентны стандартным гауссовским случайным величинам: (O i - E i )/E i 1/2 ≈N (0,1). Это, например, означает, что в 95% случаев| (O i - E i )/E i 1/2 | < 1.96 ≈ 2 (правило “двух сигм”).

Explanation. The number of measurements O i falling into the cell of the table with number i has a binomial distribution with parameters: m =Np i =E i ,σ =(Np i (1-p i )) 1/2 , where N is the number of measurements (N " 1), p i is the probability for one measurement to fall into this cell (recall that the measurements are independent and are performed under constant conditions). If p i is small, then: σ≈(Np i ) 1/2 =E i and the binomial distribution is close to Poisson, in which the average number of observations E i =λ, and the standard deviation σ=λ 1/2 = E i 1/ 2. For λ≥5, the Poisson distribution is close to normal N (m =E i =λ, σ=E i 1/2 =λ 1/2), and the normalized value (O i - E i )/E i 1/2 ≈ N (0 ,1).

Pearson defined the random variable χ 2 n – “chi-square with n degrees of freedom”, as the sum of squares of n independent standard normal r.v.:

χ 2 n = T 1 2 + T 2 2 + …+ T n 2 , where is everything T i = N(0,1) - n. O. R. With. V.

Let us try to visually understand the meaning of this most important random variable in statistics. To do this, on a plane (for n = 2) or in space (for n = 3) we represent a cloud of points whose coordinates are independent and have a standard normal distributionf T (x) ~exp (-x 2 /2). On a plane, according to the “two sigma” rule, which is independently applied to both coordinates, 90% (0.95*0.95≈0.90) of the points are enclosed within a square (-2

f χ 2 2 (a) = Сexp(-a/2) = 0.5exp(-a/2).

With a sufficiently large number of degrees of freedom n (n>30), the chi-squared distribution approaches the normal one: N (m = n; σ = (2n) ½). This is a consequence of the “central limit theorem”: the sum of identically distributed quantities having a finite variance approaches the normal law with an increase in the number of terms.

In practice, it must be remembered that the average square of the distance is equal to m (χ 2 n )=n , and its dispersion σ 2 (χ 2 n )=2n . From this it is easy to conclude which chi-square values ​​should be considered too small and too large: most of the distribution lies in the range from n -2 ∙ (2n ) ½ to n + 2 ∙ (2n ) ½ .

So, Pearson distances significantly exceeding n +2∙ (2n ) ½ should be considered implausibly large (not consistent with H 0) . If the result is close to n +2∙(2n) ½, then you should use tables in which you can find out exactly in what proportion of cases such and large chi-square values ​​\u200b\u200bmay appear.

It is important to know how to choose the right value for the number of degrees of freedom (number degrees of freedom, abbreviated n .d .f .). It seemed natural to think that n is simply equal to the number of bits: n = M . Pearson suggested so in his article. In the dice example, this would mean that n = 6. However, a few years later it was shown that Pearson was wrong. The number of degrees of freedom is always less than the number of digits, if there are connections between random variables O i. For the dice example, the sum O i is 60, and only 5 frequencies can be changed independently, so the correct value is n=6-1=5. For this value of n, we get n +2∙(2n) ½ =5+2∙(10) ½ =11.3. Since 15.4>11.3, then the hypothesis H 0 - the dice is correct, should be rejected.

After clarifying the error, the existing tables χ 2 had to be supplemented, since initially there was no case n = 1 in them, since the smallest number of digits = 2. Now it turned out that there may be cases when the Pearson distance has a distribution χ 2 n =1 .

Example. With 100 tosses of a coin, the number of coats of arms is O 1 = 65, and tails O 2 = 35. The number of digits M = 2. If the coin is symmetrical, then the expected frequencies are E 1 =50, E 2 =50.

X 2 Pearson = S(O i -E i) 2 / E i \u003d (65-50) 2 / 50 + (35-50) 2 / 50 \u003d 2 * 225/50 \u003d 9.

The resulting value should be compared with those that the random variable χ 2 n =1 can take, defined as the square of the standard normal value χ 2 n =1 =T 1 2 ≥ 9 ó T 1 ≥3 or T 1 ≤-3. The probability of such an event is very small P (χ 2 n =1 ≥9) = 0.006. Therefore, the coin cannot be considered symmetrical: H 0 should be rejected. The fact that the number of degrees of freedom cannot be equal to the number of bits can be seen from the fact that the sum of the observed frequencies is always equal to the sum of the expected ones, for example O 1 +O 2 =65+35 = E 1 +E 2 =50+50=100. Therefore, random points with coordinates O 1 and O 2 are located on a straight line: O 1 + O 2 \u003d E 1 + E 2 \u003d 100 and the distance to the center turns out to be less than if this restriction were not there, and they were located on the entire plane. Indeed, for two independent random variables with mathematical expectations E 1 =50, E 2 =50, the sum of their realizations should not always be equal to 100 - for example, the values ​​O 1 =60, O 2 =55 would be acceptable.

Explanation. Let's compare the result of the Pearson criterion with M = 2 with what the Moivre-Laplace formula gives when estimating random fluctuations in the frequency of occurrence of an event ν =K /N having a probability p in a series of N independent Bernoulli trials (K is the number of successes):

χ 2 n =1 = S(O i -E i) 2 / E i \u003d (O 1 -E 1) 2 / E 1 + (O 2 -E 2) 2 / E 2 \u003d (Nν -Np) 2 / (Np) + (N ( 1-ν )-N (1-p )) 2 /(N (1-p ))=

=(Nν-Np) 2 (1/p + 1/(1-p))/N=(Nν-Np) 2 /(Np(1-p))=((K-Np)/(Npq) ½ ) 2 = T 2

The value T \u003d (K -Np) / (Npq) ½ \u003d (K -m (K)) / σ (K) ≈ N (0.1) with σ (K) \u003d (Npq) ½ ≥3. We see that in this case Pearson's result is exactly the same as that obtained by applying the normal approximation to the binomial distribution.

So far, we have considered simple hypotheses for which the expected average frequencies E i are completely known in advance. See below for how to choose the right number of degrees of freedom for complex hypotheses.

Applying the Chi-Square Test to Test Complex Hypotheses

In the examples with the correct dice and coin, the expected frequencies could be determined before(!) the experiment. Such hypotheses are called "simple". In practice, “complex hypotheses” are more common. At the same time, in order to find the expected frequencies E i, one or several quantities (model parameters) must first be estimated, and this can only be done using experimental data. As a result, for “complex hypotheses”, the expected frequencies E i turn out to be dependent on the observed frequencies O i and therefore become random variables themselves, changing depending on the results of the experiment. In the process of fitting the parameters, the Pearson distance decreases - the parameters are selected in such a way as to improve the agreement between the model and the experiment. Therefore, the number of degrees of freedom should decrease.

How to evaluate model parameters? There are many different methods of estimation - “maximum likelihood method”, “method of moments”, “substitution method”. However, it is possible not to involve any additional funds and find parameter estimates by minimizing the Pearson distance. In the pre-computer era, this approach was rarely used: it is inconvenient for manual calculations and, as a rule, does not lend itself to an analytical solution. When calculating on a computer, numerical minimization is usually easily carried out, and the advantage of this method is its universality. So, according to the “chi-square minimization method”, we select the values ​​of the unknown parameters so that the Pearson distance becomes the smallest. (By the way, by studying the changes in this distance with small shifts relative to the found minimum, one can estimate the measure of the accuracy of the estimate: build confidence intervals.) After the parameters and this minimum distance itself have been found, it is again necessary to answer the question of whether it is small enough.

The general sequence of actions is as follows:

  1. Choice of model (hypotheses H 0).
  2. Choice of digits and determination of the vector of observed frequencies O i .
  3. Estimation of unknown parameters of the model and construction of confidence intervals for them (for example, by searching for the minimum of the Pearson distance).
  4. Calculation of expected frequencies E i .
  5. Comparison of the found value of the Pearson distance X 2 with the critical value of chi-square χ 2 crit - the largest, which is still considered as plausible, compatible with H 0 . The value, χ 2 crit, we find from the tables, solving the equation

P (χ 2 n > χ 2 crit)=1-α,

where α is the “significance level” or “test size” or “Type I error value” (typical value α=0.05).

Usually the number of degrees of freedom n is calculated by the formula

n = (number of digits) – 1 – (number of estimated parameters)

If X 2 > χ 2 crit, then the hypothesis H 0 is rejected, otherwise it is accepted. In α∙100% of cases (that is, quite rarely), this way of checking H 0 will lead to a “error of the first kind”: the hypothesis H 0 will be rejected erroneously.

Example. In the study of 10 series of 100 seeds, the number of green-eyed fly infestations was counted. Data received: O i =(16, 18, 11, 18, 21, 10, 20, 18, 17, 21);

Here, the vector of expected frequencies is unknown in advance. If the data are homogeneous and obtained for a binomial distribution, then one parameter is unknown - the proportion p of infected seeds. Note that in the original table, in fact, there are not 10 but 20 frequencies that satisfy 10 links: 16+84=100, ... 21+79=100.

X 2 \u003d (16-100p) 2 / 100p + (84-100 (1-p)) 2 / (100 (1-p)) + ... +

(21-100p) 2 /100p +(79-100(1-p)) 2 /(100(1-p))

Combining the terms in pairs (as in the example with a coin), we get the form of writing the Pearson criterion, which is usually written immediately:

X 2 \u003d (16-100p) 2 / (100p (1-p)) + ... + (21-100p) 2 / (100p (1-p)).

Now, if we use the minimum Pearson distance as a method for estimating p, then we need to find a p for which X 2 =min. (The model tries, if possible, to “adjust” to the experimental data.)

The Pearson criterion is the most universal of all used in statistics. It can be applied to one-dimensional and multidimensional data, quantitative and qualitative features. However, it is precisely because of the universality that one must be careful not to make mistakes.

Important Points

1. Choice of ranks.

  • If the distribution is discrete, then there is usually no arbitrariness in the choice of digits.
  • If the distribution is continuous, then arbitrariness is inevitable. You can use statistically equivalent blocks (all O are the same, for example =10). In this case, the lengths of the intervals are different. In manual calculations, they tried to make the intervals the same. Should the intervals in the study of the distribution of a one-dimensional feature be equal? No.
  • It is necessary to combine the bits so that the expected (not observed!) frequencies turn out to be not too small (≥5). Recall that it is they (E i ) that are in the denominators when calculating X 2 ! When analyzing one-dimensional features, it is allowed to violate this rule in the two extreme bits E 1 =E max =1. If the number of bits is large and the expected frequencies are close, then X 2 closely approximates χ 2 even for E i =2.

Parameter Estimation. The use of “self-made”, inefficient estimation methods can lead to overestimated values ​​of the Pearson distance.

Choosing the right number of degrees of freedom. If parameter estimates are made not from frequencies, but directly from data (for example, the arithmetic mean is taken as an estimate of the mean), then the exact number of degrees of freedom n is unknown. We only know that it satisfies the inequality:

(number of digits - 1 - number of estimated parameters)< n < (число разрядов – 1)

Therefore, it is necessary to compare X 2 with the critical values ​​χ 2 crit calculated over this entire range of n .

How to interpret implausibly small chi-square values? Should a coin be considered symmetrical if, after 10,000 tosses, it has 5,000 coats of arms? Previously, many statisticians believed that H 0 should also be rejected in this case. Now another approach is proposed: to accept H 0 , but subject the data and the method of their analysis to additional verification. There are two possibilities: either a too small Pearson distance means that the increase in the number of model parameters was not accompanied by a proper decrease in the number of degrees of freedom, or the data itself was falsified (perhaps unintentionally adjusted to the expected result).

Example. Two investigators A and B calculated the proportion of recessive homozygotes aa in the second generation in an AA * aa monohybrid cross. According to Mendel's laws, this proportion is 0.25. Each researcher conducted 5 experiments, and 100 organisms were studied in each experiment.

Results A: 25, 24, 26, 25, 24. Researcher's conclusion: Mendel's law is valid (?).

Results B: 29, 21, 23, 30, 19. Researcher's conclusion: Mendel's law is not valid (?).

However, Mendel's law is statistical in nature, and a quantitative analysis of the results reverses the conclusions! Combining five experiments into one, we arrive at a chi-square distribution with 5 degrees of freedom (a simple hypothesis is being tested):

X 2 A = ((25-25) 2 +(24-25) 2 +(26-25) 2 +(25-25) 2 +(24-25) 2)/(100∙0.25∙0.75)=0.16

X 2 B = ((29-25) 2 +(21-25) 2 +(23-25) 2 +(30-25) 2 +(19-25) 2)/(100∙0.25∙0.75)=5.17

Mean value m [χ 2 n =5 ]=5, standard deviation σ[χ 2 n =5 ]=(2∙5) 1/2 =3.2.

Therefore, without reference to the tables, it is clear that the value of X 2 B is typical, and the value of X 2 A is implausibly small. According to tables P (χ 2 n =5<0.16)<0.0001.

This example is an adapted version of a real case that occurred in the 1930s (see Kolmogorov's work “On Another Proof of Mendel's Laws”). Curiously, researcher A was in favor of genetics, while researcher B was against it.

Notation confusion. It is necessary to distinguish the Pearson distance, which requires additional agreements in its calculation, from the mathematical concept of the random variable chi-square. The Pearson distance under certain conditions has a distribution close to a chi-square with n degrees of freedom. Therefore, it is desirable NOT to denote the Pearson distance by χ 2 n , but to use a similar but different notation for X 2. .

The Pearson criterion is not omnipotent. There are an infinite number of alternatives for H 0 , which he is unable to take into account. Let you test the hypothesis that the feature had a uniform distribution, you have 10 bits and the vector of observed frequencies is (130,125,121,118,116,115,114,113,111,110). The Pearson criterion cannot “notice” that the frequencies decrease monotonically and H 0 will not be rejected. If it were supplemented with the criterion of series, then yes!

The quantitative study of biological phenomena necessarily requires the creation of hypotheses that can be used to explain these phenomena. To test this or that hypothesis, a series of special experiments is put in place and the actual data obtained are compared with those theoretically expected according to this hypothesis. If there is a match, this may be sufficient reason to accept the hypothesis. If the experimental data are in poor agreement with the theoretically expected, there is great doubt about the correctness of the proposed hypothesis.

The degree of compliance of the actual data with the expected (hypothetical) is measured by the chi-square fit test:

 the actually observed value of the feature in i- toy; - the theoretically expected number or sign (indicator) for a given group, k-number of data groups.

The criterion was proposed by K. Pearson in 1900 and is sometimes called Pearson's criterion.

Task. Among 164 children who inherited the factor from one parent and the factor from the other, there were 46 children with the factor, 50 with the factor, 68 with both. Calculate expected frequencies at a 1:2:1 ratio between groups and determine the degree of agreement between empirical data using Pearson's test.

Solution: The ratio of observed frequencies is 46:68:50, theoretically expected 41:82:41.

Let's set the significance level to 0.05. The tabular value of the Pearson test for this level of significance with the number of degrees of freedom equal to it turned out to be 5.99. Therefore, the hypothesis about the correspondence of the experimental data to the theoretical one can be accepted, since, .

Note that when calculating the chi-square test, we no longer set the condition for the indispensable normality of the distribution. The chi-square test can be used for any distributions that we are free to choose in our assumptions. There is some universality in this criterion.

Another application of Pearson's criterion is the comparison of an empirical distribution with a Gaussian normal distribution. At the same time, it can be attributed to the group of criteria for checking the normality of the distribution. The only restriction is the fact that the total number of values ​​(variant) when using this criterion must be large enough (at least 40), and the number of values ​​in individual classes (intervals) must be at least 5. Otherwise, adjacent intervals should be combined. The number of degrees of freedom when checking the normality of the distribution should be calculated as:.

    1. Fisher's criterion.

This parametric test serves to test the null hypothesis about the equality of the variances of normally distributed populations.

Or.

For small sample sizes, the application of the Student's t-test can be correct only if the variances are equal. Therefore, before testing the equality of sample means, it is necessary to make sure that the Student's t-test is valid.

Where N 1 , N 2 sample sizes, 1 , 2 - the number of degrees of freedom for these samples.

When using tables, it should be noted that the number of degrees of freedom for a sample with a larger variance is chosen as the column number of the table, and for a smaller variance, as the row number of the table.

For the significance level according to the tables of mathematical statistics, we find a tabular value. If, then the hypothesis of equality of variances is rejected for the chosen level of significance.

Example. Studied the effect of cobalt on the body weight of rabbits. The experiment was carried out on two groups of animals: experimental and control. Experienced received an additive to the diet in the form of an aqueous solution of cobalt chloride. During the experiment, weight gain was in grams:

Control

The \(\chi^2\) test ("chi-square", also "Pearson's goodness-of-fit test") has an extremely wide application in statistics. In general terms, we can say that it is used to test the null hypothesis about the obedience of an observed random variable to a certain theoretical distribution law (for more details, see, for example,). The specific formulation of the hypothesis being tested will vary from case to case.

In this post, I will describe how the \(\chi^2\) test works using a (hypothetical) example from immunology. Imagine that we have performed an experiment to determine the effectiveness of suppressing the development of a microbial disease when the appropriate antibodies are introduced into the body. In total, 111 mice were involved in the experiment, which we divided into two groups, including 57 and 54 animals, respectively. The first group of mice was injected with pathogenic bacteria, followed by the introduction of blood serum containing antibodies against these bacteria. Animals from the second group served as controls - they received only bacterial injections. After some time of incubation, it turned out that 38 mice died, and 73 survived. Of the dead, 13 belonged to the first group, and 25 belonged to the second (control). The null hypothesis tested in this experiment can be formulated as follows: the administration of serum with antibodies has no effect on the survival of mice. In other words, we argue that the observed differences in the survival of mice (77.2% in the first group versus 53.7% in the second group) are completely random and are not associated with the action of antibodies.

The data obtained in the experiment can be presented in the form of a table:

Total

Bacteria + serum

Only bacteria

Total

Tables like this one are called contingency tables. In this example, the table has a dimension of 2x2: there are two classes of objects ("Bacteria + serum" and "Bacteria only"), which are examined according to two criteria ("Dead" and "Survived"). This is the simplest case of a contingency table: of course, both the number of classes under study and the number of features can be larger.

To test the null hypothesis formulated above, we need to know what the situation would be if the antibodies did not really have any effect on the survival of mice. In other words, you need to calculate expected frequencies for the corresponding cells of the contingency table. How to do it? A total of 38 mice died in the experiment, which is 34.2% of the total number of animals involved. If the introduction of antibodies does not affect the survival of mice, the same percentage of mortality should be observed in both experimental groups, namely 34.2%. Calculating how much is 34.2% of 57 and 54, we get 19.5 and 18.5. These are the expected mortality rates in our experimental groups. The expected survival rates are calculated in a similar way: since 73 mice survived in total, or 65.8% of their total number, the expected survival rates are 37.5 and 35.5. Let's make a new contingency table, now with the expected frequencies:

dead

Survivors

Total

Bacteria + serum

Only bacteria

Total

As you can see, the expected frequencies are quite different from the observed ones, i.e. administration of antibodies does seem to have an effect on the survival of mice infected with the pathogen. We can quantify this impression using Pearson's goodness-of-fit test \(\chi^2\):

\[\chi^2 = \sum_()\frac((f_o - f_e)^2)(f_e),\]


where \(f_o\) and \(f_e\) are the observed and expected frequencies, respectively. The summation is performed over all cells of the table. So, for the example under consideration, we have

\[\chi^2 = (13 – 19.5)^2/19.5 + (44 – 37.5)^2/37.5 + (25 – 18.5)^2/18.5 + (29 – 35.5)^2/35.5 = \]

Is \(\chi^2\) large enough to reject the null hypothesis? To answer this question, it is necessary to find the corresponding critical value of the criterion. The number of degrees of freedom for \(\chi^2\) is calculated as \(df = (R - 1)(C - 1)\), where \(R\) and \(C\) are the number of rows and columns in the table conjugacy. In our case \(df = (2 -1)(2 - 1) = 1\). Knowing the number of degrees of freedom, we can now easily find out the critical value \(\chi^2\) using the standard R-function qchisq() :


Thus, for one degree of freedom, the value of the criterion \(\chi^2\) exceeds 3.841 only in 5% of cases. The value we obtained, 6.79, significantly exceeds this critical value, which gives us the right to reject the null hypothesis that there is no relationship between the administration of antibodies and the survival of infected mice. Rejecting this hypothesis, we risk being wrong with a probability of less than 5%.

It should be noted that the above formula for the criterion \(\chi^2\) gives somewhat overestimated values ​​when working with contingency tables of size 2x2. The reason is that the distribution of the \(\chi^2\) criterion itself is continuous, while the frequencies of binary features ("died" / "survived") are discrete by definition. In this regard, when calculating the criterion, it is customary to introduce the so-called. continuity correction, or Yates amendment :

\[\chi^2_Y = \sum_()\frac((|f_o - f_e| - 0.5)^2)(f_e).\]

Pearson "s Chi-squared test with Yates" continuity correction data : mice X-squared = 5.7923 , df = 1 , p-value = 0.0161


As you can see, R automatically applies the Yates correction for continuity ( Pearson's Chi-squared test with Yates' continuity correction). The value \(\chi^2\) calculated by the program was 5.79213. We can reject the null hypothesis of no antibody effect at the risk of being wrong with a probability of just over 1% (p-value = 0.0161 ).