How to calculate confidence interval. Questions for self-examination of students

Konstantin Krawchik clearly explains what a confidence interval is in medical research and how to use it

"Katren-Style" continues to publish a cycle of Konstantin Kravchik on medical statistics. In two previous articles, the author touched on the explanation of such concepts as and.

Konstantin Kravchik

Mathematician-analyst. Specialist in the field statistical studies in medicine and humanities

Moscow city

Very often in articles on clinical trials you can find a mysterious phrase: "confidence interval" (95% CI or 95% CI - confidence interval). For example, an article might say: "Student's t-test was used to assess the significance of differences, with a 95% confidence interval calculated."

What is the value of the "95% confidence interval" and why calculate it?

What is a confidence interval? - This is the range in which the true mean values ​​in the population fall. And what, there are "untrue" averages? In a sense, yes, they do. In we explained that it is impossible to measure the parameter of interest in the entire population, so the researchers are content with a limited sample. In this sample (for example, by body weight) there is one average value (a certain weight), by which we judge the average value in the entire general population. However, it is unlikely that the average weight in the sample (especially a small one) will coincide with the average weight in the general population. Therefore, it is more correct to calculate and use the range of average values ​​of the general population.

For example, suppose the 95% confidence interval (95% CI) for hemoglobin is between 110 and 122 g/L. This means that with a 95 % probability, the true mean value for hemoglobin in the general population will be in the range from 110 to 122 g/l. In other words, we do not know the average hemoglobin in the general population, but we can indicate the range of values ​​for this feature with 95% probability.

Confidence intervals are particularly relevant to the difference in means between groups, or what is called the effect size.

Suppose we compared the effectiveness of two iron preparations: one that has been on the market for a long time and one that has just been registered. After a course of therapy, the concentration of hemoglobin in the studied groups of patients was assessed, and statistical program we calculated that the difference between the mean values ​​of the two groups with a probability of 95% is in the range from 1.72 to 14.36 g/l (Table 1).

Tab. 1. Criterion for independent samples
(groups are compared by hemoglobin level)

This should be interpreted as follows: in a part of patients in the general population who take a new drug, hemoglobin will be higher on average by 1.72–14.36 g/l than in those who took an already known drug.

In other words, in the general population, the difference in the average values ​​for hemoglobin in groups with a 95% probability is within these limits. It will be up to the researcher to judge whether this is a lot or a little. The point of all this is that we are not working with one average value, but with a range of values, therefore, we more reliably estimate the difference in a parameter between groups.

In statistical packages, at the discretion of the researcher, one can independently narrow or expand the boundaries of the confidence interval. By lowering the probabilities of the confidence interval, we narrow the range of means. For example, at 90% CI, the range of means (or mean differences) will be narrower than at 95% CI.

Conversely, increasing the probability to 99% widens the range of values. When comparing groups, the lower limit of the CI may cross the zero mark. For example, if we extended the boundaries of the confidence interval to 99 %, then the boundaries of the interval ranged from –1 to 16 g/L. This means that in the general population there are groups, the difference between the averages between which for the studied trait is 0 (M=0).

Confidence intervals can be used to test statistical hypotheses. If the confidence interval crosses the zero value, then the null hypothesis, which assumes that the groups do not differ in the studied parameter, is true. An example is described above, when we expanded the boundaries to 99%. Somewhere in the general population, we found groups that did not differ in any way.

95% confidence interval of difference in hemoglobin, (g/l)


The figure shows the 95% confidence interval of the mean hemoglobin difference between the two groups as a line. The line passes the zero mark, therefore, there is a difference between the means equal to zero, which confirms the null hypothesis that the groups do not differ. The difference between the groups ranges from -2 to 5 g/l, which means that hemoglobin can either decrease by 2 g/l or increase by 5 g/l.

The confidence interval is a very important indicator. Thanks to it, you can see if the differences in the groups were really due to the difference in the means or due to a large sample, because with a large sample, the chances of finding differences are greater than with a small one.

In practice, it might look like this. We took a sample of 1000 people, measured the hemoglobin level and found that the confidence interval for the difference in the means lies from 1.2 to 1.5 g/L. The level of statistical significance in this case p

We see that the hemoglobin concentration increased, but almost imperceptibly, therefore, the statistical significance appeared precisely due to the sample size.

Confidence intervals can be calculated not only for averages, but also for proportions (and risk ratios). For example, we are interested in the confidence interval of the proportions of patients who achieved remission while taking the developed drug. Assume that the 95% CI for proportions, i.e. for the proportion of such patients, is in the range 0.60–0.80. Thus, we can say that our medicine has a therapeutic effect in 60 to 80% of cases.

One of the methods for solving statistical tasks is the calculation of the confidence interval. It is used as a preferred alternative to point estimation when the sample size is small. It should be noted that the process of calculating the confidence interval is rather complicated. But the tools of the Excel program allow you to somewhat simplify it. Let's find out how this is done in practice.

This method is used in the interval estimation of various statistical quantities. The main task of this calculation is to get rid of the uncertainties of the point estimate.

In Excel, there are two main options to calculate using this method: when the variance is known, and when it is unknown. In the first case, the function is used for calculations CONFIDENCE NORM, and in the second TRUST.STUDENT.

Method 1: CONFIDENCE NORM function

Operator CONFIDENCE NORM, which refers to the statistical group of functions, first appeared in Excel 2010. Earlier versions of this program use its counterpart TRUST. The task of this operator is to calculate a confidence interval with a normal distribution for the population mean.

Its syntax is as follows:

CONFIDENCE NORM(alpha, standard_dev, size)

"Alpha" is an argument indicating the level of significance that is used to calculate the confidence level. The confidence level is equal to the following expression:

(1-"Alpha")*100

"Standard deviation" is an argument, the essence of which is clear from the name. This standard deviation proposed sample.

"Size" is an argument that determines the size of the sample.

All arguments to this operator are required.

Function TRUST has exactly the same arguments and possibilities as the previous one. Its syntax is:

TRUST(alpha, standard_dev, size)

As you can see, the differences are only in the name of the operator. This feature has been retained in Excel 2010 and newer versions in a special category for compatibility reasons. "Compatibility". In versions of Excel 2007 and earlier, it is present in the main group of statistical operators.

The confidence interval boundary is determined using the formula of the following form:

X+(-)CONFIDENCE NORM

Where X is the sample mean, which is located in the middle of the selected range.

Now let's look at how to calculate the confidence interval for specific example. 12 tests were carried out, resulting in different results, which are listed in the table. This is our totality. The standard deviation is 8. We need to calculate the confidence interval at the 97% confidence level.

  1. Select the cell where the result of data processing will be displayed. Clicking on the button "Insert Function".
  2. Appears Function Wizard. Go to category "Statistical" and highlight the name "CONFIDENCE.NORM". After that click on the button OK.
  3. The arguments window opens. Its fields naturally correspond to the names of the arguments.
    Set the cursor to the first field - "Alpha". Here we should specify the level of significance. As we remember, our level of trust is 97%. At the same time, we said that it is calculated in this way:

    (1-trust level)/100

    That is, by substituting the value, we get:

    By simple calculations, we find out that the argument "Alpha" equals 0,03 . Enter this value in the field.

    As you know, the standard deviation is equal to 8 . Therefore, in the field "Standard deviation" just write down that number.

    In field "Size" you need to enter the number of elements of the tests performed. As we remember, they 12 . But in order to automate the formula and not have to edit it every time a new test is run, let's set this value to common number, and using the operator CHECK. So, we set the cursor in the field "Size", and then click on the triangle, which is located to the left of the formula bar.

    A list of recently used functions appears. If the operator CHECK used by you recently, it should be on this list. In this case, you just need to click on its name. Otherwise, if you do not find it, then go to the point "More features...".

  4. Appears already familiar to us Function Wizard. Moving back to the group "Statistical". We select the name there "CHECK". Click on the button OK.
  5. The argument window for the above operator appears. This function is designed to calculate the number of cells in the specified range that contain numeric values. Its syntax is the following:

    COUNT(value1, value2,…)

    Argument group "Values" is a reference to the range in which you want to calculate the number of cells filled with numeric data. In total, there can be up to 255 such arguments, but in our case we need only one.

    Set the cursor in the field "Value1" and, holding down the left mouse button, select the range on the sheet that contains our population. Then its address will be displayed in the field. Click on the button OK.

  6. After that, the application will perform the calculation and display the result in the cell where it is itself. In our particular case, the formula turned out like this:

    CONFIDENCE NORM(0.03,8,COUNT(B2:B13))

    The overall result of the calculations was 5,011609 .

  7. But that is not all. As we remember, the boundary of the confidence interval is calculated by adding and subtracting from the average sample value of the calculation result CONFIDENCE NORM. In this way, the right and left boundaries of the confidence interval are calculated, respectively. The sample mean itself can be calculated using the operator AVERAGE.

    This operator is designed to calculate the arithmetic mean of the selected range of numbers. It has the following rather simple syntax:

    AVERAGE(number1, number2,…)

    Argument "Number" can be either a single numeric value or a reference to cells or even entire ranges that contain them.

    So, select the cell in which the calculation of the average value will be displayed, and click on the button "Insert Function".

  8. opens Function Wizard. Back to category "Statistical" and select a name from the list "AVERAGE". As always, click on the button OK.
  9. The arguments window is launched. Set the cursor in the field "Number1" and with the left mouse button pressed, select the entire range of values. After the coordinates are displayed in the field, click on the button OK.
  10. After that AVERAGE outputs the result of the calculation to a sheet element.
  11. We calculate the right boundary of the confidence interval. To do this, select a separate cell, put the sign «=» and add the contents of the sheet elements in which the results of the calculation of functions are located AVERAGE And CONFIDENCE NORM. In order to perform the calculation, press the button Enter. In our case, we got the following formula:

    Calculation result: 6,953276

  12. In the same way, we calculate the left boundary of the confidence interval, only this time from the result of the calculation AVERAGE subtract the result of the calculation of the operator CONFIDENCE NORM. It turns out the formula for our example of the following type:

    Calculation result: -3,06994

  13. We tried to describe in detail all the steps for calculating the confidence interval, so we described each formula in detail. But you can combine all the actions in one formula. The calculation of the right bound of the confidence interval can be written as follows:

    AVERAGE(B2:B13)+CONFIDENCE(0.03,8,COUNT(B2:B13))

  14. A similar calculation of the left border would look like this:

    AVERAGE(B2:B13)-CONFIDENCE.NORM(0.03,8,COUNT(B2:B13))

Method 2: TRUST.STUDENT function

In addition, there is another function in Excel that is related to the calculation of the confidence interval - TRUST.STUDENT. It has only appeared since Excel 2010. This operator performs the calculation of the population confidence interval using Student's t-distribution. It is very convenient to use it in the case when the variance and, accordingly, the standard deviation are unknown. The operator syntax is:

TRUST.STUDENT(alpha,standard_dev,size)

As you can see, the names of the operators in this case remained unchanged.

Let's see how to calculate the boundaries of the confidence interval with an unknown standard deviation using the example of the same population that we considered in the previous method. The level of confidence, like last time, we will take 97%.

  1. Select the cell in which the calculation will be made. Click on the button "Insert Function".
  2. In the opened Function Wizard go to category "Statistical". Choose a name "TRUST.STUDENT". Click on the button OK.
  3. The argument window for the specified operator is launched.

    In field "Alpha", given that the confidence level is 97%, we write down the number 0,03 . The second time we will not dwell on the principles of calculating this parameter.

    After that, set the cursor in the field "Standard deviation". This time, this indicator is unknown to us and it needs to be calculated. This is done using a special function - STDEV.V. To call the window of this operator, click on the triangle to the left of the formula bar. If we do not find the desired name in the list that opens, then go to the item "More features...".

  4. is running Function Wizard. Moving to category "Statistical" and mark the name "STDEV.B". Then click on the button OK.
  5. The arguments window opens. operator task STDEV.V is the definition of standard deviation in sampling. Its syntax looks like this:

    STDEV.V(number1,number2,…)

    It is easy to guess that the argument "Number" is the address of the selection element. If the selection is placed in a single array, then using only one argument, you can give a link to this range.

    Set the cursor in the field "Number1" and, as always, holding down the left mouse button, select the set. After the coordinates are in the field, do not rush to press the button OK because the result will be incorrect. First we need to return to the operator arguments window TRUST.STUDENT to make the final argument. To do this, click on the appropriate name in the formula bar.

  6. The argument window of the already familiar function opens again. Set the cursor in the field "Size". Again, click on the triangle already familiar to us to go to the choice of operators. As you understand, we need a name "CHECK". Since we used this function in the calculations in the previous method, it is present in this list, so just click on it. If you do not find it, then follow the algorithm described in the first method.
  7. Getting into the arguments window CHECK, put the cursor in the field "Number1" and with the mouse button held down, select the collection. Then click on the button OK.
  8. After that, the program calculates and displays the value of the confidence interval.
  9. To determine the boundaries, we will again need to calculate the sample mean. But, given that the calculation algorithm using the formula AVERAGE the same as in the previous method, and even the result has not changed, we will not dwell on this in detail a second time.
  10. Adding up the results of the calculation AVERAGE And TRUST.STUDENT, we obtain the right boundary of the confidence interval.
  11. Subtracting from the calculation results of the operator AVERAGE calculation result TRUST.STUDENT, we have the left bound of the confidence interval.
  12. If the calculation is written in one formula, then the calculation of the right border in our case will look like this:

    AVERAGE(B2:B13)+STUDENT CONFIDENCE(0.03,STDV(B2:B13),COUNT(B2:B13))

  13. Accordingly, the formula for calculating the left border will look like this:

    AVERAGE(B2:B13)-STUDENT CONFIDENCE(0.03,STDV(B2:B13),COUNT(B2:B13))

As you can see, the tools of the Excel program make it possible to significantly facilitate the calculation of the confidence interval and its boundaries. For these purposes, separate operators are used for samples whose variance is known and unknown.

An example of interval estimating is confidence interval. A confidence interval is a segment whose center is a point estimate of a numerical characteristic, including the true value of this numerical characteristic with a given probability. This probability is called confidence probability. Thus, the confidence interval is a measure of the accuracy of the estimate, and the confidence probability characterizes its reliability. The size of the confidence interval depends on what value of the confidence probability is given by the experimenter. The higher the confidence level, the wider the interval must be in order to include the true value of the numerical characteristic with a given probability. Often a confidence value of P d = 0.95 is chosen, thus believing that this value is large enough to consider that the confidence interval “almost always” covers the true value. Only sometimes, in the case of responsible and very responsible research, P d = 0.99 and 0.999, respectively, are assumed.

The procedure for constructing a confidence interval includes two steps:

Writing a probabilistic statement about some random function, which includes the difference or ratio of the assessment and the numerical characteristic. Such a function carries information about the degree of closeness of the mentioned values. It is necessary that the distribution law of the function be known;

The probabilistic statement is transformed into a form in which the boundaries of the confidence interval of the numerical characteristic are presented in an explicit form.

Examples of functions with a known distribution that satisfy the required requirements are the following:

having a normal distribution if the value of X is normally distributed, and the value of s[X] is known;

2) (3.25)

having a Student's distribution c m = N-1, if the value of X is normally distributed, and the value of s[X] is not known in advance, but its estimate can be obtained from experimental data using formula (3.23);

3) (3.26)

having a Pearson distribution with m = N-1 if the value of X is normally distributed.

Recall that the distribution parameters m are the numbers of degrees of freedom. In addition, the following notations are used here: - arithmetic mean value, - root mean square value equal to the square root of the variance, [X] - estimate of the mean frame value, defined as the square root of the unbiased estimate of the variance, N - sample size.

The Z and t functions can be used to construct a confidence interval for mathematical expectation, while the c 2 function is used to construct a confidence interval for the variance.


Let us construct a confidence interval for the mathematical expectation, provided that we have at our disposal the results of N observations of a normally distributed quantity X, and the mean square value is known in advance from independent observations. Since the function Z is normally distributed, you can use the corresponding table to determine the value of z a such that outside - z a and + z a there remains a part of the area under the distribution curve in the sum equal to a, while within [- z a ,+ z a ] lies part of the area , equal to 1 - a . What has just been said corresponds to the following probabilistic statement:

Р(- z a £ £+z a )= 1-a. (3.27)

(The probability of fulfilling the inequality enclosed in curly brackets is 1-a.). Let's transform the expression in brackets:

Р(-z a )= 1 - a

We call the value 1-a = Р d the confidence probability Р d. According to (3.28), with this confidence probability, the confidence interval for M[X] is given by the limits:

. (3.29)

Comment: Unfortunately the tables normal distribution in different books are constructed differently. Sometimes the probability integral is given

Ф(z) =

Suppose we have a large number of items with a normal distribution of some characteristics (for example, a full warehouse of vegetables of the same type, the size and weight of which varies). You want to know the average characteristics of the entire batch of goods, but you have neither the time nor the inclination to measure and weigh each vegetable. You understand that this is not necessary. But how many pieces would you need to take for random inspection? Before giving some formulas useful for this situation, we recall some notation. First, if we did measure the entire warehouse of vegetables (this set of elements is called the general population), then we would find out with all the accuracy available to us the average value of the weight of the entire batch. Let's call this average X average gene. - general average. We already know what is completely determined if its mean value and deviation s are known. True, so far we do not know either X average gene or s of the general population. We can only take some sample, measure the values ​​we need and calculate for this sample both the average value X avg. and the standard deviation S vyb. It is known that if our sample check contains a large number of elements (usually n more than 30), and they are taken really randomly, then the s of the population will almost not differ from S samples. In addition, for the case of a normal distribution, we can use the following formulas:

With a probability of 95%

With a probability of 99%

.

IN general view with probability Р(t)

The relationship between the value of t and the value of the probability P(t), with which we want to know the confidence interval, can be taken from the following table:

P(t) 0,683 0,950 0,954 0,990 0,997
t 1,00 1,96 2,00 2,58 3,00

Thus, we have determined in what range the average value for the general population is (with a given probability).

If we do not have a large enough sample, we cannot claim that the population has s = S samples. In addition, in this case, the closeness of the sample to the normal distribution is problematic. In this case, also use S s instead of s in the formula:

but the value of t for a fixed probability P(t) will depend on the number of elements in the sample n. The larger n, the closer the resulting confidence interval will be to the value given by formula (1). The t values ​​in this case are taken from another table (Student's t-test), which we provide below:

Student's t-test values ​​for probability 0.95 and 0.99 

n P n P
0.95 0.99 0.95 0.99
2 12.71 63.66 18 2.11 2.90
3 4.30 9.93 19 2.10 2.88
4 3.18 5.84 20 2.093 2.861
5 2.78 4.60 25 2.064 2.797
6 2.57 4.03 30 2.045 2.756
7 2.45 3.71 35 2.032 2.720
8 2.37 3.50 40 2.022 2.708
9 2.31 3.36 45 2.016 2.692
10 2.26 3.25 50 2.009 2.679
11 2.23 3.17 60 2.001 2.662
12 2.20 3.11 70 1.996 2.649
13 2.18 3.06 80 1.991 2.640
14 2.16 3.01 90 1.987 2.633
15 2.15 2.98 100 1.984 2.627
16 2.13 2.95 120 1.980 2.617
17 2.12 2.92 >120 1.960 2.576

Example 3 30 people were randomly selected from the employees of the company. According to the sample, it turned out that the average salary (per month) is 10 thousand rubles with an average square deviation of 3 thousand rubles. With a probability of 0.99 determine the average salary in the firm. Solution: By condition, we have n = 30, X cf. =10000, S=3000, P=0.99. To find the confidence interval, we use the formula corresponding to the Student's criterion. According to the table for n \u003d 30 and P \u003d 0.99 we find t \u003d 2.756, therefore,

those. desired confidence interval 27484< Х ср.ген < 32516.

So, with a probability of 0.99, it can be argued that the interval (27484; 32516) contains the average salary in the company.
We hope that you will use this method without necessarily having a spreadsheet with you every time. Calculations can be carried out automatically in Excel. While in an Excel file, click the fx button on the top menu. Then, select among the functions the type "statistical", and from the proposed list in the box - STEUDRASP. Then, at the prompt, placing the cursor in the "probability" field, type the value of the reciprocal probability (that is, in our case, instead of the probability of 0.95, you need to type the probability of 0.05). Apparently, the spreadsheet is designed so that the result answers the question of how likely we can be wrong. Similarly, in the "degree of freedom" field, enter the value (n-1) for your sample.

Confidence interval for mathematical expectation - this is such an interval calculated from the data, which with a known probability contains the mathematical expectation of the general population. The natural estimate for the mathematical expectation is the arithmetic mean of its observed values. Therefore, further during the lesson we will use the terms "average", "average value". In problems of calculating the confidence interval, the answer most often required is "The confidence interval of the average number [value in a specific problem] is from [lower value] to [higher value]". With the help of the confidence interval, it is possible to evaluate not only the average values, but also the share of one or another feature of the general population. Mean values, variance, standard deviation and error, through which we will come to new definitions and formulas, are analyzed in the lesson Sample and Population Characteristics .

Point and interval estimates of the mean

If the mean value of the general population is estimated by a number (point), then a specific mean calculated from a sample of observations is taken as an estimate of the unknown mean of the general population. In this case, the mean value of the sample is random variable- does not coincide with the average value of the general population. Therefore, when indicating the mean value of the sample, it is also necessary to indicate the sample error at the same time. The standard error is used as a measure of sampling error, which is expressed in the same units as the mean. Therefore, the following notation is often used: .

If the estimate of the mean is required to be associated with a certain probability, then the parameter of the general population of interest must be estimated not by a single number, but by an interval. A confidence interval is an interval in which, with a certain probability, P the value of the estimated indicator of the general population is found. Confidence interval in which with probability P = 1 - α is a random variable , is calculated as follows:

,

α = 1 - P, which can be found in the appendix to almost any book on statistics.

In practice, the population mean and variance are not known, so the population variance is replaced by the sample variance, and the population mean by the sample mean. Thus, the confidence interval in most cases is calculated as follows:

.

The confidence interval formula can be used to estimate the population mean if

  • the standard deviation of the general population is known;
  • or the standard deviation of the population is not known, but the sample size is greater than 30.

The sample mean is an unbiased estimate of the population mean . In turn, the sample variance is not an unbiased estimate of the population variance . To obtain an unbiased estimate of the population variance in the sample variance formula, the sample size is n should be replaced with n-1.

Example 1 Information is collected from 100 randomly selected cafes in a certain city that the average number of employees in them is 10.5 with a standard deviation of 4.6. Determine the confidence interval of 95% of the number of cafe employees.

where is the critical value of the standard normal distribution for the significance level α = 0,05 .

Thus, the 95% confidence interval for the average number of cafe employees was between 9.6 and 11.4.

Example 2 For a random sample from a general population of 64 observations, the following total values ​​were calculated:

sum of values ​​in observations ,

sum of squared deviations of values ​​from the mean .

Calculate the 95% confidence interval for the expected value.

calculate the standard deviation:

,

calculate the average value:

.

Substitute the values ​​in the expression for the confidence interval:

where is the critical value of the standard normal distribution for the significance level α = 0,05 .

We get:

Thus, the 95% confidence interval for the mathematical expectation of this sample ranged from 7.484 to 11.266.

Example 3 For a random sample from a general population of 100 observations, a mean value of 15.2 and a standard deviation of 3.2 were calculated. Calculate the 95% confidence interval for the expected value, then the 99% confidence interval. If the sample power and its variation remain the same, but the confidence factor increases, will the confidence interval narrow or widen?

We substitute these values ​​into the expression for the confidence interval:

where is the critical value of the standard normal distribution for the significance level α = 0,05 .

We get:

.

Thus, the 95% confidence interval for the average of this sample was from 14.57 to 15.82.

Again, we substitute these values ​​into the expression for the confidence interval:

where is the critical value of the standard normal distribution for the significance level α = 0,01 .

We get:

.

Thus, the 99% confidence interval for the average of this sample was from 14.37 to 16.02.

As you can see, as the confidence factor increases, the critical value of the standard normal distribution also increases, and, therefore, the start and end points of the interval are located further from the mean, and thus the confidence interval for the mathematical expectation increases.

Point and interval estimates of the specific gravity

The share of some feature of the sample can be interpreted as a point estimate of the share p the same trait in the general population. If this value needs to be associated with a probability, then the confidence interval of the specific gravity should be calculated p feature in the general population with a probability P = 1 - α :

.

Example 4 There are two candidates in a certain city A And B running for mayor. 200 residents of the city were randomly polled, of which 46% answered that they would vote for the candidate A, 26% - for the candidate B and 28% do not know who they will vote for. Determine the 95% confidence interval for the proportion of city residents who support the candidate A.