Features and limitations of methods of mathematical statistics. Methods of mathematical statistics (2) - Abstract. calculate the sample variance and sample standard deviation s

1. Basic concepts and definitions

statistics sampling Bernoulli population

The concept of statistics

Statistics, or rather its research methods, is widely used in various fields of human knowledge. However, like any science, it requires the definition of the subject of its study. In this regard, a distinction is made between statistics dealing with the study of socio-economic phenomena, which belongs to the cycle of social sciences, and statistics dealing with the laws of natural phenomena, which belongs to the natural sciences.

The authors of most modern domestic university textbooks on the theory of statistics (general theory of statistics) understand statistics as a subject social science, i.e. a science that has its own special subject and method of knowledge.

Statistics - social science, which studies the quantitative side of qualitatively defined mass socio-economic phenomena and processes, their structure and distribution, placement in space, movement in time, revealing the existing quantitative dependencies, trends and patterns, and in specific conditions of place and time.

The subject of statistics

Statistics as a science does not study individual facts, but mass socio-economic phenomena and processes that act as a multitude of individual factors that have both individual and general characteristics.

The object of statistical research in statistics is called a statistical population.

Population - this is a set of units that have mass character, homogeneity, a certain integrity, interdependence of the state of individual units and the presence of variation.

For example, as special objects of statistical research, i.e. statistical aggregates, there may be many commercial banks registered in the territory Russian Federation, a bunch of joint-stock companies, a set of citizens of some country, etc. It is important to remember that the statistical population consists of really existing material objects.

Each individual element of this set is called a unit of the statistical population.

The units of the statistical population are characterized by common properties, referred to in statistics signs , i.e. Qualitative homogeneity of an aggregate is understood as the similarity of units (objects, phenomena, processes) according to some essential features, but differing in some other features.

Units of the population, along with features common to all units that determine the qualitative certainty of the population, also have individual characteristics and differences that distinguish them from each other, i.e. exists feature variation . It is due to a different combination of conditions that determine the development of the elements of the set.

For example, the level of labor productivity of bank employees is determined by their age, qualifications, attitude to work, etc.

It is the presence of variation that predetermines the need for statistics. . The variation of a feature can be reflected in the statistical distribution of population units.

Statistics as a science studies, first of all, the quantitative side of social phenomena and processes in the specific conditions of place and time, i.e. the subject of statistics is the size and quantitative correlations of socio-economic phenomena, the patterns of their connection and development.

Statistics expresses a quantitative characteristic through a certain kind of number, which are called statistical indicators.

statistic reflects the measurement result for the units of the population and the population as a whole.

Theoretical foundations of statistics as a science

The theoretical basis of any science, including statistics, is made up of concepts and categories, in the aggregate of which the basic principles of this science are expressed.

Statistical aggregates have certain properties, the carriers of which are units of the population (phenomena) that have certain characteristics. According to the form of external expression, signs are divided into attributive (descriptive, qualitative) and quantitative. Attributive (qualitative) signs are not amenable to quantitative (numerical) expression.

Quantitative signs can be divided into discrete and continuous.

important a category of statistics is also a statistical regularity.

statistical regularity - this is a form of manifestation of a causal relationship, expressed in the sequence, regularity, repetition of events with sufficient a high degree probabilities if the causes (conditions) that give rise to events do not change or change slightly.

The statistical regularity is established on the basis of the analysis of mass data. This determines its relationship with the law big numbers.

The essence of the law of large numbers lies in the fact that in the numbers summing up the result of mass observations, certain regularities appear that cannot be detected on a small number of factors. The law of large numbers is generated by the properties of mass phenomena. Tendencies and regularities revealed with the help of the law of large numbers are valid only as mass tendencies, but not as laws for each separate, individual case.

Statistics Method

Statistics as a science has developed techniques and methods for studying mass social phenomena, depending on the characteristics of its subject and the tasks that are posed in its study. The techniques and methods by which statistics studies its subject form the statistical methodology.

Statistical methodology means a system of techniques, methods and methods aimed at studying quantitative patterns that manifest themselves in the structure, dynamics and relationships of socio-economic phenomena.

The task of statistical research consists in obtaining generalizing characteristics and identifying patterns in social life in specific conditions of place and time, which manifest themselves only in a large mass of phenomena through overcoming the randomness inherent in its individual elements.

Statistical research consists of three stages:

statistical observation;

summary and grouping of observation results;

analysis of the obtained generalizing indicators.

All three stages are interconnected, and at each of them special methods are used, explained by the content of the work performed.

The concept of selective observation

The statistical methodology for studying mass phenomena distinguishes, as is known, two methods of observation depending on the completeness of the coverage of the object: continuous and non-continuous. A variety of discontinuous observation is selective.

Under selective surveillance is understood as such a non-continuous observation, in which units of the studied population, selected randomly, are subjected to statistical examination (observation).

Selective observation sets itself the task - for the examined part, characterize the entire set of units, subject to all the rules and principles of conducting statistical observation and scientifically organized work on the selection of units.

The sampling method provides the necessary information of acceptable accuracy when time and cost factors make continuous development impractical.

Characteristics of the sample and general population

The set of units selected for the survey in statistics is usually called selective , and the set of units from which the selection is made - general.

The main characteristics of the parameters of the general and sample populations are indicated by certain symbols ( tab. 1.1 ).

Table 1.1 Symbols of the main characteristics of the parameters of the general and sample populations

Characteristic

Population

Sample population

Population volume (number of units)

The number of units with the examined trait

Proportion of units with the examined trait

Average feature size

Quantitative Variance

Share variance

In the process of conducting sample observation, as in general when analyzing the data of any survey, statistics distinguish two types of errors: registration and representativeness.

Registration errors may be random (unintentional) or systematic (tendentious) in nature. They can be avoided with proper organization and conducting surveillance.

Representativeness errors are organically inherent in selective observation and arise due to the fact that the sample does not fully reproduce the general one.

It is impossible to avoid representativeness errors, however, using the methods of probability theory based on the use of limit theorems of the law of large numbers, these errors can be reduced to minimum values, the boundaries of which are set with sufficiently high accuracy;

Sampling error - this is the difference between the value of the parameter in the general population and its value calculated from the results of the sample observation.

For the average value, the error will be defined as follows:

Where, . (1.1)

The value is called the marginal sampling error .

The marginal sampling error is random. Limit theorems of the law of large numbers are devoted to the study of patterns of random sampling errors.

These patterns are most fully disclosed in the theorems of L.L. Chebyshev and A.M. Lyapunov.

Theorem of P. L. Chebyshev : with a sufficiently large number of independent observations, it is possible with a probability close to one (i.e., almost with certainty) to assert that the deviation of the sample mean from the general one will be arbitrarily small.

The theorem proves that the value of the error should not exceed.

In turn, the value expressing the standard deviation of the sample mean from the general mean depends on the fluctuation of the trait in the general population and the number of selected units.

This dependence is expressed by the formula

Where - mean sampling error (depends on the sampling method);

General dispersion;

Sample size.

It is easy to see that when a large number of units are selected, the discrepancies between the means will be smaller, i.e. there is an inverse relationship between the mean sampling error and the number of selected units.

It can be proved that an increase in the variability of a feature entails an increase in the standard deviation, and, consequently, errors.

The ratio between the variances of the general and sample population is expressed by the formula

Since the value for sufficiently large is close to, we can approximately assume that the sample variance is equal to the general variance, i.e. .

Hence, mean sampling error shows , what possible deviations of the characteristics of the sample population from the corresponding characteristics of the general population. However, the magnitude of this error can be judged with a certain probability. The multiplier indicates the magnitude of the probability.

A. M. Lyapunov proved that the distribution of sample means (and, consequently, their deviations from the general mean) with a sufficiently large number of independent observations is approximately normal, provided that the general population has a finite mean and limited variance.

Mathematically Lyapunov's theorem can be written like this:

Where - marginal sampling error .

The values ​​of this integral for various values ​​of the confidence coefficient have been calculated and are given in special mathematical tables.

For example:

t = 1 F(t) = 0.683; t = 1.5 F(t) = 0.866;

t = 2F(t) = 0.954; t = 2.5 F(t) = 0.988;

t = 3 F(t) = 0.997; t = 3.5 F(t) = 0.999.

This can be read as follows: with probability it can be argued that the difference between the sample and the general mean does not exceed one value of the mean sampling error.

In other words, in cases the representativeness error will not go beyond the limits, etc.

Knowing the sample mean value of the feature and the marginal sampling error, it is possible to determine the boundaries (limits) that contain the general mean:

Bernoulli's theorem considers the sampling error for an alternative feature, which has only two possible outcomes: the presence of the feature () and its absence (0).

Bernoulli's theorem states , that with a sufficiently large sample size, the probability of discrepancy between the share of the feature in the sample population () and the share of the feature in the general population () will tend to unity:

those. with a probability arbitrarily close to one, it can be argued that with a sufficiently large sample size, the frequency of a feature (sample share) will differ arbitrarily little from the share of a feature (in the general population).

In view of that the probability of discrepancy between frequency and proportion follows the law normal distribution, this probability can be found from the function depending on the given value.

The average sampling error for an alternative feature is determined by the formula

Since the share of a feature in the sample is unknown, it must be replaced through the share of the same feature in the general population, i.e. take, and take the variance of the alternative feature as.

Then the average sampling error is expressed by the formula

The limiting value of the difference between frequency and proportion is called marginal sampling error .

The magnitude of the marginal error can be judged with a certain probability, which depends on the multiplier, since.

Knowing the sample share of the trait and the marginal sampling error, it is possible to determine the boundaries that contain the general share:

The results of a selective statistical study largely depend on the level of preparation of the observation process.

Under the level of training in this case, it implies compliance with certain rules and principles for designing a sample survey. The most important element of design is drawing up an organizational plan for selective observation.

The organizational plan includes next questions:

  • 1. Setting the goal and objectives of observation.
  • 2. Determining the boundaries of the object of study.
  • 3. Development of the observation program (compilation of a questionnaire, questionnaire, report form, etc.) and development of its materials.
  • 4. Determination of the selection procedure, selection method and sample size.
  • 5. Training of personnel for observation, reproduction of forms, instructional documents, etc.
  • 6. Calculation of sample characteristics and determination of sampling errors.
  • 7. Distribution of sample data to the entire population.
  • 2. The main methods of forming a sample population

The reliability of the characteristics calculated from sample data is largely determined by the representativeness of the sample, which, in turn, depends on the method of selecting units from the general population.

Distinguish by appearance individual, group and combined selection.

At individual selection individual units of the general population are selected in the sample set, with group selection - groups of units, and combined selection involves a combination of group and individual selection.

The selection method determines whether the selected unit can continue to participate in the selection procedure.

Unrepeatable called such a selection in which the unit that fell into the sample is not returned to the population from which further selection is carried out.

At repeated In the selection, the unit that is included in the sample, after registering the observed characteristics, returns to the original (general) population to participate in the further selection procedure.

With this method, the size of the general population remains unchanged, which leads to a constant probability of getting into the sample of all units of the population.

In the practice of sample surveys, the following samples are most widely used:

actually random;

mechanical;

typical;

serial;

combined.

Self-random sampling

With such a sample, the selection of units from the general population is made at random or at random, without any elements of consistency. At the same time, all units of the general population, without exception, must have absolutely equal chances of being included in the sample.

Technically, proper random selection is carried out by drawing lots or according to a table of random numbers.

Self-random selection can be both repeated and non-repeated.

Suppose, as a result of a sample survey of the living conditions of city residents, carried out on the basis of a random resampling, the following distribution series was obtained ( tab. 2.1 ).

Table 2.1 Results of a sample survey of the living conditions of city residents

To determine the average sampling error, it is necessary to calculate the sample mean and variance of the trait under study (v. 2.2).

Table 2.2 Calculation of the average total (useful) area of ​​dwellings per 1 person and dispersion

Total (useful) area of ​​dwellings, per 1 person, m 2

Number of inhabitants f

Interval x

  • 5,0-10,0
  • 10,0-15,0
  • 15,0-20,0
  • 20,0-25,0
  • 25,0-30,0
  • 30.0 and over
  • 712,5
  • 2550,0
  • 4725,0
  • 4725,0
  • 3575,0
  • 2697,5
  • 5343,75
  • 31875,0
  • 82687,5
  • 106312,5
  • 98312,5
  • 87668,75

Medium the sampling error is:

Let us define the marginal sampling error with the probability:

Let's set the boundaries of the general average:

Thus, on the basis of the conducted sample survey, it can be concluded with probability that the average size of the total area per person in the city as a whole ranges from to.

When calculating the average error of a random non-repetitive sample, it is necessary to take into account the correction for non-repetitive selection:

Assuming that presented in tab. 2.1 data are the result of non-repetitive selection (the general population includes units), then the average sampling error will be slightly less:

Accordingly, the marginal sampling error will also decrease, which will cause a narrowing of the boundaries of the general average.

Let's use the data again tab. 2.1 in order to define the boundaries of the proportion of persons whose housing provision is less.

According to the results of the survey, the number of such persons amounted to a person.

Let's define the sample fraction and variance:

Calculate the average sampling error:

The marginal sampling error with a given probability is:

Let's define the boundaries of the general share:

Therefore, it can be argued with probability that the proportion of people who have less per person in the city as a whole is in the range from to.

Mechanical sampling

Mechanical sampling is used in cases where the population is somehow ordered, i.e. there is a certain sequence in the arrangement of units (lists of voters, telephone numbers of respondents, numbers of houses and apartments, etc.).

To conduct mechanical sampling, a selection proportion is established, which is determined by correlating the volumes of the sample and the general population.

The selection of units is carried out in accordance with the established proportion at regular intervals. For example, with a proportion (sample), each unit is selected.

The general population during mechanical selection can be ranked or ordered according to the value of the trait being studied or correlated with it, which will increase the representativeness of the sample.

However, in this case, the risk of a systematic error increases, associated with an underestimation of the value of the studied trait (if the first value is recorded from each interval) or its overestimation (if the last value is recorded from each interval).

It is advisable to start the selection from the middle of the first interval, for example, when sampling, select the subsequent units with the same interval

To determine the average error of mechanical sampling, the formula of the average error for self-random non-repetitive selection is used.

typical selection

This method of selection is used in cases where all units of the general population can be divided into several typical groups.

Typical selection involves the selection of units from each typical group in a purely random or mechanical way.

The selection of units in a typical sample can be organized either in proportion to the volume of typical groups, or in proportion to the intragroup differentiation of a trait.

When sampling proportional to the size of typical groups, the number of units to be selected from each group is determined as follows:

where is the volume of the group;

The size of the sample from the group.

The average error of such a sample is found by the formulas:

- (re-selection); (2.1)

- (non-repetitive selection), (2.2)

where is the average of the intragroup dispersions.

When sampling proportional to the differentiation of the trait, the number of observations for each group is calculated by the formula:

where is the standard deviation of the trait in the group.

The average error of such a selection is defined as follows:

- (reselection), (2.4)

- (non-repetitive selection). (2.5)

Let's consider both variants of a typical sample using a conditional example.

Suppose that a non-repetitive typical selection of workers of the enterprise, proportional to the size of the workshops, carried out in order to assess losses due to temporary disability, led to the following results ( tab. 2.3 ).

Table 2.3 Results of the survey of the workers of the enterprise

Let us determine the average and marginal sampling errors (with probability):

Calculate the sample mean:

With probability, we can conclude that the average number of days of temporary disability of one worker in the whole enterprise is within:

Let us use the obtained intragroup variances to carry out a selection proportional to the differentiation of the trait.

Determine the required sample size for each workshop:

Taking into account the obtained values, we calculate the average sampling error:

In this case, the average, and, consequently, the marginal error will be somewhat smaller, which will also affect the boundaries of the general average.

serial selection

This method of selection is convenient in cases where the population units are grouped into small groups or series. As such series, packages with a certain amount of finished products, batches of goods, student groups, brigades and other associations.

The essence of serial sampling lies in the actual random or mechanical selection of series, within which a continuous survey of units is carried out.

The average serial sampling error (when selecting equal series) depends on the value of only the intergroup (interseries) variance and is determined by the following formulas:

(reselection); (2.6)

(nonrepetitive selection), (2.7)

where is the number of selected series;

The total number of episodes.

The intergroup variance is calculated as follows:

where is the average of the series;

The overall average for the entire sample.

Combined selection

In the practice of statistical surveys, in addition to the selection methods discussed above, their combination is also used.

It is possible to combine type and serial sampling when series are selected in a prescribed manner from several typical groups. A combination of serial and proper random selection is also possible, in which individual units are selected within the series in a proper random order.

The error of such a sample is determined by the stepwise selection.

multistage called selection, in which enlarged groups are first extracted from the general population, then smaller ones, and so on until those units that are subject to the survey are selected.

Multi-phase sampling involves the preservation of the same sampling unit at all stages of its implementation, while the units selected at each stage are subject to examination (at each subsequent stage of selection, the survey program is expanded).

Based on the foregoing, we present formulas for the marginal sampling error for the most commonly used in practice methods of forming a sample population ( tab. 2.4 ).

Table 2.4 Marginal sampling error for some sampling methods

Math statistics is a modern industry mathematical science, which deals with the statistical description of the results of experiments and observations, as well as building mathematical models containing concepts probabilities. theoretical basis mathematical statistics serves probability theory.

In the structure of mathematical statistics, two main sections are traditionally distinguished: descriptive statistics and statistical inference (Figure 1.1).

Rice. 1.1. Main sections of mathematical statistics

Descriptive statistics is used for:

o generalization of indicators of one variable (statistics of a random sample);

o identifying relationships between two or more variables (correlation-regression analysis).

Descriptive statistics makes it possible to obtain new information, quickly understand and comprehensively evaluate it, that is, it performs the scientific function of describing the objects of study, which justifies its name. The methods of descriptive statistics are designed to turn a set of individual empirical data into a system of forms and numbers that are visual for perception: frequency distributions; indicators of trends, variability, communication. These methods calculate the statistics of a random sample, which serve as the basis for the implementation of statistical inferences.

Statistical Inference give the opportunity:

o evaluate the accuracy, reliability and effectiveness of sample statistics, find errors that occur in the process statistical studies(statistical evaluation)

o generalize the parameters of the general population obtained on the basis of sample statistics (testing statistical hypotheses).

the main objective scientific research- this is the acquisition of new knowledge about a large class of phenomena, persons or events, which are commonly called the general population.

Population is the totality of objects of study, sample- its part, which is formed in a certain scientifically substantiated way 2.

The term "general population" is used when it comes to a large but finite set of objects under study. For example, about the totality of applicants in Ukraine in 2009 or the totality of children preschool age the city of Rivne. General populations can reach significant volumes, be finite and infinite. In practice, as a rule, one deals with finite sets. And if the ratio of the size of the general population to the size of the sample is more than 100, then, according to Glass and Stanley, the estimation methods for finite and infinite populations give essentially the same results. The general set can also be called the complete set of values ​​of some attribute. The fact that the sample belongs to the general population is the main basis for assessing the characteristics of the general population according to the characteristics of the sample.

Main idea mathematical statistics is based on the belief that a complete study of all objects of the general population in most scientific problems is either practically impossible or economically impractical, since it requires a lot of time and significant material costs. Therefore, in mathematical statistics, it is used selective approach, the principle of which is shown in the diagram in Fig. 1.2.

For example, according to the formation technology, the samples are randomized (simple and systematic), stratified, clustered (see Section 4).

Rice. 1.2. Scheme of application of methods of mathematical statistics According to selective approach the use of mathematical and statistical methods can be carried out in the following sequence (see Fig. 1.2):

o with general population, properties of which are subject to research, certain methods form a sample- a typical but limited number of objects to which research methods are applied;

o as a result of observational methods, experimental actions and measurements on sample objects, empirical data are obtained;

o processing of empirical data using descriptive statistics methods gives sample indicators, which are called statisticians - like the name of the discipline, by the way;

o applying statistical inference methods to statistician, receive parameters that characterize the properties the general population.

Example 1.1. In order to assess the stability of the level of knowledge (variable x) testing of a randomized sample of 3 students with a volume of n. The tests contained m tasks, each of which was evaluated according to the scoring system: "completed" "- 1," not fulfilled "- 0. average current achievements of students remained X

3 randomized sample(from the English. Random - random) is a representative sample, which is formed according to the strategy of random tests.

at the level of previous years / h? Solution sequence:

o find out a meaningful hypothesis of the type: "if the current test results do not differ from the past, then we can consider the level of students' knowledge to be unchanged, and educational process- stable";

o formulate an adequate statistical hypothesis, such as the null hypothesis H 0 that the "current GPA X is not statistically different from the average of previous years / h", i.e. H 0: X = ⁄ r, against the corresponding alternative hypothesis X Ф ^ ;

o build empirical distributions of the investigated variable X;

o define(if necessary) correlations, for example, between a variable X and other indicators, build regression lines;

o check the correspondence of the empirical distribution to the normal law;

o assess the value of point indicators and confidence interval parameters, for example, average;

o define criteria for testing statistical hypotheses;

o test statistical hypotheses based on the selected criteria;

o formulate a decision on the statistical null hypothesis on a certain significance level;

o move from the decision to accept or reject the statistical null hypothesis of the interpretation of the conclusions regarding the meaningful hypothesis;

o formulate meaningful conclusions.

So, if we summarize the above procedures, the application of statistical methods consists of three main blocks:

The transition from an object of reality to an abstract mathematical and statistical scheme, that is, the construction of a probabilistic model of a phenomenon, process, property;

Carrying out computational actions by proper mathematical means within the framework of a probabilistic model based on the results of measurements, observations, experiments and the formulation of statistical conclusions;

Interpretation of statistical conclusions about the real situation and making an appropriate decision.

Statistical methods for processing and interpreting data are based on probability theory. The theory of probability is the basis of the methods of mathematical statistics. Without the use of fundamental concepts and laws of probability theory, it is impossible to generalize the conclusions of mathematical statistics, and hence their reasonable use for scientific and practical purposes.

Thus, the task of descriptive statistics is to transform a set of sample data into a system of indicators - statistics - frequency distributions, measures of central tendency and variability, coupling coefficients, and the like. However, statistics are characteristics, in fact, of a particular sample. Of course, it is possible to calculate sample distributions, sample means, variances, etc., but such "data analysis" is of limited scientific and educational value. The "mechanical" transfer of any conclusions drawn on the basis of such indicators to other populations is not correct.

In order to be able to transfer sample indicators or others, or to more common populations, it is necessary to have mathematically justified provisions about the conformity and ability of sample characteristics with the characteristics of these common so-called general populations. Such provisions are based on theoretical approaches and schemes associated with probabilistic models of reality, for example, on the axiomatic approach, in the law of large numbers, etc. Only with their help it is possible to transfer the properties that are established by the results of the analysis of limited empirical information, either to other or to widespread sets. Thus, the construction, the laws of functioning, the use of probabilistic models, is the subject of a mathematical field called "probability theory", becomes the essence of statistical methods.

Thus, in mathematical statistics, two parallel lines of indicators are used: the first line, which is relevant to practice (these are sample indicators) and the second, based on theory (these are indicators of a probabilistic model). For example, the empirical frequencies that are determined on the sample correspond to the concepts of theoretical probability; sample mean (practice) corresponds expected value(theory), etc. Moreover, in studies, selective characteristics, as a rule, are primary. They are calculated on the basis of observations, measurements, experiments, after which they undergo a statistical assessment of the ability and effectiveness, testing of statistical hypotheses in accordance with the objectives of the research, and in the end are accepted with a certain probability as indicators of the properties of the studied populations.

Question. Task.

1. Describe the main sections of mathematical statistics.

2. What is the main idea of ​​mathematical statistics?

3. Describe the ratio of the general and sample populations.

4. Explain the scheme for applying the methods of mathematical statistics.

5. Specify the list of the main tasks of mathematical statistics.

6. What are the main blocks of the application of statistical methods? Describe them.

7. Expand the connection between mathematical statistics and probability theory.

Math statistics- this is a branch of mathematics that studies approximate methods for collecting and analyzing data based on the results of an experiment to identify existing patterns, i.e. finding laws of distribution of random variables and their numerical characteristics.

In mathematical statistics, it is customary to distinguish two main areas of research:

1. Estimation of the parameters of the general population.

2. Testing statistical hypotheses (some a priori assumptions).

The basic concepts of mathematical statistics are: general population, sample, theoretical function distribution.

General population is the set of all conceivable statistical data in observations of a random variable.

X G \u003d (x 1, x 2, x 3, ..., x N, ) \u003d ( x i; i \u003d 1,N)

The observed random variable X is called a feature or sampling factor. The general population is a statistical analogue of a random variable, its volume N is usually large, therefore, a part of the data is selected from it, called the sample population or simply a sample.

X B \u003d (x 1, x 2, x 3, ..., x n, ) \u003d ( x i; i \u003d 1,n)

Х В М Х Г, n £ N

Sample is a collection of randomly selected observations (objects) from the general population for direct study. The number of objects in the sample is called the sample size and is denoted by n. Typically, the sample is 5% -10% of the general population.

The use of a sample to construct patterns to which an observed random variable is subject allows avoiding its continuous (mass) observation, which is often a resource-intensive process, or even simply impossible.

For example, a population is a set of individuals. The study of an entire population is laborious and expensive, therefore, data are collected on a sample of individuals who are considered representatives of this population, allowing to draw a conclusion about this population.

However, the sample must necessarily satisfy the condition representativeness, i.e. give a reasonable idea of ​​the general population. How to form a representative (representative) sample? Ideally, a random (randomized) sample is sought. To do this, a list of all individuals in the population is compiled and randomly selected. But sometimes the costs of compiling the list may be unacceptable and then take an acceptable sample, for example, one clinic, hospital and examine all patients in this clinic with this disease.

Each item in the sample is called a variant. The number of repetitions of options in the sample is called the frequency of occurrence. The value is called relative frequency options, i.e. is found as the ratio of the absolute frequency of variants to the entire sample size. A sequence of options written in ascending order is called variational series.


Let's consider three forms of variation series: ranged, discrete and interval.

ranked row- this is a list of individual units of the population in ascending order of the trait under study.

Discrete variation series is a table consisting of graphs or lines: a specific value of the attribute x i and the absolute frequency n i (or relative frequency ω i) of the manifestation of the i-th value of the attribute x.

An example of a variation series is the table

Write the distribution of relative frequencies.

Solution: Find the relative frequencies. To do this, we divide the frequencies by the sample size:

The distribution of relative frequencies has the form:

0,15 0,5 0,35

Control: 0.15 + 0.5 + 0.35 = 1.

A discrete series can be represented graphically. In a rectangular Cartesian coordinate system, points with coordinates () or () are marked, which are connected by straight lines. Such a broken line is called frequency polygon.

Construct a discrete variation series (DVR) and draw a distribution polygon for 45 applicants according to the number of points they received in the entrance exams:

39 41 40 42 41 40 42 44 40 43 42 41 43 39 42 41 42 39 41 37 43 41 38 43 42 41 40 41 38 44 40 39 41 40 42 40 41 42 40 43 38 39 41 41 42.

Solution: To build a variation series various meanings We arrange the feature x (options) in ascending order and write down its frequency under each of these values.

Let's build a polygon of this distribution:

Rice. 13.1. Frequency polygon

Interval variation series used for a large number of observations. To build such a series, you need to select the number of feature intervals and set the length of the interval. With a large number of groups, the interval will be minimal. The number of groups in a variation series can be found using the Sturges formula: (k is the number of groups, n is the sample size), and the interval width is

where is the maximum; - the minimum value of the variant, and their difference R is called span variation.

We study a sample of 100 people from the totality of all students of a medical university.

Solution: Calculate the number of groups: . Thus, to compile an interval series, it is better to divide this sample into 7 or 8 groups. The set of groups into which the results of observations are divided and the frequencies of obtaining the results of observations in each group is called aggregate.

A histogram is used to visualize a statistical distribution.

Frequency histogram- this is a stepped figure, consisting of adjacent rectangles built on the same straight line, the bases of which are the same and equal to the width of the interval, and the height is equal to either the frequency of falling into the interval or the relative frequency ω i .

Observations of the number of particles that hit the Geiger counter for a minute gave the following results:

21 30 39 31 42 34 36 30 28 30 33 24 31 40 31 33 31 27 31 45 31 34 27 30 48 30 28 30 33 46 43 30 33 28 31 27 31 36 51 34 31 36 34 37 28 30 39 31 42 37.

Based on these data, build an interval variation series with equal intervals (I interval 20-24; II interval 24-28, etc.) and draw a histogram.

Solution:n=50

The histogram of this distribution looks like:

Rice. 13.2. Distribution histogram

Task options

№ 13.1. Every hour the voltage in the mains was measured. In this case, the following values ​​were obtained (B):

227 219 215 230 232 223 220 222 218 219 222 221 227 226 226 209 211 215 218 220 216 220 220 221 225 224 212 217 219 220.

Build a statistical distribution and draw a polygon.

№ 13.2. Observations of blood sugar in 50 people gave the following results:

3.94 3.84 3.86 4.06 3.67 3.97 3.76 3.61 3.96 4.04

3.82 3.94 3.98 3.57 3.87 4.07 3.99 3.69 3.76 3.71

3.81 3.71 4.16 3.76 4.00 3.46 4.08 3.88 4.01 3.93

3.92 3.89 4.02 4.17 3.72 4.09 3.78 4.02 3.73 3.52

3.91 3.62 4.18 4.26 4.03 4.14 3.72 4.33 3.82 4.03

Based on these data, build an interval variation series with equal intervals (I - 3.45-3.55; II - 3.55-3.65, etc.) and depict it graphically, draw a histogram.

№ 13.3. Construct a range of frequencies for the distribution of erythrocyte sedimentation rate (ESR) in 100 people.

Send your good work in the knowledge base is simple. Use the form below

Students, graduate students, young scientists who use the knowledge base in their studies and work will be very grateful to you.

Posted on http://www.allbest.ru/

Introduction

Mathematical statistics is the science of mathematical methods for systematizing and using statistical data for scientific and practical conclusions. In many of its sections, mathematical statistics is based on the theory of probability, which makes it possible to assess the reliability and accuracy of conclusions drawn on the basis of limited statistical material (for example, to estimate the required sample size to obtain the results of the required accuracy in a sample survey).

In probability theory, random variables with a given distribution or random experiments are considered, the properties of which are completely known. The subject of probability theory is the properties and relationships of these quantities (distributions).

But often an experiment is a black box that gives only some results for which it is required to draw a conclusion about the properties of the experiment itself. The observer has a set of numerical (or they can be made numerical) results obtained by repeating the same random experiment under the same conditions.

In this case, for example, the following questions arise: If we observe one random variable, how can we draw the most accurate conclusion about its distribution from a set of its values ​​in several experiments? mathematical statistics variance histogram

An example of such a series of experiments is a sociological survey of a set of economic indicators or, finally, a sequence of coats of arms and tails during a thousand-fold coin toss. All of the above factors determine the relevance and significance of the subject of work at the present stage, aimed at a deep and comprehensive study of the basic concepts of mathematical statistics.

1. Subject and method of mathematical statistics

Depending on the mathematical nature of the specific results of observations, mathematical statistics is divided into statistics of numbers. statistical analysis analysis of functions (processes) and time series statistics of objects of non-numerical nature. A significant part of mathematical statistics is based on probabilistic models. The general tasks of describing estimation data and testing hypotheses are singled out. They also consider more particular tasks related to conducting sample surveys, restoring dependencies, building and using classifications (typologies), etc.

To describe the data, diagram tables and other visual representations are built, for example, correlation fields. Probabilistic models are usually not used. Some data description methods rely on advanced theory and the capabilities of modern computers. These include, in particular, cluster analysis aimed at identifying groups of objects similar to each other and multidimensional scaling that allows you to visualize objects on a plane with the least distortion of the distance between them.

Estimation and hypothesis testing methods rely on probabilistic data generation models. These models are divided into parametric and non-parametric. In parametric models, it is assumed that the objects under study are described by distribution functions depending on a small number (1-4) of numerical parameters. In nonparametric models, the distribution functions are assumed to be arbitrary continuous. In mathematical statistics, the parameters and characteristics of the distribution (expectation of the median variance of the quantile, etc.) are evaluated, the density and distribution functions of the dependence between variables (based on linear and non-parametric correlation coefficients, as well as parametric or non-parametric estimates of functions expressing dependencies), etc. Use point and interval (giving bounds on true values) estimates.

In mathematical statistics, there general theory hypothesis testing and big number methods dedicated to testing specific hypotheses. Hypotheses are considered about the values ​​of parameters and characteristics, about checking homogeneity (that is, about the coincidence of characteristics or distribution functions in two samples), about the agreement of the empirical distribution function with a given distribution function or with a parametric family of such functions, about the distribution symmetry, etc.

Of great importance is the section of mathematical statistics associated with the conduct of sample surveys with the properties of various sampling schemes and the construction of adequate methods for estimating and testing hypotheses.

Dependence recovery problems have been actively studied for more than 200 years since the development of the method of least squares by K. Gauss in 1794. Currently, the methods of searching for an informative subset of variables and non-parametric methods are the most relevant.

The development of data approximation methods and description dimension reduction was started more than 100 years ago when K. Pearson created the principal component method. Later, factor analysis and numerous non-linear generalizations were developed.

Various methods of constructing (cluster analysis) analysis and use (discriminant analysis) of classifications (typologies) are also called methods of pattern recognition (with and without a teacher), automatic classification, etc.

Mathematical methods in statistics are based either on the use of sums (based on the Central Limit Theorem of probability theory) or difference exponents (distance metrics) as in the statistics of non-numerical objects. Usually only asymptotic results are rigorously substantiated. Nowadays computers play a big role in mathematical statistics. They are used both for calculations and for simulation modeling(in particular, in sampling methods and in studying the suitability of asymptotic results).

1.1 Basic concepts of mathematical statistics

Exclusively important role in the analysis of many psychological and pedagogical phenomena, mean values ​​play, which are a generalized characteristic of a qualitatively homogeneous set according to a certain quantitative attribute. It is impossible, for example, to calculate the average specialty or the average nationality of university students, since these are qualitatively heterogeneous phenomena. On the other hand, it is possible and necessary to determine, on average, a numerical characteristic of their progress (average score), efficiency methodological systems and receptions, etc.

In psychological and pedagogical research, it is usually used different kinds average values: arithmetic mean, geometric mean, median, mode, and others. The most common are the arithmetic mean, median and mode.

The arithmetic mean is used in cases where there is a directly proportional relationship between the defining property and this feature (for example, when the performance of the study group improves, the performance of each member improves).

The arithmetic mean is the quotient of dividing the sum of values ​​by their number and is calculated by the formula:

Posted on http://www.allbest.ru/

where X is the arithmetic mean; X1, X2, X3 ... Xn - the results of individual observations (techniques, actions),

n - the number of observations (methods, actions),

The sum of the results of all observations (techniques, actions).

The median (Me) is a measure of the average position that characterizes the value of a feature on an ordered (built on the basis of increase or decrease) scale, which corresponds to the middle of the population under study. The median can be determined for ordinal and quantitative features. The location of this value is determined by the formula:

Median place = (n + 1) / 2

For example. According to the results of the study, it was found that:

On “excellent” study - 5 people from participating in the experiment;

On "good" study - 18 people;

On "satisfactory" - 22 people;

On "unsatisfactory" - 6 people.

Since in total N = 54 people took part in the experiment, the middle of the sample is equal to people. Hence, it is concluded that more than half of the students study below the “good” mark, that is, the median is more than “satisfactory”, but less than “good”.

Mode (Mo) is the most common typical feature value among other values. It corresponds to the class with the highest frequency. This class is called modal value.

For example.

If the question of the questionnaire: “indicate the degree of ownership foreign language”, the answers were distributed:

1 - fluent - 25

2 - I know enough to communicate - 54

3 - I know, but I have difficulties in communication - 253

4 - understand with difficulty - 173

5 - do not own - 28

Obviously, the most typical meaning here is - “I know, but I have difficulties in communicating”, which will be modal. So the mode is -253.

When using mathematical methods in psychological and pedagogical research, great importance is given to the calculation of variance and root-mean-square (standard) deviations.

The variance is equal to the mean square of the deviations of the value of the options from the mean. It acts as one of the characteristics of individual results of the scatter of the values ​​of the studied variable (for example, student grades) around the mean value. Calculation of the dispersion is carried out by determining: deviations from the average value; the square of the specified deviation; the sum of the squares of the deviation and the mean of the square of the deviation.

The dispersion value is used in various statistical calculations, but is not directly observable. The quantity directly related to the content of the observed variable is the standard deviation.

The standard deviation confirms the typicality and indicativeness of the arithmetic mean, reflects the measure of the fluctuation of the numerical values ​​of the signs, from which the average value is derived. It is equal to the square root of the dispersion and is determined by the formula:

(2) Posted on http://www.allbest.ru/

where: - root mean square. With a small number of observations (actions) - less than 100 - the value of the formula should not be “N”, but “N - 1”.

The arithmetic mean and mean square are the main characteristics of the results obtained during the study. They allow you to summarize the data, compare them, establish the advantages of one psychological and pedagogical system (program) over another.

The root mean square (standard) deviation is widely used as a measure of dispersion for various characteristics.

When evaluating the results of the study, it is important to determine the dispersion of a random variable around the mean value. This dispersion is described using the Gauss law (the law of the normal distribution of the probability of a random variable). The essence of the law is that when measuring a certain attribute in a given set of elements, there are always deviations in both directions from the norm due to many uncontrollable reasons, and the larger the deviations, the less often they occur.

With further data processing, the following can be identified: the coefficient of variation (stability) of the phenomenon under study, which is the percentage of the standard deviation to the arithmetic mean; a measure of obliqueness, showing in which direction the predominant number of deviations is directed; a measure of coolness, which shows the degree of accumulation of values ​​of a random variable around the average, etc. All these statistics help to more fully identify the signs of the phenomena being studied.

Measures of association between variables. Relationships (dependencies) between two or more variables in statistics are called correlations. It is estimated using the value of the correlation coefficient, which is a measure of the degree and magnitude of this relationship.

There are many correlation coefficients. Let us consider only a part of them, which take into account the presence of a linear relationship between the variables. Their choice depends on the scales for measuring the variables, the relationship between which must be assessed. The Pearson and Spearman coefficients are most often used in psychology and pedagogy.

1.2 Basic concepts of sampling

Let be a random variable observed in a random experiment. It is assumed that the probability space is given (and will not interest us).

We will assume that after conducting this experiment once under the same conditions, we got numbers - the values ​​\u200b\u200bof this random variable in the first second, etc. experiments. A random variable has some distribution that is partially or completely unknown to us.

Let's take a closer look at a set called a sample.

In a series of experiments already performed, a sample is a set of numbers. But if this series of experiments is repeated again, then instead of this set we will get a new set of numbers. Instead of a number, another number will appear - one of the values ​​​​of a random variable. That is, (and, etc.) is a variable that can take on the same values ​​as a random variable and just as often (with the same probabilities). Therefore, before the experiment - a random variable equally distributed with and after the experiment - the number that we observe in this first experiment, i.e. one of the possible values ​​of the random variable.

A sample of volume is a set of independent and equally distributed random variables (“copies”) having the same distribution.

What does it mean to “draw a conclusion about the distribution from a sample”? The distribution is characterized by a distribution function density or a table by a set of numerical characteristics -- etc. Based on the sample, one must be able to build approximations for all these characteristics.

1.3 Sampling

Consider the implementation of sampling on one elementary outcome - a set of numbers. On a suitable probability space, we introduce a random variable that takes values ​​with probabilities in (if some of the values ​​matched, add the probabilities the corresponding number of times).

The distribution of a quantity is called the empirical or sample distribution. Let us calculate the mathematical expectation and variance of a quantity and introduce the notation for these quantities:

In the same way, we calculate the moment of order

In the general case, we denote by the quantity

If, when constructing all the characteristics introduced by us, we consider the sample as a set of random variables, then these characteristics themselves -- -- will become random variables. These sample distribution characteristics are used to estimate (approximate) the corresponding unknown characteristics of the true distribution.

The reason for using the distribution characteristics to estimate the characteristics of the true distribution (or) is in the proximity of these distributions at large.

Consider, for example, tossing a regular die. Let be the number of points dropped on the th throw. Let's assume that one in the sample occurs once a deuce - once, and so on. Then the random variable will take values ​​1 6 with probabilities respectively. But these proportions approach with growth according to the law of large numbers. That is, the distribution of magnitude in some sense approaches the true distribution of the number of points dropped out when the correct die is tossed.

1.4 Empirical distribution function histogram

Since the unknown distribution can be described, for example, by its distribution function, we will construct an “estimate” for this function based on the sample.

Definition 1. An empirical distribution function built on a sample of volume is a random function for each equal to

Reminder: Random function

called an event indicator. For each, this is a random variable having a Bernoulli distribution with parameter

In other words, for any value equal to the true probability of a random variable being less, it is estimated by the proportion of the smaller sample elements.

If the sample elements are sorted in ascending order (at each elementary outcome), a new set of random variables will be obtained, called a variational series:

The element is called the th member of the variation series or the th order statistic.

The empirical distribution function has jumps at sample points. The jump value at a point is where is the number of sample elements that match with.

It is possible to construct an empirical distribution function for the variational series:

Another characteristic of the distribution is the table (for discrete distributions) or density (for absolutely continuous). The empirical or sample analog of a table or density is the so-called histogram. The histogram is based on grouped data. The estimated range of values ​​of a random variable (or the range of sample data) is divided, regardless of the sample, into a certain number of intervals (not necessarily the same). Let be intervals on the line called grouping intervals. Let us denote for by the number of sample elements that fall into the interval:

On each of the intervals, a rectangle is built whose area is proportional. The total area of ​​all rectangles must be equal to one. Let be the length of the interval. The height of the rectangle above is

The resulting figure is called a histogram.

Let's divide the segment into 4 equal segments. The segment included 4 sample items in -- 6 in -- 3 and the segment included 2 sample items. We build a histogram (Fig. 2). On fig. 3 is also a histogram for the same sample, but when the area is divided into 5 equal segments.

The Econometrics course states that the best number of grouping intervals (the "Sturgess formula") is

Here -- decimal logarithm, That's why

those. when the sample is doubled, the number of grouping intervals increases by 1. Note that the more grouping intervals, the better. But if we take the number of intervals, let's say, of the order, then with growth the histogram will not approach the density.

The following statement is true:

If the sample density is continuous function, then for so that pointwise convergence in probability of the histogram to the density takes place.

So the choice of the logarithm is reasonable, but not the only possible one.

Hosted on Allbest.ru

...

Similar Documents

    Construction of a range of relative frequencies, empirical distribution function, cumulants and histograms. Calculation of point estimates of unknown numerical characteristics. Testing the hypothesis about the type of distribution for a simple and grouped distribution series.

    term paper, added 09/28/2011

    Subject, methods and concepts of mathematical statistics, its relationship with the theory of probability. Basic concepts of the sampling method. Characteristics of the empirical distribution function. The concept of a histogram, the principle of its construction. Selective distribution.

    tutorial, added 04/24/2009

    Classification of random events. distribution function. Numerical characteristics of discrete random variables. The law of uniform distribution of probabilities. Student distribution. Problems of mathematical statistics. Population parameter estimates.

    lecture, added 12/12/2011

    Estimates of distribution parameters, the most important distributions used in mathematical statistics: normal distribution, Pearson, Student, Fisher distributions. Factor space, formulation of the purpose of the experiment and the choice of responses.

    abstract, added 01/01/2011

    Numerical characteristics of the sample. Statistical series and distribution function. The concept and graphical representation of the statistical population. Maximum likelihood method for finding the distribution density. Application of the method of least squares.

    control work, added 02/20/2011

    Problems of mathematical statistics. Distribution of a random variable based on experimental data. Empirical distribution function. Statistical estimates of distribution parameters. Normal distribution of a random variable, hypothesis testing.

    term paper, added 10/13/2009

    Statistical processing of time control data (in hours) of computer class work per day. Polygon of absolute frequencies. Plotting the empirical distribution function and the envelope of the histogram. Theoretical distribution of the general population.

    test, added 08/23/2015

    Processing the results of information on transport and technological machines by the method of mathematical statistics. Definition of the integral function of the normal distribution, the function of the Weibull law. Determination of the shift value to the beginning of the parameter distribution.

    control work, added 03/05/2017

    The concept of mathematical statistics as a science of mathematical methods of systematization and use of statistical data for scientific and practical conclusions. Point estimates for the parameters of statistical distributions. Analysis of the calculation of averages.

    term paper, added 12/13/2014

    Basic concepts of mathematical statistics, interval estimates. Method of moments and maximum likelihood method. Testing statistical hypotheses about the form of the distribution law using the Pearson criterion. Properties of estimates, continuous distributions.

Mathematical statistics is one of the main sections of such a science as mathematics, and is a branch that studies the methods and rules for processing certain data. In other words, it explores ways to uncover patterns that are inherent in large collections of identical objects, based on their sample survey.

The task of this section is to build methods for estimating the probability or making a certain decision about the nature of developing events, based on the results obtained. Tables, charts, and correlation fields are used to describe the data. rarely applied.

Mathematical statistics are used in various fields of science. For example, it is important for the economy to process information about homogeneous sets of phenomena and objects. They can be products manufactured by industry, personnel, profit data, etc. Depending on the mathematical nature of the results of observations, one can single out the statistics of numbers, the analysis of functions and objects of a non-numerical nature, and multidimensional analysis. In addition, they consider general and particular (related to the restoration of dependencies, the use of classifications, selective studies) tasks.

The authors of some textbooks believe that the theory of mathematical statistics is only a section of the theory of probability, while others believe that it is an independent science with its own goals, objectives and methods. However, in any case, its use is very extensive.

Thus, mathematical statistics is most clearly applicable in psychology. Its use will allow the specialist to correctly substantiate, find the relationship between the data, generalize them, avoid many logical errors, and much more. It should be noted that it is often simply impossible to measure this or that psychological phenomenon or personality trait without computational procedures. This suggests that the basics of this science are necessary. In other words, it can be called the source and basis of probability theory.

The method of research, which relies on the consideration of statistical data, is used in other areas. However, it should immediately be noted that its features, when applied to objects that have a different nature of origin, are always unique. Therefore, it does not make sense to combine physical science into one science. The general features of this method are reduced to counting a certain number of objects that are included in a particular group, as well as studying the distribution of quantitative features and applying the theory of probability to obtain certain conclusions.

Elements of mathematical statistics are used in areas such as physics, astronomy, etc. Here, the values ​​of characteristics and parameters, hypotheses about the coincidence of any characteristics in two samples, about the symmetry of the distribution, and much more can be considered.

Mathematical statistics plays an important role in their implementation. Their goal is most often to build adequate methods for estimating and testing hypotheses. At present, of great importance in this science are Computer techologies. They allow not only to significantly simplify the calculation process, but also to create samples for replication or when studying the suitability of the results obtained in practice.

In the general case, the methods of mathematical statistics help to draw two conclusions: either to make the desired judgment about the nature or properties of the data being studied and their relationships, or to prove that the results obtained are not enough to draw conclusions.