Statistics is a science that studies the quantitative side of mass phenomena (not single phenomena, but phenomena occurring in a large number of cases). Depending on the subject area in which we apply it, we distinguish biostatistics, economic statistics, agricultural statistics, etc. Biostatistics, therefore, is a science studying the quantitative side of mass phenomena in the field of medicine and health care in order to reveal regularities and characterize their specific manifestations.
Statistical population. The objects, cases, through/or in which a given phenomenon can be manifested, can be united in the process of their study into homogeneous groups called statistical populations. This unification is not done formally, but on the basis of the existing mutual relationship between the individual cases, which arises from their qualitative uniformity. Statistical aggregates can be of several types:
Statistical units. The units (cases, events) that make up the statistical aggregates are called statistical units.
The general population (population). Covers all units of the study population. Cases do not have to be human. individuals, they can be laboratory animals, hospitals, pharmacies, etc. If, for example, we want to study the spread of hepatitis B in our country, the general population will include all people in Bulgaria. If we are interested in how many people suffer from prostate cancer in the country, the general population will include only men in Bulgaria.Sample. It represents a certain part of the general population.
Statistical signs (variables) are basic qualities, features, characteristics on which the statistical study is carried out. Characteristics are also called variables because their values can change in the individual units of the study. The term "variable" is used in contrast to "constant" - a value that does not change. For example, age, blood type are variables, because (meanings) their varieties vary among the subjects.
Depending on the nature of their meanings, signs are classified into two main types:
The type of variable significantly determines the choice of correct statistical analyses.
Four types are used to measure the variables measuring scales:
Qualitative (categorical) data are presented on weak scales - nominal and ordinal, and quantitative - on strong scales: interval and absolute.
The main stages through which a statistical study passes are four, namely: description (planning and organization); measurement of planned indicators (field phase); statistical processing (description and analysis); discussion (interpretation and generalization) of the obtained results. The interconnectedness of these stages imposes, and certain requirements are imposed on each stage towards the others.
In the course of each specific study, taking into account the specifics of the intended goal and tasks, some of the components of this general scheme may be dropped, as well as new elements may be added.
First stage – description (planning and organization) of the study
The reason for conducting a statistical study in the field of occupational medicine is the occurrence of an occupational medical problem. This stage begins by describing the problem that has arisen (for example, frequent clinical complaints when working in a specific work environment), i.e. observed phenomena are described. After that, those categories, qualities, properties and peculiarities of the studied phenomena are selected, which correspond to the described problem to the highest degree. During this stage, the assessment is also made of what new information can be obtained from this research. For this purpose, it is necessary to consider the results and conclusions of similar studies carried out to date.
After these steps, the aims and objectives of the study, as well as the research hypotheses, are formulated.
The next steps in this stage are: determining the object of the study (population, defined population at risk, sample, etc.); depending on the presence of the described phenomena, the scope of the statistical population is also determined; determining the units of observation, hypothesized factors and dependent variables (traits).
The planning scheme also requires choosing an appropriate type of survey – comprehensive or representative; retrospective or prospective; one-time, periodic or ongoing. The observation period is determined, taking into account the development and spread of diseases, the duration of exposure, the time interval of the tested therapeutic or prophylactic interventions and medicinal products.
During this stage, the methodology for data collection is chosen and described (survey, interview, observation, experiment, medical examinations, biological samples, laboratory tests, screening methods, etc.).
In the preparation of the primary documents for data collection, data standardization and a system for coding the received information are foreseen. It is of particular importance to provide also a system for technical and logical control and verification of the reliability of the data.
It is planned to use adequate statistical methods and analyses, which are consistent with the nature and character of the investigated phenomena, since the scientific validity of the obtained results largely depends on this.
An appropriate sample model is chosen, if a representative study is planned, the rules for forming the sample itself are determined. This choice is related to the characteristics and volume of the general population, the accuracy of the survey, the material security, the possibilities of controlling the factors that can have a negative impact (the obscuring factors) on the obtained data.
On the basis of everything described so far in documentary form, the so-called "Research protocol of the study". Depending on the nature of the study, in some cases this protocol must be approved by the relevant Ethics Committee.
In some types of studies, it is necessary to conduct a so-called pilot study. It is performed on a small sample size, in a short time, and its purpose is to validate the described research protocol of the study. As a consequence of this, the mistakes made (if any) of one nature or another are removed and the final version of the protocol is prepared.
This stage of the study ends with the preparation of a detailed financial plan, provision of the necessary resources (human and technical) and time intervals for the implementation of each step of the protocol, as well as the persons responsible for the implementation of the tasks.
Second stage – measurement (field phase)
During this stage, the measurements themselves are carried out in real conditions. Its duration varies depending on the type of survey. The conduct of this stage must fully comply with the prescribed instructions in the research protocol. Otherwise, the possibility of making a systematic error can be reached. Such error can also result from inaccurate measurement, inaccurate classifications of factor and dependent variables (exposures, health status), accuracy of measuring devices, etc. Systematic error leads to an increase or decrease in the reported values, which in turn leads to a violation of the accuracy of the estimates. This shows that the aim in conducting this stage is to reduce the possibility of making a systematic error.
The collected data is entered into a computer, usually in a pre-made spreadsheet. With this input, technical and logical control criteria can be set in order to limit unacceptable (wrongly reported) data, and hence reduce the possibility of making a systematic error. This stage ends with exactly this type of reliability checks (validation) of the collected data. Errors are corrected if possible, and data of questionable and unacceptable value is removed.
Third stage – statistical processing (description and analysis) of the data
Following the protocol of the statistical study, at this stage the statistical processing of the data is carried out, which is related to the nature of the study and the type of variables under consideration. This includes data grouping, tabular and graphical presentation, as well as biostatistical analysis.
The statistical grouping is not done for its own sake, but on the basis of signs of important medical importance, such as health and social status, demographic indicators, physico-chemical factors, intensity of exposure, dose of the drug taken, etc.
Biostatistical analysis goes through: comparing the frequencies of health events or phenomena, testing statistical hypotheses related to factor dependence in specific diseases, quantitative assessment of established relationships and dependencies, modeling and forecasting of established trends, etc.
Essentially, the biostatistical analysis should proceed in unity with the qualitative analysis in the subject area (medico-biological, clinical and medico-social).
Fourth stage – discussion (interpretation and summarization) of the results
At this stage of the study, the results are subjected to a thorough discussion and interpretation from the point of view of epidemiological understanding, where the focus is on all the more important facts established in the research process.
Significant importance is attached to the reliability and evidential strength of the results, as well as to their representativeness and the various aspects of the validity of the statistical study.
When discussing the validity of the results, randomness, bias, and confounding factors should be considered as additional alternative explanations.
At the end of this stage, and this is also the end of the research, the conclusions and conclusions are formulated. New hypotheses related to the problems can be defined, on which new studies are required, recommendations are made for the introduction of measures and programs for disease control - therapeutic schemes, curative and prophylactic means, etc. One of the most important points in a statistical study is the interpretation of the results, because incorrect interpretation inevitably leads to incorrect conclusions and conclusions.
These are summary statistical characteristics that reflect the general, typical of the given sample. There are many measures of central tendency, but the arithmetic mean, median and mode are the most commonly used indicators in practice.
The arithmetic mean is calculated by the formula:
xi – the sign value for the i-th unit in the set
n – the total number of cases
For example, if we calculate the average age, then xi is the age of each individual and n is the number of such individuals.
If the quantitative feature is grouped in intervals, it is a mandatory condition that they have the same width. In this case, the arithmetic mean value can be calculated using another formula:
xi – the middle of the corresponding interval;
fi – the frequency (number of cases) in the corresponding interval
n – the number of intervals
Of course, the calculation of any estimates based on grouped data reflects on their accuracy (they are more imprecise). It is preferable to calculate summary statistics from ungrouped data whenever possible.
The median (Me) represents the positional environment of the units. To determine it, all cases must be sorted by the size of the trait of interest (for example, by age). The unit sign value that is exactly in the middle is the median. Of course, the centrality of the sign can only exist if the number of cases is odd. However, if we have an even number of units, then the median is taken to be the arithmetic mean of the value of the two values in the middle. The sequence number of the unit that is the median is determined by the formula:
nMe is the sequence number of the unit whose value is the median.
The median can also be calculated based on grouped data. And here the condition that the width of the intervals is the same applies. The formula is as follows:
LMe – the lower limit of the median interval
∑f – the total number of cases
CMe-1 – the cumulative frequencies in the pre-median interval
e – width of intervals
fMe – the frequency in the median interval
To determine the median interval, we need to calculate the cumulative frequencies. The interval that contains the unit with sequence number n/2 is the median interval.
Fashion (Mo) represents the most common meaning of the variable. There may be more than one mode in a given stat line, ie. more than one value (eg age 25 and 26) to occur the most times in the study sample.
For data grouped into equal-width intervals, the mode can also be calculated. This is done according to the formula:
LMo – lower limit of the modal interval
fMo – frequency in the modal interval
fMo-1 – frequency in the premodal interval
fMo+1 – frequency in the postmodal interval
The modal interval is the interval with the most values of the feature.
As for categorical variables, the mode (the value of the feature that occurs most often) can be determined for each of them. For variables that are measured on the ordinal scale, the median can also be determined, since we can arrange the categories in ascending order. There is no arithmetic mean in the true sense of the word for categorical variables. Conventionally, the relative share of the most common meaning (p) is taken as the average, while the sum of the shares of the remaining meanings is taken as its alternative (q). It usually applies to variables measured on a dichotomous scale (a special case of a categorical variable measured on a nominal scale with only two attribute values - see Chapter One), but in principle any qualitative variable can be converted to one with fewer values of sign.
The most commonly used measures measuring the dispersion of units in a given sample are range, standard deviation, and variance. They show how compact the given set is, with close meanings or varies according to the size of the studied feature. In addition to these, the coefficient of variation, as well as the interquartile range, can be calculated.
The sweep is the most basic measure of dispersion. It is calculated by the formula: d = xmax – xmin
xmax – the smallest value of the characteristic observed in the sample;
xmin – the largest value of the attribute.
This measure is very imprecise because it is calculated based on only two units of the study sample. In practice, the indicators standard deviation (also known as mean square deviation) and dispersion are more often used. The standard deviation shows the average deviation of the values of the studied feature from their arithmetic mean. The formula for calculating this statistical characteristic is as follows:
Variance is the standard deviation squared, or:
The standard deviation can be represented in relative form by calculating the so-called coefficient of variation (V). It is calculated by the formula:
The coefficient of asymmetry measures whether a distribution is symmetric or not. Indirectly, we can judge the presence of asymmetry when the three mean values - arithmetic mean, median and mode - have values far from each other. The coefficient of asymmetry is determined by the formula:
M3 – third central moment, is found by the formula:
When the skewness coefficient is 0, then the distribution is absolutely symmetric. For values between 0.25 and -0.25, the distribution is moderately asymmetric. With positive values of the asymmetry coefficient greater than 0.25, we speak of right (positive) asymmetry, and with values below -0.25 – of left (negative)
Quotient of kurtosis measures how sharp or obtuse the angle is at the top of the curve. It is determined by the formula:
When a frequency distribution is symmetric and no kurtosis is observed (ie the coefficients of asymmetry and kurtosis fall within the limits -0.2÷0.2), this means that its shape is close to the shape of the normal distribution. In other words, it can be said to have a normal distribution.
A theoretical distribution that is symmetric about its central axis (bell-shaped). It is not a single one, but a family of normal distributions uniquely defined by the arithmetic mean and variance.
In essence, a hypothesis is a reasonable assumption about the course of a certain process, phenomenon or event. Scientific arguments are needed to confirm or reject it.
The statistical hypothesis it is usually associated with an assumption about some unknown population parameters or about the type of frequency distributions in the population being studied.
In statistics, there are two types of hypotheses - the null hypothesis and the alternative hypothesis. These are two mutually exclusive (opposite) statements, ie. if one is true, the other is automatically false. In this situation, it is sufficient to test only one hypothesis. Only the null hypothesis is tested against data from representative studies.
Null hypothesis (H0) – a statement that is related to the presence of a null difference, null relationship, null effect (no difference, no relationship, no effect), i.e. the observed difference, relationship, effect, etc. is due only to chance, not to purposeful influence.
An alternative hypothesis (H1) – the opposite of the null hypothesis, or it is a statement related to the presence of a significant difference, relationship, or effect. The observed difference, relationship or effect is due not only to chance, but also to systematically (lawfully) operating causes.
The formulation of the alternative hypothesis is an important point in hypothesis testing. This statement is in most cases the researcher's desired result and can only be accepted if the null hypothesis is rejected. In some cases, the alternative hypotheses can be defined as undirected and directed. They are undirected when they are related to a claim of significant difference, without specifying where this difference is directed. If the direction of the difference is specified, then we speak of a directed alternative.
Example 1. Let µ denote the average value of systolic blood pressure in a given population. If we want to test whether this value is 120 mm/Hg, then the relevant hypotheses will be:
H0: µ=120
H1: µ≠120 in undirected alternative
H1: µ<120 or H1: µ>120 in directed alternative
An example 2. Let with µ1 and µ2 denote the population mean values of systolic blood pressure of patients from two groups. If we want to compare these pressures between the two groups, then the hypotheses will be:
H0: µ1=µ2
H1: µ1≠µ2 in a nondirected alternative
H1: µ1>µ2 or H1: µ1<µ2 in directed alternative
Accepting or rejecting the null hypothesis represent the two possible decisions the researcher makes based on statistical considerations. There is some probability of error in each of these decisions.
Error of the first kind (α – error). Rejecting the null hypothesis when it is the valid hypothesis is a first-order error. In other words, it can be said that the error of the first kind is related to claiming a non-existent effect.Error of the second kind (β – error). Accepting the null hypothesis when it is not the valid hypothesis is a second-order error. I.e. rejecting an existing effect results in a second-order error.
Both types of errors are unintended consequences of the researcher's decision. A first-order error represents a higher degree of undesired consequence, and the main reasons for this are twofold. First, in most cases the researcher's goal is to demonstrate the existence of some effect that is actually related to the decision to reject the null hypothesis. It is such an action that is associated with the possibility of making a first-order error. Second, the theory of statistical hypothesis testing is based on the control of the probability of error of the first order in a relatively elementary way from the point of view of the mathematical apparatus. While the control of second-order error is a complex and difficult mathematical objective.
Statistical hypotheses are tested using specific statistical criteria. The values of these criteria are obtained from previously known theoretical distributions. When testing any specific statistical hypothesis, the chosen criterion has two values - theoretical and empirical. The theoretical value can be determined from tables with the theoretical values of the relevant criterion, and the empirical value is calculated according to a specific mathematical formula with the data of the representative sample. At the heart of every statistical hypothesis test is the statistical criterion.
In choosing an appropriate criterion, it is necessary to know the conditions to which the variables must satisfy and the nature of the samples.
The probability of making a first-order error is related to the statistical significance of the results obtained, namely if this probability is small enough, it will make it possible to accept the alternative hypothesis, i.e. to assume the presence of a statistically significant effect. In this sense, the probability of making a first-order error represents the statistical significance level. The determination of a critical (threshold) level of significance (α) is related to the perceived statistical probability with which the researcher supports his claims (e.g. at a statistical probability of 0.95, α=0.05, respectively at 0.99 – α=0.01, etc.). The theoretical value of the chosen criterion divides the numerical axis into two areas – the acceptance area of H0 and rejection region of H0. The region of rejection of the null hypothesis is also called critical area for H0. This area can be bilateral or unilateral (left or right). This depends on the alternative hypothesis formulated.
The probability of accepting H1, when it is the valid hypothesis, is related to the power of the chosen criterion. The criterion power (γ) is directly related to the error of the second kind and is expressed by the equality: γ=1-β
As it became clear earlier, H is being tested0. The procedure for this check goes through several stages.
The decision to accept or reject H0 depends on the ratio between the theoretical and empirical value of the selected criterion.
If KWell>KT, then the null hypothesis is rejected in favor of the alternative and the conclusion will be that there is a statistically significant effect. In the opposite case, when KWell≤KT, then the null hypothesis is accepted and the conclusion is that no statistically significant effect is observed. The decision to accept or reject H0 it can also be taken by comparing - the set critical level of significance α and the empirical value of the level of significance (p) calculated from the data using statistical software. The check proceeds as follows: if p<α, then H0 is rejected in favor of H1 although p≥α, thenH0 is accepted. In statistical software products that can do statistical hypothesis testing, the p-value is usually denoted in the results as sig. or p-value.
1. Ranchov, G., Medical statistics, Gorexpress, 1997, Sofia, 274 p.
2. Ranchov, G., Biostatistics and biomathematics, Eco Print, 2008, Sofia, 388 p.
3. Willcox W., The Founder of Statistics. Review of the International Statistical Institute. 5 (4). 1938. 321-328.
e-mail: press@foz.mu-sofia.bg
City of Sofia, "Bialo More" St. No. 8 "Tsaritsa Joanna" UMBAL - ISUL floor 5,
phone 02 9432 127
e-mail: fph@foz.mu-sofia.bg