Submenu

Main menu

EDUCATIONAL RESOURCES - ICT

National Science Program
"Information and communication technologies for
digital single market in science, education and security'
(ICT in NOS)

A biostatistical study

Prof. Todor Kundurzhiev, MD, Assoc. Milena Stoycheva, MD

Introduction

Statistics is a science that studies the quantitative side of mass phenomena (not single phenomena, but phenomena occurring in a large number of cases). Depending on the subject area in which we apply it, we distinguish biostatistics, economic statistics, agricultural statistics, etc. Biostatistics, therefore, is a science studying the quantitative side of mass phenomena in the field of medicine and health care in order to reveal regularities and characterize their specific manifestations.

Basic concepts

Statistical population. The objects, cases, through/or in which a given phenomenon can be manifested, can be united in the process of their study into homogeneous groups called statistical populations. This unification is not done formally, but on the basis of the existing mutual relationship between the individual cases, which arises from their qualitative uniformity. Statistical aggregates can be of several types:

momentary and periodic, depending on the fact whether they refer to a specific moment or to a specific period;
general and sample, depending on whether they include absolutely all units of the population, or only a part of them;
real and hypothetical, according to whether they exist in reality or are the product of inference.

Statistical units. The units (cases, events) that make up the statistical aggregates are called statistical units.

The general population (population). Covers all units of the study population. Cases do not have to be human. individuals, they can be laboratory animals, hospitals, pharmacies, etc. If, for example, we want to study the spread of hepatitis B in our country, the general population will include all people in Bulgaria. If we are interested in how many people suffer from prostate cancer in the country, the general population will include only men in Bulgaria.Sample. It represents a certain part of the general population.

Variables (traits) and measurement scales

Statistical signs (variables) are basic qualities, features, characteristics on which the statistical study is carried out. Characteristics are also called variables because their values can change in the individual units of the study. The term "variable" is used in contrast to "constant" - a value that does not change. For example, age, blood type are variables, because (meanings) their varieties vary among the subjects.

Depending on the nature of their meanings, signs are classified into two main types:

quantitative (also called variational, metrical), they can be discrete (that is, they usually accept only whole values - such as the number of children in a family, the number of teeth, the number of employees in a given hospital, etc.) or continuous (they can to accept all numerical values in a given interval – weight, t⁰C, value of some medical indicator, etc.);
categorical (qualitative, attributive, unmetered) - do not have numerical meanings, but express qualitative differences between the studied cases; their meanings are categorical varieties. Such are, for example, gender, education, blood type, eye color, etc.

The type of variable significantly determines the choice of correct statistical analyses.

Four types are used to measure the variables measuring scales:

nominal scale (classification, name scale) – used when a property cannot be directly measured. This scale does not contain any information about the magnitude of the measured feature (there is none), but only distinguishes its individual classes or categories, which are marked with symbols (marital status: single, married, divorced, widower; blood group: 0, A, B and AB). A particular case of the nominal scale is the dichotomous one - when the observed sign has only two meanings, for example gender - male or female; presence or absence of disease, etc. The most important characteristic of a nominal scale is that the subdivisions of the variable we are measuring are of equal rank. The only quantitative characteristic in this scale is the number of units (frequency) in the distinct categories;
ordinal scale – separate classes of the variable are distinguished, which can be compared with each other, and the individual categories are ordered according to their weight (eg the variable education: primary, secondary, higher). This scale has three varieties - semi-ordered (here the varieties of the feature are expressed by the terms - big-small, cheap-expensive, low-high, etc.); rank (in it, the varieties of the sign are expressed by the terms first, second, third, etc.) and ball (for example, the scale for evaluating students' knowledge - weak, average, etc.). And here, as with the nominal scale, the only quantitative characteristic is the frequencies;
interval scale – characterized by the presence of a unit of measurement and a starting point. Differences between individual units with respect to the studied property can be assessed. The differences are represented as an interval between two points on the scale (temperature, altitude). The starting point of the scale is conditionally determined, values below it are also possible (e.g. temperature of -12 °С);
absolute scale (scale of relations) – an absolute starting point and unit of measurement are set. One can compare how many times a measurement is greater or less than another (height, weight). The starting point of the scale means complete absence of the symptom.

Qualitative (categorical) data are presented on weak scales - nominal and ordinal, and quantitative - on strong scales: interval and absolute.

Stages of statistical research

The main stages through which a statistical study passes are four, namely: description (planning and organization); measurement of planned indicators (field phase); statistical processing (description and analysis); discussion (interpretation and generalization) of the obtained results. The interconnectedness of these stages imposes, and certain requirements are imposed on each stage towards the others.

In the course of each specific study, taking into account the specifics of the intended goal and tasks, some of the components of this general scheme may be dropped, as well as new elements may be added.

First stage – description (planning and organization) of the study

The reason for conducting a statistical study in the field of occupational medicine is the occurrence of an occupational medical problem. This stage begins by describing the problem that has arisen (for example, frequent clinical complaints when working in a specific work environment), i.e. observed phenomena are described. After that, those categories, qualities, properties and peculiarities of the studied phenomena are selected, which correspond to the described problem to the highest degree. During this stage, the assessment is also made of what new information can be obtained from this research. For this purpose, it is necessary to consider the results and conclusions of similar studies carried out to date.

After these steps, the aims and objectives of the study, as well as the research hypotheses, are formulated.

The next steps in this stage are: determining the object of the study (population, defined population at risk, sample, etc.); depending on the presence of the described phenomena, the scope of the statistical population is also determined; determining the units of observation, hypothesized factors and dependent variables (traits).

The planning scheme also requires choosing an appropriate type of survey – comprehensive or representative; retrospective or prospective; one-time, periodic or ongoing. The observation period is determined, taking into account the development and spread of diseases, the duration of exposure, the time interval of the tested therapeutic or prophylactic interventions and medicinal products.

During this stage, the methodology for data collection is chosen and described (survey, interview, observation, experiment, medical examinations, biological samples, laboratory tests, screening methods, etc.).

In the preparation of the primary documents for data collection, data standardization and a system for coding the received information are foreseen. It is of particular importance to provide also a system for technical and logical control and verification of the reliability of the data.

It is planned to use adequate statistical methods and analyses, which are consistent with the nature and character of the investigated phenomena, since the scientific validity of the obtained results largely depends on this.

An appropriate sample model is chosen, if a representative study is planned, the rules for forming the sample itself are determined. This choice is related to the characteristics and volume of the general population, the accuracy of the survey, the material security, the possibilities of controlling the factors that can have a negative impact (the obscuring factors) on the obtained data.

On the basis of everything described so far in documentary form, the so-called "Research protocol of the study". Depending on the nature of the study, in some cases this protocol must be approved by the relevant Ethics Committee.

In some types of studies, it is necessary to conduct a so-called pilot study. It is performed on a small sample size, in a short time, and its purpose is to validate the described research protocol of the study. As a consequence of this, the mistakes made (if any) of one nature or another are removed and the final version of the protocol is prepared.

This stage of the study ends with the preparation of a detailed financial plan, provision of the necessary resources (human and technical) and time intervals for the implementation of each step of the protocol, as well as the persons responsible for the implementation of the tasks.

Second stage – measurement (field phase)

During this stage, the measurements themselves are carried out in real conditions. Its duration varies depending on the type of survey. The conduct of this stage must fully comply with the prescribed instructions in the research protocol. Otherwise, the possibility of making a systematic error can be reached. Such error can also result from inaccurate measurement, inaccurate classifications of factor and dependent variables (exposures, health status), accuracy of measuring devices, etc. Systematic error leads to an increase or decrease in the reported values, which in turn leads to a violation of the accuracy of the estimates. This shows that the aim in conducting this stage is to reduce the possibility of making a systematic error.

The collected data is entered into a computer, usually in a pre-made spreadsheet. With this input, technical and logical control criteria can be set in order to limit unacceptable (wrongly reported) data, and hence reduce the possibility of making a systematic error. This stage ends with exactly this type of reliability checks (validation) of the collected data. Errors are corrected if possible, and data of questionable and unacceptable value is removed.

Third stage – statistical processing (description and analysis) of the data

Following the protocol of the statistical study, at this stage the statistical processing of the data is carried out, which is related to the nature of the study and the type of variables under consideration. This includes data grouping, tabular and graphical presentation, as well as biostatistical analysis.

The statistical grouping is not done for its own sake, but on the basis of signs of important medical importance, such as health and social status, demographic indicators, physico-chemical factors, intensity of exposure, dose of the drug taken, etc.

Biostatistical analysis goes through: comparing the frequencies of health events or phenomena, testing statistical hypotheses related to factor dependence in specific diseases, quantitative assessment of established relationships and dependencies, modeling and forecasting of established trends, etc.

Essentially, the biostatistical analysis should proceed in unity with the qualitative analysis in the subject area (medico-biological, clinical and medico-social).

Fourth stage – discussion (interpretation and summarization) of the results

At this stage of the study, the results are subjected to a thorough discussion and interpretation from the point of view of epidemiological understanding, where the focus is on all the more important facts established in the research process.

Significant importance is attached to the reliability and evidential strength of the results, as well as to their representativeness and the various aspects of the validity of the statistical study.

When discussing the validity of the results, randomness, bias, and confounding factors should be considered as additional alternative explanations.

At the end of this stage, and this is also the end of the research, the conclusions and conclusions are formulated. New hypotheses related to the problems can be defined, on which new studies are required, recommendations are made for the introduction of measures and programs for disease control - therapeutic schemes, curative and prophylactic means, etc. One of the most important points in a statistical study is the interpretation of the results, because incorrect interpretation inevitably leads to incorrect conclusions and conclusions.

Descriptive statistics

Measures of central tendency, dispersion, and distribution shape

Measures of Central Tendency (Means)

These are summary statistical characteristics that reflect the general, typical of the given sample. There are many measures of central tendency, but the arithmetic mean, median and mode are the most commonly used indicators in practice.

The arithmetic mean is calculated by the formula:

x_i – the sign value for the i-th unit in the set

n – the total number of cases

For example, if we calculate the average age, then x_i is the age of each individual and n is the number of such individuals.

If the quantitative feature is grouped in intervals, it is a mandatory condition that they have the same width. In this case, the arithmetic mean value can be calculated using another formula:

x_i – the middle of the corresponding interval;

f_i – the frequency (number of cases) in the corresponding interval

n – the number of intervals

Of course, the calculation of any estimates based on grouped data reflects on their accuracy (they are more imprecise). It is preferable to calculate summary statistics from ungrouped data whenever possible.

The median (Me) represents the positional environment of the units. To determine it, all cases must be sorted by the size of the trait of interest (for example, by age). The unit sign value that is exactly in the middle is the median. Of course, the centrality of the sign can only exist if the number of cases is odd. However, if we have an even number of units, then the median is taken to be the arithmetic mean of the value of the two values in the middle. The sequence number of the unit that is the median is determined by the formula:

n_Me is the sequence number of the unit whose value is the median.

The median can also be calculated based on grouped data. And here the condition that the width of the intervals is the same applies. The formula is as follows:

L_Me – the lower limit of the median interval

∑f – the total number of cases

C_Me-1 – the cumulative frequencies in the pre-median interval

e – width of intervals

f_Me – the frequency in the median interval

To determine the median interval, we need to calculate the cumulative frequencies. The interval that contains the unit with sequence number n/2 is the median interval.

Fashion (Mo) represents the most common meaning of the variable. There may be more than one mode in a given stat line, ie. more than one value (eg age 25 and 26) to occur the most times in the study sample.

For data grouped into equal-width intervals, the mode can also be calculated. This is done according to the formula:

L_Mo – lower limit of the modal interval

f_Mo – frequency in the modal interval

f_Mo-1 – frequency in the premodal interval

f_Mo+1 – frequency in the postmodal interval

The modal interval is the interval with the most values of the feature.

As for categorical variables, the mode (the value of the feature that occurs most often) can be determined for each of them. For variables that are measured on the ordinal scale, the median can also be determined, since we can arrange the categories in ascending order. There is no arithmetic mean in the true sense of the word for categorical variables. Conventionally, the relative share of the most common meaning (p) is taken as the average, while the sum of the shares of the remaining meanings is taken as its alternative (q). It usually applies to variables measured on a dichotomous scale (a special case of a categorical variable measured on a nominal scale with only two attribute values - see Chapter One), but in principle any qualitative variable can be converted to one with fewer values of sign.

Scatter measures

The most commonly used measures measuring the dispersion of units in a given sample are range, standard deviation, and variance. They show how compact the given set is, with close meanings or varies according to the size of the studied feature. In addition to these, the coefficient of variation, as well as the interquartile range, can be calculated.

The sweep is the most basic measure of dispersion. It is calculated by the formula: d = x_max– x_min

x_max – the smallest value of the characteristic observed in the sample;

x_min – the largest value of the attribute.

This measure is very imprecise because it is calculated based on only two units of the study sample. In practice, the indicators standard deviation (also known as mean square deviation) and dispersion are more often used. The standard deviation shows the average deviation of the values of the studied feature from their arithmetic mean. The formula for calculating this statistical characteristic is as follows:

Variance is the standard deviation squared, or:

The standard deviation can be represented in relative form by calculating the so-called coefficient of variation (V). It is calculated by the formula:

Measures of the shape of the distribution – skewness and kurtosis

The coefficient of asymmetry measures whether a distribution is symmetric or not. Indirectly, we can judge the presence of asymmetry when the three mean values - arithmetic mean, median and mode - have values far from each other. The coefficient of asymmetry is determined by the formula:

M₃ – third central moment, is found by the formula:

When the skewness coefficient is 0, then the distribution is absolutely symmetric. For values between 0.25 and -0.25, the distribution is moderately asymmetric. With positive values of the asymmetry coefficient greater than 0.25, we speak of right (positive) asymmetry, and with values below -0.25 – of left (negative)

Quotient of kurtosis measures how sharp or obtuse the angle is at the top of the curve. It is determined by the formula:

When a frequency distribution is symmetric and no kurtosis is observed (ie the coefficients of asymmetry and kurtosis fall within the limits -0.2÷0.2), this means that its shape is close to the shape of the normal distribution. In other words, it can be said to have a normal distribution.

A theoretical distribution that is symmetric about its central axis (bell-shaped). It is not a single one, but a family of normal distributions uniquely defined by the arithmetic mean and variance.

Introduction to Statistical Inference - Hypothesis Testing

Definition and types of statistical hypotheses

In essence, a hypothesis is a reasonable assumption about the course of a certain process, phenomenon or event. Scientific arguments are needed to confirm or reject it.

The statistical hypothesis it is usually associated with an assumption about some unknown population parameters or about the type of frequency distributions in the population being studied.

In statistics, there are two types of hypotheses - the null hypothesis and the alternative hypothesis. These are two mutually exclusive (opposite) statements, ie. if one is true, the other is automatically false. In this situation, it is sufficient to test only one hypothesis. Only the null hypothesis is tested against data from representative studies.

Null hypothesis (H₀) – a statement that is related to the presence of a null difference, null relationship, null effect (no difference, no relationship, no effect), i.e. the observed difference, relationship, effect, etc. is due only to chance, not to purposeful influence.

An alternative hypothesis (H₁) – the opposite of the null hypothesis, or it is a statement related to the presence of a significant difference, relationship, or effect. The observed difference, relationship or effect is due not only to chance, but also to systematically (lawfully) operating causes.

The formulation of the alternative hypothesis is an important point in hypothesis testing. This statement is in most cases the researcher's desired result and can only be accepted if the null hypothesis is rejected. In some cases, the alternative hypotheses can be defined as undirected and directed. They are undirected when they are related to a claim of significant difference, without specifying where this difference is directed. If the direction of the difference is specified, then we speak of a directed alternative.

Example 1. Let µ denote the average value of systolic blood pressure in a given population. If we want to test whether this value is 120 mm/Hg, then the relevant hypotheses will be:

H₀: µ=120

H₁: µ≠120 in undirected alternative

H₁: µ<120 or H₁: µ>120 in directed alternative

An example 2. Let with µ₁ and µ₂denote the population mean values of systolic blood pressure of patients from two groups. If we want to compare these pressures between the two groups, then the hypotheses will be:

H₀: µ₁=µ₂

H₁: µ₁≠µ₂in a nondirected alternative

H₁: µ₁>µ₂ or H₁: µ₁<µ₂in directed alternative

Types of Hypothesis Testing Errors

Accepting or rejecting the null hypothesis represent the two possible decisions the researcher makes based on statistical considerations. There is some probability of error in each of these decisions.

Error of the first kind (α – error). Rejecting the null hypothesis when it is the valid hypothesis is a first-order error. In other words, it can be said that the error of the first kind is related to claiming a non-existent effect.Error of the second kind (β – error). Accepting the null hypothesis when it is not the valid hypothesis is a second-order error. I.e. rejecting an existing effect results in a second-order error.

Both types of errors are unintended consequences of the researcher's decision. A first-order error represents a higher degree of undesired consequence, and the main reasons for this are twofold. First, in most cases the researcher's goal is to demonstrate the existence of some effect that is actually related to the decision to reject the null hypothesis. It is such an action that is associated with the possibility of making a first-order error. Second, the theory of statistical hypothesis testing is based on the control of the probability of error of the first order in a relatively elementary way from the point of view of the mathematical apparatus. While the control of second-order error is a complex and difficult mathematical objective.

Statistical criterion and level of significance

Statistical hypotheses are tested using specific statistical criteria. The values of these criteria are obtained from previously known theoretical distributions. When testing any specific statistical hypothesis, the chosen criterion has two values - theoretical and empirical. The theoretical value can be determined from tables with the theoretical values of the relevant criterion, and the empirical value is calculated according to a specific mathematical formula with the data of the representative sample. At the heart of every statistical hypothesis test is the statistical criterion.

In choosing an appropriate criterion, it is necessary to know the conditions to which the variables must satisfy and the nature of the samples.

The probability of making a first-order error is related to the statistical significance of the results obtained, namely if this probability is small enough, it will make it possible to accept the alternative hypothesis, i.e. to assume the presence of a statistically significant effect. In this sense, the probability of making a first-order error represents the statistical significance level. The determination of a critical (threshold) level of significance (α) is related to the perceived statistical probability with which the researcher supports his claims (e.g. at a statistical probability of 0.95, α=0.05, respectively at 0.99 – α=0.01, etc.). The theoretical value of the chosen criterion divides the numerical axis into two areas – the acceptance area of H₀ and rejection region of H₀. The region of rejection of the null hypothesis is also called critical area for H₀. This area can be bilateral or unilateral (left or right). This depends on the alternative hypothesis formulated.

the theoretical values of the criterion corresponding to the quantiles of order α/2 and (1-α/2)

теоретичните значения на критерия, съответстващи на квантилите от ред α/2 и (1-α/2) — the theoretical values of the criterion corresponding to the quantiles of order α/2 and (1-α/2)

The probability of accepting H₁, when it is the valid hypothesis, is related to the power of the chosen criterion. The criterion power (γ) is directly related to the error of the second kind and is expressed by the equality: γ=1-β

Stages in testing statistical hypotheses

As it became clear earlier, H is being tested₀. The procedure for this check goes through several stages.

Formulation of H₀ and H₁.
Determination (choice) of the critical level of significance α. This depends on the nature of the research being conducted. It is usually chosen to be 0.05 or 0.01.
Selection of a statistical test (criterion) for the verification. The selection is made based on the conditions that the individual variables and the sample must meet. (see ch. VI)
Calculation of the empirical value of the selected criterion.
Determination of the theoretical value of the criterion. This is done from the relevant tables, taking into account the chosen level of significance and degrees of freedom.
Making a decision (accepting or rejecting the null hypothesis). The statistical conclusion itself is related to this stage.

The decision to accept or reject H₀ depends on the ratio between the theoretical and empirical value of the selected criterion.

If K_Well>K_T, then the null hypothesis is rejected in favor of the alternative and the conclusion will be that there is a statistically significant effect. In the opposite case, when K_Well≤K_T, then the null hypothesis is accepted and the conclusion is that no statistically significant effect is observed. The decision to accept or reject H₀ it can also be taken by comparing - the set critical level of significance α and the empirical value of the level of significance (p) calculated from the data using statistical software. The check proceeds as follows: if p<α, then H₀ is rejected in favor of H₁although p≥α, thenH₀ is accepted. In statistical software products that can do statistical hypothesis testing, the p-value is usually denoted in the results as sig. or p-value.

Literature:

1. Ranchov, G., Medical statistics, Gorexpress, 1997, Sofia, 274 p.

2. Ranchov, G., Biostatistics and biomathematics, Eco Print, 2008, Sofia, 388 p.

3. Willcox W., The Founder of Statistics. Review of the International Statistical Institute. 5 (4). 1938. 321-328.

FOR FOZ

ADMISSION 2025/2026.

"FOZ" PUBLISHING

e-mail: press@foz.mu-sofia.bg

Bank account:

IBAN: BG47BPBI79403163982101
BIC: BPBIBGSF
EUROBANK AD DONDUKOV BRANCH No. 4-6

CONTACTS

City of Sofia, "Bialo More" St. No. 8 "Tsaritsa Joanna" UMBAL - ISUL floor 5,
phone 02 9432 127

e-mail: fph@foz.mu-sofia.bg