Methods of determining the most important variables in the modelling of social and economic research

Authors
Affiliations

T. Kmytiuk

PhD in Economics, Associate Professor, Associate Professor of the Department of Mathematical Modeling and Statistics, Kyiv National Economic University named after Vadym Hetman

kmytiuk.tetiana@kneu.edu.ua

S. Kmytiuk

II category specialist, assistant of educational fields “Natural”, “Informatics”, “Technology”, “Center for Professional Development of Pedagogical Staffs”, communal institution of Slavuta municipal council

svitlanakmytyuk@gmail.com

The main task of economic-mathematical research is the construction of models, which makes it possible to identify and formally describe the most important and essential regularities of the functioning of socio-economic processes and phenomena, and to establish relationships between the main economic indicators. Any socio-economic process or phenomenon, as a rule, is considered as a complex system characterized by a set of factors. In the process of socio-economic processes modelling and forecasting, the research task is reduced to finding the dependence of one explained variable on several explanatory variables. This problem can be solved using multiple regression analysis, which is described by equation:

\[ \begin{align*} y=f(x)+e \\ y=f(x_(1,) x_(2,)…,x_(m ))+e \end{align*} \tag{1}\]

\(f(x)\) – the regression function of y on x, \(x_(m )\) – independent (explanatory) variable (regressor), \(y\) – dependent (explained) variable (regressand)

The complexity of economic processes requires including many influencing factors (variables) into the regression model. At the same time, an relevance problem is the determination of the most important variables, which to include (or exclude) in the model that will contribute most to modelling and forecasting. Factors of the multiple regression must meet two main requirements:

  1. be quantitatively measurable (qualitative measurement is replaced by quantitative evaluation (for example, in the form of points, ranks, binary measurements, etc.);
  2. should not be correlated with each other (a phenomenon that occurs in regression modelling is called multicollinearity).

Selection of factors can be carried out on the basis of qualitative theoretical and economic analysis. However, theoretical analysis often does not allow an unambiguous answer to the question of the quantitative relationship of the considered features and the feasibility of including factors in the model.

Therefore, the selection of factors is usually carried out in combination with the use of statistical and mathematical methods.

Among the most common statistical and mathematical methods to select the best factors are stepwise regression and all-possible subsets.

The main approaches for stepwise regression are: forward stepwise selection, backward stepwise elimination, bidirectional elimination [1].

Forward stepwise method involves the gradual including of factors into the model with simultaneous verification of the significance of the influence of the newly formed feature set using partial correlation coefficients, t-criterion and p-level values. The process of expanding the model stops when the next variable included in the model does not lead to an increase in the total \(R^2\), and the composition of the set of variables formed before its inclusion remains unchanged.

Backward stepwise method works in reverse order compared to the above method. The procedure begins with an expanded feature set, which contains all the main factors selected at the first step of the theoretical analysis. Further, in each step, factors that have an insignificant effect on the explained variable are excluded. The elimination is carried out until only significant variables remain in the model or until the specified value of the multiple correlation coefficient is reached.

The bidirectional approach is essentially a forward selection procedure, but with the option of removing the selected variable at each step, as in backward elimination when there are correlations between the variables. Or simply, it is a combination of both method of stepwise regression.

All-possible subsets method tests all possible subsets of the set of potential independent variables [4]. This means that all univariate (one-variable) models are built first, then all bivariate (two-variable) models, and so on, until the last all-variable model is generated. The selection of the best models for this process is carried out using the criteria: \(R\)-squared, adjusted \(R\)-squared or Mallows’ \(C_p\).

The rapid development of information technologies contributed to the use of methods of intelligent data analysis (IDA) in the modelling of socio-economic processes. Among the popular approaches to determining the most important variables in researches is the use of such statistical IDA methods: factor analysis, cluster analysis, variance analysis.

Factor analysis is a technique that is used to search and classification of factors that affect economic phenomena and processes, with the identification of cause-and-effect relationships affecting changes in specific indicators of economic activity. It is used to reduce a large number of variables into fewer numbers of factors [2].

The essence of the analysis is to combine variables that are highly correlated into one factor. The variance is redistributed among the variables, and the simplest and most visible structure of the new factors (latent variables) is created. When combined, each latent variable contains factors that are highly correlated with each other. Conversely, there is minimal or no correlation between the factors of individual latent variables.

There are different types of factor methods: principal component analysis, common factor analysis, image factoring, maximum likelihood method etc.

The most important variables can be identified using clustering methods [3]. Clustering methods group data into clusters based on the similarity of objects. Most clustering methods begin by choosing a central point for each cluster, and the set of elements is distributed across clusters. After that, the centers are adjusted and the elements are redistributed. Improving and increasing the accuracy of clustering algorithms is possible with the help of machine learning.

Analysis of variance (ANOVA) is a statistical method that used to identify a difference between two groups. This tool analyses the difference between planned and actual numbers. To do this, an ANOVA test splits the variability found within any data set into two parts. The first part is systematic factors, that is, factors with a statistical influence on the data set. Second, these are random factors, these are factors without statistical influence

Conclusion. The selection of the most significant factors affecting the resulting feature is carried out on the basis of qualitative, theoretical analysis in combination with the use of statistical methods in the construction of a multiple regression model. The selected factors should reflect only the most important and characteristic properties of the investigated processes or phenomena. At the same time, in the process of such research, it is possible to return several times to the model specification stage, clarifying the list of independent variables and the type of function used, as well as to carry out a combination of methods for selecting the main factors.

References

  1. Chowdhury, MZI., Turin, TC. (2020). Variable selection strategies and its importance in clinical prediction modelling. Fam Med Com Health, 8:e000262. https://doi.org/10.1136/fmch-2019-000262
  2. Kmytiuk, T. (2023). Factors of pricing of agricultural products. Scientia·fructuosa, 147(1), 88–105. https://doi.org/10.31617/1.2023(147)07
  3. Oyewole, G.J., Thopil, G.A. (2022). Data clustering: application and trends. Artif Intell Rev. https://doi.org/10.1007/s10462-022-10325-y
  4. Ratner, B. (2010). Variable selection methods in regression: Ignorable problem, outing notable solution. Journal of Targeting, Measurement and Analysis for Marketing, 18, 65–75 (2010). https://doi.org/10.1057/jt.2009.26