Missing Categorical Data in Sociological Surveys: An Experimental Evaluation of Imputation Techniques

Authors

DOI:

https://doi.org/10.29038/2306-3971-2025-01-32-32

Keywords:

Data Quality, Missing Data, Data Imputation, Multiple Imputation

Abstract

Missing categorical data presents a persistent challenge to data quality in quantitative sociological research, where simpler approaches can lead to biased estimates and incorrect conclusions. This article provides an empirically grounded evaluation of multiple imputation (MI) strategies for categorical survey data, specifically focusing on the complex, multi-category nominal variable "party voted for" using European Social Survey data from Sweden and Norway. We developed a simulation framework, introducing missingness under Missing Completely at Random, Missing at Random, derived from patterns of item nonresponse on auxiliary variables, and Missing Not at Random: linked to the undisclosed party choice itself. We systematically compared the performance of six imputation methods (Multinomial Logistic Regression, Random Forest, CART, KNN, Hot Deck, and Mode) across four distinct predictor set sizes, evaluating them using Accuracy, Cohen’s Kappa, and Macro F1-score with m=20 imputations. Results indicate that while imputing party choice is challenging, model-based MI techniques significantly outperform naive approaches. Multinomial Logistic Regression consistently emerged as the most robust and highest-performing method, often benefiting from larger predictor sets within the MI framework. K-Nearest Neighbors showed promise with smaller predictor sets, offering a computationally efficient alternative. The work emphasizes the importance of principled imputation and provides practical recommendations for sociologists regarding method selection, predictor set construction, and consideration of computational costs when addressing missing categorical data.

References

Agresti, A. (2002). Categorical Data Analysis (1st ed.). Wiley. https://doi.org/10.1002/0471249688

Alwateer, M., Atlam, E.-S., El-Raouf, M. M. A., Ghoneim, O. A., & Gad, I. (2024). Missing Data Imputation: A Comprehensive Review. Journal of Computer and Communications, 12(11), 53–75. https://doi.org/10.4236/jcc.2024.1211004

Andridge, R. R., & Little, R. J. A. (2010). A Review of Hot Deck Imputation for Survey Non-response. International Statistical Review, 78(1), 40–64. https://doi.org/10.1111/j.1751-5823.2010.00103.x

Bakker, R., Hooghe, L., Jolly, S., Marks, G., Polk, J., Rovny, J., Steenbergen, M., & Anna Vachudova, M. (2020). 2019 Chapel Hill Expert Expert Survey (CHES) [Dataset]. https://www.chesdata.eu/2019-chapel-hill-expert-survey

Benjamini, Y., & Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1), 289–300.

Bjerkem, J. (2016). The Norwegian Progress Party: An established populist party. European View, 15(2), 233–243. https://doi.org/10.1007/s12290-016-0404-8

Breiman, L. (1984). Classification and Regression Trees. Wadsworth International Group.

Bulent, K. (2020). The Sweden Democrats: Killer of Swedish Exceptionalism. European Center for Populism Studies (ECPS). https://doi.org/10.55271/op0001

Buuren, S. V., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3). https://doi.org/10.18637/jss.v045.i03

Center for Strategic & International Studies (2021). European Election Watch: Norway 2021. Center for Strategic & International Studies. Retrieved May 02, 2025 from https://www.csis.org/programs/europe-russia-and-eurasia-program/projects/european-election-watch/2021-elections/norway

Dong, W., Fong, D. Y. T., Yoon, J., Wan, E. Y. F., Bedford, L. E., Tang, E. H. M., & Lam, C. L. K. (2021). Generative adversarial networks for imputing missing data for big data clinical research. BMC Medical Research Methodology, 21(1), 78. https://doi.org/10.1186/s12874-021-01272-3

ESS ERIC (2024). ESS11—Integrated file, edition 2.0 [Dataset]. Sikt – Norwegian Agency for Shared Services in Education and Research. https://doi.org/10.21338/ESS11E02_0

Fritz, C. O., Morris, P. E., & Richler, J. J. (2012). Effect size estimates: Current use, calculations, and interpretation. Journal of Experimental Psychology. General, 141(1), 2–18. https://doi.org/10.1037/a0024338

Ge, Y., Li, Z., & Zhang, J. (2023). A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods. Scientific Reports, 13(1), 9432. https://doi.org/10.1038/s41598-023-36509-2

Graham, J. W. (2009). Missing Data Analysis: Making It Work in the Real World. Annual Review of Psychology, 60(1), 549–576. https://doi.org/10.1146/annurev.psych.58.110405.085530

Jupskås, A. R., & Langsæther, P. E. (2023). Norway. In F. Escalona, D. Keith, & L. March (Eds.), The Palgrave Handbook of Radical Left Parties in Europe (pp. 423–447). Palgrave Macmillan UK. https://doi.org/10.1057/978-1-137-56264-7_15

Kovtun, N. V., & Fataliieva, A.-N. Ya. (2020a). New Trends in Evidence-based Statistics: Data Imputation Problems. Statistics of Ukraine, 87(4), 4–13. https://doi.org/10.31767/su.4(87)2019.04.01

Kovtun, N. V., & Fataliieva, A.-N. Ya. (2020b). Software Implementation of Missing Data Recovery: Comparative Analysis. Statistics of Ukraine, 91(4), 12–20. https://doi.org/10.31767/su.4(91)2020.04.02

Kowarik, A., & Templ, M. (2016). Imputation with the R Package VIM. Journal of Statistical Software, 74(7). https://doi.org/10.18637/jss.v074.i07

Lang, K. M., & Wu, W. (2017). A Comparison of Methods for Creating Multiple Imputations of Nominal Variables. Multivariate Behavioral Research, 52(3), 290–304. https://doi.org/10.1080/00273171.2017.1289360

Lee, J. H., & Huber, J. C. (2021). Evaluation of Multiple Imputation with Large Proportions of Missing Data: How Much Is Too Much? Iranian Journal of Public Health. https://doi.org/10.18502/ijph.v50i7.6626

Little, R. J. A., & Rubin, D. B. (1989). The Analysis of Social Science Data with Missing Values. Sociological Methods & Research, 18(2–3), 292–326. https://doi.org/10.1177/0049124189018002004

Manrique-Vallier, D., & Reiter, J. P. (2013). Bayesian multiple imputation for large-scale categorical data with structural zeros. https://hdl.handle.net/1813/34889

Murray, J. S., & Reiter, J. P. (2016). Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models with Local Dependence. Journal of the American Statistical Association, 111(516), 1466–1479. https://doi.org/10.1080/01621459.2016.1174132

Newman, D. A. (2014). Missing Data: Five Practical Guidelines. Organizational Research Methods, 17(4), 372–411. https://doi.org/10.1177/1094428114548590

Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592. https://doi.org/10.1093/biomet/63.3.581

Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys (1st ed.). Wiley. https://doi.org/10.1002/9780470316696

Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—Non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112–118. https://doi.org/10.1093/bioinformatics/btr597

White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399. https://doi.org/10.1002/sim.4067

Wongkamthong, C., & Akande, O. (2020). A Comparative Study of Imputation Methods for Multivariate Ordinal Data. https://doi.org/10.48550/ARXIV.2010.10471

Published

27.06.2025

Issue

Section

METHODOLOGY AND METHODS OF SOCIOLOGICAL RESEARCH