Chitkara University Publications

Estimation of Missing Values in the Data Mining and Comparison of Imputation Methods

Abstract:

Many existing, industrial, and research data sets contain missing values (MVs). There are various reasons for their existence, such as manual data entry procedures, equipment errors, and incorrect measurements. The presence of such imperfections usually requires a preprocessing stage in which the data are prepared and cleaned, in order to be useful and sufficiently clear for the knowledge extraction process. MVs make the performance of data analysis difficult. The presence of MVs can also pose serious problems for researchers. In fact, in the appropriate handling of the MVs in the analysis may introduce bias and can result in misleading conclusions being drawn from a research study and can also limit the generalize ability of the research findings. The various types of problem are usually associated with MVs in data mining are (1) loss of efficiency;(2) complications in handling and analyzing the data; and(3)  bias resulting from differences between missing and complete data. We will focus our attention on the use of imputation methods. A fundamental advantage of this approach is that the MV treatment is independent of the learning algorithm used. For this reason, the user can select the most appropriate method for each situation he faces. In this paper, different methods of estimation of missing values are discussed. The comparison of different imputation methods is given by using non-parametric methods.

Author(s):

  • Shamsher Singh, Dept. Computer Science, A.B College Pathankot (Punjab)
  • Prof. Jagdish Prasad, HOD. Dept. Of Stat. Univ. of Raj. Jaipur

DOI: 

Keywords: 

Missing values, imputation methods, non parametric, data mining, Missing values in data mining

References:

Acuna E, Rodriguez C (2004) Classification, clustering and data mining applications. Springer, Berlin, pp 639–648. http://dx.doi.org/10.1007/978-3-642-17103-1_60

Asuncion A, Newman D (2007) UCI machine learning repository. http://archive.ics.uci.edu/ml/

Batista G, MonardM (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5):519–533. http://dx.doi.org/10.1080/713827181

C.E. Shannon, A Mathematical Theory of Communication, Bell Systems Technical Journal, vol.27, pp.379-423, 1948

Dempster, A. P., Laird, N. M., and Rubin, D. B., Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B39(1), 1997, 1-38.

Ding Y, Simonoff JS (2010) An investigation of missing data methods for classification trees applied to binary response data. J Mach Learn Res 11:131–170

English, L. P., “Help for data quality problems — A number of automated tools can ease data cleansing and help improve data quality,” InformationWeek, Oct 7, 1996, 53.

English, L. P., Information quality for business intelligence and data mining: Assuring quality for strategic information uses, 2005.<http://support.sas.com/news/users/LarryEnglish_0206.pdf> [retrieved April 1, 2007].

Farhangfar A, Kurgan LA, Pedrycz W (2007) A novel framework for imputation of missing values in databases. IEEE Trans Syst Man Cybern Part A 37(5):692–709. http://dx.doi.org/10.1109/TSMCA.2007.902631

Garvin, D. A., Managing Quality, The Free Press, New York, 1988.

Hruschka ER Jr., Hruschka ER, Ebecken NF (2007) Bayesian networks for imputation in classification problems. J Intell Inf Syst 29(3):231–252. http://dx.doi.org/10.1007/s10844-006-0016-x

Huang, K. T., Lee, Y. W., Wang, R. Y., Quality Information and Knowledge, Prentice-Hall, New York, 1999.

J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufman, 1993 http://dx.doi.org/10.1023/A:1022645310020

J.R. Quinlan, Induction of Decision Trees, Machine Learning, vol.1, pp.81-106, 1986 http://dx.doi.org/10.1023/A:1022643204877

K. J. Cios, L.A. Kurgan, Hybrid Inductive Machine Learning: An Overview of CLIP Algorithms, In: L.C. Jain, and J. Kacprzyk, (Eds.), New Learning Paradigms in Soft Computing, pp. 276-322, Physica-Verlag (Springer), 2001 http://dx.doi.org/10.1007/978-3-7908-1803-1_10

K. Y. TAM and M. Y. KIANG (1992) Managerial applications of neural networks: The case of bank failure predictions. Mgmt Sci. 38, 936-947. http://dx.doi.org/10.1287/mnsc.38.7.926

Kalton, G. and Kasprzyk, D. (1986). The treatment of missing survey data. Survey Methodology 12, 1-16.

Kim H, Golub GH, Park H (2005) Missing value estimation for dna microarray gene expression data: local least squares imputation. Bioinformatics 21(2):187–198 http://dx.doi.org/10.1093/bioinformatics/bth499

L. M. SALCHENBERGERE,. M. CINAR and N. A. LASH (1992) Neural networks: A new tool for predicting thrift failures. Decis. Sci. 23, 899-916. http://dx.doi.org/10.1111/j.1540-5915.1992.tb00425.x

Little, R. J. A., and Rubin, D. B., Statistical Analysis with Missing Data, 2nd Ed. New York: John Wiley and Sons, 2002.

N. CAPON (1982) Credit scoring systems: A critical analysis. J. Marketing 41, 82-91. http://dx.doi.org/10.2307/3203343

R. A. WALKING (1985) Predicting tender offer success: A logistic analysis. J. Finance and Quantitative Analysis 20, 461-478. http://dx.doi.org/10.2307/2330762

Rayward-Smith V.J Statistics to measure correlation for data mining applications Computational Statistics & Data Analysis 51 (2007) 3968 – 3982 http://dx.doi.org/10.1016/j.csda.2006.05.025

R. Y. AWH and D. WALTERS (1974) A discriminant analysis of economic, demographic, and attitudinal characteristics of bank charge-card holders: A case study. J. Finance. 29, 973-980. http://dx.doi.org/10.1111/j.1540-6261.1974.tb01495.x

R.O. Duda, and P.E. Hart, Pattern Classification and Scene Analysis, John Wiley, 1977

Salaun, Y. and Flores, K., “Information quality: meeting the needs of the consumer,” International Journal of Information Management, 21(1), 2001, 21-37. http://dx.doi.org/10.1016/S0268-4012(00)00048-7

Salmela, H., “From information systems quality to sustainable business quality,” Information and Software Technology, 39(12), 1997, 819-825. http://dx.doi.org/10.1016/S0950-5849(97)00040-2

Song Q, Shepperd M, Chen X, Liu J (2008) Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation. J Syst Softw 81(12):2361–2370 http://dx.doi.org/10.1016/j.jss.2008.05.008

Strong, D., Lee, Y. W., and Wang, R. Y., 10 potholes in the road to information quality, IEEE Computer, 30(8), 1997, 38-46. http://dx.doi.org/10.1109/2.607057

Tozer, G., Metadata Management for Information Control and Business Success, Artech House, Norwood, MA, 1999.

Twala B (2009) An empirical comparison of techniques for handling incomplete data using decision trees. Appl Artif Intell 23:373–405 http://dx.doi.org/10.1080/08839510902872223

Wang, H, and Wang, S., Data mining with incomplete data, in Encyclopedia of Data Warehousing and Mining, John Wang (Ed.), Idea Group Inc.: Hershey, PA, 2005, pp.293-296.

Wang, R. Y., Lee, Y. W., Pipino, L. L., and Strong, D. M., “Manage your information as a product,” Sloan Management Review, 39(4), 1998, 95-105.

 

 

0 0 votes
Article Rating
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments
CHITKARA UNIVERSITY ADMINISTRATIVE OFFICE SARASWATI KENDRA, PO Box No. 70 SCO – 160-161,Sector – 9C, Chandigarh – 160009, India. +91-172-2741000, +91-172-4691800 chitkarauniversitypublications@chitkara.edu.in

    0
    Would love your thoughts, please comment.x
    ()
    x