Due to the rapid development of information technologies, abundant data have become readily available. Data mining techniques have been used for process optimization in many manufacturing processes in automotive, LCD, semiconductor, and steel production, among others. However, a large amount of missing values occurs in the data set due to several causes (e.g., data discarded by gross measurement errors, measurement machine breakdown, routine maintenance, sampling inspection, and sensor failure), which frequently complicate the application of data mining to the data set. This study proposes a new procedure for optimizing processes called missing values-Patient Rule Induction Method (m-PRIM), which handles the missing-values problem systematically and yields considerable process improvement, even if a significant portion of the data set has missing values. A case study in a semiconductor manufacturing process is conducted to illustrate the proposed procedure.
The use of data mining techniques in manufacturing industries has begun in the 1990s, gradually receiving attention from many manufacturing processes in automotive, LCD, semiconductor, and steel manufacturing for predictive maintenance, fault detection, diagnosis, and scheduling (Harding, Shahbaz, Srinivas, & Kusiak, 2006). Data mining techniques have also been used for process optimization in order to find optimum conditions for input variables that maximize (or minimize) output variables (Braha and Shmilovici, 2002 and Kim and Ding, 2005).
Among many data mining techniques, the Patient Rule Induction Method (PRIM), originally proposed by Friedman and Fisher (1999), has been successfully applied for process optimization despite its recent emergence (Chong et al., 2007, Chong and Jun, 2008, Kwak et al., 2010 and Lee and Kim, 2008). This method directly seeks a set of sub-regions for input variables, in which higher quality values are observed from the historical data.
An embedded assumption in existing PRIM works meant for process optimization is that missing values do not exist in the data sets, or the amount of missing ones is negligible. Although abundant data are readily available due to the rapid development of information technologies, missing values are a common occurrence in various industrial process data sets due to several causes (e.g., data discarded by gross measurement errors, measurement machine breakdown, routine maintenance, sampling inspection, and sensor failure) ( Arteaga and Ferrer, 2002, Muteki et al., 2005 and Nelson et al., 1996). A large amount of missing values frequently complicates the application of data mining algorithms (including PRIM) to the data set, because most data mining algorithms have not been designed for them. Moreover, if missing values are not handled in principled ways, these can produce biased, distorted, and unreliable conclusions ( Dasu and Johnson, 2003 and Feelders, 1999). Thus, for the successful application of the existing PRIM works in process optimization, it is necessary to enhance existing works by systematically treating the missing-values problems.
The purpose of this paper is to develop a new PRIM-based method for optimizing processes, where a significant portion of the data set has missing values. This method will be referred to as the missing values-PRIM (m-PRIM). The remainder of the paper is organized as follows: PRIM is briefly reviewed in the next section; the proposed method is introduced, and the results of a case study are presented; finally, the conclusion and discussion are given.
To optimize a process using data mining techniques, it is important to consider the occurrence of missing values in the process data set. This work proposed a procedure for optimizing a process based on the existing PRIM, called m-PRIM, where the amount of missing values is not negligible.
Using a real data set from a semiconductor manufacturing process, the study demonstrates that m-PRIM yielded considerable improvements on the process compared with the current level. The degree of process improvement, however, did not reach that of process optimization when the latter was conducted without missing values in the data set, as expected.
In the case study, it was tentatively assumed that a joint distribution of all variables in the etching process followed multivariate normal distribution, and missing values would be missing at random (MAR). It is difficult to test the multivariate normality assumption of an incomplete data set in practice, thus the use of the domain knowledge of engineers and the aid of statistical tools (e.g., Mahalabanobis distance plots, Mardia’s test, etc.) are required. Additionally, although joint normality is rarely realistic, MI based on the assumption has been known to be useful for a wide variety of problems ( Schafer & Graham, 2002). Finally, the assumption of MAR in the case study could be justified because the probability of missing values depends on the sampling inspection scheme and not on the missing values themselves.