ترجمه فارسی عنوان مقاله

چارچوب داده کاوی برای کشف کلاهبرداری آبونمان در مخابرات

عنوان انگلیسی

A data mining framework for detecting subscription fraud in telecommunication

کد مقاله	سال انتشار	تعداد صفحات مقاله انگلیسی
17728	2011	13 صفحه PDF

منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Engineering Applications of Artificial Intelligence, Volume 24, Issue 1, February 2011, Pages 182–194

ترجمه کلمات کلیدی

تشخیص کلاهبرداری - داده کاوی - شبکه های عصبی - درخت تصمیم گیری - ماشین های بردار پشتیبانی - گروه ها - مخابرات

کلمات کلیدی انگلیسی

Fraud detection, Data mining, Neural networks, Decision tree, Support vector machines, Ensembles, Telecommunication,

دانلود رایگان 2 صفحه اول مقاله لاتین (PDF)

پیش نمایش مقاله

چکیده انگلیسی

Service providing companies including telecommunication companies often receive substantial damage from customers’ fraudulent behaviors. One of the common types of fraud is subscription fraud in which usage type is in contradiction with subscription type. This study aimed at identifying customers’ subscription fraud by employing data mining techniques and adopting knowledge discovery process. To this end, a hybrid approach consisting of preprocessing, clustering, and classification phases was applied, and appropriate tools were employed commensurate to each phase. Specifically, in the clustering phase SOM and K-means were combined, and in the classification phase decision tree (C4.5), neural networks, and support vector machines as single classifiers and bagging, boosting, stacking, majority and consensus voting as ensembles were examined. In addition to using clustering to identify outlier cases, it was also possible – by defining new features – to maintain the results of clustering phase for the classification phase. This, in turn, contributed to better classification results. A real dataset provided by Telecommunication Company of Tehran was applied to demonstrate the effectiveness of the proposed method. The efficient use of synergy among these techniques significantly increased prediction accuracy. The performance of all single and ensemble classifiers is evaluated based on various metrics and compared by statistical tests. The results showed that support vector machines among single classifiers and boosted trees among all classifiers have the best performance in terms of various metrics. The research findings show that the proposed model has a high accuracy, and the resulting outcomes are significant both theoretically and practically.

مقدمه انگلیسی

Telecommunication businesses are producing and storing a huge amount of data all over the world. These data are very interesting for data mining applications. The main feature of these great databases is their extraordinary size. More than 300 million records per day, for example, are stored in AT&T solely for long-distance calls (Cortes and Pregibon, 2001). Although these companies own a great source of information, only few of them are aware of the hidden knowledge of these databases. Thus, they do not use it frequently in their decision making processes. A challenge that not only telecommunication companies but also other service institutions such as banks, water and energy suppliers, and credit companies confront is customers’ fraud detection. Fraud in telecom services causes a substantial loss of annual revenue for many telecommunication companies throughout the world (Paredes, 2005 and Xing and Girolami, 2007). There are different types of fraud in the telecommunication business (Shawe-Taylor et al., 1999). Shawe-Taylor et al. (2000) present six different fraud types: subscription fraud, the manipulation of Private Branch Exchange (PBX) facilities or dial through fraud, free phone fraud, premium rate service fraud, handset theft, and roaming fraud. A common type of fraud is subscription fraud (Estevez et al., 2006). Many companies offer lower tariffs for residential subscribers than for commercial ones. So customers may ask for residential subscription, but use it for commercial purposes. In wireline telephone service, identifying the subscription fraudulent customers is possible by checking the installation and usage place. However, identifying all fraudulent customers through checking all residential customers in companies like Telecommunication Company of Tehran (TCI), which has millions of residential customers, needs a lot of money and time. Therefore, reducing the number of customers to be checked is very demanding. This study intends to propose a method to detect different patterns of residential and commercial subscribers’ behaviors based on their call detail recording (CDR) and bills’ data in order to differentiate residential subscription, which have a behavior similar to fraudulent customers. We have tried to recognize the true subscription type with the highest accuracy. Detecting subscription fraud can prevent a great part of telecommunication income loss. The remaining of this paper is organized as follows: Section 2 reviews the previous literature on the techniques for customers’ fraud detection. The proposed method is then described in Section 3. In Section 4, a real dataset provided by Telecommunication Company of Iran (hereinafter called TCI), is applied as a case study to demonstrate the effectiveness of the proposed method. Finally, concluding remarks are offered in Section 5.

نتیجه گیری انگلیسی

In this study we made use of data mining tools to solve one of the challenging problems in business. Therefore, a framework for detecting fraudulent telecommunication subscribers was proposed which covers different techniques and algorithms for data cleaning, dimension reduction, clustering, and customers’ classification. We introduced a hybrid approach consisting of preprocessing, clustering, and classification phases; appropriate tools were employed commensurate to each phase. In the data preparation phase, in addition to using the common tasks like data cleaning and transformation, PCA was used as a dimension reduction technique. In the second phase we used SOM and K-means subsequently to improve clustering results. Two popular indices i.e., Dunne and DB indices were employed to evaluate the clustering results. DT, NN, and SVM as single classifiers and bagged trees, bagged NN, boosted trees, boosted NN, stacking generalization, majority and consensus voting as ensembles were examined. The parameter space of models explored by a systematic procedure and models calibrated appropriately. TCI subscribers’ data were used as our case study to evaluate the proposed methodology. The performance of all single and ensemble classifiers is evaluated based on various metrics including accuracy, F-score, lift, and AUC. In addition, many comparisons were done by estimating appropriate confidence intervals. The results showed that SVM among single classifiers and boosted trees among all classifiers has the best performance in terms of various metrics. Narrow estimated 95% confidence intervals showed that single and ensemble classifiers have low variances. We introduced 3 new features based on the clustering results in order to keep the learning obtained from the clustering results, and memorize it for the classification phase. This is a new idea in this context that needs more investigation. However, the effectiveness of the new features was investigated experimentally which showed that adding new constructed features can improve the performance of DT and SVM significantly. We also tried to test simplified versions of proposed process by removing dimension reduction by PCA and clustering steps. The results provided evidence for the effectiveness of PCA and clustering steps. The type of subscription fraud investigated in this study occurred in a relatively long time usage period. In fact, there is no difference between a commercial and a residential call per se. the fraud is the result of using a residential subscription for commercial purposes for a period of time. The presented models need the historic data of subscribers to classify them in commercial or residential class. Therefore, the proposed method has some limitations and tradeoffs. The main hurdle is data limitation. That is, the model could not be used to classify new customers without enough historic data. Regarding the other applications, the model cannot be used for real-time purposes due to computational aspects, although in our case this did not pose itself as a problem. In subscription fraud enough time to inference is available. Hence, the proposed model is applicable to telecommunication companies; it can also be used for other non-real time applications. Since the proposed method finally provide a list of residential subscriptions, which are suspected to be used for commercial purposes, yet the company needs to investigate more certify that a fraud has occurred. Therefore, though the model is able to drastically reduce the efforts to detect subscription frauds, it does not eliminate the need for a final company checking. This is because we cannot accuse subscribers of fraudulent behavior solely based on the outputs of the model. The research findings showed that the proposed process had a high performance, and the resulting outcomes were significant both theoretically and practically. In other words, a great amount of TCI’s income, as well as that of other telecommunication companies, which is lost due to customers’ subscription fraud, can be prevented via the proposed process. As the last remark, while data mining techniques help businesses address more questions than ever before, this capability may add to the risk of invading customer privacy. Although in this research subscribers’ personal information was given to researchers anonymously, it should be mentioned that in data mining projects customer privacy rights are sometimes disregarded. As an important point, customer privacy should be taken into account in next implementation or similar projects.