درس های گرفته شده در تشخیص کلاهبرداری کارت اعتباری از دیدگاه متخصص
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|17806||2014||35 صفحه PDF||سفارش دهید||7840 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Expert Systems with Applications, Available online 22 February 2014
Billions of dollars of loss are caused every year due to fraudulent credit card transactions. The design of efficient fraud detection algorithms is key for reducing these losses, and more and more algorithms rely on advanced machine learning techniques to assist fraud investigators. The design of fraud detection algorithms is however particularly challenging due to non stationary distribution of the data, highly imbalanced classes distributions and continuous streams of transactions. At the same time public data are scarcely available for confidentiality issues, leaving unanswered many questions about which is the best strategy to deal with them. In this paper we provide some answers from the practitioner’s perspective by focusing on three crucial issues: unbalancedness, non-stationarity and assessment. The analysis is made possible by a real credit card dataset provided by our industrial partner.
Nowadays, enterprises and public institutions have to face a growing presence of fraud initiatives and need automatic systems to implement fraud detection (Delamaire, Abdou, & Pointon, 2009). Automatic systems are essential since it is not always possible or easy for a human analyst to detect fraudulent patterns in transaction datasets, often characterized by a large number of samples, many dimensions and online updates. Also, the cardholder is not reliable in reporting the theft, loss or fraudulent use of a card (Pavía, Veres-Ferrer, & Foix-Escura, 2012). Since the number of fraudulent transactions is much smaller than the legitimate ones, the data distribution is unbalanced, i.e. skewed towards non-fraudulent observations. It is well known that many learning algorithms underperform when used for unbalanced dataset (Japkowicz & Stephen, 2002) and methods (e.g. resampling) have been proposed to improve their performances. Unbalancedness is not the only factor that determines the difficulty of a classification/detection task. Another influential factor is the amount of overlapping of the classes of interest due to limited information that transaction records provide about the nature of the process (Holte, Acker, & Porter, 1989). Detection problems are typically addressed in two different ways. In the static learning setting, a detection model is periodically relearnt from scratch (e.g. once a year or month). In the online learning setting, the detection model is updated as soon as new data arrives. Though this strategy is the most adequate to deal with issues of non stationarity (e.g. due to the evolution of the spending behavior of the regular card holder or the fraudster), little attention has been devoted in the literature to the unbalanced problem in changing environment. Another problematic issue in credit card detection is the scarcity of available data due to confidentiality issues that give little chance to the community to share real datasets and assess existing techniques.
نتیجه گیری انگلیسی
This paper aims at making an experimental comparison of several state of the art algorithms and modeling techniques on one real dataset, focusing in particular on some open questions like: Which machine learning algorithm should be used? Is it enough to learn a model once a month or it is necessary to update the model everyday? How many transactions are sufficient to train the model? Should the data be analyzed in their original unbalanced form? If not, which is the best way to rebalance them? Which performance measure is the most adequate to asses results? In this paper we address these questions with the aim of assessing their importance on real data and from a practitioner perspective. These are just some of potential questions that could raise during the design of a detection system. We do not claim to be able to give a definite answer to the problem, but we hope to that our work serves as guideline for other people in the field. Our goal is to show what worked and what did not in a real case study. In this paper we give a formalisation of the learning problem in the context of credit card fraud detection. We present a way to create new features in the datasets that can trace the card holder spending habits. By doing this it is possible to present the transactions to the learning algorithm without providing the card holder identifier. We then argue that traditional classification metrics are not suited for a detection task and present existing alternative measures. We propose and compare three approaches for online learning in order to identify what is important to retain or to forget in a changing and non-stationary environment. We show the impact of the rebalancing technique on the final performance when the class distribution is skewed. In doing this we merge techniques developed for unbalanced static datasets with online learning strategies. The resulting frameworks are able to deal with unbalanced and evolving data streams. All the results are obtained by experimentation on a dataset of real credit card transactions provided by our industrial partner.