تشخیص کلاهبرداری های شیلینگ حراج آنلاین با استفاده از یادگیری تحت نظارت
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
17807 | 2014 | 14 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Expert Systems with Applications, Volume 41, Issue 6, May 2014, Pages 3027–3040
چکیده انگلیسی
Online auction sites are a target for fraud due to their anonymity, number of potential targets and low likelihood of identification. Researchers have developed methods for identifying fraud. However, these methods must be individually tailored for each type of fraud, since each differs in the characteristics important for their identification. Using supervised learning methods, it is possible to produce classifiers for specific types of fraud by providing a dataset where instances with behaviours of interest are assigned to a separate class. However this requires multiple labelled datasets: one for each fraud type of interest. It is difficult to use real-world datasets for this purpose since they are difficult to label, often limited in size, and contain zero or multiple suspicious behaviours that may or may not be under investigation. The aims of this work are to: (1) demonstrate the approach of using supervised learning together with a validated synthetic data generator to create fraud detection models that are experimentally more accurate than existing methods and that is effective over real data, and (2) to evaluate a set of features for use in general fraud detection is shown to further improve the performance of the created detection models. The approach is as follows: the data generator is an agent-based simulation modelled on users in commercial online auction data. The simulation is extended using fraud agents which model a known type of online auction fraud called competitive shilling. These agents are added to the simulation to produce the synthetic datasets. Features extracted from this data are used as training data for supervised learning. Using this approach, we optimise an existing fraud detection algorithm, and produce classifiers capable of detecting shilling fraud. Experimental results with synthetic data show the new models have significant improvements in detection accuracy. Results with commercial data show the models identify users with suspicious behaviour.
مقدمه انگلیسی
Online auction sites such as (eBay) and (TradeMe) allow goods and services to be bought and sold online anonymously. The most common type of online auction is the English auction (Menezes & Monteiro, 2005), where bids are placed in ascending order, are publicly observable, and the winner is the final bidder with the highest bid. In 2011, there were 90 million active users in eBay (Shen & Sundaresan, 2011), with more than 170 million concurrent auctions (Auction Count Charts). The anonymity and simplicity of creating multiple aliases allows unsuspecting users to be exploited by dishonest users. This exploitation can take many forms, including shilling, non-delivery, misrepresentation, or by the sale of stolen goods (Dong, Shatz, & Xu, 2009). Dishonest users will also disguise themselves to avoid detection by imitating normal behaviours (Chang & Chang, 2011), making fraudulent behaviour difficult to define. Previous work has noted that legitimate users often appear to behave irrationally (Mizuta & Steiglitz, 2000), and previous attempts at clustering users into predefined types according to their bidding behaviour have failed to label the majority of users (Shah, Joshi, Sureka, & Wurman, 2003). The range of potential fraudulent behaviour together with the number and range of legitimate behaviours makes it difficult to differentiate between fraudulent and legitimate users. The class imbalance in auction data, where the number of legitimate actions outnumber the fraudulent, makes the accurate classification of users as fraudulent or legitimate non-trivial. Past research in online auction fraud has focused on detecting specific fraudulent behaviours using a range of techniques, including decision trees (Chang and Chang, 2011 and Almendra, 2013), clustering (Chang & Chang, 2010), regression models (Kauffman and Wood, 2003 and Chae et al., 2007), statistical methods (Trevathan & Read, 2007a), model checking (Xu and Cheng, 2007 and Xu et al., 2009), and graph mining methods (Pandit, Chau, Wang, & Faloutsos, 2007). The general approach in these works involve identifying the type of behaviour or fraud of interest, then selecting a set of features that are hypothesised to be able to differentiate between users with normal and suspicious behaviour. A fraud detection algorithm is then developed using the selected feature set. The algorithm is evaluated using commercial auction datasets without knowledge of ground truth, or by using synthetic datasets without guarantee of its similarity with real data. Both types of datasets reduce the reliability of any conclusions drawn about method accuracy or effectiveness (Tsang, Dobbie, & Koh, 2012a). In this work, synthetic data is used together with supervised learning methods to develop classification models for fraud detection. Supervised learning methods allow classifiers, which can detect different types of frauds or behaviours of interest, to be trained given an appropriate training set. The synthetic data used in this work is generated using a validated agent-based simulation (Tsang, Dobbie, & Koh, 2012b,chap. 11), which has been extended to generate data containing specific fraudulent behaviours. The type of fraudulent behaviours added determines the types of frauds the resulting model can detect. An appropriate training set for supervised learning methods is created in three steps: first, define an agent-type that represents the fraud type of interest; second, generate synthetic data using the defined agent; and third, transform the generated synthetic data, which is a sequence of auctions and bids, into values for a set of user-defined features. This transformed synthetic dataset is then used as a training set for the selected supervised learning technique. This approach allows models for detecting specific types of fraud to be constructed more easily than in previous work, and with improved accuracy, as shown by the experimental results in Section 4. To our knowledge, no previous work has combined the use of a data generator and supervised learning methods to develop fraud detection methods. This work focuses on a type of fraud called competitive shilling. Competitive shilling occurs when a user submits bids to a collaborating seller’s auction to elevate the final auction price, without the intention of winning. The legitimate bidder is cheated by paying more than they otherwise would when winning the item. For example, suppose there were only two bidders in an auction: one legitimate (L), and one fraudulent (F), with bidding proceeding like so: L: $10, F:$11, L:$12, F:$13; L:$14. If there are no additional bids, L, the legitimate bidder, pays an additional $4 due to bids by F.
نتیجه گیری انگلیسی
Development and evaluation of fraud detection methods have been difficult in the past for several reasons: (1) the variety of possible fraudulent behaviours, (2) changes in fraudulent behaviours to avoid detection, and (3) the difficulties with using synthetic and commercial datasets. Our proposed approach mitigates these problems by allowing classification models to be created for arbitrary types of fraud by defining a corresponding fraud agent and generating a set of synthetic data. This approach has advantages over previous work using real data, since in synthetic data, ground truth is known, and multiple datasets can be generated for different types of fraud. In addition, the feature set proposed allows the created models to work across multiple fraud types, and not just shilling fraud. Using our approach, an existing shill detection algorithm was improved, and two new decision tree classifiers were created. Evaluation results over synthetic and commercial data validates our approach. Results over synthetic data show that both decision tree models perform significantly better than the original and optimised SSA algorithms in identifying two types of shilling behaviours: simple and delayed-start shilling. DT (features), which used our proposed feature set, performed significantly better than DT (ratings) over low values of FPR. Results over a commercial dataset has shows the decision tree models, which were trained on synthetic data only, can identify users exhibiting characteristics consistent with shilling.