دانلود مقاله ISI انگلیسی شماره 22280
ترجمه فارسی عنوان مقاله

ارزیابی برای تشخیص وب کلور با تکنیک های داده کاوی

عنوان انگلیسی
Feature evaluation for web crawler detection with data mining techniques
کد مقاله سال انتشار تعداد صفحات مقاله انگلیسی
22280 2012 11 صفحه PDF
منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Expert Systems with Applications, Volume 39, Issue 10, August 2012, Pages 8707–8717

ترجمه کلمات کلیدی
تشخیص خزنده وب - دسترسی به سیاهه های مربوط وب سرور - داده کاوی - طبقه بندی
کلمات کلیدی انگلیسی
Web crawler detection, Web server access logs, Data mining, Classification
پیش نمایش مقاله
پیش نمایش مقاله  ارزیابی برای تشخیص وب کلور با تکنیک های داده کاوی

چکیده انگلیسی

Distributed Denial of Service (DDoS) is one of the most damaging attacks on the Internet security today. Recently, malicious web crawlers have been used to execute automated DDoS attacks on web sites across the WWW. In this study we examine the effect of applying seven well-established data mining classification algorithms on static web server access logs in order to: (1) classify user sessions as belonging to either automated web crawlers or human visitors and (2) identify which of the automated web crawlers sessions exhibit ‘malicious’ behavior and are potentially participants in a DDoS attack. The classification performance is evaluated in terms of classification accuracy, recall, precision and F1 score. Seven out of nine vector (i.e. web-session) features employed in our work are borrowed from earlier studies on classification of user sessions as belonging to web crawlers. However, we also introduce two novel web-session features: the consecutive sequential request ratio and standard deviation of page request depth. The effectiveness of the new features is evaluated in terms of the information gain and gain ratio metrics. The experimental results demonstrate the potential of the new features to improve the accuracy of data mining classifiers in identifying malicious and well-behaved web crawler sessions.

مقدمه انگلیسی

Today, the world is highly dependent on the Internet, the main infrastructure of the global information society. Therefore, the availability of Internet is very critical for the economic growth of the society. For instance, the phenomenal growth and success of Internet has transformed the way traditional essential services such as banking, transportation, medicine, education and defence are operated. These services are now being actively replaced by cheaper and more efficient Internet-based applications. However, the inherent vulnerabilities of the Internet architecture provide opportunities for various attacks on the security of Internet-based applications. For example, distributed denial-of-service (DoS) is a type of security attack that poses an immense threat to the availability of any Internet-based service and application. The DoS effect is achieved by sending a flood of messages to the target (e.g., a machine hosting a web site) with the aim to interfere with the target’s operation, and make it hang, crash, reboot, or do useless work (see Fig. 1). In general, single-source DoS attacks can be easily prevented by locating and disabling the source of the malicious traffic. However, distributed DoS (DDoS) attacks launched from hundreds to tens of thousands of compromised zombies can present a much more complex challenge. Namely, unlike in the single-source DoS attack scenarios, the problem of locating the malicious hosts responsible for a DDoS attack becomes extremely difficult due to the sheer number of hosts participating in the attack. Furthermore, the larger collection of malicious hosts can generate enormous amount of traffic towards the victim. The result is a substantial loss of service and revenue for businesses under attack. According to the United States’ Department of Defence report from 2008 presented in Wilson et al. (2008), cyber attacks from individuals and countries targeting economic, political, and military organizations may increase in the future and cost billions of dollars.Now, attackers launching the traditional DDoS attacks by employing illegal Network Layer packets can be easily detected (but not easily stopped) by the signature detections systems such as intrusion detection systems. However, an emerging (and increasingly more prevalent) set of DDoS attacks known as Application Layer or Layer-7 attacks are shown to be particularly challenging to detect. The traditional network measurement systems often fail to identify the presence of Layer-7 DDoS attacks. The reason for this is that in an Application Layer attack, the attacker utilizes a legitimate network session. More specifically, the attacker utilizes a web crawler1 program that performs a clever semi-random walk of the web site links, intended to resemble the web site traversal of an actual human user. Since such attack signatures look very much like legitimate traffic, it is difficult to construct an effective metric to detect and defend against the Layer-7 attacks. Numerous studies have been published on the topic of Layer-7 DDoS attacks. Given that the key challenge of Layer-7 DDoS attacks is their close similarity to the patterns of legitimate user traffic; researchers studying Layer-7 defence mechanism are mostly interested in devising effective techniques of attack detection. More specifically, the research works in this field are categorized into two main groups: (1) detection of application-layer DDoS attacks during a flash crowd event based on aggregate-traffic analysis ( Oikonomou and Mirkovic, 2009 and Xie and Yu, 2009) and (2) differentiation between wellbehaved and malicious web crawlers based on web-log analysis ( Bomhardt et al., 2005, Hayati et al., 2010 and Park et al., 2006). (A more detailed overview of the works from the latter group is provided in Section 2.) In this study, we pursue the line of research of the second group of works further, through two sets of experiments. In particular, the goal of the first set of experiments is to: (1) examine the effectiveness of seven selected classification algorithms in detecting the presence of (i.e. distinguish between) known well-behaved web crawlers and human visitors and (2) evaluate the potential of two newly proposed web-session features to improve the classification accuracy of the examined algorithms. The goal of the second experiment is to: (1) examine the effectiveness of seven classification algorithms in distinguishing between four visitor groups to a web site (malicious web crawlers, well-behaved web crawlers, human visitors and unknown visitors (either human or robot)) and (2) evaluate the potential of two newly proposed web-session features to improve the classification accuracy of the examined algorithms in this particular case. The datasets used in the experiments are generated by pre-processing web server access log files. The implementations of classification algorithms are provided by WEKA data mining software (WEKA, 2010). The novelty of our research is twofold. Firstly, to the best of our knowledge, this is the first study that looks into the detection of the so-called malicious web crawlers, i.e. crawlers used to conduct Layer-7 attacks, and ways of distinguishing them from well-behaved web robots (such as Googlebot and MSNbot among others). Secondly, in addition to employing traditional web-session features in our classification, we also introduce two new features and prove that the utilization of these features can improve the classification accuracy of the examined algorithms. The paper is organized as follows: In Section 2, we discuss previous works on web crawler detection. In Section 3, we discuss the advantages of utilizing supervised learning for the purpose of web visitor detection over using a simple rule-based web-log analyzer. In Section 4, we present an overview of our web-log analyzer and the process of dataset generation and labelling. In Section 5, we outline the design of the experiments and the performance metrics that were utilized. In Section 6, we present and discuss the results obtained from the classification study. In Section 7, we conclude the paper with our final remarks.

نتیجه گیری انگلیسی

The detection of malicious web crawlers is one of the most active research areas in network security. In this paper, we study the problem of detecting known well-behaved web crawlers, known malicious web crawlers, unknown and human visitors to a web site using existing data mining classification algorithms. The following three general conclusions were derived from our study: • The classification accuracy of classification algorithms such as Neural Networks, C4.5, RIPPER and k-Nearest Neighbor algorithms is close to 100%. In the case of the Neural Network algorithm, the F1 scores of the underrepresented class (class 1) are 73% and 82% in Experiments 1 and 2, respectively. • The two new features proposed – the consecutive sequential requests ratio and standard deviation of page request depths – are highly ranked among the other features used in the study by the information gain and gain ratio metrics. The new features are also explicitly shown to improve the classification accuracy, recall, precision and F1 score of most evaluated algorithms, in both conducted experiments. • As evident in our study, the browsing behaviours of web crawlers (both malicious and well-behaved) and human users are significantly different. Therefore, from the data mining perspective, their identification/classification is very much a feasible task. However, the identification/classification of crawlers that attempt to mimic human users will remain the most difficult future classification challenge. We believe that with customization of either C4.5 or RIPPER, the misclassification rates of known well-behaved and malicious web crawlers could be further reduced.