مطالعه موردی تحقیقات : مشکلات و توصیه ها در هنگام استفاده از یک ابزار متنی داده کاوی
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
22303 | 2013 | 12 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Information & Management, Volume 50, Issue 7, November 2013, Pages 540–552
چکیده انگلیسی
Although many interesting results have been reported by researchers using numeric data mining methods, there are still questions that need answering before textual data mining tools will be considered generally useful due to the effort needed to learn and use them. In 2011, we generated a dataset from the legal statements (mainly privacy policy and terms of use) on the websites of 475 of the US Fortune 500 Companies and used it as input to see what we could detect about the organizational relationships between the companies by using a textual data mining tool. We hoped to find that the tool would cluster similar corporations into the same industrial sector, as validated by the company's self-reported North American Industry Classification System code (NAICS). Unfortunately, this proved only marginally successful, leading us to ask why and to pose our research question: What problems occur when a data-mining tool is used to analyze large textual datasets that are unstructured, complex, duplicative, and contain many homonyms and synonyms? In analyzing our large dataset we learned a great deal about the problem and fortunately, after significant effort, determined how to “massage” the raw dataset to improve the process and learn how the tool can be better used in research situations. We also found that NAICS, as self-reported by companies, are of dubious value to a researcher—a matter briefly discussed.
مقدمه انگلیسی
Electronic commerce is now an important part of national and international trade and thus more controls are needed to ensure effective website design and an efficient way of servicing customers. Today a person needing to buy a product will on his or her own behalf, or working as a purchasing agent for an organization, search the website of vendors to find a satisfactory and cost-effective product that is available and guaranteed by a vendor with whose products the buyer is familiar. However, as the electronic marketplace expands world-wide, the buyer needs to learn more about the organization and how it operates because the customer may live in a different country or be accessing the website of a small and relatively little-known company. Thus the material on a company website should be provided to satisfy the needs of worldwide customers whose search should be easy to perform; the data, of course, should be accurate and easy to understand. In our attempt to assess the “value” of a website, we decided to use a textual data mining tool. This led us to ask questions about the problems and potential of mining the contents of websites and to try to determine the difficulty in mining rather sparse and yet complex data. We therefore initially hoped to prove the following research hypothesis: The material on an organization's website discloses its sector of industry, where the industry is known by downloading the Corporation's self-defined NAICS (which is normally included on its website). Since almost every corporation uses the website as a way to advertize its wares, we felt that mining the whole site to determine whether the clusters would form into sets of industries would prove too simple a task and that the result of such a research effort would be trivial. Instead, we chose to use a data mining tool on only part of the website: the legal attachments statements. 1 We therefore downloaded these parts of 475 of the US Fortune 500 company website attachments and their NAICS. Specifically, we used a data mining tool (CLUTO 2) on the dataset consisting of all the available downloads, 3 hoping to find the results clumped into corporations considered to be in the same industry (i.e., performing business activities that have been categorized into easily understandable sectors, such as the computer industry). Governments and international bodies are interested in such categorization and the best known schemes today are the Standard Industry Code (SIC) and the North American Industry Classification System (NAICS). Our attempt to determine the relationship between the legal attachment statements of FORTUNE 500 corporations and their (self-defined) industry code (NAICS) required some form of cluster analysis. At this point, we attempted to validate the results by checking to what extent the companies within a cluster had the same NAICS codes, and found that they did not perform as we expected. On examination of the NAICS, we realized that they did not seem to be what we expected—a surprising finding that led us to ask several questions about the process that a corporation takes to decide on its set of codes. 1.1. Our purpose and the research questions We wished to determine the value of textual data-mining by clustering the datasets formed by downloading only the legal portions of the websites of major corporations in the hopes of finding that they would be grouped according to their industrial classification, as stated by their self-defined NAICS. This led to one major and one minor research question: Is it possible or reasonable to evaluate the effectiveness of the textual data mining process by finding how closely the clumps resulting from the use of the data mining tool on data downloaded from a corporate website is explained by the corporation's self-reported NAICS code? And, because of our answer to this, it was necessary to add: What has to be done to the downloaded data to allow a tool to clump the data meaningfully? 1.2. The significance of our results The results of our work on the major question led us to a discussion of how to reduce the time and effort expended in obtaining useful information using a textual data mining tool on a complex and unformatted set of downloaded data. The second or minor question led us to further asking: What were the problems in stating a company's SICs or NAICS codes? And Are the data produced for international and local export/import analysis accurate (due to the lack of breakdown of the information delivered by individual corporations)? These two seemed important questions and led to us to consider them as questions for our next major research project. 1.3. The structure of the paper In Section 2, we briefly discuss the portion of a typical website that deals with the legal aspects. This is followed (in Section 3) by a description of the NAICS coding system and a discussion of textual data mining (Section 4), leading to a discussion of our overall research methodology (Section 5). Section 6 provides an analysis of our results and Section 7 our conclusions. Our references and eight appendices complete the paper.
نتیجه گیری انگلیسی
One of our reviewers commented: “This paper is essentially a case study of a research project; the authors describe the progress of the project and the changes that were required as they proceeded, It is quite illuminating for those considering similar projects; it highlights some of the difficulties they may encounter and offers some solutions.” Another said “I think one can learn much more from this study about NACIS and organizational self-reporting than about relationships between the organizations.” From our extensive analysis of the outputs of CLUTO, we became convinced that it could be effective in providing results that can be valuable in addressing research questions about large textual databases. The complementary use of CLUTO with manual content analysis supports the use of data mining tools in analyzing large textual databases. However, because of our difficulty in determining the accuracy of the self-defined NAICS codes, we were led to ask some questions that we had not seemed important when we started the research: • Who (or at what level) in the company is told to select and report on the codes?? • How many classification codes should a company report?? • How do the corporate responders decide on the classification?? • Is there a team or official with ultimate responsibility to check and agree to the set of codes?? • Does the set of codes selected represent the image that the corporation wishes to project?? • Whose interests may the classification represent? This problem is exacerbated by the fact that there is no objective external validation of the classifications selected. 7.2. Comments on the process We were interested in understanding the problems in using data-mining techniques on large and rather sparsely populated datasets. Specifically, we downloaded the Privacy and Terms of Use statements made by members of the Fortune 500 companies on their websites in 2011. Our hope was that we would be able to cluster the companies into an approximation to their industry group (as defined by their NAICS codes). At first, we found that there was poor grouping with some of the discriminating words showing little similarity between corporations in a cluster so that we felt we were finding only random clustering. However, when the discriminating words were examined, it was apparent that several of them (like children, copyright, link, and cookie) appeared in almost all websites—and thus did not discriminate, but should be considered as stop words. We therefore went through several cycles of investigation, resulting in improved clustering but still finding some overlaps because of common words in the Company Name that did not correlate with industry categories (like electric, general, or corporation). A problem we faced while analyzing results was our inability to make a claim with high confidence. As a result, we feared that some hidden factor had been missed or was not fully considered. Unfortunately, our research seemed to produce fuzzy results that depended on the reasoning of the analyst in order to ensure its quality. Examples of decisions that had to be made included: • What constitutes a legal attachment statement (i.e., what to extract from website)?? • How did the company select its NAICS? We saw that companies varied substantially in their choice of NAICS, even though we felt they performed similar activities. Indeed, as we analyzed their NAICS, we realized that self-reported NAICS were inconsistent. For example, when we ran a search for the industry for NAICS ID 33 using the United States Census Bureau of NAICS for 2007 [11], we found that the multi-digit NAICS included 331, 3311, 3312, and 332, but that 338 did not exist and 339999 represented all miscellaneous manufacturing. Moreover, while company VF self-defined itself as 42432006 (Men's & Boys Clothing Merchant Wholesale), 54151109 (Custom Computer Programming Svsc), and 54161303 (Marketing Consulting Svcs), we could not understand how these activities, which are not homogenous and seem unrelated, represent its industrial niche. However, we did find that for one or two major clusters of industries constructed by the data mining tool, there were several NAICS that represented the group at a significant level. It is possible then that the differences were due to a lack of a standard way of assessing and therefore defining the NAICS and/or the fact that we chose a relatively sparse dataset from the standpoint of industry distinction – their Legal Attachments. 7.3. Major findings Initially we posed the question: What problems occur when a data-mining tool is used to analyze large textual datasets that are unstructured, complex duplicative and contain many homonyms and synonyms? During our analysis of the rather large textual dataset we encountered many “strange” findings which we were able to clarify by applying manual analysis. The unstructured, complex, duplicative nature of the large datasets provided results that seemed inconsistent until we could tell why the software provided the results. We found it necessary to make adjustments to the datasets to reduce errors caused by: • Similar words, synonyms, and phrases such as Cookies, and Links which acted as discriminating leading to the clustering of unrelated industries. • Common words occurring in the names of organizations that also resulted in clustering of unrelated industries (e.g., General Electric, General Foods, and General Motors), requiring elimination of company names from the datasets. • Companies that we considered to be in the same industry were indeed clustered together but on examination of the results we found that their NAICS (self defined) were surprisingly different. Such problems can be simply avoided with some manual intervention at the start of analysis. As to the NAICS part of this problem, the “self reported nature” of those codes allow errors when comparing two industries and a controlled process should be provided by a standardizing agency. Having refined their analysis of the data through multiple iterations, what is the value of the authors’ experience? Through the analysis we became convinced that textual mining requires careful understanding of the meaning of the words in the datasets and that the way the words are used in the sentences plays an important part in the accuracy of the results; i.e., ontology plays an important role in textual data mining and new tools should incorporate ways to exploit ontology. The authors believe that textual data mining is still in its infancy and that some stages still needs human intervention. Thus we did read all the legal Attachment Statements and removed common words, repeated words in the dataset, and acronyms that were common but different, such as PHI (Personal Health Information or Pepco Holding Inc.). We cannot however claim that the application of our research model to a different dataset would be valuable, though we hope it would be useful, especially in Big Data and social media analysis. 7.4. Suggestions for future research Our effort was obviously limited in that it used only one of many data mining tools. The study should probably be repeated with several data mining tools and the same datasets to provide a comparative analysis and see if the results vary across tools. Also, it would be interesting to replicate the experiment with other datasets, such as those downloaded from a social website (e.g., Facebook) or to replicate the effort using the entire set of elements of the Fortune 500 websites—a very large database and an exhausting task. Clustering and numeric data mining methods have been used in business and related research to find statistical patterns for marketing, but it seems that the ability to use more textual data in such studies (and ultimately in practice) would be possible if the method could be more automated (without the effort we had to make to reduce the problems of synonyms, for example). Maybe the addition of an ontology tool could remove some of the problems—a possible solution to the problems and worth investigation. The problem of deciding whether a particular set of NAICS or SIC codes properly describe the industry of a particular corporation has already been discussed. But this was not the prime reason for our effort, though we are planning to make it the subject of future work.