مطالعه موردی تحقیقات : مشکلات و توصیه ها در هنگام استفاده از یک ابزار متنی داده کاوی
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|22303||2013||12 صفحه PDF||سفارش دهید||8623 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Information & Management, Volume 50, Issue 7, November 2013, Pages 540–552
Electronic commerce is now an important part of national and international trade and thus more controls are needed to ensure effective website design and an efficient way of servicing customers. Today a person needing to buy a product will on his or her own behalf, or working as a purchasing agent for an organization, search the website of vendors to find a satisfactory and cost-effective product that is available and guaranteed by a vendor with whose products the buyer is familiar. However, as the electronic marketplace expands world-wide, the buyer needs to learn more about the organization and how it operates because the customer may live in a different country or be accessing the website of a small and relatively little-known company. Thus the material on a company website should be provided to satisfy the needs of worldwide customers whose search should be easy to perform; the data, of course, should be accurate and easy to understand. In our attempt to assess the “value” of a website, we decided to use a textual data mining tool. This led us to ask questions about the problems and potential of mining the contents of websites and to try to determine the difficulty in mining rather sparse and yet complex data. We therefore initially hoped to prove the following research hypothesis: The material on an organization's website discloses its sector of industry, where the industry is known by downloading the Corporation's self-defined NAICS (which is normally included on its website). Since almost every corporation uses the website as a way to advertize its wares, we felt that mining the whole site to determine whether the clusters would form into sets of industries would prove too simple a task and that the result of such a research effort would be trivial. Instead, we chose to use a data mining tool on only part of the website: the legal attachments statements. 1 We therefore downloaded these parts of 475 of the US Fortune 500 company website attachments and their NAICS. Specifically, we used a data mining tool (CLUTO 2) on the dataset consisting of all the available downloads, 3 hoping to find the results clumped into corporations considered to be in the same industry (i.e., performing business activities that have been categorized into easily understandable sectors, such as the computer industry). Governments and international bodies are interested in such categorization and the best known schemes today are the Standard Industry Code (SIC) and the North American Industry Classification System (NAICS). Our attempt to determine the relationship between the legal attachment statements of FORTUNE 500 corporations and their (self-defined) industry code (NAICS) required some form of cluster analysis. At this point, we attempted to validate the results by checking to what extent the companies within a cluster had the same NAICS codes, and found that they did not perform as we expected. On examination of the NAICS, we realized that they did not seem to be what we expected—a surprising finding that led us to ask several questions about the process that a corporation takes to decide on its set of codes. 1.1. Our purpose and the research questions We wished to determine the value of textual data-mining by clustering the datasets formed by downloading only the legal portions of the websites of major corporations in the hopes of finding that they would be grouped according to their industrial classification, as stated by their self-defined NAICS. This led to one major and one minor research question: Is it possible or reasonable to evaluate the effectiveness of the textual data mining process by finding how closely the clumps resulting from the use of the data mining tool on data downloaded from a corporate website is explained by the corporation's self-reported NAICS code? And, because of our answer to this, it was necessary to add: What has to be done to the downloaded data to allow a tool to clump the data meaningfully? 1.2. The significance of our results The results of our work on the major question led us to a discussion of how to reduce the time and effort expended in obtaining useful information using a textual data mining tool on a complex and unformatted set of downloaded data. The second or minor question led us to further asking: What were the problems in stating a company's SICs or NAICS codes? And Are the data produced for international and local export/import analysis accurate (due to the lack of breakdown of the information delivered by individual corporations)? These two seemed important questions and led to us to consider them as questions for our next major research project. 1.3. The structure of the paper In Section 2, we briefly discuss the portion of a typical website that deals with the legal aspects. This is followed (in Section 3) by a description of the NAICS coding system and a discussion of textual data mining (Section 4), leading to a discussion of our overall research methodology (Section 5). Section 6 provides an analysis of our results and Section 7 our conclusions. Our references and eight appendices complete the paper.
نتیجه گیری انگلیسی