Managing data resources at high quality is usually viewed as axiomatic. However, we suggest that, since the process of improving data quality should attempt to maximize economic benefits as well, high data quality is not necessarily economically-optimal. We demonstrate this argument by evaluating a microeconomic model that links the handling of data quality defects, such as outdated data and missing values, to economic outcomes: utility, cost, and net-benefit. The evaluation is set in the context of Customer Relationship Management (CRM) and uses large samples from a real-world data resource used for managing alumni relations. Within this context, our evaluation shows that all model parameters can be measured, and that all model-related assumptions are, largely, well supported. The evaluation confirms the assumption that the optimal quality level, in terms of maximizing net-benefits, is not necessarily the highest possible. Further, the evaluation process contributes some important insights for revising current data acquisition and maintenance policies.
Maintaining data resources at a high quality level is a critical task in managing organizational information systems (IS). Data quality (DQ) significantly affects IS adoption and the success of data utilization [10] and [26]. Data quality management (DQM) has been examined from a variety of technical, functional, and organizational perspectives [22]. Achieving high quality is the primary objective of DQM efforts, and much research in DQM focuses on methodologies, tools and techniques for improving quality. Recent studies (e.g., [14] and [19]) have suggested that high DQ, although having clear merits, should not necessarily be the only objective to consider when assessing DQM alternatives, particularly in an IS that manages large datasets. As shown in these studies, maximizing economic benefits, based on the value gained from improving quality, and the costs involved in improving quality, may conflict with the target of achieving a high data quality level. Such findings inspire the need to link DQM decisions to economic outcomes and tradeoffs, with the goal of identifying more cost-effective DQM solutions.
The quality of organizational data is rarely perfect as data, when captured and stored, may suffer from such defects as inaccuracies and missing values [22]. Its quality may further deteriorate as the real-world items that the data describes may change over time (e.g., a customer changing address, profession, and/or marital status). A plethora of studies have underscored the negative effect of low DQ on decision performance (e.g., [7], [9], [16] and [29]) and have identified the need to develop data refreshing policies [23], to measure DQ [13] and [19], and to communicate DQ assessments to decision makers [29] and [31]. However, maintaining data at a high quality level involves significant costs [12]. These costs are associated with efforts to detect and correct defects, set governance policies, redesign processes, and invest in monitoring tools. From an economic perspective, one would try to reach a certain quality level at a minimum possible cost. Targeting a higher DQ level improves utility of the data. (We use the term, “utility,” as a synonym for “value” or “benefit”, to be consistent with the use of this term in prominent prior literature. This has nothing to do with “utility theory”). Yet, at the same time, targeting a higher DQ level increases DQM costs [14]. However, although some DQM decisions involve significant utility/cost tradeoffs, economics-driven assessments of DQM alternatives are under-examined, barring a few exceptions. Some works (e.g., [3], [4] and [5]) use utility-driven assessments to understand tradeoffs between different DQ dimensions, optimize their configuration accordingly, and use the results for improving data processes. An algorithm that minimizes the cost of retrieving data that meets certain quality requirements has been proposed in [2]. Policy for optimizing the cost for synchronizing the contents of a DW with the source systems from which data is retrieved has been examined in [11]. A similar issue is examined from the point of refreshing distributed data views [28] and from the point of the data retrieved by query execution in DW environments [15]. Other research has also used economic assessments for developing superior DQ measurements (e.g., [13] and [19]).
A framework for optimally configuring a tabular dataset, considering economic perspectives, has been described in [14]. In this study, we develop and evaluate that model further to examine two key questions for defining optimal quality improvement policies: a) within a large data resource, what subset of records (defined by the time-span coverage, as explained later) should be targeted for improvement? b) Within that chosen subset, what should be the targeted quality level? The model in [14] has been evaluated analytically, using closed-form solutions and numerical approximations to assess applicability, given certain assumptions and constraints. In this study, we describe a rigorous and comprehensive empirical evaluation, which examines the applicability and usefulness of the model in a real-world setting. We show that, within our evaluation context, all model variables can be operationalized and all parameters estimated. Further, our evaluation confirms our modeling assumptions about associations between decision variables (time span and quality level) and economic outcomes (utility, cost, and net-benefit). We show that improvements to current data acquisition and maintenance policies, identified from applying the model, can significantly increase the overall benefit. The evaluation also highlights enhancements to the model to address similar design decisions in other data management contexts. Our evaluation illustrates the importance of quantitatively assessing and understanding the cost–benefit tradeoffs, particularly in large datasets where such tradeoffs can be very significant.
We evaluate the model in a CRM context. Several studies (e.g., [8], [17], [21] and [27]) have underscored the importance of managing customer data at a high quality level. DQ defects (e.g., missing, inaccurate, and/or outdated data values) might prevent managers and analysts from having the right picture of customers and their purchase preferences and, hence, might damage marketing efforts significantly. Some studies (e.g., [19] and [23]) have also discussed methodologies and techniques for improving the quality of customer data. For our evaluation, we use large data samples from a real-world system that helps manage alumni relationships in a large university. This system helps segment and categorize donors, predict donor behavior, and manage solicitation campaigns, much like how a traditional CRM helps manage customers [6], [23] and [27]. Though we focus on CRM, our model and evaluation methodology applies, in general, to data environments that manage large data resources, such as data warehouses (DW) and enterprise resource planning (ERP) systems. Such environments execute business processes, support decision making, and generate revenue through the sale of data products (e.g., [18], [20] and [32]). We see the plethora of data usages as ways of gaining benefits from the data resource. Such benefits can be conceptualized as “utility” [1] — a measure for the value gained through enhancements to business performance, improvements to decision outcomes, or the data consumer's willingness to pay. We posit that assessing utility-cost tradeoffs toward the maximization of the net-benefit gained from using data resources must be an important goal for managing these resources.
In the remainder of this paper, we first briefly review the dataset optimization model and state our evaluation objectives. We then describe our process for evaluating the model with the alumni data, present and analyze the results, and highlight important insights gained through such analyses. To conclude, we restate the contributions of this study, discuss implications for DQM research and practice, and suggest directions for future research.
This study is motivated by the notion that DQM can benefit from understanding the link between quality configuration decisions and economic outcomes such as utility, costs, and net-benefits. In this study, we examine this notion in the context of managing alumni data, a context in which utility/cost tradeoffs are significantly affected by DQM configuration choices. We evaluated a model for optimizing the configuration of a tabular dataset within this context. Our evaluation shows that all model variables can be operationalized and that most model assumptions are supported. Our evaluation also confirms that the model can have a strong impact on associated economic benefits.
This study offers insights into and its results have implications for, other contexts. IS environments are complex and involve many decisions that are linked to utility-cost tradeoffs. A key challenge in economics-driven evaluation is the conceptualization of economic outcomes. The monetary utility measurement used in this study is applicable to CRM. Other business domains (e.g., finance, healthcare, insurance, human-resources) may need different ways of conceptualizing and measuring utility. With appropriate utility measurements, the proposed model may be used for optimizing design configurations in these areas. Conceptualizing and estimating costs can be challenging as well. As highlighted by our evaluation, the cost of data management activities that have not been previously implemented is difficult to assess. Further, IS implementation often involves intangible costs, such as those associated with shifts in motivation, technology-adoption, and/or political struggles.
Our evaluation highlights the need for a broader perspective of DQM. So far, research in this field tended to emphasize functional and technical perspectives. These are, no doubt, critical to successful DQM. However, the results of our evaluation indicate the need to methodically address the economic aspects involved in DQM configuration and on-going maintenance. Our study also emphasizes DQM as a continuous process and not a one-time effort. We view our study and its results as a step that justifies the inclusion of economic perspectives in managing the quality of data resources.