RDFProv : رابطه فروشگاه RDF برای پرس و جو و منشأ مدیریت جریان کاری علمی
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
21849 | 2010 | 30 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Data & Knowledge Engineering, Volume 69, Issue 8, August 2010, Pages 836–865
چکیده انگلیسی
Provenance metadata has become increasingly important to support scientific discovery reproducibility, result interpretation, and problem diagnosis in scientific workflow environments. The provenance management problem concerns the efficiency and effectiveness of the modeling, recording, representation, integration, storage, and querying of provenance metadata. Our approach to provenance management seamlessly integrates the interoperability, extensibility, and inference advantages of Semantic Web technologies with the storage and querying power of an RDBMS to meet the emerging requirements of scientific workflow provenance management. In this paper, we elaborate on the design of a relational RDF store, called RDFProv, which is optimized for scientific workflow provenance querying and management. Specifically, we propose: i) two schema mapping algorithms to map an OWL provenance ontology to a relational database schema that is optimized for common provenance queries; ii) three efficient data mapping algorithms to map provenance RDF metadata to relational data according to the generated relational database schema, and iii) a schema-independent SPARQL-to-SQL translation algorithm that is optimized on-the-fly by using the type information of an instance available from the input provenance ontology and the statistics of the sizes of the tables in the database. Experimental results are presented to show that our algorithms are efficient and scalable. The comparison with two popular relational RDF stores, Jena and Sesame, and two commercial native RDF stores, AllegroGraph and BigOWLIM, showed that our optimizations result in improved performance and scalability for provenance metadata management. Finally, our case study for provenance management in a real-life biological simulation workflow showed the production quality and capability of the RDFProv system. Although presented in the context of scientific workflow provenance management, many of our proposed techniques apply to general RDF data management as well.
مقدمه انگلیسی
With recent advances in the development of Scientific Workflow Management Systems [34], [39], [46], [69], [70], [80] and [123], scientists from various domains are able to automate their experiments using scientific workflows to achieve significant scientific discoveries via complex and distributed scientific computations. As a result, scientific workflow has emerged as a new field to address the new requirements from scientists [70] and [75]. One such important requirement is provenance management which is essential for scientific workflows to support scientific discovery reproducibility, result interpretation, and problem diagnosis [3], [20] and [96]. This support is enabled via provenance metadata that captures the origin and derivation history of a data product, including the original data sources, intermediate data products, and the steps that were applied to produce the data product. The provenance management problem concerns the efficiency and effectiveness of the modeling, recording, representation, integration, storage, and querying of provenance metadata. While there is an ongoing community effort on standardizing provenance modeling via the Open Provenance Model (OPM) [3], it is still not clear which storage and query model is most suitable for provenance management. Recently, the Semantic Web [16] and [94] technologies have been increasingly used for provenance management due to their flexibility and semantics support [48], [50], [65], [88] and [121], such that provenance metadata is represented and captured via Resource Description Framework (RDF) [111] and [114], RDF Schema (RDFS) [113], and Web Ontology Language (OWL) [110], and queried using the SPARQL [115] query language. This technological suite, enhanced with the Semantic Web inference support, was shown to address [88] the four functional requirements for provenance identified by the Open Provenance Model: (1) provenance information interoperability, (2) ease of application development, (3) precise description of provenance information, and (4) inference capability and digital representation of provenance. In addition, in our work, we choose a Semantic Web approach for provenance management due to its several advantages. First, a flexible and extensible data model is needed for provenance representation as what provenance information should be recorded can differ from one system to another and from one domain to another domain and can evolve over time; the RDF data model satisfies such a requirement. Second, it is important to interpret and reason about provenance using domain knowledge via domain-specific provenance ontologies; therefore, an inference engine with support of user-defined inference rules is needed as domain-specific provenance ontologies can contain various inference rules (such as “a peptide is derived from a protein”) that cannot be known in advance, and domain-specific provenance ontologies can evolve rapidly over time. Third, provenance interoperability becomes more and more important due to the need of integrating provenance across different provenance models, domains, and organizations in collaborative scientific projects. The RDF model facilitates such integration and interoperability. Finally, as RDF serializes graphs, it is naturally suitable for representation of provenance graphs with no further adaptation, even though the mapping does not have to be one-to-one (e.g., the OPM implementation as RDF/OWL by the Tupelo project [6]). In this paper, we propose an approach to provenance management that seamlessly integrates the interoperability, extensibility, and inference advantages of Semantic Web technologies with the storage and querying power of an RDBMS to meet the emerging requirements of scientific workflow provenance management. Our motivation of using the mature relational database technology is provided by the fact that provenance metadata growth rate is potentially very high since provenance is generated automatically for every scientific experiment. On the Semantic Web, large volumes of RDF data are managed with the so called RDF stores, and majority of them, including Jena [118] and [119], Sesame [23], 3store [56] and [57], KAON [107], RStar [71], OpenLink Virtuoso [42], DLDB [81], RDFSuite [9] and [105], DBOWL [77], PARKA [101], and RDFBroker [100], use an RDBMS as a backend to manage RDF data. Although a general-purpose relational RDF store (see [15] for a survey) can be used for provenance metadata management, the following provenance-specific requirements bring about several optimization strategies for schema design, data mapping, and query mapping, enabling us to develop a provenance metadata management system that is more efficient and flexible than one that is simply based on an existing RDF store. • As provenance metadata is generated incrementally, each time a scientific workflow executes, provenance systems should emphasize optimizations for efficient incremental data mapping. As we show in this work, one of such optimizations, a join-elimination optimization strategy, can be developed for provenance based on the property that workflow definition metadata is generated before workflow execution metadata. • As the performance for provenance storage and that for provenance querying are often conflicting, it may be preferable for a provenance management system to trade data ingest performance for query performance. For example, for long-running scientific workflows, trading data ingest performance for query performance might be a good strategy. • The identification of common provenance queries has the potential to lead to an optimized database schema design to support efficient provenance browsing, visualization, and analysis. • Update and delete are not the concern of provenance management since it works in an append fashion, similarly to log management. Therefore, we can apply some denormalization and redundancy strategies for database schema design, leading to improved query performance. These provenance-specific metadata properties cannot be assumed by a general-purpose RDF store, hampering several interesting data management optimizations to gain better performance for data ingest and querying. While conducting a case study for a real-life scientific workflow in the biological simulation field (see Section 7 for detailed information) to illustrate and verify the validity of our research, we observed that two popular general-purpose RDF stores, Jena and Sesame, could not completely satisfy the provenance management requirements of the workflow. While Sesame could not keep up with the data ingest rate, Jena could not do as good as Sesame on query performance. Both systems lacked support for some provenance queries. Therefore, by exploiting the above provenance characteristics, we design a relational RDF store, called RDFProv, which is optimized for scientific workflow provenance querying and management. RDFProv has a three-layer architecture (see Fig. 1) that complies with the architectural requirements defined for the reference architecture for scientific workflow management systems [68]. The provenance model layer is responsible for managing provenance ontologies and rule-based inference to augment to-be-stored RDF datasets with new triples. The model mapping layer employs three mappings: (1) schema mapping to generate a relational database schema based on a provenance ontology, (2) data mapping to map RDF triples to relational tuples, and (3) query mapping to translate RDF queries expressed in the SPARQL language into relational queries expressed in the SQL language. These mappings bridge the provenance model layer and the relational model layer, where the latter is represented by a relational database management system that serves as an efficient relational provenance storage backend. This paper elaborates on the design of RDFProv and has the following main contributions: i) we propose two schema mapping algorithms to map a provenance ontology encoded with OWL to a relational database schema that is optimized for common provenance queries; ii) we propose three efficient data mapping algorithms to map provenance RDF metadata to relational data according to the generated relational database schema, and iii) we propose a schema-independent SPARQL-to-SQL translation algorithm that is optimized on-the-fly by using the type information of an instance available from the input provenance ontology and the statistics of the sizes of the tables in the database. At each design step, we contribute novel ideas which are not available in existing RDF stores, such as new kinds of relations for schema mapping, optimized incremental strategies for data mapping, and two query optimization techniques for query translation. When combined together, our algorithms provide a competitive solution to the provenance management problem. We compare our techniques with open-source relational RDF stores, Jena [118] and [119] and Sesame [23], and commercial native RDF stores, AllegroGraph [1] and BigOWLIM [2], to show that our optimizations result in improved performance and scalability for Semantic Web enabled provenance metadata management. We also show how SPARQL can be extended with negation, aggregation, and set operations (e.g., division) to support additional important provenance queries. Last, but not least, we provide a case study for provenance management in the TangoInSilico [43] scientific workflow, exploring the production quality and capability of RDFProv for this real-life provenance application.1.1. Organization The rest of the paper is organized as follows. Section 2 reviews related work. Section 3 discusses the provenance model layer of RDFProv and introduces a sample provenance ontology. 4, 5 and 6 present the model mapping layer of RDFProv, elaborating on provenance ontology to database schema mapping, provenance metadata to relational data mapping, and SPARQL-to-SQL query translation, respectively. Section 7 provides our case study for provenance management in a real-life scientific workflow from the biological simulations field. Section 8 empirically compares RDFProv with two commercial relational RDF stores. Finally, Section 9 concludes the paper and discusses possible future work directions.
نتیجه گیری انگلیسی
In this work, we designed relational RDF store RDFProv that is a Semantic Web driven system optimized for querying and managing scientific workflow provenance metadata. The architecture of RDFProv seamlessly integrates the interoperability, extensibility, and reasoning advantages of Semantic Web technologies with the storage and querying power of an RDBMS. To support this integration, three model mappings are described in detail. Our schema mapping, data mapping, and SPARQL-to-SQL query translation algorithms are optimized to efficiently support (1) common provenance queries, (2) incremental data loading that employs the ordering of inserting various provenance metadata, and (3) schema-independent query translation that is optimized on-the-fly by using the type information of an instance and the statistics of the sizes of the tables in the database. The RDFProv system design provides the two alternative database representations, SchemaMapping-V and SchemaMapping-T, enabling the flexibility to setup a provenance repository based on specific scientific workflow needs. In particular, SchemaMapping-V supports very fast schema and data mappings, while SchemaMapping-T supports very efficient query processing. Our query translation allows transparent switching between these two representations. The experimental study showed that our algorithms are efficient and scalable. The comparison with existing general-purpose RDF stores Jena, Sesame, AllegroGraph, and BigOWLIM showed that our optimizations provide improved efficiency and scalability to provenance metadata management. Finally, our case study for provenance management in the TangoInSilico scientific workflow showed the production quality and capability of the RDFProv system. Our provenance storage and querying techniques are orthogonal to the scientific workflow model. Therefore, we support the storage and querying of provenance generated from both long-duration and short-duration activities. However, one can take advantage of the characteristics of scientific workflows (long-duration or short-duration) to trade between data ingest performance and querying performance. For long-duration activities, in which the provenance digest performance is less important, we can choose a database schema with a slower data mapping strategy but with a faster query response time. On the other hand, for short-duration activities, in which the provenance digest rate becomes critical, a faster data mapping strategy can be chosen to speed up data ingest. As shown in our case study, such tradeoff between data mapping performance and query performance might be a desirable feature for some scientific workflow applications. In the future, we would like to continue to explore further optimizations for database schema design, data ingest, and querying with the main focus on semantic query optimization. Our attention also catch column-oriented databases [7] and [95], which can be customized for provenance management, and provenance reduction techniques [28], which can be used to decrease storage requirements via duplicate elimination and provenance inheritance. Finally, we would like to consider querying and managing scientific workflow provenance in distributed environments with multiple computing nodes to enable processing of huge datasets with billions of triples.