Evolution and quality management in dynamic data integration systems

Motivation.

Data management in P2P systems is a challenging and difficult problem considering the excessive number of peers, their autonomous nature and the potential heterogeneity of their schemas. A peer data management system (PDMS) [Arenas et al. 2003, Ng et al. 2003, Bernstein et al. 2002] is one such application which enables users to transparently query several heterogeneous and autonomous data sources. A PDMS is considered an evolution of the traditional data integration systems [Katchaounov 2003, Bernstein et al. 2002], in such a way that the no notion of a single mediation schema is replaced by a set of semantic mappings between peers’s schemas. Queries are formulated according to a particular peer schema, and sent to other peers of the system. Thus, query processing considers local data, i.e., data stored at the peer where the query is submitted, and remote data, i.e. data stored at other peers associated through schema mappings [Tatarinov et al. 2004].

Traditional approaches for data sharing in mediated and other distributed database environments make assumptions that no longer hold in P2P systems, or suffer from scalability problems when applied to systems with massive number of participants. Furthermore, data integration techniques developed for P2P systems need to consider and cope with the volatile nature of the peers [Löser et al. 2003, Ng et al. 2003].

The development of data integration systems, independently of the architecture they are based on (i.e. mediation systems, P2P architectures or GRID infrastructures) poses two major problems: the evaluation of the quality of the information offered by these systems and the maintenance of the semantic mappings which connect the data sources. Ensuring quality of information delivered to the users is an important problem which conditions the success of the information systems. In the case of a data integration system, the problem is particularly difficult because of the integration of data coming from multiple sources having various levels of quality and the autonomous evolution of these data sources.

In data integration systems, the evolution problem is mainly related to changes raised at the data source level, for example, adding or removing an element of a source schema as well as adding or removing a data source from the system [Lóscio 2003, McBrien et al. 2002, Lee at al. 2002b]. Such evolution may cause changes in the mappings or semantic links which relate data sources. These semantic links constitute the baseline of any data integration system and, generally, they are defined statically at the design phase by paying a high cost for their development. However, the environment over which data integration system is built is not static it may evolve frequently. Consequently, in order to maintain the data integration system alive, it is necessary to dynamically reconsider the semantic links and adapt them to the new changes. Otherwise, the data integration system becomes progressive useless. Additional cost paid for this dynamic maintenance of semantic links may dramatically increase with the volume of change events and with the frequency of these events. Therefore, the maintenance problem becomes a bottleneck within the data integration system [Velegrakis et al. 2004].

Handling source evolution is an essential feature as it implies modularity and scalability of the data integration system. Moreover, as the system evolves it becomes necessary to evaluate the quality criteria to consider the changes. So, the absence of techniques to perform the quality evaluation of the information offered by the system and the maintenance of the semantic mappings can make data integration systems inoperative and obsolete [Altareva et al. 2005, Wang et al. 1996, Pipino et al. 2002, Marotta et al. 2005].

Related work.

Different approaches have been proposed for managing evolution in systems involving multiple data sources; the ToMAS system [Velegrakis et al. 2004] considers mappings between one source schema and a target schema and produces a set of new mappings if changes occur in the sources schemas; the generated mappings are ranked based on their similarity with initial old mapping. [McBrien et al. 2002] proposes a set of primitives representing graph operations that transform a source schema into a target schema. If a source schema changes, a new mapping is generated as a set of transformations from the new source schema to the target schema. In [Lee at al. 2002b], a framework is proposed to support schema evolution in relational systems; an extension of SQL is proposed to define views including users' preferences about the way the views should evolve under source changes.

In previous work [Bouzeghoub et al. 2003], we have proposed an approach for mapping evolution for relational data sources; we assume that the mappings are generated following the methodology described in [Kedad et al. 1999]. The propagation of source changes is done using a set of event-condition-action (ECA) rules. Some problems remain unsolved by the existing approaches, such as capturing change events and evaluating this change, defining different strategies for the global evolution process and evaluating the propagation algorithms in the case of a large number of source change events.

Detection of change– at the data source level – may be difficult because of uncertainty and imprecision of the data. May approaches have been proposed for the management of vague and imprecise data. The rough set theory (RS) [Pawlak 1982] has proven to be very useful for managing fuzzy or uncertain data, without requiring additional information on the data such as a belief ratio or a priori probabilities. RS is well adapted to data-centric problems such as attribute or instance selection, undistinguishable instances (a group of instances may not be isolated according to a representation in a given language), or concept approximation. RS relies on a mathematical model that allows the representation of vague concepts by means of upper and lower approximations based on instances’ similarities. RS theory has been used in various domains. In [Ahlqvist et al. 2003], a combination of RS and fuzzy set theories has been used in a spatial data integration application. It seems very useful to apply RS for defining data quality parameters, maintaining mappings and evaluating changes.