Evolution and quality management in dynamic data integration systems
Motivation.
Data management in P2P systems is a challenging and difficult problem
considering the excessive number of peers, their autonomous nature and
the potential heterogeneity of their schemas. A peer data management
system (PDMS) [Arenas et al. 2003, Ng et al. 2003, Bernstein et al.
2002] is one such application which enables users to transparently
query several heterogeneous and autonomous data sources. A PDMS is
considered an evolution of the traditional data integration systems
[Katchaounov 2003, Bernstein et al. 2002], in such a way that the no
notion of a single mediation schema is replaced by a set of semantic
mappings between peers’s schemas. Queries are formulated according to a
particular peer schema, and sent to other peers of the system. Thus,
query processing considers local data, i.e., data stored at the peer
where the query is submitted, and remote data, i.e. data stored at
other peers associated through schema mappings [Tatarinov et al. 2004].
Traditional approaches for data sharing in mediated and other
distributed database environments make assumptions that no longer hold
in P2P systems, or suffer from scalability problems when applied to
systems with massive number of participants. Furthermore, data
integration techniques developed for P2P systems need to consider and
cope with the volatile nature of the peers [Löser et al. 2003, Ng et
al. 2003].
The development of data integration systems, independently of the
architecture they are based on (i.e. mediation systems, P2P
architectures or GRID infrastructures) poses two major problems: the
evaluation of the quality of the information offered by these systems
and the maintenance of the semantic mappings which connect the data
sources. Ensuring quality of information delivered to the users is an
important problem which conditions the success of the information
systems. In the case of a data integration system, the problem is
particularly difficult because of the integration of data coming from
multiple sources having various levels of quality and the autonomous
evolution of these data sources.
In data integration systems, the evolution problem is mainly related to
changes raised at the data source level, for example, adding or
removing an element of a source schema as well as adding or removing a
data source from the system [Lóscio 2003, McBrien et al. 2002, Lee at
al. 2002b]. Such evolution may cause changes in the mappings or
semantic links which relate data sources. These semantic links
constitute the baseline of any data integration system and, generally,
they are defined statically at the design phase by paying a high cost
for their development. However, the environment over which data
integration system is built is not static it may evolve frequently.
Consequently, in order to maintain the data integration system alive,
it is necessary to dynamically reconsider the semantic links and adapt
them to the new changes. Otherwise, the data integration system becomes
progressive useless. Additional cost paid for this dynamic maintenance
of semantic links may dramatically increase with the volume of change
events and with the frequency of these events. Therefore, the
maintenance problem becomes a bottleneck within the data integration
system [Velegrakis et al. 2004].
Handling source evolution is an essential feature as it implies
modularity and scalability of the data integration system. Moreover, as
the system evolves it becomes necessary to evaluate the quality
criteria to consider the changes. So, the absence of techniques to
perform the quality evaluation of the information offered by the system
and the maintenance of the semantic mappings can make data integration
systems inoperative and obsolete [Altareva et al. 2005, Wang et al.
1996, Pipino et al. 2002, Marotta et al. 2005].
Related work.
Different approaches have been proposed for managing evolution in
systems involving multiple data sources; the ToMAS system [Velegrakis
et al. 2004] considers mappings between one source schema and a target
schema and produces a set of new mappings if changes occur in the
sources schemas; the generated mappings are ranked based on their
similarity with initial old mapping. [McBrien et al. 2002] proposes a
set of primitives representing graph operations that transform a source
schema into a target schema. If a source schema changes, a new mapping
is generated as a set of transformations from the new source schema to
the target schema. In [Lee at al. 2002b], a framework is proposed to
support schema evolution in relational systems; an extension of SQL is
proposed to define views including users' preferences about the way the
views should evolve under source changes.
In previous work [Bouzeghoub et al. 2003], we have proposed an approach
for mapping evolution for relational data sources; we assume that the
mappings are generated following the methodology described in [Kedad et
al. 1999]. The propagation of source changes is done using a set of
event-condition-action (ECA) rules. Some problems remain unsolved by
the existing approaches, such as capturing change events and evaluating
this change, defining different strategies for the global evolution
process and evaluating the propagation algorithms in the case of a
large number of source change events.
Detection of change– at the data source level – may be difficult
because of uncertainty and imprecision of the data. May approaches have
been proposed for the management of vague and imprecise data. The rough
set theory (RS) [Pawlak 1982] has proven to be very useful for managing
fuzzy or uncertain data, without requiring additional information on
the data such as a belief ratio or a priori probabilities. RS is well
adapted to data-centric problems such as attribute or instance
selection, undistinguishable instances (a group of instances may not be
isolated according to a representation in a given language), or concept
approximation. RS relies on a mathematical model that allows the
representation of vague concepts by means of upper and lower
approximations based on instances’ similarities. RS theory has been
used in various domains. In [Ahlqvist et al. 2003], a combination of RS
and fuzzy set theories has been used in a spatial data integration
application. It seems very useful to apply RS for defining data quality
parameters, maintaining mappings and evaluating changes.