Technological Choices of the ResearchSpace Project

Martin Doerr,

FORTH-ICS, Centre for Cultural Informatics,

 August 2010

 

Introduction

In this note we comment the technological choices of the Andrew W.Mellon Foundation Officer’s Grant Project “ResearchSpace”, as presented to us by the British Museum. We focus in particular on:

  • The choice to transform collection information into RDF,
  • the choice to integrate RDF information in an RDF triple store, and
  • the choice of the CIDOC CRM as global schema

in the light of the characteristics of the scholarly cultural-historical discourse and research.

Relevance of RDF

RDF technology is not just another technology to encode and manage data. Behind the RDF/OWL data model stands on one side a long tradition of knowledge representation and knowledge management tools, which however, for decades, did not achieve any wider industrial significance. There are three main factors for that:

  • For decades, computer scientists and implementers concentrated on complex reasoning (“artificial intelligence”), rather than on scale.
  • Until recently, industry was preoccupied with installing systems rather than integrating information from many sources. In cultural heritage, this is still the case. In industry, “data warehousing”, “enterprise knowledge access” and “business intelligence” brought about the refocusing on industrial information integration.
  • Until recently, there were no scalable platforms – database management systems – available, a consequence of the low demand.

It is the Semantic Web that introduced the utility of very large scale information integration, and hence triggered both, the development of an efficient, simple data model for information integration, and the scalable technology to manage it. The latter has only been available  in the last few years.

The break-through of the RDF data model, the so-called “RDF triples”, is that all information is broken down into aggregates of independent binary links, “triples”, as elements. Each triple links two universal resource identifiers. Therefore it stands without context, allowing for information from different sources brought into such a form to be thrown into one pool, and being merged automatically into one large graph. RDF is strikingly simple, much simpler than XML schema. This simplicity is particularly adequate for information merging, since there are minimal assumptions about constraints. It has been shown in numerous large projects in the 1990’ies that there is no other data model for information integration. Any other method must internally reinvent more or less RDF.

Therefore we regard providing a “research space” in RDF as currently the most adequate. XML documents or RDBMS are  not alternatives. It would mean not to integrate information at all, but only to provide the same kind of access to disparate information aggregates, because both XML Schema and RDBMS rely on context dependent information elements. On the other side, transformation from XML Schema and RDBMS to RDF is simple and cheap.

Integrating Cultural Information

Libraries and Digital Libraries have lived perfectly without information integration, just providing good, homogeneous finding aids. A cultural-historical research space however provides access to primary knowledge about objects and in archival material. This information is prior to having a subject in the library sense. A museum object is more like an illustration or witness of the past, than information in its own right. Cultural historical research means understanding “possible pasts”, the facts, events, material, social and psychological influences and motivations. It lives from understanding contexts by pulling together bits and pieces of related facts from disparate resources, which can typically not be classified under subjects in an obvious way. It lives from taking into account all known facts.

Choice of the CIDOC CRM and Co-Reference

Therefore the offered primary information should be tightly linked together, and be comprehensive, in contrast to supporting discovery of a sufficiently comprehensive piece of literature. “Throwing” all accepted facts into an RDF triple store, provides a global network of the “latest stage” of knowledge, which fulfills these requirements. In order to make sense out of it, one must apply (1) a common global core schema, representing the most relevant forms of relationships in and across data assets, into which all sources can be mapped for the purpose of homogeneous access, and (2) try to resolve co-references.

Under these conditions, the global network of knowledge can reveal deep “stories” built out of an immense number of concatenated primary facts, and thing impossible for a traditional library.

The CIDOC CRM has been proven over the last decade to be the most adequate core schema for that purpose, in particular for the museum world. It has successfully been extended to represent library conceptualization of FRBR and its relationships and other domains. It has been accepted by Europeana as an “application profile” for the next release of Europeana. The Europeana data model “EDM” is a far more simplistic core schema for global querying, generalizing even over Dublin Core and the CRM. It is however not adequate, i.e. too abstract to transform source data into it as an alternative representation. Nevertheless, due to RDF modularity, the ResearchSpace could easily overlay the EDM on top of the CRM without affecting the format of the data under it at all.

Even if data are under the same schema, methods of co-reference resolution should be among the tools to maintain the triple store. They find and relate URIs referring to the same things, thus concatenating previously disconnected facts. Such tools greatly improve the performance of the integrated Research Space. Besides detecting co-reference with automated tools, we recommend to deal with co-reference as scholarly knowledge in its own right.

Integrated Knowledge and Sources

This sounds too good to be true, and indeed, it is only part of the complete story. Cultural-historical knowledge, even primary sources, may be unreliable, incomplete, contradictory, and even deliberate lies. Even observation on material evidence is error-prone. Information cannot be valued, criticized and improved, if the source and context of their creation is unknown. The truth is in general undecidable. Every new fact may revive an old hypothesis.

Therefore the integrated global network of knowledge does not represent the truth, and in its integrated form, it does not even represent the opinion of anybody in particular. This has to be understood. What it should and does represent in an RDF encoding, is a comprehensive representation of all alternative facts and their integrated consequences, including all contradictions from those. As such, it has much more utility as a research tool, if correctly understood and if it leads to the original source of each fact. It must be seen as a sophisticated index, not the “brave new” knowledge base.

Indeed, trying to exclude “impossible world” by constraints as expressible in OWL is counterproductive. In a demonstration, the naive user may be frustrated if the “smart” information system presents an obviously wrong fact, be it by a wrong co-reference or by contradicting sources. Demonstration should make clear, that this is the way how the information system will lead the user to contradicting sources, or possibly contradicting sources! Trying to “clean up” the integrated knowledge is counterproductive to supporting research and a general mistake. The cleaning up must come from updating the sources by their curators, or in secondary literature derived from the network. Trying to aggregate (see ORE:aggregation) instead of merging alternative knowledge we regard also as a mistake, originating in the confusion of the role of the network. It pleases the eye with large displays of aggregated opinions, but leads to immense complexity to represent the effects of contradictory knowledge or even inability to detect it.

Most current triple stores actually represent the source of each triple, and we regard this feature as mandatory for the research space. Some platforms implement “contexts” or “named graphs” for that purpose.

Source Representation

As argued above, information sources must be available as units of knowledge, together with their own metadata about the context of creation. Those metadata may or may not be particularly useful as finding aids for the facts in them, but they are mandatory in order to evaluate and curate information. The content of the sources may be left in their original encoding or be transformed into a common representation.

Being left in the original encoding, it causes high complexity when the information is going to be displayed at a user workstation.  Being transformed to a common format, it may loose some details in the original, but it is easy to update the knowledge network from it. If the original is a Relational database, there is no generic unit of knowledge, so an extraction of information in adequate units is necessary anyhow on a case by case basis.

Based on this thought, we regard the strategy of the ResearchSpace to transform all sources to XML-RDF as adequate, and currently the cheapest method. Being in an XML wrap, it maintains the nature of a unit of knowledge, which can be traced to its creation, can be curated and updated by named authorities. Containing RDF under a CRM schema (or extension of it), it provides the welcome homogeneity to merge it into the global network. Once the researcher can communicate with the curators of the original information, all scholarly quality requirements can be fulfilled. In the long term, curators of digital cultural information should come together and find ways how to scholarly cite digital content in publications. This is not only a question of unit of knowledge and provenance, but also of fixity and long-term preservation. Obviously, an integrated network of knowledge cannot play this role.

The project should be aware that transformation of information is a) inevitable in any form or information aggregation or integration and b) a process requiring a combination of curatorial knowledge and IT skills, and not to be underestimated. On the other side, computer science literature refers to manual schema mapping as too expensive. This is actually not true. Simply, the cost must be understood. If a schema with 500 fields of a database containing a hundred thousand or a million records is once manually mapped, the cost of working a couple of weeks on it is negligible per transformed data record in comparison to the production of the records. Only manual correction of content can become prohibitive.

Summary

Summarizing, we regard that the technological choices of the ResearchSpace Project are adequate and represent the latest state-of-the-art. Similar arguments we have made in [Doerr 2008], and other recent projects, such as CultureSampo in Finland, CLAROS in the UK and even Europeana take a similar route. The choice of RDF technology has a deep bearing on functionality, and it not a question of industrial domination. The necessity to hold information in an integrated form and as distinct units of knowledge is a consequence both of the scholarly knowledge flow – its structure of producing and consuming knowledge, and possibly of the limitations of current technology. However, distribution of information has its own benefits, in terms of scale, preservation and curation. The project may pay particular attention to:

  • Not to compromise the utility of integrating contradictory information as a “research index” on one side and making the user aware of this feature on the other side
  • To provide an adequate display or access to the original sources in distinct form.
  • Not to underestimate the data transformation tools and skills, and not to overestimate the cost of manual schema mapping by experts.

We wish the project a good progress and great success.

References

Martin Doerr,  D. Iorizzo,  The dream of a global knowledge network—A new approach, 2008, ACM Journal on Computing and Cultural Heritage, Vol. 1, No. 1, Article 5, Publication date: June 2008


Comments