---+ Introduction Many datasets are transferred as XML, providing a tree-based datamodel that is purely syntactic in nature. Semantic processing is standardised around RDF, which provides a graph-based model. In the transformation process we must identify syntactic artifacts such as meaningless ordering in the XML data, lacking structure (e.g., the _creator_ of an artwork is not a literal string by a person identified by a resource with properties) and overly structured data (e.g. the dimension of an object is a property of the object, not of some placeholder that combines physical properties of the object). These syntactic artifacts must be translated into a proper semantic model where objects and properties are typed and semantically related to common vocabularies such as SKOS and Dublin Core. This document describes our toolkit for supporting this transformation process, together with examples taken from actual translations. The toolkit is implemented in SWI-Prolog and can be downloaded using GIT from one of the addresses below. Running the toolkit requires SWI-Prolog, which can be downloaded from http://www.swi-prolog.org for Windows, MacOS and Linux or in source for many other platforms. * git://eculture.cs.vu.nl/home/git/econnect/xmlrdf.git * http://eculture.cs.vu.nl/git/econnect/xmlrdf.git The graph-rewrite engine is written in Prolog. This document does not assume any knowledge about Prolog. The rule-language is, as far as possible, a clean declarative graph-rewrite language. The transformation process for actual data however can be complicated. For these cases the rule-system allow mixing rules with arbitrary Prolog code, providing an unconstrained transformation system. We provide a (dynamically extended) library of Prolog routines for typical conversion tasks. ---+ Converting XML into RDF The core idea behind converting `data-xml' into RDF is that every complex XML element maps to a resource (often a bnode) and every atomic attribute maps to an attribute of this bnode. Such a translation gives a valid RDF document, which is much easier to access for further processing. There are a few places where we must be more subtle in the initial conversion. First, the XML reserved attributes: * The xml:lang attribute is kept around and if we create an RDF literal, it is used to create a literal in the current language. * xmlns declarations are ignored (they make the declarations available to the application, but the namespaces are already processed by the XML parser). Second, we may wish to map some of our properties into rdfs:XMLLiteral or RDF dataTypes. In particular the first _must_ be done in the first pass to avoid all the complexities of turning the RDF back into XML (think of the above mentioned declarations, but ordering requirements can make this fundamentally impossible). Because this step needs type information about properties, we might as well allow for some very simple transformations in the first phase. These transformations are guided by the target RDF schema. The transformation process can add additional properties to the target RDF properties and RDF classes. The property is called map:xmlname, where the =map= prefix is currently defined as http://cs.vu.nl/eculture/map/. If this property is associated to a class, an XML element with the defined name is translated into an instance of this class. If it is associated to a property, it affects XML attribute or atomic element translation in two ways: * It uses the RDF property name rather than the XML name for the property * The rdfs:range of the property affects the value translation: * If it is rdfs:XMLLiteral, the sub-element is translated to an RDF XMLLiteral. * If it is an XSD datatype, the sub-element is translated into a typed RDF literal * It it is a proper class and the value is a valid URI, the URI is used as value without translation into a literal. Below is an example that maps XML elements =record= into vra:Work instances and maps the XML attribute =title= into the vra:title property. Note that it is not required (and not desirable) to add the =|map:xmlname|= properties to the actual schema files. Instead, put them in a separate file and load both into the conversion engine. == @prefix vra: . @prefix map: . # Map element-names to rdf:type vra:Work map:xmlname "record" . # Map xml attribute and sub-element names to properties vra:title map:xmlname "title" . == ---+ Default XML name mapping The initial XML to RDF mapper uses the XML attribute and tag-names for creating RDF properties. It provides two optional processing steps that make identifiers fit better with the RDF practice. 1. It can add a _prefix_ to each XML name to create a fully qualified URI. E.g., == ?- rdf_current_ns(ahm, Prefix), load_xml_as_rdf('data.xml', [ prefix(Prefix) ]). == 2. It `restyles' XML identifiers. Notably identifiers that contain a dot (.) are hard to process using Turtle. The library identifies alphanumerical substrings of the XML name and constructs new identifiers from these parts. By default, predicates start with a lowercase letter and each new part starts with an uppercase letter, as in =oneTwo=. Types (classes) start with an uppercase letter, as in =OneTwo=. This behaviour can be controlled with the options =predicate_style= and =class_style= of load_xml_as_rdf/2. ---+ Subsequent meta-data mapping Further mapping of meta-data consists of the following steps: 1. Fix the node-structure. 2. Re-establish internal links. 3. Re-establish external links. 4. Create a mapping schema that link the classes and predicates of the source to the target schema (e.g., Dublin Core). 5. Assign URIs to blank nodes where applicable. ---++ Fix the node-structure Source-data generally uses a record structure. Sometimes, each record is a simple flat list of properties, while in other cases it has a deeply nested structure. We distinguish three types of properties: 1. Properties with a clear single literal value, such as a collection-identifier. Such properties are directly mapped to RDF literals. 2. Properties with instance-specific scope that may have annotations. Typical examples are the title (multiple, translations, who has given the work a title, etc.) or a dimension (unit, which dimension, etc.). In this case, we create a new RDF node for each instance. 3. Properties that link to external resources: persons (creator), material (linking to a controlled vocabulary), etc. In this case the mapper unites multiple values that have the same properties. E.g., we create a single creator node for all creators found in a collection that have the same name a date of birth. Addition (bibliographical) information is accumulated in the RDF node. PROBLEM: sometimes the additional information clarifies the relation of the shared resource to a specific work and sometimes it privides more information about the resource (e.g. place of birth). For cases (2) and (3) above, each metadata field has zero or more RDF nodes that act as value. The principal value is represented by rdf:value, while the others use the original property name. E.g., the AHM data contains == Record title Title . Record title.type Type . == This is translated into == Record title [ a ahm:Title ; rdf:value "Some title" ; ahm:titleType Type ; ] . == If the work has multiple titles, each title is represented by a separate node. Because this step may involve using ordering information of the initial XML data that is still present in the raw converted RDF graph, this step must be performed before the data is saved. ---++ Re-establish internal links This step is generally trivial. Some properties represent links to other works in the collection. The property value is typically a literal representing a unique identifier to the target object such as the collection identifier or a database key. This step replaces the predicate-value with an actual link to the target resource. ---++ Re-establish external links This step re-establishes links from external resources such as vocabularies which we know to be used during the annotation. In this step we only make mapping for which we are _absolutely_ sure. I.e., if there is any ambiguity, which is not uncommon, we maintain the value as a blank node created in step (1). ---++ Create a mapping schema It is adviced to maintain the original property- and type-names (classes) in the RDF because this 1. Allows to reason about possible subtle differences between the source-specific properties and properties that come from generic schemas such as Dublin Core. E.g., a creator as listed for a work in a museum for architecture is typically an architect and the work in the museum is some form of reproduction on the real physical object. If we had replaced the original creator property by =|dcterms:creator|=, this information is lost. 2. It makes it much easier to relate the RDF to the original collection data. One of the advantages of this is that it becomes easier to reuse the result of semantic enrichment in the original data-source. This implies that the converted data is normally accompagnied by a schema that lists the properties and types in the data and relates them using rdfs:subPropertyOf or rdfs:subClassOf to one or more generic schemas (e.g., Dublic Core). ClioPatria provides a facility to compute a schema for a graph from the actual data. This schema can be used as a starting point. To get this schema, open the ClioPatria web-interface, Use *|Places/Graphs|* to locate the graph and choose the option _|Compute a schema for this graph and *Show* the result as *Turtle*|_. ---++ Assign URIs to blank nodes where applicable. Any blank node we may wish to link to from the outside world needs to be given a real URI. The record-URIs are typically created from the collection-identifier. For other blank nodes, we look for distinguishing (short) literals. ---+ Enriching the crude RDF The obtained RDF is generally rather crude. Typical `flaws' are: * It contains literals where it should have references to other RDF instances * One probably wants proper resources for many of the blank nodes. * Some blank nodes provide no semantic organization and must be removed. * At other place, intermediate instances must be created (as blank nodes or named instances). * In addition to the above, some literal fields need to be rewritten, sometimes to (multiple) new literals and sometimes to a named or bnode instance. Our rewrite language is a production-rule system, where the syntax is modelled after CHR (a committed-choice language for constraint programming) and the triple notation is based on Turtle/SPARQL. There are 3 types of production rules: * *|Propagation rules|* add triples * *|Simplication rules|* delete triples and add new triples. * *|Simpagation rules|* are in between. They match triples, delete triples and add triples, The overall syntax for the three rule-types is (in the order above): == ? @@ * ==> ? , *. ? @@ * <=> ? , *. ? @@ * \ * <=> ? , *. == Here, is an arbitrary Prolog term. is a triple in a Turtle-like, but Prolog native, syntax: == { , , } == Any of these fields may contain a variable, written as a Prolog variable: an uppercase letter followed by zero or more letters, digits or the underscore. E.g., =Hello=, =Hello_world=, =A9=. Resources are either fully (single-)quoted Prolog atoms (E.g. 'http://example.com/me', or terms of the form : , where is a defined prefix (see rdf_register_ns/2) and is a possible quoted Prolog atom. E.g., =|vra:title|= or =|ulan:'Person'|= (note the quotes to avoid interpretation as a variable). Literals can use a more elaborate syntax: == ^^ @ literal(Atom) == Here, is a double-quoted Prolog string and is a resource. The form literal(Atom) can be used to match the text of an otherwise unqualified literal with a variable. I.e., == { S, vra:title, literal(A) } == has the same meaning as the SPARQL expression =|?S vra:title ?A FILTER isLiteral(?A)|=, Triples in the _condition_ side can be postfixed using '?', in which case they are optional matches. If the triple cannot be matched, triples on the production-side that use the variable are ignored. Triples in the _condition_ can also be enclosed in a Prolog list ([...]), In this case, the triples are requested to be in the *order* specified. Ordering is not an official part of the RDF specs, but the SWI-Prolog RDF store maintains the order of triples in generated in the XML conversion process. An ordered set can match multiple times on a given subject, where it AB can match both AAABBB and ABABAB. Both forms appear in real-world XML data. Finally, on the _production_ side, the _object_ can take this form: == bnode([ { = } ], [ {