xmlrdf/commit
Copied files from original xmlrdf
author | Jan Wielemaker |
---|---|
Mon Nov 29 12:44:54 2010 +0100 | |
committer | Jan Wielemaker |
Mon Nov 29 12:44:54 2010 +0100 | |
commit | 41d0b95bc1499d15cd253cd1bc6a1ecd8b57335a |
tree | dfc7e3b7b42c39cad27dce9af05275ab9b46455d |
parent | 1cb1db337eb3f0f288c379592b98ece661311482 |
Diff style: patch stat
diff --git a/lib/xmlrdf/README.txt b/lib/xmlrdf/README.txt new file mode 100644 index 0000000..b9fd0ce --- /dev/null +++ b/lib/xmlrdf/README.txt @@ -0,0 +1,566 @@ +---+ Introduction + +Many datasets are transferred as XML, providing a tree-based datamodel +that is purely syntactic in nature. Semantic processing is standardised +around RDF, which provides a graph-based model. In the transformation +process we must identify syntactic artifacts such as meaningless +ordering in the XML data, lacking structure (e.g., the _creator_ of an +artwork is not a literal string by a person identified by a resource +with properties) and overly structured data (e.g. the dimension of an +object is a property of the object, not of some placeholder that +combines physical properties of the object). These syntactic artifacts +must be translated into a proper semantic model where objects and +properties are typed and semantically related to common vocabularies +such as SKOS and Dublin Core. + +This document describes our toolkit for supporting this transformation +process, together with examples taken from actual translations. The +toolkit is implemented in SWI-Prolog and can be downloaded using GIT +from one of the addresses below. Running the toolkit requires +SWI-Prolog, which can be downloaded from http://www.swi-prolog.org +for Windows, MacOS and Linux or in source for many other platforms. + + * git://eculture.cs.vu.nl/home/git/econnect/xmlrdf.git + * http://eculture.cs.vu.nl/git/econnect/xmlrdf.git + +The graph-rewrite engine is written in Prolog. This document does not +assume any knowledge about Prolog. The rule-language is, as far as +possible, a clean declarative graph-rewrite language. The transformation +process for actual data however can be complicated. For these cases the +rule-system allow mixing rules with arbitrary Prolog code, providing an +unconstrained transformation system. We provide a (dynamically extended) +library of Prolog routines for typical conversion tasks. + + +---+ Converting XML into RDF + +The core idea behind converting `data-xml' into RDF is that every +complex XML element maps to a resource (often a bnode) and every atomic +attribute maps to an attribute of this bnode. Such a translation gives a +valid RDF document, which is much easier to access for further +processing. + +There are a few places where we must be more subtle in the initial +conversion. First, the XML reserved attributes: + + * The xml:lang attribute is kept around and if we create an RDF + literal, it is used to create a literal in the current language. + + * xmlns declarations are ignored (they make the declarations + available to the application, but the namespaces are already + processed by the XML parser). + +Second, we may wish to map some of our properties into rdfs:XMLLiteral +or RDF dataTypes. In particular the first _must_ be done in the first +pass to avoid all the complexities of turning the RDF back into XML +(think of the above mentioned declarations, but ordering requirements +can make this fundamentally impossible). + +Because this step needs type information about properties, we might as +well allow for some very simple transformations in the first phase. +These transformations are guided by the target RDF schema. The +transformation process can add additional properties to the target RDF +properties and RDF classes. The property is called map:xmlname, where +the =map= prefix is currently defined as http://cs.vu.nl/eculture/map/. +If this property is associated to a class, an XML element with the +defined name is translated into an instance of this class. If it is +associated to a property, it affects XML attribute or atomic element +translation in two ways: + + * It uses the RDF property name rather than the XML name for + the property + + * The rdfs:range of the property affects the value translation: + + * If it is rdfs:XMLLiteral, the sub-element is translated + to an RDF XMLLiteral. + + * If it is an XSD datatype, the sub-element is translated + into a typed RDF literal + + * It it is a proper class and the value is a valid URI, the + URI is used as value without translation into a literal. + +Below is an example that maps XML elements =record= into vra:Work +instances and maps the XML attribute =title= into the vra:title +property. Note that it is not required (and not desirable) to add the +=|map:xmlname|= properties to the actual schema files. Instead, put them +in a separate file and load both into the conversion engine. + + == + @prefix vra: <http://www.vraweb.org/vracore/vracore3#> . + @prefix map: <http://cs.vu.nl/eculture/map/> . + + # Map element-names to rdf:type + + vra:Work map:xmlname "record" . + + # Map xml attribute and sub-element names to properties + + vra:title map:xmlname "title" . + == + +---+ Default XML name mapping + +The initial XML to RDF mapper uses the XML attribute and tag-names for +creating RDF properties. It provides two optional processing steps that +make identifiers fit better with the RDF practice. + + 1. It can add a _prefix_ to each XML name to create a fully + qualified URI. E.g., + + == + ?- rdf_current_ns(ahm, Prefix), + load_xml_as_rdf('data.xml', + [ prefix(Prefix) + ]). + == + + 2. It `restyles' XML identifiers. Notably identifiers that contain a + dot (.) are hard to process using Turtle. The library identifies + alphanumerical substrings of the XML name and constructs new + identifiers from these parts. By default, predicates start with + a lowercase letter and each new part starts with an uppercase + letter, as in =oneTwo=. Types (classes) start with an uppercase + letter, as in =OneTwo=. This behaviour can be controlled with the + options =predicate_style= and =class_style= of load_xml_as_rdf/2. + + +---+ Subsequent meta-data mapping + +Further mapping of meta-data consists of the following steps: + + 1. Fix the node-structure. + 2. Re-establish internal links. + 3. Re-establish external links. + 4. Create a mapping schema that link the classes and predicates + of the source to the target schema (e.g., Dublin Core). + 5. Assign URIs to blank nodes where applicable. + +---++ Fix the node-structure + +Source-data generally uses a record structure. Sometimes, each record is +a simple flat list of properties, while in other cases it has a deeply +nested structure. We distinguish three types of properties: + + 1. Properties with a clear single literal value, such as a + collection-identifier. Such properties are directly mapped to + RDF literals. + + 2. Properties with instance-specific scope that may have + annotations. Typical examples are the title (multiple, translations, + who has given the work a title, etc.) or a dimension (unit, which + dimension, etc.). In this case, we create a new RDF node for each + instance. + + 3. Properties that link to external resources: persons (creator), + material (linking to a controlled vocabulary), etc. In this case + the mapper unites multiple values that have the same properties. + E.g., we create a single creator node for all creators found in + a collection that have the same name a date of birth. + + Addition (bibliographical) information is accumulated in the RDF + node. + + PROBLEM: sometimes the additional information clarifies the relation + of the shared resource to a specific work and sometimes it privides + more information about the resource (e.g. place of birth). + +For cases (2) and (3) above, each metadata field has zero or more RDF +nodes that act as value. The principal value is represented by +rdf:value, while the others use the original property name. E.g., the +AHM data contains + + == + Record title Title . + Record title.type Type . + == + +This is translated into + + == + Record title [ a ahm:Title ; + rdf:value "Some title" ; + ahm:titleType Type ; + ] . + == + +If the work has multiple titles, each title is represented by a +separate node. + +Because this step may involve using ordering information of the initial +XML data that is still present in the raw converted RDF graph, this step +must be performed before the data is saved. + + +---++ Re-establish internal links + +This step is generally trivial. Some properties represent links to other +works in the collection. The property value is typically a literal +representing a unique identifier to the target object such as the +collection identifier or a database key. This step replaces the +predicate-value with an actual link to the target resource. + + +---++ Re-establish external links + +This step re-establishes links from external resources such as +vocabularies which we know to be used during the annotation. In this +step we only make mapping for which we are _absolutely_ sure. I.e., if +there is any ambiguity, which is not uncommon, we maintain the value as +a blank node created in step (1). + + +---++ Create a mapping schema + +It is adviced to maintain the original property- and type-names +(classes) in the RDF because this + + 1. Allows to reason about possible subtle differences between the + source-specific properties and properties that come from generic + schemas such as Dublin Core. E.g., a creator as listed for a work in + a museum for architecture is typically an architect and the work in + the museum is some form of reproduction on the real physical object. + If we had replacd the original creator property by + =|dcterms:creator|=, this information is lost. + + 2. It makes it much easier to relate the RDF to the original + collection data. One of the advantages of this is that it becomes + easier to reuse the result of semantic enrichment in the original + data-source. + +The toolkit provides a predicate to derive the initial schema from the +converted data using the predicate make_schema/2: + + * [[make_schema/2]] + +After running this predicate, the schema can be downloaded from the +target graph through the web-interface, or it can be saved using +rdf_save_turtle/2, as in + + == + ?- make_schema(data, schema). + ?- rdf_save_turtle('schema.ttl', [graph(schema)]). + == + + +---++ Assign URIs to blank nodes where applicable. + +Any blank node we may wish to link to from the outside world needs to be +given a real URI. The record-URIs are typically created from the +collection-identifier. For other blank nodes, we look for distinguishing +(short) literals. + + +---+ Enriching the crude RDF + +The obtained RDF is generally rather crude. Typical `flaws' are: + + * It contains literals where it should have references to other RDF + instances + + * One probably wants proper resources for many of the blank nodes. + + * Some blank nodes provide no semantic organization and must be + removed. + + * At other place, intermediate instances must be created (as + blank nodes or named instances). + + * In addition to the above, some literal fields need to be + rewritten, sometimes to (multiple) new literals and sometimes + to a named or bnode instance. + +Our rewrite language is a production-rule system, where the syntax is +modelled after CHR (a committed-choice language for constraint +programming) and the triple notation is based on Turtle/SPARQL. There +are 3 types of production rules: + + * *|Propagation rules|* add triples + + * *|Simplication rules|* delete triples and add new triples. + + * *|Simpagation rules|* are in between. They match triples, delete + triples and add triples, + +The overall syntax for the three rule-types is (in the order above): + + == + <name>? @@ <triple>* ==> <guard>? , <triple>*. + <name>? @@ <triple>* <=> <guard>? , <triple>*. + <name>? @@ <triple>* \ <triple>* <=> <guard>? , <triple>*. + == + +Here, <guard> is an arbitrary Prolog term. <triple> is a triple +in a Turtle-like, but Prolog native, syntax: + + == + { <subject> , <predicate> , <object> } + == + +Any of these fields may contain a variable, written as a Prolog +variable: an uppercase letter followed by zero or more letters, digits +or the underscore. E.g., =Hello=, =Hello_world=, =A9=. Resources are +either fully (single-)quoted Prolog atoms (E.g. 'http://example.com/me', +or terms of the form <prefix> : <local>, where <prefix> is a defined +prefix (see rdf_register_ns/2) and <local> is a possible quoted Prolog +atom. E.g., =|vra:title|= or =|ulan:'Person'|= (note the quotes to avoid +interpretation as a variable). Literals can use a more elaborate syntax: + + == + <string> ^^ <type> + <string> @ <lang> + <string> + literal(Atom) + == + +Here, <string> is a double-quoted Prolog string and <type> is a +resource. The form literal(Atom) can be used to match the text of an +otherwise unqualified literal with a variable. I.e., + + == + { S, vra:title, literal(A) } + == + +has the same meaning as the SPARQL expression =|?S vra:title ?A FILTER +isLiteral(?A)|=, + +Triples in the _condition_ side can be postfixed using '?', in which +case they are optional matches. If the triple cannot be matched, triples +on the production-side that use the variable are ignored. + +Triples in the _condition_ can also be enclosed in a Prolog list +([...]), In this case, the triples are requested to be in the *order* +specified. Ordering is not an official part of the RDF specs, but the +SWI-Prolog RDF store maintains the order of triples in generated in the +XML conversion process. An ordered set can match multiple times on a +given subject, where it AB can match both AAABBB and ABABAB. Both forms +appear in real-world XML data. + +Finally, on the _production_ side, the _object_ can take this form: + + == + bnode([ {<predicate> = <object>} + ], + [ {<option>} + ]) + == + +This means, `for the object, create a bnode from the given <predicate> = +<object> pairs'. The <option>s guide the process. At this moment, there +is only one option with two values: + + == + share_if(equal) + share_if(equal([<predicate>*])) + == + +Without any option, each execution of the rule creates a new bnode. With +the =share_if= option =equal=, it uses the same bnode-id for all +productions that produce the same predicate-object list (in canonical +order, after removing duplicates). Using the last form, it considers two +blank nodes equal if they have the same triples on the given predicates. +All other predicates are simply added to the blank-node. + + +---++ Renaming resoures (or naming blank-nodes) + +The construct =|{X}|= can be used on the condition and action side of a +rule. If used, there must be exactly one such construct, one for the +resource to be deleted and one for the resource to be added. All +resources for which the condition matches are renamed. Below is an +example rule. The first triple extracts the identifier. This triple must +remain in the database. The =|\ {A}=| binds the (blank node) identifier +to be renamed. The two Prolog guards verify that the resource is a blank +node and generate an identifier (URI). The _action_ (_|{S}|_) gives the +rule engine the URI that must be given to the matched =|{A}=|. + + == + work_uris @@ + { A, vra:'idNumber.currentRepository', ID } \ {A} <=> + rdf_is_bnode(A), + literal_to_id(ID, ahm, S), + {S}. + == + + +---++ Putting triples in another graph + +Triples created by the _action_ side of a rule are added to the graph +that is being rewritten. It is also possible to add them to another +graph using the syntax below: + + == + { S,P,O } >> Graph + == + +E.g., if we want to store the information about person resources that +we create in a graph named =persons=, we can so so using a rule like +this: + + == + person @@ + {S, creator, Name}, + {S, 'creator.date_of_birth', Born} ?, + {S, 'creator.date_of_death', Died} ?, + {S, 'creator.role', Role} ? + <=> + Name \== "onbekend", + name_to_id(Name, ahm, Creator), + { S, vra:creator, Creator }, + { Creator, rdf:type, ulan:'Person' } >> persons, + { Creator, vp:labelPreferred, Name } >> persons, + { Creator, ulan:birthDate, Born } >> persons, + { Creator, ulan:deathDate, Died } >> persons, + { Creator, ulan:role, Role } >> persons. + == + + +---++ Utility predicates + +The rewriting process is often guided by a _guard_ which is, as already +mentioned, an arbitrary Prolog goal. Because translation of repositories +shares a lot of common tasks, we plan to develop a library for these. +This section documents the available predicates. + + * [[find_in_vocabulary/3]] + * [[literal_to_id/3]] + * [[name_to_id/3]] + + +---+ Putting it all together (examples) + +Below we give some rules that we wrote to convert real data. + +---++ Deleting a triple + +Sometimes XML contains data that simply means `nothing'. We want to +delete this data: + + == + {_, creator, "onbekend" } <=> + true. + == + +Now, in the data from which this was extracted, this is a bit too crude +because some records keep data about the creator even though his/her +name is not known. Therefore, we preceed the rule with the rule of the +next section. Note that the order of rules matter: a rule is executed +before the next one. In this particular case we could have removed the +=|{S, creator, "onbekend"}|= triple from the example below to make it +match after the rule above is executed. + + +---++ Preserving info about unknown creators + +The example below deals with entries in the database where the `creator' +is unknown (Dutch: _onbekend_), but some properties are known about him +or her. The remainder of the condition matches possible information +about this creator using an `optional' match. The _guard_ verifies there +is at least some information about our unknown creator. The _action_ +part of the rule associates a new blank node as a creator. + + == + creator_onbekend @@ + {S, creator, "onbekend"}, + {S, 'creator.date_of_birth', Born} ?, + {S, 'creator.date_of_death', Died} ?, + {S, 'creator.role', Role} ? + <=> + at_least_one_given([Born, Died, Role]), + { S, vra:creator, + bnode([ ulan:birthDate = Born, + ulan:deathDate = Died, + ulan:role = Role + ]) + }. + + at_least_one_given(Values) :- + member(V, Values), + ground(V), !. + == + +---++ Negation + +Negation is only provided as Prolog negation--by-failure in the guard. +This implies that we cannot use the =|{...}|= triple notation to test on +the absence of a triple, but instead we need to use the SWI-Prolog +RDF-DB primitive rdf/3. For example, to delete all person records that +have no name, we can use the rule below. The first triple verifies the +record-type. The second matches all triples on that record and the guard +verifies that the subject has no triples for the property ahm:name. + + == + delete_no_name @@ + { S, rdf:type, ahm:'Person' }, + { S, _, _ } + <=> + \+ rdf(S, ahm:name, _). + == + +---+ Running the toolkit + +Currently, there is no well-defined workflow for running the tools. The +files run.pl and rewrite.pl contain a skeleton that I use to convert the +data from AHM (Amsterdams Historisch Museum). The file run.pl loads +relevant background data and defines run/0 to call the initial +converter. The relevant steps of the initial converter are to load VRA +and mapping.ttl that contains the map:xmlname declarations discussed +above. Next, we load the XML into crude RDF using the call below. The +options specify that the input in XML without namespaces (dialect =xml= +rather than =xmlns=) and that the file contains XML elements named +=record= as the desired unit of data for conversion. + + == + run(File) :- + load_xml_as_rdf(File, + [ dialect(xml), + unit(record) + ]). + == + +The result can be browsed by typing =|?- triple20.|= + +The file rewrite.pl scripts the rewrite phase. It sets up namespaces, +calls to the rewrite predicates with the proper arguments and finally +provides the rules. Here are the toplevel predicates: + + * [[rewrite/0]] + * [[rewrite/1]] + * [[rewrite/2]] + * [[list_rules/0]] + +Below is an example run, showing all available rules and running a +single rule. The example demonstrates that rules are applied until a +fixed-point is reached (i.e., the RDF database does not change by +applying the rules). + + == + ?- [rewrite]. + true. + + ?- list_rules. + Defined RDF mapping rules: + + title_translations + dimension + work_uris + creator_sequence + creator_onbekend + delete_unknown_creator + delete_empty_literal + creator + material_aat + related_object + + true. + + ?- rewrite(delete_empty_literal). + % Applying ... delete_empty_literal (1) + % 0.100 seconds; 23,456 changes; 2,008,860 --> 1,985,404 triples + % Step 1: generation 2,020,746 --> 2,044,202 + % Applying ... delete_empty_literal (1) + % 0.000 seconds; no change + % Step 2: generation 2,044,202 --> 2,044,202 + true. + == diff --git a/lib/xmlrdf/rdf_convert_util.pl b/lib/xmlrdf/rdf_convert_util.pl new file mode 100644 index 0000000..821b5cc --- /dev/null +++ b/lib/xmlrdf/rdf_convert_util.pl @@ -0,0 +1,130 @@ +:- module(rdf_convert_util, + [ rdf_literal/1, % @Term + literal_to_id/3, % +Literal, +NameSpace, -Id + name_to_id/3, % +Literal, +NameSpace, -Id + edm_identifier/4 % +URI, +Orig, -New, NewURI + ]). +:- use_module(library(semweb/rdf_db)). +:- use_module(library(apply)). + + +%% rdf_literal(Term) is semidet. +% +% True if Term is an RDF literal + +rdf_literal(Term) :- + compound(Term), + Term = literal(_). + +%% name_to_id(+Literal, +NS, -ID) +% +% Similar to literal_to_id/3, but intended to deal with person +% names. +% +% @tbd Now simply the same as literal_to_id/3 + +name_to_id(Literal, NS, ID) :- + literal_to_id(Literal, NS, ID). + +%% literal_to_id(+LiteralOrList, +NS, -ID) is det. +% +% Generate an identifier from a literal by mapping all characters +% that are not allowed in a (Turtle) identifier to _. +% LiteralOrList can be a list. In this case we generate an id for +% each element in LiteralOrList and append these. A typical usage +% scenario is to add a type: +% +% == +% literal_to_id(['book-', Literal], NS, ID) +% == +% +% Another is to add the label of the parent: +% +% == +% literal_to_id([ParentLit, '-', Literal], NS, ID) +% == +% +% @tbd Verify that the generated URI is unique! +% @tbd Remove diacritics for non-iso-latin-1 text + +literal_to_id(Literals, NS, URI) :- + is_list(Literals), !, + maplist(literal_to_id, Literals, IDs), + atomic_list_concat(IDs, ID), + rdf_current_ns(NS, Prefix), + atom_concat(Prefix, ID, URI). +literal_to_id(Literal, NS, URI) :- + literal_to_id(Literal, ID), + rdf_current_ns(NS, Prefix), + atom_concat(Prefix, ID, URI). + +literal_to_id(Literal, ID) :- + text_of_literal(Literal, Text), + text_to_id(Text, ID). + +text_of_literal(Var, _) :- + var(Var), !, + instantiation_error(Var). +text_of_literal(literal(Lit), Text) :- !, + text_of_literal(Lit, Text). +text_of_literal(type(_, Text), Text). +text_of_literal(lang(_, Text), Text). +text_of_literal(Text, Text) :- + atomic(Text). + +text_to_id(Text, Id) :- + unaccent_atom(Text, T1), + atom_codes(T1, Codes), + maplist(map_non_id_char, Codes, Codes1), + normalize_underscores(Codes1, Codes2), + atom_codes(Id, Codes2). + +map_non_id_char(0'_, 0'_) :- !. +map_non_id_char(0'-, 0'-) :- !. +map_non_id_char(C, C) :- + code_type(C, csym), !. +map_non_id_char(_, 0'_). + +normalize_underscores([0'_|T0], T) :- !, + normalize_underscores(T0, T). +normalize_underscores([], [0'_]) :- !. +normalize_underscores(In, Out) :- + normalize_underscores_2(In, Out). + +normalize_underscores_2([], []). +normalize_underscores_2([0'_|T0], Can) :- !, + normalize_underscores(T0, T), + ( T == [0'_] + -> Can = [] + ; Can = [0'_|T] + ). +normalize_underscores_2([H|T0], [H|T]) :- + normalize_underscores_2(T0, T). + + +%% edm_identifier(URI, +Orig, +New, -NewURI) +% +% Translate betweem the various EDM identifiers. E.g.: +% +% == +% edm_identifier(Proxy, proxy, aggregate, Aggregate) +% == +% +% @error domain_error(edm_uri, URI) if the URI doesn't contain +% =|/<orig>-|= or contains it multiple times. + +edm_identifier(URI, Orig, New, NewURI) :- + subst_pattern(Orig, OP), + ( sub_atom(URI, B, _, A, OP), + sub_atom(URI, _, A, 0, End), + sub_atom(URI, 0, B, _, Start), + \+ sub_atom(End, _, _, _, OP), + \+ sub_atom(Start, _, _, _, OP) + -> subst_pattern(New, NP), + atomic_list_concat([Start, NP, End], NewURI) + ; domain_error(edm_uri, URI) + ). + +subst_pattern(Text, Pattern) :- + atomic_list_concat([/, Text, -], Pattern). + diff --git a/lib/xmlrdf/rdf_name_bnodes.pl b/lib/xmlrdf/rdf_name_bnodes.pl new file mode 100644 index 0000000..527f3ce --- /dev/null +++ b/lib/xmlrdf/rdf_name_bnodes.pl @@ -0,0 +1,199 @@ +/* File: rdf_name_bnodes.pl + Author: Jan Wielemaker + Created: Jan 14 2010 + Purpose: Create URIs to blank nodes +*/ + +:- module(rdf_name_bnodes, + [ name_bnodes/3, % +Set, -Names, +Graph + name_instances/4 % +Class, -P, -Pairs, +Graph + ]). +:- use_module(library(semweb/rdf_db)). +:- use_module(library(pairs)). +:- use_module(library(apply)). +:- use_module(library(debug)). + +/** <module> Establish a URI for a set of RDF blank-nodes + +This library can propose and implement a naming schema for a set of RDF +blank-nodes. + +Naming schemas: + + * Create from a unique property + In this case, give preference to word-like properties over + database keys. Make the property-literal turtle friendly. + We call this key <ID>. Use <base><ID> as identifier. + + * Create from a semi-unique property + property of parent. + If there is a property that is nearly unique and the nodes + are organised in a hierarchy, use the label of the parent + (recursively) to make the names unique. Use + <base><parent-ID>-<ID> + + * If the set can be split into multiple classes, each of + which can have unique names attached based on one of the + other schemas, use <base><class-ID>-<ID> + + * If the blank-nodes are used exactly once as a property + of another resource, use <other>-<property-ID>. If the property + appears multiple times, try <other>-<property-ID>-<ID> or, + if all fails, <other>-<property-ID>-<N> + +Steps: + + * Find shared (literal) properties + * Split-by-class + +@see literal_to_id/3 for generating Turtle-friendly idenfiers. +*/ + +:- rdf_meta + name_instances(r, r, -, r). + +%% name_instances(+Class, ?P, -Pairs, +Graph) + +name_instances(Class, P, Pairs, Graph) :- + findall(R, rdf(R, rdf:type, Class), Rs), + sort(Rs, Set), + name_bnodes(Set, P, Names, Graph), + pairs_keys_values(Pairs, Set, Names). + + +%% name_resources(+Resources, -Names, ?Graph) + +name_bnodes(Set, Names, Graph) :- + name_bnodes(Set, _, Names, Graph). + +name_bnodes(Set, P, Names, Graph) :- + length(Set, RCount), + shared_property(Set, P, Graph), + debug(name_bnodes, 'Trying property ~q', [P]), + maplist(local_name(Graph, P), Set, Names), + sort(Names, Sorted), + length(Sorted, NCount), + ( RCount == NCount + -> true + ; NU is RCount - NCount, + debug(name_bnodes, '~D of ~D non-unique', [NU, RCount]), + fail + ). + +%% local_name(+Graph, +P, +R, -Name) is nondet. +% +% Propose a local name for R based on P. +% +% @tbd Add 'n' if the results starts with a digit + +local_name(Graph, P, R, Name) :- + findall(T, property_text(R, P, T, Graph), Ts), + Ts \== [], !, + maplist(text_to_id, Ts, IDL), + sort(IDL, SIDL), + atomic_list_concat(SIDL, -, Name). + + +%% property_text(+R, +P, -Text, ?Graph) is nondet. +% +% Fetch a textual value for the property P. + +property_text(R, P, Text, Graph) :- + rdf(R, P, Value, Graph), + text_of(Value, Text). + +text_of(literal(X), Text) :- !, + text_of_literal(X, Text). +text_of(R, Text) :- + rdf_is_bnode(R), + rdf_has(R, rdf:value, V), + text_of(V, Text). + +text_of_literal(Text, Text) :- + atom(Text), !. +text_of_literal(lang(_, Text), Text). +text_of_literal(type(_, Text), Text). + +%% shared_property(+Set, -P, +Graph) is nondet. +% +% True if P is a property that appears on all instances of Set. +% +% @tbd Should we also allow for super-properties? + +shared_property(Set, P, Graph) :- + map_list_to_pairs(property_count(Graph), Set, Keyed), + keysort(Keyed, KeySorted), + pairs_values(KeySorted, [H|T]), + property_of(P, Graph, H), + ( maplist(property_of(P, Graph), T) + -> true + ). + +%% property_count(+Graph, +R, -Count) is det. +% +% Count is the number of distinct properties on the resource R. + +property_count(Graph, R, Count) :- + findall(P, rdf(R, P, _, Graph), Ps), + sort(Ps, Set), + length(Set, Count). + +%% property_of(?P, +Graph, +Resource) is nondet. +% +% True if P is a property on Resource in Graph. + +property_of(P, Graph, R) :- + atom(P), !, + ( rdf(R, P, _, Graph) + -> true + ). +property_of(P, Graph, R) :- + findall(P, rdf(R, P, _, Graph), Ps), + sort(Ps, Set), + member(P, Set). + + + /******************************* + * UTIL * + *******************************/ + +:- dynamic + text_id_cache/2. + +text_to_id(Text, Id) :- + ( text_id_cache(Text, Id0) + -> Id = Id0 + ; text_to_id_raw(Text, Id0) + -> assertz(text_id_cache(Text, Id0)), + Id = Id0 + ; debug(name_bnodes, 'No id from ~q', [Text]), + fail + ). + +text_to_id_raw(Text, Id) :- + unaccent_atom(Text, T1), + atom_codes(T1, Codes), + maplist(map_non_id_char, Codes, Codes1), + normalize_underscores(Codes1, Codes2), + atom_codes(Id, Codes2). + +map_non_id_char(0'_, 0'_) :- !. +map_non_id_char(0'-, 0'-) :- !. +map_non_id_char(C, C) :- + code_type(C, csym), !. +map_non_id_char(_, 0'_). + +normalize_underscores([0'_|T0], T) :- !, + normalize_underscores(T0, T). +normalize_underscores([], [0'_]) :- !. +normalize_underscores(In, Out) :- + normalize_underscores_2(In, Out). + +normalize_underscores_2([], []). +normalize_underscores_2([0'_|T0], Can) :- !, + normalize_underscores(T0, T), + ( T == [0'_] + -> Can = [] + ; Can = [0'_|T] + ). +normalize_underscores_2([H|T0], [H|T]) :- + normalize_underscores_2(T0, T). diff --git a/lib/xmlrdf/rdf_rename.pl b/lib/xmlrdf/rdf_rename.pl new file mode 100644 index 0000000..537a1a3 --- /dev/null +++ b/lib/xmlrdf/rdf_rename.pl @@ -0,0 +1,22 @@ +:- module(rdf_rename, + [ rdf_rename/3 % +Old, -New, ?Graph + ]). +:- use_module(library(semweb/rdf_db)). + +%% rdf_rename(+OldResource, +NewResource, ?Graph) is det. +% +% Rename a resource, changing all references in all three +% positions of the triple. If Graph is given, renaming is limited +% to triples that are associated to the matching graph. + +rdf_rename(Old, Old, _) :- !. +rdf_rename(Old, New, Graph) :- + rdf_transaction(rename(Old, New, Graph), rename(Old, New)). + +rename(Old, New, G) :- + forall(rdf(Old, P, O, G), + rdf_update(Old, P, O, G, subject(New))), + forall(rdf(S, Old, O, G), + rdf_update(S, Old, O, G, predicate(New))), + forall(rdf(S, P, Old, G), + rdf_update(S, P, Old, G, object(New))). diff --git a/lib/xmlrdf/rdf_rewrite.pl b/lib/xmlrdf/rdf_rewrite.pl new file mode 100644 index 0000000..ca236d1 --- /dev/null +++ b/lib/xmlrdf/rdf_rewrite.pl @@ -0,0 +1,926 @@ +:- module(rdf_rewrite, + [ op(1200, xfx, (@@)), % Name @@ Rule + op(1180, xfx, ==>), % Head ==> Body + op(1180, xfx, <=>), % Head <=> Body + op(1100, xfx, \), % Head \ Del <=> Body + op(200, fx, ^), % ^Predicate + op(700, xfx, ^^), % Text^^Type + op(700, xfx, @), % Text@Lang + op(200, xf, ?), % Triple ? + op(200, xfx, >>), % Triple >> Graph + % Toplevel converter + rdf_rewrite_rules/0, + rdf_list_rule/1, % +Name + rdf_rewrite/2, % +Graph, Rule + rdf_rewrite/1, % +Graph + % Runtime support + subject_triple_sequence/3, % +Pattern, -Data, +Graph + rdf_assert_new/4, % +S,+P,+O,+Graph + rdf_retract_if_ground/4, % +S,+P,+O,+Graph + rdf_assert_if_ground/4, % +S,+P,+O,+Graph + rdf_set_lang/3, % +LitIn, +Lang, -LitOut + rdf_set_type/3, % +LitIn, +Type, -LitOut + % Re-exporting + rdf_rename/3 % +Old,+New,?Graph + ]). +:- use_module(library(semweb/rdf_db)). +:- use_module(library(semweb/rdfs)). +:- use_module(library(error)). +:- use_module(library(apply)). +:- use_module(library(lists)). +:- use_module(library(debug)). +:- use_module(library(uri)). +:- use_module(library(option)). +:- use_module(library(pairs)). +:- use_module(rdf_rename). + + +/** <module> A generic RDF rewrite engine + +Triple notation: + + {Subject, Predicate, Object} + +Object is one of: + + * URI, written as ns:local or 'full URI' + * "literal" + * "literal"^^URI + * "literal"@lang + * Variable + +Rename URIs: {X} is a shorthand for {X,P,O},{S,X,O},{S,P,X} + +{S, ^vra:idNumber, ID} \ {S} <=> + a vra:'Work', + make_identifier(ID, URI), + {URI}. + +Map into + +rdf_mapping_rule(Id, Name, Graph, Actions, Options) :- + Code + +Where Actions is a conjunction of the following statements: + + * rdf_retractall(S,P,O,Graph) + * rdf_retract_if_ground(S,P,O,Graph) + * rdf_assert_new(S,P,O,Graph) + * rdf_assert_if_ground(S,P,O,Graph) +*/ + + /******************************* + * THE REWRITER * + *******************************/ + +:- meta_predicate + rdf_rewrite(:), + rdf_rewrite(:, +). + +rdf_rewrite(Graph) :- + rdf_rewrite(Graph, _). + +rdf_rewrite(Module:Graph, Rule) :- + rdf_generation(G0), + rewrite_step(Module, Graph, Rule), + rdf_generation(G1), + debug(rdf_rewrite, 'Rewrite: generation ~D --> ~D', + [G0,G1]), + G0 \== G1, !. + +rewrite_step(Module, Graph, Rule) :- + ( mapping_rules(Module, Rules), + \+ \+ member(Rule-_, Rules) + -> true + ; existence_error(rule, Rule) + ), + ( member(Rule-Pairs, Rules), + nth1(I, Pairs, Id-Options), + debug(rdf_rewrite, 'Applying ... ~q (~d)', [Rule, I]), + rdf_generation(G0), + rdf_statistics(triples(TC0)), + statistics(cputime, T0), + ( option(transaction(true), Options, true) + -> rdf_transaction(call_rewrite_rule(Id, Module, Graph, Options), + Rule) + ; call_rewrite_rule(Id, Module, Graph, Options) + ), + statistics(cputime, T1), + rdf_statistics(triples(TC1)), + rdf_generation(G1), + T is T1 - T0, + GDiff is G1-G0, + ( GDiff == 0 + -> debug(rdf_rewrite, '~3f seconds; no change', [T]) + ; debug(rdf_rewrite, '~3f seconds; ~D changes; ~D --> ~D triples', + [T, GDiff, TC0, TC1]) + ), + fail + ; true + ). + +call_rewrite_rule(Rule, Module, Graph, Options) :- + bnode_terms(Options, BNodes, BNodeOptions, _RestOptions), + BNodes \== [], !, + Template =.. [v,Actions|BNodes], + findall(Template, + Module:rdf_mapping_rule(Rule, _Name, Graph, Actions, Options), + Bag), + create_bnodes(BNodeOptions, 2, Bag, Graph), + call_actions(Bag). +call_rewrite_rule(Rule, Module, Graph, Options) :- + findall(Actions, + Module:rdf_mapping_rule(Rule, _Name, Graph, Actions, Options), + Goals), + maplist(call, Goals). + +%% rdf_rewrite_rules +% +% List available rules + +rdf_rewrite_rules :- + format('Defined RDF mapping rules:~n~n', []), + ( mapping_rules(_, Rules), + forall(append(Seen, [Rule-Ids|_], Rules), + list_rule(Rule, Ids, Seen)), + fail + ; true + ), + format('~n', []). + +list_rule(Rule, [_Id], Seen) :- + memberchk(Rule-_, Seen), !, + format('\t~q ~t~40|(DISCONTIGUOUS)~n', [Rule]). +list_rule(Rule, [_Id], _) :- !, + format('\t~q~n', [Rule]). +list_rule(Rule, Ids, Seen) :- + memberchk(Rule-_, Seen), !, + length(Ids, Len), + format('\t~q ~t~40|(~d rules, DISCONTIGUOUS)~n', [Rule, Len]). +list_rule(Rule, Ids, _) :- + length(Ids, Len), + format('\t~q ~t~40|(~d rules)~n', [Rule, Len]). + + +%% mapping_rules(?Module, -Rules) is nondet. +% +% @param Rules is a list Name-IdOptionPairs + +mapping_rules(Module, Rules) :- + current_module(Module), + current_predicate(Module:rdf_mapping_rule/5), + findall(Name-(Id-Options), + clause(Module:rdf_mapping_rule(Id, Name, _, _, Options), _), + Pairs), + group_pairs_by_key(Pairs, Rules). + + +%% rdf_list_rule(+Name) is det. +% +% Produce a listing of the generated Prolog for the named rule. + +:- meta_predicate + rdf_list_rule(:). + +rdf_list_rule(M:Name) :- + ( ( M == user + ; M == rdf_rewrite + ) + -> true + ; Module = M + ), + ( current_module(Module), + current_predicate(Module:rdf_mapping_rule/5), + Head = Module:rdf_mapping_rule(_, Name, _, _, _), + forall(clause(Head, Body), + portray_clause((Head :- Body))), + fail + ; true + ). + + + /******************************* + * BNODE MAGIC * + *******************************/ + +%% bnode_terms(+RuleOptions, -BNodeTemplates, -BNodeOptions, -RestOpts) +% +% Split the option-list + +bnode_terms([], [], [], []). +bnode_terms([bnode(BN, Props, Options)|T0], + [bnode(BN, Props)|BNT], + [Options|OT], + Rest) :- + bnode_terms(T0, BNT, OT, Rest). +bnode_terms([H|T0], BN, O, [H|T]) :- + bnode_terms(T0, BN, O, T). + + +%% create_bnodes(+BNOptions, +Index, +Bag, +Graph) +% + +create_bnodes([], _, _, _). +create_bnodes([BNOptions|OT], I, Bag, Graph) :- + create_bnodes_arg(Bag, I, BNOptions, Graph), + I2 is I + 1, + create_bnodes(OT, I2, Bag, Graph). + +%% create_bnodes(+BNTerms, +Options, +Graph) +% +% Share blank nodes. +% +% @param BNTerms is a list bnode(Id, Properties) +% @param Options describes the sharing. Currently supports +% +% * equal +% All properties must be equal +% * equal(ListOfProperties) +% Only the indicated properties must be equal + +create_bnodes_arg(BNTerms, I, Options, Graph) :- + option(share_if(Share), Options, equal), + key_bnodes(BNTerms, Share, I, KTerms), + group_pairs_by_key(KTerms, Grouped), + maplist(make_bnode(Graph), Grouped). + +make_bnode(Graph, _Key-[bnode(BN, P0)|BNodes]) :- + merge_bnodes(BNodes, BN, P0, Properties), + rdf_bnode(BN), + forall(member(P=O, Properties), + rdf_assert_if_ground(BN, P, O, Graph)). + +merge_bnodes([], _, PL, PL). +merge_bnodes([bnode(BN, P2)|T], BN, PL0, PL) :- + union(PL0, P2, PL1), + merge_bnodes(T, BN, PL1, PL). + + +key_bnodes([], _, _, []). +key_bnodes([Templ|T0], Share, I, [Keyed|T]) :- + key_bnode(Share, I, Templ, Keyed), + key_bnodes(T0, Share, I, T). + + +key_bnode(Equal, I, Template, Key-BNode) :- !, + arg(I, Template, BNode), + arg(2, BNode, Properties), + ( Equal == equal + -> sort(Properties, Key) + ; Equal = equal(L) + -> maplist(pvalues(Properties), L, Key) + ; domain_error(share_if, Equal) + ). + +pvalues([], _, []). +pvalues([P=V|T0], P, [V|T]) :- + ground(V), !, + pvalues(T0, P, T). +pvalues([_|T0], P, T) :- + pvalues(T0, P, T). + + +%% call_actions(+Templates) +% +% Call the actions associated with each template-instantiation of +% the findall. + +call_actions([]). +call_actions([Template|T]) :- + arg(1, Template, Actions), + Actions, + call_actions(T). + + + /******************************* + * TERM-EXPANSION * + *******************************/ + +%% expand_rule(+Rule, -Clause) is det. +% +% Expand the rule-language into proper Prolog rules. Rules are +% clauses for rdf_mapping_rule/4. + +expand_rule(Name@@Rule, Clause) :- + rule_id(Id), + expand_rule(Rule, Name, Id, Clause). +expand_rule(Rule, Clause) :- + rule_term(Rule), !, + ( rule_id(Id), + expand_rule(Rule, Id, Id, Clause) + -> true + ; print_message(warning, illegal_rdf_rule) + ). +expand_rule(Term0, Term) :- + expand_rdf(Term0, Term), + Term0 \== Term. + + +expand_rule((Keep \ Delete <=> Body), Name, Id, + (rdf_mapping_rule(Id, Name, Graph, Actions, Options) :- + Rule)) :- !, + expand_body(Body, Guard, Add, Options0), + actions(Graph, Delete, Add, Actions), + ( Actions = rdf_rename(_,_,_), + Options0 == [] + -> Options = [transaction(false)] + ; Options = Options0 + ), + rule_body(Graph, Keep, Delete, Guard, Rule). +expand_rule((Delete <=> Body), Name, Id, + (rdf_mapping_rule(Id, Name, Graph, Actions, Options) :- + Rule)) :- !, + expand_body(Body, Guard, Add, Options), + actions(Graph, Delete, Add, Actions), + rule_body(Graph, true, Delete, Guard, Rule). +expand_rule((Keep ==> Body), Name, Id, + (rdf_mapping_rule(Id, Name, Graph, Actions, Options) :- + Rule)) :- !, + expand_body(Body, Guard, Add, Options), + actions(Graph, true, Add, Actions), + rule_body(Graph, Keep, true, Guard, Rule). + +rule_term(_<=>_). +rule_term(_==>_). + +%% rule_id(-Id) +% +% Give an identifier to the rule. Currently we use the +% source-location. We probably need some way to name the rule, but +% a good syntax is hard. (Name @ Rule) as used by CHR is not +% possible because @ is already used for language-tagged literals. + +rule_id(Id) :- + source_location(File, Line), + uri_file_name(URI, File), + atomic_list_concat([URI, #, Line], Id). + +%% expand_body(+Body, -Guard, -Add, -Options) is det. +% +% Split the body into two goal-lists; one describing the +% guard and one adding data. + +expand_body(Body, Guard, Add, Options) :- + comma_list(Body, Members), + partition(is_triple, Members, Add0, Guard0), + phrase(expand_add(Add0, MoreGuard, Options), Add), + append(Guard0, MoreGuard, Guard1), + expand_rdf(Guard1, Guard). + +is_triple(V) :- + var(V), !, fail. +is_triple({}(_)). +is_triple({}(_)>>_). + +%% expand_add(+In, -Goal, -Options)// +% +% Expand object-lists in triples to generate a blank-node + +expand_add([], [], []) --> []. +expand_add([Triple|T0], [rdf_bnode(BN)|Goals], Options) --> + { triple(Triple, S,P,O,G), + nonvar(O), + O = bnode(Properties), ! + }, + g_triple(G,S,P,BN), + bnode_triples(Properties, BN), + expand_add(T0, Goals, Options). +expand_add([Triple|T0], Goals, + [bnode(BN, Properties, BNOptions)|Options]) --> + { triple(Triple, S,P,O,G), + nonvar(O), + O = bnode(Properties0, BNOptions0), !, + expand_rdf(Properties0, Properties), + expand_rdf(BNOptions0, BNOptions) + }, + g_triple(G,S,P,BN), + expand_add(T0, Goals, Options). +expand_add([Triple|T0], [Fix|Goals], Options) --> + { triple(Triple, S,P,O,G), + nonvar(O), + ( O = (Var@Lan) + -> Fix = rdf_set_lang(Var,Lan,O2) + ; O = (Var@Type) + -> Fix = rdf_set_type(Var,Type,O2) + ) + }, !, + g_triple(G,S,P,O2), + expand_add(T0, Goals, Options). +expand_add([X|T0], Goals, Options) --> + [X], + expand_add(T0, Goals, Options). + +g_triple(-, S,P,O) --> !, + [ {S,P,O} ]. +g_triple(G, S,P,O) --> + [ {S,P,O} >> G ]. + +bnode_triples([], _) --> []. +bnode_triples([P=O|T], S) --> + [ {S,P,O} ], + bnode_triples(T, S). + + +%% actions(+Graph, +Delete, +AddList, -Actions) is det. +% +% Create an action-goal from the list of RDF objects to add and +% delete. + +actions(Graph, Delete, AddList, Actions) :- + comma_list(Delete, DelList0), + flatten(DelList0, DelList), % Deal with sequences + join_actions(Graph, DelList, AddList, ActionList), + comma_list(Actions, ActionList). + +join_actions(Graph, DelList, AddList, Actions) :- + select(Del, DelList, RDel), + single_resource(Del, R0), + select(Add, AddList, RAdd), + single_resource(Add, R1), !, + no_more_single_updates(RDel), + no_more_single_updates(RAdd), + Actions = [rdf_rename(R0, R1, Graph)|RActions], + join_actions(Graph, RDel, RAdd, RActions). +join_actions(Graph, DelList, AddList, Actions) :- + delete_actions(DelList, Graph, CondVars, Actions, AddActions), + add_actions(AddList, Graph, CondVars, AddActions, []). + +single_resource({R}, R) :- + var(R), !. +single_resource({R}, R) :- + nonvar(R), + R \= (_,_). + +no_more_single_updates(List) :- + member(X, List), + single_resource(X, _), !, + representation_error(multiple_single_resources). +no_more_single_updates(_). + + +delete_actions([], _, [], L, L). +delete_actions([X|_], _, _, _, _) :- + var(X), !, + instantiation_error(X). +delete_actions([Triple?|T0], Graph, + CondVars, + [rdf_retract_if_ground(S,P,O,Graph)|T], L) :- !, + expanded_triple(Triple, S,P,O), + term_variables(Triple, CondVars, CVTail), + delete_actions(T0, Graph, CVTail, T, L). +delete_actions([Triple|T0], Graph, CondVars, + [rdf_retractall(S,P,O,Graph)|T], L) :- + expanded_triple(Triple, S,P,O), + delete_actions(T0, Graph, CondVars, T, L). + + +%% add_actions(+Triples, +Graph, +CondVars, +Actions, ?ActionTail) +% +% @tbd conditional-variable computation in ==> rules is missing, +% which is why disabled this and always use +% rdf_assert_if_ground/4 for now. This works fine, but +% makes it harder to track bugs in rules. + +add_actions([], _, _, L, L). +add_actions([Triple>>Graph|T0], Graph0, CV, Actions, L) :- !, + add_actions([Triple], Graph, CV, Actions, Tail), + add_actions(T0, Graph0, CV, Tail, L). +add_actions([Triple|T0], Graph, CV, [Action|T], L) :- + Action = rdf_assert_if_ground(S,P,O, Graph), + expanded_triple(Triple, S,P,O), + add_actions(T0, Graph, CV, T, L). + + +%% rule_body(+Graph, +Keep, +Delete, +Guard:list, -Rule) +% +% Construct the actual body for our mapping rule. +% +% @tbd: Use the RDF query optimizer to finish the job. + +rule_body(Graph, Keep, Delete, GuardList, Rule) :- + comma_list(Guard, GuardList), + make_goal((Keep, Delete, Guard), Graph, Rule0), + expand_goal(Rule0, Rule). + +make_goal(G, _, G) :- + var(G), !. +make_goal((A0,B0), Graph, (A,B)) :- !, + make_goal(A0, Graph, A), + make_goal(B0, Graph, B). +make_goal((A0;B0), Graph, (A;B)) :- !, + make_goal(A0, Graph, A), + make_goal(B0, Graph, B). +make_goal((A0->B0), Graph, (A->B)) :- !, + make_goal(A0, Graph, A), + make_goal(B0, Graph, B). +make_goal(List, Graph, Goal) :- + is_list(List), !, + ( same_subject_triples(List, Subject, Pairs) + -> Goal = subject_triple_sequence(Subject, Pairs, Graph) + ; type_error(same_subject_triples, List) + ). +make_goal(X, Graph, Goal) :- + expanded_triple(X, S,P,O), !, + make_rdf_goal(S,P,O, Graph, Goal). +make_goal(T, _, true) :- + single_resource(T, _), !. +make_goal(X?, Graph, (G*->true;true)) :- !, + make_goal(X, Graph, G). +make_goal(G, _, G). + +make_rdf_goal(S,SP,O, _, Goal) :- + nonvar(SP), + SP = ^P, !, + Goal = rdf_has(S, P, O). +make_rdf_goal(S,P,O, _, Goal) :- + Goal = rdf(S, P, O). + + +%% same_subject_triples(+List, -Subject, -PredObjPairs) is semidet. +% +% Matches [{S,P,O}, {S,P2,O2}, ...] + +same_subject_triples([H|T0], S, [P-O|T]) :- + expanded_triple(H, S,P,O), + same_subject_triples_2(T0, S, T). + +same_subject_triples_2([], _, []). +same_subject_triples_2([H|T0], S, [P-O|T]) :- + expanded_triple(H, S1,P,O), + S1 == S, + same_subject_triples_2(T0, S, T). + + +%% expanded_triple(+Term, -S,-P,-O) is semidet. +% +% As triple/4, expanding the 3 arguments. + +expanded_triple(Triple, S,P,O) :- + triple(Triple, S0, P0, O0), + expand_resource(S0, S), + expand_predicate(P0, P), + expand_object(O0, O). + +expand_predicate(P, P) :- + var(P), !. +expand_predicate(^P0, ^P) :- + expand_resource(P0, P). +expand_predicate(P0, P) :- + expand_resource(P0, P). + + +%% triple(+Term, -S,-P,-O) is semidet. +% +% True if Term is of the form {S,P,O}. Note that all {...} terms +% are mapped to the canonical Prolog term {}(Arg), so we must be +% careful when matching. In particular, {_} = {_,_,_}!. +% Alternative, we could use subsumbes/2 to do the checking +% properly. + +triple({}(X), S,P,O) :- + nonvar(X), + X = (S,X2), + nonvar(X2), + X2 = (P,O). + +triple(T>>G, S,P,O,G) :- !, + triple(T,S,P,O). +triple(T, S,P,O,-) :- !, + triple(T,S,P,O). + + +%% expand_resource(+In, -Out) is det. + +expand_resource(X, X) :- + var(X), !. +expand_resource(X, X) :- + atom(X), !. +expand_resource(NS:Local, Global) :- + must_be(atom, NS), + must_be(atom, Local), + ( rdf_current_ns(NS, Full) + -> atom_concat(Full, Local, Global) + ; existence_error(namespace, NS) + ). + +expand_object(O, O) :- + var(O), !. +expand_object(literal(X), literal(X)) :- !. +expand_object("", literal('')) :- !. +expand_object(O, O) :- + atom(O), !. +expand_object(NS:Local, O) :- !, + expand_resource(NS:Local, O). +expand_object(String^^R, literal(type(Type, Text))) :- !, + to_literal_text(String, Text), + expand_resource(R, Type). +expand_object(String@Lang, literal(lang(Lang, Text))) :- !, + to_literal_text(String, Text). +expand_object(String, literal(Text)) :- + to_literal_text(String, Text). + +to_literal_text(Var, Var) :- + var(Var), !. +to_literal_text(String, Text) :- + string(String), !, + atom_concat(String, '', Text). +to_literal_text(String, Text) :- + atom_codes(Text, String). + +%% comma_list(+Conjunction, -List) is det. +%% comma_list(-Conjunction, +List) is det. +% +% Translate between a Prolog conjunction and a list. Elements +% =true= are removes from both translations. The empty list is +% mapped to a single =true=. + +comma_list(Conj, List) :- + is_list(List), + list_comma(List, Conj). +comma_list(Conj, List) :- + phrase(comma_list(Conj), List). + +comma_list(A) --> {var(A)}, !, [A]. +comma_list((A,B)) --> !, comma_list(A), comma_list(B). +comma_list(true) --> !, []. +comma_list(A) --> [A]. + +list_comma([], true). +list_comma([H|T], C) :- + ( T == [] + -> C = H + ; H == true + -> list_comma(T, C) + ; C = (H,B), + list_comma(T, B) + ). + + + /******************************* + * GENERAL CLAUSES * + *******************************/ + +%% expand_rdf(+Term0, -Term) is det. +% +% Expand our symbolic representation to RDF structures anywhere in +% the code. This is somewhat dubious because mapping "..." into a +% literal can be ambiguous and such is mapping ns:local for RDF +% namespaces. +% +% @tbd Should we introduce `Term to quote terms? + +expand_rdf(Term0, Term) :- + compound(Term0), !, + ( expand_to_literal(Term0, Term) + -> true + ; Term0 = NS:Local, + atom(NS), atom(Local) + -> expand_resource(NS:Local, Term) + ; Term0 =.. [F|Args0], + maplist(expand_rdf, Args0, Args), + Term =.. [F|Args] + ). +expand_rdf(Term, Term). + +expand_to_literal(Text^^Type, literal(type(URI, Value))) :- !, + to_literal_text(Text, Value), + expand_resource(Type, URI). +expand_to_literal(Text@Lang, literal(lang(Lang, Value))) :- !, + to_literal_text(Text, Value). +expand_to_literal(Text, literal(Value)) :- + is_list(Text), + maplist(sensible_char, Text), + atom_codes(Value, Text). + +sensible_char(C) :- + integer(C), + sensible_char_2(C). + +sensible_char_2(C) :- + between(32, 127, C), !. +sensible_char_2(0'\t). +sensible_char_2(0'\n). + + + /******************************* + * RUNTIME * + *******************************/ + +%% rdf_assert_new(+S,+P,+O,+Graph) is det. + +rdf_assert_new(S,P,O,Graph) :- + rdf(S,P,O,Graph), !. +rdf_assert_new(S,P,O,Graph) :- + rdf_assert(S,P,O,Graph). + +%% rdf_assert_if_ground(+S,+P,+O,+Graph) is det. +% +% Assert if the goal is instantiated. Used to deal with +% optional assertions + +rdf_assert_if_ground(S,P,O,Graph) :- + nonvar(S), nonvar(P), nonvar(O), !, + ( rdf(S,P,O,Graph) + -> true + ; rdf_assert(S,P,O,Graph) + ). +rdf_assert_if_ground(_,_,_,_). + +%% rdf_retract_if_ground(+S,+P,+O,+Graph) is det. +% +% Retract if the goal is instantiated. Used to deal with +% optional retraction + +rdf_retract_if_ground(S,P,O,Graph) :- + nonvar(S), nonvar(P), nonvar(O), !, + rdf_retractall(S,P,O,Graph). +rdf_retract_if_ground(_,_,_,_). + + +%% subject_triple_sequence(?S, +Pattern, ?Graph) is nondet. +% +% True if S has the P-O pairs from Pattern in the same order as in +% Pattern. + +subject_triple_sequence(S, Pattern, Graph) :- + best_guard(Pattern, S, Guard), + findall(S, Guard, SList), + sort(SList, SSet), + assertion(maplist(atom, SSet)), + member(S, SSet), + findall(P-O, rdf(S, P, O, Graph), Data), + sequence_in(Pattern, Data). + +best_guard([], S, rdf_subject(S)). +best_guard([P-O|T], S, Guard) :- + copy_term(P-O, P1-O1), + estimate(rdf(S,P1,O1), C0), + best_guard(T, S, C0, rdf(S,P1,O1), Guard). + +best_guard([], _, _, G, G). +best_guard([P-O|T], S, C0, G0, G) :- + copy_term(P-O, P1-O1), + estimate(rdf(S,P1,O1), C1), + ( C1 < C0 + -> best_guard(T, S, C1, rdf(S,P1,O1), G) + ; best_guard(T, S, C0, G0, G) + ). + +estimate(rdf(_,P,O), C) :- + atom(P), var(O), !, + rdf_predicate_property(P, triples(C)). +estimate(rdf(S,P,O), C) :- + rdf_estimate_complexity(S,P,O,C). + + + + +%% sequence_in(+ListPattern, -Data) is nondet. +% +% True if ListPattern appears in Data. There are two patterns of +% interest to us, for which we give an example if the pattern +% appears three times: +% +% a,b,a,b,a,b +% a,a,a,b,b,b +% +% In both cases, the pattern [a,b] must match 3 times. To deal +% with this, we first try to match the 2nd type. +% +% @param ListPattern is a list of Pred-Object pairs. + +sequence_in(Pattern, Data) :- + maplist(indices(Data), Pattern, Places), + maplist(arity, Places, [Len|Lens]), + all_same(Lens, Len), + Array =.. [d|Data], + between(1, Len, I), + data(Pattern, I, Places, Array). + +data([], _, _, _). +data([H|T], I, [P|PT], Data) :- + arg(I, P, Place), + arg(Place, Data, H), + data(T, I, PT, Data). + +indices(Data, H, Indices) :- + findall(I, nth1(I, Data, H), IndexList), + Indices =.. [i|IndexList]. + +arity(Term, Arity) :- + functor(Term, _, Arity). + +all_same([], _). +all_same([H|T], H) :- + all_same(T, H). + + +%% rdf_set_lang(+O0, +Lang, -O) +% +% Set/change the language of a literal + +rdf_set_lang(Lit, Lang, literal(lang(Lang, Text))) :- + text_of_literal(Lit, Text). + +%% rdf_set_type(+O0, +Lang, -O) +% +% Set/change the type of a literal + +rdf_set_type(Lit, Type, literal(type(Type, Text))) :- + text_of_literal(Lit, Text). + +text_of_literal(Var, _) :- + var(Var), !, + fail. +text_of_literal(literal(Lit), Text) :- !, + text_of_literal(Lit, Text). +text_of_literal(lang(_, Text), Text) :- !. +text_of_literal(type(_, Text), Text) :- !. +text_of_literal(Text, Text). + + + /******************************* + * EXPANSION * + *******************************/ + +user:term_expansion(In, Out) :- + prolog_load_context(module, Module), + predicate_property(Module:rdf_rewrite(_), + imported_from(rdf_rewrite)), + expand_rule(In, Out). + + + /******************************* + * PCE EMACS SUPPORT * + *******************************/ + +:- multifile + emacs_prolog_colours:term_colours/2, + emacs_prolog_colours:goal_colours/2, + emacs_prolog_colours:style/2, + emacs_prolog_colours:identify/2, + prolog:called_by/2. + +term_colours((_Name@@Rule), + expanded - [ identifier, RuleColours ]) :- + term_colours(Rule, RuleColours). +term_colours((Head <=> Body), + expanded - [ HeadColours, BodyColours ]) :- + head_colours(Head, HeadColours), + body_colours(Body, BodyColours). +term_colours((Head ==> Body), + expanded - [ HeadColours, BodyColours ]) :- + head_colours(Head, keep, HeadColours), + body_colours(Body, BodyColours). + +head_colours((Keep \ Del), + expanded - [ KeepColours, DelColours ]) :- !, + head_colours(Keep, keep, KeepColours), + head_colours(Del, del, DelColours). +head_colours(Del, Colours) :- + head_colours(Del, del, Colours). + +%% head_colours(+BodyTerm, +KeepDel, -Colours) is det. + +head_colours(Var, _, classify) :- + var(Var). +head_colours((A,B), Keep, control - [AC, BC]) :- !, + head_colours(A, Keep, AC), + head_colours(B, Keep, BC). +head_colours(X?, Keep, optional - [Colours ]) :- !, + head_colours(X, Keep, Colours). +head_colours(List, Keep, sequence - Colours) :- + is_list(List), !, + head_sequence(List, Keep, Colours). +head_colours({_}, keep, keep_triple - [classify ]) :- !. +head_colours({_}, del, del_triple - [classify ]) :- !. +head_colours(_, _, classify). + +head_sequence([], _, []). +head_sequence([H|T0], Keep, [C|T]) :- + head_colours(H, Keep, C), + head_sequence(T0, Keep, T). + +%% body_colours(+BodyTerm, -Colours) is det. + +body_colours(Var, classify) :- + var(Var), !. +body_colours((A,B), control - [AC, BC]) :- !, + body_colours(A, AC), + body_colours(B, BC). +body_colours(X>>_, redirect - [Colours, graph]) :- !, + body_colours(X, Colours). +body_colours({_}, add_triple - [classify ]) :- !. +body_colours(_, body). + +emacs_prolog_colours:term_colours(Term, Colours) :- + term_colours(Term, Colours). + +:- op(990, xfx, :=). % allow compiling without XPCE +:- op(200, fy, @). + +emacs_prolog_colours:style(add_triple, style(background := '#a2ffa1')). +emacs_prolog_colours:style(del_triple, style(background := '#ffb3b3')). +emacs_prolog_colours:style(optional, style(bold := @on)). +emacs_prolog_colours:style(sequence, style(bold := @on)). +emacs_prolog_colours:style(graph, style(bold := @on)). diff --git a/lib/xmlrdf/rdf_schema.pl b/lib/xmlrdf/rdf_schema.pl new file mode 100644 index 0000000..adb390b --- /dev/null +++ b/lib/xmlrdf/rdf_schema.pl @@ -0,0 +1,132 @@ +:- module(rdf_schema, + [ make_schema/2 % +DataGraph, +SchemaGraph + ]). +:- use_module(library(semweb/rdf_db)). +:- use_module(library(semweb/rdfs)). + +%% make_schema(+Graph, +SchemaGraph) is det. +% +% Create an initial schema by providing definitions for all +% predicates and types (classes) used in Graph. The schema is +% dumped into the graph SchemaGraph. +% +% This predicate is typically used _after_ running the +% rewrite-rules to reflect renamed typed and properties. + +make_schema(Data, Schema) :- + rdf_retractall(_,_,_,Schema), + rdf_transaction(make_schema_(Data, Schema), make_schema). + +make_schema_(Data, Schema) :- + forall(predicate_in_graph(Data, P), + define_predicate(P, Data, Schema)), + forall(type_in_graph(Data, Class), + define_type(Class, Schema)). + +define_predicate(P, _, _) :- + rdf_global_id(rdf:_, P), !. +define_predicate(P, _, _) :- + rdf_global_id(rdfs:_, P), !. +define_predicate(P, DataGraph, Graph) :- + copy_data(P, Graph), + rdf_assert(P, rdf:type, rdf:'Property', Graph), + assign_label(P, Graph), + predicate_statistics(DataGraph, P, _C, + _Subjects, _Objects, + Domains, Ranges), + ( Domains = [Dom] + -> rdf_assert(P, rdfs:domain, Dom, Graph) + ; true + ), + ( Ranges = [Range] + -> rdf_assert(P, rdfs:range, Range, Graph) + ; true + ). + + +define_type(C, Graph) :- + copy_data(C, Graph), + rdf_assert(C, rdf:type, rdfs:'Class', Graph), + assign_label(C, Graph). + + +assign_label(S, Graph) :- + ( rdf(S, rdfs:label, _) + -> true + ; rdfs_label(S, Label), + Label \== S + -> rdf_assert(S, rdfs:label, literal(Label), Graph) + ; true + ). + + +copy_data(S, Graph) :- + rdf_retractall(S,_,_,Graph), + forall((rdf(S,P,O,G), G \== Graph), + rdf_assert(S,P,O,Graph)). + + + /******************************* + * QUERY * + *******************************/ + +predicate_in_graph(Graph, P) :- + rdf_current_predicate(P), + once(rdf(_,P,_,Graph)). + +%% type_in_graph(+Graph, -Class) +% +% Generate the unique types in Graph + +:- thread_local + type_seen/1. + +type_in_graph(Graph, Class) :- + call_cleanup(type_in_graph2(Graph, Class), + retractall(type_seen(_))). + +type_in_graph2(Graph, Class) :- + subject_in_graph(Graph, S), + ( rdf(S, rdf:type, Class) + *-> true + ; rdf_equal(Class, rdfs:'Resource') + ), + ( type_seen(Class) + -> fail + ; assert(type_seen(Class)) + ). + + +subject_in_graph(Graph, S) :- + rdf_subject(S), + once(rdf(S, _, _, Graph)). + +predicate_statistics(Graph, P, C, Subjects, Objects, Domains, Ranges) :- + findall(S-O, rdf(S,P,O,Graph), Pairs), + length(Pairs, C), + pairs_keys_values(Pairs, Ss, Os), + sort(Ss, Subjects), + sort(Os, Objects), + resources_types(Subjects, Graph, Domains), + resources_types(Objects, Graph, Ranges). + +resources_types(URIs, Graph, Types) :- + findall(T, resource_type_in(URIs, Graph, T), TList), + sort(TList, Types). + +resource_type_in(List, Graph, T) :- + member(URI, List), + resource_type(URI, Graph, T). + +%% resource_type(+URI, +Graph, -Type) is det. + +resource_type(URI, Graph, T) :- + ( URI = literal(Lit) + -> ( Lit = type(T, _) + -> true + ; rdf_equal(T, rdfs:'Literal') + ) + ; rdf(URI, rdf:type, T, Graph) + *-> true + ; rdf_equal(T, rdfs:'Resource') + ). diff --git a/lib/xmlrdf/xmlrdf.pl b/lib/xmlrdf/xmlrdf.pl new file mode 100644 index 0000000..689fac1 --- /dev/null +++ b/lib/xmlrdf/xmlrdf.pl @@ -0,0 +1,615 @@ +/* File: xmlrdf.pl + Author: Jan Wielemaker + Created: Oct 26 2009 + Purpose: Generic translation from XML to RDF +*/ + +:- module(xmlrdf, + [ load_xml_as_rdf/2 % +Input, +Options + ]). +:- use_module(library(semweb/rdf_db)). +:- use_module(library(semweb/rdfs)). +:- use_module(library(semweb/rdf_turtle)). +:- use_module(library(semweb/rdf_turtle_write)). +:- use_module(library(http/http_open)). +:- use_module(library(sgml)). +:- use_module(library(uri)). +:- use_module(library(option)). +:- use_module(library(debug)). +:- use_module(library(xsdp_types)). +:- use_module(library(record)). +:- use_module(library(apply)). + +:- rdf_register_ns(map, 'http://cs.vu.nl/eculture/map/'). + +:- record + option(units:list=[], + dialect:oneof([xml,xmlns])=xmlns, + graph:atom=data, + prefix:atom=(-), + class_style:oneof(['OneTwo','oneTwo', + 'one_two','One_Two',keep])='OneTwo', + predicate_style:oneof(['OneTwo','oneTwo', + 'one_two','One_Two',keep])='oneTwo'). + +%% load_xml_as_rdf(From, Options) +% +% Convert an XML file into `crude' RDF. From is either a filename, +% a URL (using either =file= or =http= scheme) or a term +% stream(Stream). Options is a list of the following options: +% +% * unit(+Elements) +% If provided, consider elements whose name match one of +% the members of the list Elements a toplevel structure +% and process the file one element at a time. If there +% is just one toplevel structure, this may be passed without +% using a list. +% +% * dialect(Dialect) +% One of =xml= or =xmlns=. Use =xmlns= if the file contains +% xmlns= attributes and XML names of the form ns:local. If +% neither is present, the file must be processed using the +% =xml= dialect. +% +% * graph(+Graph) +% RDF Graph for storing the output. Default is =data= +% +% * prefix(+Prefix) +% Create a URI from an XML name by putting Prefix in front +% of it. If we are processing _xmlns_ (see dialect), the +% XML namespace declaration is ignored. I.e., the URI is +% formed from Prefix followed by the XML local name. +% +% If this option is not present the URI is simply the XML +% name if the dialect is =xml= or the XML namespace followed +% by the local name if the dialect is =xmlns=. +% +% * predicate_style(+Style) +% Change the `identifier style' for RDF predicates creates +% from XML names. The default is 'oneTwo'. Other values are +% 'OneTwo', 'one_two' or 'One_Two'. The value =keep= uses +% the XML name directly as RDF name. This is valid, but +% often leads to names that cannot be written using the Turtle +% and SPARQL shorthand notation. +% +% * class_style(+Style) +% Same as predicate_style, but used when generating a +% class-name. The default is 'OneTwo'. + +load_xml_as_rdf(From, Options) :- + canonical_unit_option(Options, COptions), + make_option(COptions, Record, _Rest), + flush_name_uri_cache, + flush_property_map, + rdf_statistics(triples(C0)), + statistics(cputime, T0), + setup_call_cleanup(open_input(From, In, Cleanup), + process(In, Record), + Cleanup), + statistics(cputime, T1), + rdf_statistics(triples(C1)), + T is T1-T0, + C is C1-C0, + print_message(informational, xmlrdf(loaded(From, T, C))). + +canonical_unit_option(Options, COptions) :- + select_option(unit(Unit), Options, Rest), !, + to_list(Unit, Units), + COptions = [units(Units)|Rest]. +canonical_unit_option(Options, Options). + +to_list(List, List) :- + is_list(List), !. +to_list(Elem, [Elem]). + +%% open_input(+Spec, -Stream, -Close) +% +% Open the input Spec, returning Stream and a closure Close to +% revert the side-effects. + +open_input(stream(In), In, true) :- !. +open_input(URL, In, Cleanup) :- + atom(URL), + uri_file_name(URL, File), !, + open(File, read, In, [type(binary)]), + Cleanup = close(In). +open_input(URL, In, Cleanup) :- + atom(URL), + uri_components(URL, Data), + uri_data(scheme, Data, http), !, + http_open(URL, In, []), + set_stream(In, file_name(URL)), + Cleanup = close(In). +open_input(Spec, In, Cleanup) :- + absolute_file_name(Spec, Path, [access(read)]), + open(Path, read, In, [type(binary)]), + Cleanup = close(In). + +%% process(+Stream, +Options) + +process(Stream, Options) :- + option_units(Options, Units), + Units \== [], !, + b_setval(xmlrdf_unit, Units), + b_setval(xmlrdf_options, Options), + setup_call_cleanup(new_sgml_parser(Parser, []), + ( configure_parser(Parser, Options), + sgml_parse(Parser, + [ source(Stream), + call(begin, on_begin) + ]) + ), + free_sgml_parser(Parser)). +process(Stream, Options) :- + setup_call_cleanup(new_sgml_parser(Parser, []), + ( configure_parser(Parser, Options), + sgml_parse(Parser, + [ source(Stream), + document(Document) + ]) + ), + free_sgml_parser(Parser)), + Document = [Element], + convert(Element, Options). + + +configure_parser(Parser, Options) :- + option_dialect(Options, Dialect), + set_sgml_parser(Parser, dialect(Dialect)), + set_sgml_parser(Parser, space(sgml)). + + +on_begin(Element, Attr, Parser) :- + b_getval(xmlrdf_unit, Elements), + memberchk(Element, Elements), !, + b_getval(xmlrdf_options, Options), + sgml_parse(Parser, + [ document(Content), + parse(content) + ]), + convert(element(Element, Attr, Content), Options). + + + /******************************* + * RDF CONVERSION * + *******************************/ + +%% convert(+Element, +Options) is det. + +convert(Element, Options) :- + option_graph(Options, Graph), + element_uri(Element, URI), + element_type(Element, Type, Options), + rdf_assert(URI, rdf:type, Type, Graph), + set_properties(URI, Element, Options). + +element_uri(_Element, URI) :- + rdf_bnode(URI). + +element_type(element(Name, _, _), Class, _) :- + rdf(Class, map:xmlname, literal(Name)), + rdfs_individual_of(Class, rdfs:'Class'), !. +element_type(element(EName, _, _), Name, Options) :- + name_to_uri(EName, class, Name, Options), + debug(xmlrdf(type), 'No element type for element-name ~p', [Name]). + + + /******************************* + * PROPERTIES * + *******************************/ + +%% set_properties(+URI, +Element, +Options) is det. + +set_properties(URL, element(_, Attrs, Content), Options) :- + set_properties(URL, element(_, Attrs, Content), -, Options). + +set_properties(URL, element(_, Attrs, Content), Lang, Options) :- + setp_from_attributes(Attrs, URL, Lang, Lang1, Options), + setp_from_content(Content, URL, Lang1, Options). + +setp_from_attributes([], _, Lang, Lang, _). +setp_from_attributes([xmlns:_=_|T], URL, Lang0, Lang, Options) :- !, + setp_from_attributes(T, URL, Lang0, Lang, Options). +setp_from_attributes([xmlns=_|T], URL, Lang0, Lang, Options) :- !, + setp_from_attributes(T, URL, Lang0, Lang, Options). +setp_from_attributes([xml:_=_|T], URL, Lang0, Lang, Options) :- !, + setp_from_attributes(T, URL, Lang0, Lang, Options). +setp_from_attributes([AttName=Value|T], URL, Lang0, Lang, Options) :- + map_literal_property(URL, AttName, Prop, Options), + option_graph(Options, Graph), + ( Lang0 == (-) + -> rdf_assert(URL, Prop, literal(Value), Graph) + ; rdf_assert(URL, Prop, literal(lang(Lang0, Value)), Graph) + ), + setp_from_attributes(T, URL, Lang0, Lang, Options). + + +%% setp_from_content(+Content, +URL, +Lang, +Options) is det. +% +% Create attributes for URL from the given content. If we +% encounter CDATA, this is typically from an element that has +% attributes. We use rdf:value for the property in this case. + +setp_from_content([], _, _, _). +setp_from_content([element(Name, Attrs0, Content)|T], URL, Lang, Options) :- !, + exclude(xmlns_property, Attrs0, Attrs), + setp_from_content_element(element(Name, Attrs, Content), URL, + Lang, Options), + setp_from_content(T, URL, Lang, Options). +setp_from_content([Text|T], URL, Lang, Options) :- + make_literal_value(Lang, Text, Value), + option_graph(Options, Graph), + rdf_assert(URL, rdf:value, Value, Graph), + setp_from_content(T, URL, Lang, Options). + + +xmlns_property(xmlns=_) :- !. +xmlns_property(xmlns:_=_) :- !. +xmlns_property(P=_) :- + atom(P), + sub_atom(P, 0, _, _, 'xmlns:'), !. + + +%% setp_from_content_element(+Element, +URL, +Lang, +Options) is det. +% +% Create a property for URL from the XML element Element. If the +% property is mapped, we know the property and target datatype. If +% not, we must decide whether to go for a literal or an bnode. If +% all data can be expressed as a literal, we use a literal and +% else we create a bnode. We can only express the data as a +% literal if it has no attributes or the only attribute is +% xml:lang. + +setp_from_content_element(element(EName, AL, CL), URL, Lang, Options) :- + mapped_property(URL, EName, Prop, Type, Options), !, + debug(xmlrdf(pmap), '~p ~p', [URL, Prop]), + make_value(EName, AL, CL, Type, Value, Lang, Options), + option_graph(Options, Graph), + rdf_assert(URL, Prop, Value, Graph). +setp_from_content_element(element(EName, [], [Text]), URL, Lang, Options) :- + atom(Text), !, + name_to_uri(EName, predicate, Prop, Options), + make_literal_value(Lang, Text, Value), + option_graph(Options, Graph), + rdf_assert(URL, Prop, Value, Graph). +setp_from_content_element(element(EName, [], []), URL, _, Options) :- !, + name_to_uri(EName, predicate, Prop, Options), + option_graph(Options, Graph), + rdf_assert(URL, Prop, literal(''), Graph). +setp_from_content_element(element(EName, [xml:lang=Lang], [Text]), + URL, _, Options) :- + atom(Text), !, + name_to_uri(EName, predicate, Prop, Options), + make_literal_value(Lang, Text, Value), + option_graph(Options, Graph), + rdf_assert(URL, Prop, Value, Graph). +setp_from_content_element(element(EName, Attrs0, Content), + URL, Lang, Options) :- + name_to_uri(EName, predicate, Prop, Options), + name_to_uri(EName, class, Type, Options), + ( select(xml:lang=Lang1, Attrs0, Attrs) + -> true + ; Lang1 = Lang, + Attrs = Attrs0 + ), + make_value(EName, Attrs, Content, Type, Value, Lang1, Options), + option_graph(Options, Graph), + rdf_assert(URL, Prop, Value, Graph). + +make_literal_value(-, Text, literal(Text)) :- !. +make_literal_value(Lang, Text, literal(lang(Lang, Text))). + + +%% make_value(+Element, +Attributes, +Content, +Type, -Value, +%% +Lang, +Options) + +make_value(_, Atts, Content, Literal, literal(Value), Lang, _) :- + rdf_equal(rdfs:'Literal', Literal), + ( Content = [Text], + atom(Text) + -> true + ; Content == [] + -> Text = '' + ), !, + ( memberchk(xml:lang=TheLang, Atts) + -> Value = lang(TheLang, Text) + ; Lang = (-) + -> Value = Text + ; Value = lang(Lang, Text) + ). +make_value(_, _, [Text], Type, literal(type(Type, Text)), _, _) :- + atom(Text), + datatype_type(Type), !. +make_value(_, _, [], Type, literal(type(Type, '')), _, _) :- + datatype_type(Type), !. +make_value(_, Attrs, Content, Type, literal(type(XMLLit, Content)), _, _) :- + rdf_equal(rdf:'XMLLiteral', XMLLit), + maplist(xml_attribute, Attrs), + ( Type = XMLLit + ; rdf_equal(rdfs:'Literal', Type) + ), !. +make_value(Element, Attrs, Content, Type, ValueURI, Lang, Options) :- + rdf_equal(rdf:'XMLLiteral', XMLLit), + ( Type = XMLLit + ; rdf_equal(rdfs:'Literal', Type) + ), !, + element_uri(element(Element, Attrs, Content), ValueURI), + option_graph(Options, Graph), + rdf_assert(ValueURI, rdf:type, Type, Graph), + setp_from_attributes(Attrs, ValueURI, Lang, _Lang1, Options), + rdf_assert(ValueURI, rdf:value, literal(type(XMLLit, Content)),Graph). +make_value(_, [], [URL], Type, URL, _, _) :- + atom(URL), + rdfs_subclass_of(Type, rdfs:'Resource'), !. +make_value(Element, Attrs, Content, Type, ValueURI, Lang, Options) :- + element_uri(element(Element, Attrs, Content), ValueURI), + option_graph(Options, Graph), + rdf_assert(ValueURI, rdf:type, Type, Graph), + setp_from_attributes(Attrs, ValueURI, Lang, Lang1, Options), + setp_from_content(Content, ValueURI, Lang1, Options). + +datatype_type(URI) :- + xsdp_uri_type(URI, _Type). + +xml_attribute(xml:_=_). +xml_attribute(xmlns:_=_). + + + /******************************* + * MAP * + *******************************/ + +%% map_literal_property(+Subject, +XMLName, -RDFProperty, +Options) +%% is det. +% +% RDFProperty is a URI for the property indicated in the source +% with XMLName. The property will be added to Subject. Subject is +% guaranteed to have an rdf:type when this predicate is called. + +:- thread_local + literal_property_map/3, + property_map/5. + +flush_property_map :- + retractall(literal_property_map(_,_,_)), + retractall(property_map(_,_,_,_,_)). + +map_literal_property(Subject, XMLName, RDFProperty, Options) :- + rdf(Subject, rdf:type, Class), + ( literal_property_map(XMLName, Class, RDFProperty) + -> true + ; map_lprop_class(Subject, XMLName, RDFProperty, Options), + assert(literal_property_map(XMLName, Class, RDFProperty)) + ). + +map_lprop_class(Subject, XMLName, RDFProperty, Options) :- + mapped_property(Subject, XMLName, RDFProperty, _Type, Options), !. +map_lprop_class(_Subject, XMLName, Prop, Options) :- + name_to_uri(XMLName, predicate, Prop, Options), + rdf_equal(rdfs:'Literal', Literal), + update_schema(XMLName, Prop, Literal). + +update_schema(XMLName, Prop, Type) :- + name_to_atom(XMLName, Atom), + ( rdfs_individual_of(Prop, rdf:'Property') + -> true + ; rdf_assert(Prop, rdf:type, rdf:'Property', schema) + ), + ( rdf_has(Prop, rdfs:range, _Range) + -> true + ; rdf_assert(Prop, rdfs:range, Type, schema) + ), + ( rdf_has(Prop, map:xmlname, _) + -> true + ; rdf_assert(Prop, map:xmlname, literal(Atom), schema) + ). + +%% mapped_property(+Subject, +XMLName, +%% -RDFProperty, -RDFType, +Options) is semidet. +% +% True if XMLName is mapped to RDFProperty with the given RDF +% type. There are three ways to decide that we deal with an +% established mapping: +% +% 1. There is a property with map:xmlname with a literal value +% that matches the XML Name. Namespace qualified names are +% written as <prefix><local> +% +% 2. There is a property with map:xmlname that matches the +% default translation. +% +% 3. There is a property with a URI that matches the default +% translation that has a type. + +mapped_property(Subject, XMLName, Prop, Type, Options) :- + rdf(Subject, rdf:type, Class), + ( property_map(XMLName, Class, Prop, Type, Mapped) + -> Mapped == true + ; mapped_property_nc(Subject, XMLName, Prop, Type, Options) + -> assert(property_map(XMLName, Class, Prop, Type, true)) + ; assert(property_map(XMLName, Class, _, _, false)), + fail + ). + +mapped_property_nc(_Subject, XMLName, Prop, Type, Options) :- + name_to_uri(XMLName, predicate, Prop0, Options), + ( ( name_to_atom(XMLName, Atom), + rdf(Prop, map:xmlname, literal(Atom)) + ; rdf(Prop, map:xmlname, Prop0) + ), + rdfs_individual_of(Prop, rdf:'Property') + -> ( rdf(Prop, rdfs:range, Type) + -> true + ; rdf_equal(rdfs:'Literal', Type) + ) + ; Prop = Prop0, + rdf(Prop0, rdfs:range, Type) + -> true + ). + +name_to_atom(Prefix:Local, Name) :- + atom_concat(Prefix, Local, Name). +name_to_atom(Name, Name). + + + /******************************* + * UTIL * + *******************************/ + +%% name_to_uri(+XMLName, +Type, -URI, +Options) is det. +% +% @param XMLName is an atom for dialect =xml= or a term +% Prefix:Local when using the =xmlns= dialect. +% @param Type is one of =class= or =predicate= +% @URI Is an RDF resource (atom) + +:- thread_local + name_uri_cache/4. + +flush_name_uri_cache :- + retractall(name_uri_cache(_,_,_,_)). + +name_to_uri(NS:Local, Type, URI, Options) :- !, + ( name_uri_cache(Local, NS:Local, Type, URI) + -> true + ; name_to_uri_nc(NS:Local, Type, URI, Options), + assert(name_uri_cache(Local, NS:Local, Type, URI)) + ). +name_to_uri(Name, Type, URI, Options) :- + ( name_uri_cache(Name, Name, Type, URI) + -> true + ; name_to_uri_nc(Name, Type, URI, Options), + assert(name_uri_cache(Name, Name, Type, URI)) + ). + +name_to_uri_nc(NS:Local, Type, URI, Options) :- !, + restyle(Type, Local, Local1, Options), + ( option_prefix(Options, Prefix), + Prefix \== (-) + -> atom_concat(Prefix, Local1, URI) + ; atom_concat(NS, Local1, URI) + ). +name_to_uri_nc(Name, Type, URI, Options) :- + restyle(Type, Name, Name1, Options), + ( option_prefix(Options, Prefix), + Prefix \== (-) + -> atom_concat(Prefix, Name1, URI) + ; URI = Name1 + ). + +restyle(predicate, Name0, Name, Options) :- + option_predicate_style(Options, Style), + ( Style == keep + -> Name = Name0 + ; restyle_identifier(Style, Name0, Name) + ). +restyle(class, Name0, Name, Options) :- + option_class_style(Options, Style), + ( Style == keep + -> Name = Name0 + ; restyle_identifier(Style, Name0, Name) + ). + + + /******************************* + * IDENTIFIERS * + *******************************/ + +%% restyle_identifier(+Style, +In, -Out) is det. +% +% Restyle an identifier by extracting the alnum substrings and +% joining them together according to Style. +% +% @param Style is described with join_name_parts/3. + +restyle_identifier(Style, In, Out) :- + name_parts(In, Parts), + join_name_parts(Style, Parts, Out). + + +%% name_parts(+Identifier, -Parts) is det. +% +% Parts is a list of atoms that make up Identifier. The parts +% found are turned into lowercase, unless all its characters are +% uppercase. E.g., +% +% == +% ?- name_parts('sourceCodeURI', X). +% X = [source, code, 'URI']. +% == + +name_parts(Name, Parts) :- + atom_codes(Name, Codes), + phrase(name_parts(Parts), Codes). + +name_parts([H|T]) --> + name_part(H), !, + name_parts(T). +name_parts([]) --> []. + +name_part(H) --> + string(Codes, Tail), + sep(Tail), !, + { Codes = [_|_], + atom_codes(H0, Codes), + ( maplist(is_upper, Codes) + -> H = H0 + ; downcase_atom(H0, H) + ) + }. + +string(T,T) --> []. +string([H|T], L) --> [H], string(T, L). + +sep([]) --> sep_char, !, sep_chars. +sep([T]), [N] --> + [T,N], + { code_type(T, lower), + code_type(N, upper) + }. +sep([],[],[]). + +sep_char --> + [H], + { \+ code_type(H, alnum) }. + +sep_chars --> sep_char, !, sep_chars. +sep_chars --> []. + +%% join_name_parts(+Style, +Parts, -Identifier) +% +% Join parts of an identifier according to Style. Style is +% one of: +% +% * 'OneTwo' +% * oneTwo +% * one_two +% * 'One_Two' + +join_name_parts(Style, [First|Parts], Identifier) :- + style(Style, CapFirst, CapRest, Sep), + capitalise(CapFirst, First, H), + maplist(capitalise(CapRest), Parts, T), + atomic_list_concat([H|T], Sep, Identifier). + +style('OneTwo', true, true, ''). +style(oneTwo, false, true, ''). +style(one_two, false, false, '_'). +style('One_Two', true, true, '_'). + +capitalise(false, X, X) :- !. +capitalise(true, X, Y) :- + atom_codes(X, [H0|T]), + code_type(H0, to_lower(H)), + atom_codes(Y, [H|T]). + + + /******************************* + * MESSAGES * + *******************************/ + +:- multifile + prolog:message//1. + +prolog:message(xmlrdf(loaded(From, Time, Count))) --> + [ 'Loaded ~D triples in ~3f seconds from ~p'-[Count, Time, From] ]. diff --git a/rdf/cpack/xmlrdf.ttl b/rdf/cpack/xmlrdf.ttl index a640ba5..ffcbb44 100644 --- a/rdf/cpack/xmlrdf.ttl +++ b/rdf/cpack/xmlrdf.ttl @@ -11,7 +11,7 @@ # this. Otherwise you can specify the information inline as done below. # See http://xmlns.com/foaf/spec/ for defines fields. -<> a cpack:Package ; +<> a cpack:Library ; cpack:packageName "xmlrdf" ; dcterms:title "XML to RDF conversion" ; cpack:author [ a foaf:Person ; @@ -24,7 +24,17 @@ ] ; cpack:description -"""The package description goes here. You can use PlDoc Wiki markup. +"""The XML to RDF convertor allows for importing XML as RDF using two steps: + + 1. A generic translation from XML to RDF. This generally needs no + configuration but it allows for a few parameters: + + - Define the primary record element + - Define the target namespace + - Define elements that must be kept as XMLLiteral + + 2. Preform a rewrite on the generated graph using a Prolog based + rewrite language. """ .