SWI-Prolog -- XML documents

Documentation
- Reference manual
- Packages
  - SWI-Prolog SGML/XML parser
    - Predicate Reference

3.3 XML documents

The parser can operate in two modes: sgml mode and xml mode, as defined by the dialect(Dialect) option. Regardless of this option, if the first line of the document reads as below, the parser is switched automatically into XML mode.

<?xml ... ?>

Currently switching to XML mode implies:

XML empty elements
The construct <element [attribute...] /> is recognised as an empty element.
Predefined entities
The following entitities are predefined: lt (<), gt (>), amp (&), apos (') and quot (").
Case sensitivity
In XML mode, names are treated case-sensitive, except for the DTD reserved names (i.e. ELEMENT, etc.).
Character classes
In XML mode, underscores (_) and colon (:) are allowed in names.
White-space handling
White space mode is set to preserve. In addition to setting white-space handling at the toplevel the XML reserved attribute xml:space is honoured. It may appear both in the document and the DTD. The remove extension is honoured as xml:space value. For example, the DTD statement below ensures that the pre element preserves space, regardless of the default processing mode.
```
<!ATTLIST pre xml:space nmtoken #fixed preserve>
```

3.3.1 XML Namespaces

Using the dialect xmlns, the parser will interpret XML namespaces. In this case, the names of elements are returned as a term of the format

URL:LocalName

If an identifier has no namespace and there is no default namespace it is returned as a simple atom. If an identifier has a namespace but this namespace is undeclared, the namespace name rather than the related URL is returned.

Attributes declaring namespaces (xmlns:<ns>=<url>) are reported as if xmlns were not a defined resource.

In many cases, getting attribute-names as url:name is not desirable. Such terms are hard to unify and sometimes multiple URLs may be mapped to the same identifier. This may happen due to poor version management, poor standardisation or because the the application doesn't care too much about versions. This package defines two call-backs that can be set using set_sgml_parser/2 to deal with this problem.

The call-back xmlns is called as XML namespaces are noticed. It can be used to extend a canonical mapping for later use by the urlns call-back. The following illustrates this behaviour. Any namespace containing rdf-syntax in its URL or that is used as rdf namespace is canonicalised to rdf. This implies that any attribute and element name from the RDF namespace appears as rdf:<name>

:- dynamic
        xmlns/3.

on_xmlns(rdf, URL, _Parser) :- !,
        asserta(xmlns(URL, rdf, _)).
on_xmlns(_, URL, _Parser) :-
        sub_atom(URL, _, _, _, 'rdf-syntax'), !,
        asserta(xmlns(URL, rdf, _)).

load_rdf_xml(File, Term) :-
        load_structure(File, Term,
                       [ dialect(xmlns),
                         call(xmlns, on_xmlns),
                         call(urlns, xmlns)
                       ]).

The library provides iri_xml_namespace/3 to break down an IRI into its namespace and localname:

[det]iri_xml_namespace(+IRI, -Namespace, -Localname)

Split an IRI (Unicode URI) into its Namespace (an IRI) and Localname (a Unicode XML name, see xml_name/2). The Localname is defined as the longest last part of the IRI that satisfies the syntax of an XML name. With IRI schemas that are designed to work with XML namespaces, this will typically break the IRI on the last # or /. Note however that this can produce unexpected results. E.g., in the example below, one might expect the namespace to be http://example.com/images\#, but an XML name cannot start with a digit.

?- iri_xml_namespace('http://example.com/images#12345', NS, L).
NS = 'http://example.com/images#12345',
L = ''.

As we see from the example above, the Localname can be the empty atom. Similarly, Namespace can be the empty atom if IRI is an XML name. Applications will often have to check for either or both these conditions. We decided against failing in these conditions because the application typically wants to know which of the two conditions (empty namespace or empty localname) holds. This predicate is often used for generating RDF/XML from an RDF graph.

[det]iri_xml_namespace(+IRI, -Namespace)

Same as iri_xml_namespace/3, but avoids creating an atom for the Localname.