Monday, November 02, 2009

Magic Properties with SPIN

Magic Properties (aka property functions) are a popular extension point supported by many SPARQL engines. Looking like normal triple matches, magic properties can be used to dynamically compute property values even if there are no corresponding triples in the actual model. For example, in the kennedys ontology, a magic property called :grandParent could be used to dynamically find grandparent relationships even though only the direct parent relationships are asserted in the model:


In the example above, the SPARQL engine will use some internal "magic" to figure out which values need to be returned for the ?grandParent variable. How this internal magic is implemented is left to the particular engine, and in the case of Jena this is simply a Java method call in which people can add their own custom code.

These magic properties can be extremely powerful, and often are the only escape mechanism if the default expressivity of SPARQL isn't good enough for a task. However, the current mechanisms of defining such magic properties require low-level (Java) programming and leads to queries that are neither transparent to the end user, nor platform independent.

The new version of SPIN 1.1 addresses those shortcomings and provides a powerful mechanism for defining magic properties entirely in RDF. Let's look at an example first. The following TopBraid Composer 3.2 screenshot shows the definition of the magic property :grandParent:

A magic SPIN property is an instance of the property metaclass spin:MagicProperty. Like regular SPIN functions, magic properties can define arguments that represent the left hand side of the magic property. Here, the argument sp:arg1 has type kennedys:Person indicating that this property can be applied to subjects of type Person. The results of the right hand side of the triple are computed by the nested SPARQL query specified as spin:body. In this body, the variable ?arg1 represents the Person whose grand parents we are asking for. When a SPIN-aware SPARQL engine hits the triple pattern with the :grandParent predicate in it, it will execute the nested body query and bind the variable on the right with the results of the SELECT query. The results in the first screenshot (JosephKennedy and RoseFitzgerald) have been computed this way: the grand parents are the parents of the parents.

An interesting characteristic of magic properties is that they may be applied in any "direction", i.e. either with a variable on the subject position, a variable as object or even in both places. For example, the :grandParent relationship can be queried to find all grand children of RoseFitzgerald (including JohnKennedyJr):

In this case, the nested SPARQL query is simply executed with different bindings (?arg1 is left blank and ?grandParent is pre-bound with RoseFitzgerald). We can also leave both sides blank and the system will return all existing grand parent/grand child relationships in the whole model.

Now comes the really interesting bit. Think about it: this is a mechanism that can be used to define new (SPARQL) predicates entirely based on other SPARQL queries. These SPARQL queries may be recursive and may use other magic properties. The SPARQL engine can walk through the results delivered by one magic property and, while doing this, can consider any other kind of background knowledge that it needs to compute to fulfill its task. If a solution does not lead to new matches, the system will backtrack and try the next one. In traditional rule languages (incl. Prolog) this technique is called backward chaining. Each user-defined magic SPIN property can be regarded as a rule that is fired on demand to support the task of solving a goal (finding matches).

Magic SPIN properties greatly extend the power of SPARQL, and do so in a very transparent and Semantic Web compliant way.

SPARQL Debugger and Profiler

SPARQL queries may get complex and often contain a series of operators such as triple matches, filter clauses, optional statements and unions. In many cases, building correct and efficient queries using those elements requires multiple attempts. This is often because it is difficult to see what the SPARQL engine is doing under the hood.

As of version 3.2, TopBraid Composer (Maestro Edition) introduces a Debugger View that has been designed to help query designers by providing an interactive view into a query at execution time. The Debugger can be used to:
  • Display the internal data structure (algebra) that the SPARQL engine is using to execute the query
  • To walk through the execution of the query
  • To display intermediate variable bindings
  • To execute a query until it reaches a certain break point
  • To collect statistics about the number of query steps involved in every operator
This is a sophisticated feature that can be extremely powerful, but it also requires a decent understanding of how SPARQL works. People with some programming background may find the debugger more useful than beginners.

Background

Let's start with some background on how SPARQL works. The key feature of SPARQL is the WHERE clause, which contains conditions that return variable bindings based on the query graph. Depending on the type of query, these variable bindings are then returned in the SELECT clause, or used to CONSTRUCT new triples. In order to understand the characteristics of a query, we therefore need to focus on the WHERE clause, and modifiers such as ORDER BY.

When a query is processed, the SPARQL engine will first create an internal data structure, called Algebra. The query is executing this data structure, while the textual representation with keywords such as SELECT and WHERE only serves as the user interface and query exchange language. It is important to understand that different query syntaxes may be converted into the same algebra data structure, and that the same query syntax may be rendered into different algebras depending on the choices of the SPARQL engine. In particular, the engine may decide to optimize certain patterns. Seeing the internal data structure will often lead to surprising results.

For example, the query
SELECT ?subject ?object
WHERE {
?subject rdfs:subClassOf ?object .
?subject rdfs:label ?label .
FILTER (fn:starts-with(?label, "C")) .
}
is converted into a SPARQL Algebra data structure that can be rendered into textual form as follows:
(project (?subject ?object)
(filter (fn:starts-with ?label "C")
(bgp
(triple ?subject rdfs:subClassOf ?object)
(triple ?subject rdfs:label ?label)
)))
In this notation, you can see that the SPARQL engine is working with a nested tree structure of operators such as project, filter and bgp. These operators typically have attributes or child elements, for example bgp (Basic Graph Pattern) contains a list of triple patterns to evaluate. Note that it is up to the SPARQL engine to optimize those operators so that they lead to fewer database requests. In particular, multiple triple matches may be combined to a single operation to benefit from server-side joins.

The Algebra operators form a tree structure, in which each node returns an iterator of variable bindings. For example, the basic graph pattern above may return the binding ?label="Matriarch"^^xsd:string, ?object=Person, ?subject=Matriarch. These bindings are served up to the next operator in the processing pipeline. For example, the filter operator will evaluate whether the ?label starts with "C" and then pass the same bindings to its parent operation, the project step. Thus, each operator receives a stream of input bindings and may create new bindings for its parent. The final results of the query are the output bindings of the root operator (here, project).

The algorithm of the SPARQL engine (at least the Jena ARQ engine used by Composer) follows the evaluation model above, and builds a nested hierarchy of iterators. Each of those iterators has two main functions:
  1. hasNext to check whether there is any additional result binding available
  2. next to return the next binding and move to the next step
The SPARQL Debugger allows users to trace the evaluation of the SPARQL engine to see when hasNext and next are invoked, and whether they return new variable bindings. When you step through the query manually, the user interface will display arrows in different colors to illustrate the current position, and display variable bindings in a table.


Getting Started with the Debugger View

TopBraid Composer's SPARQL Debugger is accessible from multiple places:
  • In the SPARQL View, use the Debug query button to open the currently edited query in the Debugger view.
  • In the context menu of spin:rule property values on the form, use Debug SPIN query to debug the SPIN rule. Note that this may add a clause ?this rdf:type [Class] to the query, just like the SPIN engine would do.


As shown above, the Debugger is split into multiple areas. The main area on the left displays the query. Two different renderings are available:
  • Algebra displays the internal Jena ARQ data structure
  • SPARQL displays the same query in pseudo-SPARQL, which is derived from the Algebra (reverse engineered if you want) and therefore may look different from the original SPARQL syntax.
You can switch between those two renderings at any time and will get the same features. Depending on your experience, the SPARQL view might be easier to get started, but the Algebra view is much more informative as it provides the real underlying data structure.

Both textual views have a control bar on their left border. This bar displays the current step position with an arrow, and also displays any break points. Double-click on the bar to set or remove break points.

On the right hand side, a table displays current variable bindings. These are only filled if the current execution step is "green", i.e. is returning the next operation with some values.

The tool bar contains buttons to control the behavior of the SPARQL view. The following buttons are available:
  • Run continues the execution of the query until the next break point.
  • Step Over continues the execution of the query to the next step.
  • Profiler activates or de-activates profiling. When activated, the engine will record the triple matches (findSPO) queries done against the triple source, and display the resulting statistics behind each operation.
  • Layout vertically switches between vertical and horizontal layout. Horizontal layout is better suitable for "wide" queries, vertical layout for lots of variable bindings.
  • Preferences opens up the SPARQL Preferences page, which may impact how the SPARQL engine optimizes the query algebra. Changes will have effect after restarting the query.
Profiling

If profiling mode has been activated (and, if needed, the query restarted), then the query engine will operate on a modified triple store which, as a side effect of running queries, will also collect statistical information. In particular, this records which find(SPO) queries have been made, and how many triples have been iterated over. The accumulated numbers of those calls (find/triples) will be displayed for each operation in cyan color on the right edge of the text display. In the following example, there are two triple patterns that lead to low-level queries.

In the above screen shot, the upper triple match just leads to a single query (for all rdfs:subClassOf triples), and this query has (so far) returned 15 matches. The second (optional) triple match has been executed 15 times (for each of the former matches), but only five results were found so far. From these numbers you can see that the optional operation is far more "expensive" in terms of number of queries than the subClassOf query.

To summarize, TopBraid's new SPARQL debugger is a powerful low-level tool for people who want to understand the execution behavior of a query or SPIN rule in the SPARQL engine. The debugger and its profiler can provide valuable insights that may lead to much more efficient queries.

Friday, October 30, 2009

Fixing Constraint Violations with spin:fix

Constraint checking is a popular feature of many Semantic Web tools to ensure that instance data meets the design objectives attached to classes and properties in an ontology. Data entry tools like TopBraid Composer and Ensemble use the SPARQL-based constraint checking language SPIN to make sure that users get warnings if the data they are entering is violating constraints. If a violation is reported, the user would read through the violation message and then change the data on the form.

One of the new features of SPIN 1.1 is the property spin:fix which can be used to let the system suggest operations that would repair a constraint violation automatically. The basic idea is that spin:fix can be attached to the spin:ConstraintViolation produced by a spin:constraint query to link the violation with one or more SPARQL Update requests. Those SPARQL Update requests may INSERT or DELETE triples to create a state in which the constraint is no longer violated. If spin:fixes are created, then the user interface may suggest them to the user with a single click. In the TopBraid Composer 3.2 screenshot below, the resource InvalidSquare violates the constraint that instances of Square must have equal width and height:

The following spin:constraint (attached to the Square class) implements the suggestions above:

You can see that the constraint creates a spin:ConstraintViolation that is attached to two instances of :SetObject via spin:fix. The class :SetObject is a SPIN template with the following definition:
The SPIN templates suggested as fixes could come from a library of re-usable building blocks. I assume that most constraint violations can be repaired by replacing, adding or deleting certain triples, and these cases can be generalized easily. Furthermore, the spin:constraints themselves can be generalized into templates, so that the ontology designer in the end just needs to pick the correct template to get a really powerful and convenient constraint checking mechanism.

Also note that the constraint fixing mechanism above could be useful beyond user interfaces, for example to automatically repair incoming data streams in web service calls.

Tuesday, October 27, 2009

OWL 2 Support in TopBraid Composer

Following today's announcement that OWL 2 has become an official W3C recommendation, I am pleased to announce that TopBraid Composer 3.2 (to be published by the end of this week) has comprehensive OWL 2 support. Here is a sampler of some of the new capabilities.

Property Chain Axioms can be used to define relationships between multiple properties, for example to define that an uncle is the brother of a parent. In OWL 2 mode, TopBraid's Properties form contains a new widget for entering such property chains using owl:propertyChainAxiom. Another kind of new property axioms, owl:propertyDisjointWith can be edited on the same page.


User-Defined Datatypes are a mechanism of narrowing down datatype properties to specific value ranges, such as integer > 0. In typical cases, such datatypes are entered as allValuesFrom restrictions on the class form. We use the Manchester Syntax for that purpose:

In addition to class axioms, user-defined datatypes can also be used as global rdfs:ranges:


OWL 2 Class Axioms including qualified cardinality restrictions and all other features supported by the Manchester Syntax can be entered on the class forms:


All other OWL 2 extensions such as new property meta-classes, keys and syntactic sugar can be edited through the generic RDF editing capabilities of TopBraid - the extended OWL 2 system vocabulary has been very helpful for this. Of course, TopBraid Composer can load and save any OWL 2 file in formats such as RDF/XML or Turtle.

At the time of writing this, I am not aware of any OWL 2 compliant inference engines that we could freely distribute with TopBraid Composer. Currently available options include OWL RL engines such as SPIN or Oracle 11g RDF. I am sure more will follow, and anyone in the community is invited to contribute plug-ins to those inference engines that we cannot legally ship with our platform, as separate downloads.

The new OWL 2 support is available in all editions of TopBraid Composer, including the Free Edition.

Monday, October 19, 2009

RDFex: Partial ontology imports

One of the overall design goals of Linked Data and the Semantic Web is vocabulary re-use. Instead of having thousands of "Person" classes, new ontologies should attempt to re-use existing Person definitions, such as those found in the FOAF namespace. This schema re-use makes it easier for Semantic Web agents to link data together, and potentially reduces the maintenance costs as it becomes possible to benefit from the whole infrastructure and community around those shared ontologies.

However, there are a couple of well-known reasons why such a re-use is not always feasible or desirable, leading to situations in which developers feel they need to reinvent the wheel. One particular problem is that the OWL construct for linking vocabularies (owl:imports) has all-or-nothing semantics: if my ontology owl:imports the FOAF namespace, then I would suck all definitions of FOAF into my own model, even though I just care about one or two concepts. The result is that in Semantic Web inference engines, browsers or editors, my ontology will be full of definitions that are just distracting, or unnecessarily increase the complexity of my model. Since owl:imports is not the ideal mechanism, people sometimes simply extract term definitions from other files and paste them into their own files - look for example at the bottom of this file. This, of course, leads to other maintenance problems and is generally not a clean approach.

For an internal project in which I wanted to re-use parts of the SIOC namespace, I have implemented a new web service called rdfex.org. RDFex is a very simple, yet IMHO elegant, mechanism for using owl:imports to import snippets of other namespaces without having to copy and paste them. The basic idea is that the rdfex.org server can be used as a proxy for various popular ontologies (including DC, FOAF and SIOC), so that users can specify which classes, properties and individuals from those namespaces they would like to import (using owl:imports). For example, the proxy ontology http://rdfex.org/foaf/Person,firstName represents all triples defining the class foaf:Person and the property foaf:firstName, including their rdf:type, rdfs:labels, rdfs:comments and any relationships between those terms (such as rdfs:domain and rdfs:range). Any combination of those resources is available because the result will be dynamically assembled at request time.

The upcoming release of the TopBraid platform 3.2 also has native support for those rdfex imports, so that the system can do this extraction from local copies instead of having to go to the web. TopQuadrant is committed to supporting this service in the future, so please feel free to use it if you find it useful.

Friday, September 04, 2009

Currency conversion with the Units Ontology, SPARQLMotion and SPIN

As the Linked Data cloud is steaming ahead, the first large online stores (recently: BestBuy.com) are publishing their products in a machine-friendly RDF format. In order to exchange product information in a meaningful way, the various product vendors should use a shared (or at least mappable) vocabulary to represent prices, so that internet search engines and crawlers can better compare values. In the real world, three-letter currency codes such as USD, EUR and AUD are being used. These abbreviations are a standard vocabulary, but having just a string representation is an error-prone strategy - e.g. the string "AMD" could be either the Armenian Drams currency, or a computer chip manufacturer.

We have recently published a comprehensive units ontology (QUDT) that also includes all of the world's currencies. For each currency, a globally unique identifier (URI) is used. These URI resources also point back to the currency abbreviation, using a property called qud:abbreviation. A good strategy for ontology designers would be to use those URIs as ranges of their properties, instead of string codes. Using those standard URIs will have some benefits down the road...

This morning I have created a SPIN library of currency conversion functions that can be used to convert between various currencies, using the very latest conversion factors. These functions can be used in SPARQL queries or SPIN rules. For example, we have a function currencies:getRateByCodes that gets the latest exchange rate between two currencies, specified by the three-letter codes:


This particular SPIN function is backed by a SPARQLMotion script that takes two currency codes (arg1 and arg2) as input and then calls an external web service to retrieve the latest exchange rate. The web service delivers XML, and a simple XPath expression is used to extract the value into an xsd:float RDF literal:

We can drill into the Call web service module to see that it is simply calling a REST web service, inserting the two arguments into the URL string:


Based on this low-level function, we can define additional higher-level functions. The following screenshot shows the complete definition of a SPIN function that takes two qud:CurrencyUnits as input, then gets their respective abbreviations, and finally calls the getRateByCodes function:


Whenever the function is called, any SPIN-aware SPARQL engine will simply execute the SELECT query specified as the body of the function. This means that new functions are built by combining other functions. The same mechanism was used to define the convertLiteral function, as shown below:

The function above makes good use of the QUDT units ontology to shield the user from any low-level details. All the user needs to do is to specify the range of a property to be, say, unit:USDollar, and then all future values of that property will be stored as literals with that unit as datatype. The SPIN function can then look at the literal to find out about its currency. Then, the function can look up the unit's abbreviation and make the corresponding web service calls to fetch the current exchange rate. Finally, a simple multiplication returns the desired new value.

Once these functions are defined, they can be used in various places. For example, here is a class Product, which has two properties: one stores the price in US Dollars, and another represents the price in Australian Dollars. Only the USD price needs to be entered, because the class also has a SPIN rule attached to it which will automatically do this computation as an inference:

An example Product instance, with the inferred (blue) AUD price is shown below:

These rules can of course be generalized further into SPIN templates, or the range of the target property could be used instead of hard-coding the target currency anywhere. There are really endless possibilities here.

Monday, August 31, 2009

Units ontology with SPIN support published

My co-workers at TopQuadrant have just published a new OWL ontology about Quantities, Units, Dimensions and Datatypes (QUDT). This is a result of a long term, ongoing project with NASA AMES, and our friends at NASA have permitted us to publish those ontologies to encourage the wider use outside of NASA.

The QUDT ontology is very carefully designed and provides comprehensive coverage of almost every unit of measurement that is known to humankind. For example, it defines the unit Centimeter as follows:

Each unit has a stable URI, making it possible to link to it from your own domain models in a reliable way. For each unit, the ontology defines some useful metadata including abbreviation, a link to DBpedia and a categorization of units into groups, such as length units.

I think this units ontology can fill an important gap in the current Semantic Web and Linked Data efforts. Numeric data without any formalized units is pretty useless for machines, and sometimes even for humans. Currently, a unit may be mentioned somewhere hidden in a comment or not at all, but the QUDT ontology allows ontology designers to clearly specify these implicit assumptions. With explicitly modeled units, linked data can be processed and transformed in much more useful ways. For example, if a height it specified in Centimeters, then a smart linked data browser can automatically translate it into Feet for US American readers.

There are two main ways of using the units ontology: you can use the unit resources to "annotate" your properties with a dedicated property such as qud:units. The values of your property would use built-in datatypes such as xsd:double. The other alternative is to embed the unit directly into the literals. For this use case, all units have also been declared to be rdfs:Datatypes. This makes it possible, to assign units as rdfs:ranges of a property as shown below:

Here, the property height has the range unit:Centimeter. An example instance would then show up like this:


The specific height will then be stored in RDF literals such as "8380"^^unit:Centimeter. (The upcoming version 3.2 of TopBraid Composer will show the unit in parantheses behind the property name, but I didn't want to play tricks here).

Now that the units have been formalized in an ontology, new ways of working with numeric data become possible. As described earlier, the SPIN framework can be used to define new SPARQL functions which can then be used to do things like unit conversion. We have published a SPIN Library which contains some generic unit conversion functions. For example, the qudspin:convert function can be used to convert any numeric value from a source unit (here: unit:Centimeter) to a target unit (unit:Foot):


If the units are used as datatypes, then the function qudspin:convertLiteral can be used, saving one argument in the function call:


In the following example, we iterate over all instances with a :height property and display the height (in cm) as well as the converted height (in feet) using the SPIN function:


Such unit conversion tasks have been made possible by adding conversion multipliers to the QUDT ontology. SPIN functions can use this extra metadata to drive mathematical computations. The function qudspin:convert is backed by a SPARQL query as shown below:

The SPIN framework makes it possible to define such SPARQL functions (and rules and constraints) in a completely declarative way. No extra hard-coding of anything is needed. Any SPIN-aware SPARQL engine can simply look up the definition of the qudspin:convert function on the web and learn about the underlying mathematics. Likewise, there is no need for humans to worry about the calculations themselves - they can treat the SPIN function as a black box.

The easiest way to play with this (for example using TopBraid Composer Free Edition) is to add an owl:imports statement to http://topbraid.org/spin/qudspin

Friday, August 21, 2009

Ontology Mapping with SPIN Templates

The question of how to transform data from one ontology to another comes up again and again, most recently in a question on the W3C Semantic Web list. The requirement is very real: for example, assume you have a class Person (with firstName/lastName) and a class Member (with fullName), and you want to construct one Member for each Person, so that the fullName is derived by concatenating firstName + " " + lastName. So basically you want to transform some (legacy) data into a format that some other application can understand.

Ideally, there should be a reusable standard mapping ontology for this purpose, which is also executable and user-friendly in visual editing tools. I am not aware of such a standard ontology, but I know how it could be built. Clearly, the typical complexity of such mapping tasks goes beyond what is provided by modeling languages like OWL. A graph matching language like SPARQL with rich built-in functions will be better suited. SPARQL CONSTRUCT queries can be used to define such mappings, as described on this blog three years ago.

The SPARQL Inferencing Notation (SPIN) provides a framework for organizing such SPARQL CONSTRUCT queries in a way that is easy to maintain and efficient to execute. In the following example I will walk through the steps needed to create a generic mapping ontology for tasks such as the one above, using SPIN Templates. The example is intentionally held very simple. The resulting file can be downloaded here and you can use TopBraid Composer (even the Free Edition) to execute it.

Let's assume we have two ontologies, person and member with the following classes:

An example instance of the source ontology may look like the following, with values for firstName and lastName filled in:

SPARQL can be used to create a mapping so that all instances of Person become Members, with a fullName derived from firstName and lastName. We would need two CONSTRUCT queries: one that adds the rdf:type triple to make the Persons also Members, and one that concatenates the firstName and lastName values into the fullName. You could attach those CONSTRUCT queries as SPIN rules to the classes as shown below. Note that the variable ?this means "for every instance of the class Person".

This mechanism will work fine, we can press the inferences button to run the SPIN rule engine and it will create the new RDF triples:

We can see that the Person is now also a Member with a fullName:

However, the solution above requires that the person creating the mapping is familiar with SPARQL. Additionally, the transformations can not easily be reused and similar SPARQL queries need to be entered the next time a string concatenation is required.

SPIN Templates can be used to encapsulate SPARQL queries so that they can be reused and edited easily. In the screenshot below I have replaced the hard-coded SPARQL queries with two SPIN template calls, which actually do the same but in a much nicer way:

Another way of visualizing these is using TopBraid's Diagram facility:


Let's look behind the scenes. The two entries under the spin:rule property are now SPIN Template Calls. A Template Call is an instance of a SPIN Template, but with arguments filled in. Here is the definition of the first SPIN Template, the concatenation rule:

The SPIN Template above is wrapping a SPARQL CONSTRUCT query (under spin:body). Templates can take arguments (under spin:constraint), which define how the template can be invoked. The values of the arguments will be "inserted" as variable bindings into the SPARQL query. In the example above, there are three arguments (sourceProperty1, sourceProperty2 and targetProperty) which are referenced in the body query as variables ?sourceProperty1 etc. In order to use such a template, the user simply needs to select the source class, go to "Create from SPIN Template..." under spin:rule, and fill in the arguments, as shown below.

The resulting Template Call will be associated with the class Person as a spin:rule, so that the SPIN (mapping) engine will infer the same new triples. The main achievement though is that the string concatenation module has now been generalized and could be reused in other ontologies. Since SPIN Templates are represented entirely in RDF, they can be shared on the web. Creating a library of such mapping modules would be a great topic for a Master's Thesis...