Monday, November 02, 2009

Magic Properties with SPIN

Magic Properties (aka property functions) are a popular extension point supported by many SPARQL engines. Looking like normal triple matches, magic properties can be used to dynamically compute property values even if there are no corresponding triples in the actual model. For example, in the kennedys ontology, a magic property called :grandParent could be used to dynamically find grandparent relationships even though only the direct parent relationships are asserted in the model:


In the example above, the SPARQL engine will use some internal "magic" to figure out which values need to be returned for the ?grandParent variable. How this internal magic is implemented is left to the particular engine, and in the case of Jena this is simply a Java method call in which people can add their own custom code.

These magic properties can be extremely powerful, and often are the only escape mechanism if the default expressivity of SPARQL isn't good enough for a task. However, the current mechanisms of defining such magic properties require low-level (Java) programming and leads to queries that are neither transparent to the end user, nor platform independent.

The new version of SPIN 1.1 addresses those shortcomings and provides a powerful mechanism for defining magic properties entirely in RDF. Let's look at an example first. The following TopBraid Composer 3.2 screenshot shows the definition of the magic property :grandParent:

A magic SPIN property is an instance of the property metaclass spin:MagicProperty. Like regular SPIN functions, magic properties can define arguments that represent the left hand side of the magic property. Here, the argument sp:arg1 has type kennedys:Person indicating that this property can be applied to subjects of type Person. The results of the right hand side of the triple are computed by the nested SPARQL query specified as spin:body. In this body, the variable ?arg1 represents the Person whose grand parents we are asking for. When a SPIN-aware SPARQL engine hits the triple pattern with the :grandParent predicate in it, it will execute the nested body query and bind the variable on the right with the results of the SELECT query. The results in the first screenshot (JosephKennedy and RoseFitzgerald) have been computed this way: the grand parents are the parents of the parents.

An interesting characteristic of magic properties is that they may be applied in any "direction", i.e. either with a variable on the subject position, a variable as object or even in both places. For example, the :grandParent relationship can be queried to find all grand children of RoseFitzgerald (including JohnKennedyJr):

In this case, the nested SPARQL query is simply executed with different bindings (?arg1 is left blank and ?grandParent is pre-bound with RoseFitzgerald). We can also leave both sides blank and the system will return all existing grand parent/grand child relationships in the whole model.

Now comes the really interesting bit. Think about it: this is a mechanism that can be used to define new (SPARQL) predicates entirely based on other SPARQL queries. These SPARQL queries may be recursive and may use other magic properties. The SPARQL engine can walk through the results delivered by one magic property and, while doing this, can consider any other kind of background knowledge that it needs to compute to fulfill its task. If a solution does not lead to new matches, the system will backtrack and try the next one. In traditional rule languages (incl. Prolog) this technique is called backward chaining. Each user-defined magic SPIN property can be regarded as a rule that is fired on demand to support the task of solving a goal (finding matches).

Magic SPIN properties greatly extend the power of SPARQL, and do so in a very transparent and Semantic Web compliant way.

SPARQL Debugger and Profiler

SPARQL queries may get complex and often contain a series of operators such as triple matches, filter clauses, optional statements and unions. In many cases, building correct and efficient queries using those elements requires multiple attempts. This is often because it is difficult to see what the SPARQL engine is doing under the hood.

As of version 3.2, TopBraid Composer (Maestro Edition) introduces a Debugger View that has been designed to help query designers by providing an interactive view into a query at execution time. The Debugger can be used to:
  • Display the internal data structure (algebra) that the SPARQL engine is using to execute the query
  • To walk through the execution of the query
  • To display intermediate variable bindings
  • To execute a query until it reaches a certain break point
  • To collect statistics about the number of query steps involved in every operator
This is a sophisticated feature that can be extremely powerful, but it also requires a decent understanding of how SPARQL works. People with some programming background may find the debugger more useful than beginners.

Background

Let's start with some background on how SPARQL works. The key feature of SPARQL is the WHERE clause, which contains conditions that return variable bindings based on the query graph. Depending on the type of query, these variable bindings are then returned in the SELECT clause, or used to CONSTRUCT new triples. In order to understand the characteristics of a query, we therefore need to focus on the WHERE clause, and modifiers such as ORDER BY.

When a query is processed, the SPARQL engine will first create an internal data structure, called Algebra. The query is executing this data structure, while the textual representation with keywords such as SELECT and WHERE only serves as the user interface and query exchange language. It is important to understand that different query syntaxes may be converted into the same algebra data structure, and that the same query syntax may be rendered into different algebras depending on the choices of the SPARQL engine. In particular, the engine may decide to optimize certain patterns. Seeing the internal data structure will often lead to surprising results.

For example, the query
SELECT ?subject ?object
WHERE {
?subject rdfs:subClassOf ?object .
?subject rdfs:label ?label .
FILTER (fn:starts-with(?label, "C")) .
}
is converted into a SPARQL Algebra data structure that can be rendered into textual form as follows:
(project (?subject ?object)
(filter (fn:starts-with ?label "C")
(bgp
(triple ?subject rdfs:subClassOf ?object)
(triple ?subject rdfs:label ?label)
)))
In this notation, you can see that the SPARQL engine is working with a nested tree structure of operators such as project, filter and bgp. These operators typically have attributes or child elements, for example bgp (Basic Graph Pattern) contains a list of triple patterns to evaluate. Note that it is up to the SPARQL engine to optimize those operators so that they lead to fewer database requests. In particular, multiple triple matches may be combined to a single operation to benefit from server-side joins.

The Algebra operators form a tree structure, in which each node returns an iterator of variable bindings. For example, the basic graph pattern above may return the binding ?label="Matriarch"^^xsd:string, ?object=Person, ?subject=Matriarch. These bindings are served up to the next operator in the processing pipeline. For example, the filter operator will evaluate whether the ?label starts with "C" and then pass the same bindings to its parent operation, the project step. Thus, each operator receives a stream of input bindings and may create new bindings for its parent. The final results of the query are the output bindings of the root operator (here, project).

The algorithm of the SPARQL engine (at least the Jena ARQ engine used by Composer) follows the evaluation model above, and builds a nested hierarchy of iterators. Each of those iterators has two main functions:
  1. hasNext to check whether there is any additional result binding available
  2. next to return the next binding and move to the next step
The SPARQL Debugger allows users to trace the evaluation of the SPARQL engine to see when hasNext and next are invoked, and whether they return new variable bindings. When you step through the query manually, the user interface will display arrows in different colors to illustrate the current position, and display variable bindings in a table.


Getting Started with the Debugger View

TopBraid Composer's SPARQL Debugger is accessible from multiple places:
  • In the SPARQL View, use the Debug query button to open the currently edited query in the Debugger view.
  • In the context menu of spin:rule property values on the form, use Debug SPIN query to debug the SPIN rule. Note that this may add a clause ?this rdf:type [Class] to the query, just like the SPIN engine would do.


As shown above, the Debugger is split into multiple areas. The main area on the left displays the query. Two different renderings are available:
  • Algebra displays the internal Jena ARQ data structure
  • SPARQL displays the same query in pseudo-SPARQL, which is derived from the Algebra (reverse engineered if you want) and therefore may look different from the original SPARQL syntax.
You can switch between those two renderings at any time and will get the same features. Depending on your experience, the SPARQL view might be easier to get started, but the Algebra view is much more informative as it provides the real underlying data structure.

Both textual views have a control bar on their left border. This bar displays the current step position with an arrow, and also displays any break points. Double-click on the bar to set or remove break points.

On the right hand side, a table displays current variable bindings. These are only filled if the current execution step is "green", i.e. is returning the next operation with some values.

The tool bar contains buttons to control the behavior of the SPARQL view. The following buttons are available:
  • Run continues the execution of the query until the next break point.
  • Step Over continues the execution of the query to the next step.
  • Profiler activates or de-activates profiling. When activated, the engine will record the triple matches (findSPO) queries done against the triple source, and display the resulting statistics behind each operation.
  • Layout vertically switches between vertical and horizontal layout. Horizontal layout is better suitable for "wide" queries, vertical layout for lots of variable bindings.
  • Preferences opens up the SPARQL Preferences page, which may impact how the SPARQL engine optimizes the query algebra. Changes will have effect after restarting the query.
Profiling

If profiling mode has been activated (and, if needed, the query restarted), then the query engine will operate on a modified triple store which, as a side effect of running queries, will also collect statistical information. In particular, this records which find(SPO) queries have been made, and how many triples have been iterated over. The accumulated numbers of those calls (find/triples) will be displayed for each operation in cyan color on the right edge of the text display. In the following example, there are two triple patterns that lead to low-level queries.

In the above screen shot, the upper triple match just leads to a single query (for all rdfs:subClassOf triples), and this query has (so far) returned 15 matches. The second (optional) triple match has been executed 15 times (for each of the former matches), but only five results were found so far. From these numbers you can see that the optional operation is far more "expensive" in terms of number of queries than the subClassOf query.

To summarize, TopBraid's new SPARQL debugger is a powerful low-level tool for people who want to understand the execution behavior of a query or SPIN rule in the SPARQL engine. The debugger and its profiler can provide valuable insights that may lead to much more efficient queries.