Friday, June 27, 2008

SPARQL Functions in Motion

SPARQL is well established as the standard query language for the Semantic Web. Comparable to SQL, SPARQL provides the SELECT keyword to extract information out of an RDF/OWL repository. SPARQL also provides the CONSTRUCT keyword to construct new triples from existing ones, making SPARQL an attractive solution to defining ontology mappings or rule bases.

As so often with W3C standards, the official specifications take you 80% to where you really want to be, while the remaining 20% are often non-standard extensions that make the technology really useful in real-world applications. In the case of SPARQL, many implementations already support some form of SPARQL Update Language with keywords such as INSERT and DELETE, leading to de-facto standards that will hopefully be officially folded into the standard in the next iterations. Another extremely useful extension has recently been implemented by Andy Seaborne in the Jena ARQ SPARQL engine: LET Assignments. Here is an example derived from Andy's blog:

SELECT ?area
WHERE {
?x rdf:type :Rectangle ;
:height ?h ;
:width ?w .
LET (?area := ?h * ?w) .
}

The LET command can be used to create new values out of existing values. The syntax of the right hand side of the assignment provides the same expressivity as FILTER expressions, which are well covered by the standard. What makes this LET command so attractive is that it greatly extends the expressiveness of SPARQL, especially when using the CONSTRUCT or INSERT keywords. We can slightly modify the example above to define a rule that automatically infers an area triple from height and width:

CONSTRUCT {
?x :area ?area .
}
WHERE {
?x rdf:type :Rectangle ;
:height ?h ;
:width ?w .
LET (?area := ?h * ?w) .
}


In addition to simple arithmetical expressions such as above, SPARQL also defines a collection of built-in functions such as bound, isBlank, lang, and str. The Jena ARQ library adds many more, including string functions.

As of version 2.6.0, our RDF/OWL development platform TopBraid Suite, which is based on Jena, includes support for LET assignments and greatly extends this mechanism. We have added a comprehensive library of more SPARQL Functions - the SPARQLMotion Functions. Among others, these functions can be used to build URIs from other names, cast values between datatypes, analyze the class structure, extract sub-strings and convert resources into human-readable names. Many more such functions will be added in future versions, in response to the use cases that we encounter in practice. TopBraid Composer provides a convenient auto-complete and context help feature to use these functions as shown in the next screenshot.



Java programmers can use the Jena API to add new functions if the provided functions are insufficient. TopBraid Composer users can also use SPARQLMotion to define new functions and make them available to any SPARQL query. I have recently uploaded an example SPARQLMotion function definition. The following screenshot, taken from this example, shows that the new function takes some input string and extracts the text in parantheses:


Using the declarative visual scripting language SPARQLMotion, average RDF/OWL experts can custom-tailor SPARQL to their individual needs without having to work with Java.

Monday, April 14, 2008

Linking to DBpedia with TopBraid

The semantic Web is coming. After at least a decade of preparation in its research community, the technology around RDF seems to be finally taking off. Re-branded as a web of linked data, the semantic Web is bootstrapping itself around a growing network of online databases, ontologies, SPARQL end-points, RDFa files and RDF-compliant web services.

A promising central hub in this linked data network is DBpedia, an RDF repository based on Wikipedia. DBpedia provides machine-readable RDF data for each of the pages in Wikipedia. Each Wikipedia page is represented by a corresponding RDF resource, and these resources are associated with RDF property values to provide descriptions, images, cross-references and tons of useful background knowledge. For example, the DBpedia pages for cities (e.g., Canberra) contain geographical information, the number of inhabitants, population density, links to famous inhabitants and average temperatures, all in machine-processable form. While these property values may not be totally stable and reliable, they are at least a good start.

However, the main benefit of DBpedia is that it provides relatively stable URIs for all relevant real-world concepts. This makes it a natural place to connect specific domain models with each other. If I publish my RDF files with links to DBpedia and you do the same, then we can automatically find cross-references and might more easily find mappings between our domain models. All I need to do is to add links such as { my:Canberra owl:sameAs dbpedia:Canberra }.

In order to support linking domain models with DBpedia and to encourage our users to link their domain models into the semantic Web, TopBraid Composer 2.5.3 contains some new features that semi-automatically suggest missing links. We have integrated a Wikipedia web service that takes a string (here, an rdfs:label or a local resource name) and tries to find a matching Wikipedia page for it. From the resulting page, TopBraid can derive the DBpedia page and display it in a Wizard as shown below. The wizard can then be used to preview and assign DBpedia links to one or more domain resources.




I have made a short video about all this.

Saturday, March 15, 2008

Extending your tools with SPARQLMotion

A few days ago I wrote about how to create Web Services with SPARQLMotion. The basic idea is that SPARQLMotion scripts can take parameters as external input and then process these parameters in SPARQL queries etc. With today's release of TopBraid Composer Maestro 2.5.2, we have applied this idea to create a new mechanism that can be used to extend the tool itself.

Here is an example. This simple SPARQLMotion script (SetCreatorService.sms.n3) takes the currently selected resource as input and adds the triple ?resource dc:creator "John Doe" using a SPARQL Update language call.




From top to bottom the steps are:

  1. Take the selected resource as input and bind it to the variable ?resource.
  2. Run the SPARQL call INSERT { ?resource dc:creator "John Doe" }
  3. Done

Download the script and put it into your TopBraid workspace, then open the context menu (Resource menu) of any resource. This will insert a new menu item "Set creator" to execute the script on the selected resource:


When executed, the selected resource will be edited automatically, by the SPARQLMotion script. Needless to say, more complex SPARQLMotion services could be run as well.

The trick is that TopBraid is scanning your workspace for all files ending with .sms.xyz (where xyz might be n3). If these files contain a service that takes the selected resource as input (sml:BindWithSelectedResource), then TopBraid will add a menu item for the corresponding sml:ReturnXYZ module.

Here is another example, that is also explained in depth on our web site (click on the image for details).


This service can actually be used to send an email from TopBraid Composer. It demonstrates that these services can also be supplied with a pre-condition, so that the menu item only shows up for certain resources. This is implemented by means of an ASK query in the sml:BindWithSelectedResource module.

It is easy to see that this feature is potentially yet another disruptive move in the direction of model-driven applications. SPARQLMotion can be used to extend the TopBraid Composer tool itself to provide convenient short-cuts to frequently needed activities. Instead of having to rely on your IT department to write plug-ins in a programming language like Java, users of the tool can now do such things themselves. No more need to learn complex APIs, fiddle with the extension mechanisms, just plain modeling. The tricky parts of connecting the dots are already solved by the SPARQLMotion engine.

Tuesday, March 11, 2008

Creating Web Services with SPARQLMotion

A week ago we have officially launched SPARQLMotion 1.0 as part of the latest TopBraid Suite release. SPARQLMotion is a visual scripting language based on Semantic Web standards. The language is particularly useful to automate all kinds of data integration tasks because SPARQLMotion has built-in facilities to merge, map and transform data from various sources and formats. Furthermore, being a visual language, little programming skills beyond SPARQL are required to use SPARQLMotion and its tools.

One of the new features of SPARQLMotion 1.0 is that it can be used to create customized Web Services. TopBraid users can visually define REST-style web services and execute them within Maestro or the TopBraid Live server. I am describing a small example SPARQLMotion web service on our web page, but here is a screenshot of the script for your convenience.



This small SPARQLMotion script takes a calling code such as "61" as input, sends it in a SELECT query to the DBPedia SPARQL end point, and then sends a string response such as "61 is the calling code of Australia." back to the client. (Thanks to Henry Story for a variation of this scenario!)

This particular example only highlights one aspect of the possibilities of SPARQLMotion, namely the ability to create wrappers of arbitrary SPARQL calls. However, imagine that you can also define any sequence of processing steps in the middle (between the green start and the red end module). You could mash up data from multiple newsfeeds, databases, spreadsheets or XML sources, include data from external web services, construct new triples, define iterations and if-then-else branches, apply inference engines, send emails, construct complex web pages using JSPs or other templates, etc. With SPARQLMotion, defining useful Web Services becomes a matter of drag and drop - at least that's our goal at TopQuadrant.

There are many other SPARQLMotion improvements in the latest TopBraid Composer release, and we are incrementally adding examples and documentation - finally also including a user's guide. The tool now also provides a visual script debugger. Realistically there are still some rough edges that demand for improvements, but there are also quite a lot of opportunities to discover in this new semantic programming paradigm. Just make a mark and see where it takes you.

Monday, March 03, 2008

Editing Oracle 11g RDF Rules in TopBraid

Oracle has been supporting native RDF/OWL support since version 10g of their database. Now in its second release as 11g, Oracle has become a very serious option for projects that operate on large amounts of data. Many of our own customers already have Oracle installed in their enterprise and trust the infrastructure that Oracle provides. However, Oracle does not yet offer complete solutions that would also involve ontology editors, rule editors, semantic data browsers, semantic information integration tools, etc. This is where TopBraid Suite is well established, so that using TopBraid on top of Oracle is an attractive option. In fact, in the last couple of months, the majority of our customers has explicitly asked for this combination.

In response to this increasing interest in using Oracle in conjunction with TopBraid, we have added some native Oracle capabilities to our platform, based on the Oracle Jena API for the connection. In particular, TopBraid Composer 2.5.1 includes a feature to edit user-defined native Oracle rules, and to run server-side inferences. The following screenshot shows TopBraid's basic Oracle rules editor.




With this rule editing support, TopBraid users can select one or more user-defined rule bases and also combine them with the pre-defined Oracle rule sets RDFS, RDFS++, OWLSIF and OWLPRIME. The rule language supported by Oracle includes triple patterns. This is enough for the most common needs like the infamous "uncle" relationship which goes beyond the expressivity of OWL.

The major difference between using Oracle's native rule support and any of the other inference engines built into TopBraid is that Oracle rules are executed server-side. This makes execution significantly faster than executing them inside a TopBraid client. This is because if the rules were executed in TopBraid, then the rule engine would need to fetch lots of query results from the database, process the results and then continue with the next loop. Native Oracle avoids this communication overhead. Furthermore, Oracle rules are executed incrementally on each database commit, making it easier to maintain up-to-date inferences. There are several disadvantages though as well: TopBraid cannot distinguish inferred triples from asserted triples (at least not yet), making it for example unclear which triples can be deleted or not. A more critical limitation though is that the rules operating on Oracle would not "see" any imported triples, or triples that were inferred in other ways. This favors scenarios in which all RDF data can reside in a single Oracle model.

By the way, TopBraid Composer now also has features to execute arbitrary SQL commands and SELECT queries directly on the Oracle database, displaying the results in a console together with other status reports. With these capabilities, TopBraid Composer can be used to set up and maintain Oracle databases including rule bases, so that the Oracle database can be accessed in the desired way by end-user applications (such as TopBraid Live or Ensemble). In a sense, in addition to all its other Semantic Web features, TopBraid is becoming an admin tool for Oracle RDF...

Friday, February 08, 2008

Text Analysis with OpenCalais in TopBraid

OpenCalais is an amazing Web Service that was recently made publicly available by Reuters. In a nutshell, OpenCalais takes arbitrary text or HTML documents as input and tries to extract semantic web entities from it. For example it can identify persons, companies and countries and returns them as machine-readable RDF data structures. Needless to say this extraction does not work perfectly well because understanding human languages requires (artificial) intelligence and a lot of implicit background knowledge. In any case it can create astonishing results. We ran it over the TopQuadrant management web site and it correctly identified all five people, their respective roles as well as parts of their former companies.

Such text-entity-extraction services have in the past been (very expensive) niche products. The recent announcement by Reuters (who have acquired the text mining company ClearForest last year) to make OpenCalais available for free came as a great surprise. After all, many customers of ours have requested features to import text into their ontologies in the past. Given all this, it was an obvious next step for us to include OpenCalais into TopBraid. TopBraid Composer 2.5 now includes several features that seamlessly integrate the Calais web service into data processing tasks. For example, you can extract RDF from arbitrary HTML files from the web and save the results into files. Or you can put .txt files into your workspace and directly import them into some other RDF/OWL project - Calais will be called automatically.

However, the real power of OpenCalais is exposed when used in data processing pipelines such as SPARQLMotion scripts. The following TopBraid screenshot shows a SPARQLMotion script that

  • loads the latest business news from a New York Times RSS feed
  • sends the text of the news items to OpenCalais (OpenCalais will identify all countries mentioned in the news)
  • iterates over all countries to request their geo coordinates from the geonames web service
  • displays all countries on a Google Map



This script is of course just one possibility of using information delivered by Calais. A more comprehensive solution would probably include a countries ontology that already has background information (including coordinates, capitals, financial details) about each country. Then SPARQLMotion could be used to create an intelligent agent that analyzes newsfeeds (or any other textual data source) against semantic query patterns such as "Alert me if there are any news about a company merger located in an oil-exporting country". If you want to play with all this, please download TopBraid Composer Maestro 2.5.0 but keep in mind that SPARQLMotion is work in progress and not complete (Matt Fischer recently wrote an independent review of an even older version of SPARQLMotion that illustrates some of the open issues).

Note that OpenCalais seems to be part of a larger roadmap at Reuters aiming at making "all the world's content more accessible and valuable". It is great to see a world-leading information company embrace the Semantic Web vision so directly! As a comprehensive information integration and ontology design tool, TopBraid Composer and its SPARQLMotion language seem to be ideal platforms to process, analyze and visualize the information that OpenCalais delivers.

Beyond Social Networking with FOAF and TopBraid

The recent announcement that Google is now systematically scanning FOAF files is potentially a large leap toward building the Semantic Web as a linked network of distributed data sources. FOAF files contain personal and work-related information such as name, acquaintances, publications, projects and contact information in RDF/OWL format. In contrast to most of the well-known commercial social networking services, FOAF files are maintained in a decentralized network, in which each user can publish and edit his or her own profile without being locked into any vendor's private database.

I personally haven't followed the FOAF project well in the past, but Google's announcement triggered me to have a second look. While there are certainly several historical and questionable design decisions in the current FOAF ontology version, it is nevertheless a very important domain model because it is widely used. In my opinion, the best thing about FOAF is that it defines stable URIs for concepts such as Person, name, mbox and img. Even if we don't like all aspects of FOAF (e.g. the redundant and inconsistent naming of properties like surname/family_name and firstName/givenname), FOAF at least provides shared URIs that ensure that users to talk about the same things.

Semantic Web technology users can reuse those shared URIs for their own purpose. For example, we can import the FOAF namespace into a project management ontology so that we can reuse and leverage the facts from FOAF files for all team members. For that purpose we may only select to use certain FOAF properties, and we are not forced to use the official specification ontology in all its details.

In our case, I wanted to see how FOAF files look in TopBraid Composer, so that I could update my own FOAF profile that was sitting neglected on the web. I did a few adjustments to the original FOAF spec so that it is less confusing to average users. In particular, I removed a couple of subclass relationships that pointed to other namespaces, removed redundant rdf:types of many properties, and removed redundant domains and ranges that would otherwise clutter up forms with unhelpful widgets. The result is an editor-friendly FOAF which has been made part of TopBraid's standard ontology library as of TopBraid Composer version 2.5.0.

I also did a couple of extensions for TopBraid to better support typical usage patterns of FOAF files. In particular I added a "follow-your-nose" feature that allows users to dynamically import a namespace that is mentioned in a URI. This is needed to explore details of FOAF profiles that are linked via the foaf:knows property. I also added imaging support to the graph editor so that from now on any value of foaf:depiction (and its sub-properties) will be rendered as images. See the TopBraid Composer 2.5 screenshot below.



So in principle you can now use TopBraid to edit your personal profile. This of course isn't cool by itself, but your personal profile alone isn't cool by itself either. The added value is that TopBraid (or similar ontology editors) are generic and this means you can use the full range of components like graphs, forms, maps and query builders as well as the various inference and data processing engines (e.g. SPARQLMotion) on your model. Even more importantly, you can make FOAF models part of other domain models and do things that cross the borders of what could be done with a traditional social networking service. For example, I could link my FOAF profile with a music ontology that contains background knowledge about the style of music that I prefer to listen to.

If Google would notice that a growing number of FOAF files out there also reference related namespaces, then it may want to scan those other namespaces as well. Google may soon find out that instead of maintaining a dedicated internal social networking data structure and providing a specific social networking web service, it will be easier for them to store the original data in a flexible and self-describing format such as RDF/OWL and simply publish a SPARQL endpoint to their API users. FOAF files could therefore be the start of a long friendship...

Thursday, November 15, 2007

SparqlMotion: A visual semantic web scripting language

The open architecture of semantic web languages like RDF, OWL and SPARQL make them an excellent choice for data integration problems, aka mash-ups. Semantic technology tools can be used to bring together heterogeneous data sources, to post-process and filter them, and to query the resulting aggregated data models. One of those tools, TopBraid Composer, provides import capabilities for legacy data in XML, UML, relational databases, spreadsheets, news feeds, HTML pages etc. Users can edit ontologies to bridge the various data items, and run inference or query engines to get information out. However, going through these steps is typically a manual process that needs to be repeated for each new data source.

SparqlMotion is a new visual language that enables average users to define scripts to import, post-process, query and visualize data using semantic web technology. Users can define and share those scripts as OWL models, based on a dedicated SparqlMotion ontology and module library. The graph editor of Composer's Maestro Edition (or any other OWL editor) can be used to define the data and execution flow of these scripts using drag and drop:




Here is a screencam video (15 minutes) that shows how to create the above SparqlMotion script with TopBraid Composer. Here is the example script in N3 notation. The script loads data from a news feed, post-processes the resulting triples, ask the user to enter a keyword, and then displays all events that contain the keyword in a calendar. The output of the script could also be another file, a spreadsheet, a database or a dynamic model that can be imported into other ontologies.



Each of the nodes in the above diagram represents a data processing step, which must be an instance of a SparqlMotion module class such as sml:LoadNewsFeed. The sm:next relationship specifies the information flow between two modules. For example, the resulting output of the newsfeed loader (RDF triples) is used as input for the data type conversion module below it. The latter module can process/filter the RDF input and pass it on to the next node etc. Scripts can branch their data flow and merge RDF input of multiple modules into a single node at any time.



Two information formats are currently supported: RDF and XML. We provide translation modules based on our Semantic XML algorithm that can convert between RDF and XML at any time. In addition to these formats, modules can bind variables. For example, a user input module such as "Enter keyword" above can prompt the user to enter a string and then pass that string literal to the following modules in a variable such as "keyword". Succeeding modules can reference this variable, for example, in SPARQL queries.



SPARQL is the central language of SparqlMotion. Many modules (such as those that display data on a calendar or a Google map) use a SPARQL query to select which resources to display. There is also an iteration module that repeats other modules for each result row of a SPARQL select clause. Finally, SPARQL's CONSTRUCT keyword is used heavily to transform and filter RDF data.



The SparqlMotion modules library is growing rapidly since we started using it in customer projects. We are also working on a web-based graph editor in Flex based on TopBraid Ensemble's graphing capabilities. This will remind some people of Yahoo Pipes. The current version included in TopBraid Composer Maestro is rather alpha software as we better understand SparqlMotion design patterns and add support for best practices. We expect to incrementally roll out many more features over the next few months. In any case, the system is available for download if you want to give it a try. Make sure to watch the video before exploring this exciting space.