Friday, February 08, 2008

Text Analysis with OpenCalais in TopBraid

OpenCalais is an amazing Web Service that was recently made publicly available by Reuters. In a nutshell, OpenCalais takes arbitrary text or HTML documents as input and tries to extract semantic web entities from it. For example it can identify persons, companies and countries and returns them as machine-readable RDF data structures. Needless to say this extraction does not work perfectly well because understanding human languages requires (artificial) intelligence and a lot of implicit background knowledge. In any case it can create astonishing results. We ran it over the TopQuadrant management web site and it correctly identified all five people, their respective roles as well as parts of their former companies.

Such text-entity-extraction services have in the past been (very expensive) niche products. The recent announcement by Reuters (who have acquired the text mining company ClearForest last year) to make OpenCalais available for free came as a great surprise. After all, many customers of ours have requested features to import text into their ontologies in the past. Given all this, it was an obvious next step for us to include OpenCalais into TopBraid. TopBraid Composer 2.5 now includes several features that seamlessly integrate the Calais web service into data processing tasks. For example, you can extract RDF from arbitrary HTML files from the web and save the results into files. Or you can put .txt files into your workspace and directly import them into some other RDF/OWL project - Calais will be called automatically.

However, the real power of OpenCalais is exposed when used in data processing pipelines such as SPARQLMotion scripts. The following TopBraid screenshot shows a SPARQLMotion script that

  • loads the latest business news from a New York Times RSS feed
  • sends the text of the news items to OpenCalais (OpenCalais will identify all countries mentioned in the news)
  • iterates over all countries to request their geo coordinates from the geonames web service
  • displays all countries on a Google Map



This script is of course just one possibility of using information delivered by Calais. A more comprehensive solution would probably include a countries ontology that already has background information (including coordinates, capitals, financial details) about each country. Then SPARQLMotion could be used to create an intelligent agent that analyzes newsfeeds (or any other textual data source) against semantic query patterns such as "Alert me if there are any news about a company merger located in an oil-exporting country". If you want to play with all this, please download TopBraid Composer Maestro 2.5.0 but keep in mind that SPARQLMotion is work in progress and not complete (Matt Fischer recently wrote an independent review of an even older version of SPARQLMotion that illustrates some of the open issues).

Note that OpenCalais seems to be part of a larger roadmap at Reuters aiming at making "all the world's content more accessible and valuable". It is great to see a world-leading information company embrace the Semantic Web vision so directly! As a comprehensive information integration and ontology design tool, TopBraid Composer and its SPARQLMotion language seem to be ideal platforms to process, analyze and visualize the information that OpenCalais delivers.

Beyond Social Networking with FOAF and TopBraid

The recent announcement that Google is now systematically scanning FOAF files is potentially a large leap toward building the Semantic Web as a linked network of distributed data sources. FOAF files contain personal and work-related information such as name, acquaintances, publications, projects and contact information in RDF/OWL format. In contrast to most of the well-known commercial social networking services, FOAF files are maintained in a decentralized network, in which each user can publish and edit his or her own profile without being locked into any vendor's private database.

I personally haven't followed the FOAF project well in the past, but Google's announcement triggered me to have a second look. While there are certainly several historical and questionable design decisions in the current FOAF ontology version, it is nevertheless a very important domain model because it is widely used. In my opinion, the best thing about FOAF is that it defines stable URIs for concepts such as Person, name, mbox and img. Even if we don't like all aspects of FOAF (e.g. the redundant and inconsistent naming of properties like surname/family_name and firstName/givenname), FOAF at least provides shared URIs that ensure that users to talk about the same things.

Semantic Web technology users can reuse those shared URIs for their own purpose. For example, we can import the FOAF namespace into a project management ontology so that we can reuse and leverage the facts from FOAF files for all team members. For that purpose we may only select to use certain FOAF properties, and we are not forced to use the official specification ontology in all its details.

In our case, I wanted to see how FOAF files look in TopBraid Composer, so that I could update my own FOAF profile that was sitting neglected on the web. I did a few adjustments to the original FOAF spec so that it is less confusing to average users. In particular, I removed a couple of subclass relationships that pointed to other namespaces, removed redundant rdf:types of many properties, and removed redundant domains and ranges that would otherwise clutter up forms with unhelpful widgets. The result is an editor-friendly FOAF which has been made part of TopBraid's standard ontology library as of TopBraid Composer version 2.5.0.

I also did a couple of extensions for TopBraid to better support typical usage patterns of FOAF files. In particular I added a "follow-your-nose" feature that allows users to dynamically import a namespace that is mentioned in a URI. This is needed to explore details of FOAF profiles that are linked via the foaf:knows property. I also added imaging support to the graph editor so that from now on any value of foaf:depiction (and its sub-properties) will be rendered as images. See the TopBraid Composer 2.5 screenshot below.



So in principle you can now use TopBraid to edit your personal profile. This of course isn't cool by itself, but your personal profile alone isn't cool by itself either. The added value is that TopBraid (or similar ontology editors) are generic and this means you can use the full range of components like graphs, forms, maps and query builders as well as the various inference and data processing engines (e.g. SPARQLMotion) on your model. Even more importantly, you can make FOAF models part of other domain models and do things that cross the borders of what could be done with a traditional social networking service. For example, I could link my FOAF profile with a music ontology that contains background knowledge about the style of music that I prefer to listen to.

If Google would notice that a growing number of FOAF files out there also reference related namespaces, then it may want to scan those other namespaces as well. Google may soon find out that instead of maintaining a dedicated internal social networking data structure and providing a specific social networking web service, it will be easier for them to store the original data in a flexible and self-describing format such as RDF/OWL and simply publish a SPARQL endpoint to their API users. FOAF files could therefore be the start of a long friendship...