How to write SPARQL queries against Freebase data

Overview

Freebase RDF data is clean and well-organized, so it can be straightforward to write queries if you understand how. Although a "cookbook" on the subject doesn't yet exist, this post describes the minimum you need to know to write SPARQL queries against Freebase data.

  1. Writing SPARQL queries against Freebase
  2. Introspecting the Freebase Schema
  3. Compound Value Types in RDF
  4. Freebase Keys

What to load

Although it is possible to load Freebase data directly into a triple store, it is a difficult process because the Freebase RDF dump is not entirely compatible with RDF standards -- many tools will crash or otherwise fail to load the data. The Freebase RDF dump also contains hundreds of millions of redundant or uninteresting triples that greatly increase both loading and query times.

We use the Open Source Infovore framework to produce :BaseKB, a purified data product which is compatible with RDF standard tools.

We've heard reports of people loading :BaseKB Gold into a number of triple stores including Allegrograph, BigData, and OpenLink Virtuoso. This product is a free download via BitTorrent. The ideal hardware for loading this data is a quad core machine with at least 32GB of RAM and SSD storage.

If you'd like to skip the loading step, which can take hours, you can use the RDFeasy Complete Edition in the AWS Marketplace, which combines OpenLink Virtuoso, :BaseKB data and perfectly matched hardware for a low hourly price. This is an excellent option for evaluation, research, and development, because (1) you can get started in ten minutes and (2) you only need to pay for the time when you're using it.

Prefix declaration

:BaseKB rewrites URIs from the http://rdf.freebase.com/ns/ namespace to http://rdf.basekb.com/ns. Since nearly all of the entities, types, and predicates you'll use come in this namespace, we write

prefix : <http://rdf.basekb.com/ns/>

at the beginning of all queries. If you're using raw data from Freebase, you can write

prefix : <http://rdf.freebase.com/ns/>

and get similar results.

Looking up entities and predicates

Let's try a query I was asked about, which is to find the longest river entirely contained in Russia.

In a prefect world we'd have a :BaseKB-powered schema browser, but for now, we can use the Freebase web interface. Go to

http://freebase.com/

and type the word Russia into the autosuggest at the top. You'll see something like

Russia Dropdown

If you click on the first link, you'll get to the country page for Russia, which is

https://www.freebase.com/m/06bnz

and if you look at the head of the page you will see a mid identifier

Russia Detail

You can either read the mid /m/06bnz from the header on the top of page or from the URL of the page. Either way, to use this as an RDF identifier you replace the first slash with a colon, and the second slash with a period to get

:m.06bnz

Now we also need to find two properties to write this query

  1. We need a property that states that a location is completely contained in another location, and
  2. We need a property to find the length of a river.

We can start at the Freebase home page, which lists "bases" that contain common types and properties.

Top topics

We click on location, and see

Location types

We click again on "Location" and then we see a list of properties, the first few of which are

Location properties

The one we want is Contained-by and we can rewrite this like so

:location.location.contained_by

Note that most properties in Freebase have this structure. The first part is the name of a 'base', which organizes groups of related types. The second part is the name of a type, which can be referred to as

:location.location

then finally the last part is the name of the property contained_by. The exception to this rule is that some bases belong to users or are contained in other bases, in which case the name of the base could have multiple parts.

The river length type takes a little more digging, because rivers are not under location, they are under geography, which shows up as "Physical Geography" on the top page. You'd first look at

https://www.freebase.com/geography

and then at

https://www.freebase.com/geography/river?schema=

and eventually find the property you want is

:geography.river.length

Putting it all together

Now that we know the properties we need, we can write the following query

prefix : <http://rdf.basekb.com/ns/>

select ?river ?length {
   ?river :geography.river.length ?length .
   ?river :location.location.containedby :m.06bnz .
} ORDER BY DESC(?length) LIMIT 1

If you're using RDFeasy, you can run this in the "Database/Interactive SQL" tab by putting the command 'sparql' in front of the SPARQL, which looks like

Query Snap

and then you get this result

Silvrback blog image

If we convert that mid back to a Freebase detail page we get

https://www.freebase.com/m/0203mm

which is the right answer.

Thinking in RDFS

Note that we don't need to put

    ?river a :location.river .

into the query because only a :location.river can be the subject of :location.river.length. This isn't just because Freebase types are organized like base -> type -> property, but because RDFS can infer the above a statement based on

:location.river.length rdfs:domain :location.river .

Much like computer programs (particularly in Java) can grow in verbosity, SPARQL queries can too, and it's wise to leave out any constraints that are unnecessary.

Further reading

It makes sense to read the SPARQL 1.1 specification cover-to-cover, as well as the Metaweb architecture documentation.

  1. Writing SPARQL queries against Freebase
  2. Introspecting the Freebase Schema
  3. Compound Value Types in RDF
  4. Freebase Keys

:BaseKB is a product of Ontology2. See our privacy policy and terms of use. :BaseKB contains data from Freebase.