Compound Value Types in RDF

Prelude

  1. Writing SPARQL queries against Freebase
  2. Introspecting the Freebase Schema
  3. Compound Value Types in RDF
  4. Freebase Keys

Most facts can be expressed as a single triple of the form

:subject :predicate :object .

for instance,

:Colorado :partOf :United_States .

On the other thing, some relationships fundamentally involve more than one thing and some variables change rapidly as a function of time, and in those cases we extend the triple model by adding an extra node to model the relationship; this node plays a role similar to a table row in a database, and is called either a Compound Value Type or a Mediator in Freebase.

In RDF it is possible to create a node without a unique identifier, known as a "blank node", but in a community database like Freebase, a global name ensures that anyone really can say anything about anything. This article will teach you how to work with Compound Value types when querying Freebase data with SPARQL.

This post is a followup to How to write SPARQL queries against Freebase data and is part of a series. Subscribe to the RSS feed and to the :BaseKB mailing list for future episodes.

The data set I'm using is the 2014-03-02 edition of :BaseKB Gold. You can download this via Bittorrent and load it into any standard-complaint triple store, but it's even faster to use the pre-loaded Compact Edition which can deploy perfectly matched hardware, software and data in just one click.

Dated measurements

The :measurement_unit base contains a number of "mediators" that represent measurements taken at a moment in time. :measurement_unit.dated_money_value
is just a touch more complicated than average because it contains a currency unit:

Dated Money Value

To see how this works, I looked up the Eastman Kodak Corporation via the Freebase web UI and got the following id: :m.0ftdd. I can then look up revenue numbers like

prefix : <http://rdf.basekb.com/ns/>

select ?valid_date ?amount {
   :m.0ftdd :business.business_operation.revenue ?that .
   ?that 
      :measurement_unit.dated_money_value.amount ?amount ;
      :measurement_unit.dated_money_value.valid_date ?valid_date
} order by (?valid_date)

I get the following results back

Kodak Results

and piping the result into R I see one of the great American rise and fall stories:

Kodak Revenue

Note in this case the currency and source attributes are boring because the currency is always USD and the source attribute is missing.

Adding a dimension

Although it common to look at compound value types that are associated with a topic, it is also easy with SPARQL to approach CVTs from side angles. For instance, this query turns up the top sources of dated integers

select ?source (count(*) as ?cnt) {
   ?value :measurement_unit.dated_integer.source ?source .
} group by ?source ORDER BY DESC(?cnt) LIMIT 5

Annotated table

If you need to know, the text labels were put in by hand in OneNote

The World Bank is (known as :m.02vk52z) is a data provider to Freebase, and we can find the data types it provides like so

select ?p (count(*) as ?cnt) {
     ?source :dataworld.information_source.provider :m.02vk52z .
     ?item ?p ?source .
} group by ?p order by desc(?cnt)

with the following results:

World Bank Types

it is also possible to pull up a list of the predicates asserted by the World Bank

prefix : <http://rdf.basekb.com/ns/>

select ?predicate (count(*) as ?cnt) {
     ?source :dataworld.information_source.provider :m.02vk52z .
     ?fact ?sourceP ?source .
     ?something ?predicate ?fact .
     filter(regex(str(?sourceP),"measurement_unit"))
} group by ?predicate order by desc(?cnt)

World Bank Predicates

How many CVT nodes are there?

We can count the subjects that appear in Freebase like so:

select (count(distinct(?s)) as ?count) {
   ?s ?p ?o .
}

we get a total count of 108,743,069 which greatly exceeds the 43,453,748 topics, and the7,962,738 schema objects with the following query

prefix : <http://rdf.basekb.com/ns/>

select (count(distinct(?s)) as ?count) {
   ?s a ?o .
   FILTER(REGEX(?o,'/type[.]'))
}

this leaves 57,326,583 non-topics. We can get some insight into these by looking up the top non-topics, some of which are CVTs

prefix : <http://rdf.basekb.com/ns/>

select ?nonTopicType (count(*) as ?count) {
    ?that a ?nonTopicType .
    minus {?that a :common.topic }
} ORDER BY DESC(?count) LIMIT 50

Popular non-topics

Using the Summary data

Compound value types are marked like

?cvType :freebase.type_hints.mediator 1 .

If you try to join this against

?that a :cvType .

you get poor results because ?cvType is a mid identifier, whereas the RHS of the a statement is a 'friendly' identifier like :film.performance. These don't match up, so the join doesn't work. Freebase doesn't contain statements like

:m.020w ?alsoKnownAs :measurement_unit.dated_integer .

but rather you find

:m.020w :freebase.object_hints.best_hrid "/measurement_unit/dated_integer" .

This join could be completed by the use of SPARQL functions to convert slashes into periods and strings into URIs, but there's an easier way.

We can get a list of theCVTs, however, by using the summary information
:freebase.type_profile.type_count, which results in a simple query

prefix : <http://rdf.basekb.com/ns/>

select ?name ?count {
   ?mediatorType :freebase.type_hints.mediator 1 ;
       :freebase.type_profile.instance_count ?count ;
       :freebase.object_hints.best_hrid ?name .
} order by desc(?count) LIMIT 30

Only cvt

and then by mixing up the query a little we get a count of all instances that are tagged as mediators:

prefix : <http://rdf.basekb.com/ns/>

select (sum(?count) AS ?sum) {
   ?mediatorType :freebase.type_hints.mediator 1 ;
      :freebase.type_profile.instance_count ?count ;
      :freebase.object_hints.best_hrid ?name .
} 

which turns out to be 13,761,088, which is a tiny fraction, much less than 10% of the relationships in the system.

People and roles

At this point you should know enough about how to look up CVT facts, even if it is a little awkward to format your results. (That's the subject of a future post) I'm going to show a few more screenshots from the Freebase Web UI that illustrate some interesting uses of CVTs.

Employment histories are a good example of something that transcends the triple, since employment is (i) a temporary condition, and (ii) involves both an employer and a job title. If we look up, say, Steve Jobs, we get

Jobs Career

If, on the other hand, you look at good Captain Kirk you see that Freebase knows he's been played by three different actors:

Silvrback blog image

Conclusion

Compound Value Types are a useful addition that Freebase adds to the RDF data model; an understanding of CVTs is essential if you want to access all of the data that exists in Freebase.

  1. Writing SPARQL queries against Freebase
  2. Introspecting the Freebase Schema
  3. Compound Value Types in RDF
  4. Freebase Keys

:BaseKB is a product of Ontology2. See our privacy policy and terms of use. :BaseKB contains data from Freebase.