Identifiers in Freebase

Prelude

  1. Writing SPARQL queries against Freebase
  2. Introspecting the Freebase Schema
  3. Compound Value Types in RDF
  4. Freebase Keys

Most Linked Data sets represent links to the outside world in a format like

:internalResource owl:sameAs external:thatResource .

where owl:sameAs could be replaced by some other predicate which is not so problematic in its definition. When data is linked so, you have many options for integration, such as loading everything into the same triple store, or derferencing URIs one at a time.

Freebase was conceived before the time of Linked Data and SPARQL so it developed its own method of mapping identifiers to concepts; this information is expressed in two different ways in RDF.

For the purpose of concision, the :BaseKB Compact Edition supports only one of these mechanisms, the :type.object.key predicate, while the :BaseKB Complete Edition supports both.

This article teaches you how to look up external and identifiers using the :type.object.key predicate and the special key: namespace.

Inventor of the Traffic Light

Let's take case of Garret Morgan, who is :m.01tp2v in Freebase and who comes about as close to a real-life Tony Stark as anyone. If we look up identifiers that Freebase knows for him with this query

sparql

prefix : <http://rdf.basekb.com/ns/> 

select ?key {
   :m.01tp2v  :type.object.key ?key .
} ORDER BY ?key

we get

Morgan Results

Note that Freebase keys are structured like path in Unix. :type.object.key spells them out completely, while the alternative representation represents the directed acyclic graph directly.

Note that some of these identifiers have been inserted by external entities, (ex. /base/ranker/ and /user/avh/ellerdale), we also see a key in the /en/ namespace which means you can refer to this entity as /en/garret_a_morgan in MQL queries. In the early days, Freebase created human-readable identifiers for all topics, but this policy did not scale well, and Freebase eventually converged on the consistent use of mids for everything that is not a type or a property.

Unicode character encoding in keys

An important bit of convention is that Freebase encodes non-plaintext characters in identifiers as $xxxx where xxxx is hexadecimal for a 16-bit Unicode codepoint. You can see this used above, where "Garret A. Morgan" is spelled out as

Garret_A$002E_Morgan

heaxdecimal 2E is decimal 46 in ASCII and Unicode, which represents a period. The same encoding is used for the Korean variant "개릿 모건", which is Morgan's name spelled out phonetically

/wikipedia/ko/$AC1C$B9BF_$BAA8$AC74

Note that characters in the upper plane (with codepoints greater than $FFFF) are encoded as a pair of symbols using surrogate characters. The following Java function decodes the $ sequences in Freebase keys:

public String unescapeFreebaseKey(String in) {
    StringBuilder out=new StringBuilder(in.length());
    String [] parts=in.split("[$]");
    out.append(parts[0]);
    for(int i=1;i<parts.length;i++) {
        String hexSymbols=parts[i].substring(0,4);
        String remainder="";
        if(parts[i].length()>4) {
            remainder=parts[i].substring(4);
        }

        int codePoint=Integer.parseInt(hexSymbols,16);
        char[] character=Character.toChars(codePoint);
        out.append(character);
        out.append(remainder);
    }

    return out.toString();
}

The key(s) to Wikipedia

Let's take a look at the conventions used in Wikipedia keys. Wikipedia keys come in several kinds:

\wikipedia\{lang}\
\wikipedia\{lang}_title\
\wikipedia\{lang}_id\

where {lang} is an ISO 639-1 or a variation of an ISO code. Wikipedia keys are derived from Wikipedia titles by replacing the space character with an underscore, and escaping punctuation and non-ASCII characters with the $-convention describe above.

A page in Wikipedia has a "real" title, but may appear under different names because of redirect records that point to the real page. The real title is encoded in the \wikipedia\{lang}_title\ namespaces, whereas the titles that redirect to the real title are encoded in the \wikipedia\{lang}\ namespaces.

Generally systems should accept all Wikipedia titles from the outside system, but should use the official form when exporting data to the outside.

Wikipedia titles have the special property of being unique, unlike Freebase titles, which can be shared by many objects. Wikipedia titles are disambiguated in a rather ad-hoc manner. Sometimes Wikipedians choose names to avoid conflict, but frequently they add something to the title to disambiguate it, such as a few words in parenthesis giving the type of of the object, for example

Battle of Los Angeles (album)

Note that the /{lang}_id/ namespace contains numeric identifiers, which are the internal primary key in the database tables behind Wikipedia. These identifiers are supposed to remain stable when titles change, so they provide one more interconnection between Freebase and Wikipedia.

Keys in the complete edition

The Compact Edition of :BaseKB contains only the :type.object.key identifiers. I believe these are sufficient for almost any task, but the Complete Edition provides a different view of Freebase keys. It so turns out that any Freebase namespace, like

/authority/iso/3166-1/alpha-2

can be converted to a URI

<http://rdf.freebase.com/key/key.authority.iso.3166-1.alpha-2>

and Freebase uses this as a predicate like so

?subject ?keyPredicate "String_Value_Of_Key".

In this case, ISO 3166-1 Alpha 2 is the fancy name for the commonly used two-letter country abbreviations, and by searching this namespace, we can make a list of current countries, together with their codes and labels.

prefix : <http://rdf.basekb.com/ns/>
prefix key: <http://rdf.basekb.com/key/key.>

select ?country ?code  ?label {
   ?country key:authority.iso.3166-1.alpha-2 ?code .
   ?country rdfs:label ?label .
   FILTER(lang(?label)='en')
}

the first few results look like

Country Codes

Conclusion

Freebase has a mechanism for representing internal or external identifiers that is expressed in two different ways. When you learn how to use this mechanism, you'll find it easy to link up Freebase with other data sources.

  1. Writing SPARQL queries against Freebase
  2. Introspecting the Freebase Schema
  3. Compound Value Types in RDF
  4. Freebase Keys

:BaseKB is a product of Ontology2. See our privacy policy and terms of use. :BaseKB contains data from Freebase.