Getting Started

What is YAGO?

YAGO is a knowledge base, i.e., a database with knowledge about the real world. YAGO contains both entities (such as movies, people, cities, countries, etc.) and relations between these entities (who played in which movie, which city is located in which country, etc.). All in all, YAGO contains more than 50 million entities and 90 million facts.

YAGO arranges its entities into classes: Elvis Presley belongs to the class of people, Paris belongs to the class of cities, and so on. These classes are arranged in a taxonomy: The class of cities is a subclass of the class of populated places, this class is a subclass of geographical locations, etc.

YAGO also defines which relations can hold between which entities: birthPlace, e.g., is a relation that can hold between a person and a place. The definition of these relations, together with the taxonomy is called the ontology.

What is so special about YAGO?

YAGO combines two great resources:

  1. Wikidata is the largest general-purpose knowledge base on the Semantic Web. It is a great repository of entities, but it has a difficult taxonomy and no human-readable entity identifiers.
  2. schema.org is a standard ontology of classes and relations, which is maintained by Google and others — but it does not have any entities.

YAGO combines these two resources, thus getting the best from both worlds: a huge repository of facts, together with an ontology that is simple and used as a standard by a large community. In addition, all identifiers in YAGO are human-readable, all entities belong to at least one class, and only classes and properties with enough instances are kept. To this, YAGO adds a system of logical constraints. These do not just keep the data clean, but also allow for reasoning on the data. YAGO is thus a simplified, cleaned, and “reasonable” version of Wikidata.

What are the logical constraints of YAGO?

Logical constraints are conditions that the data must fulfill. For example, a logical constraint can say that no entity can be at the same time a person and a place. These constraints serve to root out errors in the data, and establish the logical coherence of the knowledge base. The constraints also allow for making deductions: If someone asks whether Elvis is a place, then we can answer “no”, because we know he is a person. While this may sound trivial, such reasoning is not possible without the logical constraint. YAGO currently has the following logical constraints:

What is the data model of YAGO?

YAGO is stored in the standard Resource Description Framework “RDF”. This means that YAGO is a set of facts, each of which consists of a subject, a predicate (also called “relation” or “property”) and an object — as in <Elvis> <birthPlace> <Tupelo>.

We use different vocabularies for the components of such a fact. For example, for the predicates, we use the relations that are defined by schema.org. Therefore, RDF requires that we prefix the predicates with schema:. This method allows us to refer to standard vocabulary without re-inventing the wheel.

For “facts about facts” (such as time stamps for facts or other types of annotations), we use the RDF* format.

What are the relations in YAGO?

The relations in YAGO come from schema.org. We have mapped the original relations of Wikidata manually to these relations. We discard all Wikidata relations that do not have a schema.org-equivalent. This cuts away a large number of predicates that had very few facts.

What is the taxonomy of YAGO?

The top-level taxonomy of YAGO is taken from schema.org. In this way, we have a simple hierarchy of classes that has proven to work well in practice. However, these classes are not fine-grained enough. For example, they do not know “electric cars”. Therefore, we have carefully integrated selected parts of the Wikidata taxonomy into YAGO.

How can I access YAGO?

There are several ways to access YAGO:

  1. You can browse yourself through the knowledge base in our Web Interface
  2. You can launch SPARQL queries in our SPARQL endpoint
  3. You can programmatically send queries to our SPARQL endpoint
  4. You can download data and load it into an RDF triple store (e.g., BlazeGraph or Jena).
    This is the preferred method if you plan to launch a larger number of queries.