Entity Profiles

Bootleg uses Wikipedia and Wikidata to collect and generate a entity database of metadata associated with an entity. We support both non-structural data (e.g., the title of an entity) and structural data (e.g., the type or relationship of an entity). We now describe how to generate entity profile data from scratch to be used for training and the structure of the profile data we already provide.

Generating Profiles

The database of entity data starts with a simple jsonl file of data associated with an entity. Specifically, each line is a JSON object

{
    "entity_id": "Q16240866",
    "mentions": [["benin national u20 football team",1],["benin national under20 football team",1]],
    "title": "Forbidden fruit",
    "description": "A fruit that once was considered not to be eaten",
    "types": {"hyena": ["<wordnet_football_team_108080025>"],
              "wiki": ["national association football team"],
              "relations":["country for sport","sport"]},
    "relations": [
        {"relation":"P1532","object":"Q962"},
    ],
}

The entity_id gives a unique string identifier of the entity. It does not have to start with a Q. As we normalize to Wikidata, our entities are referred to as QIDs. The mentions provides a list of known aliases to the entity and a prior score associated with that mention indicating the strength of association. The score is used to order the candidates. The types provides the different types and entity is and supports different type systems. In the example above, the two type systems are hyena and wiki. We also have a relations type system which treats the relationships an entity participates in as types. The relations JSON field provides the actual KG relationship triples where entity_id is the head.

Note

By default, Bootleg assigns the score for each mentions as being the global entity count in Wikipedia. We empirically found this was a better scoring method for incorporating Wikidata “also known as” aliases that did not appear in Wikipedia. This means the scores for the mentions for a single entity will be the same.

We provide a more complete sample of raw profile data to look at.

Once the data is ready, we provide an EntityProfile API to build and interact with the profile data. To create an entity profile for the model from the raw jsonl data, run

from bootleg.symbols.entity_profile import EntityProfile
path_to_file = "data/sample_raw_entity_data/raw_profile.jsonl"
# edit_mode means you are allowed to modify the profile
ep = EntityProfile.load_from_jsonl(path_to_file, edit_mode=True)

Note

By default, we assume that each alias can have a maximum of 30 candidates, 10 types, and 100 connections. You can change these by adding max_candidates, max_types, and max_connections as keyword arguments to load_from_jsonl. Note that increasing the number of maximum candidates increases the memory required for training and inference.

Profile API

Now that the profile is loaded, you can interact with the metadata and change it. For example, to get the title and add a type mapping, you’d run

ep.get_title("Q16240866")
# This is adding the type "country" to the "wiki" type system
ep.add_type("Q16240866", "sports team", "wiki")

Once ready to train or run a model with the profile data, simply save it

ep.save("data/sample_entity_db")

We have already provided the saved dump at data/sample_entity_data.

See our entity profile tutorial for a more complete walkthrough notebook of the API.

Training with a Profile

Inside the saved folder for the profile, all the mappings needed to run a Bootleg model are provided. There are three subfolders as described below. Note that we use the word alias and mention interchangeably.

  • entity_mappings: This folder contains non-structural entity data.
    • qid2eid: This is a folder containing a Trie mapping from entity id (we refer to this as QID) to an entity index used internally to extract embeddings. Note that these entity ids start at 1 (0 index is reserved for a “not in candidate list” entity). We use Wikidata QIDs in our tutorials and documentation but any string identifier will work.

    • qid2title.json: This is a mapping from entity QID to entity Wikipedia title.

    • qid2desc.json: This is a mapping from entity QID to entity Wikipedia description.

    • alias2qids: This is a folder containing a RecordTrie mapping from possible mentions (or aliases) to a list possible candidates. We restrict our candidate lists to be a predefined max length, typically 30. Each item in the list is a pair of [QID, QID score] values. The QID score is used for sorting candidates before filtering to the top 30. The scores are otherwise not used in Bootleg. This mapping is mined from both Wikipedia and Wikidata (reach out with a github issue if you want to know more).

    • alias2id: This is a folder containing a Trie mapping from alias to alias index used internally by the model.

    • config.json: This gives metadata associated with the entity data. Specifically, the maximum number of candidates.

  • type_mappings: This folder contains type entity data for each type system subfolder. Inside each subfolder are the following files.
    • qid2typenames: Folder containing a RecordTrie mapping from entity QID to a list of type names.

    • config.json: Contains metadata of the maximum number of types allowed for an entity.

  • kg_mappings: This folder contains relationship entity data.
    • qid2relations: Folder containing a RecordTrie mapping from entity QID to relations to list of tail QIDs associated with the entity QID.

    • config.json: Contains metadata of the maximum number of tail connections allowed for a particular head entity and relation.

Note

In Bootleg, we add types from a selected type system and add KG relationship triples to our entity encoder.

Note

In our public entity_db provided to run Bootleg models, we also provide alias2qids_unfiltered.json which provides our unfiltered, raw candidate mappings. We filter noisy aliases before running mention extraction.

Given this metadata, you simply need to specify the types, relation mappings and correct folder structures in a Bootleg training config. Specifically, these are the config parameters that need to be set to be associated with an entity profile.

data_config:
  entity_dir: data/sample_entity_data
  use_entity_desc: true
  entity_type_data:
    use_entity_types: true
    type_symbols_dir: type_mappings/wiki
  entity_kg_data:
    use_entity_kg: true
    kg_symbols_dir: kg_mappings

See our example config for a full reference, and see our entity profile tutorial for some methods to help modify configs to map to the entity profile correctly.