Configuring Bootleg

By default, Bootleg loads the default config from bootleg/utils/parser/bootleg_args.py. When running a Bootleg model, the user may pass in a custom JSON or YAML config via:

python3 bootleg/run.py --config_script <path_to_config>

This will override all default values. Further, if a user wishes to overwrite a param from the command line, they can pass in the value, using the dotted path of the argument. For example, to overwrite the data directory (the param data_config.data_dir, the user can enter:

python3 bootleg/run.py --config_script <path_to_config> --data_config.data_dir <path_to_data>

Bootleg will save the run config (as well as a fully parsed verison with all defaults) in the log directory.

Finally, when evaluating Bootleg using the annotator, Bootleg processes possible mentions in text with three environment flags: BOOTLEG_STRIP, BOOTLEG_LOWER, BOOTLEG_LANG_CODE. The first sets the language to use for Spacy. The second is if the user wants to strip punctuation on mentions (set to False by default). The third is if the user wants to call .lower() (set to True by default).

Emmental Config

As Bootleg uses Emmental, the training parameters (e.g., learning rate) are set and handled by Emmental. We provide all Emmental params, as well as our defaults, at bootleg/utils/parser/emm_parse_args.py. All Emmental params are under the emmental configuration group. For example, to change the learning rate and number of epochs in a config, add

emmental:
  lr: 1e-4
  n_epochs: 10
run_config:
  ...

You can also change Emmental params by the command line with --emmental.<emmental_param> <value>.

Example Training Config

An example training config is shown below

emmental:
  lr: 2e-5
  n_epochs: 3
  evaluation_freq: 0.2
  warmup_percentage: 0.1
  lr_scheduler: linear
  log_path: logs/wiki
  l2: 0.01
  grad_clip: 1.0
  fp16: true
run_config:
  eval_batch_size: 32
  dataloader_threads: 4
  dataset_threads: 50
train_config:
  batch_size: 32
model_config:
  hidden_size: 200
data_config:
  data_dir: bootleg-data/data/wiki_title_0122
  data_prep_dir: prep
  use_entity_desc: true
  entity_type_data:
    use_entity_types: true
    type_symbols_dir: type_mappings/wiki
  entity_kg_data:
    use_entity_kg: true
    kg_symbols_dir: kg_mappings
  entity_dir: bootleg-data/data/wiki_title_0122/entity_db
  max_seq_len: 128
  max_seq_window_len: 64
  max_ent_len: 128
  overwrite_preprocessed_data: false
  dev_dataset:
    file: dev.jsonl
    use_weak_label: true
  test_dataset:
    file: test.jsonl
    use_weak_label: true
  train_dataset:
    file: train.jsonl
    use_weak_label: true
  train_in_candidates: true
  word_embedding:
    cache_dir: bootleg-data/embs/pretrained_bert_models
    bert_model: bert-base-uncased

Default Config

The default Bootleg config is shown below

"""Bootleg default configuration parameters.

In the json file, everything is a string or number. In this python file,
if the default is a boolean, it will be parsed as such. If the default
is a dictionary, True and False strings will become booleans. Otherwise
they will stay string.
"""
import multiprocessing

config_args = {
    "run_config": {
        "spawn_method": (
            "forkserver",
            "multiprocessing spawn method. forkserver will save memory but have slower startup costs.",
        ),
        "eval_batch_size": (128, "batch size for eval"),
        "dump_preds_accumulation_steps": (
            1000,
            "number of eval steps to accumulate the output tensors for before saving results to file",
        ),
        "dump_preds_num_data_splits": (
            1,
            "number of chunks to split the input file; helps with OOM issues",
        ),
        "overwrite_eval_dumps": (False, "overwrite dumped eval data"),
        "dataloader_threads": (16, "data loader threads to feed gpus"),
        "log_level": ("info", "logging level"),
        "dataset_threads": (
            int(multiprocessing.cpu_count() * 0.9),
            "data set threads for prepping data",
        ),
        "result_label_file": (
            "bootleg_labels.jsonl",
            "file name to save predicted entities in",
        ),
        "result_emb_file": (
            "bootleg_embs.npy",
            "file name to save contextualized embs in",
        ),
    },
    # Parameters for hyperparameter tuning
    "train_config": {
        "batch_size": (32, "batch size"),
    },
    "model_config": {
        "hidden_size": (300, "hidden dimension for the embeddings before scoring"),
        "normalize": (False, "normalize embeddings before dot product"),
        "temperature": (1.0, "temperature for softmax in loss"),
    },
    "data_config": {
        "eval_slices": ([], "slices for evaluation"),
        "train_in_candidates": (
            True,
            "Train in candidates (if False, this means we include NIL entity)",
        ),
        "data_dir": ("data", "where training, testing, and dev data is stored"),
        "data_prep_dir": (
            "prep",
            "directory where data prep files are saved inside data_dir",
        ),
        "entity_dir": (
            "entity_data",
            "where entity profile information and prepped embedding data is stored",
        ),
        "entity_prep_dir": (
            "prep",
            "directory where prepped embedding data is saved inside entity_dir",
        ),
        "entity_map_dir": (
            "entity_mappings",
            "directory where entity json mappings are saved inside entity_dir",
        ),
        "alias_cand_map": (
            "alias2qids",
            "name of alias candidate map file, should be saved in entity_dir/entity_map_dir",
        ),
        "alias_idx_map": (
            "alias2id",
            "name of alias index map file, should be saved in entity_dir/entity_map_dir",
        ),
        "qid_cnt_map": (
            "qid2cnt.json",
            "name of alias index map file, should be saved in data_dir",
        ),
        "max_seq_len": (128, "max token length sentences"),
        "max_seq_window_len": (64, "max window around an entity"),
        "max_ent_len": (128, "max token length for entire encoded entity"),
        "context_mask_perc": (
            0.0,
            "mask percent for context tokens in addition to tail masking",
        ),
        "popularity_mask": (
            True,
            "whether to use popularity masking for training in the entity and context encoders",
        ),
        "overwrite_preprocessed_data": (False, "overwrite preprocessed data"),
        "print_examples_prep": (True, "whether to print examples during prep or not"),
        "use_entity_desc": (True, "whether to use entity descriptions or not"),
        "entity_type_data": {
            "use_entity_types": (False, "whether to use entity type data"),
            "type_symbols_dir": (
                "type_mappings/wiki",
                "directory to type symbols inside entity_dir",
            ),
            "max_ent_type_len": (20, "max WORD length for type sequence"),
        },
        "entity_kg_data": {
            "use_entity_kg": (False, "whether to use entity type data"),
            "kg_symbols_dir": (
                "kg_mappings",
                "directory to kg symbols inside entity_dir",
            ),
            "max_ent_kg_len": (60, "max WORD length for kg sequence"),
        },
        "train_dataset": {
            "file": ("train.jsonl", ""),
            "use_weak_label": (True, "Use weakly labeled mentions"),
        },
        "dev_dataset": {
            "file": ("dev.jsonl", ""),
            "use_weak_label": (True, "Use weakly labeled mentions"),
        },
        "test_dataset": {
            "file": ("test.jsonl", ""),
            "use_weak_label": (True, "Use weakly labeled mentions"),
        },
        "word_embedding": {
            "bert_model": ("bert-base-uncased", ""),
            "context_layers": (12, ""),
            "entity_layers": (12, ""),
            "cache_dir": (
                "pretrained_bert_models",
                "Directory where word embeddings are cached",
            ),
        },
    },
}