bootleg package¶

Subpackages¶

Submodules¶

bootleg.data module¶

Bootleg data creation.

bootleg.data.bootleg_collate_fn(batch: Union[List[Tuple[Dict[str, Any], Dict[str, torch.Tensor]]], List[Dict[str, Any]]]) → Union[Tuple[Dict[str, Any], Dict[str, torch.Tensor]], Dict[str, Any]][source]¶

Collate function (modified from emmental collate fn).

The main difference is our collate function merges candidates from across the batch for disambiguation. :param batch: The batch to collate.

Returns: The collated batch.

bootleg.data.get_dataloaders(args, tasks, use_batch_cands, load_entity_data, splits, entity_symbols, tokenizer, dataset_offsets: Optional[Dict[str, List[int]]] = None)[source]¶

Get the dataloaders.

Parameters

args – main args
tasks – task names
use_batch_cands – whether to use candidates across a batch (train and eval_batch_cands)
load_entity_data – whether to load entity data
splits – data splits to generate dataloaders for
entity_symbols – entity symbols
dataset_offsets – [start, end] offsets for each split to index into the dataset. Dataset len is end-start. If end is None, end is the length of the dataset.

Returns: list of dataloaders

bootleg.data.get_entity_dataloaders(args, tasks, entity_symbols, tokenizer)[source]¶

Get the entity dataloaders.

Parameters

args – main args
tasks – task names
entity_symbols – entity symbols

Returns: list of dataloaders

bootleg.data.get_slicedatasets(args, splits, entity_symbols)[source]¶

Get the slice datasets.

Parameters

args – main args
splits – splits to get datasets for
entity_symbols – entity symbols

Returns: Dict of slice datasets

bootleg.dataset module¶

Bootleg NED Dataset.

class bootleg.dataset.BootlegDataset(main_args, name, dataset, use_weak_label, load_entity_data, tokenizer, entity_symbols, dataset_threads, split='train', is_bert=True, dataset_range=None)[source]¶

Bases: emmental.data.EmmentalDataset

Bootleg Dataset class.

Parameters

main_args – input config
name – internal dataset name
dataset – dataset file
use_weak_label – whether to use weakly labeled mentions or not
load_entity_data – whether to load entity data or not
tokenizer – sentence tokenizer
entity_symbols – entity database class
dataset_threads – number of threads to use
split – data split
is_bert – is the tokenizer a BERT or not
dataset_range – offset into dataset

classmethod build_data_dicts(save_dataset_name, save_labels_name, X_storage, Y_storage)[source]¶

Return the X_dict and Y_dict of inputs and labels.

Parameters

save_dataset_name – memmap file name with inputs
save_labels_name – memmap file name with labels
X_storage – memmap storage for inputs
Y_storage – memmap storage labels

Returns: X_dict of inputs and Y_dict of labels for Emmental datasets

classmethod build_data_entity_dicts(save_dataset_name, X_storage)[source]¶

Return the X_dict for the entity data.

Parameters

save_dataset_name – memmap file name with entity data
X_storage – memmap storage type

Returns: Dict of labels

get_sentidx_to_rowids()[source]¶

Get mapping from sent idx to row id in X_dict.

Returns: Dict of sent idx to row id

class bootleg.dataset.BootlegEntityDataset(main_args, name, dataset, tokenizer, entity_symbols, dataset_threads, split='test')[source]¶

Bases: emmental.data.EmmentalDataset

Bootleg Dataset class for entities.

Parameters

main_args – input config
name – internal dataset name
dataset – dataset file
tokenizer – sentence tokenizer
entity_symbols – entity database class
dataset_threads – number of threads to use
split – data split

classmethod build_data_entity_dicts(save_dataset_name, X_storage)[source]¶

Return the X_dict for the entity data.

Parameters

save_dataset_name – memmap file name with entity data
X_storage – memmap storage type

Returns: Dict of labels

class bootleg.dataset.InputExample(sent_idx, subsent_idx, alias_list_pos, alias_to_predict, span, phrase, alias, qid, qid_cnt_mask_score)[source]¶

Bases: object

A single training/test example for prediction.

classmethod from_dict(in_dict)[source]¶: Create pobject from dictionary.

to_dict()[source]¶: Return dictionary of object.

class bootleg.dataset.InputFeatures(alias_idx, word_input_ids, word_token_type_ids, word_attention_mask, word_qid_cnt_mask_score, gold_eid, for_dump_gold_eid, gold_cand_K_idx, for_dump_gold_cand_K_idx_train, alias_list_pos, sent_idx, subsent_idx, guid)[source]¶

Bases: object

A single set of features of data.

classmethod from_dict(in_dict)[source]¶: Create pobject from dictionary.

to_dict()[source]¶: Return dictionary of object.

bootleg.dataset.build_and_save_entity_inputs(save_entity_dataset_name, X_entity_storage, data_config, dataset_threads, tokenizer, entity_symbols)[source]¶

Create entity features.

Parameters

save_entity_dataset_name – memmap filename to save the entity data
X_entity_storage – storage type for memmap file
data_config – data config
dataset_threads – number of threads
tokenizer – tokenizer
entity_symbols – entity symbols

bootleg.dataset.build_and_save_entity_inputs_hlp(input_qids)[source]¶: Create entity features multiprocessing helper.

bootleg.dataset.build_and_save_entity_inputs_initializer(constants, data_config, save_entity_dataset_name, X_entity_storage, tokenizer)[source]¶: Create entity features multiprocessing initializer.

bootleg.dataset.build_and_save_entity_inputs_single(input_qids, constants, memfile, type_symbols, kg_symbols, tokenizer, entity_symbols)[source]¶: Create entity features.

bootleg.dataset.convert_examples_to_features_and_save(meta_file, guid_dtype, data_config, dataset_threads, use_weak_label, split, is_bert, save_dataset_name, save_labels_name, X_storage, Y_storage, tokenizer, entity_symbols)[source]¶

Create features from examples.

Converts the prepped examples into input features and saves in memmap files. These are used in the __get_item__ method.

Parameters

meta_file – metadata file where input file paths are saved
guid_dtype – unique identifier dtype
data_config – data config
dataset_threads – number of threads
use_weak_label – whether to use weak labeling or not
split – data split
is_bert – is the tokenizer a BERT tokenizer
save_dataset_name – data features file name to save
save_labels_name – data labels file name to save
X_storage – data features storage type (for memmap)
Y_storage – data labels storage type (for memmap)
tokenizer – tokenizer
entity_symbols – entity symbols

bootleg.dataset.convert_examples_to_features_and_save_hlp(input_dict)[source]¶: Convert examples to features multiprocessing initializer.

bootleg.dataset.convert_examples_to_features_and_save_initializer(tokenizer, data_config, save_dataset_name, save_labels_name, X_storage, Y_storage)[source]¶: Create examples multiprocessing initializer.

bootleg.dataset.convert_examples_to_features_and_save_single(input_dict, tokenizer, entitysymbols, mmap_file, mmap_label_file)[source]¶: Convert examples to features multiprocessing helper.

bootleg.dataset.create_examples(dataset, create_ex_indir, create_ex_outdir, meta_file, data_config, dataset_threads, use_weak_label, split, is_bert, tokenizer)[source]¶

Create examples from the raw input data.

Parameters

dataset – data file to read
create_ex_indir – temporary directory where input files are stored
create_ex_outdir – temporary directory to store output files from method
meta_file – metadata file to save the file names/paths for the next step in prep pipeline
data_config – data config
dataset_threads – number of threads
use_weak_label – whether to use weak labeling or not
split – data split
is_bert – is the tokenizer a BERT one
tokenizer – tokenizer

bootleg.dataset.create_examples_hlp(args)[source]¶: Create examples multiprocessing helper.

bootleg.dataset.create_examples_initializer(constants_dict, tokenizer)[source]¶: Create examples multiprocessing initializer.

bootleg.dataset.create_examples_single(in_file_idx, in_file_name, in_file_lines, out_file_name, constants_dict, tokenizer)[source]¶: Create examples.

bootleg.dataset.extract_context(span, sentence, max_seq_window_len, tokenizer)[source]¶

Extract the left and right context window around a span.

Parameters

span – character span (left and right values)
sentence – sentence
max_seq_window_len – maximum window length around a span
tokenizer – tokenizer

Returns: context window

bootleg.dataset.get_entity_string(qid, constants, entity_symbols, kg_symbols, type_symbols)[source]¶

Get string representation of entity.

For each entity, generates a string that is fed into a language model to generate an entity embedding. Returns all tokens that are the title of the entity (even if in the description)

Parameters

qid – QID
constants – Dict of constants
entity_symbols – entity symbols
kg_symbols – kg symbols
type_symbols – type symbols

Returns: entity strings, number of types over max length, number of relations over max length

bootleg.dataset.get_structural_entity_str(items, max_tok_len, sep_tok)[source]¶

Return sep_tok joined list of items of strucutral resources.

Parameters

items – list of structural resources
max_tok_len – maximum token length
sep_tok – token to separate out resources

Returns

result string, number of items that went beyond max_tok_len

bootleg.extract_all_entities module¶

Bootleg run command.

bootleg.extract_all_entities.parse_cmdline_args()[source]¶

Parse command line.

Takes an input config file and parses it into the correct subdictionary groups for the model.

Returns: model run mode of train, eval, or dumping parsed Dict config path to original config path

bootleg.extract_all_entities.run_model(config, run_config_path=None)[source]¶

Run Emmental Bootleg model.

Parameters

config – parsed model config
run_config_path – original config path (for saving)

bootleg.extract_all_entities.setup(config, run_config_path=None)[source]¶

Set distributed backend and save configuration files.

Parameters

config – config
run_config_path – path for original run config

bootleg.run module¶

Bootleg run command.

bootleg.run.configure_optimizer()[source]¶

Configure the optimizer for Bootleg.

Parameters: config – config

bootleg.run.parse_cmdline_args()[source]¶

Take an input config file and parse it into the correct subdictionary groups for the model.

Returns: model run mode of train, eval, or dumping parsed Dict config path to original config path

bootleg.run.run_model(mode, config, run_config_path=None, entity_emb_file=None)[source]¶

Run Emmental Bootleg models.

Parameters

mode – run mode (train, eval, dump_preds)
config – parsed model config
run_config_path – original config path (for saving)
entity_emb_file – file for dumped entity embeddings

bootleg.run.setup(config, run_config_path=None)[source]¶

Set distributed backend and save configuration files.

Parameters

config – config
run_config_path – path for original run config

bootleg.scorer module¶

Bootleg scorer.

class bootleg.scorer.BootlegSlicedScorer(train_in_candidates, slices_datasets=None)[source]¶

Bases: object

Sliced NED scorer init.

Parameters

train_in_candidates – are we training assuming that all gold qids are in the candidates or not
slices_datasets – slice dataset (see slicing/slice_dataset.py)

bootleg_score(golds: numpy.ndarray, probs: numpy.ndarray, preds: Optional[numpy.ndarray], uids: Optional[List[str]] = None) → Dict[str, float][source]¶

Scores the predictions using the gold labels and slices.

Parameters

golds – gold labels
probs – probabilities
preds – predictions (max prob candidate)
uids – unique identifiers

Returns: dictionary of tensorboard compatible keys and metrics

get_slices(uid)[source]¶

Get slices incidence matrices.

Get slice incidence matrices for the uid Uid is dtype (np.dtype([(‘sent_idx’, ‘i8’, 1), (‘subsent_idx’, ‘i8’, 1), (“alias_orig_list_pos”, ‘i8’, max_aliases)]) where alias_orig_list_pos gives the mentions original positions in the sentence.

Parameters: uid – unique identifier of sentence

Returns: dictionary of slice_name -> matrix of 0/1 for if alias is in slice or not (-1 for no alias)

bootleg.task_config module¶

Emmental task constants.

Module contents¶

Print functions for distributed computation.

bootleg.log_rank_0_debug(logger, message)[source]¶: If distributed is initialized log debug only on rank 0.

bootleg.log_rank_0_info(logger, message)[source]¶: If distributed is initialized log info only on rank 0.