bootleg package¶
Subpackages¶
- bootleg.end2end package
- bootleg.layers package
- bootleg.slicing package
- bootleg.symbols package
- bootleg.tasks package
- bootleg.utils package
- Subpackages
- bootleg.utils.classes package
- bootleg.utils.parser package
- bootleg.utils.preprocessing package
- Submodules
- bootleg.utils.preprocessing.compute_statistics module
- bootleg.utils.preprocessing.count_body_part_size module
- bootleg.utils.preprocessing.gen_alias_cand_map module
- bootleg.utils.preprocessing.gen_entity_mappings module
- bootleg.utils.preprocessing.get_train_qid_counts module
- bootleg.utils.preprocessing.sample_eval_data module
- Module contents
- Submodules
- bootleg.utils.data_utils module
- bootleg.utils.eval_utils module
- bootleg.utils.model_utils module
- bootleg.utils.utils module
- Module contents
- Subpackages
Submodules¶
bootleg.data module¶
Bootleg data creation.
- bootleg.data.bootleg_collate_fn(batch: Union[List[Tuple[Dict[str, Any], Dict[str, torch.Tensor]]], List[Dict[str, Any]]]) Union[Tuple[Dict[str, Any], Dict[str, torch.Tensor]], Dict[str, Any]] [source]¶
Collate function (modified from emmental collate fn).
The main difference is our collate function merges candidates from across the batch for disambiguation. :param batch: The batch to collate.
- Returns
The collated batch.
- bootleg.data.get_dataloaders(args, tasks, use_batch_cands, load_entity_data, splits, entity_symbols, tokenizer, dataset_offsets: Optional[Dict[str, List[int]]] = None)[source]¶
Get the dataloaders.
- Parameters
args – main args
tasks – task names
use_batch_cands – whether to use candidates across a batch (train and eval_batch_cands)
load_entity_data – whether to load entity data
splits – data splits to generate dataloaders for
entity_symbols – entity symbols
dataset_offsets – [start, end] offsets for each split to index into the dataset. Dataset len is end-start. If end is None, end is the length of the dataset.
Returns: list of dataloaders
bootleg.dataset module¶
Bootleg NED Dataset.
- class bootleg.dataset.BootlegDataset(main_args, name, dataset, use_weak_label, load_entity_data, tokenizer, entity_symbols, dataset_threads, split='train', is_bert=True, dataset_range=None)[source]¶
Bases:
emmental.data.EmmentalDataset
Bootleg Dataset class.
- Parameters
main_args – input config
name – internal dataset name
dataset – dataset file
use_weak_label – whether to use weakly labeled mentions or not
load_entity_data – whether to load entity data or not
tokenizer – sentence tokenizer
entity_symbols – entity database class
dataset_threads – number of threads to use
split – data split
is_bert – is the tokenizer a BERT or not
dataset_range – offset into dataset
- classmethod build_data_dicts(save_dataset_name, save_labels_name, X_storage, Y_storage)[source]¶
Return the X_dict and Y_dict of inputs and labels.
- Parameters
save_dataset_name – memmap file name with inputs
save_labels_name – memmap file name with labels
X_storage – memmap storage for inputs
Y_storage – memmap storage labels
Returns: X_dict of inputs and Y_dict of labels for Emmental datasets
- class bootleg.dataset.BootlegEntityDataset(main_args, name, dataset, tokenizer, entity_symbols, dataset_threads, split='test')[source]¶
Bases:
emmental.data.EmmentalDataset
Bootleg Dataset class for entities.
- Parameters
main_args – input config
name – internal dataset name
dataset – dataset file
tokenizer – sentence tokenizer
entity_symbols – entity database class
dataset_threads – number of threads to use
split – data split
- class bootleg.dataset.InputExample(sent_idx, subsent_idx, alias_list_pos, alias_to_predict, span, phrase, alias, qid, qid_cnt_mask_score)[source]¶
Bases:
object
A single training/test example for prediction.
- class bootleg.dataset.InputFeatures(alias_idx, word_input_ids, word_token_type_ids, word_attention_mask, word_qid_cnt_mask_score, gold_eid, for_dump_gold_eid, gold_cand_K_idx, for_dump_gold_cand_K_idx_train, alias_list_pos, sent_idx, subsent_idx, guid)[source]¶
Bases:
object
A single set of features of data.
- bootleg.dataset.build_and_save_entity_inputs(save_entity_dataset_name, X_entity_storage, data_config, dataset_threads, tokenizer, entity_symbols)[source]¶
Create entity features.
- Parameters
save_entity_dataset_name – memmap filename to save the entity data
X_entity_storage – storage type for memmap file
data_config – data config
dataset_threads – number of threads
tokenizer – tokenizer
entity_symbols – entity symbols
- bootleg.dataset.build_and_save_entity_inputs_hlp(input_qids)[source]¶
Create entity features multiprocessing helper.
- bootleg.dataset.build_and_save_entity_inputs_initializer(constants, data_config, save_entity_dataset_name, X_entity_storage, tokenizer)[source]¶
Create entity features multiprocessing initializer.
- bootleg.dataset.build_and_save_entity_inputs_single(input_qids, constants, memfile, type_symbols, kg_symbols, tokenizer, entity_symbols)[source]¶
Create entity features.
- bootleg.dataset.convert_examples_to_features_and_save(meta_file, guid_dtype, data_config, dataset_threads, use_weak_label, split, is_bert, save_dataset_name, save_labels_name, X_storage, Y_storage, tokenizer, entity_symbols)[source]¶
Create features from examples.
Converts the prepped examples into input features and saves in memmap files. These are used in the __get_item__ method.
- Parameters
meta_file – metadata file where input file paths are saved
guid_dtype – unique identifier dtype
data_config – data config
dataset_threads – number of threads
use_weak_label – whether to use weak labeling or not
split – data split
is_bert – is the tokenizer a BERT tokenizer
save_dataset_name – data features file name to save
save_labels_name – data labels file name to save
X_storage – data features storage type (for memmap)
Y_storage – data labels storage type (for memmap)
tokenizer – tokenizer
entity_symbols – entity symbols
- bootleg.dataset.convert_examples_to_features_and_save_hlp(input_dict)[source]¶
Convert examples to features multiprocessing initializer.
- bootleg.dataset.convert_examples_to_features_and_save_initializer(tokenizer, data_config, save_dataset_name, save_labels_name, X_storage, Y_storage)[source]¶
Create examples multiprocessing initializer.
- bootleg.dataset.convert_examples_to_features_and_save_single(input_dict, tokenizer, entitysymbols, mmap_file, mmap_label_file)[source]¶
Convert examples to features multiprocessing helper.
- bootleg.dataset.create_examples(dataset, create_ex_indir, create_ex_outdir, meta_file, data_config, dataset_threads, use_weak_label, split, is_bert, tokenizer)[source]¶
Create examples from the raw input data.
- Parameters
dataset – data file to read
create_ex_indir – temporary directory where input files are stored
create_ex_outdir – temporary directory to store output files from method
meta_file – metadata file to save the file names/paths for the next step in prep pipeline
data_config – data config
dataset_threads – number of threads
use_weak_label – whether to use weak labeling or not
split – data split
is_bert – is the tokenizer a BERT one
tokenizer – tokenizer
- bootleg.dataset.create_examples_initializer(constants_dict, tokenizer)[source]¶
Create examples multiprocessing initializer.
- bootleg.dataset.create_examples_single(in_file_idx, in_file_name, in_file_lines, out_file_name, constants_dict, tokenizer)[source]¶
Create examples.
- bootleg.dataset.extract_context(span, sentence, max_seq_window_len, tokenizer)[source]¶
Extract the left and right context window around a span.
- Parameters
span – character span (left and right values)
sentence – sentence
max_seq_window_len – maximum window length around a span
tokenizer – tokenizer
Returns: context window
- bootleg.dataset.get_entity_string(qid, constants, entity_symbols, kg_symbols, type_symbols)[source]¶
Get string representation of entity.
For each entity, generates a string that is fed into a language model to generate an entity embedding. Returns all tokens that are the title of the entity (even if in the description)
- Parameters
qid – QID
constants – Dict of constants
entity_symbols – entity symbols
kg_symbols – kg symbols
type_symbols – type symbols
Returns: entity strings, number of types over max length, number of relations over max length
- bootleg.dataset.get_structural_entity_str(items, max_tok_len, sep_tok)[source]¶
Return sep_tok joined list of items of strucutral resources.
- Parameters
items – list of structural resources
max_tok_len – maximum token length
sep_tok – token to separate out resources
- Returns
result string, number of items that went beyond
max_tok_len
bootleg.extract_all_entities module¶
Bootleg run command.
- bootleg.extract_all_entities.parse_cmdline_args()[source]¶
Parse command line.
Takes an input config file and parses it into the correct subdictionary groups for the model.
- Returns
model run mode of train, eval, or dumping parsed Dict config path to original config path
bootleg.run module¶
Bootleg run command.
- bootleg.run.configure_optimizer()[source]¶
Configure the optimizer for Bootleg.
- Parameters
config – config
- bootleg.run.parse_cmdline_args()[source]¶
Take an input config file and parse it into the correct subdictionary groups for the model.
- Returns
model run mode of train, eval, or dumping parsed Dict config path to original config path
- bootleg.run.run_model(mode, config, run_config_path=None, entity_emb_file=None)[source]¶
Run Emmental Bootleg models.
- Parameters
mode – run mode (train, eval, dump_preds)
config – parsed model config
run_config_path – original config path (for saving)
entity_emb_file – file for dumped entity embeddings
bootleg.scorer module¶
Bootleg scorer.
- class bootleg.scorer.BootlegSlicedScorer(train_in_candidates, slices_datasets=None)[source]¶
Bases:
object
Sliced NED scorer init.
- Parameters
train_in_candidates – are we training assuming that all gold qids are in the candidates or not
slices_datasets – slice dataset (see slicing/slice_dataset.py)
- bootleg_score(golds: numpy.ndarray, probs: numpy.ndarray, preds: Optional[numpy.ndarray], uids: Optional[List[str]] = None) Dict[str, float] [source]¶
Scores the predictions using the gold labels and slices.
- Parameters
golds – gold labels
probs – probabilities
preds – predictions (max prob candidate)
uids – unique identifiers
Returns: dictionary of tensorboard compatible keys and metrics
- get_slices(uid)[source]¶
Get slices incidence matrices.
Get slice incidence matrices for the uid Uid is dtype (np.dtype([(‘sent_idx’, ‘i8’, 1), (‘subsent_idx’, ‘i8’, 1), (“alias_orig_list_pos”, ‘i8’, max_aliases)]) where alias_orig_list_pos gives the mentions original positions in the sentence.
- Parameters
uid – unique identifier of sentence
Returns: dictionary of slice_name -> matrix of 0/1 for if alias is in slice or not (-1 for no alias)
bootleg.task_config module¶
Emmental task constants.
Module contents¶
Print functions for distributed computation.