bootleg package

Subpackages

Submodules

bootleg.data module

Bootleg data creation.

bootleg.data.bootleg_collate_fn(batch: Union[List[Tuple[Dict[str, Any], Dict[str, torch.Tensor]]], List[Dict[str, Any]]]) Union[Tuple[Dict[str, Any], Dict[str, torch.Tensor]], Dict[str, Any]][source]

Collate function (modified from emmental collate fn).

The main difference is our collate function merges candidates from across the batch for disambiguation. :param batch: The batch to collate.

Returns

The collated batch.

bootleg.data.get_dataloaders(args, tasks, use_batch_cands, load_entity_data, splits, entity_symbols, tokenizer, dataset_offsets: Optional[Dict[str, List[int]]] = None)[source]

Get the dataloaders.

Parameters
  • args – main args

  • tasks – task names

  • use_batch_cands – whether to use candidates across a batch (train and eval_batch_cands)

  • load_entity_data – whether to load entity data

  • splits – data splits to generate dataloaders for

  • entity_symbols – entity symbols

  • dataset_offsets – [start, end] offsets for each split to index into the dataset. Dataset len is end-start. If end is None, end is the length of the dataset.

Returns: list of dataloaders

bootleg.data.get_entity_dataloaders(args, tasks, entity_symbols, tokenizer)[source]

Get the entity dataloaders.

Parameters
  • args – main args

  • tasks – task names

  • entity_symbols – entity symbols

Returns: list of dataloaders

bootleg.data.get_slicedatasets(args, splits, entity_symbols)[source]

Get the slice datasets.

Parameters
  • args – main args

  • splits – splits to get datasets for

  • entity_symbols – entity symbols

Returns: Dict of slice datasets

bootleg.dataset module

Bootleg NED Dataset.

class bootleg.dataset.BootlegDataset(main_args, name, dataset, use_weak_label, load_entity_data, tokenizer, entity_symbols, dataset_threads, split='train', is_bert=True, dataset_range=None)[source]

Bases: emmental.data.EmmentalDataset

Bootleg Dataset class.

Parameters
  • main_args – input config

  • name – internal dataset name

  • dataset – dataset file

  • use_weak_label – whether to use weakly labeled mentions or not

  • load_entity_data – whether to load entity data or not

  • tokenizer – sentence tokenizer

  • entity_symbols – entity database class

  • dataset_threads – number of threads to use

  • split – data split

  • is_bert – is the tokenizer a BERT or not

  • dataset_range – offset into dataset

classmethod build_data_dicts(save_dataset_name, save_labels_name, X_storage, Y_storage)[source]

Return the X_dict and Y_dict of inputs and labels.

Parameters
  • save_dataset_name – memmap file name with inputs

  • save_labels_name – memmap file name with labels

  • X_storage – memmap storage for inputs

  • Y_storage – memmap storage labels

Returns: X_dict of inputs and Y_dict of labels for Emmental datasets

classmethod build_data_entity_dicts(save_dataset_name, X_storage)[source]

Return the X_dict for the entity data.

Parameters
  • save_dataset_name – memmap file name with entity data

  • X_storage – memmap storage type

Returns: Dict of labels

get_sentidx_to_rowids()[source]

Get mapping from sent idx to row id in X_dict.

Returns: Dict of sent idx to row id

class bootleg.dataset.BootlegEntityDataset(main_args, name, dataset, tokenizer, entity_symbols, dataset_threads, split='test')[source]

Bases: emmental.data.EmmentalDataset

Bootleg Dataset class for entities.

Parameters
  • main_args – input config

  • name – internal dataset name

  • dataset – dataset file

  • tokenizer – sentence tokenizer

  • entity_symbols – entity database class

  • dataset_threads – number of threads to use

  • split – data split

classmethod build_data_entity_dicts(save_dataset_name, X_storage)[source]

Return the X_dict for the entity data.

Parameters
  • save_dataset_name – memmap file name with entity data

  • X_storage – memmap storage type

Returns: Dict of labels

class bootleg.dataset.InputExample(sent_idx, subsent_idx, alias_list_pos, alias_to_predict, span, phrase, alias, qid, qid_cnt_mask_score)[source]

Bases: object

A single training/test example for prediction.

classmethod from_dict(in_dict)[source]

Create pobject from dictionary.

to_dict()[source]

Return dictionary of object.

class bootleg.dataset.InputFeatures(alias_idx, word_input_ids, word_token_type_ids, word_attention_mask, word_qid_cnt_mask_score, gold_eid, for_dump_gold_eid, gold_cand_K_idx, for_dump_gold_cand_K_idx_train, alias_list_pos, sent_idx, subsent_idx, guid)[source]

Bases: object

A single set of features of data.

classmethod from_dict(in_dict)[source]

Create pobject from dictionary.

to_dict()[source]

Return dictionary of object.

bootleg.dataset.build_and_save_entity_inputs(save_entity_dataset_name, X_entity_storage, data_config, dataset_threads, tokenizer, entity_symbols)[source]

Create entity features.

Parameters
  • save_entity_dataset_name – memmap filename to save the entity data

  • X_entity_storage – storage type for memmap file

  • data_config – data config

  • dataset_threads – number of threads

  • tokenizer – tokenizer

  • entity_symbols – entity symbols

bootleg.dataset.build_and_save_entity_inputs_hlp(input_qids)[source]

Create entity features multiprocessing helper.

bootleg.dataset.build_and_save_entity_inputs_initializer(constants, data_config, save_entity_dataset_name, X_entity_storage, tokenizer)[source]

Create entity features multiprocessing initializer.

bootleg.dataset.build_and_save_entity_inputs_single(input_qids, constants, memfile, type_symbols, kg_symbols, tokenizer, entity_symbols)[source]

Create entity features.

bootleg.dataset.convert_examples_to_features_and_save(meta_file, guid_dtype, data_config, dataset_threads, use_weak_label, split, is_bert, save_dataset_name, save_labels_name, X_storage, Y_storage, tokenizer, entity_symbols)[source]

Create features from examples.

Converts the prepped examples into input features and saves in memmap files. These are used in the __get_item__ method.

Parameters
  • meta_file – metadata file where input file paths are saved

  • guid_dtype – unique identifier dtype

  • data_config – data config

  • dataset_threads – number of threads

  • use_weak_label – whether to use weak labeling or not

  • split – data split

  • is_bert – is the tokenizer a BERT tokenizer

  • save_dataset_name – data features file name to save

  • save_labels_name – data labels file name to save

  • X_storage – data features storage type (for memmap)

  • Y_storage – data labels storage type (for memmap)

  • tokenizer – tokenizer

  • entity_symbols – entity symbols

bootleg.dataset.convert_examples_to_features_and_save_hlp(input_dict)[source]

Convert examples to features multiprocessing initializer.

bootleg.dataset.convert_examples_to_features_and_save_initializer(tokenizer, data_config, save_dataset_name, save_labels_name, X_storage, Y_storage)[source]

Create examples multiprocessing initializer.

bootleg.dataset.convert_examples_to_features_and_save_single(input_dict, tokenizer, entitysymbols, mmap_file, mmap_label_file)[source]

Convert examples to features multiprocessing helper.

bootleg.dataset.create_examples(dataset, create_ex_indir, create_ex_outdir, meta_file, data_config, dataset_threads, use_weak_label, split, is_bert, tokenizer)[source]

Create examples from the raw input data.

Parameters
  • dataset – data file to read

  • create_ex_indir – temporary directory where input files are stored

  • create_ex_outdir – temporary directory to store output files from method

  • meta_file – metadata file to save the file names/paths for the next step in prep pipeline

  • data_config – data config

  • dataset_threads – number of threads

  • use_weak_label – whether to use weak labeling or not

  • split – data split

  • is_bert – is the tokenizer a BERT one

  • tokenizer – tokenizer

bootleg.dataset.create_examples_hlp(args)[source]

Create examples multiprocessing helper.

bootleg.dataset.create_examples_initializer(constants_dict, tokenizer)[source]

Create examples multiprocessing initializer.

bootleg.dataset.create_examples_single(in_file_idx, in_file_name, in_file_lines, out_file_name, constants_dict, tokenizer)[source]

Create examples.

bootleg.dataset.extract_context(span, sentence, max_seq_window_len, tokenizer)[source]

Extract the left and right context window around a span.

Parameters
  • span – character span (left and right values)

  • sentence – sentence

  • max_seq_window_len – maximum window length around a span

  • tokenizer – tokenizer

Returns: context window

bootleg.dataset.get_entity_string(qid, constants, entity_symbols, kg_symbols, type_symbols)[source]

Get string representation of entity.

For each entity, generates a string that is fed into a language model to generate an entity embedding. Returns all tokens that are the title of the entity (even if in the description)

Parameters
  • qid – QID

  • constants – Dict of constants

  • entity_symbols – entity symbols

  • kg_symbols – kg symbols

  • type_symbols – type symbols

Returns: entity strings, number of types over max length, number of relations over max length

bootleg.dataset.get_structural_entity_str(items, max_tok_len, sep_tok)[source]

Return sep_tok joined list of items of strucutral resources.

Parameters
  • items – list of structural resources

  • max_tok_len – maximum token length

  • sep_tok – token to separate out resources

Returns

result string, number of items that went beyond max_tok_len

bootleg.extract_all_entities module

Bootleg run command.

bootleg.extract_all_entities.parse_cmdline_args()[source]

Parse command line.

Takes an input config file and parses it into the correct subdictionary groups for the model.

Returns

model run mode of train, eval, or dumping parsed Dict config path to original config path

bootleg.extract_all_entities.run_model(config, run_config_path=None)[source]

Run Emmental Bootleg model.

Parameters
  • config – parsed model config

  • run_config_path – original config path (for saving)

bootleg.extract_all_entities.setup(config, run_config_path=None)[source]

Set distributed backend and save configuration files.

Parameters
  • config – config

  • run_config_path – path for original run config

bootleg.run module

Bootleg run command.

bootleg.run.configure_optimizer()[source]

Configure the optimizer for Bootleg.

Parameters

config – config

bootleg.run.parse_cmdline_args()[source]

Take an input config file and parse it into the correct subdictionary groups for the model.

Returns

model run mode of train, eval, or dumping parsed Dict config path to original config path

bootleg.run.run_model(mode, config, run_config_path=None, entity_emb_file=None)[source]

Run Emmental Bootleg models.

Parameters
  • mode – run mode (train, eval, dump_preds)

  • config – parsed model config

  • run_config_path – original config path (for saving)

  • entity_emb_file – file for dumped entity embeddings

bootleg.run.setup(config, run_config_path=None)[source]

Set distributed backend and save configuration files.

Parameters
  • config – config

  • run_config_path – path for original run config

bootleg.scorer module

Bootleg scorer.

class bootleg.scorer.BootlegSlicedScorer(train_in_candidates, slices_datasets=None)[source]

Bases: object

Sliced NED scorer init.

Parameters
  • train_in_candidates – are we training assuming that all gold qids are in the candidates or not

  • slices_datasets – slice dataset (see slicing/slice_dataset.py)

bootleg_score(golds: numpy.ndarray, probs: numpy.ndarray, preds: Optional[numpy.ndarray], uids: Optional[List[str]] = None) Dict[str, float][source]

Scores the predictions using the gold labels and slices.

Parameters
  • golds – gold labels

  • probs – probabilities

  • preds – predictions (max prob candidate)

  • uids – unique identifiers

Returns: dictionary of tensorboard compatible keys and metrics

get_slices(uid)[source]

Get slices incidence matrices.

Get slice incidence matrices for the uid Uid is dtype (np.dtype([(‘sent_idx’, ‘i8’, 1), (‘subsent_idx’, ‘i8’, 1), (“alias_orig_list_pos”, ‘i8’, max_aliases)]) where alias_orig_list_pos gives the mentions original positions in the sentence.

Parameters

uid – unique identifier of sentence

Returns: dictionary of slice_name -> matrix of 0/1 for if alias is in slice or not (-1 for no alias)

bootleg.task_config module

Emmental task constants.

Module contents

Print functions for distributed computation.

bootleg.log_rank_0_debug(logger, message)[source]

If distributed is initialized log debug only on rank 0.

bootleg.log_rank_0_info(logger, message)[source]

If distributed is initialized log info only on rank 0.