bootleg.utils package

Subpackages

Submodules

bootleg.utils.data_utils module

Bootleg data utils.

bootleg.utils.data_utils.add_special_tokens(tokenizer)[source]

Add special tokens.

Parameters
  • tokenizer – tokenizer

  • data_config – data config

  • entitysymbols – entity symbols

bootleg.utils.data_utils.correct_not_augmented_dict_values(gold, dict_values)[source]

Correct gold label dict values in data prep.

Modifies the dict_values to only contain those mentions that are gold labels. The new dictionary has the alias indices be corrected to start at 0 and end at the number of gold mentions.

Parameters
  • gold – List of T/F values if mention is gold label or not

  • dict_values – Dict of slice_name -> Dict[alias_idx] -> slice probability

Returns: adjusted dict_values such that only gold = True aliases are kept (dict is reindexed to start at 0)

bootleg.utils.data_utils.generate_slice_name(data_args, slice_names, use_weak_label, dataset)[source]

Generate name for slice datasets, taking into account the config eval slices.

Parameters
  • data_args – data args

  • slice_names – slice names

  • use_weak_label – if using weak labels or not

  • dataset – dataset name

Returns: dataset name for saving slice data

bootleg.utils.data_utils.get_chunk_dir(prep_dir)[source]

Get directory for saving data chunks.

Parameters

prep_dir – prep directory

Returns: directory path

bootleg.utils.data_utils.get_data_prep_dir(data_config)[source]

Get data prep directory for saving prep files.

Parameters

data_config – data config

Returns: directory path

bootleg.utils.data_utils.get_emb_prep_dir(data_config)[source]

Get embedding prep directory for saving prep files.

Parameters

data_config – data config

Returns: directory path

bootleg.utils.data_utils.get_eval_slices(eval_slices)[source]

Get eval slices in data prep.

Given input eval slices (passed in config), ensure FINAL_LOSS is in the eval slices. FINAL_LOSS gives overall metrics.

Parameters

eval_slices – list of input eval slices

Returns: list of eval slices to use in the model

bootleg.utils.data_utils.get_save_data_folder(data_args, use_weak_label, dataset)[source]

Get save data folder for the prepped data.

Parameters
  • data_args – data config

  • use_weak_label – whether to use weak labelling or not

  • dataset – dataset name

Returns: folder string path

bootleg.utils.data_utils.get_save_data_folder_candgen(data_args, use_weak_label, dataset)[source]

Give save data folder for the prepped data.

Parameters
  • data_args – data config

  • use_weak_label – whether to use weak labelling or not

  • dataset – dataset name

Returns: folder string path

bootleg.utils.data_utils.read_in_akas(entitysymbols)[source]

Read in alias to QID mappings and generates a QID to list of alternate names.

Parameters

entitysymbols – entity symbols

Returns: dictionary of QID to type names

bootleg.utils.eval_utils module

Bootleg eval utils.

bootleg.utils.eval_utils.batched_pred_iter(model, dataloader, dump_preds_accumulation_steps, sent_idx2num_mens)[source]

Predict from dataloader.

Predict from dataloader taking into account eval accumulation steps. Will yield a new prediction set after each set accumulation steps for writing out.

If a sentence or batch doesn’t have any mentions, it will not be returned by this method.

Recall that we split up sentences that are too long to feed to the model. We use the sent_idx2num_mens dict to ensure we have full sentences evaluated before returning, otherwise we’ll have incomplete sentences to merge together when dumping.

Parameters
  • model – model

  • dataloader – The dataloader to predict

  • dump_preds_accumulation_steps – Number of eval steps to run before returning

  • sent_idx2num_mens – list of sent index to number of mentions

Returns

Iterator over result dict.

bootleg.utils.eval_utils.check_and_create_alias_cand_trie(save_folder, entity_symbols)[source]

Create a mmap memory trie object for storing the alias-candidate mappings.

Parameters
  • save_folder – save folder for alias trie

  • entity_symbols – entity symbols

bootleg.utils.eval_utils.collect_and_merge_results(unmerged_entity_emb_file, emb_file_config, config, sent_idx2num_mens, sent_idx2row, save_folder, entity_symbols)[source]

Merge mentions, filtering non-gold labels, and saves to file.

Parameters
  • unmerged_entity_emb_file – memmap file from dump step

  • emb_file_config – config file for loading memmap file

  • config – model config

  • res_dict – result dictionary from Emmental predict

  • sent_idx2num_mens – Dict sentence idx to number of mentions

  • sent_idx2row – Dict sentence idx to row of eval data

  • save_folder – folder to save results

  • entity_symbols – entity symbols

Returns: saved prediction file, total mentions seen

bootleg.utils.eval_utils.dump_model_outputs(model, dataloader, config, sentidx2num_mentions, save_folder, entity_symbols, task_name, overwrite_data)[source]

Dump model outputs.

Parameters
  • model – model

  • dataloader – data loader

  • config – config

  • sentidx2num_mentions – Dict from sentence idx to number of mentions

  • save_folder – save folder

  • entity_symbols – entity symbols

  • task_name – task name

  • overwrite_data – overwrite saved mmap files

Returns: mmemp file name for saved outputs, dtype file name for loading memmap file

bootleg.utils.eval_utils.get_emb_file(save_folder)[source]

Get the embedding numpy file for the batch.

Parameters

save_folder – save folder

Returns: string

bootleg.utils.eval_utils.get_eval_folder(file)[source]

Return eval folder for the given evaluation file.

Stored in log_path/filename/model_name.

Parameters

file – eval file

Returns: eval folder

bootleg.utils.eval_utils.get_result_file(save_folder)[source]

Get the jsonl label file for the batch.

Parameters

save_folder – save folder

Returns: string

bootleg.utils.eval_utils.get_sent_idx2num_mens(data_file)[source]

Get the map from sentence index to number of mentions and to data.

Used for calculating offsets and chunking file.

Parameters

data_file – eval file

Returns: Dict of sentence index -> number of mention per sentence, Dict of sentence index -> input line

bootleg.utils.eval_utils.get_sental2embid(merged_entity_emb_file, merged_storage_type)[source]

Get sent_idx, alias_idx mapping to emb idx for quick lookup.

Parameters
  • merged_entity_emb_file – memmap file after merge sentences

  • merged_storage_type – file storage type

Returns: Dict of f”{sent_idx}_{alias_idx}” -> index in merged_entity_emb_file

bootleg.utils.eval_utils.map_aliases_to_candidates(train_in_candidates, max_candidates, alias_cand_map, aliases)[source]

Get list of QID candidates for each alias.

Parameters
  • train_in_candidates – whether the model has a NC entity or not (assumes all gold QIDs are in candidate lists)

  • alias_cand_map – alias -> candidate qids in dict or TwoLayerVocabularyScoreTrie format

  • aliases – list of aliases

Returns: List of lists QIDs

bootleg.utils.eval_utils.map_candidate_qids_to_eid(candidate_qids, qid2eid)[source]

Get list of EID candidates for each alias.

Parameters
  • candidate_qids – list of list of candidate QIDs

  • qid2eid – mapping of qid to entity id

Returns: List of lists EIDs

bootleg.utils.eval_utils.masked_class_logsoftmax(pred, mask, dim=2, temp=1.0, zero_delta=1e-45)[source]

Masked logsoftmax.

Mask of 0/False means mask value (ignore it)

Parameters
  • pred – input tensor

  • mask – mask

  • dim – softmax dimension

  • temp – softmax temperature

  • zero_delta – small value to add so that vector + (mask+zero_delta).log() is not Nan for all 0s

Returns: masked softmax tensor

bootleg.utils.eval_utils.merge_subsentences(num_processes, subset_sent_idx2num_mens, cache_folder, to_save_file, to_save_storage, to_read_file, to_read_storage)[source]

Merge and flatten sentence over sub-sentences.

Flatten all sentences back together over sub-sentences; removing the PAD aliases from the data I.e., converts from sent_idx -> array of values to (sent_idx, alias_idx) -> value with varying numbers of aliases per sentence.

Parameters
  • num_processes – number of processes

  • subset_sent_idx2num_mens – Dict of sentence index to number of mentions for this batch

  • cache_folder – cache directory

  • to_save_file – memmap file to save results to

  • to_save_storage – save file storage type

  • to_read_file – memmap file to read predictions from

  • to_read_storage – read file storage type

bootleg.utils.eval_utils.merge_subsentences_hlp(args)[source]

Merge subsentences multiprocessing subprocess helper.

bootleg.utils.eval_utils.merge_subsentences_initializer(to_write_file, to_write_storage, to_read_file, to_read_storage, sentidx2offset_file)[source]

Merge subsentences initializer for multiprocessing.

Parameters
  • to_write_file – file to write

  • to_write_storage – mmap storage type

  • to_read_file – file to read

  • to_read_storage – mmap storage type

  • sentidx2offset_file – sentence index to offset in mmap data

bootleg.utils.eval_utils.merge_subsentences_single(K, hidden_size, r_idx_set, filt_emb_data, full_pred_data, sentidx2offset)[source]

Merge subsentences single process.

Will flatted out the results from full_pred_data so each line of

filt_emb_data is one alias prediction.

Parameters
  • K – number candidates

  • hidden_size – hidden size

  • r_idx_set – batch result index

  • filt_emb_data – mmap embedding file to write

  • full_pred_data – mmap result file to read

  • sentidx2offset – sentence to emb data offset

bootleg.utils.eval_utils.write_data_labels(num_processes, merged_entity_emb_file, merged_storage_type, sent_idx2row, cache_folder, out_file, entity_dump, train_in_candidates, max_candidates, trie_candidate_map_folder=None, trie_qid2eid_file=None)[source]

Take the flattened data from merge_sentences and write out predictions.

Parameters
  • num_processes – number of processes

  • merged_entity_emb_file – input memmap file after merge sentences

  • merged_storage_type – input file storage type

  • sent_idx2row – Dict of sentence idx to row relevant to this subbatch

  • cache_folder – folder to save temporary outputs

  • out_file – final output file for predictions

  • entity_dump – entity dump

  • train_in_candidates – whether NC entities are not in candidate lists

  • max_candidates – maximum number of candidates

  • trie_candidate_map_folder – folder where trie of alias->candidate map is stored for parallel proccessing

  • trie_qid2eid_file – file where trie of qid->eid map is stored for parallel proccessing

bootleg.utils.eval_utils.write_data_labels_hlp(args)[source]

Write data labels multiprocess helper function.

bootleg.utils.eval_utils.write_data_labels_initializer(merged_entity_emb_file, merged_storage_type, sental2embid_file, train_in_candidates, max_cands, trie_candidate_map_folder, trie_qid2eid_file)[source]

Write data labels multiprocessing initializer.

Parameters
  • merged_entity_emb_file – flattened embedding input file

  • merged_storage_type – mmap storage type

  • sental2embid_file – sentence, alias -> embedding id mapping

  • train_in_candidates – train in candidates flag

  • max_cands – max candidates

  • trie_candidate_map_folder – alias trie folder

  • trie_qid2eid_file – qid to eid trie file

bootleg.utils.eval_utils.write_data_labels_single(sentidx2row, output_file, filt_emb_data, sental2embid, alias_cand_map, qid2eid, train_in_cands, max_cands)[source]

Write data labels single subprocess function.

Will take the alias predictions and merge them back by sentence to be written out.

Parameters
  • sentidx2row – sentence index to raw eval data row

  • output_file – output file

  • filt_emb_data – mmap embedding data (one prediction per row)

  • sental2embid – sentence index, alias index -> embedding row id

  • alias_cand_map – alias to candidate map

  • qid2eid – qid to entity id map

  • train_in_cands – training in candidates flag

  • max_cands – maximum candidates

bootleg.utils.eval_utils.write_disambig_metrics_to_csv(file_path, dictionary)[source]

Save disambiguation metrics in the dictionary to file_path.

Parameters
  • file_path – file path

  • dictionary – dictionary of scores (output of Emmental score)

bootleg.utils.model_utils module

Model utils.

bootleg.utils.model_utils.count_parameters(model, requires_grad, logger)[source]

Count the number of parameters.

Parameters
  • model – model to count

  • requires_grad – whether to look at grad or no grad params

  • logger – logger

bootleg.utils.model_utils.get_max_candidates(entity_symbols, data_config)[source]

Get max candidates.

Returns the maximum number of candidates used in the model, taking into account train_in_candidates If train_in_canddiates is False, we add a NC entity candidate (for null candidate)

Parameters
  • entity_symbols – entity symbols

  • data_config – data config

bootleg.utils.utils module

Bootleg utils.

bootleg.utils.utils.chunk_file(in_file, out_dir, num_lines, prefix='out_')[source]

Chunk a file into num_lines chunks.

Parameters
  • in_file – input file

  • out_dir – output directory

  • num_lines – number of lines in each chunk

  • prefix – prefix for output files in out_dir

Returns: total number of lines read, dictionary of output file path -> number of lines in that file (for tqdms)

bootleg.utils.utils.chunks(iterable, n)[source]

Chunk data.

chunks(ABCDE,2) => AB CD E.

Parameters
  • iterable – iterable input

  • n – number of chunks

Returns: next chunk

bootleg.utils.utils.create_single_item_trie(in_dict, out_file='')[source]

Create marisa trie.

Creates a marisa trie from the input dictionary. We assume the dictionary has string keys and integer values.

Parameters
  • in_dict – Dict[str] -> Int

  • out_file – marisa file to save (useful for reading as memmap) (optional)

Returns: marisa trie of in_dict

bootleg.utils.utils.dump_json_file(filename, contents, ensure_ascii=False)[source]

Dump dictionary to json file.

Parameters
  • filename – file to write to

  • contents – dictionary to save

  • ensure_ascii – ensure ascii

bootleg.utils.utils.dump_yaml_file(filename, contents)[source]

Dump dictionary to yaml file.

Parameters
  • filename – file to write to

  • contents – dictionary to save

bootleg.utils.utils.ensure_dir(d)[source]

Check if a directory exists. If not, it makes it.

Parameters

d – path

bootleg.utils.utils.exists_dir(d)[source]

Check if directory exists.

Parameters

d – path

bootleg.utils.utils.get_lnrm(s, strip=1, lower=1)[source]

Convert to lnrm form.

Convert a string to its lnrm form We form the lower-cased normalized version l(s) of a string s by canonicalizing its UTF-8 characters, eliminating diacritics, lower-casing the UTF-8 and throwing out all ASCII- range characters that are not alpha-numeric.

from http://nlp.stanford.edu/pubs/subctackbp.pdf Section 2.3

Parameters
  • s – input string

  • strip – boolean for stripping alias or not

  • lower – boolean for lowercasing alias or not

Returns: the lnrm form of the string

bootleg.utils.utils.load_json_file(filename)[source]

Load dictionary from json file.

Parameters

filename – file to read from

Returns: Dict

bootleg.utils.utils.load_single_item_trie(file)[source]

Load a marisa trie with integer values from memmap file.

Parameters

file – marisa input file

Returns: marisa trie

bootleg.utils.utils.load_yaml_file(filename)[source]

Load dictionary from yaml file.

Parameters

filename – file to read from

Returns: Dict

bootleg.utils.utils.recurse_redict(d)[source]

Cast all DottedDict values in a dictionary to be dictionaries.

Useful for YAML dumping.

Parameters

d – Dict

Returns: Dict with no DottedDicts

bootleg.utils.utils.strip_nan(input_list)[source]

Replace float(‘nan’) with nulls.

Used for ujson loading/dumping.

Parameters

input_list – list of items to remove the Nans from

Returns: list or nested list where Nan is not None

bootleg.utils.utils.try_rmtree(rm_dir)[source]

Try to remove a directory tree.

In the case a resource is open, rmtree will fail. This retries to rmtree after 1 second waits for 5 times.

Parameters

rm_dir – directory to remove

bootleg.utils.utils.write_jsonl(filepath, values, ensure_ascii=False)[source]

Write List[Dict] data to jsonlines file.

Parameters
  • filepath – file to write to

  • values – list of dictionary data to write

  • ensure_ascii – ensure_ascii for json

bootleg.utils.utils.write_to_file(filename, value)[source]

Write generic value to a file.

If value is not string, will cast to str().

Parameters
  • filename – file to write to

  • value – context to write

Returns: Dict

Module contents

Util init.