bootleg.utils package¶
Subpackages¶
- bootleg.utils.classes package
- bootleg.utils.parser package
- bootleg.utils.preprocessing package
- Submodules
- bootleg.utils.preprocessing.compute_statistics module
- bootleg.utils.preprocessing.count_body_part_size module
- bootleg.utils.preprocessing.gen_alias_cand_map module
- bootleg.utils.preprocessing.gen_entity_mappings module
- bootleg.utils.preprocessing.get_train_qid_counts module
- bootleg.utils.preprocessing.sample_eval_data module
- Module contents
Submodules¶
bootleg.utils.data_utils module¶
Bootleg data utils.
- bootleg.utils.data_utils.add_special_tokens(tokenizer)[source]¶
Add special tokens.
- Parameters
tokenizer – tokenizer
data_config – data config
entitysymbols – entity symbols
- bootleg.utils.data_utils.correct_not_augmented_dict_values(gold, dict_values)[source]¶
Correct gold label dict values in data prep.
Modifies the dict_values to only contain those mentions that are gold labels. The new dictionary has the alias indices be corrected to start at 0 and end at the number of gold mentions.
- Parameters
gold – List of T/F values if mention is gold label or not
dict_values – Dict of slice_name -> Dict[alias_idx] -> slice probability
Returns: adjusted dict_values such that only gold = True aliases are kept (dict is reindexed to start at 0)
- bootleg.utils.data_utils.generate_slice_name(data_args, slice_names, use_weak_label, dataset)[source]¶
Generate name for slice datasets, taking into account the config eval slices.
- Parameters
data_args – data args
slice_names – slice names
use_weak_label – if using weak labels or not
dataset – dataset name
Returns: dataset name for saving slice data
- bootleg.utils.data_utils.get_chunk_dir(prep_dir)[source]¶
Get directory for saving data chunks.
- Parameters
prep_dir – prep directory
Returns: directory path
- bootleg.utils.data_utils.get_data_prep_dir(data_config)[source]¶
Get data prep directory for saving prep files.
- Parameters
data_config – data config
Returns: directory path
- bootleg.utils.data_utils.get_emb_prep_dir(data_config)[source]¶
Get embedding prep directory for saving prep files.
- Parameters
data_config – data config
Returns: directory path
- bootleg.utils.data_utils.get_eval_slices(eval_slices)[source]¶
Get eval slices in data prep.
Given input eval slices (passed in config), ensure FINAL_LOSS is in the eval slices. FINAL_LOSS gives overall metrics.
- Parameters
eval_slices – list of input eval slices
Returns: list of eval slices to use in the model
- bootleg.utils.data_utils.get_save_data_folder(data_args, use_weak_label, dataset)[source]¶
Get save data folder for the prepped data.
- Parameters
data_args – data config
use_weak_label – whether to use weak labelling or not
dataset – dataset name
Returns: folder string path
bootleg.utils.eval_utils module¶
Bootleg eval utils.
- bootleg.utils.eval_utils.batched_pred_iter(model, dataloader, dump_preds_accumulation_steps, sent_idx2num_mens)[source]¶
Predict from dataloader.
Predict from dataloader taking into account eval accumulation steps. Will yield a new prediction set after each set accumulation steps for writing out.
If a sentence or batch doesn’t have any mentions, it will not be returned by this method.
Recall that we split up sentences that are too long to feed to the model. We use the sent_idx2num_mens dict to ensure we have full sentences evaluated before returning, otherwise we’ll have incomplete sentences to merge together when dumping.
- Parameters
model – model
dataloader – The dataloader to predict
dump_preds_accumulation_steps – Number of eval steps to run before returning
sent_idx2num_mens – list of sent index to number of mentions
- Returns
Iterator over result dict.
- bootleg.utils.eval_utils.check_and_create_alias_cand_trie(save_folder, entity_symbols)[source]¶
Create a mmap memory trie object for storing the alias-candidate mappings.
- Parameters
save_folder – save folder for alias trie
entity_symbols – entity symbols
- bootleg.utils.eval_utils.collect_and_merge_results(unmerged_entity_emb_file, emb_file_config, config, sent_idx2num_mens, sent_idx2row, save_folder, entity_symbols)[source]¶
Merge mentions, filtering non-gold labels, and saves to file.
- Parameters
unmerged_entity_emb_file – memmap file from dump step
emb_file_config – config file for loading memmap file
config – model config
res_dict – result dictionary from Emmental predict
sent_idx2num_mens – Dict sentence idx to number of mentions
sent_idx2row – Dict sentence idx to row of eval data
save_folder – folder to save results
entity_symbols – entity symbols
Returns: saved prediction file, total mentions seen
- bootleg.utils.eval_utils.dump_model_outputs(model, dataloader, config, sentidx2num_mentions, save_folder, entity_symbols, task_name, overwrite_data)[source]¶
Dump model outputs.
- Parameters
model – model
dataloader – data loader
config – config
sentidx2num_mentions – Dict from sentence idx to number of mentions
save_folder – save folder
entity_symbols – entity symbols
task_name – task name
overwrite_data – overwrite saved mmap files
Returns: mmemp file name for saved outputs, dtype file name for loading memmap file
- bootleg.utils.eval_utils.get_emb_file(save_folder)[source]¶
Get the embedding numpy file for the batch.
- Parameters
save_folder – save folder
Returns: string
- bootleg.utils.eval_utils.get_eval_folder(file)[source]¶
Return eval folder for the given evaluation file.
Stored in log_path/filename/model_name.
- Parameters
file – eval file
Returns: eval folder
- bootleg.utils.eval_utils.get_result_file(save_folder)[source]¶
Get the jsonl label file for the batch.
- Parameters
save_folder – save folder
Returns: string
- bootleg.utils.eval_utils.get_sent_idx2num_mens(data_file)[source]¶
Get the map from sentence index to number of mentions and to data.
Used for calculating offsets and chunking file.
- Parameters
data_file – eval file
Returns: Dict of sentence index -> number of mention per sentence, Dict of sentence index -> input line
- bootleg.utils.eval_utils.get_sental2embid(merged_entity_emb_file, merged_storage_type)[source]¶
Get sent_idx, alias_idx mapping to emb idx for quick lookup.
- Parameters
merged_entity_emb_file – memmap file after merge sentences
merged_storage_type – file storage type
Returns: Dict of f”{sent_idx}_{alias_idx}” -> index in merged_entity_emb_file
- bootleg.utils.eval_utils.map_aliases_to_candidates(train_in_candidates, max_candidates, alias_cand_map, aliases)[source]¶
Get list of QID candidates for each alias.
- Parameters
train_in_candidates – whether the model has a NC entity or not (assumes all gold QIDs are in candidate lists)
alias_cand_map – alias -> candidate qids in dict or TwoLayerVocabularyScoreTrie format
aliases – list of aliases
Returns: List of lists QIDs
- bootleg.utils.eval_utils.map_candidate_qids_to_eid(candidate_qids, qid2eid)[source]¶
Get list of EID candidates for each alias.
- Parameters
candidate_qids – list of list of candidate QIDs
qid2eid – mapping of qid to entity id
Returns: List of lists EIDs
- bootleg.utils.eval_utils.masked_class_logsoftmax(pred, mask, dim=2, temp=1.0, zero_delta=1e-45)[source]¶
Masked logsoftmax.
Mask of 0/False means mask value (ignore it)
- Parameters
pred – input tensor
mask – mask
dim – softmax dimension
temp – softmax temperature
zero_delta – small value to add so that vector + (mask+zero_delta).log() is not Nan for all 0s
Returns: masked softmax tensor
- bootleg.utils.eval_utils.merge_subsentences(num_processes, subset_sent_idx2num_mens, cache_folder, to_save_file, to_save_storage, to_read_file, to_read_storage)[source]¶
Merge and flatten sentence over sub-sentences.
Flatten all sentences back together over sub-sentences; removing the PAD aliases from the data I.e., converts from sent_idx -> array of values to (sent_idx, alias_idx) -> value with varying numbers of aliases per sentence.
- Parameters
num_processes – number of processes
subset_sent_idx2num_mens – Dict of sentence index to number of mentions for this batch
cache_folder – cache directory
to_save_file – memmap file to save results to
to_save_storage – save file storage type
to_read_file – memmap file to read predictions from
to_read_storage – read file storage type
- bootleg.utils.eval_utils.merge_subsentences_hlp(args)[source]¶
Merge subsentences multiprocessing subprocess helper.
- bootleg.utils.eval_utils.merge_subsentences_initializer(to_write_file, to_write_storage, to_read_file, to_read_storage, sentidx2offset_file)[source]¶
Merge subsentences initializer for multiprocessing.
- Parameters
to_write_file – file to write
to_write_storage – mmap storage type
to_read_file – file to read
to_read_storage – mmap storage type
sentidx2offset_file – sentence index to offset in mmap data
- bootleg.utils.eval_utils.merge_subsentences_single(K, hidden_size, r_idx_set, filt_emb_data, full_pred_data, sentidx2offset)[source]¶
Merge subsentences single process.
- Will flatted out the results from full_pred_data so each line of
filt_emb_data is one alias prediction.
- Parameters
K – number candidates
hidden_size – hidden size
r_idx_set – batch result index
filt_emb_data – mmap embedding file to write
full_pred_data – mmap result file to read
sentidx2offset – sentence to emb data offset
- bootleg.utils.eval_utils.write_data_labels(num_processes, merged_entity_emb_file, merged_storage_type, sent_idx2row, cache_folder, out_file, entity_dump, train_in_candidates, max_candidates, trie_candidate_map_folder=None, trie_qid2eid_file=None)[source]¶
Take the flattened data from merge_sentences and write out predictions.
- Parameters
num_processes – number of processes
merged_entity_emb_file – input memmap file after merge sentences
merged_storage_type – input file storage type
sent_idx2row – Dict of sentence idx to row relevant to this subbatch
cache_folder – folder to save temporary outputs
out_file – final output file for predictions
entity_dump – entity dump
train_in_candidates – whether NC entities are not in candidate lists
max_candidates – maximum number of candidates
trie_candidate_map_folder – folder where trie of alias->candidate map is stored for parallel proccessing
trie_qid2eid_file – file where trie of qid->eid map is stored for parallel proccessing
- bootleg.utils.eval_utils.write_data_labels_hlp(args)[source]¶
Write data labels multiprocess helper function.
- bootleg.utils.eval_utils.write_data_labels_initializer(merged_entity_emb_file, merged_storage_type, sental2embid_file, train_in_candidates, max_cands, trie_candidate_map_folder, trie_qid2eid_file)[source]¶
Write data labels multiprocessing initializer.
- Parameters
merged_entity_emb_file – flattened embedding input file
merged_storage_type – mmap storage type
sental2embid_file – sentence, alias -> embedding id mapping
train_in_candidates – train in candidates flag
max_cands – max candidates
trie_candidate_map_folder – alias trie folder
trie_qid2eid_file – qid to eid trie file
- bootleg.utils.eval_utils.write_data_labels_single(sentidx2row, output_file, filt_emb_data, sental2embid, alias_cand_map, qid2eid, train_in_cands, max_cands)[source]¶
Write data labels single subprocess function.
Will take the alias predictions and merge them back by sentence to be written out.
- Parameters
sentidx2row – sentence index to raw eval data row
output_file – output file
filt_emb_data – mmap embedding data (one prediction per row)
sental2embid – sentence index, alias index -> embedding row id
alias_cand_map – alias to candidate map
qid2eid – qid to entity id map
train_in_cands – training in candidates flag
max_cands – maximum candidates
bootleg.utils.model_utils module¶
Model utils.
- bootleg.utils.model_utils.count_parameters(model, requires_grad, logger)[source]¶
Count the number of parameters.
- Parameters
model – model to count
requires_grad – whether to look at grad or no grad params
logger – logger
- bootleg.utils.model_utils.get_max_candidates(entity_symbols, data_config)[source]¶
Get max candidates.
Returns the maximum number of candidates used in the model, taking into account train_in_candidates If train_in_canddiates is False, we add a NC entity candidate (for null candidate)
- Parameters
entity_symbols – entity symbols
data_config – data config
bootleg.utils.utils module¶
Bootleg utils.
- bootleg.utils.utils.chunk_file(in_file, out_dir, num_lines, prefix='out_')[source]¶
Chunk a file into num_lines chunks.
- Parameters
in_file – input file
out_dir – output directory
num_lines – number of lines in each chunk
prefix – prefix for output files in out_dir
Returns: total number of lines read, dictionary of output file path -> number of lines in that file (for tqdms)
- bootleg.utils.utils.chunks(iterable, n)[source]¶
Chunk data.
chunks(ABCDE,2) => AB CD E.
- Parameters
iterable – iterable input
n – number of chunks
Returns: next chunk
- bootleg.utils.utils.create_single_item_trie(in_dict, out_file='')[source]¶
Create marisa trie.
Creates a marisa trie from the input dictionary. We assume the dictionary has string keys and integer values.
- Parameters
in_dict – Dict[str] -> Int
out_file – marisa file to save (useful for reading as memmap) (optional)
Returns: marisa trie of in_dict
- bootleg.utils.utils.dump_json_file(filename, contents, ensure_ascii=False)[source]¶
Dump dictionary to json file.
- Parameters
filename – file to write to
contents – dictionary to save
ensure_ascii – ensure ascii
- bootleg.utils.utils.dump_yaml_file(filename, contents)[source]¶
Dump dictionary to yaml file.
- Parameters
filename – file to write to
contents – dictionary to save
- bootleg.utils.utils.ensure_dir(d)[source]¶
Check if a directory exists. If not, it makes it.
- Parameters
d – path
- bootleg.utils.utils.get_lnrm(s, strip=1, lower=1)[source]¶
Convert to lnrm form.
Convert a string to its lnrm form We form the lower-cased normalized version l(s) of a string s by canonicalizing its UTF-8 characters, eliminating diacritics, lower-casing the UTF-8 and throwing out all ASCII- range characters that are not alpha-numeric.
from http://nlp.stanford.edu/pubs/subctackbp.pdf Section 2.3
- Parameters
s – input string
strip – boolean for stripping alias or not
lower – boolean for lowercasing alias or not
Returns: the lnrm form of the string
- bootleg.utils.utils.load_json_file(filename)[source]¶
Load dictionary from json file.
- Parameters
filename – file to read from
Returns: Dict
- bootleg.utils.utils.load_single_item_trie(file)[source]¶
Load a marisa trie with integer values from memmap file.
- Parameters
file – marisa input file
Returns: marisa trie
- bootleg.utils.utils.load_yaml_file(filename)[source]¶
Load dictionary from yaml file.
- Parameters
filename – file to read from
Returns: Dict
- bootleg.utils.utils.recurse_redict(d)[source]¶
Cast all DottedDict values in a dictionary to be dictionaries.
Useful for YAML dumping.
- Parameters
d – Dict
Returns: Dict with no DottedDicts
- bootleg.utils.utils.strip_nan(input_list)[source]¶
Replace float(‘nan’) with nulls.
Used for ujson loading/dumping.
- Parameters
input_list – list of items to remove the Nans from
Returns: list or nested list where Nan is not None
- bootleg.utils.utils.try_rmtree(rm_dir)[source]¶
Try to remove a directory tree.
In the case a resource is open, rmtree will fail. This retries to rmtree after 1 second waits for 5 times.
- Parameters
rm_dir – directory to remove
Module contents¶
Util init.