bootleg.slicing package¶
Submodules¶
bootleg.slicing.slice_dataset module¶
Bootleg slice dataset.
- class bootleg.slicing.slice_dataset.BootlegSliceDataset(main_args, dataset, use_weak_label, entity_symbols, dataset_threads, split='train')[source]¶
Bases:
object
Slice dataset class.
Our dataset class for holding data slices (or subpopulations).
Each mention can be part of 0 or more slices. When running eval, we use the SliceDataset to determine which mentions are part of what slices. Importantly, although the model “sees” all mentions, only GOLD anchor links are evaluated for eval (splits of test/dev).
- Parameters
main_args – main arguments
dataset – dataset file
use_weak_label – whether to use weak labeling or not
entity_symbols – entity symbols
dataset_threads – number of processes to use
split – data split
- classmethod build_data_dict(save_dataset_name, storage)[source]¶
Build the slice dataset from saved file.
Loads the memmap slice dataset and create a mapping from sentence index to row index.
- Parameters
save_dataset_name – saved memmap file name
storage – storage type of memmap file
Returns: numpy memmap data, Dict of sentence index to row in data
- contains_sentidx(sent_idx)[source]¶
Return true if the sentence index is in the dataset.
- Parameters
sent_idx – sentence index
Returns: bool whether in dataset or not
- get_slice_incidence_arr(sent_idx, alias_orig_list_pos)[source]¶
Get slice incident array.
Given the sentence index and the list of aliases to get slice indices for (may have -1 indicating no alias), return a dictionary of slice_name -> 0/1 incidence array of if each alias in alias_orig_list_pos was in the slice or not (-1 for no alias).
- Parameters
sent_idx – sentence index
alias_orig_list_pos – list of alias positions in input data list (due to sentence splitting, aliases may be split up)
Returns: Dict of slice name -> 0/1 incidence array
- class bootleg.slicing.slice_dataset.InputExample(sent_idx, subslice_idx, anchor, num_alias2pred, slices)[source]¶
Bases:
object
A single training/test example.
- class bootleg.slicing.slice_dataset.InputFeatures(sent_idx, subslice_idx, alias_slice_incidence, alias2pred_probs)[source]¶
Bases:
object
A single set of features of data.
- bootleg.slicing.slice_dataset.convert_examples_to_features_and_save(meta_file, dataset_threads, slice_names, save_dataset_name, storage)[source]¶
Convert the prepped examples into input features.
Saves in memmap files. These are used in the __get_item__ method.
- Parameters
meta_file – metadata file where input file paths are saved
dataset_threads – number of threads
slice_names – list of slice names to evaluation on
save_dataset_name – data file name to save
storage – data storage type (for memmap)
- bootleg.slicing.slice_dataset.convert_examples_to_features_and_save_hlp(input_dict)[source]¶
Convert to features helper.
- bootleg.slicing.slice_dataset.convert_examples_to_features_and_save_initializer(save_dataset_name, storage)[source]¶
Convert to features multiprocessing initializer.
- bootleg.slicing.slice_dataset.convert_examples_to_features_and_save_single(input_dict, mmap_file)[source]¶
Convert examples to features multiprocessing helper.
- bootleg.slicing.slice_dataset.create_examples(dataset, create_ex_indir, create_ex_outdir, meta_file, data_config, dataset_threads, slice_names, use_weak_label, split)[source]¶
Create examples from the raw input data.
- Parameters
dataset – dataset file
create_ex_indir – temporary directory where input files are stored
create_ex_outdir – temporary directory to store output files from method
meta_file – metadata file to save the file names/paths for the next step in prep pipeline
data_config – data config
dataset_threads – number of threads
slice_names – list of slices to evaluate on
use_weak_label – whether to use weak labeling or not
split – data split
- bootleg.slicing.slice_dataset.create_examples_initializer(data_config, slice_names, use_weak_label, split, train_in_candidates)[source]¶
Create example multiprocessing initialiezr.
- bootleg.slicing.slice_dataset.create_examples_single(in_file_name, in_file_lines, out_file_name, constants_dict)[source]¶
Create examples multiprocessing helper.
- bootleg.slicing.slice_dataset.get_slice_values(slice_names, line)[source]¶
Results a dictionary of all slice values for an input example.
Any mention with a slice value of > 0.5 gets assigned that slice. If some slices are missing from the input, we assign all mentions as not being in that slice (getting a 0 label value). We also check that slices are formatted correctly.
- Parameters
slice_names – slice names to evaluate on
line – input data json line
Returns: Dict of slice name to alias index string to float value of if mention is in a slice or not.
Module contents¶
Slicing initializer.