bootleg.slicing package¶

Submodules¶

bootleg.slicing.slice_dataset module¶

Bootleg slice dataset.

class bootleg.slicing.slice_dataset.BootlegSliceDataset(main_args, dataset, use_weak_label, entity_symbols, dataset_threads, split='train')[source]¶

Bases: object

Slice dataset class.

Our dataset class for holding data slices (or subpopulations).

Each mention can be part of 0 or more slices. When running eval, we use the SliceDataset to determine which mentions are part of what slices. Importantly, although the model “sees” all mentions, only GOLD anchor links are evaluated for eval (splits of test/dev).

Parameters

main_args – main arguments
dataset – dataset file
use_weak_label – whether to use weak labeling or not
entity_symbols – entity symbols
dataset_threads – number of processes to use
split – data split

classmethod build_data_dict(save_dataset_name, storage)[source]¶

Build the slice dataset from saved file.

Loads the memmap slice dataset and create a mapping from sentence index to row index.

Parameters

save_dataset_name – saved memmap file name
storage – storage type of memmap file

Returns: numpy memmap data, Dict of sentence index to row in data

contains_sentidx(sent_idx)[source]¶

Return true if the sentence index is in the dataset.

Parameters: sent_idx – sentence index

Returns: bool whether in dataset or not

get_slice_incidence_arr(sent_idx, alias_orig_list_pos)[source]¶

Get slice incident array.

Given the sentence index and the list of aliases to get slice indices for (may have -1 indicating no alias), return a dictionary of slice_name -> 0/1 incidence array of if each alias in alias_orig_list_pos was in the slice or not (-1 for no alias).

Parameters

sent_idx – sentence index
alias_orig_list_pos – list of alias positions in input data list (due to sentence splitting, aliases may be split up)

Returns: Dict of slice name -> 0/1 incidence array

class bootleg.slicing.slice_dataset.InputExample(sent_idx, subslice_idx, anchor, num_alias2pred, slices)[source]¶

Bases: object

A single training/test example.

classmethod from_dict(in_dict)[source]¶: Create object from dictionary.

to_dict()[source]¶: Turn object to dictionary.

class bootleg.slicing.slice_dataset.InputFeatures(sent_idx, subslice_idx, alias_slice_incidence, alias2pred_probs)[source]¶

Bases: object

A single set of features of data.

classmethod from_dict(in_dict)[source]¶: Create object from dictionary.

to_dict()[source]¶: Object to dictionary.

bootleg.slicing.slice_dataset.convert_examples_to_features_and_save(meta_file, dataset_threads, slice_names, save_dataset_name, storage)[source]¶

Convert the prepped examples into input features.

Saves in memmap files. These are used in the __get_item__ method.

Parameters

meta_file – metadata file where input file paths are saved
dataset_threads – number of threads
slice_names – list of slice names to evaluation on
save_dataset_name – data file name to save
storage – data storage type (for memmap)

bootleg.slicing.slice_dataset.convert_examples_to_features_and_save_hlp(input_dict)[source]¶: Convert to features helper.

bootleg.slicing.slice_dataset.convert_examples_to_features_and_save_initializer(save_dataset_name, storage)[source]¶: Convert to features multiprocessing initializer.

bootleg.slicing.slice_dataset.convert_examples_to_features_and_save_single(input_dict, mmap_file)[source]¶: Convert examples to features multiprocessing helper.

bootleg.slicing.slice_dataset.create_examples(dataset, create_ex_indir, create_ex_outdir, meta_file, data_config, dataset_threads, slice_names, use_weak_label, split)[source]¶

Create examples from the raw input data.

Parameters

dataset – dataset file
create_ex_indir – temporary directory where input files are stored
create_ex_outdir – temporary directory to store output files from method
meta_file – metadata file to save the file names/paths for the next step in prep pipeline
data_config – data config
dataset_threads – number of threads
slice_names – list of slices to evaluate on
use_weak_label – whether to use weak labeling or not
split – data split

bootleg.slicing.slice_dataset.create_examples_hlp(args)[source]¶: Create examples wrapper helper.

bootleg.slicing.slice_dataset.create_examples_initializer(data_config, slice_names, use_weak_label, split, train_in_candidates)[source]¶: Create example multiprocessing initialiezr.

bootleg.slicing.slice_dataset.create_examples_single(in_file_name, in_file_lines, out_file_name, constants_dict)[source]¶: Create examples multiprocessing helper.

bootleg.slicing.slice_dataset.get_slice_values(slice_names, line)[source]¶

Results a dictionary of all slice values for an input example.

Any mention with a slice value of > 0.5 gets assigned that slice. If some slices are missing from the input, we assign all mentions as not being in that slice (getting a 0 label value). We also check that slices are formatted correctly.

Parameters

slice_names – slice names to evaluate on
line – input data json line

Returns: Dict of slice name to alias index string to float value of if mention is in a slice or not.

Module contents¶

Slicing initializer.