bootleg.slicing package

Submodules

bootleg.slicing.slice_dataset module

Bootleg slice dataset.

class bootleg.slicing.slice_dataset.BootlegSliceDataset(main_args, dataset, use_weak_label, entity_symbols, dataset_threads, split='train')[source]

Bases: object

Slice dataset class.

Our dataset class for holding data slices (or subpopulations).

Each mention can be part of 0 or more slices. When running eval, we use the SliceDataset to determine which mentions are part of what slices. Importantly, although the model “sees” all mentions, only GOLD anchor links are evaluated for eval (splits of test/dev).

Parameters
  • main_args – main arguments

  • dataset – dataset file

  • use_weak_label – whether to use weak labeling or not

  • entity_symbols – entity symbols

  • dataset_threads – number of processes to use

  • split – data split

classmethod build_data_dict(save_dataset_name, storage)[source]

Build the slice dataset from saved file.

Loads the memmap slice dataset and create a mapping from sentence index to row index.

Parameters
  • save_dataset_name – saved memmap file name

  • storage – storage type of memmap file

Returns: numpy memmap data, Dict of sentence index to row in data

contains_sentidx(sent_idx)[source]

Return true if the sentence index is in the dataset.

Parameters

sent_idx – sentence index

Returns: bool whether in dataset or not

get_slice_incidence_arr(sent_idx, alias_orig_list_pos)[source]

Get slice incident array.

Given the sentence index and the list of aliases to get slice indices for (may have -1 indicating no alias), return a dictionary of slice_name -> 0/1 incidence array of if each alias in alias_orig_list_pos was in the slice or not (-1 for no alias).

Parameters
  • sent_idx – sentence index

  • alias_orig_list_pos – list of alias positions in input data list (due to sentence splitting, aliases may be split up)

Returns: Dict of slice name -> 0/1 incidence array

class bootleg.slicing.slice_dataset.InputExample(sent_idx, subslice_idx, anchor, num_alias2pred, slices)[source]

Bases: object

A single training/test example.

classmethod from_dict(in_dict)[source]

Create object from dictionary.

to_dict()[source]

Turn object to dictionary.

class bootleg.slicing.slice_dataset.InputFeatures(sent_idx, subslice_idx, alias_slice_incidence, alias2pred_probs)[source]

Bases: object

A single set of features of data.

classmethod from_dict(in_dict)[source]

Create object from dictionary.

to_dict()[source]

Object to dictionary.

bootleg.slicing.slice_dataset.convert_examples_to_features_and_save(meta_file, dataset_threads, slice_names, save_dataset_name, storage)[source]

Convert the prepped examples into input features.

Saves in memmap files. These are used in the __get_item__ method.

Parameters
  • meta_file – metadata file where input file paths are saved

  • dataset_threads – number of threads

  • slice_names – list of slice names to evaluation on

  • save_dataset_name – data file name to save

  • storage – data storage type (for memmap)

bootleg.slicing.slice_dataset.convert_examples_to_features_and_save_hlp(input_dict)[source]

Convert to features helper.

bootleg.slicing.slice_dataset.convert_examples_to_features_and_save_initializer(save_dataset_name, storage)[source]

Convert to features multiprocessing initializer.

bootleg.slicing.slice_dataset.convert_examples_to_features_and_save_single(input_dict, mmap_file)[source]

Convert examples to features multiprocessing helper.

bootleg.slicing.slice_dataset.create_examples(dataset, create_ex_indir, create_ex_outdir, meta_file, data_config, dataset_threads, slice_names, use_weak_label, split)[source]

Create examples from the raw input data.

Parameters
  • dataset – dataset file

  • create_ex_indir – temporary directory where input files are stored

  • create_ex_outdir – temporary directory to store output files from method

  • meta_file – metadata file to save the file names/paths for the next step in prep pipeline

  • data_config – data config

  • dataset_threads – number of threads

  • slice_names – list of slices to evaluate on

  • use_weak_label – whether to use weak labeling or not

  • split – data split

bootleg.slicing.slice_dataset.create_examples_hlp(args)[source]

Create examples wrapper helper.

bootleg.slicing.slice_dataset.create_examples_initializer(data_config, slice_names, use_weak_label, split, train_in_candidates)[source]

Create example multiprocessing initialiezr.

bootleg.slicing.slice_dataset.create_examples_single(in_file_name, in_file_lines, out_file_name, constants_dict)[source]

Create examples multiprocessing helper.

bootleg.slicing.slice_dataset.get_slice_values(slice_names, line)[source]

Results a dictionary of all slice values for an input example.

Any mention with a slice value of > 0.5 gets assigned that slice. If some slices are missing from the input, we assign all mentions as not being in that slice (getting a 0 label value). We also check that slices are formatted correctly.

Parameters
  • slice_names – slice names to evaluate on

  • line – input data json line

Returns: Dict of slice name to alias index string to float value of if mention is in a slice or not.

Module contents

Slicing initializer.