bootleg.utils.preprocessing package¶
Submodules¶
bootleg.utils.preprocessing.compute_statistics module¶
Compute statistics over data.
Helper file for computing various statistics over our data such as mention frequency, mention text frequency in the data (even if not labeled as an anchor), …
etc.
- bootleg.utils.preprocessing.compute_statistics.chunk_text_data(input_src, chunk_files, chunk_size, num_lines)[source]¶
Chunk text data.
- bootleg.utils.preprocessing.compute_statistics.compute_histograms(save_dir, entity_symbols)[source]¶
Compute histogram.
- bootleg.utils.preprocessing.compute_statistics.compute_occurrences(save_dir, data_file, entity_dump, lower, strip, num_workers=8)[source]¶
Compute statistics.
- bootleg.utils.preprocessing.compute_statistics.compute_occurrences_single(args, max_alias_len=6)[source]¶
Compute statistics single process.
bootleg.utils.preprocessing.count_body_part_size module¶
bootleg.utils.preprocessing.gen_alias_cand_map module¶
bootleg.utils.preprocessing.gen_entity_mappings module¶
bootleg.utils.preprocessing.get_train_qid_counts module¶
Compute QID counts.
Helper function that computes a dictionary of QID -> count in training data.
If a QID is not in this dictionary, it has a count of zero.
bootleg.utils.preprocessing.sample_eval_data module¶
Sample eval data.
This will sample a jsonl train or eval data based on the slices in the data. This is useful for subsampling a smaller eval dataset.py.
The output of this file is a files with a subset of sentences from the input file samples such that for each slice in –args.slice, a minimum of args.min_sample_size mentions are in the slice (if possible). Once that is satisfied, we sample to get approximately –args.sample_perc of mentions from each slice.
- bootleg.utils.preprocessing.sample_eval_data.get_slice_stats(num_processes, file)[source]¶
Get true anchor slice counts.
Module contents¶
Preprocessing init.