bootleg.end2end package¶

Submodules¶

bootleg.end2end.annotator_utils module¶

Annotator utils.

class bootleg.end2end.annotator_utils.DownloadProgressBar[source]¶

Bases: object

Progress bar.

bootleg.end2end.bootleg_annotator module¶

BootlegAnnotator.

class bootleg.end2end.bootleg_annotator.BootlegAnnotator(config: Optional[Union[str, Dict[str, Any]]] = None, device: Optional[int] = None, min_alias_len: int = 1, max_alias_len: int = 6, threshold: float = 0.0, cache_dir: Optional[str] = None, model_name: Optional[str] = None, entity_emb_file: Optional[str] = None, return_embs: bool = False, return_ctx_embs: bool = False, extract_method: str = 'spacy', verbose: bool = False)[source]¶

Bases: object

Bootleg on-the-fly annotator.

BootlegAnnotator class: convenient wrapper of preprocessing and model eval to allow for annotating single sentences at a time for quick experimentation, e.g. in notebooks.

Parameters

config – model config or path to config (default None)
device – model device, -1 for CPU (default None)
min_alias_len – minimum alias length (default 1)
max_alias_len – maximum alias length (default 6)
threshold – probability threshold (default 0.0)
cache_dir – cache directory (default None)
model_name – model name (default None)
entity_emb_file – entity embedding file (default None)
return_embs – whether to return entity embeddings or not (default False)
return_ctx_embs – whether to return context embeddings or not (default False)
extract_method – mention extraction method
verbose – verbose boolean (default False)

extract_mentions(text)[source]¶

Mention extraction wrapper.

Parameters: text – text to extract mentions from

Returns: JSON object of sentence to be used in eval

get_entity_tokens(qid)[source]¶

Get entity tokens.

Parameters: qid – entity QID
Returns: Dict of input tokens for forward pass.

get_forward_batch(input_ids, token_type_ids, attention_mask, entity_token_ids, entity_type_ids, entity_attention_mask, entity_cand_eid, generate_entity_inputs)[source]¶

Generate emmental batch.

Parameters

input_ids – word token ids
token_type_ids – word token type ids
attention_mask – work attention mask
entity_token_ids – entity token ids
entity_type_ids – entity type ids
entity_attention_mask – entity attention mask
entity_cand_eid – entity candidate eids
generate_entity_inputs – whether to generate entity id inputs

Returns: X_dict for emmental

get_sentence_tokens(sample, men_idx)[source]¶

Get context tokens.

Parameters

sample – Dict sample after extraction
men_idx – mention index to select

Returns: Dict of tokenized outputs

label_mentions(text_list=None, extracted_examples=None)[source]¶

Extract mentions and runs disambiguation.

If user provides extracted_examples, we will ignore text_list.

Parameters

text_list – list of text to disambiguate (or single string) (can be None if extracted_examples is not None)
extracted_examples – List of Dicts of keys “sentence”, “aliases”, “spans”, “cands” (QIDs) (optional)

Returns: Dict of

qids: final predicted QIDs,

probs: final predicted probs,

titles: final predicted titles,

cands: all entity candidates,

cand_probs: probabilities of all candidates,

char_spans: final extracted char spans,

aliases: final extracted aliases,

embs: final entity contextualized embeddings (if return_embs is True)

cand_embs: final candidate entity contextualized embeddings (if return_embs is True)

set_threshold(value)[source]¶

Set threshold.

Parameters: value – threshold value

bootleg.end2end.bootleg_annotator.create_config(model_path, data_path, model_name)[source]¶

Create Bootleg config.

Parameters

model_path – model directory
data_path – data directory
model_name – model name

Returns: updated config

bootleg.end2end.bootleg_annotator.create_sources(model_path, data_path, model_name)[source]¶

Download Bootleg data and saves in log dir.

Parameters

model_path – model directory
data_path – data directory
model_name – model name to download

bootleg.end2end.bootleg_annotator.get_default_cache()[source]¶: Get default cache directory for saving Bootleg data.

bootleg.end2end.extract_mentions module¶

Extract mentions.

This file takes in a jsonlines file with sentences and extract aliases and spans using a pre-computed alias table.

bootleg.end2end.extract_mentions.chunk_text_data(input_src, chunk_files, chunk_size, num_lines)[source]¶

Chunk text input file into chunk_size chunks.

Parameters

input_src – input file
chunk_files – list of chunk file names
chunk_size – chunk size in number of lines
num_lines – total number of lines

bootleg.end2end.extract_mentions.create_out_line(sent_obj, final_aliases, final_spans, found_char_spans)[source]¶

Create JSON output line.

Parameters

sent_obj – input sentence JSON
final_aliases – list of final aliases
final_spans – list of final spans
found_char_spans – list of final char spans

Returns: JSON object

bootleg.end2end.extract_mentions.extract_mentions(in_filepath, out_filepath, entity_db_dir, extract_method='ngram_spacy', min_alias_len=1, max_alias_len=6, num_workers=8, num_chunks=None, verbose=False)[source]¶

Extract mentions from file.

Parameters

in_filepath – input file
out_filepath – output file
entity_db_dir – path to entity db
extract_method – mention extraction method
min_alias_len – minimum alias length (in words)
max_alias_len – maximum alias length (in words)
num_workers – number of multiprocessing workers
num_chunks – number of subchunks to feed to workers
verbose – verbose boolean

bootleg.end2end.extract_mentions.main()[source]¶: Run.

bootleg.end2end.extract_mentions.merge_files(chunk_outfiles, out_filepath)[source]¶

Merge output files.

Parameters

chunk_outfiles – list of chunk files
out_filepath – final output file path

bootleg.end2end.extract_mentions.parse_args()[source]¶: Generate args.

bootleg.end2end.extract_mentions.subprocess(args)[source]¶

Extract mentions single process.

Parameters: args – subprocess args

Module contents¶

End2End init.