bootleg.end2end package¶
Submodules¶
bootleg.end2end.annotator_utils module¶
Annotator utils.
bootleg.end2end.bootleg_annotator module¶
BootlegAnnotator.
- class bootleg.end2end.bootleg_annotator.BootlegAnnotator(config: Optional[Union[str, Dict[str, Any]]] = None, device: Optional[int] = None, min_alias_len: int = 1, max_alias_len: int = 6, threshold: float = 0.0, cache_dir: Optional[str] = None, model_name: Optional[str] = None, entity_emb_file: Optional[str] = None, return_embs: bool = False, return_ctx_embs: bool = False, extract_method: str = 'spacy', verbose: bool = False)[source]¶
Bases:
object
Bootleg on-the-fly annotator.
BootlegAnnotator class: convenient wrapper of preprocessing and model eval to allow for annotating single sentences at a time for quick experimentation, e.g. in notebooks.
- Parameters
config – model config or path to config (default None)
device – model device, -1 for CPU (default None)
min_alias_len – minimum alias length (default 1)
max_alias_len – maximum alias length (default 6)
threshold – probability threshold (default 0.0)
cache_dir – cache directory (default None)
model_name – model name (default None)
entity_emb_file – entity embedding file (default None)
return_embs – whether to return entity embeddings or not (default False)
return_ctx_embs – whether to return context embeddings or not (default False)
extract_method – mention extraction method
verbose – verbose boolean (default False)
- extract_mentions(text)[source]¶
Mention extraction wrapper.
- Parameters
text – text to extract mentions from
Returns: JSON object of sentence to be used in eval
- get_entity_tokens(qid)[source]¶
Get entity tokens.
- Parameters
qid – entity QID
- Returns
Dict of input tokens for forward pass.
- get_forward_batch(input_ids, token_type_ids, attention_mask, entity_token_ids, entity_type_ids, entity_attention_mask, entity_cand_eid, generate_entity_inputs)[source]¶
Generate emmental batch.
- Parameters
input_ids – word token ids
token_type_ids – word token type ids
attention_mask – work attention mask
entity_token_ids – entity token ids
entity_type_ids – entity type ids
entity_attention_mask – entity attention mask
entity_cand_eid – entity candidate eids
generate_entity_inputs – whether to generate entity id inputs
Returns: X_dict for emmental
- get_sentence_tokens(sample, men_idx)[source]¶
Get context tokens.
- Parameters
sample – Dict sample after extraction
men_idx – mention index to select
Returns: Dict of tokenized outputs
- label_mentions(text_list=None, extracted_examples=None)[source]¶
Extract mentions and runs disambiguation.
If user provides extracted_examples, we will ignore text_list.
- Parameters
text_list – list of text to disambiguate (or single string) (can be None if extracted_examples is not None)
extracted_examples – List of Dicts of keys “sentence”, “aliases”, “spans”, “cands” (QIDs) (optional)
Returns: Dict of
qids
: final predicted QIDs,probs
: final predicted probs,titles
: final predicted titles,cands
: all entity candidates,cand_probs
: probabilities of all candidates,char_spans
: final extracted char spans,aliases
: final extracted aliases,embs
: final entity contextualized embeddings (if return_embs is True)cand_embs
: final candidate entity contextualized embeddings (if return_embs is True)
- bootleg.end2end.bootleg_annotator.create_config(model_path, data_path, model_name)[source]¶
Create Bootleg config.
- Parameters
model_path – model directory
data_path – data directory
model_name – model name
Returns: updated config
bootleg.end2end.extract_mentions module¶
Extract mentions.
This file takes in a jsonlines file with sentences and extract aliases and spans using a pre-computed alias table.
- bootleg.end2end.extract_mentions.chunk_text_data(input_src, chunk_files, chunk_size, num_lines)[source]¶
Chunk text input file into chunk_size chunks.
- Parameters
input_src – input file
chunk_files – list of chunk file names
chunk_size – chunk size in number of lines
num_lines – total number of lines
- bootleg.end2end.extract_mentions.create_out_line(sent_obj, final_aliases, final_spans, found_char_spans)[source]¶
Create JSON output line.
- Parameters
sent_obj – input sentence JSON
final_aliases – list of final aliases
final_spans – list of final spans
found_char_spans – list of final char spans
Returns: JSON object
- bootleg.end2end.extract_mentions.extract_mentions(in_filepath, out_filepath, entity_db_dir, extract_method='ngram_spacy', min_alias_len=1, max_alias_len=6, num_workers=8, num_chunks=None, verbose=False)[source]¶
Extract mentions from file.
- Parameters
in_filepath – input file
out_filepath – output file
entity_db_dir – path to entity db
extract_method – mention extraction method
min_alias_len – minimum alias length (in words)
max_alias_len – maximum alias length (in words)
num_workers – number of multiprocessing workers
num_chunks – number of subchunks to feed to workers
verbose – verbose boolean
Module contents¶
End2End init.