bootleg.end2end package

Submodules

bootleg.end2end.annotator_utils module

Annotator utils.

class bootleg.end2end.annotator_utils.DownloadProgressBar[source]

Bases: object

Progress bar.

bootleg.end2end.bootleg_annotator module

BootlegAnnotator.

class bootleg.end2end.bootleg_annotator.BootlegAnnotator(config: Optional[Union[str, Dict[str, Any]]] = None, device: Optional[int] = None, min_alias_len: int = 1, max_alias_len: int = 6, threshold: float = 0.0, cache_dir: Optional[str] = None, model_name: Optional[str] = None, entity_emb_file: Optional[str] = None, return_embs: bool = False, return_ctx_embs: bool = False, extract_method: str = 'spacy', verbose: bool = False)[source]

Bases: object

Bootleg on-the-fly annotator.

BootlegAnnotator class: convenient wrapper of preprocessing and model eval to allow for annotating single sentences at a time for quick experimentation, e.g. in notebooks.

Parameters
  • config – model config or path to config (default None)

  • device – model device, -1 for CPU (default None)

  • min_alias_len – minimum alias length (default 1)

  • max_alias_len – maximum alias length (default 6)

  • threshold – probability threshold (default 0.0)

  • cache_dir – cache directory (default None)

  • model_name – model name (default None)

  • entity_emb_file – entity embedding file (default None)

  • return_embs – whether to return entity embeddings or not (default False)

  • return_ctx_embs – whether to return context embeddings or not (default False)

  • extract_method – mention extraction method

  • verbose – verbose boolean (default False)

extract_mentions(text)[source]

Mention extraction wrapper.

Parameters

text – text to extract mentions from

Returns: JSON object of sentence to be used in eval

get_entity_tokens(qid)[source]

Get entity tokens.

Parameters

qid – entity QID

Returns

Dict of input tokens for forward pass.

get_forward_batch(input_ids, token_type_ids, attention_mask, entity_token_ids, entity_type_ids, entity_attention_mask, entity_cand_eid, generate_entity_inputs)[source]

Generate emmental batch.

Parameters
  • input_ids – word token ids

  • token_type_ids – word token type ids

  • attention_mask – work attention mask

  • entity_token_ids – entity token ids

  • entity_type_ids – entity type ids

  • entity_attention_mask – entity attention mask

  • entity_cand_eid – entity candidate eids

  • generate_entity_inputs – whether to generate entity id inputs

Returns: X_dict for emmental

get_sentence_tokens(sample, men_idx)[source]

Get context tokens.

Parameters
  • sample – Dict sample after extraction

  • men_idx – mention index to select

Returns: Dict of tokenized outputs

label_mentions(text_list=None, extracted_examples=None)[source]

Extract mentions and runs disambiguation.

If user provides extracted_examples, we will ignore text_list.

Parameters
  • text_list – list of text to disambiguate (or single string) (can be None if extracted_examples is not None)

  • extracted_examples – List of Dicts of keys “sentence”, “aliases”, “spans”, “cands” (QIDs) (optional)

Returns: Dict of

  • qids: final predicted QIDs,

  • probs: final predicted probs,

  • titles: final predicted titles,

  • cands: all entity candidates,

  • cand_probs: probabilities of all candidates,

  • char_spans: final extracted char spans,

  • aliases: final extracted aliases,

  • embs: final entity contextualized embeddings (if return_embs is True)

  • cand_embs: final candidate entity contextualized embeddings (if return_embs is True)

set_threshold(value)[source]

Set threshold.

Parameters

value – threshold value

bootleg.end2end.bootleg_annotator.create_config(model_path, data_path, model_name)[source]

Create Bootleg config.

Parameters
  • model_path – model directory

  • data_path – data directory

  • model_name – model name

Returns: updated config

bootleg.end2end.bootleg_annotator.create_sources(model_path, data_path, model_name)[source]

Download Bootleg data and saves in log dir.

Parameters
  • model_path – model directory

  • data_path – data directory

  • model_name – model name to download

bootleg.end2end.bootleg_annotator.get_default_cache()[source]

Get default cache directory for saving Bootleg data.

bootleg.end2end.extract_mentions module

Extract mentions.

This file takes in a jsonlines file with sentences and extract aliases and spans using a pre-computed alias table.

bootleg.end2end.extract_mentions.chunk_text_data(input_src, chunk_files, chunk_size, num_lines)[source]

Chunk text input file into chunk_size chunks.

Parameters
  • input_src – input file

  • chunk_files – list of chunk file names

  • chunk_size – chunk size in number of lines

  • num_lines – total number of lines

bootleg.end2end.extract_mentions.create_out_line(sent_obj, final_aliases, final_spans, found_char_spans)[source]

Create JSON output line.

Parameters
  • sent_obj – input sentence JSON

  • final_aliases – list of final aliases

  • final_spans – list of final spans

  • found_char_spans – list of final char spans

Returns: JSON object

bootleg.end2end.extract_mentions.extract_mentions(in_filepath, out_filepath, entity_db_dir, extract_method='ngram_spacy', min_alias_len=1, max_alias_len=6, num_workers=8, num_chunks=None, verbose=False)[source]

Extract mentions from file.

Parameters
  • in_filepath – input file

  • out_filepath – output file

  • entity_db_dir – path to entity db

  • extract_method – mention extraction method

  • min_alias_len – minimum alias length (in words)

  • max_alias_len – maximum alias length (in words)

  • num_workers – number of multiprocessing workers

  • num_chunks – number of subchunks to feed to workers

  • verbose – verbose boolean

bootleg.end2end.extract_mentions.main()[source]

Run.

bootleg.end2end.extract_mentions.merge_files(chunk_outfiles, out_filepath)[source]

Merge output files.

Parameters
  • chunk_outfiles – list of chunk files

  • out_filepath – final output file path

bootleg.end2end.extract_mentions.parse_args()[source]

Generate args.

bootleg.end2end.extract_mentions.subprocess(args)[source]

Extract mentions single process.

Parameters

args – subprocess args

Module contents

End2End init.