Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning 2.0.0 conventions. The maintainers will create a git tag for each release and increment the version number found in bootleg/_version.py accordingly. We release tagged versions to PyPI automatically using GitHub Actions.

Note

Bootleg is still under active development and APIs may still change rapidly. Until we release v1.0.0, changes in MINOR version indicate backward incompatible changes.

Unreleased 1.1.1dev0

Fixed

  • Corrected training with train_in_candidates set to False.

1.1.0 - 2022-04-12

Changed

  • We did an architectural change and switched to a biencoder model. This changes our task flow and dataprep. This new model uses less CPU storage and uses the standard BERT architecture. Our entity encoder now takes a textual input of an entity that contains its title, description, KG relationships, and types.

  • To support larger files for dumping predictions over, we support adding an entity_emb_file to the model (extracted from extract_all_entities.py. This will make evaluation faster. Further, we added dump_preds_num_data_splits to split a file before dumping. As each file pass gets a new dataload object, this can mitiage any torch dataloader memory issues that happens over large files.

  • Renamed eval_accumulation_steps to dump_preds_accumulation_steps.

  • Removed option to dump_embs. Users should use dump_preds instead. The output file will have entity_ids attribute that will index into the extracted entity embeddings.

  • Restructured our entity_db data for faster loading. It uses Tries rather than jsons to store the data for read only mode. The KG relations are not backwards compatible.

  • Moved to character spans for input data. Added utils.preprocessing.convert_to_char_spans as a helper function to convert from word offsets to character offsets.

Added

  • BOOTLEG_STRIP and BOOTLEG_LOWER environment variables for get_lnrm.

  • extract_all_entities.py as a way to extract all entity embeddings. These entity embeddings can be used in eval and be used downstream. Uses can use get_eid from the EntityProfile to extract the row id for a specific entity.

1.0.5 - 2021-08-20

Fixed

  • Fixed -1 command line argparse error

  • Adjusted requirements

1.0.4 - 2021-07-12

Added

  • Tutorial to generate contextualized entity embeddings that perform better downstream

Fixed

  • Bump version of Pydantic to 1.7.4

1.0.3 - 2021-06-29

Fixed

  • Corrected how custom candidates were handled in the BootlegAnnotator when using extracted_examples

  • Fixed memory leak in BooltegAnnotator due to missing torch.no_grad()

1.0.2 - 2021-04-28

Added

  • Support for min_alias_len to extract_mentions and the BootlegAnnotator.

  • return_embs flag to pass into BootlegAnnotator that will return the contextualized embeddings of the entity (using key embs) and entity candidates (using key cand_embs).

Changed

  • Removed condition that aliases for eval must appear in candidate lists. We now allow for eval to not have known aliases and always mark these as incorrect. When dumping predictions, these get “-1” candidates and null probabilities.

Fixed

  • Corrected fit_to_profile to rebuild the title embeddings for the new entities.

1.0.1 - 2021-03-22

Note

If upgrading to 1.0.1 from 1.0.0, you will need to re-download our models given the links in the README.md. We altered what keys were saved in the state dict, but the model weights are unchanged.

Added

  • data_config.print_examples_prep flag to toggle data example printing during data prep.

  • data_config.dump_preds_accumulation_steps to support subbatching dumping of predictings. We save outputs to separate files of size approximately data_config.dump_preds_accumulation_steps*data_config.eval_batch_size and merge into a final file at the end.

  • Entity Profile API. See the docs. This allows for modifying entity metadata as well as adding and removing entities. We profile methods for refitting a model with a new profile for immediate inference, no finetuning needed.

Changed

  • Support for not using multiprocessing if use sets data_config.dataset_threads to be 1.

  • Added better argument parsing to check for arguments that were misspelled or otherwise wouldn’t trigger anything.

  • Code is now Flake8 compatible.

Fixed

  • Fixed readthedocs so the BootlegAnnotator was loaded correctly.

  • Fixed logging in BootlegAnnotator.

  • Fixed use_exact_path argument in Emmental.

1.0.0 - 2021-02-15

We did a major rewrite of our entire codebase and moved to using Emmental for training. Emmental allows for each multi-task training, FP16, and support for both DataParallel and DistributedDataParallel.

The overall functionality of Bootleg remains unchanged. We still support the use of an annotator and bulk mention extraction and evaluation. The core Bootleg model has remained largely unchanged. Checkout our documentation for more information on getting started. We have new models trained as described in our README.

Note

This branch os not backwards compatible with our old models or code base.

Some more subtle changes are below

Added

  • Support for data parallel and distributed data parallel training (through Emmental)

  • FP16 (through Emmental)

  • Easy install with BootlegAnnotator

Changed

  • Mention extraction code and alias map has been updated

  • Models trained on October 2020 save of Wikipedia

  • Have uncased and cased models

Removed

  • Support for slice-based learning

  • Support for batch prepped KG embeddings (only use batch on the fly)