Enso: An Open Source Library for Benchmarking Embeddings + Transfer Learning Methods

Because Indico has benefited so much from hard work in the open-source community, we like to make sure a portion of our time is spent giving back. As part of this initiative, we’re releasing Enso, an open-source python library for benchmarking document embeddings and transfer learning methods.

Enso was created in part to help measure and address industry-wide overfitting to a small number of academic datasets. There needs to be a simple way to separate generic progress from advances that take advantage of dataset-specific features. The former is “publication worthy”, the latter may not be — but often it’s hard to distinguish between the two because papers fail to evaluate a broad enough domain. The number of classes, number of training examples, level of class imbalance, average document length, and other dataset attributes can have an enormous influence on the viability of an approach, and evaluation across a broad range of tasks helps to ascertain where an application of a given method is appropriate and where it is not. Through Enso, we hope to make evaluating across a broad range of datasets painless in an effort to make this practice more common.

In addition to providing a framework for benchmarking embedding quality, we’ve also included 24 open-source datasets for you to use in your own experiments.

Installation

Enso is compatible with Python 3.4+.

You can install enso via pip:

pip install enso

or directly via setup.py:

git clone git@github.com:IndicoDataSolutions/Enso.git
python setup.py install

After installation, you’ll probably also want to download the provided datasets:

python

python3 -m enso.download

Usage and Workflow

Although there are other effective approaches to applying transfer learning to natural language processing, the current version of Enso is built on the assumption that the approach to “transfer learning” adheres to the flow listed below. This approach is designed to replicate a scenario where a pool of unlabeled data is available, and labelers with subject matter expertise have a limited amount of time to provide labels for a subset of the unlabeled data.

All examples in the dataset are “featurized” via a pre-trained source model (python -m enso.featurize)
Re-represented data is separated into train and test sets
A fixed number of examples from the train set is selected to use as training data via the selected sampling strategy
The training data subset is optionally over or under-sampled to account for variation in class balance
A target model is trained using the featurized training examples as inputs (python -m enso.experiment)
The target model is benchmarked on all featurized test examples
The process is repeated for all combinations of featurizers, dataset sizes, target model architectures, etc.
Results are visualized and manually inspected (python -m enso.visualize)

Documentation

Full documentation and configuration information is available here at enso.readthedocs.org

Future Work

Currently Enso is limited to benchmarking tasks that rely on static representations outputted by pretrained models. We’d eventually like to extend Enso to support benchmarking for model fine-tuning approaches as well. We’re in the process of incorporating the model finetuning work of our advisor, Alec Radford, into Enso, so stay tuned!

In addition to supporting the model fine-tuning workflows, we’d also like to add support for benchmarking tasks other than classification (comparison, multiple-choice, textual-entailment, etc.) and a broader range of input types (image, audio).

We’ve used Enso internally to test whether or not adding new optimizers or new embeddings to the Indico platform is worthwhile, and we hope Enso also enables others to test what methods are good fits for industry application.

If you’d like to help add any of this functionality or are looking for a machine learning project to hack on, check out the Enso wishlist or reach out to <madison@indico.io> for more information.

Happy hacking!

[addtoany]

Increase intake capacity. Drive top line revenue growth.

Schedule Demo

Unstructured Unlocked podcast

April 24, 2024 | E45

Unstructured Unlocked episode 45 with Daniel Faggella, Head of Research, CEO at Emerj Artificial Intelligence Research

Listen Now

April 10, 2024 | E44

Unstructured Unlocked episode 44 with Tom Wilde, Indico Data CEO, and Robin Merttens, Executive Chairman of InsTech

Listen Now

March 27, 2024 | E43

Unstructured Unlocked episode 43 with Sunil Rao, Chief Executive Officer at Tribble

Listen Now

View All

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Get Started

Industry

Use Cases

Get Started

Resources

Documentation

Customer Stories

Get Started

Get Started

Get Started

Indico Named as Major Contender and Star Performer in Everest Group's PEAK Matrix® for Intelligent Document Processing (IDP)

BLOG

Enso: An Open Source Library for Benchmarking Embeddings + Transfer Learning Methods

Installation

Usage and Workflow

Documentation

Future Work

Increase intake capacity. Drive top line revenue growth.

Related Posts

Announcements, Artificial Intelligence, Intelligent Intake

Product update: New out-of-the-box models and streamlined ACORD processing

Announcements, Microsoft

In: out of the box Azure and Exchange support… Out: complex technical architecture

Announcements

Introducing Indico’s new labeling interface

Unstructured Unlocked podcast

Unstructured Unlocked episode 45 with Daniel Faggella, Head of Research, CEO at Emerj Artificial Intelligence Research

Unstructured Unlocked episode 44 with Tom Wilde, Indico Data CEO, and Robin Merttens, Executive Chairman of InsTech

Unstructured Unlocked episode 43 with Sunil Rao, Chief Executive Officer at Tribble

Get started with Indico

Schedule1-1 Demo

Resources

Blog

Gain insights from experts in automation, data, machine learning, and digital transformation.

Unstructured Unlocked

Enterprise leaders discuss how to unlock value from unstructured data.

YouTube Channel

Check out our YouTube channel to see clips from our podcast and more.

Get our best content on intelligent automation sent to your inbox weekly!

Schedule
1-1 Demo