Adapters: Faster Prediction with Fewer Parameters

Training machine learning models is hard, and training models that perform well is one of the top priorities of any data scientist. However, an often overlooked fact is that once a model is trained, deploying it on a large scale for inference can be another huge challenge in itself! Consider some of the problems that models created with finetune, indico’s python library for the task, face once deployed for production:

Large Files: Saved models can exceed 250MB, even after float precision reduction. These large files are difficult and expensive to store, and have to be transferred over network for use in production.

Latency: Before a model can be used, all of its parameters must be loaded into memory, which forces TensorFlow to perform a slow graph recompilation. Combined with transfer and loading of large files noted above, these factors contribute to up to 20 seconds of overhead when trying to predict using a finetuned model!

Caching Issues: We want our deployed models to be able to handle requests quickly, and that means parameters have to remain cached in memory for each model. Since switching out models automatically triggers graph recompilation, each model has to remain cached on a separate processor. Serving a dozen different models would require a dozen GPUs running constantly, which is prohibitively expensive, so inference requests have to be handled with less performant CPUs, further exaggerating latency and throughput issues.

Parameter inefficiency: Even if finetuned models use the same original featurizer, such as BERT or GPT, nearly every weight in each model is changed during training. This means that the vast majority of parameters in a model have to be saved, stored, transferred, and loaded – even if they only differ slightly from their starting point.

Given these problems, it would be very convenient to have a method of finetuning models that only trains some of the present weights. That way, saved file sizes would be smaller, and loading times faster, since only the small numbers of weights that are changed during training would have to be handled.

One solution to this problem might be to train only a subset of the layers of the Transformer. However this approach still encounters problems with graph recompilation, and in practice this approach causes significant degradation to model prediction quality. Fortunately, a more elegant solution is possible with finetune’s new DeploymentModel, which leverages the adapters idea from Parameter-Efficient Transfer Learning for NLP along with some clever software engineering to speed up model inference and reduce saved model sizes at a minimal cost to model accuracy.

The adapter is a small block of feedforward layers that are mixed into each layer of the Transformer architecture. In our case, it downprojects its input to dimension 64 by default, before reprojecting to its original size. As described in the paper, the adapter uses a skip-connection so it can initialize to identity before training; otherwise, the model will not converge.

The adapter architecture, and its place within the Transformer layer. Figure from Houlsby et al.

During finetuning, we modify only the adapter blocks and layer normalization weights – since they have very few parameters – leaving everything else unchanged. Amazingly, the shift from finetuning the entire model to finetuning with orders of magnitude fewer parameters causes nearly no harm to accuracy:

Adapter performance compared with traditional finetuning across a number of tasks.

When combined with our custom DeploymentModel, adapters provide numerous other advantages. Since we only need to track the changed weights, model save files become much smaller when we save our model. The DeploymentModel leverages this fact by holding the large static featurizer loaded in memory, while it selectively switches out adapter and layernorm weights when running inference a different model is desired. This method preserves the featurizer graph, which bypasses TensorFlow’s expensive recompilation. Thus, we only have to deal with the overhead of graph compilation once, and we can then amortize that cost over multiple prediction runs with multiple models.

Let’s check out how anyone can take advantage of adapters and the DeploymentModel using finetune. We train a simple classifier, making sure to enable adapters in its configuration:

from finetune import Classifier
from finetune.base_models import GPTModel
model = Classifier(adapter_size=64, base_model=GPTModel)
model.fit(X,Y)
model.save(‘classifier_using_adapters.jl’)

The DeploymentModel allows us to use the weights from our trained classifier for fast loading and prediction. Note that we specify the base model used and load the featurizer before loading in the custom model, to incur the one-time overhead of graph compilation up front:

from finetune import DeploymentModel
deployment_model = DeploymentModel(featurizer=GPTModel)
deployment_model.load_featurizer()
deployment_model.load_custom_model(‘classifier_using_adapters.jl’)
preds = deployment_model.predict(X)

As described previously, the DeploymentModel can also swap out weights without requiring a reload of its featurizer. See this in the example below, assuming there is also a previously trained regressor on file:

deployment_model = DeploymentModel(featurizer=GPTModel)

# Load the base featurizer and incur a one-time cost
deployment_model.load_featurizer() 

# Quickly load a target model and corresponding adapter weights 
deployment_model.load_custom_model(‘classifier_using_adapters.jl’)
classifier_preds = deployment_model.predict(classifier_X)

# Quickly swap out the target model and adapter weights of another model
deployment_model.load_custom_model(‘regressor_using_adapters.jl’)
regressor_preds = deployment_model.predict(regressor_X)

With the DeploymentModel, the time from loading to the end of predicting is a mere 2 seconds (compared to over 20 seconds before), and it works with a file that is nearly 10x smaller!

The DeploymentModel works by splitting the model into two separate graphs, using two estimators from the TensorFlow Estimator API – one estimator delivers the large featurizer, the second loads the much smaller target model that is customized for each task. The target estimator is reloaded in each call to predict, but its overhead is trivial due to its size. Of course, the key advantage is that the featurizer estimator remains cached between all calls to predict and load_custom_model, and only edits a subset of necessary weights when loading new models. Check out the source code in finetune here.

With customizable adapter sizes and support for several base models, prediction is more efficient with no loss of freedom. By bundling an academic advance with some clever software engineering, production model finetuning has the potential to be cleaner, faster and dramatically more practical than before.

[addtoany]

Increase intake capacity. Drive top line revenue growth.

Schedule Demo

Unstructured Unlocked podcast

April 10, 2024 | E44

Unstructured Unlocked episode 44 with Tom Wilde, Indico Data CEO, and Robin Merttens, Executive Chairman of InsTech

Listen Now

March 27, 2024 | E43

Unstructured Unlocked episode 43 with Sunil Rao, Chief Executive Officer at Tribble

Listen Now

March 13, 2024 | E42

Unstructured Unlocked episode 42 with Arthur Borden, VP of Digital Business Systems & Architecture for Everest and Alex Taylor, Global Head of Emerging Technology for QBE Ventures

Listen Now

View All

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Get Started

Industry

Use Cases

Get Started

Resources

Documentation

Customer Stories

Get Started

Get Started

Get Started

Indico Named as Major Contender and Star Performer in Everest Group's PEAK Matrix® for Intelligent Document Processing (IDP)

BLOG

Adapters: Faster Prediction with Fewer Parameters

Increase intake capacity. Drive top line revenue growth.

Related Posts

Announcements, Machine Learning

Understanding Indico’s Staggered Loop

Machine Learning, Release Notes

Release Notes – Indico Unstructured Data Platform v5.3

Citizen Developer, Machine Learning

Overcome the complexity of machine learning: get to know machine teaching

Unstructured Unlocked podcast

Unstructured Unlocked episode 44 with Tom Wilde, Indico Data CEO, and Robin Merttens, Executive Chairman of InsTech

Unstructured Unlocked episode 43 with Sunil Rao, Chief Executive Officer at Tribble

Unstructured Unlocked episode 42 with Arthur Borden, VP of Digital Business Systems & Architecture for Everest and Alex Taylor, Global Head of Emerging Technology for QBE Ventures

Get started with Indico

Schedule1-1 Demo

Resources

Blog

Gain insights from experts in automation, data, machine learning, and digital transformation.

Unstructured Unlocked

Enterprise leaders discuss how to unlock value from unstructured data.

YouTube Channel

Check out our YouTube channel to see clips from our podcast and more.

Get our best content on intelligent automation sent to your inbox weekly!

Schedule
1-1 Demo