Exploiting Text Embeddings for Industry Contexts

From Synonyms to Object Properties

It’s well known that word embeddings are excellent for finding similarities between words — specifically, synonyms. We achieve this using supervised machine learning techniques by showing a neural net a dataset of hundreds of millions of pieces of text. The algorithm looks at the context and frequency in which particular words appear together, and maps them out as vectors in multi-dimensional space (or what we otherwise refer to as word embeddings).

A couple of weeks ago, our CTO Slater Victoroff gave a talk on “Using and abusing text embeddings” for the recent AI with the Best conference, where he presented an interesting — and surprising — finding. Not only do word embeddings successfully find synonyms, but also appear to pick up on object properties. For instance, “grass” and “green” seem to map closely to one another, as well as “ocean” and “blue”; “black” and “obsidian”; “sulfur” and “yellow”. These connections are surprising, given the broad range of contexts in the training dataset — we’d expect that the model could only gain this kind of information through rare events like metaphors and analogies. It’s intriguing that the algorithm still managed to learn and represent these relationships, but it’s not clear to us how or why these connections were learned, or how they’re represented in the underlying neural network.

From a utilization perspective, however, we don’t necessarily need to understand that structure. If these general embeddings can actually pick up on object properties or definitions as “weak synonyms” (and it’s not just an odd statistical anomaly), we expect that retraining our embeddings with domain specificity should deliver a similar result for those industry-focused cases. We could then use them to build a more robust fuzzy search tool or chatbots for specific industries.

So, let’s test our hypothesis. As part of his presentation, Slater wrote a script that allows you to play Twenty Questions with your machine using indico’s general text feature embeddings. Let’s tweak it to work with our finance-specific text embeddings. (Or, skip ahead to the Results section.)

Updating the Code

There are only a few pieces of the code that we need to change here. Instead of importing the NOUNS library, we’ll just add a small list of financial terms to the featurize_nouns function.

 finance_nouns = ["accounts payable", "accounts receivable", "accrual basis", "amortization", "arbitrage", "asset", "bankruptcy", "bond", "boom", "capital",
 "cash basis", "certificate of deposit", "commodity", "cost of capital", "cumulative", "debt", "deficit", "depreciation", "dividend", "economy", "equity",
 "exchange traded fund", "fiduciary", "fund", "gross domestic product",  "growth stock", "hedge fund", "internal revenue", "intrinsic", "invest", "invoice",
 "leverage",  "liability", "margin account", "margin call", "money market", "mortgage", "mutual fund", "paycheck", "portfolio", "premium", "profit",
 "real estate", "recession", "return", "revenue", "savings", "short selling", "stock", "trade", "Treasury bill", "treasury stock", "value", "volatility", "wager"
  ]

Underneath that, specify that you want to use indico’s finance feature embeddings by updating vectorize for the features variable.

features = vectorize(finance_nouns, domain="finance", batch_size=200)

You will need to do the same for the answer_features variable in the ask_question function.

answer_features = vectorize(options[int(response)], domain="finance")

Aaaand once again in the closest_difference function.

features = vectorize(list(args), domain="finance")

Now, change all instances of NOUNS across the script to finance_nouns.

You’re almost there! The last thing to change are the keywords you want to use as part of the twenty questions that the program should ask. These are the keywords we used:

twenty_questions([
    ["macro", "micro"],
    ["accounting", "investing", "economics"],
    ["gain", "loss"]
])

Results

When we run the program, we can see that the embeddings do pick up on “object properties” for these more abstract finance domain-specific terms. For instance, selecting “accounting”, “macro”, and “gain” results in the following guesses: internal revenue, asset, accounts receivable, margin account, and revenue.

That’s pretty good — “internal revenue” and “asset” are both macro accounting terms (the only ones in the list, in fact). “Internal revenue” doesn’t really relate to “gain”, but perhaps the existence of “revenue” in the term moved it closer to that point. “Accounts receivable”, “margin account”, and “revenue” are micro accounting terms, but that’s because we ran out of macro accounting terms to choose from.

Note that our embeddings were trained on quarterly financial reports, which is why they perform better for accounting terminology versus investing and economics concepts. It’s fairly easy to update the embeddings to deal better with investing and economics nomenclature, though. Using Custom Collection and CrowdLabel, you can customize indico’s finance embeddings by retraining them with labeled investment documents to better suit the task at hand.
Once trained, the embeddings can be implemented in fuzzy search tools, FAQs bots, and numerous other applications. If you need help with training or want to learn more, feel free to reach out at contact@indico.io!

[addtoany]

Increase intake capacity. Drive top line revenue growth.

Schedule Demo

Unstructured Unlocked podcast

April 24, 2024 | E45

Unstructured Unlocked episode 45 with Daniel Faggella, Head of Research, CEO at Emerj Artificial Intelligence Research

Listen Now

April 10, 2024 | E44

Unstructured Unlocked episode 44 with Tom Wilde, Indico Data CEO, and Robin Merttens, Executive Chairman of InsTech

Listen Now

March 27, 2024 | E43

Unstructured Unlocked episode 43 with Sunil Rao, Chief Executive Officer at Tribble

Listen Now

View All

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Get Started

Industry

Use Cases

Get Started

Resources

Documentation

Customer Stories

Get Started

Get Started

Get Started

Indico Named as Major Contender and Star Performer in Everest Group's PEAK Matrix® for Intelligent Document Processing (IDP)

BLOG

Exploiting Text Embeddings for Industry Contexts

From Synonyms to Object Properties

Updating the Code

Results

Increase intake capacity. Drive top line revenue growth.

Related Posts

Announcements, Machine Learning

Understanding Indico’s Staggered Loop

Machine Learning, Release Notes

Release Notes – Indico Unstructured Data Platform v5.3

Citizen Developer, Machine Learning

Overcome the complexity of machine learning: get to know machine teaching

Unstructured Unlocked podcast

Unstructured Unlocked episode 45 with Daniel Faggella, Head of Research, CEO at Emerj Artificial Intelligence Research

Unstructured Unlocked episode 44 with Tom Wilde, Indico Data CEO, and Robin Merttens, Executive Chairman of InsTech

Unstructured Unlocked episode 43 with Sunil Rao, Chief Executive Officer at Tribble

Get started with Indico

Schedule1-1 Demo

Resources

Blog

Gain insights from experts in automation, data, machine learning, and digital transformation.

Unstructured Unlocked

Enterprise leaders discuss how to unlock value from unstructured data.

YouTube Channel

Check out our YouTube channel to see clips from our podcast and more.

Get our best content on intelligent automation sent to your inbox weekly!

Schedule
1-1 Demo