In speech and writing, how often do we use one term — and only that term — to describe an idea? For example, if you were searching through a document for information relating to a business’ current assets, looking up only “current assets” would mean that you miss out on anything discussing cash, short-term assets, receivables, inventory, and prepaid expenses. Yet, in too many of our search interactions today, searching for information is limited to keyword lookups. Some newer techniques augment strict keyword based approaches to automatically include synonyms using pre-built dictionaries. While this can pay dividends, this approach can be brittle and isn’t as comprehensive as concept based searches. Effective concept based searches which account for not just synonyms but also for context in language can lead to a very different search experience. Imagine a scenario where a search for “wealth inequality” you also draws hits such as “the gap between the rich and the poor”, “unfair distribution of wealth”, “income inequality”, and so forth.

Pure keyword-to-keyword search is unintuitive to human speech and expression. It limits us — and with today’s deep learning capabilities, it’s a limitation that we can avoid.

Remember when a few of months ago, a judge ordered the then-nominee for the EPA, Scott Pruitt, to release thousands of emails so that the CMD watchdog organization could inspect them for ties to fossil fuel companies? Nearly 7,000 pages of emails were handed over, and the following day, CMD revealed that Pruitt was indeed friendly with various fossil fuel businesses. Now, we don’t know whether CMD used a keyword search to find the relevant documents, but if that was all they did, they would have had to brainstorm every possible keyword and its variations and still fall short of true results.

With a well-built concept search system, entering “fossil fuels” into the search bar should not only return all mentions “fossil fuels”, but anything related to oil and gas too, from fracking plans to oil companies.

For legal aides who spend days poring through hundreds of documents, emails, and other content to discover useful evidence, such a system would save a significant amount of time and lead to improved quality. It is applicable to other industries too, from finance to medical — anything which would require combing through reams of text.

So, how does fuzzy search work?

indico’s Text Features API creates of hundreds of thousands of rich feature vector representations for a given text input, learned using deep learning techniques. These feature vectors — numerical representations in multi-dimensional space — are a computer’s way of assigning meaning to language. We can use these representations to calculate similarity of concepts between sentences and a search query. This is why you can search a broad concept like “free market” and get results about money, competition, and demand.

An experiment

We decided to compare concept vs. keyword search on another public email dataset — the Enron emails — with a simple concept search model, built with indico’s Text Features API. Specifically, we explored the 1000+ emails of some randomly chosen users and determined which concepts to search for based on a word cloud analysis of the entire dataset from Zichen Wang. We broke down the emails to the sentence level using indico’s Text Features API’s automatic sentence splitting function. The top results for concept searches for the phrase effect of economic downturn were:
Results for query "effect of economic downturn"

These are all excellent examples of concerns and the impact of a recession — and note how our search query is explicitly stated in any of these sentences. We also found interesting results for human resources:
Results for query "Human Resources"

These results are particularly intriguing as they don’t mention “HR” or a specific task that we would associate with HR, like “careers” or “hiring”, but the content chosen by the Text Features API are clear examples of these functions in action.

Speaking of HR, while poking around through the database, we noticed an email that clearly indicated some kind of (failed) tryst had taken place between two co-workers. So, we pulled a sentence directly from that email to see if we could pinpoint any other communication that revealed a similar pattern…

Search query: “Nothing has changed that nor do I think we need to act weird around each other going forward.”

Ooh lala.


Note how out of all the fuzzy search top three results, only one email contained the term we were searching for. If we consider this on a grander scale, how much information are we missing out on by simply using keyword searches? How many legal cases, business deals, and other decisions may have been affected by incomplete information?

If you’d like to learn more or are looking to implement a machine learning solution for your business, reach out to us at contact@indico.io.

Suggested Posts

Machine Learning So Easy, Even Your Cat Could Do It (Part 1): Sentiment Analysis

ICLR 2016 Takeaways: Adversarial Models & Optimization

Happy Chinese New Year! Going from Tweets to Trends in Minutes