Building a Bot to Answer FAQs: Predicting Text Similarity

In our previous tutorial on customer support bots, we trained a bot using the Custom Collection API to direct customers to the team member who is best suited to assist them with their problem or query. The bot improved our team’s response times as we no longer had to rely on a human facilitator (who also plays many other roles in our company #startuplife) to do the job. However, we’re generally only able to respond during our office hours of 11am-7pm EST, so there’s still lag for inquiries outside of that period. How can we improve this? Build a bot to answer frequently asked questions, reducing lag time for more customers and ensuring our engineers don’t need to spend more time than necessary away from the products we’re building for you :).

The Task

We’ll conduct a nearest neighbour search in Python, comparing a user input question to a list of FAQs. To do this, we’ll use indico’s Text Features API to find all the feature vectors for the text data, and calculate the distance between these vectors to those of the user’s input question in 300-dimensional space. Then we’ll return the appropriate answer based to the FAQ that the user’s question is most similar to (if it meets a certain confidence threshold).

Getting Started

First, get the skeleton code from our SuperCell GitHub repo.

You’ll need to install all necessary packages if you don’t have them — texttable and, of course, indicoio.
If you haven’t already set up your indico account, follow our Quickstart Guide. It will walk you through the process of getting your API key and installing the indicoio Python library. If you run into any problems, check the Installation section of the docs. If all else fails, you can also reach out to us through that little chat bubble. Assuming your account is all set up and you’ve installed everything, let’s get started!

Go to the top of your file and import indicoio. Don’t forget to set your API key. There are a number of ways you can do it; I like to put mine in a configuration file.

import indicoio
indicoio.config.api_key = 'YOUR_API_KEY'

Using indico’s Text Features API

You’ll need to store your FAQs and their respective answers in a dictionary. For simplicity’s sake, I’ve created a dictionary, faqs, of five questions and answers in the script itself. This will be our starting dataset. We only need to find the text features for the questions and not the answers, so we extract faqs.keys() and then feed that data into our make_feats() function.

def make_feats(data):
    """
    Send our text data through the indico API and return each text example's text vector representation
    """
    chunks = [data[x:x+100] for x in xrange(0, len(data), 100)]
    feats = []
    # just a progress bar to show us how much we have left
    for chunk in tqdm(chunks):
        feats.extend(indicoio.text_features(chunk))
    return feats

Next, let’s update the run() function. Save out feats to a Pickle file so you don’t have to keep re-running the Text Features API on the static list of FAQs every time you want to compare a user’s question to it.

def run():
    data = faqs.keys()
    print "FAQ data received. Finding features."
feats = make_feats(data)
    with open('faq_feats.pkl', 'wb') as f:
        pickle.dump(feats, f)
    print "FAQ features found!"

Comparing FAQs to User Input

Now that we’ve got the feature representations for the FAQ text data, let’s move on to the next phase: collecting and comparing user questions to our FAQs. So that everyone can run this script locally, no matter what customer support chat service you plan to hook this up to, we’ll just use raw_input(). You’ll need to set up your own webhook according to your messaging app’s docs.

First, let’s get an input, add it to the list of FAQs, as well as find the text features for the input and them to the main feats list. This will simplify things later when we need to calculate the distances for all the feature representations. Update the input_question() function:

def input_question(data, feats):
    # input a question
    question = raw_input("What is your question? ")
    # add the user question and its vector representations to the corresponding lists, `data` and `feats`
    # insert them at index 0 so you know exactly where they are for later distance calculations
    if question is not None:
        data.insert(0, question)
    new_feats = indicoio.text_features(question)
    feats.insert(0, new_feats)
    return data, feats

Time to update the run() function again. This time, you can just load the Pickle file with the FAQ features you found earlier.

def run():
    data = faqs.keys()
    print "FAQ data received. Finding features."
with open('faq_feats.pkl', 'rb') as f:
        feats = pickle.load(f)
    print "Features found -- success! Calculating similarities..."

So now we’ve got a list of feature vectors for all the FAQs and the user’s question! How will this help us figure out which FAQ the input is most similar to? Similarity between pieces of text is measured by similarity between their corresponding feature vectors. We predict their similarity in the calculate_distances function, which calculates the distance between these vectors in cosine space. Cosine is generally the comparison metric of choice when you’re dealing with points in high dimensional space.

calculate_distances produces an m by n matrix that stores the distance between document m and document n at distance_matrix[m][n].
Update run() once again:

def run():
    data = faqs.keys()
    print "FAQ data received. Finding features."
    with open('faq_feats.pkl', 'rb') as f:
            feats = pickle.load(f)
    print "Features found -- success! Calculating similarities..."
    input_results = input_question(data, feats)
    new_data = input_results[0]
    new_feats = input_results[1]
    distance_matrix = calculate_distances(new_feats)
    print "Similarities found. Generating table."

Finally, let’s see how well our nearest neighbours search performs! The similarity_text() function will sort through the distance_matrix and order each piece of text according to level of similarity, and then print it out in a table. We don’t want our bot to give an answer if it’s not very confident that it’s found an FAQ match though, so we need to set a confidence threshold. Add the following code to similarity_text(), just below print t.draw():

        # set a confidence threshold
            if similar_idx == most_sim_idx and similarity >= 0.75:
                    faq_match = data[most_sim_idx]
            else:
                sorry = "Sorry, I'm not sure how to respond. Let me find someone who can help you."

If the bot’s confidence level meets the threshold, it should return the appropriate FAQ answer. Otherwise, it should notify your customer support manager (you’ll have to hook that up based on your messaging app’s docs):

    # print the appropriate answer to the FAQ, or bring in a human to respond
    if faq_match is not None:
            print "A: %r" % faqs[faq_match]
    else:
            print sorry

Update run() one last time and then, well, run the code!

def run():
    data = faqs.keys()
    print "FAQ data received. Finding features."
    with open('faq_feats.pkl', 'rb') as f:
            feats = pickle.load(f)
    print "Features found -- success! Calculating similarities..."
    input_results = input_question(data, feats)
    new_data = input_results[0]
    new_feats = input_results[1]
    distance_matrix = calculate_distances(new_feats)
    print "Similarities found. Generating table."
    idx = 0
    similarity_text(idx, distance_matrix, new_data)
    print 'n' + '-' * 80

How well did it perform? Here’s an example:

Woo! It performed quite well, even though the input question’s word choice differed from and was more concise than the FAQ match, our rich text features were still able to capture its meaning. So, what’s happening here — why does this work?

indico’s Text Features API creates of hundreds of thousands of rich feature vector representations for a given text input, learned using deep learning techniques. These feature vectors — numerical representations in multi-dimensional space — are a computer’s way of assigning meaning to a word.
When we fed the list of FAQs into the Text Features algorithm, it essentially identified certain words in each question as “significant”, and determined where they could be found in multi-dimensional space. Text Features then combined these word representations to produce a representation of the entire “document” (in this case, an FAQ — you could also pass in an entire article to find its feature representation). It did the same thing when we showed it our new input question. We then used these document representations to calculate the distance between them to determine how conceptually similar the two “documents” are. The closer the features in multi-dimensional space, the closer they are in meaning.

Next Steps

At the heart of it, this was an exercise in predicting text similarity. What other applications can you imagine for this task? If you create something cool with our APIs, definitely let us know at contact@indico.io!

[addtoany]

Increase intake capacity. Drive top line revenue growth.

Schedule Demo

Unstructured Unlocked podcast

April 24, 2024 | E45

Unstructured Unlocked episode 45 with Daniel Faggella, Head of Research, CEO at Emerj Artificial Intelligence Research

Listen Now

April 10, 2024 | E44

Unstructured Unlocked episode 44 with Tom Wilde, Indico Data CEO, and Robin Merttens, Executive Chairman of InsTech

Listen Now

March 27, 2024 | E43

Unstructured Unlocked episode 43 with Sunil Rao, Chief Executive Officer at Tribble

Listen Now

View All

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Get Started

Industry

Use Cases

Get Started

Resources

Documentation

Customer Stories

Get Started

Get Started

Get Started

Indico Named as Major Contender and Star Performer in Everest Group's PEAK Matrix® for Intelligent Document Processing (IDP)

BLOG

Building a Bot to Answer FAQs: Predicting Text Similarity

The Task

Getting Started

Using indico’s Text Features API

Comparing FAQs to User Input

Next Steps

Increase intake capacity. Drive top line revenue growth.

Related Posts

Artificial Intelligence, Business

6 Steps to Building the Business Case for Intelligent Automation

Announcements, Business, Indico

Indico Posts Record Q2 in New Bookings as Automation Wave Continues to Accelerate

Artificial Intelligence, Business, Financial Services, Intelligent Process Automation, Machine Learning, Robotic Process Automation

Process Automation Comes to ISDA Master Agreements

Unstructured Unlocked podcast

Unstructured Unlocked episode 45 with Daniel Faggella, Head of Research, CEO at Emerj Artificial Intelligence Research

Unstructured Unlocked episode 44 with Tom Wilde, Indico Data CEO, and Robin Merttens, Executive Chairman of InsTech

Unstructured Unlocked episode 43 with Sunil Rao, Chief Executive Officer at Tribble

Get started with Indico

Schedule1-1 Demo

Resources

Blog

Gain insights from experts in automation, data, machine learning, and digital transformation.

Unstructured Unlocked

Enterprise leaders discuss how to unlock value from unstructured data.

YouTube Channel

Check out our YouTube channel to see clips from our podcast and more.

Get our best content on intelligent automation sent to your inbox weekly!

Schedule
1-1 Demo