Building a Bot to Answer FAQs: Predicting Text Similarity
February 15, 2017 / Business, Developers, Tutorials
In our previous tutorial on customer support bots, we trained a bot using the Custom Collection API to direct customers to the team member who is best suited to assist them with their problem or query. The bot improved our team’s response times as we no longer had to rely on a human facilitator (who also plays many other roles in our company #startuplife) to do the job. However, we’re generally only able to respond during our office hours of 11am-7pm EST, so there’s still lag for inquiries outside of that period. How can we improve this? Build a bot to answer frequently asked questions, reducing lag time for more customers and ensuring our engineers don’t need to spend more time than necessary away from the products we’re building for you :).
We’ll conduct a nearest neighbour search in Python, comparing a user input question to a list of FAQs. To do this, we’ll use indico’s Text Features API to find all the feature vectors for the text data, and calculate the distance between these vectors to those of the user’s input question in 300-dimensional space. Then we’ll return the appropriate answer based to the FAQ that the user’s question is most similar to (if it meets a certain confidence threshold).
First, get the skeleton code from our SuperCell GitHub repo.
You’ll need to install all necessary packages if you don’t have them —
texttable and, of course,
If you haven’t already set up your indico account, follow our Quickstart Guide. It will walk you through the process of getting your API key and installing the
indicoio Python library. If you run into any problems, check the Installation section of the docs. If all else fails, you can also reach out to us through that little chat bubble. Assuming your account is all set up and you’ve installed everything, let’s get started!
Go to the top of your file and import
indicoio. Don’t forget to set your API key. There are a number of ways you can do it; I like to put mine in a configuration file.
import indicoio indicoio.config.api_key = 'YOUR_API_KEY'
Using indico’s Text Features API
You’ll need to store your FAQs and their respective answers in a dictionary. For simplicity’s sake, I’ve created a dictionary,
faqs, of five questions and answers in the script itself. This will be our starting dataset. We only need to find the text features for the questions and not the answers, so we extract
faqs.keys() and then feed that data into our
def make_feats(data): """ Send our text data through the indico API and return each text example's text vector representation """ chunks = [data[x:x+100] for x in xrange(0, len(data), 100)] feats =  # just a progress bar to show us how much we have left for chunk in tqdm(chunks): feats.extend(indicoio.text_features(chunk)) return feats
Next, let’s update the
run() function. Save out
feats to a Pickle file so you don’t have to keep re-running the Text Features API on the static list of FAQs every time you want to compare a user’s question to it.
def run(): data = faqs.keys() print "FAQ data received. Finding features." feats = make_feats(data) with open('faq_feats.pkl', 'wb') as f: pickle.dump(feats, f) print "FAQ features found!"
Comparing FAQs to User Input
Now that we’ve got the feature representations for the FAQ text data, let’s move on to the next phase: collecting and comparing user questions to our FAQs. So that everyone can run this script locally, no matter what customer support chat service you plan to hook this up to, we’ll just use
raw_input(). You’ll need to set up your own webhook according to your messaging app’s docs.
First, let’s get an input, add it to the list of FAQs, as well as find the text features for the input and them to the main
feats list. This will simplify things later when we need to calculate the distances for all the feature representations. Update the
def input_question(data, feats): # input a question question = raw_input("What is your question? ") # add the user question and its vector representations to the corresponding lists, `data` and `feats` # insert them at index 0 so you know exactly where they are for later distance calculations if question is not None: data.insert(0, question) new_feats = indicoio.text_features(question) feats.insert(0, new_feats) return data, feats
Time to update the
run() function again. This time, you can just load the Pickle file with the FAQ features you found earlier.
def run(): data = faqs.keys() print "FAQ data received. Finding features." with open('faq_feats.pkl', 'rb') as f: feats = pickle.load(f) print "Features found -- success! Calculating similarities..."
So now we’ve got a list of feature vectors for all the FAQs and the user’s question! How will this help us figure out which FAQ the input is most similar to? Similarity between pieces of text is measured by similarity between their corresponding feature vectors. We predict their similarity in the
calculate_distances function, which calculates the distance between these vectors in cosine space. Cosine is generally the comparison metric of choice when you’re dealing with points in high dimensional space.
calculate_distances produces an
n matrix that stores the distance between document
m and document
run() once again:
def run(): data = faqs.keys() print "FAQ data received. Finding features." with open('faq_feats.pkl', 'rb') as f: feats = pickle.load(f) print "Features found -- success! Calculating similarities..." input_results = input_question(data, feats) new_data = input_results new_feats = input_results distance_matrix = calculate_distances(new_feats) print "Similarities found. Generating table."
Finally, let’s see how well our nearest neighbours search performs! The
similarity_text() function will sort through the
distance_matrix and order each piece of text according to level of similarity, and then print it out in a table. We don’t want our bot to give an answer if it’s not very confident that it’s found an FAQ match though, so we need to set a confidence threshold. Add the following code to
similarity_text(), just below
# set a confidence threshold if similar_idx == most_sim_idx and similarity >= 0.75: faq_match = data[most_sim_idx] else: sorry = "Sorry, I'm not sure how to respond. Let me find someone who can help you."
If the bot’s confidence level meets the threshold, it should return the appropriate FAQ answer. Otherwise, it should notify your customer support manager (you’ll have to hook that up based on your messaging app’s docs):
# print the appropriate answer to the FAQ, or bring in a human to respond if faq_match is not None: print "A: %r" % faqs[faq_match] else: print sorry
run() one last time and then, well, run the code!
def run(): data = faqs.keys() print "FAQ data received. Finding features." with open('faq_feats.pkl', 'rb') as f: feats = pickle.load(f) print "Features found -- success! Calculating similarities..." input_results = input_question(data, feats) new_data = input_results new_feats = input_results distance_matrix = calculate_distances(new_feats) print "Similarities found. Generating table." idx = 0 similarity_text(idx, distance_matrix, new_data) print '\n' + '-' * 80
How well did it perform? Here’s an example:
Woo! It performed quite well, even though the input question’s word choice differed from and was more concise than the FAQ match, our rich text features were still able to capture its meaning. So, what’s happening here — why does this work?
indico’s Text Features API creates of hundreds of thousands of rich feature vector representations for a given text input, learned using deep learning techniques. These feature vectors — numerical representations in multi-dimensional space — are a computer’s way of assigning meaning to a word.
When we fed the list of FAQs into the Text Features algorithm, it essentially identified certain words in each question as “significant”, and determined where they could be found in multi-dimensional space. Text Features then combined these word representations to produce a representation of the entire “document” (in this case, an FAQ — you could also pass in an entire article to find its feature representation). It did the same thing when we showed it our new input question. We then used these document representations to calculate the distance between them to determine how conceptually similar the two “documents” are. The closer the features in multi-dimensional space, the closer they are in meaning.
At the heart of it, this was an exercise in predicting text similarity. What other applications can you imagine for this task? If you create something cool with our APIs, definitely let us know at firstname.lastname@example.org!