Creating Personalized Content Filters with CrowdLabel

We are all inundated with information. Even in a Google-powered world, where it’s relatively easy to find any type of content that we may seek, it can sometimes still take significant effort to pore through the search results and find the specific pieces we are interested in.
For instance, say you’re interested in wearable tech. More specifically, you need to keep track of how the market is doing — but you’re not that interested in in-depth articles that explore the tech behind the devices. You want to filter out all that “unnecessary” content, and just read a certain type of article that covers the topic in a way that matters to you.

The Task + Data + Approach

We’re going to create a personalized content filter that only shows you articles that you’ve deemed are interesting to you. Let’s use the example we discussed above, and tune our filter to pick out articles that talk about the wearable device market specifically, while leaving out other content related to the topic. This is a simple classification problem, and in this case, there are two categories: Interesting and Not Interesting.

Now, we need a dataset. This is a CSV file of 30 articles discussing wearable devices, which I gathered from Wired, Financial Times, and other content providers after running a basic Google search on “wearable tech”. Next, we need to train the model with this unlabeled data to teach it what we care about, which we can quickly do using indico’s CrowdLabel. This tool enables you to quickly upload, label, and seamlessly train a model on your own data when used in conjunction with Custom Collections — all without writing any code.

If you’re thinking that dataset of 30 articles sounds too small to train a robust deep learning model, don’t worry. We can use such a small dataset because Custom Collections allows you to make use of Indico’s hundreds of thousands of rich feature embeddings through transfer learning.

Note: We’ll only be building the backend function (as in, the machine learning model) in this tutorial. This app would likely require some sort of frontend, but is not within the scope of this tutorial.

Using CrowdLabel

Let’s get started! First things first — go to CrowdLabel and sign into your indico account (or sign up if you don’t have one yet). If you already know how to use CrowdLabel, feel free to skip to the section that examines how well our filter performs.

Before uploading your dataset, make sure all your examples are placed in a single column, and each example should live in a separate cell. You should also give the column a heading, as the first example in your CSV file will automatically be interpreted as a column title.

Upload your dataset to CrowdLabel. You’ll be brought to a page that will allow you to prepare your dataset for labeling. See the image below for more details on the different form fields.

Specifying the number of labelers per task will set the minimum number of times an example must be labeled before it’s accepted as complete.
Be sure to leave instructions for your labeling party under “Question” so they complete the task the way you want it. Make sure you are clear about the task. Here we specify, “Does the article talk about overall wearable tech market trends?” This way the labeler knows our criteria for what is interesting.

Now, let’s consider how the end product should work. The question we’re trying to answer as we read each article is “Is this article interesting or not?” We’ll allow labelers to select only one option, since the answers are mutually exclusive, by using radio buttons. Go ahead and fill out the different label options, and click “Create Dataset”.

You’ll then be taken to the details page for your dataset, where you can review the task, add further instructions and team members, and take snapshots of your data. Go ahead and start labeling!

Try label the data as best you can, even if you’re uncertain about the accuracy of your choice. If you have multiple people labeling, examples will only be used as training data if several people validate the label (you can actually turn this function on or off, but we recommend using it for better results).

Take a Snapshot to train your Custom Collection model

You can take a snapshot of your data at any point for a macro look at how your data has been labeled (it will download as a CSV). Taking a snapshot also gives you the option to train a Custom Collection of all the available and validated labeled data with the press of a button. You can also specify model domain, which will tell Custom Collection to work with feature embeddings that are best suited to your task. For this case, it isn’t necessary. Finally, click “Save” to train your personalized content filter!

Your Personalized Content Filter

Note that at the moment, you will need to run code to use your Custom Collection model, but you can do it in just a few lines of code. If you want to try running it, use this script and test dataset (these articles are different from the ones we used for training. You could also gather your own for testing if you prefer).

Otherwise, here are some accuracy metrics produced by the model:

{u'metrics': {u'recall': {u'no': 1.0, u'yes': 0.975}, u'class_accuracy': 0.9888888889, u'precision': {u'no': 0.9833333333000001, u'yes': 1.0}}}

Precision and recall

metrics are typically used to measure the relevance of results for classification problems, such as this one. Basically, high precision indicates that the model found significantly more relevant results than irrelevant ones, and high recall tells us that the model found most of the possible relevant results. Looks like our model is really good at recognizing the sort of content we want to read (articles about the wearable tech market)! Now, to set the model to only return the “yes” (interesting) articles, we would just set a threshold probability in our code that tells the model to only show us articles that it’s confident we’ll like. If you want to be extremely strict, you could set it to yes = 0.9, but somewhere around 0.6 or 0.7 is probably good enough.

To give you a more tangible sense of how the model is working, here are some snippets of the articles the model deemed interesting vs. not interesting:

Interesting	Not Interesting
Wristwear Dominates the Wearables Market While Clothing and Earwear Have Market-Beating Growth by 2021 The overall wearables market is expected to return to strong growth after a brief slowdown in 2016 that resulted from delayed launches of major platforms and notable vendors struggling to maintain pace. Looking ahead, new vendors, emerging form factors, and an expanded number of retail outlets will drive worldwide wearable device shipments from 102.4 million in 2016 to 237.5 million in 2021, a five-year compound annual growth rate (CAGR) of 18.3%, according to new data from the International Data Corporation (IDC) Worldwide Quarterly Wearable Device Tracker…	Em-Sense Enabled Smartwatch Can Detect When You Touch a Doorknob The great promise of the smartwatch and its intelligent brethren is that they will unite everyday objects in a seamless web of connectivity. But there is a glaring hole in this plan: Our lives remain filled with countless dumb objects—door handles, lamps, pots and pans—that have no way of communicating with our smart objects. One solution is to make everything intelligent. A more discerning approach is to design objects that are clever enough to glean information from the dumb objects around them. This is the idea behind Em-Sense, a technology Carnegie Mellon’s future interfaces group developed with Disney Research…
Fitbit’s Dominance Diminishes But Wearable Tech Market Bigger Than Ever Fitbit has taken a hit with year-on-year Q4 sales but the wearable tech market is bigger than ever. They are the two main takeways from the latest IDC Worldwide Quarterly Wearable Device Tracker, which states that Fitbit shipments were down 22.7 percent in the fourth quarter of 2016: 6.5 million units compared to 8.4 million units in the same period in 2015…	Athletes Turn to Implants to Take Training Into the Future With the rollout this season of RFID (radio-frequency identification) player-tracking technology in every NFL stadium—bottle-cap-sized chips are embedded in every set of shoulder pads—the league can keep tabs on all of its players the same way that businesses do their inventories. The system, provided by Zebra Technologies and still largely kept under wraps by the NFL, is geared partly toward improving the TV experience, but when that data is opened up to fans and teams, things could get really interesting…
Wearable Market in India Saw Shipment of 2.5 Million Units in 2016 Wearable market in India observed a shipment of 2.5 million units in 2016 with Goqii and Xiaomi dominating the entry-level segment, International Data Corporation’s (IDC) Worldwide Quarterly Wearable Device Tracker has said. Total wearable shipments in the fourth quarter of 2016 were 675,000 units, which included both smart wearables that can run third party apps and basic wearables, which cannot. Wearables shipments declined 19.6 per cent sequentially as basic wearables — that account to 86.4 per cent of total wearable — declined 23.9 per cent in the fourth quarter of 2016…	Electronic Skin May Transform the Way We Interact With Tech It’s still just a proof-of-concept, but ideas for how to use skins like this one come pretty easily—think the Bionic Man, or Luke Skywalker’s prosthetic hand. But only recently has microfabrication tech caught up to humanity’s cyborg dreams. Now, every few months or so, a lab comes out with a new, different design: A bunch of sensors stuck to a glove! Circuitry printed on weird sticky paper! “We’re at a very nascent stage,” says Benjamin Tee, an e-skin researcher at the Institute of Materials Research and Engineering in Singapore…

Next steps? Build a frontend for your app, and also hook up Text Tags to create an initial broad filter, before implementing this finer one. Questions? Reach out to us through that little chat bubble on the bottom righthand corner of your screen!

[addtoany]

Increase intake capacity. Drive top line revenue growth.

Schedule Demo

Unstructured Unlocked podcast

April 24, 2024 | E45

Unstructured Unlocked episode 45 with Daniel Faggella, Head of Research, CEO at Emerj Artificial Intelligence Research

Listen Now

April 10, 2024 | E44

Unstructured Unlocked episode 44 with Tom Wilde, Indico Data CEO, and Robin Merttens, Executive Chairman of InsTech

Listen Now

March 27, 2024 | E43

Unstructured Unlocked episode 43 with Sunil Rao, Chief Executive Officer at Tribble

Listen Now

View All

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Get Started

Industry

Use Cases

Get Started

Resources

Documentation

Customer Stories

Get Started

Get Started

Get Started

Indico Named as Major Contender and Star Performer in Everest Group's PEAK Matrix® for Intelligent Document Processing (IDP)

BLOG

Creating Personalized Content Filters with CrowdLabel

The Task + Data + Approach

Using CrowdLabel

Your Personalized Content Filter

Increase intake capacity. Drive top line revenue growth.

Related Posts

Artificial Intelligence, Business

6 Steps to Building the Business Case for Intelligent Automation

Announcements, Business, Indico

Indico Posts Record Q2 in New Bookings as Automation Wave Continues to Accelerate

Artificial Intelligence, Business, Financial Services, Intelligent Process Automation, Machine Learning, Robotic Process Automation

Process Automation Comes to ISDA Master Agreements

Unstructured Unlocked podcast

Unstructured Unlocked episode 45 with Daniel Faggella, Head of Research, CEO at Emerj Artificial Intelligence Research

Unstructured Unlocked episode 44 with Tom Wilde, Indico Data CEO, and Robin Merttens, Executive Chairman of InsTech

Unstructured Unlocked episode 43 with Sunil Rao, Chief Executive Officer at Tribble

Get started with Indico

Schedule1-1 Demo

Resources

Blog

Gain insights from experts in automation, data, machine learning, and digital transformation.

Unstructured Unlocked

Enterprise leaders discuss how to unlock value from unstructured data.

YouTube Channel

Check out our YouTube channel to see clips from our podcast and more.

Get our best content on intelligent automation sent to your inbox weekly!

Schedule
1-1 Demo