General, pre-trained machine learning solutions are now easier to come by — sentiment analysis, topic classification, entity extraction, etc. Unfortunately, these solutions typically don’t work so well with industry or company-specific tasks, because the data that matters for your business is probably different from the data used to train the model. For instance, financial articles are generally more formal and serious in tone compared to the casual language used on social media and sites such as Buzzfeed. Most general sentiment analysis models are only trained to work with the latter, so they would perform poorly on financial articles.
indico’s Custom Collections API employs transfer learning to help you efficiently build machine learning models that are tailored to your task and industry. You can train these models using datasets that are orders of magnitude smaller than what would be required if you were to start from scratch. To learn more about transfer learning, check out our Deep Learning in Fashion series for a non-technical explanation, or take a more technical dive in our Exploring Computer Vision series.
Let’s build a simple, customized topic tagging model using an unlabeled dataset and no code. Wait. Waiiit. What’s that now?
No, no joke. It’s now possible with our shiny new tool, CrowdLabel — but I’ll get to that in a sec.
We specifically built this demo for categorizing financial articles, but you can tweak this for the domain of your preference (you’ll just have to gather your own dataset ;)).
It’s an exciting time to be a part of the machine learning field. 2015 was described as a “breakthrough year”, and this year we’ve seen even more development. Research is moving so quickly, we regularly have to go back and update articles and tutorials we published just a few months back.
Yet, folks outside of the general tech sphere are cautious about, or find it difficult to adopt this technology into their businesses. Yes, machine learning research is moving quickly, but it’s focused on improving or building algorithms that aren’t necessarily useful for businesses in other industries today. This gap between what practitioners/researchers are pursuing and what people in other industries actually need makes machine learning seem inaccessible and mysterious — scary, even.
Machine learning needs to be usable. You might have the most accurate algorithm for something, but it’s meaningless if it doesn’t actually help anyone. We still need researchers to keep pursuing these brilliant advances, but we also need to improve the user experience of building and deploying machine learning models.
That’s why we built CrowdLabel, the latest tool in our machine learning kit. It’s a seamless way to quickly upload, label, and process data, and then train a model on said data when used in conjunction with Custom Collections — all without writing any code.
Don’t worry, I’ll show you how it works. Let’s get to the good stuff.
We started with an unlabeled dataset of close to 2,000 financial news articles. Download a truncated version (750 articles) here. Five of us got comfortable on the couches with some coffee and proceeded to label about 400 of them (50-200 each). It took us maybe two hours. We set our data processing options in CrowdLabel to automatically exclude examples where labelers disagreed, which left us with 225 examples to train our customized model. Remember that we can use such a small dataset because Custom Collections allows you to make use of indico’s hundreds of thousands of rich feature embeddings through transfer learning.
Training the Model with CrowdLabel
We defined five categories for our financial topic tags:
- Corporate News: Article about a specific company describing an event that happened to that company, such as a management change, product launch, or strategy change.
- Stock News & Analysis: An article reporting the performance of a single company’s stock. This should include articles like “Pepsi’s earnings beat expectations”, Pepsi’s revenue worse than expected”, “Pepsi’s stock price has risen”, etc.
- Fundamental Analysis: An article examining several factors to make an assessment about a company’s stock. For example, “Barclays increases estimate for Pepsi” or “Pepsi is a strong buy”.
- Sector/Market News: News or analysis that focuses on a sector, region, or market.
- Other: Non-financial news, i.e. articles that don’t fall into any of other categories.
We created these categories based on the idea of augmented, personalized search. For instance, if you’re an analyst researching whether or not you should invest in Pepsi, you could ask the algorithm to return all the articles tagged as “Corporate News” to determine if any recent management changes will impact your decision. Note that we built this demo as a simple proof of concept, so you should create your labels/tags according to your required level of specificity.
Before uploading your dataset, make sure all your examples are placed in a single column (in the image below, the text has simply overflowed out of the cells but everything lives in Column A), and each example should live in a separate cell. You should also give the column a heading, as the first example in your CSV file will automatically be interpreted as a column title.
Go ahead and upload your dataset to CrowdLabel. You’ll be brought to a page that will allow you to prepare your dataset for labeling. See the image below for more details on the different form fields.
- Specifying the number of labelers per task will set the minimum number of times an example must be labeled before it’s accepted as complete.
- Be sure to leave instructions for your labeling party under “Question” so they complete the task the way you want it!
Now you must think about how your end product will work. Do you want to force your model to place each article into a single category, or can it fall into a variety of buckets? Since our categories are already quite broad, for our purposes we’ll be more rigid and allow labelers to only select one option (Radio buttons). Go ahead and fill out the different label options, and click “Create Dataset”.
You’ll then be taken to the details page for your dataset, where you can review the task, add further instructions and team members, and take snapshots of your data. So go ahead and add your team members to the project (make sure they’ve got an indico account!), and start labeling!
Examples can be rejected if the data doesn’t load properly. Otherwise, it’s best to try label the data as accurately as possible, even if you’re uncertain about the accuracy of your label. Remember, examples will only be used as training data if multiple people validate the label (you can actually turn this function on or off, but we recommend using it for better results).
You can take a snapshot of your data at any point for a macro look at how your data has been labeled (it will download as a CSV). Taking a snapshot also gives you the option to train a Custom Collection of all the available and validated labeled data with the press of a button. You can also specify model domain, which will tell Custom Collection to work with feature embeddings that are best suited to your task. Since we’re working with financial articles, you should select “Finance” as your model domain. Finally, click “Save” to train your customized finance topic classification model! Simple as that — no code required, as promised.
Your Finance Topic Tagger
Note that at the moment, you will need to run code to use your model, but you can do it in just a few lines of code (Python, in this example):
import indicoio from indicoio.custom import Collection indicoio.config.api_key = 'YOUR_API_KEY' collection = Collection("YOUR_COLLECTION_NAME") # making a batch prediction of several articles, be sure to specify “finance” domain print collection.predict([“STRING_OF_ARTICLE1”, “STRING_OF_ARTICLE2”, …], domain=”finance”)
We ran some accuracy tests to determine how well the model performed. First, we calculated the inter-labeler agreement rate (basically, a measure of how accurately our human selves performed at this task, based on multiple people validating a label): 60%. The model, however, categorized articles correctly 70% of the time — yay! Still, because the inter-labeler agreement wasn’t very high, it’s likely that we need clearer distinctions among categories. Essentially, the model is limited by the accuracy of our labeled training data, so improving inter-labeler agreement would help to improve the accuracy of the model itself.
Check out our docs for more details on using Custom Collections, or reach out to us through that little chat bubble if you have questions 🙂
There’s no doubt that machine learning is incredibly useful for staying afloat in this Information Age, with data being hurled at us from all directions. It can eliminate tedious tasks that prevent us from getting to the more important and interesting aspects of our jobs (whatever your job may be). Yet, many companies have been slow to adopt the tech because it’s so difficult to build customized solutions from scratch. Or rather, it was.
CrowdLabel and Custom Collections are now at your disposal.