Webinar replay: How carriers are leveraging large language models (LLMs) and automation to drive better decisions
Watch Now
  Everest Group IDP
             PEAK Matrix® 2022  
Indico Named as Major Contender and Star Performer in Everest Group's PEAK Matrix® for Intelligent Document Processing (IDP)
Access the Report

BLOG

Create a Customized RSS Feed with indico Text Tags, Python, & HTML

April 28, 2015 | Developers, Text Data Use Case, Tutorials

Back to Blog

Introduction

This tutorial describes how to create a customized, automated RSS feed for reddit using the indico Text Tags API with Python and HTML/CSS. This takes four steps:

  1. Setup [Python]
  2. Categorization and setting thresholds [Python]
  3. Access reddit’s RSS feed [Python]
  4. Create a basic front-end [HTML/CSS]

This tutorial assumes that you’re already familiar with Python and have registered for an API key.

Where to get help:
If you’re having trouble going through this tutorial, please email us at contact@indico.io.

Installation
Install all the packages needed for this tutorial by running the following in your terminal:

git clone https://github.com/IndicoDataSolutions/RSSCustomization.git
cd RSSCustomization
sudo python setup.py install
python server.py

Step 1: Setup [Python]

First, import Flask, a web framework that provides you with a set of tools for building simple, lightweight web applications. You’ll need this library to respond to requests sent to our website.

from flask import Flask, render_template

Next, import Python’s Feedparser, which will let you access reddit’s RSS feed. If you don’t have it, you need to install the pip package.

import feedparser

Finally, import indico’s Text Tags API.

from indicoio import text_tags

Once you’ve imported everything, just set up your app to work with Flask.

app = Flask(_name_)
app.debug = True

Step 2: Categorization and Setting Thresholds [Python]

This step allows you to determine the flexibility of your application. Before you begin setting your threshold though, you need to define the feed you’ll be accessing. We’ll be using reddit in this particular case, but you can substitute it with any RSS feed of your choice.

feed = "http://www.reddit.com/.rss"

Now you’ll need to define a series of functions for processing the results from the RSS feed. We’ve broken it up this way so that it’s easier to make sense of what’s going on.

def thresholded(tags, minimum):
        " " "
        Remove all tags with probability less than 'minimum'
        " " "
        return dict((category, prob) for category, prob in tags.items()
                if prob > minimum)
def likely_tag(tags, minimum=0.1):
        " " "
        Threshold tags, then get the tag with the highest probability.
        If no tag probability exceeds the minimum, return the string 'none'
        " " "
        trimmed = thresholded(tags, minimum) or {'none': 0}
        return max(trimmed, key=lambda key: trimmed[key])

Use the likely_tag function to set the threshold to your personal tastes — if you want to have a broader definition of a category, for instance everything that is even tangentially related to technology, you can change your threshold from 0.1 to 0.05. If you want it to be more specific, set your threshold higher to something like 0.5.

The max function will return the tag with the highest probability. The key argument within it tells the max function which value to use to compare tags — in this case, the probability output by the Text Tags API.

This next function examines the content returned by feedparser, categorizes it, and then uses the values from the Text Tags API to determine whether or not the tag should be kept. It will remove anything outside of the threshold that you’ve set.

def parsed(entry):
    “ “ “
    Strip unnecessary content from the return of feedparser,
    and augment with the output of indico's `text_tags` API
    “ “ “
    return {
        'title': entry['title'],
        'link': entry['link'],
        'tag': likely_tag(text_tags(entry['title']))
    }

Essentially, remove all unnecessary fields and pass each article’s title through the API.

Step 3: Access the RSS Feed [Python]

Now you can set up the main function that accesses reddit’s RSS feed, parses it and returns the data for each entry coming into your feed.

@app.route('/')
def main():
    entries = feedparser.parse(feed)['entries']
    titles = [entry.get('title') for entry in entries]
    title_tags = batch_text_tags(titles)
    for entry, tags in zip(entries, title_tags):
        entry['tags'] = tags
    entries = [parsed(entry) for entry in entries]
    # render template with additional jinja2 data
    return render_template('main.html', entries=entries)
if __name__ == '__main__':
    app.run()

The app.route call is a bit of Flask logic which lets the webserver know it should run the main function. The batch_text_tags function allows you to send over multiple examples in a single request so that you get results from the indico API quickly.

Step 4: Create Basic Front-End [HTML/CSS]

Here’s some sample code to create your front-end.


        
        
        
        

The hashChanged function is triggered when you click a link that takes you to a hash URL. The function will pass in the text that occurs after the hash and filter your feed to contain only articles from that category.

RSS Customization: Reddit's Home Page

{% for entry in entries %} {% endfor %}
indico Tag Title
{{ entry.tag }} {{ entry.title }}

The code within the

element uses Jinja2, a templating language for Python.
And there you have it: an automated, customized RSS feed using Python, indico and HTML.

[addtoany]

Increase intake capacity. Drive top line revenue growth.

[addtoany]

Unstructured Unlocked podcast

March 27, 2024 | E43

Unstructured Unlocked episode 43 with Sunil Rao, Chief Executive Officer at Tribble

podcast episode artwork
March 13, 2024 | E42

Unstructured Unlocked episode 42 with Arthur Borden, VP of Digital Business Systems & Architecture for Everest and Alex Taylor, Global Head of Emerging Technology for QBE Ventures

podcast episode artwork
February 28, 2024 | E41

Unstructured Unlocked episode 41 with Charles Morris, Chief Data Scientist for Financial Services at Microsoft

podcast episode artwork

Get started with Indico

Schedule
1-1 Demo

Resources

Blog

Gain insights from experts in automation, data, machine learning, and digital transformation.

Unstructured Unlocked

Enterprise leaders discuss how to unlock value from unstructured data.

YouTube Channel

Check out our YouTube channel to see clips from our podcast and more.
Subscribe to our blog

Get our best content on intelligent automation sent to your inbox weekly!