Introduction

This tutorial describes how to create a customized, automated RSS feed for reddit using the indico Text Tags API with Python and HTML/CSS. This takes four steps:

  1. Setup [Python]
  2. Categorization and setting thresholds [Python]
  3. Access reddit’s RSS feed [Python]
  4. Create a basic front-end [HTML/CSS]

This tutorial assumes that you’re already familiar with Python and have registered for an API key.

Where to get help:
If you’re having trouble going through this tutorial, please email us at contact@indico.io.

Installation
Install all the packages needed for this tutorial by running the following in your terminal:

git clone https://github.com/IndicoDataSolutions/RSSCustomization.git
cd RSSCustomization
sudo python setup.py install
python server.py

 

Step 1: Setup [Python]

First, import Flask, a web framework that provides you with a set of tools for building simple, lightweight web applications. You’ll need this library to respond to requests sent to our website.

from flask import Flask, render_template

Next, import Python’s Feedparser, which will let you access reddit’s RSS feed. If you don’t have it, you need to install the pip package.

import feedparser

Finally, import indico’s Text Tags API.

from indicoio import text_tags

Once you’ve imported everything, just set up your app to work with Flask.

app = Flask(_name_)
app.debug = True

 

Step 2: Categorization and Setting Thresholds [Python]

This step allows you to determine the flexibility of your application. Before you begin setting your threshold though, you need to define the feed you’ll be accessing. We’ll be using reddit in this particular case, but you can substitute it with any RSS feed of your choice.

feed = "http://www.reddit.com/.rss"

Now you’ll need to define a series of functions for processing the results from the RSS feed. We’ve broken it up this way so that it’s easier to make sense of what’s going on.

def thresholded(tags, minimum):
        " " "
        Remove all tags with probability less than 'minimum'
        " " "
        return dict((category, prob) for category, prob in tags.items()
                if prob > minimum)

def likely_tag(tags, minimum=0.1):
        " " "
        Threshold tags, then get the tag with the highest probability.
        If no tag probability exceeds the minimum, return the string 'none'
        " " "
        trimmed = thresholded(tags, minimum) or {'none': 0}
        return max(trimmed, key=lambda key: trimmed[key])

Use the likely_tag function to set the threshold to your personal tastes — if you want to have a broader definition of a category, for instance everything that is even tangentially related to technology, you can change your threshold from 0.1 to 0.05. If you want it to be more specific, set your threshold higher to something like 0.5.

The max function will return the tag with the highest probability. The key argument within it tells the max function which value to use to compare tags — in this case, the probability output by the Text Tags API.

This next function examines the content returned by feedparser, categorizes it, and then uses the values from the Text Tags API to determine whether or not the tag should be kept. It will remove anything outside of the threshold that you’ve set.

def parsed(entry):
    “ “ “
    Strip unnecessary content from the return of feedparser,
    and augment with the output of indico's `text_tags` API
    “ “ “
    return {
        'title': entry['title'],
        'link': entry['link'],
        'tag': likely_tag(text_tags(entry['title']))
    }

Essentially, remove all unnecessary fields and pass each article’s title through the API.

Step 3: Access the RSS Feed [Python]

Now you can set up the main function that accesses reddit’s RSS feed, parses it and returns the data for each entry coming into your feed.

@app.route('/')
def main():
    entries = feedparser.parse(feed)['entries']
    titles = [entry.get('title') for entry in entries]
    title_tags = batch_text_tags(titles)
    for entry, tags in zip(entries, title_tags):
        entry['tags'] = tags
    entries = [parsed(entry) for entry in entries]

    # render template with additional jinja2 data
    return render_template('main.html', entries=entries)

if __name__ == '__main__':
    app.run()

The app.route call is a bit of Flask logic which lets the webserver know it should run the main function. The batch_text_tags function allows you to send over multiple examples in a single request so that you get results from the indico API quickly.

Step 4: Create Basic Front-End [HTML/CSS]

Here’s some sample code to create your front-end.

<head>
        <link rel="stylesheet" href="http://yui.yahooapis.com/pure/0.5.0/pure-min.css">
        <link rel="stylesheet" href="/static/css/main.css">
        <script src="http://zeptojs.com/zepto.min.js"></script>
        <script>
            var hashChanged = function(hashlink) {
                // update table when url fragment changes

                // strip leading hash
                var tag = hashlink.substring(1);

                if (tag === "") {
                    // show all links
                    $('tr').show();
            
                } else {
                    // show only links from given category
                    $('tr').hide();
                    $('.' + tag).show();     
                }
            }
            // supported in nearly all modern browsers
            window.onhashchange = function () {
                hashChanged(window.location.hash); 
            }
        </script>
</head>

The hashChanged function is triggered when you click a link that takes you to a hash URL. The function will pass in the text that occurs after the hash and filter your feed to contain only articles from that category.

<h1>RSS Customization: Reddit's Home Page</h1>

    <table class="pure-table">
        <thead>
            <tr>
                <th>indico Tag</th>
                <th>Title</th>
            </tr>
        </thead>

        <tbody>
            {% for entry in entries %}
                <tr class="{{ entry.tag }}">
                    <td><a href="#{{ entry.tag }}" class="tag">{{ entry.tag }}</a></td>
                    <td><a href="{{ entry.link }}">{{ entry.title }}</a></td>
                </tr>
            {% endfor %}
        </tbody>
    </table>

The code within the <tbody> element uses Jinja2, a templating language for Python.

And there you have it: an automated, customized RSS feed using Python, indico and HTML.

Suggested Posts

Analyzing the Democratic Presidential Candidates with Machine Learning

Deep Learning in Fashion (Part 1): Transfer Learning

Read Less, Learn More: Introducing indico's Summarization API