RSS is a great format for communicating the most recent stories, comments or changes on a site but does not usually go back further than the last 10 or 20 entries. I wanted a fairly large, historical data set of RSS feeds from a variety of services but was unable to find any freely available collections of this sort.
Fortunately, there are a number of libraries out there that make parsing feeds really easy. Couple these with some of the really nice ORM (object-relational mapping) frameworks and presto! My very own SQL database of RSS feeds.
If you need to do something similar, you are more than welcome to use my scripts. All I ask is that if you build any interesting collection of RSS feeds I’d love a quick email letting me know about it.
Feed Bag is a ruby script that depends on two libraries not found in the standard libraries that come with the ruby 1.8.6 distribution:
Both libraries are easily installed as ruby gems. These ensure that any other dependencies are automatically installed. Once you have ruby gems installed on your machine, all you need to do is:
sudo gem install sequel
sudo gem install feed-normalizer
You will also need some SQL database manager to run the script. I’ve found SQLite to be a great database manager for quick and dirty projects like this one as you don’t have to worry about user management, servers and the like. I am currently using version 3.4.0 for Mac OS X.
Once you have all of the above dependencies, you can do one of two things:
Either way, you will get these three files:
feedbag.rb
: The main script.models.rb
: Used to define the database schema and operations.tally,sh
: A small bash script that calls sqlite to count entries.There are three main modes of using Feed Bag: adding a new feed, listing the current feeds, and scanning existing feeds.
All feeds and entries are stored in an SQLite database that is created as required. SQLite databases are saved as a single file. All modes of usage must specify which SQLite file to use as the database through the use of the -d
option.
A complete list of options and brief usage synopsis is available by calling feedbag.rb --help
.
To add a feed for the ABC News coverage of Australian news to a new or existing database file called news.db
you just do this at your command prompt:
$ ./feedbag.rb -d news.db http://www.abc.net.au/news/indexes/idx-australia/rss.xml
Using news.db for Feed DB
Creating new feed for http://www.abc.net.au/news/indexes/idx-australia/rss.xml
The new feed is called 'ABC News : Australia'
If news.db
does not exist it will be created in the current directory. The feed at the given URL will be parsed and stored in the database. However, no entries from the feed are read at this stage.
Multiple URLs for feeds can be added with one call to feedbag.rb
.
To list the feeds that exist in a given database, their number of entries and when they were last checked you use the -l
or --list
option:
$ ./feedbag.rb -d news.db -l
Using news.db for Feed DB
1: ABC News : Australia (Checked: Thu Jan 01 00:00:00 +1000 1970) - 0
The first number is a unique identifier for the feed, the text after it is its name as it appears in the parsed RSS feed. Inside the brackets is the date and time of the most recently parsed entry in the feed (here it is set to the Unix epoch as there are no entries yet). The last number is the number of parsed entries.
A call to feedbag.rb
with no arguments scans all the existing feeds in the database for news entries (i.e., those with dates after the currently most recent one):
$ ./feedbag.rb -d news.db
Using news.db for Feed DB
Scanning ABC News : Australia
Budget cuts force Centrelink to axe 2,000 jobs
Tripodi facing suspension over Scimone probe
Iemma promises crackdown on donations
[... 20 more titles ...]
Nuclear energy not yet an option: Wong
The output here shows the titles of each of the entries that were parsed and added to the database.
Listing the contents of the database now shows the following:
$ ./feedbag.rb -d news.db -l
Using news.db for Feed DB
1: ABC News : Australia (Checked: Fri Feb 22 16:39:00 +1100 2008) - 24
Feed Bag does not provide any more support over scanning and archiving entries from RSS feeds when called. To periodically check for new entries in you will have to set up a crontab
or launchd
item or some other script to periodically call feedbag.rb
.
Once you have collected some data it’s up to you what you do with it. The easiest way to start processing it is via SQL. For example, if you wanted to count the number of entries in each feed you can use SQL like this:
$ sqlite3 news.db "select name, count(*) from feeds, entries where entries.feed_id = feeds.id group by feed_id;"
ABC News : Australia|24
This is effectively what the tally.sh
script does. Alternatively, you could also write more powerful analysis scripts using ruby and the Sequel framework (it’s really, really slick!)
The database schema used by Feed Bag is concisely summarised in the Sequel domain-specific language. It is found in the models.rb file
. The feeds:
class Feed < Sequel::Model(:feeds)
set_schema do
primary_key :id
text :name
text :url
time :last_checked
time :created
end
And the entries:
class Entry < Sequel::Model(:entries)
set_schema do
primary_key :id
text :url
text :title
text :content
text :description
time :time
foreign_key :feed_id, :table => :feeds
index :url
end
end
Even if you don’t precisely understand what Sequel is doing here it’s pretty clear what the tables and columns are if you know a little SQL.
If you end up using Feed Bag, I’d love to hear about it — especially if you create a useful data set with it.
You can email me usingthe link at the bottom of this page, or leave me a comment at my blog: inductio ex machina.