Tuesday, September 19, 2006

Caltrain news rss feed


RSS seems to be a mechanism that is enabling many types of device to have access to structured data.  This is especially good for mobile phone, where the browser can render the information in a form that is readable on a small screen.


One example of this is the Caltrain news updates.  When I am waiting at the train station, I want an easy way to check the status of the trains.  Caltrain has a web page that displays this information, but they don't have an rss feed.


I wrote a script that parses the html an creates an rss feed.  This is based on the example given on this page.  Here is the resulting rss feed.  (This will be up as long as Caltrain doesn't ask me to take it down.)  There are two things that I want to fix with script:


  1. Not cause the post to be seen as unread when the it runs again.

  2. Add a summary of the body to the rss feed.






#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::TokeParser;
use XML::RSS;
# First - LWP::Simple. Download the page using get();.
my $content = get( "http://www.caltrain.com/news.html" ) or die $!;
# Second - Create a TokeParser object, using our downloaded HTML.
my $stream = HTML::TokeParser->new( \$content ) or die $!;
# Finally - create the RSS object.
my $rss = XML::RSS->new( version => '0.9' );
# Prep the RSS.
$rss->channel(
title => "Caltrain news",
link => "http://www.caltrain.com/news.html",
description => "Latest caltrain news");
# Declare variables.
my ($tag, $headline, $url);
# First indication of a headline - A <div> tag is present.
while ( $tag = $stream->get_tag("a") ) {
# Inside this loop, $tag is at a <a> tag.
# But do we have a "class="newstitle">" token, too?
if ($tag->[1]{class} and $tag->[1]{class} eq 'newstitle') {
# We do!
# Now, we're at the <a> with the headline in.
# We need to put the contents of the 'href' token in $url.
$url = $tag->[1]{href} || "--";
# Now we can grab $headline, by using get_trimmed_text
# up to the close of the <a> tag.
$headline = $stream->get_trimmed_text('/a');
# We need to escape ampersands, as they start entity references in XML.
$url =~ s/&/&/g;
# The <a> tags contain relative URLs - we need to qualify these.
$url = 'http://www.caltrain.com/'.$url;
# And that's it. We can add our pair to the RSS channel.
$rss->add_item( title => $headline, link => $url);
}
}
$rss->save("caltrain.rss");

Technorati Tags: , ,