Monday, October 9, 2006

Fixed problem in Caltrain rss generator


Someone pointed out a problem with the rss that was being generated from the script that I wrote to create the Caltrain rss feed.  The problem was that iso-8859-1 characters were being included the rss feed, when the feed stated that it was utf-8 encoded.


The fix was to just convert the characters to utf-8 before parsing.  (The character encodeing should really be determined from the document, instead of hard coding iso-8859-1.)






#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::TokeParser;
use XML::RSS;
# First - LWP::Simple. Download the page using get();.
my $content = get( "http://www.caltrain.com/news.html" ) or die $!;
# convert the string from iso-8859-1 to utf-8
$content = decode("iso-8859-1", $content);
# Second - Create a TokeParser object, using our downloaded HTML.
my $stream = HTML::TokeParser->new( \$content ) or die $!;
# Finally - create the RSS object.
my $rss = XML::RSS->new( version => '0.9' );
# Prep the RSS.
$rss->channel(
title => "Caltrain news",
link => "http://www.caltrain.com/news.html",
description => "Latest caltrain news");
# Declare variables.
my ($tag, $headline, $url);
# First indication of a headline - A <div> tag is present.
while ( $tag = $stream->get_tag("a") ) {
# Inside this loop, $tag is at a <a> tag.
# But do we have a "class="newstitle">" token, too?
if ($tag->[1]{class} and $tag->[1]{class} eq 'newstitle') {
# We do!
# Now, we're at the <a> with the headline in.
# We need to put the contents of the 'href' token in $url.
$url = $tag->[1]{href} || "--";
# Now we can grab $headline, by using get_trimmed_text
# up to the close of the <a> tag.
$headline = $stream->get_trimmed_text('/a');
# We need to escape ampersands, as they start entity references in XML.
$url =~ s/&/&/g;
# The <a> tags contain relative URLs - we need to qualify these.
$url = 'http://www.caltrain.com/'.$url;
# And that's it. We can add our pair to the RSS channel.
$rss->add_item( title => $headline, link => $url);
}
}
$rss->save("caltrain.rss");

No comments:

Post a Comment

Unlocking Raspberry Pi Potential: Navigating Network Booting Challenges for Enhanced Performance and Reliability

I've set up several Raspberry Pis around our house for various projects, but one recurring challenge is the potential for SD card failur...