Wednesday, October 11, 2006

Another fix for caltrain rss script


Today I noticed a problem in the rss feed generated by the script that I wrote.  An invalid character (character code 0x93) was included in the feed.  The problem was that the page stated that the character set was iso-8859-1, even though there are characters in the cp1250 character set.



Here is the updated script. I also had to apply the patch described on this page that allows the RSS module handle multiple byte characters






#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::TokeParser;
use XML::RSS;
# First - LWP::Simple. Download the page using get();.
my $content = get( "http://www.caltrain.com/news.html" ) or die $!;
# convert the string from iso-8859-1 to utf-8
$content = decode("cp1250", $content, Encode::FB_HTMLCREF);
# Second - Create a TokeParser object, using our downloaded HTML.
my $stream = HTML::TokeParser->new( \$content ) or die $!;
# Finally - create the RSS object.
my $rss = XML::RSS->new( version => '0.9' );
# Prep the RSS.
$rss->channel(
title => "Caltrain news",
link => "http://www.caltrain.com/news.html",
description => "Latest caltrain news");
# Declare variables.
my ($tag, $headline, $url);
# First indication of a headline - A <div> tag is present.
while ( $tag = $stream->get_tag("a") ) {
# Inside this loop, $tag is at a <a> tag.
# But do we have a "class="newstitle">" token, too?
if ($tag->[1]{class} and $tag->[1]{class} eq 'newstitle') {
# We do!
# Now, we're at the <a> with the headline in.
# We need to put the contents of the 'href' token in $url.
$url = $tag->[1]{href} || "--";
# Now we can grab $headline, by using get_trimmed_text
# up to the close of the <a> tag.
$headline = $stream->get_trimmed_text('/a');
# We need to escape ampersands, as they start entity references in XML.
$url =~ s/&/&/g;
# The <a> tags contain relative URLs - we need to qualify these.
$url = 'http://www.caltrain.com/'.$url;
# And that's it. We can add our pair to the RSS channel.
$rss->add_item( title => $headline, link => $url);
}
}
$rss->save("caltrain.rss");

No comments:

Post a Comment

Unlocking Raspberry Pi Potential: Navigating Network Booting Challenges for Enhanced Performance and Reliability

I've set up several Raspberry Pis around our house for various projects, but one recurring challenge is the potential for SD card failur...