This page looks plain and unstyled because you're using a non-standard compliant browser. To see it in its best form, please upgrade to a browser that supports web standards. It's free and painless.

Paul's Time Sink

| Main | Albums |

« | »

Another fix for caltrain rss script

Paul Westbrook | 11 October, 2006 21:32

Today I noticed a problem in the rss feed generated by the script that I wrote.  An invalid character (character code 0x93) was included in the feed.  The problem was that the page stated that the character set was iso-8859-1, even though there are characters in the cp1250 character set.

Here is the updated script. I also had to apply the patch described on this page that allows the RSS module handle multiple byte characters

#!/usr/bin/perl -w
 
use strict;
use LWP::Simple;
use HTML::TokeParser;
use XML::RSS;
 
# First - LWP::Simple. Download the page using get();.
my $content = get( "http://www.caltrain.com/news.html" ) or die $!;
 
# convert the string from iso-8859-1 to utf-8
$content = decode("cp1250", $content, Encode::FB_HTMLCREF);
 
# Second - Create a TokeParser object, using our downloaded HTML.
my $stream = HTML::TokeParser->new( \$content ) or die $!;
 
# Finally - create the RSS object.
my $rss = XML::RSS->new( version => '0.9' );
 
# Prep the RSS.
$rss->channel(
    title        => "Caltrain news",
    link         => "http://www.caltrain.com/news.html",
    description  => "Latest caltrain news");
 
# Declare variables.
my ($tag, $headline, $url);
 
# First indication of a headline - A <div> tag is present.
 
while ( $tag = $stream->get_tag("a") ) {
    # Inside this loop, $tag is at a <a> tag.
 
    # But do we have a "class="newstitle">" token, too?
 
    if ($tag->[1]{class} and $tag->[1]{class} eq 'newstitle') {
        # We do!
 
        # Now, we're at the <a> with the headline in.
        # We need to put the contents of the 'href' token in $url.
        $url = $tag->[1]{href} || "--";
 
        # Now we can grab $headline, by using get_trimmed_text
        # up to the close of the <a> tag.
        $headline = $stream->get_trimmed_text('/a');
 
        # We need to escape ampersands, as they start entity references in XML.
        $url =~ s/&/&/g;
 
        # The <a> tags contain relative URLs - we need to qualify these.
        $url = 'http://www.caltrain.com/'.$url;
 
        # And that's it. We can add our pair to the RSS channel.
        $rss->add_item( title => $headline, link => $url);
    }
}
 
$rss->save("caltrain.rss");

Technorati tags: , ,

Add comment

Topic

Text

Your name

Your email address

Your personal page (if any)




Powered by LifeType
Design by Book of Styles