This page looks plain and unstyled because you're using a non-standard compliant browser. To see it in its best form, please upgrade to a browser that supports web standards. It's free and painless.

Paul's Time Sink

| Main | Albums |

« | »

Fixed problem in Caltrain rss generator

Paul Westbrook | 09 October, 2006 13:15

Someone pointed out a problem with the rss that was being generated from the script that I wrote to create the Caltrain rss feed.  The problem was that iso-8859-1 characters were being included the rss feed, when the feed stated that it was utf-8 encoded.

The fix was to just convert the characters to utf-8 before parsing.  (The character encodeing should really be determined from the document, instead of hard coding iso-8859-1.)

#!/usr/bin/perl -w
 
use strict;
use LWP::Simple;
use HTML::TokeParser;
use XML::RSS;
 
# First - LWP::Simple. Download the page using get();.
my $content = get( "http://www.caltrain.com/news.html" ) or die $!;
 
# convert the string from iso-8859-1 to utf-8
$content = decode("iso-8859-1", $content);
 
# Second - Create a TokeParser object, using our downloaded HTML.
my $stream = HTML::TokeParser->new( \$content ) or die $!;
 
# Finally - create the RSS object.
my $rss = XML::RSS->new( version => '0.9' );
 
# Prep the RSS.
$rss->channel(
    title        => "Caltrain news",
    link         => "http://www.caltrain.com/news.html",
    description  => "Latest caltrain news");
 
# Declare variables.
my ($tag, $headline, $url);
 
# First indication of a headline - A <div> tag is present.
 
while ( $tag = $stream->get_tag("a") ) {
    # Inside this loop, $tag is at a <a> tag.
 
    # But do we have a "class="newstitle">" token, too?
 
    if ($tag->[1]{class} and $tag->[1]{class} eq 'newstitle') {
        # We do!
 
        # Now, we're at the <a> with the headline in.
        # We need to put the contents of the 'href' token in $url.
        $url = $tag->[1]{href} || "--";
 
        # Now we can grab $headline, by using get_trimmed_text
        # up to the close of the <a> tag.
        $headline = $stream->get_trimmed_text('/a');
 
        # We need to escape ampersands, as they start entity references in XML.
        $url =~ s/&/&/g;
 
        # The <a> tags contain relative URLs - we need to qualify these.
        $url = 'http://www.caltrain.com/'.$url;
 
        # And that's it. We can add our pair to the RSS channel.
        $rss->add_item( title => $headline, link => $url);
    }
}
 
$rss->save("caltrain.rss");

Technorati tags: , , , ,

Add comment

Topic

Text

Your name

Your email address

Your personal page (if any)




Powered by LifeType
Design by Book of Styles