This page looks plain and unstyled because you're using a non-standard compliant browser. To see it in its best form, please upgrade to a browser that supports web standards. It's free and painless.
| « | October 2006 | » | ||||
|---|---|---|---|---|---|---|
| Su | Mo | Tu | We | Th | Fr | Sa |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 8 | 9 | 10 | 11 | 12 | 13 | 14 |
| 15 | 16 | 17 | 18 | 19 | 20 | 21 |
| 22 | 23 | 24 | 25 | 26 | 27 | 28 |
| 29 | 30 | 31 | ||||
Digitial Commercial Transition Video
Audio mixing with VMWare on Linux host
24 hours of Google Reader Shared Items
links for October 2, 2008
NuevaSync
Virtual Machines
Moving to Windows
Vista on Mac mini
Insecure wifi
links for September 7, 2008
Today I noticed a problem in the rss feed generated by the script that I wrote. An invalid character (character code 0x93) was included in the feed. The problem was that the page stated that the character set was iso-8859-1, even though there are characters in the cp1250 character set.
Here is the updated script. I also had to apply the patch described on this page that allows the RSS module handle multiple byte characters
#!/usr/bin/perl -w use strict; use LWP::Simple; use HTML::TokeParser; use XML::RSS; # First - LWP::Simple. Download the page using get();. my $content = get( "http://www.caltrain.com/news.html" ) or die $!; # convert the string from iso-8859-1 to utf-8 $content = decode("cp1250", $content, Encode::FB_HTMLCREF); # Second - Create a TokeParser object, using our downloaded HTML. my $stream = HTML::TokeParser->new( \$content ) or die $!; # Finally - create the RSS object. my $rss = XML::RSS->new( version => '0.9' ); # Prep the RSS. $rss->channel( title => "Caltrain news", link => "http://www.caltrain.com/news.html", description => "Latest caltrain news"); # Declare variables. my ($tag, $headline, $url); # First indication of a headline - A <div> tag is present. while ( $tag = $stream->get_tag("a") ) { # Inside this loop, $tag is at a <a> tag. # But do we have a "class="newstitle">" token, too? if ($tag->[1]{class} and $tag->[1]{class} eq 'newstitle') { # We do! # Now, we're at the <a> with the headline in. # We need to put the contents of the 'href' token in $url. $url = $tag->[1]{href} || "--"; # Now we can grab $headline, by using get_trimmed_text # up to the close of the <a> tag. $headline = $stream->get_trimmed_text('/a'); # We need to escape ampersands, as they start entity references in XML. $url =~ s/&/&/g; # The <a> tags contain relative URLs - we need to qualify these. $url = 'http://www.caltrain.com/'.$url; # And that's it. We can add our pair to the RSS channel. $rss->add_item( title => $headline, link => $url); } } $rss->save("caltrain.rss");
Technorati tags: Caltrain, perl, rss
This
work is licensed under a
Creative Commons License.
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
December 2006
November 2006
October 2006
September 2006
August 2006
July 2006
June 2006
May 2006
April 2006
March 2006
February 2006
January 2006
December 2005
November 2005
October 2005
September 2005
August 2005
July 2005
June 2005
May 2005
April 2005
March 2005
February 2005
January 2005
December 2004
November 2004
October 2004
September 2004
August 2004
July 2004
June 2004
May 2004
April 2004
March 2004
January 2004
Electronics [206]

Computer [766]

Blogging [112]

Links [71]

Cars [64]

General [134]

Gadgets [62]

Phone [46]

Family [13]

Games [26]

moblog [4]

Hardware [19]

Third Time Dad
Engadget
Autoblog
Geek News Central
Jessica's Blog
atmaspheric | endeavors
Paint the Tiger • Carve the Swan
TiVoBlog
SuperJason's Personal Blog
Forever Geek
sprocket i/o
Slacy's Blog