Sunday, March 30, 2008

Toddler's play with HTML in Haskell

I just read a blog article entitled Kid's play with HTML in Haskell, where the author extracts some information from an HTML document, using the Haskell XML Toolbox. I have an alternative XML/HTML library, TagSoup, so I decided to implement their problem with my library.

The Problem

Given an HTML file, extract all hyperlinks to mp3 files.

In TagSoup


[mp3 | TagOpen "a" atts <- parseTags txt
, ("href",mp3) <- atts
, takeExtension mp3 == ".mp3"]


The code is a list comprehension. The first line says use TagSoup to parse the text, and pick all "a" links. The second line says pick all "href" attributes from the tag you matched. The final line uses the FilePath library to check the extension is mp3.

A Complete Program

The above fragment is all the TagSoup logic, but to match exact the interface to the original code, we can wrap it up as so:


import System.FilePath
import System.Environment
import Text.HTML.TagSoup

main = do
[src] <- getArgs
txt <- readFile src
mapM_ putStrLn [mp3 | TagOpen "a" atts <- parseTags txt
, ("href",mp3) <- atts
, takeExtension mp3 == ".mp3"]


Summary

If you have a desire to quickly get a bit of information out of some XML/HTML page, TagSoup may be the answer. It isn't intended to be a complete HTML framework, but it does nicely optimise fairly common patterns of use.

2 comments:

Anonymous said...

Very nice indeed. I have one more little HTML/XML-related problem to solve. I'll take the tagsoup-route first, I think it offers enough for that problem as well.

Anonymous said...
This comment has been removed by a blog administrator.