Downloading stock prices in F# - Part II - Html scraping

Luca Bolognese - Sep 2008

Other parts:

Getting stock prices and dividends is relatively easy given that, on Yahoo, you can get the info as a CVS ﬁle. Getting the splits info is harder. You would think that Yahoo would put that info in the dividends CVS as it does when it displays it on screen, but it doesn’t. So I had to write code to scrap it from the multiple web pages where it might reside. In essence, I’m scraping this.

html.fs

In this ﬁle there are utility functions that I will use later on to retrieve split info.

#light
open System
open System.IO
open System.Text.RegularExpressions
// It assumes no table inside table ...
let tableExpr = "<table[^>]*>(.*?)</table>"
let headerExpr = "<th[^>]*>(.*?)</th>"
let rowExpr = "<tr[^>]*>(.*?)</tr>"
let colExpr = "<td[^>]*>(.*?)</td>"
let regexOptions = RegexOptions.Multiline ||| RegexOptions.Singleline 
                                          ||| RegexOptions.IgnoreCase

This code is straightforward enough (if you know what Regex does). I’m sure that there are better expression to scrap tables and rows on the web, but these work in my case. I really don’t need to scrape tables. I put the table expression there in case you need it.

I then write code to scrape all the cells in a piece of html:

let scrapHtmlCells html =
  seq { for x in Regex.Matches(html, colExpr, regexOptions) -> x.Groups.Item(1).ToString()}

This is a sequence expression. Sequence expressions are used to generate sequences starting from some expression (as the name hints to). In this case Regex.Matches returns a MatchClollection, which is a non-generic IEnumerable. For each element in it, we return the value of the ﬁrst match. We could as easily have constructed a list or an array, given that there is not much deferred computation going on. But oh well

Always check the type of your functions in F#! With type inference it is easy to get it wrong. Hovering your mouse on top of it in VS shows it. This one is typed: string -> seq. It takes a string (html) and return a sequence of strings (the cells in html).

We’ll need rows as well.

let scrapHtmlRows html =
    seq { for x in Regex.Matches(html, rowExpr, regexOptions) -> scrapHtmlCells x.Value }

This works about the same. I’m matching all the rows and retrieving the cells for each one of them. I’m getting back a matrix-like structure, that is to say that this function as type: string -> seq<seq>.

That’s all for today. In the next installment we’ll make it happen.

Comments

Luca Bolognese's WebLog : Down

2008-09-05T14:42:49Z

PingBack from http://blogs.msdn.com/lucabol/archive/2008/08/29/downloading-stock-prices-in-f-part-i-data-modeling.aspx

Luca Bolognese's WebLog

2008-09-12T16:18:11Z

Other parts: Part I - Data modeling Part II - Html scraping It is now time to load our data. There is

Luca Bolognese's WebLog

2008-09-19T17:59:39Z

Other parts: Part I - Data modeling Part II - Html scraping Part III - Async loader for prices and divs

Luca Bolognese's WebLog

2008-09-26T16:04:19Z

Other parts: Part I - Data modeling Part II - Html scraping Part III - Async loader for prices and divs

Luca Bolognese's WebLog

2008-10-20T18:45:51Z

Other parts: Part I - Data modeling Part II - Html scraping Part III - Async loader for prices and divs

Tags

Comments

Luca Bolognese's WebLog : Down

Luca Bolognese's WebLog

Luca Bolognese's WebLog

Luca Bolognese's WebLog

Luca Bolognese's WebLog