Downloading stock prices in F# - Part II - Html scraping

-

Other parts:

Getting stock prices and div­i­dends is rel­a­tively easy given that, on Yahoo, you can get the info as a CVS file. Getting the splits info is harder. You would think that Yahoo would put that info in the div­i­dends CVS as it does when it dis­plays it on screen, but it does­n’t. So I had to write code to scrap it from the mul­ti­ple web pages where it might re­side. In essence, I’m scrap­ing this.

html.fs

In this file there are util­ity func­tions that I will use later on to re­trieve split info.

#light
open System
open System.IO
open System.Text.RegularExpressions
// It assumes no table inside table ...
let tableExpr = "<table[^>]*>(.*?)</table>"
let headerExpr = "<th[^>]*>(.*?)</th>"
let rowExpr = "<tr[^>]*>(.*?)</tr>"
let colExpr = "<td[^>]*>(.*?)</td>"
let regexOptions = RegexOptions.Multiline ||| RegexOptions.Singleline 
||| RegexOptions.IgnoreCase

This code is straight­for­ward enough (if you know what Regex does). I’m sure that there are bet­ter ex­pres­sion to scrap ta­bles and rows on the web, but these work in my case. I re­ally don’t need to scrape ta­bles. I put the table ex­pres­sion there in case you need it.

I then write code to scrape all the cells in a piece of html:

let scrapHtmlCells html =
  seq { for x in Regex.Matches(html, colExpr, regexOptions) -> x.Groups.Item(1).ToString()}            

This is a se­quence ex­pres­sion. Sequence ex­pres­sions are used to gen­er­ate se­quences start­ing from some ex­pres­sion (as the name hints to). In this case Regex.Matches re­turns a MatchClollection, which is a non-generic IEnumerable. For each el­e­ment in it, we re­turn the value of the first match. We could as eas­ily have con­structed a list or an ar­ray, given that there is not much de­ferred com­pu­ta­tion go­ing on. But oh well

Always check the type of your func­tions in F#! With type in­fer­ence it is easy to get it wrong. Hovering your mouse on top of it in VS shows it. This one is typed: string -> seq. It takes a string (html) and re­turn a se­quence of strings (the cells in html).

We’ll need rows as well.

let scrapHtmlRows html =
    seq { for x in Regex.Matches(html, rowExpr, regexOptions) -> scrapHtmlCells x.Value }

This works about the same. I’m match­ing all the rows and re­triev­ing the cells for each one of them. I’m get­ting back a ma­trix-like struc­ture, that is to say that this func­tion as type: string -> seq<seq>.

That’s all for to­day. In the next in­stall­ment we’ll make it hap­pen.

Tags

5 Comments

Comments

Luca Bolognese's WebLog

2008-09-12T16:18:11Z

Other parts: Part I - Data mod­el­ing Part II - Html scrap­ing It is now time to load our data. There is

Luca Bolognese's WebLog

2008-09-19T17:59:39Z

Other parts: Part I - Data mod­el­ing Part II - Html scrap­ing Part III - Async loader for prices and divs

Luca Bolognese's WebLog

2008-09-26T16:04:19Z

Other parts: Part I - Data mod­el­ing Part II - Html scrap­ing Part III - Async loader for prices and divs

Luca Bolognese's WebLog

2008-10-20T18:45:51Z

Other parts: Part I - Data mod­el­ing Part II - Html scrap­ing Part III - Async loader for prices and divs