~~ Offline ~~ theme Menu

Downloading stock prices in F# - Part II - Html scraping

Other parts:

Getting stock prices and dividends is relatively easy given that, on Yahoo, you can get the info as a CVS file. Getting the splits info is harder. You would think that Yahoo would put that info in the dividends CVS as it does when it displays it on screen, but it doesn’t. So I had to write code to scrap it from the multiple web pages where it might reside. In essence, I’m scraping this.


In this file there are utility functions that I will use later on to retrieve split info.

open System
open System.IO
open System.Text.RegularExpressions
// It assumes no table inside table ...
let tableExpr = "<table[^>]*>(.*?)</table>"
let headerExpr = "<th[^>]*>(.*?)</th>"
let rowExpr = "<tr[^>]*>(.*?)</tr>"
let colExpr = "<td[^>]*>(.*?)</td>"
let regexOptions = RegexOptions.Multiline ||| RegexOptions.Singleline 
||| RegexOptions.IgnoreCase

This code is straightforward enough (if you know what Regex does). I’m sure that there are better expression to scrap tables and rows on the web, but these work in my case. I really don’t need to scrape tables. I put the table expression there in case you need it.

I then write code to scrape all the cells in a piece of html:

let scrapHtmlCells html =
  seq { for x in Regex.Matches(html, colExpr, regexOptions) -> x.Groups.Item(1).ToString()}            

This is a sequence expression. Sequence expressions are used to generate sequences starting from some expression (as the name hints to). In this case Regex.Matches returns a MatchClollection, which is a non-generic IEnumerable. For each element in it, we return the value of the first match. We could as easily have constructed a list or an array, given that there is not much deferred computation going on. But oh well

Always check the type of your functions in F#! With type inference it is easy to get it wrong. Hovering your mouse on top of it in VS shows it. This one is typed: string -> seq. It takes a string (html) and return a sequence of strings (the cells in html).

We’ll need rows as well.

let scrapHtmlRows html =
    seq { for x in Regex.Matches(html, rowExpr, regexOptions) -> scrapHtmlCells x.Value }

This works about the same. I’m matching all the rows and retrieving the cells for each one of them. I’m getting back a matrix-like structure, that is to say that this function as type: string -> seq<seq>.

That’s all for today. In the next installment we’ll make it happen.