Recursively extracting nested links from a webpage using Racket

-

This is my first at­tempt at writ­ing Racket or LISP code. It might be ugly … For now, let’s build a web crawler, next, I shall write my­self a lan­guage. That’s where Racket re­ally shine.

The code is here. Thanks to Mike for re­view­ing it.

Why

I want to trans­late a web­site, in­clud­ing re­cur­sively reached pages, to a pdf to read on my e-reader. This pro­gram does the first step: go­ing from a URL to a list of URLs by re­cur­sively nav­i­gat­ing the links on the page.

The PS1Script di­rec­tory con­tains scripts to per­from the other steps:

  1. Translate all the links to PDF pages (using Chrome headless).
  2. Combine all the PDFs to a single document (using cpdf).

Prelude

First you de­clare which lan­guage you are us­ing. Racket is a lan­guage to build lan­guages and the lan­guage it­self is just one of many (try reread­ing this phrase).

#lang racket

We then ex­pose the main func­tions of the li­brary. Aka re­cur­sively get­ting all links from a web­page to a cer­tain nest­ing level both as a Racket list and as a new­line de­lim­i­tated string.

; Provides functions to extract links from web pages recursively

(provide
 ; uriString nestingLevels -> list of uriStrings
 ; Doesn't follow links to non-html resources or pointing to a different domain then uriString.
 uri->nestedLinks
 ; uriSTring nestingLevels -> string of newline separated Uri Strings
 uri->nestedLinksNl)

Implementation

This is sim­i­lar, but not iden­ti­cal, to C# using state­ment as you de­clare which pack­ages are needed.

(require (planet neil/html-parsing:2:0)
         net/url
         xml
         html
         sxml/sxpath
         threading)

The pro­gram crawl just links to html files in the same do­main as the first URI given to it. It is pos­si­ble to ex­tend this more by hav­ing the pro­gram take a reg­u­lar ex­pres­sion (or * ex­pres­sion) to iden­tify which file sto leave out of the crawl­ing.

(define invalid-suffixes
  '("./" ".xml" ".jpg" ".jpeg" ".png" ".gif" ".tiff" ".psd"
    ".eps" ".ai" ".indd" ".raw" ".svg"))

(define invalid-prefixes '("#" "mailto:" "javascript:"))

(define (different-domain? baseUrl l)
  (define url (string->url l))
  (and (url-host url) (not (equal? (url-host baseUrl) (url-host url)))))

(define (good-link? baseUrl l) (not (or (different-domain? baseUrl l)
                                (ormap (curry string-suffix? l) invalid-suffixes)
                                (ormap (curry string-prefix? l) invalid-prefixes))))

Next we have to parse the HTML. We use XPath for that. Racket is par­tic­u­larly good for XML pars­ing as it maps nat­u­rally to ex­pres­sions which are the bread and but­ter of the lan­guage. I don’t en­joy the nest­ing of ex­pres­sions that LISP like lan­guages force onto you (as in the ex­pres­sion be­low). But see later for a par­tial so­lu­tion.

(define (xexp->links xexp) (flatten (map cdr ((sxpath "//a/@href") xexp))))

The strange ~>> op­er­a­tor in the code be­low comes from the Threading package. This set of macros lets you build pipelines of com­pu­ta­tions sim­i­larly to the F# |> op­er­a­tor. The ini­tial value comes first and then the func­tions are ap­plied in se­ries each one to the re­sult of the pre­vi­ous. By do­ing that you flatten’ the nested ex­pres­sions, mak­ing them more read­able (at least to my eyes).

This ca­pa­bil­ity of chang­ing the core be­hav­ior of the lan­guage is some­thing very pe­cu­liar to the LISP fam­ily, and the reaa­son why I am at­tracted to Racket in the first place.

This func­tion ex­tracts all the good’ links from a url. BTW: I love that you can use -> to name sym­bols.

(define (url->links url)
  (~>> (call/input-url url get-pure-port html->xexp)
       xexp->links
       (filter (curry good-link? url))))

λ~> is the func­tion com­po­si­tion op­er­a­tor in the Threading’ li­brary. You got to love em­bed­ding lamb­das in the code.

(define uri->links
  (λ~> string->url url->links))

This is the main re­cur­sive work­horse of the pro­gram. It works some­thing like this (numbers marked in the code):

  1. Treat links to subparts of a web page as if they were links to the webpage
  2. If it is not a good link, return the links already visited (visited)
  3. Same thing if the link is alread in visited
  4. If we reached the nesting level specified, add the link to visited and return ‘visited’
  5. Otherwise add the link to visited and call yourself on all sublinks on the page

The func­tion is not tail re­cur­sive, but that is not a huge deal in Racket as the stack is very large. It does­n’t blow up as eas­ily as in most other lan­guages.

(define (uri->nestedLinks-rec baseUrl uri visited levels)
  
  (define abs-url (if (string-contains? uri "#") ; <0>
      (~> uri (string-split "#") first (combine-url/relative baseUrl _))
      (~>> uri (combine-url/relative baseUrl))))
  
  (log-info "~a, ~a, ~a:~a~n" (url->string baseUrl)
                              levels uri (url->string abs-url))

  (cond [(not (good-link? baseUrl uri)) visited] ; <1>
        [(member abs-url visited)  visited]      ; <2>
        [(zero? levels)  (cons abs-url visited)] ; <3>
        [else  (for/fold ([acc (cons abs-url visited)]) ;<4>
                         ([l (in-list (url->links abs-url))])
                 (uri->nestedLinks-rec abs-url l acc (sub1 levels)))]))

Finally we can triv­ially de­fine the two main func­tions in the mod­ule.

(define (uri->nestedLinks uri levels)
  (reverse (uri->nestedLinks-rec (string->url uri) "" '() levels)))

(define (uri->nestedLinksNl uri levels)
  (define links (uri->nestedLinks uri levels))
  (string-join (map url->string links) "\n" #:after-last "\n"))

Test

To my great plea­sure, Racket al­lows (encourages?) you to have tests in the same file as the code. They just go into sub mod­ules, that can be con­structed piece­wise with the module+ in­struc­tion.

You could add the tests be­side each func­tion, but I de­cided to have a sep­a­rate sec­tion in the file in­stead. To run them you call raco test FILENAME.

(define (uri->path test-uri)
        (build-path "./data" (~> test-uri first uri->file string->path)))
(define uri->file (λ~> string->url url-host))
(define test-uris '(
                    ("https://www.lucabol.com" 3)
                    ("https://beautifulracket.com/" 3)
                    ("https://en.wikipedia.org/wiki/Typeface" 1)
                    ("https://brieferhistoryoftime.com" 3)
                    ("https://mobydick.wales/" 3)
                    ("https://resilientwebdesign.com" 3)
                    ("https://www.c82.net/euclid/" 3)
                    ))

(module+ test
  (require rackunit)

I got a bit sloppy not nam­ing my lamb­das here … But, does­n’t the lambda sym­bol look cool?

  (for-each (λ (test-uri)
              (with-input-from-file
                  (uri->path test-uri)
                (λ () (begin
                        (define saved-result (port->string))
                        (define calc-result
                          (uri->nestedLinksNl (first test-uri) (second test-uri)))
                        (check-equal? calc-result saved-result test-uri)))
                #:mode 'text
                ))
            test-uris))

This is used to re­gen­er­ate the test data. You can then in­spect it man­u­ally be­fore run­ning tests.

(define (refresh-test-data)
    (for-each (λ (test-uri)
                (with-output-to-file
                    (uri->path test-uri)                  
                  (λ () (display (uri->nestedLinksNl (first test-uri) (second test-uri))))
                  #:exists 'replace))
              test-uris))

Main

Main goes into its own sub­mod­ule as well. Racket is not as pure as Haskell, so you can nat­u­rally man­age side ef­fects like user in­put and such. You got to ap­pre­ci­ate the con­ci­siv­ness of the com­mand line pars­ing li­brary.

The code be­low looks a bit odd to me. It could prob­a­bly be refac­tored so that the parser ex­pres­sion re­turns the val­ues in­stead of fill­ing out pa­ra­me­ters.

(module+ main

  (define levels (make-parameter "3"))
  (define uri (make-parameter #f))

  (define parser
    (command-line
     #:program "website-links"
     #:usage-help "Extracts links from a webpage recursively to a specified level."
     #:once-each
     [("-l" "--levels") LEVELS "How many nested levels to process (default 3)." (levels LEVELS)]
     #:args (URI) (uri URI)))
      
  (display (uri->nestedLinksNl (uri) (string->number (levels)))))
  

Conclusion

I liked Racket very much. It takes a lit­tle while to get use to the ex­pres­sion syn­tax, which is very dif­fer­ent from the C-like one most of us are used to. It also takes a while to get used to the style of the doc­u­men­ta­tion, which is writ­ten very pre­cisely for the care­ful reader. We are more used to the ‘here is an ex­am­ple, copy it’ kind of doc­u­men­ta­tion. For the dis­tracted pro­gram­mer …

Tags