[racket-users] Html to text, how to obtain a rough preview

Erich Rast Tue, 30 May 2017 04:08:34 -0700

Hi all,

I need a function to provide a rough textual preview (without
formatting except newlines) of the content of a web page.


So far I'm using this:

(require net/url
         html-parsing
         sxml)

(provide fetch fetch-string-content)

(define (fetch url)
  (call/input-url url
                  get-pure-port
                  port->string))

(define (fetch-string-content url)
  (sxml:text ((sxpath '(html body)) (html->xexp (fetch url)))))

The sxpath correctly returns the body sexp, but fetch-string-content
still only returns an empty string or a bunch of "\n\n\n".

I guess the problem is that sxml:text only returns what is immediately
below the element, and that's not what I want. There are all kinds of
unknown div and span tags in web pages. I'm looking for a way to get
a simplified version of the textual content of the html body. If I was
on Linux only I'd use "lynx -dump -nolist" in a subprocess, but it needs
to be cross-platform.

Is there a sxml trick to achieve that? It doesn't need to be perfect.

Best,

Erich

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Html to text, how to obtain a rough preview

Reply via email to