Re: [racket-users] Html to text, how to obtain a rough preview

Philip McGrath Tue, 30 May 2017 14:48:35 -0700

I would handle this by adding some special cases to ignore the content of
script tags, extract alt text for images when it's provided, etc.


This gets meaningful content from both nytimes.com and cnn.com (though CNN
seems to only have their navigation links accessible without JavaScript):

#lang racket

(require net/url
         html-parsing
         sxml
         xml
         xml/path
         )

(define (fetch url)
  (call/input-url url
                  get-pure-port
                  port->string))

(define (html->string-content src-str)
  (let loop ([to-go (se-path*/list
                     '(body)
                     (xml->xexpr
                      (document-element
                       (read-xml
                        (open-input-string
                         (srl:sxml->xml-noindent
                          (html->xexp src-str)))))))]
             [so-far ""])
    (match to-go
      ['() so-far]
      [(cons (? string? str)
             more)
       (loop more (string-append so-far str))]
      [(cons (cons (or 'style 'script) _)
             more)
       (loop more so-far)]
      [(cons (list-rest 'img
                        (list-no-order (list 'alt alt-text)
                                       _ ...)
                        _)
             more)
       (loop more (string-append so-far alt-text))]
      [(cons (list-rest _ _ body)
             more)
       (loop more (loop body so-far))]
      [(cons _ more) ;ignore entities, CDATA, p-i for simplicity
       (loop more so-far)])))

(displayln
 (html->string-content
  (fetch (string->url "https://www.nytimes.com";))))


I only converted to the xml library's x-expressions because I haven't
worked with SXML before.

Probably the first thing I'd improve is the handling of whitespace, since
it's normalized in HTML, but then you'd probably want more special cases
for tags like <p> that should translate to whitespace in your output.


-Philip

On Tue, May 30, 2017 at 4:17 PM, Jon Zeppieri <zeppi...@gmail.com> wrote:

> ((sxpath '(// *text*)) doc)
>
> should return all (and only) the text nodes in doc. I'm not so
> familiar with the sxml-xexp compatibility stuff, so I don't know if
> you can use an xexp here or if you really need an sxml document.
>
> On Tue, May 30, 2017 at 7:08 AM, Erich Rast <er...@snafu.de> wrote:
> > Hi all,
> >
> > I need a function to provide a rough textual preview (without
> > formatting except newlines) of the content of a web page.
> >
> > So far I'm using this:
> >
> > (require net/url
> >          html-parsing
> >          sxml)
> >
> > (provide fetch fetch-string-content)
> >
> > (define (fetch url)
> >   (call/input-url url
> >                   get-pure-port
> >                   port->string))
> >
> > (define (fetch-string-content url)
> >   (sxml:text ((sxpath '(html body)) (html->xexp (fetch url)))))
> >
> > The sxpath correctly returns the body sexp, but fetch-string-content
> > still only returns an empty string or a bunch of "\n\n\n".
> >
> > I guess the problem is that sxml:text only returns what is immediately
> > below the element, and that's not what I want. There are all kinds of
> > unknown div and span tags in web pages. I'm looking for a way to get
> > a simplified version of the textual content of the html body. If I was
> > on Linux only I'd use "lynx -dump -nolist" in a subprocess, but it needs
> > to be cross-platform.
> >
> > Is there a sxml trick to achieve that? It doesn't need to be perfect.
> >
> > Best,
> >
> > Erich
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups "Racket Users" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to racket-users+unsubscr...@googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to racket-users+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Html to text, how to obtain a rough preview

Reply via email to