Re: [go-nuts] Re: Simple web crawler question. How to avoid crawling visited links?

Michael Jones Tue, 26 Sep 2017 14:57:18 -0700

Aaron, this is the simplest change I could make to that program to show you
an example. It does what you want.


https://play.golang.org/p/Q8pu0l-Yvy

It shows what I was saying about replacing a string with a struct of string
and int.

Depth 1:

http://www.whitehouse.gov
https://www.whitehouse.gov/blog/2017/06/08/president-trumps-plan-rebuild-americas-infrastructure
:
https://www.whitehouse.gov/privacy
https://www.whitehouse.gov/copyright

On Tue, Sep 26, 2017 at 7:29 AM, Michael Jones <michael.jo...@gmail.com>
wrote:

> To be sure, examples are a starting place.
>
> To crawl to a certain depth, you need to know the depth of each url. That
> means that the url being passed around stops being "just a string" and
> becomes a struct with information including at least, the url as a string
> and the link depth as an integer. Following a link means incrementing the
> depth, checking that, and also checking the presence of the string part in
> the map.
>
> To crawl to a certain distance from a website, in the sense of how many
> hops to other websites, needs similar processing on the hostname.
>
> To crawl with rate-limitation so as not to overburden a distant host's
> server(s), means to have multiple work queues for each hostname. That is
> the natural place to do rate limits, time windows, or other such processing.
>
> The real meaning of the example is to point out that the work queue -- a
> channel here -- can be written to by the (go)routines that read from it.
> This is a useful notion in this case.
>
> Good luck!
>
> On Mon, Sep 25, 2017 at 8:46 PM, Aaron <gdz...@gmail.com> wrote:
>
>> Thank you Mike. I have read the book and done that exercise. The code in
>> the example will not crawl into certain depth and stop at certain depth of
>> the website. The requirements are a bit different. I can't just use the
>> approach in the example directly or more precisely I don't know how to use
>> it in my code.
>>
>> On Tuesday, September 26, 2017 at 6:18:43 AM UTC+8, Michael Jones wrote:
>>>
>>> The book, The Go Programming Language discusses the web crawl task at
>>> several points through the text. The simplest complete parallel version is:
>>>
>>> https://github.com/adonovan/gopl.io/blob/master/ch8/crawl3/findlinks.go
>>>
>>> which if you'll download and build works quite nicely:
>>>
>>> *$ crawl3 http://www.golang.org <http://www.golang.org>*
>>> http://www.golang.org
>>> http://www.google.com/intl/en/policies/privacy/
>>> https://golang.org/doc/tos.html
>>> https://golang.org/project/
>>> https://golang.org/pkg/
>>> https://golang.org/doc/
>>> http://play.golang.org/
>>> https://tour.golang.org/
>>> https://golang.org/LICENSE
>>> https://developers.google.com/site-policies#restrictions
>>> https://golang.org/dl/
>>> https://golang.org/blog/
>>> https://golang.org/help/
>>> https://golang.org/
>>> https://blog.golang.org/
>>> https://www.google.com/intl/en/privacy/privacy-policy.html
>>> https://www.google.com/intl/en/policies/terms/
>>> https://golang.org/LICENSE?m=text
>>> https://golang.org/pkg
>>> https://golang.org/doc/go_faq.html
>>> https://groups.google.com/group/golang-nuts
>>> https://blog.gopheracademy.com/gophers-slack-community/
>>> https://golang.org/wiki
>>> https://forum.golangbridge.org/
>>> irc:irc.freenode.net/go-nuts
>>> 2017/09/25 15:13:07 Get irc:irc.freenode.net/go-nuts: unsupported
>>> protocol scheme "irc"
>>> https://golang.org/doc/faq
>>> https://groups.google.com/group/golang-announce
>>> https://blog.golang.org
>>> https://twitter.com/golang
>>> :
>>>
>>> On Mon, Sep 25, 2017 at 7:46 AM, Michael Jones <michae...@gmail.com>
>>> wrote:
>>>
>>>> i suggest that you first make it work in the simple way and then make
>>>> it concurrent.
>>>>
>>>> however, one lock-free concurrent way to think of this is as follows...
>>>>
>>>> 1. start with a list of urls (in code, on command line, etc.)
>>>> 2. spawn a go process that writes each of them to a channel of strings,
>>>> perhaps called PENDING
>>>> 3. spawn a go process that reads a url string from work and if it is
>>>> not in the map of already processed url's, writes it to a channel of
>>>> strings, WORK, after adding the url to the map.
>>>> 4 spawn a set of go processes that read WORK, fetch the url, do
>>>> whatever it it that you need to do, and for urls found there, writes them
>>>> to PENDING
>>>>
>>>> this is enough. now as written you have the challenge to know when the
>>>> workers are done and pending is empty. that's when you exit. there are
>>>> other ways to do this, but the point is to state with emphasis what an
>>>> earlier email said, which is to have the map in its own goroutine, the one
>>>> that decides which urls should be processed.
>>>>
>>>> On Mon, Sep 25, 2017 at 5:35 AM, Aaron <gdz...@gmail.com> wrote:
>>>>
>>>>> I have come up with a fix, using Mutex. But I am not sure how to do it
>>>>> with channels.
>>>>>
>>>>> package main
>>>>>
>>>>> import (
>>>>>     "fmt"
>>>>>     "log"
>>>>>     "net/http"
>>>>>     "os"
>>>>>     "strings"
>>>>>     "sync"
>>>>>
>>>>>     "golang.org/x/net/html"
>>>>> )
>>>>>
>>>>> var lock = sync.RWMutex{}
>>>>>
>>>>> func main() {
>>>>>     if len(os.Args) != 2 {
>>>>>         fmt.Println("Usage: crawl [URL].")
>>>>>     }
>>>>>
>>>>>     url := os.Args[1]
>>>>>     if !strings.HasPrefix(url, "http://";) {
>>>>>         url = "http://"; + url
>>>>>     }
>>>>>
>>>>>     n := 0
>>>>>
>>>>>     for link := range newCrawl(url, 1) {
>>>>>         n++
>>>>>         fmt.Println(link)
>>>>>     }
>>>>>
>>>>>     fmt.Printf("Total links: %d\n", n)
>>>>> }
>>>>>
>>>>> func newCrawl(url string, num int) chan string {
>>>>>     visited := make(map[string]bool)
>>>>>     ch := make(chan string, 20)
>>>>>
>>>>>     go func() {
>>>>>         crawl(url, 3, ch, &visited)
>>>>>         close(ch)
>>>>>     }()
>>>>>
>>>>>     return ch
>>>>> }
>>>>>
>>>>> func crawl(url string, n int, ch chan string, visited *map[string]bool)
>>>>> {
>>>>>     if n < 1 {
>>>>>         return
>>>>>     }
>>>>>     resp, err := http.Get(url)
>>>>>     if err != nil {
>>>>>         log.Fatalf("Can not reach the site. Error = %v\n", err)
>>>>>         os.Exit(1)
>>>>>     }
>>>>>
>>>>>     b := resp.Body
>>>>>     defer b.Close()
>>>>>
>>>>>     z := html.NewTokenizer(b)
>>>>>
>>>>>     nextN := n - 1
>>>>>     for {
>>>>>         token := z.Next()
>>>>>
>>>>>         switch token {
>>>>>         case html.ErrorToken:
>>>>>             return
>>>>>         case html.StartTagToken:
>>>>>             current := z.Token()
>>>>>             if current.Data != "a" {
>>>>>                 continue
>>>>>             }
>>>>>             result, ok := getHrefTag(current)
>>>>>             if !ok {
>>>>>                 continue
>>>>>             }
>>>>>
>>>>>             hasProto := strings.HasPrefix(result, "http")
>>>>>             if hasProto {
>>>>>                 lock.RLock()
>>>>>                 ok := (*visited)[result]
>>>>>                 lock.RUnlock()
>>>>>                 if ok {
>>>>>                     continue
>>>>>                 }
>>>>>                 done := make(chan struct{})
>>>>>                 go func() {
>>>>>                     crawl(result, nextN, ch, visited)
>>>>>                     close(done)
>>>>>                 }()
>>>>>                 <-done
>>>>>                 lock.Lock()
>>>>>                 (*visited)[result] = true
>>>>>                 lock.Unlock()
>>>>>                 ch <- result
>>>>>             }
>>>>>         }
>>>>>     }
>>>>> }
>>>>>
>>>>> func getHrefTag(token html.Token) (result string, ok bool) {
>>>>>     for _, a := range token.Attr {
>>>>>         if a.Key == "href" {
>>>>>             result = a.Val
>>>>>             ok = true
>>>>>             break
>>>>>         }
>>>>>     }
>>>>>     return
>>>>> }
>>>>>
>>>>>
>>>>> On Sunday, September 24, 2017 at 10:13:16 PM UTC+8, Aaron wrote:
>>>>>>
>>>>>> Hi I am learning Golang concurrency and trying to build a simple
>>>>>> Website crawler. I managed to crawl all the links of the pages of any 
>>>>>> depth
>>>>>> of website. But I still have one problem to tackle: how to avoid crawling
>>>>>> visited links that are previously crawled?
>>>>>>
>>>>>> Here is my code. Hope you guys can shed some light. Thank you in
>>>>>> advance.
>>>>>>
>>>>>> package main
>>>>>> import (
>>>>>>     "fmt"
>>>>>>     "log"
>>>>>>     "net/http"
>>>>>>     "os"
>>>>>>     "strings"
>>>>>>
>>>>>>     "golang.org/x/net/html")
>>>>>>
>>>>>> func main() {
>>>>>>     if len(os.Args) != 2 {
>>>>>>         fmt.Println("Usage: crawl [URL].")
>>>>>>     }
>>>>>>
>>>>>>     url := os.Args[1]
>>>>>>     if !strings.HasPrefix(url, "http://";) {
>>>>>>         url = "http://"; + url
>>>>>>     }
>>>>>>
>>>>>>     for link := range newCrawl(url, 1) {
>>>>>>         fmt.Println(link)
>>>>>>     }}
>>>>>>
>>>>>> func newCrawl(url string, num int) chan string {
>>>>>>     ch := make(chan string, 20)
>>>>>>
>>>>>>     go func() {
>>>>>>         crawl(url, 1, ch)
>>>>>>         close(ch)
>>>>>>     }()
>>>>>>
>>>>>>     return ch}
>>>>>>
>>>>>> func crawl(url string, n int, ch chan string) {
>>>>>>     if n < 1 {
>>>>>>         return
>>>>>>     }
>>>>>>     resp, err := http.Get(url)
>>>>>>     if err != nil {
>>>>>>         log.Fatalf("Can not reach the site. Error = %v\n", err)
>>>>>>         os.Exit(1)
>>>>>>     }
>>>>>>
>>>>>>     b := resp.Body
>>>>>>     defer b.Close()
>>>>>>
>>>>>>     z := html.NewTokenizer(b)
>>>>>>
>>>>>>     nextN := n - 1
>>>>>>     for {
>>>>>>         token := z.Next()
>>>>>>
>>>>>>         switch token {
>>>>>>         case html.ErrorToken:
>>>>>>             return
>>>>>>         case html.StartTagToken:
>>>>>>             current := z.Token()
>>>>>>             if current.Data != "a" {
>>>>>>                 continue
>>>>>>             }
>>>>>>             result, ok := getHrefTag(current)
>>>>>>             if !ok {
>>>>>>                 continue
>>>>>>             }
>>>>>>
>>>>>>             hasProto := strings.HasPrefix(result, "http")
>>>>>>             if hasProto {
>>>>>>                 done := make(chan struct{})
>>>>>>                 go func() {
>>>>>>                     crawl(result, nextN, ch)
>>>>>>                     close(done)
>>>>>>                 }()
>>>>>>                 <-done
>>>>>>                 ch <- result
>>>>>>             }
>>>>>>         }
>>>>>>     }}
>>>>>>
>>>>>> func getHrefTag(token html.Token) (result string, ok bool) {
>>>>>>     for _, a := range token.Attr {
>>>>>>         if a.Key == "href" {
>>>>>>             result = a.Val
>>>>>>             ok = true
>>>>>>             break
>>>>>>         }
>>>>>>     }
>>>>>>     return}
>>>>>>
>>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "golang-nuts" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to golang-nuts...@googlegroups.com.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Michael T. Jones
>>>> michae...@gmail.com
>>>>
>>>
>>>
>>>
>>> --
>>> Michael T. Jones
>>> michae...@gmail.com
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "golang-nuts" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to golang-nuts+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> --
> Michael T. Jones
> michael.jo...@gmail.com
>



-- 
Michael T. Jones
michael.jo...@gmail.com

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [go-nuts] Re: Simple web crawler question. How to avoid crawling visited links?

Reply via email to