Re: [go-nuts] Re: Simple web crawler question. How to avoid crawling visited links?

Michael Jones Mon, 25 Sep 2017 15:19:10 -0700

The book, The Go Programming Language discusses the web crawl task at
several points through the text. The simplest complete parallel version is:


https://github.com/adonovan/gopl.io/blob/master/ch8/crawl3/findlinks.go

which if you'll download and build works quite nicely:

*$ crawl3 http://www.golang.org <http://www.golang.org>*
http://www.golang.org
http://www.google.com/intl/en/policies/privacy/
https://golang.org/doc/tos.html
https://golang.org/project/
https://golang.org/pkg/
https://golang.org/doc/
http://play.golang.org/
https://tour.golang.org/
https://golang.org/LICENSE
https://developers.google.com/site-policies#restrictions
https://golang.org/dl/
https://golang.org/blog/
https://golang.org/help/
https://golang.org/
https://blog.golang.org/
https://www.google.com/intl/en/privacy/privacy-policy.html
https://www.google.com/intl/en/policies/terms/
https://golang.org/LICENSE?m=text
https://golang.org/pkg
https://golang.org/doc/go_faq.html
https://groups.google.com/group/golang-nuts
https://blog.gopheracademy.com/gophers-slack-community/
https://golang.org/wiki
https://forum.golangbridge.org/
irc:irc.freenode.net/go-nuts
2017/09/25 15:13:07 Get irc:irc.freenode.net/go-nuts: unsupported protocol
scheme "irc"
https://golang.org/doc/faq
https://groups.google.com/group/golang-announce
https://blog.golang.org
https://twitter.com/golang
:

On Mon, Sep 25, 2017 at 7:46 AM, Michael Jones <michael.jo...@gmail.com>
wrote:

> i suggest that you first make it work in the simple way and then make it
> concurrent.
>
> however, one lock-free concurrent way to think of this is as follows...
>
> 1. start with a list of urls (in code, on command line, etc.)
> 2. spawn a go process that writes each of them to a channel of strings,
> perhaps called PENDING
> 3. spawn a go process that reads a url string from work and if it is not
> in the map of already processed url's, writes it to a channel of strings,
> WORK, after adding the url to the map.
> 4 spawn a set of go processes that read WORK, fetch the url, do whatever
> it it that you need to do, and for urls found there, writes them to PENDING
>
> this is enough. now as written you have the challenge to know when the
> workers are done and pending is empty. that's when you exit. there are
> other ways to do this, but the point is to state with emphasis what an
> earlier email said, which is to have the map in its own goroutine, the one
> that decides which urls should be processed.
>
> On Mon, Sep 25, 2017 at 5:35 AM, Aaron <gdz...@gmail.com> wrote:
>
>> I have come up with a fix, using Mutex. But I am not sure how to do it
>> with channels.
>>
>> package main
>>
>> import (
>>     "fmt"
>>     "log"
>>     "net/http"
>>     "os"
>>     "strings"
>>     "sync"
>>
>>     "golang.org/x/net/html"
>> )
>>
>> var lock = sync.RWMutex{}
>>
>> func main() {
>>     if len(os.Args) != 2 {
>>         fmt.Println("Usage: crawl [URL].")
>>     }
>>
>>     url := os.Args[1]
>>     if !strings.HasPrefix(url, "http://";) {
>>         url = "http://"; + url
>>     }
>>
>>     n := 0
>>
>>     for link := range newCrawl(url, 1) {
>>         n++
>>         fmt.Println(link)
>>     }
>>
>>     fmt.Printf("Total links: %d\n", n)
>> }
>>
>> func newCrawl(url string, num int) chan string {
>>     visited := make(map[string]bool)
>>     ch := make(chan string, 20)
>>
>>     go func() {
>>         crawl(url, 3, ch, &visited)
>>         close(ch)
>>     }()
>>
>>     return ch
>> }
>>
>> func crawl(url string, n int, ch chan string, visited *map[string]bool) {
>>     if n < 1 {
>>         return
>>     }
>>     resp, err := http.Get(url)
>>     if err != nil {
>>         log.Fatalf("Can not reach the site. Error = %v\n", err)
>>         os.Exit(1)
>>     }
>>
>>     b := resp.Body
>>     defer b.Close()
>>
>>     z := html.NewTokenizer(b)
>>
>>     nextN := n - 1
>>     for {
>>         token := z.Next()
>>
>>         switch token {
>>         case html.ErrorToken:
>>             return
>>         case html.StartTagToken:
>>             current := z.Token()
>>             if current.Data != "a" {
>>                 continue
>>             }
>>             result, ok := getHrefTag(current)
>>             if !ok {
>>                 continue
>>             }
>>
>>             hasProto := strings.HasPrefix(result, "http")
>>             if hasProto {
>>                 lock.RLock()
>>                 ok := (*visited)[result]
>>                 lock.RUnlock()
>>                 if ok {
>>                     continue
>>                 }
>>                 done := make(chan struct{})
>>                 go func() {
>>                     crawl(result, nextN, ch, visited)
>>                     close(done)
>>                 }()
>>                 <-done
>>                 lock.Lock()
>>                 (*visited)[result] = true
>>                 lock.Unlock()
>>                 ch <- result
>>             }
>>         }
>>     }
>> }
>>
>> func getHrefTag(token html.Token) (result string, ok bool) {
>>     for _, a := range token.Attr {
>>         if a.Key == "href" {
>>             result = a.Val
>>             ok = true
>>             break
>>         }
>>     }
>>     return
>> }
>>
>>
>> On Sunday, September 24, 2017 at 10:13:16 PM UTC+8, Aaron wrote:
>>>
>>> Hi I am learning Golang concurrency and trying to build a simple Website
>>> crawler. I managed to crawl all the links of the pages of any depth of
>>> website. But I still have one problem to tackle: how to avoid crawling
>>> visited links that are previously crawled?
>>>
>>> Here is my code. Hope you guys can shed some light. Thank you in advance.
>>>
>>> package main
>>> import (
>>>     "fmt"
>>>     "log"
>>>     "net/http"
>>>     "os"
>>>     "strings"
>>>
>>>     "golang.org/x/net/html")
>>>
>>> func main() {
>>>     if len(os.Args) != 2 {
>>>         fmt.Println("Usage: crawl [URL].")
>>>     }
>>>
>>>     url := os.Args[1]
>>>     if !strings.HasPrefix(url, "http://";) {
>>>         url = "http://"; + url
>>>     }
>>>
>>>     for link := range newCrawl(url, 1) {
>>>         fmt.Println(link)
>>>     }}
>>>
>>> func newCrawl(url string, num int) chan string {
>>>     ch := make(chan string, 20)
>>>
>>>     go func() {
>>>         crawl(url, 1, ch)
>>>         close(ch)
>>>     }()
>>>
>>>     return ch}
>>>
>>> func crawl(url string, n int, ch chan string) {
>>>     if n < 1 {
>>>         return
>>>     }
>>>     resp, err := http.Get(url)
>>>     if err != nil {
>>>         log.Fatalf("Can not reach the site. Error = %v\n", err)
>>>         os.Exit(1)
>>>     }
>>>
>>>     b := resp.Body
>>>     defer b.Close()
>>>
>>>     z := html.NewTokenizer(b)
>>>
>>>     nextN := n - 1
>>>     for {
>>>         token := z.Next()
>>>
>>>         switch token {
>>>         case html.ErrorToken:
>>>             return
>>>         case html.StartTagToken:
>>>             current := z.Token()
>>>             if current.Data != "a" {
>>>                 continue
>>>             }
>>>             result, ok := getHrefTag(current)
>>>             if !ok {
>>>                 continue
>>>             }
>>>
>>>             hasProto := strings.HasPrefix(result, "http")
>>>             if hasProto {
>>>                 done := make(chan struct{})
>>>                 go func() {
>>>                     crawl(result, nextN, ch)
>>>                     close(done)
>>>                 }()
>>>                 <-done
>>>                 ch <- result
>>>             }
>>>         }
>>>     }}
>>>
>>> func getHrefTag(token html.Token) (result string, ok bool) {
>>>     for _, a := range token.Attr {
>>>         if a.Key == "href" {
>>>             result = a.Val
>>>             ok = true
>>>             break
>>>         }
>>>     }
>>>     return}
>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "golang-nuts" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to golang-nuts+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> --
> Michael T. Jones
> michael.jo...@gmail.com
>



-- 
Michael T. Jones
michael.jo...@gmail.com

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [go-nuts] Re: Simple web crawler question. How to avoid crawling visited links?

Reply via email to