Re: [go-nuts] Re: Simple web crawler question. How to avoid crawling visited links?

Aaron Mon, 25 Sep 2017 20:46:21 -0700

Thank you Mike. I have read the book and done that exercise. The code in 
the example will not crawl into certain depth and stop at certain depth of 
the website. The requirements are a bit different. I can't just use the 
approach in the example directly or more precisely I don't know how to use 
it in my code.


On Tuesday, September 26, 2017 at 6:18:43 AM UTC+8, Michael Jones wrote:
>
> The book, The Go Programming Language discusses the web crawl task at 
> several points through the text. The simplest complete parallel version is:
>
> https://github.com/adonovan/gopl.io/blob/master/ch8/crawl3/findlinks.go
>
> which if you'll download and build works quite nicely:
>
> *$ crawl3 http://www.golang.org <http://www.golang.org>*
> http://www.golang.org
> http://www.google.com/intl/en/policies/privacy/
> https://golang.org/doc/tos.html
> https://golang.org/project/
> https://golang.org/pkg/
> https://golang.org/doc/
> http://play.golang.org/
> https://tour.golang.org/
> https://golang.org/LICENSE
> https://developers.google.com/site-policies#restrictions
> https://golang.org/dl/
> https://golang.org/blog/
> https://golang.org/help/
> https://golang.org/
> https://blog.golang.org/
> https://www.google.com/intl/en/privacy/privacy-policy.html
> https://www.google.com/intl/en/policies/terms/
> https://golang.org/LICENSE?m=text
> https://golang.org/pkg
> https://golang.org/doc/go_faq.html
> https://groups.google.com/group/golang-nuts
> https://blog.gopheracademy.com/gophers-slack-community/
> https://golang.org/wiki
> https://forum.golangbridge.org/
> irc:irc.freenode.net/go-nuts
> 2017/09/25 15:13:07 Get irc:irc.freenode.net/go-nuts: unsupported 
> protocol scheme "irc"
> https://golang.org/doc/faq
> https://groups.google.com/group/golang-announce
> https://blog.golang.org
> https://twitter.com/golang
> :
>
> On Mon, Sep 25, 2017 at 7:46 AM, Michael Jones <michae...@gmail.com 
> <javascript:>> wrote:
>
>> i suggest that you first make it work in the simple way and then make it 
>> concurrent.
>>
>> however, one lock-free concurrent way to think of this is as follows...
>>
>> 1. start with a list of urls (in code, on command line, etc.)
>> 2. spawn a go process that writes each of them to a channel of strings, 
>> perhaps called PENDING
>> 3. spawn a go process that reads a url string from work and if it is not 
>> in the map of already processed url's, writes it to a channel of strings, 
>> WORK, after adding the url to the map.
>> 4 spawn a set of go processes that read WORK, fetch the url, do whatever 
>> it it that you need to do, and for urls found there, writes them to PENDING
>>
>> this is enough. now as written you have the challenge to know when the 
>> workers are done and pending is empty. that's when you exit. there are 
>> other ways to do this, but the point is to state with emphasis what an 
>> earlier email said, which is to have the map in its own goroutine, the one 
>> that decides which urls should be processed.
>>
>> On Mon, Sep 25, 2017 at 5:35 AM, Aaron <gdz...@gmail.com <javascript:>> 
>> wrote:
>>
>>> I have come up with a fix, using Mutex. But I am not sure how to do it 
>>> with channels.
>>>
>>> package main
>>>
>>> import (
>>>     "fmt"
>>>     "log"
>>>     "net/http"
>>>     "os"
>>>     "strings"
>>>     "sync"
>>>
>>>     "golang.org/x/net/html"
>>> )
>>>
>>> var lock = sync.RWMutex{}
>>>
>>> func main() {
>>>     if len(os.Args) != 2 {
>>>         fmt.Println("Usage: crawl [URL].")
>>>     }
>>>
>>>     url := os.Args[1]
>>>     if !strings.HasPrefix(url, "http://";) {
>>>         url = "http://"; + url
>>>     }
>>>
>>>     n := 0
>>>
>>>     for link := range newCrawl(url, 1) {
>>>         n++
>>>         fmt.Println(link)
>>>     }
>>>
>>>     fmt.Printf("Total links: %d\n", n)
>>> }
>>>
>>> func newCrawl(url string, num int) chan string {
>>>     visited := make(map[string]bool)
>>>     ch := make(chan string, 20)
>>>
>>>     go func() {
>>>         crawl(url, 3, ch, &visited)
>>>         close(ch)
>>>     }()
>>>
>>>     return ch
>>> }
>>>
>>> func crawl(url string, n int, ch chan string, visited *map[string]bool) 
>>> {
>>>     if n < 1 {
>>>         return
>>>     }
>>>     resp, err := http.Get(url)
>>>     if err != nil {
>>>         log.Fatalf("Can not reach the site. Error = %v\n", err)
>>>         os.Exit(1)
>>>     }
>>>
>>>     b := resp.Body
>>>     defer b.Close()
>>>
>>>     z := html.NewTokenizer(b)
>>>
>>>     nextN := n - 1
>>>     for {
>>>         token := z.Next()
>>>
>>>         switch token {
>>>         case html.ErrorToken:
>>>             return
>>>         case html.StartTagToken:
>>>             current := z.Token()
>>>             if current.Data != "a" {
>>>                 continue
>>>             }
>>>             result, ok := getHrefTag(current)
>>>             if !ok {
>>>                 continue
>>>             }
>>>
>>>             hasProto := strings.HasPrefix(result, "http")
>>>             if hasProto {
>>>                 lock.RLock()
>>>                 ok := (*visited)[result]
>>>                 lock.RUnlock()
>>>                 if ok {
>>>                     continue
>>>                 }
>>>                 done := make(chan struct{})
>>>                 go func() {
>>>                     crawl(result, nextN, ch, visited)
>>>                     close(done)
>>>                 }()
>>>                 <-done
>>>                 lock.Lock()
>>>                 (*visited)[result] = true
>>>                 lock.Unlock()
>>>                 ch <- result
>>>             }
>>>         }
>>>     }
>>> }
>>>
>>> func getHrefTag(token html.Token) (result string, ok bool) {
>>>     for _, a := range token.Attr {
>>>         if a.Key == "href" {
>>>             result = a.Val
>>>             ok = true
>>>             break
>>>         }
>>>     }
>>>     return
>>> }
>>>
>>>
>>> On Sunday, September 24, 2017 at 10:13:16 PM UTC+8, Aaron wrote:
>>>>
>>>> Hi I am learning Golang concurrency and trying to build a simple 
>>>> Website crawler. I managed to crawl all the links of the pages of any 
>>>> depth 
>>>> of website. But I still have one problem to tackle: how to avoid crawling 
>>>> visited links that are previously crawled?
>>>>
>>>> Here is my code. Hope you guys can shed some light. Thank you in 
>>>> advance.
>>>>
>>>> package main
>>>> import (
>>>>     "fmt"
>>>>     "log"
>>>>     "net/http"
>>>>     "os"
>>>>     "strings"
>>>>
>>>>     "golang.org/x/net/html")
>>>>
>>>> func main() {
>>>>     if len(os.Args) != 2 {
>>>>         fmt.Println("Usage: crawl [URL].")
>>>>     }
>>>>
>>>>     url := os.Args[1]
>>>>     if !strings.HasPrefix(url, "http://";) {
>>>>         url = "http://"; + url
>>>>     }
>>>>
>>>>     for link := range newCrawl(url, 1) {
>>>>         fmt.Println(link)
>>>>     }}
>>>>
>>>> func newCrawl(url string, num int) chan string {
>>>>     ch := make(chan string, 20)
>>>>
>>>>     go func() {
>>>>         crawl(url, 1, ch)
>>>>         close(ch)
>>>>     }()
>>>>
>>>>     return ch}
>>>>
>>>> func crawl(url string, n int, ch chan string) {
>>>>     if n < 1 {
>>>>         return
>>>>     }
>>>>     resp, err := http.Get(url)
>>>>     if err != nil {
>>>>         log.Fatalf("Can not reach the site. Error = %v\n", err)
>>>>         os.Exit(1)
>>>>     }
>>>>
>>>>     b := resp.Body
>>>>     defer b.Close()
>>>>
>>>>     z := html.NewTokenizer(b)
>>>>
>>>>     nextN := n - 1
>>>>     for {
>>>>         token := z.Next()
>>>>
>>>>         switch token {
>>>>         case html.ErrorToken:
>>>>             return
>>>>         case html.StartTagToken:
>>>>             current := z.Token()
>>>>             if current.Data != "a" {
>>>>                 continue
>>>>             }
>>>>             result, ok := getHrefTag(current)
>>>>             if !ok {
>>>>                 continue
>>>>             }
>>>>
>>>>             hasProto := strings.HasPrefix(result, "http")
>>>>             if hasProto {
>>>>                 done := make(chan struct{})
>>>>                 go func() {
>>>>                     crawl(result, nextN, ch)
>>>>                     close(done)
>>>>                 }()
>>>>                 <-done
>>>>                 ch <- result
>>>>             }
>>>>         }
>>>>     }}
>>>>
>>>> func getHrefTag(token html.Token) (result string, ok bool) {
>>>>     for _, a := range token.Attr {
>>>>         if a.Key == "href" {
>>>>             result = a.Val
>>>>             ok = true
>>>>             break
>>>>         }
>>>>     }
>>>>     return}
>>>>
>>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "golang-nuts" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to golang-nuts...@googlegroups.com <javascript:>.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>
>> -- 
>> Michael T. Jones
>> michae...@gmail.com <javascript:>
>>
>
>
>
> -- 
> Michael T. Jones
> michae...@gmail.com <javascript:>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [go-nuts] Re: Simple web crawler question. How to avoid crawling visited links?

Reply via email to