Re: [go-nuts] Re: Simple web crawler question. How to avoid crawling visited links?

Michael Jones Tue, 26 Sep 2017 07:30:31 -0700

To be sure, examples are a starting place.

To crawl to a certain depth, you need to know the depth of each url. That
means that the url being passed around stops being "just a string" and
becomes a struct with information including at least, the url as a string
and the link depth as an integer. Following a link means incrementing the
depth, checking that, and also checking the presence of the string part in
the map.


To crawl to a certain distance from a website, in the sense of how many
hops to other websites, needs similar processing on the hostname.

To crawl with rate-limitation so as not to overburden a distant host's
server(s), means to have multiple work queues for each hostname. That is
the natural place to do rate limits, time windows, or other such processing.

The real meaning of the example is to point out that the work queue -- a
channel here -- can be written to by the (go)routines that read from it.
This is a useful notion in this case.

Good luck!

On Mon, Sep 25, 2017 at 8:46 PM, Aaron <gdz...@gmail.com> wrote:

> Thank you Mike. I have read the book and done that exercise. The code in
> the example will not crawl into certain depth and stop at certain depth of
> the website. The requirements are a bit different. I can't just use the
> approach in the example directly or more precisely I don't know how to use
> it in my code.
>
> On Tuesday, September 26, 2017 at 6:18:43 AM UTC+8, Michael Jones wrote:
>>
>> The book, The Go Programming Language discusses the web crawl task at
>> several points through the text. The simplest complete parallel version is:
>>
>> https://github.com/adonovan/gopl.io/blob/master/ch8/crawl3/findlinks.go
>>
>> which if you'll download and build works quite nicely:
>>
>> *$ crawl3 http://www.golang.org <http://www.golang.org>*
>> http://www.golang.org
>> http://www.google.com/intl/en/policies/privacy/
>> https://golang.org/doc/tos.html
>> https://golang.org/project/
>> https://golang.org/pkg/
>> https://golang.org/doc/
>> http://play.golang.org/
>> https://tour.golang.org/
>> https://golang.org/LICENSE
>> https://developers.google.com/site-policies#restrictions
>> https://golang.org/dl/
>> https://golang.org/blog/
>> https://golang.org/help/
>> https://golang.org/
>> https://blog.golang.org/
>> https://www.google.com/intl/en/privacy/privacy-policy.html
>> https://www.google.com/intl/en/policies/terms/
>> https://golang.org/LICENSE?m=text
>> https://golang.org/pkg
>> https://golang.org/doc/go_faq.html
>> https://groups.google.com/group/golang-nuts
>> https://blog.gopheracademy.com/gophers-slack-community/
>> https://golang.org/wiki
>> https://forum.golangbridge.org/
>> irc:irc.freenode.net/go-nuts
>> 2017/09/25 15:13:07 Get irc:irc.freenode.net/go-nuts: unsupported
>> protocol scheme "irc"
>> https://golang.org/doc/faq
>> https://groups.google.com/group/golang-announce
>> https://blog.golang.org
>> https://twitter.com/golang
>> :
>>
>> On Mon, Sep 25, 2017 at 7:46 AM, Michael Jones <michae...@gmail.com>
>> wrote:
>>
>>> i suggest that you first make it work in the simple way and then make it
>>> concurrent.
>>>
>>> however, one lock-free concurrent way to think of this is as follows...
>>>
>>> 1. start with a list of urls (in code, on command line, etc.)
>>> 2. spawn a go process that writes each of them to a channel of strings,
>>> perhaps called PENDING
>>> 3. spawn a go process that reads a url string from work and if it is not
>>> in the map of already processed url's, writes it to a channel of strings,
>>> WORK, after adding the url to the map.
>>> 4 spawn a set of go processes that read WORK, fetch the url, do whatever
>>> it it that you need to do, and for urls found there, writes them to PENDING
>>>
>>> this is enough. now as written you have the challenge to know when the
>>> workers are done and pending is empty. that's when you exit. there are
>>> other ways to do this, but the point is to state with emphasis what an
>>> earlier email said, which is to have the map in its own goroutine, the one
>>> that decides which urls should be processed.
>>>
>>> On Mon, Sep 25, 2017 at 5:35 AM, Aaron <gdz...@gmail.com> wrote:
>>>
>>>> I have come up with a fix, using Mutex. But I am not sure how to do it
>>>> with channels.
>>>>
>>>> package main
>>>>
>>>> import (
>>>>     "fmt"
>>>>     "log"
>>>>     "net/http"
>>>>     "os"
>>>>     "strings"
>>>>     "sync"
>>>>
>>>>     "golang.org/x/net/html"
>>>> )
>>>>
>>>> var lock = sync.RWMutex{}
>>>>
>>>> func main() {
>>>>     if len(os.Args) != 2 {
>>>>         fmt.Println("Usage: crawl [URL].")
>>>>     }
>>>>
>>>>     url := os.Args[1]
>>>>     if !strings.HasPrefix(url, "http://";) {
>>>>         url = "http://"; + url
>>>>     }
>>>>
>>>>     n := 0
>>>>
>>>>     for link := range newCrawl(url, 1) {
>>>>         n++
>>>>         fmt.Println(link)
>>>>     }
>>>>
>>>>     fmt.Printf("Total links: %d\n", n)
>>>> }
>>>>
>>>> func newCrawl(url string, num int) chan string {
>>>>     visited := make(map[string]bool)
>>>>     ch := make(chan string, 20)
>>>>
>>>>     go func() {
>>>>         crawl(url, 3, ch, &visited)
>>>>         close(ch)
>>>>     }()
>>>>
>>>>     return ch
>>>> }
>>>>
>>>> func crawl(url string, n int, ch chan string, visited *map[string]bool)
>>>> {
>>>>     if n < 1 {
>>>>         return
>>>>     }
>>>>     resp, err := http.Get(url)
>>>>     if err != nil {
>>>>         log.Fatalf("Can not reach the site. Error = %v\n", err)
>>>>         os.Exit(1)
>>>>     }
>>>>
>>>>     b := resp.Body
>>>>     defer b.Close()
>>>>
>>>>     z := html.NewTokenizer(b)
>>>>
>>>>     nextN := n - 1
>>>>     for {
>>>>         token := z.Next()
>>>>
>>>>         switch token {
>>>>         case html.ErrorToken:
>>>>             return
>>>>         case html.StartTagToken:
>>>>             current := z.Token()
>>>>             if current.Data != "a" {
>>>>                 continue
>>>>             }
>>>>             result, ok := getHrefTag(current)
>>>>             if !ok {
>>>>                 continue
>>>>             }
>>>>
>>>>             hasProto := strings.HasPrefix(result, "http")
>>>>             if hasProto {
>>>>                 lock.RLock()
>>>>                 ok := (*visited)[result]
>>>>                 lock.RUnlock()
>>>>                 if ok {
>>>>                     continue
>>>>                 }
>>>>                 done := make(chan struct{})
>>>>                 go func() {
>>>>                     crawl(result, nextN, ch, visited)
>>>>                     close(done)
>>>>                 }()
>>>>                 <-done
>>>>                 lock.Lock()
>>>>                 (*visited)[result] = true
>>>>                 lock.Unlock()
>>>>                 ch <- result
>>>>             }
>>>>         }
>>>>     }
>>>> }
>>>>
>>>> func getHrefTag(token html.Token) (result string, ok bool) {
>>>>     for _, a := range token.Attr {
>>>>         if a.Key == "href" {
>>>>             result = a.Val
>>>>             ok = true
>>>>             break
>>>>         }
>>>>     }
>>>>     return
>>>> }
>>>>
>>>>
>>>> On Sunday, September 24, 2017 at 10:13:16 PM UTC+8, Aaron wrote:
>>>>>
>>>>> Hi I am learning Golang concurrency and trying to build a simple
>>>>> Website crawler. I managed to crawl all the links of the pages of any 
>>>>> depth
>>>>> of website. But I still have one problem to tackle: how to avoid crawling
>>>>> visited links that are previously crawled?
>>>>>
>>>>> Here is my code. Hope you guys can shed some light. Thank you in
>>>>> advance.
>>>>>
>>>>> package main
>>>>> import (
>>>>>     "fmt"
>>>>>     "log"
>>>>>     "net/http"
>>>>>     "os"
>>>>>     "strings"
>>>>>
>>>>>     "golang.org/x/net/html")
>>>>>
>>>>> func main() {
>>>>>     if len(os.Args) != 2 {
>>>>>         fmt.Println("Usage: crawl [URL].")
>>>>>     }
>>>>>
>>>>>     url := os.Args[1]
>>>>>     if !strings.HasPrefix(url, "http://";) {
>>>>>         url = "http://"; + url
>>>>>     }
>>>>>
>>>>>     for link := range newCrawl(url, 1) {
>>>>>         fmt.Println(link)
>>>>>     }}
>>>>>
>>>>> func newCrawl(url string, num int) chan string {
>>>>>     ch := make(chan string, 20)
>>>>>
>>>>>     go func() {
>>>>>         crawl(url, 1, ch)
>>>>>         close(ch)
>>>>>     }()
>>>>>
>>>>>     return ch}
>>>>>
>>>>> func crawl(url string, n int, ch chan string) {
>>>>>     if n < 1 {
>>>>>         return
>>>>>     }
>>>>>     resp, err := http.Get(url)
>>>>>     if err != nil {
>>>>>         log.Fatalf("Can not reach the site. Error = %v\n", err)
>>>>>         os.Exit(1)
>>>>>     }
>>>>>
>>>>>     b := resp.Body
>>>>>     defer b.Close()
>>>>>
>>>>>     z := html.NewTokenizer(b)
>>>>>
>>>>>     nextN := n - 1
>>>>>     for {
>>>>>         token := z.Next()
>>>>>
>>>>>         switch token {
>>>>>         case html.ErrorToken:
>>>>>             return
>>>>>         case html.StartTagToken:
>>>>>             current := z.Token()
>>>>>             if current.Data != "a" {
>>>>>                 continue
>>>>>             }
>>>>>             result, ok := getHrefTag(current)
>>>>>             if !ok {
>>>>>                 continue
>>>>>             }
>>>>>
>>>>>             hasProto := strings.HasPrefix(result, "http")
>>>>>             if hasProto {
>>>>>                 done := make(chan struct{})
>>>>>                 go func() {
>>>>>                     crawl(result, nextN, ch)
>>>>>                     close(done)
>>>>>                 }()
>>>>>                 <-done
>>>>>                 ch <- result
>>>>>             }
>>>>>         }
>>>>>     }}
>>>>>
>>>>> func getHrefTag(token html.Token) (result string, ok bool) {
>>>>>     for _, a := range token.Attr {
>>>>>         if a.Key == "href" {
>>>>>             result = a.Val
>>>>>             ok = true
>>>>>             break
>>>>>         }
>>>>>     }
>>>>>     return}
>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "golang-nuts" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to golang-nuts...@googlegroups.com.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>>
>>> --
>>> Michael T. Jones
>>> michae...@gmail.com
>>>
>>
>>
>>
>> --
>> Michael T. Jones
>> michae...@gmail.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Michael T. Jones
michael.jo...@gmail.com

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [go-nuts] Re: Simple web crawler question. How to avoid crawling visited links?

Reply via email to