Re: [go-nuts] Re: Simple web crawler question. How to avoid crawling visited links?

Michael Jones Mon, 25 Sep 2017 07:47:17 -0700

i suggest that you first make it work in the simple way and then make it
concurrent.


however, one lock-free concurrent way to think of this is as follows...

1. start with a list of urls (in code, on command line, etc.)
2. spawn a go process that writes each of them to a channel of strings,
perhaps called PENDING
3. spawn a go process that reads a url string from work and if it is not in
the map of already processed url's, writes it to a channel of strings,
WORK, after adding the url to the map.
4 spawn a set of go processes that read WORK, fetch the url, do whatever it
it that you need to do, and for urls found there, writes them to PENDING

this is enough. now as written you have the challenge to know when the
workers are done and pending is empty. that's when you exit. there are
other ways to do this, but the point is to state with emphasis what an
earlier email said, which is to have the map in its own goroutine, the one
that decides which urls should be processed.

On Mon, Sep 25, 2017 at 5:35 AM, Aaron <gdz...@gmail.com> wrote:

> I have come up with a fix, using Mutex. But I am not sure how to do it
> with channels.
>
> package main
>
> import (
>     "fmt"
>     "log"
>     "net/http"
>     "os"
>     "strings"
>     "sync"
>
>     "golang.org/x/net/html"
> )
>
> var lock = sync.RWMutex{}
>
> func main() {
>     if len(os.Args) != 2 {
>         fmt.Println("Usage: crawl [URL].")
>     }
>
>     url := os.Args[1]
>     if !strings.HasPrefix(url, "http://";) {
>         url = "http://"; + url
>     }
>
>     n := 0
>
>     for link := range newCrawl(url, 1) {
>         n++
>         fmt.Println(link)
>     }
>
>     fmt.Printf("Total links: %d\n", n)
> }
>
> func newCrawl(url string, num int) chan string {
>     visited := make(map[string]bool)
>     ch := make(chan string, 20)
>
>     go func() {
>         crawl(url, 3, ch, &visited)
>         close(ch)
>     }()
>
>     return ch
> }
>
> func crawl(url string, n int, ch chan string, visited *map[string]bool) {
>     if n < 1 {
>         return
>     }
>     resp, err := http.Get(url)
>     if err != nil {
>         log.Fatalf("Can not reach the site. Error = %v\n", err)
>         os.Exit(1)
>     }
>
>     b := resp.Body
>     defer b.Close()
>
>     z := html.NewTokenizer(b)
>
>     nextN := n - 1
>     for {
>         token := z.Next()
>
>         switch token {
>         case html.ErrorToken:
>             return
>         case html.StartTagToken:
>             current := z.Token()
>             if current.Data != "a" {
>                 continue
>             }
>             result, ok := getHrefTag(current)
>             if !ok {
>                 continue
>             }
>
>             hasProto := strings.HasPrefix(result, "http")
>             if hasProto {
>                 lock.RLock()
>                 ok := (*visited)[result]
>                 lock.RUnlock()
>                 if ok {
>                     continue
>                 }
>                 done := make(chan struct{})
>                 go func() {
>                     crawl(result, nextN, ch, visited)
>                     close(done)
>                 }()
>                 <-done
>                 lock.Lock()
>                 (*visited)[result] = true
>                 lock.Unlock()
>                 ch <- result
>             }
>         }
>     }
> }
>
> func getHrefTag(token html.Token) (result string, ok bool) {
>     for _, a := range token.Attr {
>         if a.Key == "href" {
>             result = a.Val
>             ok = true
>             break
>         }
>     }
>     return
> }
>
>
> On Sunday, September 24, 2017 at 10:13:16 PM UTC+8, Aaron wrote:
>>
>> Hi I am learning Golang concurrency and trying to build a simple Website
>> crawler. I managed to crawl all the links of the pages of any depth of
>> website. But I still have one problem to tackle: how to avoid crawling
>> visited links that are previously crawled?
>>
>> Here is my code. Hope you guys can shed some light. Thank you in advance.
>>
>> package main
>> import (
>>     "fmt"
>>     "log"
>>     "net/http"
>>     "os"
>>     "strings"
>>
>>     "golang.org/x/net/html")
>>
>> func main() {
>>     if len(os.Args) != 2 {
>>         fmt.Println("Usage: crawl [URL].")
>>     }
>>
>>     url := os.Args[1]
>>     if !strings.HasPrefix(url, "http://";) {
>>         url = "http://"; + url
>>     }
>>
>>     for link := range newCrawl(url, 1) {
>>         fmt.Println(link)
>>     }}
>>
>> func newCrawl(url string, num int) chan string {
>>     ch := make(chan string, 20)
>>
>>     go func() {
>>         crawl(url, 1, ch)
>>         close(ch)
>>     }()
>>
>>     return ch}
>>
>> func crawl(url string, n int, ch chan string) {
>>     if n < 1 {
>>         return
>>     }
>>     resp, err := http.Get(url)
>>     if err != nil {
>>         log.Fatalf("Can not reach the site. Error = %v\n", err)
>>         os.Exit(1)
>>     }
>>
>>     b := resp.Body
>>     defer b.Close()
>>
>>     z := html.NewTokenizer(b)
>>
>>     nextN := n - 1
>>     for {
>>         token := z.Next()
>>
>>         switch token {
>>         case html.ErrorToken:
>>             return
>>         case html.StartTagToken:
>>             current := z.Token()
>>             if current.Data != "a" {
>>                 continue
>>             }
>>             result, ok := getHrefTag(current)
>>             if !ok {
>>                 continue
>>             }
>>
>>             hasProto := strings.HasPrefix(result, "http")
>>             if hasProto {
>>                 done := make(chan struct{})
>>                 go func() {
>>                     crawl(result, nextN, ch)
>>                     close(done)
>>                 }()
>>                 <-done
>>                 ch <- result
>>             }
>>         }
>>     }}
>>
>> func getHrefTag(token html.Token) (result string, ok bool) {
>>     for _, a := range token.Attr {
>>         if a.Key == "href" {
>>             result = a.Val
>>             ok = true
>>             break
>>         }
>>     }
>>     return}
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Michael T. Jones
michael.jo...@gmail.com

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [go-nuts] Re: Simple web crawler question. How to avoid crawling visited links?

Reply via email to