Thank you Mike. I have read the book and done that exercise. The code in the example will not crawl into certain depth and stop at certain depth of the website. The requirements are a bit different. I can't just use the approach in the example directly or more precisely I don't know how to use it in my code.
On Tuesday, September 26, 2017 at 6:18:43 AM UTC+8, Michael Jones wrote: > > The book, The Go Programming Language discusses the web crawl task at > several points through the text. The simplest complete parallel version is: > > https://github.com/adonovan/gopl.io/blob/master/ch8/crawl3/findlinks.go > > which if you'll download and build works quite nicely: > > *$ crawl3 http://www.golang.org <http://www.golang.org>* > http://www.golang.org > http://www.google.com/intl/en/policies/privacy/ > https://golang.org/doc/tos.html > https://golang.org/project/ > https://golang.org/pkg/ > https://golang.org/doc/ > http://play.golang.org/ > https://tour.golang.org/ > https://golang.org/LICENSE > https://developers.google.com/site-policies#restrictions > https://golang.org/dl/ > https://golang.org/blog/ > https://golang.org/help/ > https://golang.org/ > https://blog.golang.org/ > https://www.google.com/intl/en/privacy/privacy-policy.html > https://www.google.com/intl/en/policies/terms/ > https://golang.org/LICENSE?m=text > https://golang.org/pkg > https://golang.org/doc/go_faq.html > https://groups.google.com/group/golang-nuts > https://blog.gopheracademy.com/gophers-slack-community/ > https://golang.org/wiki > https://forum.golangbridge.org/ > irc:irc.freenode.net/go-nuts > 2017/09/25 15:13:07 Get irc:irc.freenode.net/go-nuts: unsupported > protocol scheme "irc" > https://golang.org/doc/faq > https://groups.google.com/group/golang-announce > https://blog.golang.org > https://twitter.com/golang > : > > On Mon, Sep 25, 2017 at 7:46 AM, Michael Jones <michae...@gmail.com > <javascript:>> wrote: > >> i suggest that you first make it work in the simple way and then make it >> concurrent. >> >> however, one lock-free concurrent way to think of this is as follows... >> >> 1. start with a list of urls (in code, on command line, etc.) >> 2. spawn a go process that writes each of them to a channel of strings, >> perhaps called PENDING >> 3. spawn a go process that reads a url string from work and if it is not >> in the map of already processed url's, writes it to a channel of strings, >> WORK, after adding the url to the map. >> 4 spawn a set of go processes that read WORK, fetch the url, do whatever >> it it that you need to do, and for urls found there, writes them to PENDING >> >> this is enough. now as written you have the challenge to know when the >> workers are done and pending is empty. that's when you exit. there are >> other ways to do this, but the point is to state with emphasis what an >> earlier email said, which is to have the map in its own goroutine, the one >> that decides which urls should be processed. >> >> On Mon, Sep 25, 2017 at 5:35 AM, Aaron <gdz...@gmail.com <javascript:>> >> wrote: >> >>> I have come up with a fix, using Mutex. But I am not sure how to do it >>> with channels. >>> >>> package main >>> >>> import ( >>> "fmt" >>> "log" >>> "net/http" >>> "os" >>> "strings" >>> "sync" >>> >>> "golang.org/x/net/html" >>> ) >>> >>> var lock = sync.RWMutex{} >>> >>> func main() { >>> if len(os.Args) != 2 { >>> fmt.Println("Usage: crawl [URL].") >>> } >>> >>> url := os.Args[1] >>> if !strings.HasPrefix(url, "http://") { >>> url = "http://" + url >>> } >>> >>> n := 0 >>> >>> for link := range newCrawl(url, 1) { >>> n++ >>> fmt.Println(link) >>> } >>> >>> fmt.Printf("Total links: %d\n", n) >>> } >>> >>> func newCrawl(url string, num int) chan string { >>> visited := make(map[string]bool) >>> ch := make(chan string, 20) >>> >>> go func() { >>> crawl(url, 3, ch, &visited) >>> close(ch) >>> }() >>> >>> return ch >>> } >>> >>> func crawl(url string, n int, ch chan string, visited *map[string]bool) >>> { >>> if n < 1 { >>> return >>> } >>> resp, err := http.Get(url) >>> if err != nil { >>> log.Fatalf("Can not reach the site. Error = %v\n", err) >>> os.Exit(1) >>> } >>> >>> b := resp.Body >>> defer b.Close() >>> >>> z := html.NewTokenizer(b) >>> >>> nextN := n - 1 >>> for { >>> token := z.Next() >>> >>> switch token { >>> case html.ErrorToken: >>> return >>> case html.StartTagToken: >>> current := z.Token() >>> if current.Data != "a" { >>> continue >>> } >>> result, ok := getHrefTag(current) >>> if !ok { >>> continue >>> } >>> >>> hasProto := strings.HasPrefix(result, "http") >>> if hasProto { >>> lock.RLock() >>> ok := (*visited)[result] >>> lock.RUnlock() >>> if ok { >>> continue >>> } >>> done := make(chan struct{}) >>> go func() { >>> crawl(result, nextN, ch, visited) >>> close(done) >>> }() >>> <-done >>> lock.Lock() >>> (*visited)[result] = true >>> lock.Unlock() >>> ch <- result >>> } >>> } >>> } >>> } >>> >>> func getHrefTag(token html.Token) (result string, ok bool) { >>> for _, a := range token.Attr { >>> if a.Key == "href" { >>> result = a.Val >>> ok = true >>> break >>> } >>> } >>> return >>> } >>> >>> >>> On Sunday, September 24, 2017 at 10:13:16 PM UTC+8, Aaron wrote: >>>> >>>> Hi I am learning Golang concurrency and trying to build a simple >>>> Website crawler. I managed to crawl all the links of the pages of any >>>> depth >>>> of website. But I still have one problem to tackle: how to avoid crawling >>>> visited links that are previously crawled? >>>> >>>> Here is my code. Hope you guys can shed some light. Thank you in >>>> advance. >>>> >>>> package main >>>> import ( >>>> "fmt" >>>> "log" >>>> "net/http" >>>> "os" >>>> "strings" >>>> >>>> "golang.org/x/net/html") >>>> >>>> func main() { >>>> if len(os.Args) != 2 { >>>> fmt.Println("Usage: crawl [URL].") >>>> } >>>> >>>> url := os.Args[1] >>>> if !strings.HasPrefix(url, "http://") { >>>> url = "http://" + url >>>> } >>>> >>>> for link := range newCrawl(url, 1) { >>>> fmt.Println(link) >>>> }} >>>> >>>> func newCrawl(url string, num int) chan string { >>>> ch := make(chan string, 20) >>>> >>>> go func() { >>>> crawl(url, 1, ch) >>>> close(ch) >>>> }() >>>> >>>> return ch} >>>> >>>> func crawl(url string, n int, ch chan string) { >>>> if n < 1 { >>>> return >>>> } >>>> resp, err := http.Get(url) >>>> if err != nil { >>>> log.Fatalf("Can not reach the site. Error = %v\n", err) >>>> os.Exit(1) >>>> } >>>> >>>> b := resp.Body >>>> defer b.Close() >>>> >>>> z := html.NewTokenizer(b) >>>> >>>> nextN := n - 1 >>>> for { >>>> token := z.Next() >>>> >>>> switch token { >>>> case html.ErrorToken: >>>> return >>>> case html.StartTagToken: >>>> current := z.Token() >>>> if current.Data != "a" { >>>> continue >>>> } >>>> result, ok := getHrefTag(current) >>>> if !ok { >>>> continue >>>> } >>>> >>>> hasProto := strings.HasPrefix(result, "http") >>>> if hasProto { >>>> done := make(chan struct{}) >>>> go func() { >>>> crawl(result, nextN, ch) >>>> close(done) >>>> }() >>>> <-done >>>> ch <- result >>>> } >>>> } >>>> }} >>>> >>>> func getHrefTag(token html.Token) (result string, ok bool) { >>>> for _, a := range token.Attr { >>>> if a.Key == "href" { >>>> result = a.Val >>>> ok = true >>>> break >>>> } >>>> } >>>> return} >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "golang-nuts" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to golang-nuts...@googlegroups.com <javascript:>. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> >> -- >> Michael T. Jones >> michae...@gmail.com <javascript:> >> > > > > -- > Michael T. Jones > michae...@gmail.com <javascript:> > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.