The book, The Go Programming Language discusses the web crawl task at several points through the text. The simplest complete parallel version is:
https://github.com/adonovan/gopl.io/blob/master/ch8/crawl3/findlinks.go which if you'll download and build works quite nicely: *$ crawl3 http://www.golang.org <http://www.golang.org>* http://www.golang.org http://www.google.com/intl/en/policies/privacy/ https://golang.org/doc/tos.html https://golang.org/project/ https://golang.org/pkg/ https://golang.org/doc/ http://play.golang.org/ https://tour.golang.org/ https://golang.org/LICENSE https://developers.google.com/site-policies#restrictions https://golang.org/dl/ https://golang.org/blog/ https://golang.org/help/ https://golang.org/ https://blog.golang.org/ https://www.google.com/intl/en/privacy/privacy-policy.html https://www.google.com/intl/en/policies/terms/ https://golang.org/LICENSE?m=text https://golang.org/pkg https://golang.org/doc/go_faq.html https://groups.google.com/group/golang-nuts https://blog.gopheracademy.com/gophers-slack-community/ https://golang.org/wiki https://forum.golangbridge.org/ irc:irc.freenode.net/go-nuts 2017/09/25 15:13:07 Get irc:irc.freenode.net/go-nuts: unsupported protocol scheme "irc" https://golang.org/doc/faq https://groups.google.com/group/golang-announce https://blog.golang.org https://twitter.com/golang : On Mon, Sep 25, 2017 at 7:46 AM, Michael Jones <michael.jo...@gmail.com> wrote: > i suggest that you first make it work in the simple way and then make it > concurrent. > > however, one lock-free concurrent way to think of this is as follows... > > 1. start with a list of urls (in code, on command line, etc.) > 2. spawn a go process that writes each of them to a channel of strings, > perhaps called PENDING > 3. spawn a go process that reads a url string from work and if it is not > in the map of already processed url's, writes it to a channel of strings, > WORK, after adding the url to the map. > 4 spawn a set of go processes that read WORK, fetch the url, do whatever > it it that you need to do, and for urls found there, writes them to PENDING > > this is enough. now as written you have the challenge to know when the > workers are done and pending is empty. that's when you exit. there are > other ways to do this, but the point is to state with emphasis what an > earlier email said, which is to have the map in its own goroutine, the one > that decides which urls should be processed. > > On Mon, Sep 25, 2017 at 5:35 AM, Aaron <gdz...@gmail.com> wrote: > >> I have come up with a fix, using Mutex. But I am not sure how to do it >> with channels. >> >> package main >> >> import ( >> "fmt" >> "log" >> "net/http" >> "os" >> "strings" >> "sync" >> >> "golang.org/x/net/html" >> ) >> >> var lock = sync.RWMutex{} >> >> func main() { >> if len(os.Args) != 2 { >> fmt.Println("Usage: crawl [URL].") >> } >> >> url := os.Args[1] >> if !strings.HasPrefix(url, "http://") { >> url = "http://" + url >> } >> >> n := 0 >> >> for link := range newCrawl(url, 1) { >> n++ >> fmt.Println(link) >> } >> >> fmt.Printf("Total links: %d\n", n) >> } >> >> func newCrawl(url string, num int) chan string { >> visited := make(map[string]bool) >> ch := make(chan string, 20) >> >> go func() { >> crawl(url, 3, ch, &visited) >> close(ch) >> }() >> >> return ch >> } >> >> func crawl(url string, n int, ch chan string, visited *map[string]bool) { >> if n < 1 { >> return >> } >> resp, err := http.Get(url) >> if err != nil { >> log.Fatalf("Can not reach the site. Error = %v\n", err) >> os.Exit(1) >> } >> >> b := resp.Body >> defer b.Close() >> >> z := html.NewTokenizer(b) >> >> nextN := n - 1 >> for { >> token := z.Next() >> >> switch token { >> case html.ErrorToken: >> return >> case html.StartTagToken: >> current := z.Token() >> if current.Data != "a" { >> continue >> } >> result, ok := getHrefTag(current) >> if !ok { >> continue >> } >> >> hasProto := strings.HasPrefix(result, "http") >> if hasProto { >> lock.RLock() >> ok := (*visited)[result] >> lock.RUnlock() >> if ok { >> continue >> } >> done := make(chan struct{}) >> go func() { >> crawl(result, nextN, ch, visited) >> close(done) >> }() >> <-done >> lock.Lock() >> (*visited)[result] = true >> lock.Unlock() >> ch <- result >> } >> } >> } >> } >> >> func getHrefTag(token html.Token) (result string, ok bool) { >> for _, a := range token.Attr { >> if a.Key == "href" { >> result = a.Val >> ok = true >> break >> } >> } >> return >> } >> >> >> On Sunday, September 24, 2017 at 10:13:16 PM UTC+8, Aaron wrote: >>> >>> Hi I am learning Golang concurrency and trying to build a simple Website >>> crawler. I managed to crawl all the links of the pages of any depth of >>> website. But I still have one problem to tackle: how to avoid crawling >>> visited links that are previously crawled? >>> >>> Here is my code. Hope you guys can shed some light. Thank you in advance. >>> >>> package main >>> import ( >>> "fmt" >>> "log" >>> "net/http" >>> "os" >>> "strings" >>> >>> "golang.org/x/net/html") >>> >>> func main() { >>> if len(os.Args) != 2 { >>> fmt.Println("Usage: crawl [URL].") >>> } >>> >>> url := os.Args[1] >>> if !strings.HasPrefix(url, "http://") { >>> url = "http://" + url >>> } >>> >>> for link := range newCrawl(url, 1) { >>> fmt.Println(link) >>> }} >>> >>> func newCrawl(url string, num int) chan string { >>> ch := make(chan string, 20) >>> >>> go func() { >>> crawl(url, 1, ch) >>> close(ch) >>> }() >>> >>> return ch} >>> >>> func crawl(url string, n int, ch chan string) { >>> if n < 1 { >>> return >>> } >>> resp, err := http.Get(url) >>> if err != nil { >>> log.Fatalf("Can not reach the site. Error = %v\n", err) >>> os.Exit(1) >>> } >>> >>> b := resp.Body >>> defer b.Close() >>> >>> z := html.NewTokenizer(b) >>> >>> nextN := n - 1 >>> for { >>> token := z.Next() >>> >>> switch token { >>> case html.ErrorToken: >>> return >>> case html.StartTagToken: >>> current := z.Token() >>> if current.Data != "a" { >>> continue >>> } >>> result, ok := getHrefTag(current) >>> if !ok { >>> continue >>> } >>> >>> hasProto := strings.HasPrefix(result, "http") >>> if hasProto { >>> done := make(chan struct{}) >>> go func() { >>> crawl(result, nextN, ch) >>> close(done) >>> }() >>> <-done >>> ch <- result >>> } >>> } >>> }} >>> >>> func getHrefTag(token html.Token) (result string, ok bool) { >>> for _, a := range token.Attr { >>> if a.Key == "href" { >>> result = a.Val >>> ok = true >>> break >>> } >>> } >>> return} >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "golang-nuts" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to golang-nuts+unsubscr...@googlegroups.com. >> For more options, visit https://groups.google.com/d/optout. >> > > > > -- > Michael T. Jones > michael.jo...@gmail.com > -- Michael T. Jones michael.jo...@gmail.com -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.