Aaron, this is the simplest change I could make to that program to show you an example. It does what you want.
https://play.golang.org/p/Q8pu0l-Yvy It shows what I was saying about replacing a string with a struct of string and int. Depth 1: http://www.whitehouse.gov https://www.whitehouse.gov/blog/2017/06/08/president-trumps-plan-rebuild-americas-infrastructure : https://www.whitehouse.gov/privacy https://www.whitehouse.gov/copyright On Tue, Sep 26, 2017 at 7:29 AM, Michael Jones <michael.jo...@gmail.com> wrote: > To be sure, examples are a starting place. > > To crawl to a certain depth, you need to know the depth of each url. That > means that the url being passed around stops being "just a string" and > becomes a struct with information including at least, the url as a string > and the link depth as an integer. Following a link means incrementing the > depth, checking that, and also checking the presence of the string part in > the map. > > To crawl to a certain distance from a website, in the sense of how many > hops to other websites, needs similar processing on the hostname. > > To crawl with rate-limitation so as not to overburden a distant host's > server(s), means to have multiple work queues for each hostname. That is > the natural place to do rate limits, time windows, or other such processing. > > The real meaning of the example is to point out that the work queue -- a > channel here -- can be written to by the (go)routines that read from it. > This is a useful notion in this case. > > Good luck! > > On Mon, Sep 25, 2017 at 8:46 PM, Aaron <gdz...@gmail.com> wrote: > >> Thank you Mike. I have read the book and done that exercise. The code in >> the example will not crawl into certain depth and stop at certain depth of >> the website. The requirements are a bit different. I can't just use the >> approach in the example directly or more precisely I don't know how to use >> it in my code. >> >> On Tuesday, September 26, 2017 at 6:18:43 AM UTC+8, Michael Jones wrote: >>> >>> The book, The Go Programming Language discusses the web crawl task at >>> several points through the text. The simplest complete parallel version is: >>> >>> https://github.com/adonovan/gopl.io/blob/master/ch8/crawl3/findlinks.go >>> >>> which if you'll download and build works quite nicely: >>> >>> *$ crawl3 http://www.golang.org <http://www.golang.org>* >>> http://www.golang.org >>> http://www.google.com/intl/en/policies/privacy/ >>> https://golang.org/doc/tos.html >>> https://golang.org/project/ >>> https://golang.org/pkg/ >>> https://golang.org/doc/ >>> http://play.golang.org/ >>> https://tour.golang.org/ >>> https://golang.org/LICENSE >>> https://developers.google.com/site-policies#restrictions >>> https://golang.org/dl/ >>> https://golang.org/blog/ >>> https://golang.org/help/ >>> https://golang.org/ >>> https://blog.golang.org/ >>> https://www.google.com/intl/en/privacy/privacy-policy.html >>> https://www.google.com/intl/en/policies/terms/ >>> https://golang.org/LICENSE?m=text >>> https://golang.org/pkg >>> https://golang.org/doc/go_faq.html >>> https://groups.google.com/group/golang-nuts >>> https://blog.gopheracademy.com/gophers-slack-community/ >>> https://golang.org/wiki >>> https://forum.golangbridge.org/ >>> irc:irc.freenode.net/go-nuts >>> 2017/09/25 15:13:07 Get irc:irc.freenode.net/go-nuts: unsupported >>> protocol scheme "irc" >>> https://golang.org/doc/faq >>> https://groups.google.com/group/golang-announce >>> https://blog.golang.org >>> https://twitter.com/golang >>> : >>> >>> On Mon, Sep 25, 2017 at 7:46 AM, Michael Jones <michae...@gmail.com> >>> wrote: >>> >>>> i suggest that you first make it work in the simple way and then make >>>> it concurrent. >>>> >>>> however, one lock-free concurrent way to think of this is as follows... >>>> >>>> 1. start with a list of urls (in code, on command line, etc.) >>>> 2. spawn a go process that writes each of them to a channel of strings, >>>> perhaps called PENDING >>>> 3. spawn a go process that reads a url string from work and if it is >>>> not in the map of already processed url's, writes it to a channel of >>>> strings, WORK, after adding the url to the map. >>>> 4 spawn a set of go processes that read WORK, fetch the url, do >>>> whatever it it that you need to do, and for urls found there, writes them >>>> to PENDING >>>> >>>> this is enough. now as written you have the challenge to know when the >>>> workers are done and pending is empty. that's when you exit. there are >>>> other ways to do this, but the point is to state with emphasis what an >>>> earlier email said, which is to have the map in its own goroutine, the one >>>> that decides which urls should be processed. >>>> >>>> On Mon, Sep 25, 2017 at 5:35 AM, Aaron <gdz...@gmail.com> wrote: >>>> >>>>> I have come up with a fix, using Mutex. But I am not sure how to do it >>>>> with channels. >>>>> >>>>> package main >>>>> >>>>> import ( >>>>> "fmt" >>>>> "log" >>>>> "net/http" >>>>> "os" >>>>> "strings" >>>>> "sync" >>>>> >>>>> "golang.org/x/net/html" >>>>> ) >>>>> >>>>> var lock = sync.RWMutex{} >>>>> >>>>> func main() { >>>>> if len(os.Args) != 2 { >>>>> fmt.Println("Usage: crawl [URL].") >>>>> } >>>>> >>>>> url := os.Args[1] >>>>> if !strings.HasPrefix(url, "http://") { >>>>> url = "http://" + url >>>>> } >>>>> >>>>> n := 0 >>>>> >>>>> for link := range newCrawl(url, 1) { >>>>> n++ >>>>> fmt.Println(link) >>>>> } >>>>> >>>>> fmt.Printf("Total links: %d\n", n) >>>>> } >>>>> >>>>> func newCrawl(url string, num int) chan string { >>>>> visited := make(map[string]bool) >>>>> ch := make(chan string, 20) >>>>> >>>>> go func() { >>>>> crawl(url, 3, ch, &visited) >>>>> close(ch) >>>>> }() >>>>> >>>>> return ch >>>>> } >>>>> >>>>> func crawl(url string, n int, ch chan string, visited *map[string]bool) >>>>> { >>>>> if n < 1 { >>>>> return >>>>> } >>>>> resp, err := http.Get(url) >>>>> if err != nil { >>>>> log.Fatalf("Can not reach the site. Error = %v\n", err) >>>>> os.Exit(1) >>>>> } >>>>> >>>>> b := resp.Body >>>>> defer b.Close() >>>>> >>>>> z := html.NewTokenizer(b) >>>>> >>>>> nextN := n - 1 >>>>> for { >>>>> token := z.Next() >>>>> >>>>> switch token { >>>>> case html.ErrorToken: >>>>> return >>>>> case html.StartTagToken: >>>>> current := z.Token() >>>>> if current.Data != "a" { >>>>> continue >>>>> } >>>>> result, ok := getHrefTag(current) >>>>> if !ok { >>>>> continue >>>>> } >>>>> >>>>> hasProto := strings.HasPrefix(result, "http") >>>>> if hasProto { >>>>> lock.RLock() >>>>> ok := (*visited)[result] >>>>> lock.RUnlock() >>>>> if ok { >>>>> continue >>>>> } >>>>> done := make(chan struct{}) >>>>> go func() { >>>>> crawl(result, nextN, ch, visited) >>>>> close(done) >>>>> }() >>>>> <-done >>>>> lock.Lock() >>>>> (*visited)[result] = true >>>>> lock.Unlock() >>>>> ch <- result >>>>> } >>>>> } >>>>> } >>>>> } >>>>> >>>>> func getHrefTag(token html.Token) (result string, ok bool) { >>>>> for _, a := range token.Attr { >>>>> if a.Key == "href" { >>>>> result = a.Val >>>>> ok = true >>>>> break >>>>> } >>>>> } >>>>> return >>>>> } >>>>> >>>>> >>>>> On Sunday, September 24, 2017 at 10:13:16 PM UTC+8, Aaron wrote: >>>>>> >>>>>> Hi I am learning Golang concurrency and trying to build a simple >>>>>> Website crawler. I managed to crawl all the links of the pages of any >>>>>> depth >>>>>> of website. But I still have one problem to tackle: how to avoid crawling >>>>>> visited links that are previously crawled? >>>>>> >>>>>> Here is my code. Hope you guys can shed some light. Thank you in >>>>>> advance. >>>>>> >>>>>> package main >>>>>> import ( >>>>>> "fmt" >>>>>> "log" >>>>>> "net/http" >>>>>> "os" >>>>>> "strings" >>>>>> >>>>>> "golang.org/x/net/html") >>>>>> >>>>>> func main() { >>>>>> if len(os.Args) != 2 { >>>>>> fmt.Println("Usage: crawl [URL].") >>>>>> } >>>>>> >>>>>> url := os.Args[1] >>>>>> if !strings.HasPrefix(url, "http://") { >>>>>> url = "http://" + url >>>>>> } >>>>>> >>>>>> for link := range newCrawl(url, 1) { >>>>>> fmt.Println(link) >>>>>> }} >>>>>> >>>>>> func newCrawl(url string, num int) chan string { >>>>>> ch := make(chan string, 20) >>>>>> >>>>>> go func() { >>>>>> crawl(url, 1, ch) >>>>>> close(ch) >>>>>> }() >>>>>> >>>>>> return ch} >>>>>> >>>>>> func crawl(url string, n int, ch chan string) { >>>>>> if n < 1 { >>>>>> return >>>>>> } >>>>>> resp, err := http.Get(url) >>>>>> if err != nil { >>>>>> log.Fatalf("Can not reach the site. Error = %v\n", err) >>>>>> os.Exit(1) >>>>>> } >>>>>> >>>>>> b := resp.Body >>>>>> defer b.Close() >>>>>> >>>>>> z := html.NewTokenizer(b) >>>>>> >>>>>> nextN := n - 1 >>>>>> for { >>>>>> token := z.Next() >>>>>> >>>>>> switch token { >>>>>> case html.ErrorToken: >>>>>> return >>>>>> case html.StartTagToken: >>>>>> current := z.Token() >>>>>> if current.Data != "a" { >>>>>> continue >>>>>> } >>>>>> result, ok := getHrefTag(current) >>>>>> if !ok { >>>>>> continue >>>>>> } >>>>>> >>>>>> hasProto := strings.HasPrefix(result, "http") >>>>>> if hasProto { >>>>>> done := make(chan struct{}) >>>>>> go func() { >>>>>> crawl(result, nextN, ch) >>>>>> close(done) >>>>>> }() >>>>>> <-done >>>>>> ch <- result >>>>>> } >>>>>> } >>>>>> }} >>>>>> >>>>>> func getHrefTag(token html.Token) (result string, ok bool) { >>>>>> for _, a := range token.Attr { >>>>>> if a.Key == "href" { >>>>>> result = a.Val >>>>>> ok = true >>>>>> break >>>>>> } >>>>>> } >>>>>> return} >>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "golang-nuts" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to golang-nuts...@googlegroups.com. >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> >>>> >>>> -- >>>> Michael T. Jones >>>> michae...@gmail.com >>>> >>> >>> >>> >>> -- >>> Michael T. Jones >>> michae...@gmail.com >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "golang-nuts" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to golang-nuts+unsubscr...@googlegroups.com. >> For more options, visit https://groups.google.com/d/optout. >> > > > > -- > Michael T. Jones > michael.jo...@gmail.com > -- Michael T. Jones michael.jo...@gmail.com -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.