I wrote this a while ago, here it is :-) https://github.com/gtrak/betamore-clj/blob/master/src/betamore_clj/core.clj
On Thu, Jul 4, 2013 at 1:23 PM, Amir Fouladi <ah.foul...@gmail.com> wrote: > Hi everybody > > I'm a clojure newbie, so I'm sorry if I'm asking a dumb question. > I'm trying to write a program to crawl a webpage, find all links and > recursively do this until all the links in the website is crawled (of > course I'm omitting external hosts to avoid infinite crawling). So > basically I'm dealing with a tree data structure. The problem is that my > knowledge of clojure data structures are no where near enough to be able to > implement this. I read a little bit about zippers and lots of other stuff > but it only made me more confused. > > This is what I've got so far: > > > > (ns cralwer.core > (:gen-class) > (:require [net.cgrand.enlive-html :as h]) > (:import (java.net URL MalformedURLException)) > (:import java.io.FileNotFoundException) > ) > > (defn get-absolute-url-same-host > "Convert the URL to absolute form if it's already not. Returns nill if > the url is not from the same host" > [url parent] > (try (let [u (URL. url)] > (if (= (.getHost u) (.getHost parent)) > (.toString u))) > (catch MalformedURLException e (.toString (URL. parent url))) > )) > > > > (defn get-links > "Return all the links in a URI" > [url links] > ;I do this check to avoid back edges/already seen urls and stop when > there are no links in the current page > (if-not (or (nil? url) (some #{url} links)) > (try (let [j-url (java.net.URL. url) > page (h/html-resource j-url)] > (map #(get-absolute-url-same-host (:href (:attrs %)) j-url) > (h/select page [(h/attr? :href)]))) > > (catch FileNotFoundException e (println "invalid URL: " url))))) > > > > (defn get-all-links > "Return a collection of all links" > [url] > (let [links '() children (get-links url links)] > (concat links (mapcat get-all-links children)))) > > > For small inputs, I get an empty list and for large inputs I just get > stack overflow exception. > > Thanks a lot for your help in advance > > -- > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > --- > You received this message because you are subscribed to the Google Groups > "Clojure" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to clojure+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/groups/opt_out. > > > -- -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.