Hi everybody

I'm a clojure newbie, so I'm sorry if I'm asking a dumb question. 
I'm trying to write a program to crawl a webpage, find all links and 
recursively do this until all the links in the website is crawled (of 
course I'm omitting external hosts to avoid infinite crawling). So 
basically I'm dealing with a tree data structure. The problem is that my 
knowledge of clojure data structures are no where near enough to be able to 
implement this. I read a little bit about zippers and lots of other stuff 
but it only made me more confused. 

This is what I've got so far: 



(ns cralwer.core
  (:gen-class)
  (:require [net.cgrand.enlive-html :as h])
  (:import (java.net URL MalformedURLException)) 
  (:import java.io.FileNotFoundException) 
  )

(defn get-absolute-url-same-host
  "Convert the URL to absolute form if it's already not. Returns nill if 
the url is not from the same host"
  [url parent]
  (try (let [u (URL. url)]
         (if (= (.getHost u) (.getHost parent))
           (.toString u)))
    (catch MalformedURLException e (.toString (URL. parent url)))
    ))



(defn get-links 
  "Return all the links in a URI"
  [url links]
    ;I do this check to avoid back edges/already seen urls and stop when 
there are no links in the current page
    (if-not (or (nil? url) (some #{url} links)) 
      (try (let [j-url (java.net.URL. url)
          page (h/html-resource j-url)]
         (map #(get-absolute-url-same-host (:href (:attrs %)) j-url) 
(h/select page [(h/attr? :href)])))

      (catch FileNotFoundException e (println "invalid URL: " url)))))



(defn get-all-links
  "Return a collection of all links"
  [url]
  (let [links '() children (get-links url links)]
    (concat links (mapcat get-all-links children))))


For small inputs, I get an empty list and for large inputs I just get stack 
overflow exception.

Thanks a lot for your help in advance 

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to