Hi,all
I want to parse big log files using Clojure.
And the structure of each line record is
"UserID,Lantitude,Lontitude,Timestamp".
My implemented steps are:
----> Read log file & Get top-n user list
----> Find each top-n user's records and store in separate log file
(UserID.log) .

The implement source code :
;======================================================
(defn parse-file
  ""
  [file n]
  (with-open [rdr (io/reader file)]
    (println "001 begin with open ")
    (let [lines (line-seq rdr)
          res (parse-recur lines)
          sorted
          (into (sorted-map-by (fn [key1 key2]
                                 (compare [(get res key2) key2]
                                          [(get res key1) key1])))
                res)]
      (println "Statistic result : " res)
      (println "Top-N User List : " sorted)
      (find-write-recur lines sorted n)
      )))

(defn parse-recur
  ""
  [lines]
  (loop [ls  lines
         res {}]
    (if ls
      (recur (next ls)
               (update-res res (first ls)))
      res)))

(defn update-res
  ""
  [res line]
  (let [params (string/split line #",")
        id     (if (> (count params) 1) (params 0) "0")]
    (if (res id)
      (update-in res [id] inc)
      (assoc res id 1))))

(defn find-write-recur
  "Get each users' records and store into separate log file"
  [lines sorted n]
  (loop [x n
         sd sorted
         id (first (keys sd))]
    (if (and (> x 0) sd)
      (do (create-write-file id
                             (find-recur lines id))
          (recur (dec x)
                 (rest sd)
                 (nth (keys sd) 1))))))

(defn find-recur
  ""
  [lines id]
  (loop [ls lines
           res []]
    (if ls
      (recur (next ls)
               (update-vec res id (first ls)))
      res)))

(defn update-vec
  ""
  [res id line]
  (let [params (string/split line #",")
        id_        (if (> (count params) 1) (params 0) "0")]
        (if (= id id_ )
          (conj res line)
          res)))

(defn create-write-file
  "Create a new file and write information into the file."
  ([file info-lines]
   (with-open [wr (io/writer (str MAIN-PATH file))]
     (doseq [line info-lines] (.write wr (str line "\n")))
     ))
  ([file info-lines append?]
   (with-open [wr (io/writer (str MAIN-PATH file) :append append?)]
     (doseq [line info-lines] (.write wr (str line "\n"))))
   ))
;======================================================

I tested this clj in REPL with command (parse-file "./DATA/log.log" 3), and
get the results:

Records         Size          Time      Result
1,000             42KB         <1s         OK
10,000           420KB       <1s         OK
100,000          4.3MB        3s          OK
1,000,000       43MB         15s         OK
6,000,000       258MB       >20M      "OutOfMemoryError Java heap space
 java.lang.String.substring (String.java:1913)"

======================================================
Here is the question:
1. how can i fix the error when i try to parse big log file , like > 200MB
2. how can i optimize the function to run faster ?
3. there are logs more than 1G size , how can the function deal with it.

I am still new to Clojure, any suggestion or solution will be appreciate~
Thanks

BR

------------------------------------

刘家齐 (Jacky Liu)



手机:15201091195        邮箱:liujiaq...@gmail.com

Skype:jacky_liu_1987   QQ:406229156

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to