Clojure's substr, and many other functions that return substrings of a
larger one (e.g. re-find, re-seq, etc) are based on Java's
java.lang.String/substring() method behavior.

Before Java version 7u6 or thereabouts, this was implemented in O(1) time
by creating a String object that referred to an offset and length within
the original String object, thus retaining a reference to it as long as the
substrings were referenced.

Around Java version 7u6, Java's substring() method behavior changed to copy
the desired substring into a new String object, so no references are kept
to the original.


http://www.javaadvent.com/2012/12/changes-to-stringsubstring-in-java-7.html

Fun, eh?  And no, this was not obvious to me until I ran across the issue
some time back.  Mark Engelberg encountered this issue while doing
performance tuning on his Instaparse library:
https://github.com/Engelberg/instaparse

If you know you are deploying on a Java version that is earlier than 7u6,
you can using the String constructor, e.g. (String. s) from Clojure to
force the copying of the string.  You could even get fancier and write code
that depends upon the Java version you are running upon, if that interests
you.

Andy


On Thu, Sep 12, 2013 at 9:08 AM, Brian Craft <craft.br...@gmail.com> wrote:

> After working around the seq + closure = death problem, I still had a
> severe memory leak in my code, which took many hours to find.
>
> Holding a reference to a string returned by clojure.string/split is
> somehow retaining a reference to the original string. In my case I needed
> to hold the first column of each row in a tsv file that was 4G in size.
> This resulted in holding the entire 4G in memory.
>
> Here's a demo. Function "data" returns a seq of lines that are about 1000
> bytes. The first column, however, is just a few bytes, and 10k of them
> should easily fit in 10M of heap space. But, no:
>
> $ LEIN_JVM_OPTS=-Xmx10M lein repl
> REPL started; server listening on localhost port 34955
> user=> (defn data [] (for [i (range)] (str "row " i "\t"
> (clojure.string/join "" (repeat 1000 "x")))))
> #'user/data
> user=> (def x (vec (take 10000 (map #(first (clojure.string/split %
> #"\t")) (data)))))
> java.lang.OutOfMemoryError: Java heap space (NO_SOURCE_FILE:4)
> user=>
>
> If I copy the returned string with the String constructor, it's fine:
>
> $ LEIN_JVM_OPTS=-Xmx10M lein repl
> REPL started; server listening on localhost port 20587
> user=> (defn data [] (for [i (range)] (str "row " i "\t"
> (clojure.string/join "" (repeat 1000 "x")))))
> #'user/data
> user=> (def x (vec (take 10000 (map #(String. (first (clojure.string/split
> % #"\t"))) (data)))))
> #'user/x
> user=> (x 10)
> "row 10"
> user=>
>
> Two observations about this.
>
> First, this behavior is very unexpected to me. I don't understand if it is
> a property of strings, collections, or string/split specifically that is
> causing it. Is there something in the docs that I overlooked, that would
> have warned of this?
>
> Second, for tracking down problems like this, the available tooling is
> pathetic, to put it as politely as possible. jhat would not trace the the
> leaked strings. It consistently froze up when tracing them to GC roots.
> visualvm traced it back to CacheLRU, as in the screenshot I posted in the
> other thread, which was perfectly uninformative.
>
> Without any usable tooling, the only workflow I found to narrow the
> problem was to iteratively stub out portions of code and re-run the program
> for several minutes to determine if the leak was active. Obviously, this is
> incredibly painful, slow, and tedious.
>
> I'm hoping someone can tell me there's a better way.
>
> Note that the leak did not appear in when exercising subsystems
> independently, because in that case no references were retained from one
> subsystem to the other. So, "try it in the repl" was not an effective
> strategy.
>
> --
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to