Clojure's substr, and many other functions that return substrings of a larger one (e.g. re-find, re-seq, etc) are based on Java's java.lang.String/substring() method behavior.
Before Java version 7u6 or thereabouts, this was implemented in O(1) time by creating a String object that referred to an offset and length within the original String object, thus retaining a reference to it as long as the substrings were referenced. Around Java version 7u6, Java's substring() method behavior changed to copy the desired substring into a new String object, so no references are kept to the original. http://www.javaadvent.com/2012/12/changes-to-stringsubstring-in-java-7.html Fun, eh? And no, this was not obvious to me until I ran across the issue some time back. Mark Engelberg encountered this issue while doing performance tuning on his Instaparse library: https://github.com/Engelberg/instaparse If you know you are deploying on a Java version that is earlier than 7u6, you can using the String constructor, e.g. (String. s) from Clojure to force the copying of the string. You could even get fancier and write code that depends upon the Java version you are running upon, if that interests you. Andy On Thu, Sep 12, 2013 at 9:08 AM, Brian Craft <craft.br...@gmail.com> wrote: > After working around the seq + closure = death problem, I still had a > severe memory leak in my code, which took many hours to find. > > Holding a reference to a string returned by clojure.string/split is > somehow retaining a reference to the original string. In my case I needed > to hold the first column of each row in a tsv file that was 4G in size. > This resulted in holding the entire 4G in memory. > > Here's a demo. Function "data" returns a seq of lines that are about 1000 > bytes. The first column, however, is just a few bytes, and 10k of them > should easily fit in 10M of heap space. But, no: > > $ LEIN_JVM_OPTS=-Xmx10M lein repl > REPL started; server listening on localhost port 34955 > user=> (defn data [] (for [i (range)] (str "row " i "\t" > (clojure.string/join "" (repeat 1000 "x"))))) > #'user/data > user=> (def x (vec (take 10000 (map #(first (clojure.string/split % > #"\t")) (data))))) > java.lang.OutOfMemoryError: Java heap space (NO_SOURCE_FILE:4) > user=> > > If I copy the returned string with the String constructor, it's fine: > > $ LEIN_JVM_OPTS=-Xmx10M lein repl > REPL started; server listening on localhost port 20587 > user=> (defn data [] (for [i (range)] (str "row " i "\t" > (clojure.string/join "" (repeat 1000 "x"))))) > #'user/data > user=> (def x (vec (take 10000 (map #(String. (first (clojure.string/split > % #"\t"))) (data))))) > #'user/x > user=> (x 10) > "row 10" > user=> > > Two observations about this. > > First, this behavior is very unexpected to me. I don't understand if it is > a property of strings, collections, or string/split specifically that is > causing it. Is there something in the docs that I overlooked, that would > have warned of this? > > Second, for tracking down problems like this, the available tooling is > pathetic, to put it as politely as possible. jhat would not trace the the > leaked strings. It consistently froze up when tracing them to GC roots. > visualvm traced it back to CacheLRU, as in the screenshot I posted in the > other thread, which was perfectly uninformative. > > Without any usable tooling, the only workflow I found to narrow the > problem was to iteratively stub out portions of code and re-run the program > for several minutes to determine if the leak was active. Obviously, this is > incredibly painful, slow, and tedious. > > I'm hoping someone can tell me there's a better way. > > Note that the leak did not appear in when exercising subsystems > independently, because in that case no references were retained from one > subsystem to the other. So, "try it in the repl" was not an effective > strategy. > > -- > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > --- > You received this message because you are subscribed to the Google Groups > "Clojure" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to clojure+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/groups/opt_out. > -- -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.