Ouch. Thanks for the explanation.

On Thursday, September 12, 2013 9:46:47 AM UTC-7, Andy Fingerhut wrote:
>
> Clojure's substr, and many other functions that return substrings of a 
> larger one (e.g. re-find, re-seq, etc) are based on Java's 
> java.lang.String/substring() method behavior.
>
> Before Java version 7u6 or thereabouts, this was implemented in O(1) time 
> by creating a String object that referred to an offset and length within 
> the original String object, thus retaining a reference to it as long as the 
> substrings were referenced.
>
> Around Java version 7u6, Java's substring() method behavior changed to 
> copy the desired substring into a new String object, so no references are 
> kept to the original.
>
>     
> http://www.javaadvent.com/2012/12/changes-to-stringsubstring-in-java-7.html
>
> Fun, eh?  And no, this was not obvious to me until I ran across the issue 
> some time back.  Mark Engelberg encountered this issue while doing 
> performance tuning on his Instaparse library: 
> https://github.com/Engelberg/instaparse
>
> If you know you are deploying on a Java version that is earlier than 7u6, 
> you can using the String constructor, e.g. (String. s) from Clojure to 
> force the copying of the string.  You could even get fancier and write code 
> that depends upon the Java version you are running upon, if that interests 
> you.
>
> Andy
>
>
> On Thu, Sep 12, 2013 at 9:08 AM, Brian Craft <craft...@gmail.com<javascript:>
> > wrote:
>
>> After working around the seq + closure = death problem, I still had a 
>> severe memory leak in my code, which took many hours to find.
>>
>> Holding a reference to a string returned by clojure.string/split is 
>> somehow retaining a reference to the original string. In my case I needed 
>> to hold the first column of each row in a tsv file that was 4G in size. 
>> This resulted in holding the entire 4G in memory.
>>
>> Here's a demo. Function "data" returns a seq of lines that are about 1000 
>> bytes. The first column, however, is just a few bytes, and 10k of them 
>> should easily fit in 10M of heap space. But, no:
>>
>> $ LEIN_JVM_OPTS=-Xmx10M lein repl
>> REPL started; server listening on localhost port 34955
>> user=> (defn data [] (for [i (range)] (str "row " i "\t" 
>> (clojure.string/join "" (repeat 1000 "x")))))
>> #'user/data
>> user=> (def x (vec (take 10000 (map #(first (clojure.string/split % 
>> #"\t")) (data)))))
>> java.lang.OutOfMemoryError: Java heap space (NO_SOURCE_FILE:4)
>> user=> 
>>
>> If I copy the returned string with the String constructor, it's fine:
>>
>> $ LEIN_JVM_OPTS=-Xmx10M lein repl
>> REPL started; server listening on localhost port 20587
>> user=> (defn data [] (for [i (range)] (str "row " i "\t" 
>> (clojure.string/join "" (repeat 1000 "x")))))
>> #'user/data
>> user=> (def x (vec (take 10000 (map #(String. (first 
>> (clojure.string/split % #"\t"))) (data)))))
>> #'user/x
>> user=> (x 10)
>> "row 10"
>> user=> 
>>
>> Two observations about this.
>>
>> First, this behavior is very unexpected to me. I don't understand if it 
>> is a property of strings, collections, or string/split specifically that is 
>> causing it. Is there something in the docs that I overlooked, that would 
>> have warned of this?
>>
>> Second, for tracking down problems like this, the available tooling is 
>> pathetic, to put it as politely as possible. jhat would not trace the the 
>> leaked strings. It consistently froze up when tracing them to GC roots. 
>> visualvm traced it back to CacheLRU, as in the screenshot I posted in the 
>> other thread, which was perfectly uninformative.
>>
>> Without any usable tooling, the only workflow I found to narrow the 
>> problem was to iteratively stub out portions of code and re-run the program 
>> for several minutes to determine if the leak was active. Obviously, this is 
>> incredibly painful, slow, and tedious.
>>
>> I'm hoping someone can tell me there's a better way.
>>
>> Note that the leak did not appear in when exercising subsystems 
>> independently, because in that case no references were retained from one 
>> subsystem to the other. So, "try it in the repl" was not an effective 
>> strategy.
>>  
>> -- 
>> -- 
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to clo...@googlegroups.com<javascript:>
>> Note that posts from new members are moderated - please be patient with 
>> your first post.
>> To unsubscribe from this group, send email to
>> clojure+u...@googlegroups.com <javascript:>
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "Clojure" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to clojure+u...@googlegroups.com <javascript:>.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to