Ouch. Thanks for the explanation. On Thursday, September 12, 2013 9:46:47 AM UTC-7, Andy Fingerhut wrote: > > Clojure's substr, and many other functions that return substrings of a > larger one (e.g. re-find, re-seq, etc) are based on Java's > java.lang.String/substring() method behavior. > > Before Java version 7u6 or thereabouts, this was implemented in O(1) time > by creating a String object that referred to an offset and length within > the original String object, thus retaining a reference to it as long as the > substrings were referenced. > > Around Java version 7u6, Java's substring() method behavior changed to > copy the desired substring into a new String object, so no references are > kept to the original. > > > http://www.javaadvent.com/2012/12/changes-to-stringsubstring-in-java-7.html > > Fun, eh? And no, this was not obvious to me until I ran across the issue > some time back. Mark Engelberg encountered this issue while doing > performance tuning on his Instaparse library: > https://github.com/Engelberg/instaparse > > If you know you are deploying on a Java version that is earlier than 7u6, > you can using the String constructor, e.g. (String. s) from Clojure to > force the copying of the string. You could even get fancier and write code > that depends upon the Java version you are running upon, if that interests > you. > > Andy > > > On Thu, Sep 12, 2013 at 9:08 AM, Brian Craft <craft...@gmail.com<javascript:> > > wrote: > >> After working around the seq + closure = death problem, I still had a >> severe memory leak in my code, which took many hours to find. >> >> Holding a reference to a string returned by clojure.string/split is >> somehow retaining a reference to the original string. In my case I needed >> to hold the first column of each row in a tsv file that was 4G in size. >> This resulted in holding the entire 4G in memory. >> >> Here's a demo. Function "data" returns a seq of lines that are about 1000 >> bytes. The first column, however, is just a few bytes, and 10k of them >> should easily fit in 10M of heap space. But, no: >> >> $ LEIN_JVM_OPTS=-Xmx10M lein repl >> REPL started; server listening on localhost port 34955 >> user=> (defn data [] (for [i (range)] (str "row " i "\t" >> (clojure.string/join "" (repeat 1000 "x"))))) >> #'user/data >> user=> (def x (vec (take 10000 (map #(first (clojure.string/split % >> #"\t")) (data))))) >> java.lang.OutOfMemoryError: Java heap space (NO_SOURCE_FILE:4) >> user=> >> >> If I copy the returned string with the String constructor, it's fine: >> >> $ LEIN_JVM_OPTS=-Xmx10M lein repl >> REPL started; server listening on localhost port 20587 >> user=> (defn data [] (for [i (range)] (str "row " i "\t" >> (clojure.string/join "" (repeat 1000 "x"))))) >> #'user/data >> user=> (def x (vec (take 10000 (map #(String. (first >> (clojure.string/split % #"\t"))) (data))))) >> #'user/x >> user=> (x 10) >> "row 10" >> user=> >> >> Two observations about this. >> >> First, this behavior is very unexpected to me. I don't understand if it >> is a property of strings, collections, or string/split specifically that is >> causing it. Is there something in the docs that I overlooked, that would >> have warned of this? >> >> Second, for tracking down problems like this, the available tooling is >> pathetic, to put it as politely as possible. jhat would not trace the the >> leaked strings. It consistently froze up when tracing them to GC roots. >> visualvm traced it back to CacheLRU, as in the screenshot I posted in the >> other thread, which was perfectly uninformative. >> >> Without any usable tooling, the only workflow I found to narrow the >> problem was to iteratively stub out portions of code and re-run the program >> for several minutes to determine if the leak was active. Obviously, this is >> incredibly painful, slow, and tedious. >> >> I'm hoping someone can tell me there's a better way. >> >> Note that the leak did not appear in when exercising subsystems >> independently, because in that case no references were retained from one >> subsystem to the other. So, "try it in the repl" was not an effective >> strategy. >> >> -- >> -- >> You received this message because you are subscribed to the Google >> Groups "Clojure" group. >> To post to this group, send email to clo...@googlegroups.com<javascript:> >> Note that posts from new members are moderated - please be patient with >> your first post. >> To unsubscribe from this group, send email to >> clojure+u...@googlegroups.com <javascript:> >> For more options, visit this group at >> http://groups.google.com/group/clojure?hl=en >> --- >> You received this message because you are subscribed to the Google Groups >> "Clojure" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to clojure+u...@googlegroups.com <javascript:>. >> For more options, visit https://groups.google.com/groups/opt_out. >> > >
-- -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.