I've been using Clojure to automate some file management tasks over here; in particular, I whipped together about 100 lines of code to build and maintain a hash database of certain files and thereby efficiently discover duplications among these files. (Basically, it uses a subdirectory of the filesystem as a persisted hashmap, with the first few bytes of the hash used to make a tree of subdirectories, the leaves of which are hash buckets that collect files; FileInputStream/MessageDigest is used to compute file hashes, and byte-by-byte comparisons detect identical files from among those with identical hashes.)
I noticed something odd after spending a while working in other file management tools, then switching to my REPL and hammering off an (add-files! "/some/directory"). I flipped to NB's *out* tab to monitor its progress and ... what progress? It had printed nothing, and as I watched, it continued to print nothing. I switched back to the REPL and it looked like it had submitted the sexp to the evaluator. But no output. Then I checked the process list and saw the java task consuming no CPU (unsusprising since I'd given it an I/O bound task) but rapidly growing in memory use. It bloated up by several tens of megs before I finally saw "Entering /some/directory" in *out*. The funny thing is, the code is basically this: (defn add-files! [dir] (sysprint "Entering " dir) (doseq [f (file-seq dir)] ...)) and sysprint is basically this: (defn sysprint [& objs] (.println System/out (apply str objs))) (so the output goes to *out* instead of blitzkrieging the REPL tab's scrollback with thousands of lines of output, which then makes looking things up from earlier in the REPL session very annoying). Since the thing froze for a significant amount of time allocating a significant amount of memory before the (.println System.out ...) succeeded, I conclude that the rest of add-files! is irrelevant here and it was something in (apply str "Entering " "/some/directory") or in (.println System/out ...) that spent a long time allocating tens of megabytes of heap. I'm mystified as to why. The add-files! function doesn't usually do this, and it recurses for subdirectories, so the "Entering ..." message is printed several times in a typical use. After the one long pause, this particular run generated several more such messages without any noticeable pauses. What the heck was it? A GC pause? Why would that grow the heap? The heap should be near max size already and perhaps shrink somewhat in a GC pause. On the other hand, a long pause accompanied by rapid heap growth suggests the construction of a large data structure, so large it's not only big in memory but slow to build. But there was no significant CPU use by the Java process during the pause, which suggests that it was performing *blocking I/O*. That suggests it was loading something big from disk, but a) it hadn't even reached the doseq yet to start reading and hashing files, b) the files never reside completely in live heap at one time (only in 4KB chunks), and c) the files themselves were smallish, few exceeding 1MB and none even approaching 10MB. The nearest thing I can figure out is that the process was going ahead and hashing files, but the *out* display was lagging way behind. So, the sysprint ran, then the doseq was well underway, and the heap grew for a while as 4KB chunks were allocated ... Except a single (byte-array 4096) is being recycled in the hashing loop. It's a mutable Java array and DigestInputStream.read() is overwriting its contents for each new chunk. And unless the attached MessageDigest is building huge temporary data structures, which seems unlikely, that's basically it. The heap should have grown by a few megs, total, from String, String[], and File object churn, and creating one more 4KB array for each new entire file (not each 4KB of data), over the entire operation. If there were 20 subdirectories and 1000 files, a reasonable estimate, we're looking at: 20 String[] from File.list() on the subdirectories, average length 51 (1020/20), average size 212 bytes or so, for 4240 bytes. 1020 String for file paths, average length say around 30, for 30600 bytes. 1000 4KB byte[] arrays for just under 4MB. 1000 byte[]s of length 20 to hold the 160-bit hashes generated by MessageDigests, for around 28000 bytes. 1000 MessageDigest objects. If we assume that they contain 20-element byte[]s but return defensive copies, so those aren't the 28000 bytes just counted, and (generously!) that they contain another 1KB of temporary state at peak size, that's another meg. 1020 seq objects over arrays and a few layers of filter, remove, map, etc. over these, say 10,000 seq objects total, backed by arrays -- generously call that another whole megabyte. Assorted small temporary strings and such per object -- assume, generously, yet another full meg total. = about 7 megabytes. With a 64 meg heap and a much smaller initial process size, I'd have expected it to just grow by 7 megs (or less, probably quite a lot less) and not trigger ANY garbage collection. Any way I figure it, the code I wrote had no business causing the JVM to allocate even ten megs, let alone over thirty, let alone accompanied by a lengthy pause, whether for GC or disk I/O. Nor is there an obvious reason for the System/out println's return to precede the actual appearance of the text in the *out* tab by any significant span of time. In particular, the Java task wasn't using detectable levels of CPU (< 1% the whole time) so it couldn't have been hogging the CPU and making Netbeans slow to update its display; and it wasn't remote, so network latency is not a factor. NB was talking to it via loopback interface and NB's own maintaining of the display is essentially a CPU bound task that would not be impacted by contention for access to the network or the disk controller (so, the Java task sending the println to *out* and then heavily hogging the disk controller during the file hashing should not have blocked NB from receiving the text in the pipe from *out* and updating its display). In fact the only thing that would seem like it could explain it is if the operating system paused one or both processes to page a big chunk of something in from disk itself, but with the memory usage by the system under 60% at the time nothing should have been paged out to begin with. And of course any paging of the NB process back in should have concluded before I was able to issue any command at the REPL in it, whereas the Java process's size was (before it began growing) only a few megs, which should page in virtually instantaneously as judged at human perception speeds. And of course that rapid process size growth would remain unexplained. Something's hinky. Either I'm not understanding something in how System.out.println works, or in the Clojure code that called it, or in how NB and an attached Clojure runtime communicate; or else something even stranger is going on, like the operating system misreporting process sizes or something. (If the Java process size was actually much larger than it was claiming, then perhaps it both needed paging and needed a lot of time to page in, for instance. Though the free memory % suggested as much as a whole GB of unallocated RAM, meaning that process would have to have been a LOT larger to force the OS to page -- close to the -Xmx 1G I gave it, in fact, and then the pause could have been a GC pause after all...then again, if the OS-reported process sizes are suspect, then so is the OS-reported free memory %! Damn, once you distrust any of your diagnostic tools, it opens a Pandora's box that leads one to doubt everything and be sure of nothing, unless you successfully predict the output and it was statistically impossible to get by accident. The files all have the same SHA-1 as reported by multiple tools as well as the MessageDigest code, for instance, and that's highly unlikely to happen unless the various SHA-1 algorithm implementations in these tools are in fact all working correctly.) -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en