I've been using Clojure to automate some file management tasks over
here; in particular, I whipped together about 100 lines of code to
build and maintain a hash database of certain files and thereby
efficiently discover duplications among these files. (Basically, it
uses a subdirectory of the filesystem as a persisted hashmap, with the
first few bytes of the hash used to make a tree of subdirectories, the
leaves of which are hash buckets that collect files;
FileInputStream/MessageDigest is used to compute file hashes, and
byte-by-byte comparisons detect identical files from among those with
identical hashes.)

I noticed something odd after spending a while working in other file
management tools, then switching to my REPL and hammering off an
(add-files! "/some/directory"). I flipped to NB's *out* tab to monitor
its progress and ... what progress? It had printed nothing, and as I
watched, it continued to print nothing. I switched back to the REPL
and it looked like it had submitted the sexp to the evaluator. But no
output. Then I checked the process list and saw the java task
consuming no CPU (unsusprising since I'd given it an I/O bound task)
but rapidly growing in memory use. It bloated up by several tens of
megs before I finally saw "Entering /some/directory" in *out*.

The funny thing is, the code is basically this:

(defn add-files! [dir]
  (sysprint "Entering " dir)
  (doseq [f (file-seq dir)]
    ...))

and sysprint is basically this:

(defn sysprint [& objs]
  (.println System/out (apply str objs)))

(so the output goes to *out* instead of blitzkrieging the REPL tab's
scrollback with thousands of lines of output, which then makes looking
things up from earlier in the REPL session very annoying).

Since the thing froze for a significant amount of time allocating a
significant amount of memory before the (.println System.out ...)
succeeded, I conclude that the rest of add-files! is irrelevant here
and it was something in (apply str "Entering " "/some/directory") or
in (.println System/out ...) that spent a long time allocating tens of
megabytes of heap. I'm mystified as to why. The add-files! function
doesn't usually do this, and it recurses for subdirectories, so the
"Entering ..." message is printed several times in a typical use.
After the one long pause, this particular run generated several more
such messages without any noticeable pauses.

What the heck was it? A GC pause? Why would that grow the heap? The
heap should be near max size already and perhaps shrink somewhat in a
GC pause. On the other hand, a long pause accompanied by rapid heap
growth suggests the construction of a large data structure, so large
it's not only big in memory but slow to build. But there was no
significant CPU use by the Java process during the pause, which
suggests that it was performing *blocking I/O*. That suggests it was
loading something big from disk, but a) it hadn't even reached the
doseq yet to start reading and hashing files, b) the files never
reside completely in live heap at one time (only in 4KB chunks), and
c) the files themselves were smallish, few exceeding 1MB and none even
approaching 10MB.

The nearest thing I can figure out is that the process was going ahead
and hashing files, but the *out* display was lagging way behind. So,
the sysprint ran, then the doseq was well underway, and the heap grew
for a while as 4KB chunks were allocated ...

Except a single (byte-array 4096) is being recycled in the hashing
loop. It's a mutable Java array and DigestInputStream.read() is
overwriting its contents for each new chunk. And unless the attached
MessageDigest is building huge temporary data structures, which seems
unlikely, that's basically it. The heap should have grown by a few
megs, total, from String, String[], and File object churn, and
creating one more 4KB array for each new entire file (not each 4KB of
data), over the entire operation. If there were 20 subdirectories and
1000 files, a reasonable estimate, we're looking at:

20 String[] from File.list() on the subdirectories, average length 51
(1020/20), average size 212 bytes or so, for 4240 bytes.

1020 String for file paths, average length say around 30, for 30600 bytes.

1000 4KB byte[] arrays for just under 4MB.

1000 byte[]s of length 20 to hold the 160-bit hashes generated by
MessageDigests, for around 28000 bytes.

1000 MessageDigest objects. If we assume that they contain 20-element
byte[]s but return defensive copies, so those aren't the 28000 bytes
just counted, and (generously!) that they contain another 1KB of
temporary state at peak size, that's another meg.

1020 seq objects over arrays and a few layers of filter, remove, map,
etc. over these, say 10,000 seq objects total, backed by arrays --
generously call that another whole megabyte.

Assorted small temporary strings and such per object -- assume,
generously, yet another full meg total.

= about 7 megabytes.

With a 64 meg heap and a much smaller initial process size, I'd have
expected it to just grow by 7 megs (or less, probably quite a lot
less) and not trigger ANY garbage collection.

Any way I figure it, the code I wrote had no business causing the JVM
to allocate even ten megs, let alone over thirty, let alone
accompanied by a lengthy pause, whether for GC or disk I/O. Nor is
there an obvious reason for the System/out println's return to precede
the actual appearance of the text in the *out* tab by any significant
span of time. In particular, the Java task wasn't using detectable
levels of CPU (< 1% the whole time) so it couldn't have been hogging
the CPU and making Netbeans slow to update its display; and it wasn't
remote, so network latency is not a factor. NB was talking to it via
loopback interface and NB's own maintaining of the display is
essentially a CPU bound task that would not be impacted by contention
for access to the network or the disk controller (so, the Java task
sending the println to *out* and then heavily hogging the disk
controller during the file hashing should not have blocked NB from
receiving the text in the pipe from *out* and updating its display).

In fact the only thing that would seem like it could explain it is if
the operating system paused one or both processes to page a big chunk
of something in from disk itself, but with the memory usage by the
system under 60% at the time nothing should have been paged out to
begin with. And of course any paging of the NB process back in should
have concluded before I was able to issue any command at the REPL in
it, whereas the Java process's size was (before it began growing) only
a few megs, which should page in virtually instantaneously as judged
at human perception speeds. And of course that rapid process size
growth would remain unexplained.

Something's hinky. Either I'm not understanding something in how
System.out.println works, or in the Clojure code that called it, or in
how NB and an attached Clojure runtime communicate; or else something
even stranger is going on, like the operating system misreporting
process sizes or something. (If the Java process size was actually
much larger than it was claiming, then perhaps it both needed paging
and needed a lot of time to page in, for instance. Though the free
memory % suggested as much as a whole GB of unallocated RAM, meaning
that process would have to have been a LOT larger to force the OS to
page -- close to the -Xmx 1G I gave it, in fact, and then the pause
could have been a GC pause after all...then again, if the OS-reported
process sizes are suspect, then so is the OS-reported free memory %!
Damn, once you distrust any of your diagnostic tools, it opens a
Pandora's box that leads one to doubt everything and be sure of
nothing, unless you successfully predict the output and it was
statistically impossible to get by accident. The files all have the
same SHA-1 as reported by multiple tools as well as the MessageDigest
code, for instance, and that's highly unlikely to happen unless the
various SHA-1 algorithm implementations in these tools are in fact all
working correctly.)

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to