On Dec 9, 2004, at 2:32 PM, Paul Johnson wrote:

On Thu, Dec 09, 2004 at 02:06:39PM -0800, Kevin Scaldeferri wrote:

My latest theory was that my forked processes were stomping on each
other and corrupting the stored data structure.  So I replaced the
calls to nstore and retrieve with lock_nstore and lock_retrieve in
DB.pm and Structure.pm.  This doesn't seem to have helped.

The run files are named time . ".$$." . sprintf "%05d", rand 2 ** 16 which should protect against forked processes. It may be that the structure files are not so well protected.

I ran Storable::retrieve on all the files in my cover_db after the
latest failure.  I found that all the files in cover_db/runs were
corrupted, while only one out of a couple hundred files in
cover_db/structure was bad.  I don't know if that's a useful bit of
info or not.

Are you absolutely certain that there's not another version of Storable
around which could be being picked up? Maybe you could try printing out
the version of Storable being used before nstore is called?



Okay, so it turns out I was being a bit of an idiot and failed to notice that cover_db/runs/ contains directories, not files. So, naturally, Storable can't retrieve a directory. The actual files one level deeper seem fine.


So, in reality, I just have one structure file that is bad.

I did try adding a line to print out the Storable version, and it doesn't change. However, I also couldn't reproduce the bug with that line in there. Yeah for Heisenbugs!

I think this supports the idea that there is some race condition between multiple processes attempting to write the file if the additional I/O slows things down enough that it goes away.. Except that I'm still using lock_retrieve and lock_nstore in Structure.pm and it still happens. Nothing else modifies those files, right?


-kevin



Reply via email to