Strange occasional marshal error

2011-03-02 Thread Graham Stratton
Hi,

I'm using Python with ZeroMQ to distribute data around an HPC cluster.
The results have been good apart from one issue which I am completely
stuck with:

We are using marshal for serialising objects before distributing them
around the cluster, and extremely occasionally a corrupted marshal is
produced. The current workaround is to serialise everything twice and
check that the serialisations are the same. On the rare occasions that
they are not, I have dumped the files for comparison. It turns out
that there are a few positions within the serialisation where
corruption tends to occur (these positions seem to be independent of
the data of the size of the complete serialisation). These are:

4 bytes starting at 548867 (0x86003)
4 bytes starting at 4398083 (0x431c03)
4 bytes starting at 17595395 (0x10c7c03)
4 bytes starting at 19794819 (0x12e0b83)
4 bytes starting at 22269171 (0x153ccf3)
2 bytes starting at 25052819 (0x17e4693)
3 bytes starting at 28184419 (0x1ae0f63)

I note that the ratio between the later positions is almost exactly
1.125. Presumably this has something to do with memory allocation
somewhere?

Some datapoints:

- The phenomenon has been observed in a single-threaded process
without ZeroMQ
- I think the phenomenon has been observed in pickled as well as
marshalled data
- The phenomenon has been observed on different hardware

Unfortunately after quite a lot of work I still haven't managed to
reproduce this error on a single machine. Hopefully the above is
enough information for someone to speculate as to where the problem
is.

Many thanks in advance for any help.

Regards,

Graham
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Strange occasional marshal error

2011-03-03 Thread Graham Stratton
On Mar 2, 3:01 pm, Graham Stratton  wrote:
> We are using marshal for serialising objects before distributing them
> around the cluster, and extremely occasionally a corrupted marshal is
> produced. The current workaround is to serialise everything twice and
> check that the serialisations are the same. On the rare occasions that
> they are not, I have dumped the files for comparison. It turns out
> that there are a few positions within the serialisation where
> corruption tends to occur (these positions seem to be independent of
> the data of the size of the complete serialisation). These are:
>
> 4 bytes starting at 548867 (0x86003)
> 4 bytes starting at 4398083 (0x431c03)
> 4 bytes starting at 17595395 (0x10c7c03)
> 4 bytes starting at 19794819 (0x12e0b83)
> 4 bytes starting at 22269171 (0x153ccf3)
> 2 bytes starting at 25052819 (0x17e4693)
> 3 bytes starting at 28184419 (0x1ae0f63)

I modified marshal.c to print when it extends the string used to write
the marshal to. This gave me these results:

>>> s = marshal.dumps(list((i, str(i)) for i in range(140)))
Resizing string from 50 to 1124 bytes
Resizing string from 1124 to 3272 bytes
Resizing string from 3272 to 7568 bytes
Resizing string from 7568 to 16160 bytes
Resizing string from 16160 to 33344 bytes
Resizing string from 33344 to 67712 bytes
Resizing string from 67712 to 136448 bytes
Resizing string from 136448 to 273920 bytes
Resizing string from 273920 to 548864 bytes
Resizing string from 548864 to 1098752 bytes
Resizing string from 1098752 to 2198528 bytes
Resizing string from 2198528 to 4398080 bytes
Resizing string from 4398080 to 8797184 bytes
Resizing string from 8797184 to 17595392 bytes
Resizing string from 17595392 to 19794816 bytes
Resizing string from 19794816 to 22269168 bytes
Resizing string from 22269168 to 25052814 bytes
Resizing string from 25052814 to 28184415 bytes
Resizing string from 28184415 to 31707466 bytes

Every corruption point occurs exactly three bytes above an extension
point (rounded to the nearest word for the last two). This clearly
isn't a coincidence, but I can't see where there could be a problem.
I'd be grateful for any pointers.

Thanks,

Graham
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Strange occasional marshal error

2011-03-03 Thread Graham Stratton
On Mar 3, 6:04 pm, Guido van Rossum  wrote:
> This bug report doesn't mention the Python version nor the platform --
> it could in theory be a bug in the platform compiler or memory
> allocator.

I've seen the problem with 2.6 and 2.7, on RHEL 4 (possibly with a
custom kernel, I can't check at the moment).

> It would also be nice to provide the test program that
> reproduces the issue.

I'm working on trying to reproduce it without the proprietary code
that uses it, but so far haven't managed it. There are some custom C
extensions in the system where this is observed, but since the code is
single-threaded I don't think they can have any effect during
marshalling.

Thanks,

Graham
-- 
http://mail.python.org/mailman/listinfo/python-list