Strange occasional marshal error
Hi, I'm using Python with ZeroMQ to distribute data around an HPC cluster. The results have been good apart from one issue which I am completely stuck with: We are using marshal for serialising objects before distributing them around the cluster, and extremely occasionally a corrupted marshal is produced. The current workaround is to serialise everything twice and check that the serialisations are the same. On the rare occasions that they are not, I have dumped the files for comparison. It turns out that there are a few positions within the serialisation where corruption tends to occur (these positions seem to be independent of the data of the size of the complete serialisation). These are: 4 bytes starting at 548867 (0x86003) 4 bytes starting at 4398083 (0x431c03) 4 bytes starting at 17595395 (0x10c7c03) 4 bytes starting at 19794819 (0x12e0b83) 4 bytes starting at 22269171 (0x153ccf3) 2 bytes starting at 25052819 (0x17e4693) 3 bytes starting at 28184419 (0x1ae0f63) I note that the ratio between the later positions is almost exactly 1.125. Presumably this has something to do with memory allocation somewhere? Some datapoints: - The phenomenon has been observed in a single-threaded process without ZeroMQ - I think the phenomenon has been observed in pickled as well as marshalled data - The phenomenon has been observed on different hardware Unfortunately after quite a lot of work I still haven't managed to reproduce this error on a single machine. Hopefully the above is enough information for someone to speculate as to where the problem is. Many thanks in advance for any help. Regards, Graham -- http://mail.python.org/mailman/listinfo/python-list
Re: Strange occasional marshal error
On Mar 2, 3:01 pm, Graham Stratton wrote: > We are using marshal for serialising objects before distributing them > around the cluster, and extremely occasionally a corrupted marshal is > produced. The current workaround is to serialise everything twice and > check that the serialisations are the same. On the rare occasions that > they are not, I have dumped the files for comparison. It turns out > that there are a few positions within the serialisation where > corruption tends to occur (these positions seem to be independent of > the data of the size of the complete serialisation). These are: > > 4 bytes starting at 548867 (0x86003) > 4 bytes starting at 4398083 (0x431c03) > 4 bytes starting at 17595395 (0x10c7c03) > 4 bytes starting at 19794819 (0x12e0b83) > 4 bytes starting at 22269171 (0x153ccf3) > 2 bytes starting at 25052819 (0x17e4693) > 3 bytes starting at 28184419 (0x1ae0f63) I modified marshal.c to print when it extends the string used to write the marshal to. This gave me these results: >>> s = marshal.dumps(list((i, str(i)) for i in range(140))) Resizing string from 50 to 1124 bytes Resizing string from 1124 to 3272 bytes Resizing string from 3272 to 7568 bytes Resizing string from 7568 to 16160 bytes Resizing string from 16160 to 33344 bytes Resizing string from 33344 to 67712 bytes Resizing string from 67712 to 136448 bytes Resizing string from 136448 to 273920 bytes Resizing string from 273920 to 548864 bytes Resizing string from 548864 to 1098752 bytes Resizing string from 1098752 to 2198528 bytes Resizing string from 2198528 to 4398080 bytes Resizing string from 4398080 to 8797184 bytes Resizing string from 8797184 to 17595392 bytes Resizing string from 17595392 to 19794816 bytes Resizing string from 19794816 to 22269168 bytes Resizing string from 22269168 to 25052814 bytes Resizing string from 25052814 to 28184415 bytes Resizing string from 28184415 to 31707466 bytes Every corruption point occurs exactly three bytes above an extension point (rounded to the nearest word for the last two). This clearly isn't a coincidence, but I can't see where there could be a problem. I'd be grateful for any pointers. Thanks, Graham -- http://mail.python.org/mailman/listinfo/python-list
Re: Strange occasional marshal error
On Mar 3, 6:04 pm, Guido van Rossum wrote: > This bug report doesn't mention the Python version nor the platform -- > it could in theory be a bug in the platform compiler or memory > allocator. I've seen the problem with 2.6 and 2.7, on RHEL 4 (possibly with a custom kernel, I can't check at the moment). > It would also be nice to provide the test program that > reproduces the issue. I'm working on trying to reproduce it without the proprietary code that uses it, but so far haven't managed it. There are some custom C extensions in the system where this is observed, but since the code is single-threaded I don't think they can have any effect during marshalling. Thanks, Graham -- http://mail.python.org/mailman/listinfo/python-list