[
https://issues.apache.org/jira/browse/BEAM-6748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16784093#comment-16784093
]
Valentyn Tymofieiev commented on BEAM-6748:
-------------------------------------------
I suspect the problem here may be with the test.
Adding some debug output at
[https://github.com/apache/beam/blob/af2e5bd8a42ea0eb7ce12ea29b8a32757accc197/sdks/python/apache_beam/io/avroio.py#L465
|https://github.com/apache/beam/blob/af2e5bd8a42ea0eb7ce12ea29b8a32757accc197/sdks/python/apache_beam/io/avroio.py#L465.],
we can see that the on Python 3 block.size is roughly 16000 (bytes?), while on
Python 2 it is roughly 64000. Exact numbers slightly vary, but it think there
is some implementation detail in fastavro library that on Python 2 makes a
default block size to be <=64k, and on Python 3 <=16k.
So on Python 3 we have 3 blocks total, while on Python 3 we have 11 blocks.
The test assumes that there are always 3 blocks: blocks:
[https://github.com/apache/beam/blob/af2e5bd8a42ea0eb7ce12ea29b8a32757accc197/sdks/python/apache_beam/io/avroio_test.py#L302.]
[~chamikara], do you know how does fastavro.read.block_reader selects block
size? Are there any requirements or guarantees that dictate a certain size, or
it may be implementation/platform-dependent?
> Splitting logic in Avro IO tests behaves unexpectedly in Python 3
> -----------------------------------------------------------------
>
> Key: BEAM-6748
> URL: https://issues.apache.org/jira/browse/BEAM-6748
> Project: Beam
> Issue Type: Sub-task
> Components: sdk-py-core
> Reporter: Valentyn Tymofieiev
> Assignee: Valentyn Tymofieiev
> Priority: Major
>
> *apache_beam.io.avroio_test.TestAvro.test_split_points*
> *apache_beam.io.avroio_test.TestFastAvro.test_split_points*
> fail with:
>
> {code:java}
> Traceback (most recent call last):
> File "/home/robbe/workspace/beam/sdks/python/apache_beam/io/avroio_test.py",
> line 308, in test_split_points
> self.assertEquals(split_points_report[-10:], [(2, 1)] * 10)
> AssertionError: Lists differ: [(10, 1), (10, 1), (10, 1), (10, 1), (10, 1[42
> chars], 1)] != [(2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2[32 chars], 1)]
> First differing element 0:
> (10, 1)
> (2, 1)
> + [(2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1),
> (2, 1)]
> - [(10, 1),
> - (10, 1),
> - (10, 1),
> - (10, 1),
> - (10, 1),
> - (10, 1),
> - (10, 1),
> - (10, 1),
> - (10, 1),
> - (10, 1)] {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)