Jonathan mercier created ARROW-11903: ----------------------------------------
Summary: Stored data to parquet do not fit values before the storing Key: ARROW-11903 URL: https://issues.apache.org/jira/browse/ARROW-11903 Project: Apache Arrow Issue Type: Bug Components: Archery Affects Versions: 2.0.0 Reporter: Jonathan mercier Dear, I have a strange behavior, indeed data before do not keep their value once stored to parquet. the schema is: {code:python} variations = struct((field('start', int64(), nullable=False), field('stop', int64(), nullable=False), field('reference', string(), nullable=False), field('alternative', string(), nullable=False), field('category', int8(), nullable=False))) variations_field = field('variations', list_(variations)) metadata = {b'pandas': b'{"index_columns": ["sample"], ' b'"column_indexes": [{"name": null, "field_name": "sample", "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], ' b'"columns": [' b'{"name": "variations", "field_name": "variations", "pandas_type": "list[object]", "numpy_type": "object", "metadata": null}, ' b'{"name": "sample", "field_name": "sample", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}], ' b'"pandas_version": "1.2.0"}'} sample_to_variations_schema = schema((sample_field, variations_field), metadata=metadata) {code} to store data I do: {code:python} table = Table.from_arrays([samples, variations_by_sample], schema=sample_to_variations_schema) dataset_dir = path.join(outdir, f'contig={contig}') makedirs(dataset_dir, exist_ok=True) with ParquetWriter(where=path.join(dataset_dir, 'variant_to_samples'), version='2.0', schema=table.schema, compression='SNAPPY') as pw: pw.write_table(table){code} I put a breakpoint just after table is assgned, in onder to check values in memory: Example for the row n°210027 {code:python} >>> samples[210027] 831028 >>> variations_by_sample[210027] [(241, 241, 'C', 'T', 0), (445, 445, 'T', 'C', 0), (3037, 3037, 'C', 'T', 0), (6286, 6286, 'C', 'T', 0), (11024, 11024, 'A', 'G', 0), (14408, 14408, 'C', 'T', 0), (21255, 21255, 'G', 'C', 0), (22227, 22227, 'C', 'T', 0), (23403, 23403, 'A', 'G', 0), (24140, 24140, 'G', 'A', 0), (25496, 25496, 'T', 'C', 0), (26801, 26801, 'C', 'G', 0), (27840, 27840, 'T', 'C', 0), (27944, 27944, 'C', 'T', 0), (27948, 27948, 'G', 'T', 0), (28932, 28932, 'C', 'T', 0), (29645, 29645, 'G', 'T', 0)] {code} Now the application end successfully and data are stored into a parquet dataset. So, I load those data and check their consistencies. {code:python} $ ipython In [1]: from pyarrow.parquet import read_table ...: sample_to_variants = read_table('sample_to_variants_db') In [2]: row_num = 0 ...: an_id = 0 ...: while an_id != 831028: ...: an_id = sample_to_variants.column(0)[row_num].as_py() ...: row_num += 1 ...: In [3]: sample_to_variants.column(0)[row_num-1].as_py() Out[3]: 831028 In [4]: sample_to_variants.column(1)[row_num-1].as_py() Out[4]: [{'start': 241, 'stop': 241, 'reference': 'C', 'alternative': 'T', 'category': 0}, {'start': 445, 'stop': 445, 'reference': 'G', 'alternative': 'T', 'category': 0}, {'start': 3037, 'stop': 3037, 'reference': 'G', 'alternative': 'T', 'category': 0}, {'start': 6286, 'stop': 6286, 'reference': 'C', 'alternative': 'T', 'category': 0}, {'start': 11024, 'stop': 11024, 'reference': 'C', 'alternative': 'T', 'category': 0}, {'start': 14408, 'stop': 14408, 'reference': 'C', 'alternative': 'T', 'category': 0}, {'start': 21255, 'stop': 21255, 'reference': 'G', 'alternative': 'T', 'category': 0}, {'start': 22227, 'stop': 22227, 'reference': 'G', 'alternative': 'A', 'category': 0}, {'start': 23403, 'stop': 23403, 'reference': 'C', 'alternative': 'T', 'category': 0}, {'start': 24140, 'stop': 24140, 'reference': 'C', 'alternative': 'T', 'category': 0}, {'start': 25496, 'stop': 25496, 'reference': 'A', 'alternative': 'G', 'category': 0}, {'start': 26801, 'stop': 26801, 'reference': 'G', 'alternative': 'T', 'category': 0}, {'start': 27840, 'stop': 27840, 'reference': 'C', 'alternative': 'T', 'category': 0}, {'start': 27944, 'stop': 27944, 'reference': 'T', 'alternative': 'C', 'category': 0}, {'start': 27948, 'stop': 27948, 'reference': 'G', 'alternative': 'A', 'category': 0}, {'start': 28932, 'stop': 28932, 'reference': 'C', 'alternative': 'T', 'category': 0}, {'start': 29645, 'stop': 29645, 'reference': 'G', 'alternative': 'A', 'category': 0}] {code} we can see that the column 1 (0 based) do not èhave the same value before to be wrote in parquet. As example into parquet dataset I have this value: {code:python} {'start': 24140, 'stop': 24140, 'reference': 'C', 'alternative': 'T', 'category': 0}, {code} while from the memory before to be stored: {code:python} (24140, 24140, 'G', 'A', 0) {code} I do not understand what is the mechanism which lead to this inconsistency. So I am not able to make a minimal example case (sorry) Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005)