Martin Durant created ARROW-3247: ------------------------------------ Summary: Support spark array and map types Key: ARROW-3247 URL: https://issues.apache.org/jira/browse/ARROW-3247 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Martin Durant
As far I understand, there is already some support for nested array/dict/structs in arrow. However, spark Map and List types are structured one level deeper (I believe to allow for both NULL and empty entries). Surprisingly, fastparquet can load these. I do not know the plan for arbitrary nested object support, but it should be made clear. Schema of spark-generated file from the fastparquet test suite (please see in text mode): - spark_schema: | - map_op_op: MAP, OPTIONAL | - key_value: REPEATED | | - key: BYTE_ARRAY, UTF8, REQUIRED | - value: BYTE_ARRAY, UTF8, OPTIONAL | - map_op_req: MAP, OPTIONAL | - key_value: REPEATED | | - key: BYTE_ARRAY, UTF8, REQUIRED | - value: BYTE_ARRAY, UTF8, REQUIRED | - map_req_op: MAP, REQUIRED | - key_value: REPEATED | | - key: BYTE_ARRAY, UTF8, REQUIRED | - value: BYTE_ARRAY, UTF8, OPTIONAL | - map_req_req: MAP, REQUIRED | - key_value: REPEATED | | - key: BYTE_ARRAY, UTF8, REQUIRED | - value: BYTE_ARRAY, UTF8, REQUIRED | - arr_op_op: LIST, OPTIONAL | - list: REPEATED | - element: BYTE_ARRAY, UTF8, OPTIONAL | - arr_op_req: LIST, OPTIONAL | - list: REPEATED | - element: BYTE_ARRAY, UTF8, REQUIRED | - arr_req_op: LIST, REQUIRED | - list: REPEATED | - element: BYTE_ARRAY, UTF8, OPTIONAL - arr_req_req: LIST, REQUIRED - list: REPEATED - element: BYTE_ARRAY, UTF8, REQUIRED (please forgive that some of this has already been mentioned elsewhere; this is one of the entries in the list at https://github.com/dask/fastparquet/issues/374 as a feature that is useful in fastparquet) -- This message was sent by Atlassian JIRA (v7.6.3#76005)