Re: Major difference between Spark and Arrow Parquet Implementations

Wes McKinney Wed, 16 Aug 2017 09:05:20 -0700

hi Lucas,

My understanding is that the Parquet format by itself does not place
any such restrictions on the names of fields, and so this is a Spark
SQL-specific issue (anyone please correct me if I'm mistaken about
this). I would be happy to help add a schema cleaning option to
normalize field names for use in Spark. I just opened:


https://issues.apache.org/jira/browse/ARROW-1359

Thanks
Wes

On Wed, Aug 16, 2017 at 11:58 AM, Lucas Pickup
<lucas.pic...@microsoft.com.invalid> wrote:
> Hello,
>
> I have been using pyarrow and PySpark to write Parquet files. I have used 
> pyarrow to successfully write out a Parquet file with spaces in column names. 
> E.g. 'X Coordinate'.
> When I try to write out the same dataset using Sparks Parquet writer it fails 
> claiming:
> "Attribute name "X Coordinate" contains invalid character(s) among " 
> ,;{}()\\n\\t<file://n//t>="".
> It seems that according to Spark's Parquet implementation those above 
> characters are not allowed to be a part of a Parquet Schema due to special 
> meaning.
> The code that checks this is 
> here<https://github.com/apache/spark/blob/cba826d00173a945b0c9a7629c66e36fa73b723e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L565>.
>
> I was wondering if there was a reason why the implementations have such a 
> major difference when it comes to schema generation?
>
> Cheers, Lucas Pickup

Re: Major difference between Spark and Arrow Parquet Implementations

Reply via email to