RE: Major difference between Spark and Arrow Parquet Implementations

Erin Sobkow Wed, 16 Aug 2017 10:07:37 -0700

Hi Wes:

Somehow I have been inadvertently added to your list and am getting all these 
emails that make no sense to me at all.  I'm in on some conversation I know 
nothing about and am getting up to 20 emails a day from different people.  Can 
I ask you to remove me from your list and can you get all the other people in 
your group to remove me as well?  Thanks!


Erin Sobkow, BA Kin, RMT
Community Consultant
Parkland Valley Sport, Culture & Recreation District

Box 263, Yorkton, SK  S3N 2V7
Phone: (306) 786-6585
Fax: (306) 782-0474
Email:  esob...@parklandvalley.ca
Website:  www.parklandvalley.ca

If you no longer wish to receive electronic messages from Parkland Valley 
Sport, Culture & Recreation District please reply with the word 'STOP'.

 

Together...building healthy communities through sport, culture and recreation

-----Original Message-----
From: Wes McKinney [mailto:wesmck...@gmail.com] 
Sent: August 16, 2017 10:04 AM
To: dev@arrow.apache.org
Subject: Re: Major difference between Spark and Arrow Parquet Implementations

hi Lucas,

My understanding is that the Parquet format by itself does not place any such 
restrictions on the names of fields, and so this is a Spark SQL-specific issue 
(anyone please correct me if I'm mistaken about this). I would be happy to help 
add a schema cleaning option to normalize field names for use in Spark. I just 
opened:

https://issues.apache.org/jira/browse/ARROW-1359

Thanks
Wes

On Wed, Aug 16, 2017 at 11:58 AM, Lucas Pickup 
<lucas.pic...@microsoft.com.invalid> wrote:
> Hello,
>
> I have been using pyarrow and PySpark to write Parquet files. I have used 
> pyarrow to successfully write out a Parquet file with spaces in column names. 
> E.g. 'X Coordinate'.
> When I try to write out the same dataset using Sparks Parquet writer it fails 
> claiming:
> "Attribute name "X Coordinate" contains invalid character(s) among " 
> ,;{}()\\n\\t<file://n//t>="".
> It seems that according to Spark's Parquet implementation those above 
> characters are not allowed to be a part of a Parquet Schema due to special 
> meaning.
> The code that checks this is 
> here<https://github.com/apache/spark/blob/cba826d00173a945b0c9a7629c66e36fa73b723e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L565>.
>
> I was wondering if there was a reason why the implementations have such a 
> major difference when it comes to schema generation?
>
> Cheers, Lucas Pickup

RE: Major difference between Spark and Arrow Parquet Implementations

Reply via email to