Hi Wes: Somehow I have been inadvertently added to your list and am getting all these emails that make no sense to me at all. I'm in on some conversation I know nothing about and am getting up to 20 emails a day from different people. Can I ask you to remove me from your list and can you get all the other people in your group to remove me as well? Thanks!
Erin Sobkow, BA Kin, RMT Community Consultant Parkland Valley Sport, Culture & Recreation District Box 263, Yorkton, SK S3N 2V7 Phone: (306) 786-6585 Fax: (306) 782-0474 Email: esob...@parklandvalley.ca Website: www.parklandvalley.ca If you no longer wish to receive electronic messages from Parkland Valley Sport, Culture & Recreation District please reply with the word 'STOP'. Together...building healthy communities through sport, culture and recreation -----Original Message----- From: Wes McKinney [mailto:wesmck...@gmail.com] Sent: August 16, 2017 10:04 AM To: dev@arrow.apache.org Subject: Re: Major difference between Spark and Arrow Parquet Implementations hi Lucas, My understanding is that the Parquet format by itself does not place any such restrictions on the names of fields, and so this is a Spark SQL-specific issue (anyone please correct me if I'm mistaken about this). I would be happy to help add a schema cleaning option to normalize field names for use in Spark. I just opened: https://issues.apache.org/jira/browse/ARROW-1359 Thanks Wes On Wed, Aug 16, 2017 at 11:58 AM, Lucas Pickup <lucas.pic...@microsoft.com.invalid> wrote: > Hello, > > I have been using pyarrow and PySpark to write Parquet files. I have used > pyarrow to successfully write out a Parquet file with spaces in column names. > E.g. 'X Coordinate'. > When I try to write out the same dataset using Sparks Parquet writer it fails > claiming: > "Attribute name "X Coordinate" contains invalid character(s) among " > ,;{}()\\n\\t<file://n//t>="". > It seems that according to Spark's Parquet implementation those above > characters are not allowed to be a part of a Parquet Schema due to special > meaning. > The code that checks this is > here<https://github.com/apache/spark/blob/cba826d00173a945b0c9a7629c66e36fa73b723e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L565>. > > I was wondering if there was a reason why the implementations have such a > major difference when it comes to schema generation? > > Cheers, Lucas Pickup