hi Lucas, My understanding is that the Parquet format by itself does not place any such restrictions on the names of fields, and so this is a Spark SQL-specific issue (anyone please correct me if I'm mistaken about this). I would be happy to help add a schema cleaning option to normalize field names for use in Spark. I just opened:
https://issues.apache.org/jira/browse/ARROW-1359 Thanks Wes On Wed, Aug 16, 2017 at 11:58 AM, Lucas Pickup <lucas.pic...@microsoft.com.invalid> wrote: > Hello, > > I have been using pyarrow and PySpark to write Parquet files. I have used > pyarrow to successfully write out a Parquet file with spaces in column names. > E.g. 'X Coordinate'. > When I try to write out the same dataset using Sparks Parquet writer it fails > claiming: > "Attribute name "X Coordinate" contains invalid character(s) among " > ,;{}()\\n\\t<file://n//t>="". > It seems that according to Spark's Parquet implementation those above > characters are not allowed to be a part of a Parquet Schema due to special > meaning. > The code that checks this is > here<https://github.com/apache/spark/blob/cba826d00173a945b0c9a7629c66e36fa73b723e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L565>. > > I was wondering if there was a reason why the implementations have such a > major difference when it comes to schema generation? > > Cheers, Lucas Pickup