Re: Partition parquet data by ENUM column

2015-07-24 Thread Cheng Lian
Your guess is partly right. Firstly, Spark SQL doesn’t have an equivalent data type to Parquet BINARY (ENUM), and always falls back to normal StringType. So in your case, Spark SQL just see a StringType, which maps to Parquet BINARY (UTF8), but the underlying data type is BINARY (ENUM). Secon

Re: Partition parquet data by ENUM column

2015-07-23 Thread Cheng Lian
Could you please provide the full stack trace of the exception? And what's the Git commit hash of the version you were using? Cheng On 7/24/15 6:37 AM, Jerry Lam wrote: Hi Cheng, I ran into issues related to ENUM when I tried to use Filter push down. I'm using Spark 1.5.0 (which contains fix

Re: Partition parquet data by ENUM column

2015-07-23 Thread Jerry Lam
Hi Cheng, I ran into issues related to ENUM when I tried to use Filter push down. I'm using Spark 1.5.0 (which contains fixes for parquet filter push down). The exception is the following: java.lang.IllegalArgumentException: FilterPredicate column: item's declared type (org.apache.parquet.io.api.

Re: Partition parquet data by ENUM column

2015-07-21 Thread Cheng Lian
On 7/22/15 9:03 AM, Ankit wrote: Thanks a lot Cheng. So it seems even in spark 1.3 and 1.4, parquet ENUMs were treated as Strings in Spark SQL right? So does this mean partitioning for enums already works in previous versions too since they are just treated as strings? It’s a little bit comp

Re: Partition parquet data by ENUM column

2015-07-21 Thread Ankit
Thanks a lot Cheng. So it seems even in spark 1.3 and 1.4, parquet ENUMs were treated as Strings in Spark SQL right? So does this mean partitioning for enums already works in previous versions too since they are just treated as strings? Also, is there a good way to verify that the partitioning is

Re: Partition parquet data by ENUM column

2015-07-21 Thread Cheng Lian
Parquet support for Thrift/Avro/ProtoBuf ENUM types are just added to the master branch. https://github.com/apache/spark/pull/7048 ENUM types are actually not in the Parquet format spec, that's why we didn't have it at the first place. Basically, ENUMs are always treated as UTF8 strings in Spa