In our case, we do not actually need partition inference so the workaround was easy -- instead of using the path as rootpath/batch_id=333/... we changed the paths to rootpath/333/.... This works for us because we compute the set of HDFS paths manually and then register a dataframe into a SQLContext.
But it seems like there is a nicer solution: http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery Notice that the data types of the partitioning columns are automatically inferred. Currently, numeric data types and string type are supported. Sometimes users may not want to automatically infer the data types of the partitioning columns. For these use cases, the automatic type inference can be configured by spark.sql.sources.partitionColumnTypeInference.enabled, which is default to true. When type inference is disabled, string type will be used for the partitioning columns On Sat, Oct 10, 2015 at 9:52 PM, shobhit gupta <smartsho...@gmail.com> wrote: > here is what the df.schema.toString() prints. > > DF Schema is ::StructType(StructField(batch_id,StringType,true)) > > I think you nailed the problem, this filed is the part of our hdfs file > path. We have kind of partitioned our data on the basis of batch_ids folder. > > How did you get around it? > > Thanks for help. :) > > On Sat, Oct 10, 2015 at 7:55 AM, Yana Kadiyska <yana.kadiy...@gmail.com> > wrote: > >> can you show the output of df.printSchema? Just a guess but I think I ran >> into something similar with a column that was part of a path in parquet. >> E.g. we had an account_id in the parquet file data itself which was of type >> string but we also named the files in the following manner >> /somepath/account_id=.../file.parquet. Since Spark uses the paths for >> partition discovery, it was actually inferring that account_id is a numeric >> type and upon reading the data, we ran into the exception you're describing >> (this is in Spark 1.4).. >> >> On Fri, Oct 9, 2015 at 7:55 PM, Abhisheks <smartsho...@gmail.com> wrote: >> >>> Hi there, >>> >>> I have saved my records in to parquet format and am using Spark1.5. But >>> when >>> I try to fetch the columns it throws exception* >>> java.lang.ClassCastException: java.lang.Long cannot be cast to >>> org.apache.spark.unsafe.types.UTF8String*. >>> >>> This filed is saved as String while writing parquet. so here is the >>> sample >>> code and output for the same.. >>> >>> logger.info("troubling thing is ::" + >>> sqlContext.sql(fileSelectQuery).schema().toString()); >>> DataFrame df= sqlContext.sql(fileSelectQuery); >>> JavaRDD<Row> rdd2 = df.toJavaRDD(); >>> >>> First Line in the code (Logger) prints this: >>> troubling thing is ::StructType(StructField(batch_id,StringType,true)) >>> >>> But the moment after it the execption comes up. >>> >>> Any idea why it is treating the filed as Long? (yeah one unique thing >>> about >>> column is it is always a number e.g. Time-stamp). >>> >>> Any help is appreciated. >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/SQLcontext-changing-String-field-to-Long-tp25005.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> > > > -- > > > > > *Regards , Shobhit Gupta.* > *"If you salute your job, you have to salute nobody. But if you pollute > your job, you have to salute everybody..!!"* >