On Sun, Aug 10, 2014 at 2:43 PM, Michael Armbrust <mich...@databricks.com> wrote:
> >> > if I try to add hive-exec-0.12.0-cdh5.0.3.jar to my SPARK_CLASSPATH, in >> order to get DeprecatedParquetInputFormat, I find out that there is an >> incompatibility in the SerDeUtils class. Spark's Hive snapshot expects to >> find > > > Instead of including CDH's version of Hive, I'd try just including the > Hive jars for Parquet from here: > http://mvnrepository.com/artifact/com.twitter/parquet-hive-bundle/1.5.0 > > This worked for me, thank you. In case someone else wishes to try it, I un-jarred the spark-1.0.2 assembly and then un-jarred the parquet-hive-bundle in the same place. I then re-jarred the whole thing back into an assembly and was able to run it with PySpark on YARN. It is really nice to be able to leverage the data partitioning through Hive. Note that I had to use the Java6 version of jar after having found that something in the way Java7 creates jar files makes the python code in the assembly inaccessible. But with Java6's jar, all is well. Thank you, Michael, for this suggestion. And Sean, I do value the effort Cloudera puts into making this "real." At this stage I am evaluating options and so it's helpful to me to be able to kick the tires, at scale, without asking the SREs to undertake a full upgrade effort. Rest assured that we will do that in due course. Thanks again for sharing your insights. Eric