On Sun, Aug 10, 2014 at 2:43 PM, Michael Armbrust <mich...@databricks.com>
wrote:

>
>>
> if I try to add hive-exec-0.12.0-cdh5.0.3.jar to my SPARK_CLASSPATH, in
>> order to get DeprecatedParquetInputFormat, I find out that there is an
>> incompatibility in the SerDeUtils class.  Spark's Hive snapshot expects to
>> find
>
>
> Instead of including CDH's version of Hive, I'd try just including the
> Hive jars for Parquet from here:
> http://mvnrepository.com/artifact/com.twitter/parquet-hive-bundle/1.5.0
>
>
This worked for me, thank you.  In case someone else wishes to try it, I
un-jarred the spark-1.0.2 assembly and then un-jarred the
parquet-hive-bundle in the same place.  I then re-jarred the whole thing
back into an assembly and was able to run it with PySpark on YARN.  It is
really nice to be able to leverage the data partitioning through Hive.

Note that I had to use the Java6 version of jar after having found that
something in the way Java7 creates jar files makes the python code in the
assembly inaccessible.  But with Java6's jar, all is well.

Thank you, Michael, for this suggestion.

And Sean, I do value the effort Cloudera puts into making this "real."  At
this stage I am evaluating options and so it's helpful to me to be able to
kick the tires, at scale, without asking the SREs to undertake a full
upgrade effort.  Rest assured that we will do that in due course.  Thanks
again for sharing your insights.

Eric

Reply via email to