I have a CDH5.0.3 cluster with Hive tables written in Parquet.

The tables have the "DeprecatedParquetInputFormat" on their metadata, and
when I try to select from one using Spark SQL, it blows up with a stack
trace like this:

java.lang.RuntimeException: java.lang.ClassNotFoundException:
parquet.hive.DeprecatedParquetInputFormat

at
org.apache.hadoop.hive.ql.metadata.Table.getInputFormatClass(Table.java:309)


Fair enough, DeprecatedParquetInputFormat isn't in the Spark assembly
built with Hive.


if I try to add hive-exec-0.12.0-cdh5.0.3.jar to my SPARK_CLASSPATH, in
order to get DeprecatedParquetInputFormat, I find out that there is an
incompatibility in the SerDeUtils class.  Spark's Hive snapshot expects to
find


java.lang.NoSuchMethodError:
org.apache.hadoop.hive.serde2.SerDeUtils.lookupDeserializer(Ljava/lang/String;)Lorg/apache/hadoop/hive/serde2/Deserializer;

 at
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:217)


But that isn't in the Hive snapshot provided by CDH5.0.3


Both Spark and CDH label their Hive versions as 0.12.0.


According to the Apache SVN server
<http://svn.apache.org/viewvc/hive/tags/release-0.12.0/serde/src/java/org/apache/hadoop/hive/serde2/SerDeUtils.java?revision=1532081&view=markup>,
CDH is the one that's out of step, as this method is definitely on the
0.12.0 release.  I have raised a ticket with Cloudera about this.


Has anyone found a workaround?


I did try extracting a subset of jars from hive-exec.jar, but that quickly
turned into a journey down the rabbit hole.

Reply via email to