I have a CDH5.0.3 cluster with Hive tables written in Parquet. The tables have the "DeprecatedParquetInputFormat" on their metadata, and when I try to select from one using Spark SQL, it blows up with a stack trace like this:
java.lang.RuntimeException: java.lang.ClassNotFoundException: parquet.hive.DeprecatedParquetInputFormat at org.apache.hadoop.hive.ql.metadata.Table.getInputFormatClass(Table.java:309) Fair enough, DeprecatedParquetInputFormat isn't in the Spark assembly built with Hive. if I try to add hive-exec-0.12.0-cdh5.0.3.jar to my SPARK_CLASSPATH, in order to get DeprecatedParquetInputFormat, I find out that there is an incompatibility in the SerDeUtils class. Spark's Hive snapshot expects to find java.lang.NoSuchMethodError: org.apache.hadoop.hive.serde2.SerDeUtils.lookupDeserializer(Ljava/lang/String;)Lorg/apache/hadoop/hive/serde2/Deserializer; at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:217) But that isn't in the Hive snapshot provided by CDH5.0.3 Both Spark and CDH label their Hive versions as 0.12.0. According to the Apache SVN server <http://svn.apache.org/viewvc/hive/tags/release-0.12.0/serde/src/java/org/apache/hadoop/hive/serde2/SerDeUtils.java?revision=1532081&view=markup>, CDH is the one that's out of step, as this method is definitely on the 0.12.0 release. I have raised a ticket with Cloudera about this. Has anyone found a workaround? I did try extracting a subset of jars from hive-exec.jar, but that quickly turned into a journey down the rabbit hole.