Oh, I found a explanation from
http://cmenguy.github.io/blog/2013/10/30/using-hive-with-parquet-format-in-cdh-4-dot-3/
The error here is a bit misleading, what it really means is that the class
parquet.hive.DeprecatedParquetOutputFormat isn’t in the classpath for Hive.
Sure enough, doing a ls /usr
Hi,
Looks like the latest SparkSQL with Hive 0.12 has a bug in Parquet support.
I got the following exceptions:
org.apache.hadoop.hive.ql.parse.SemanticException: Output Format must
implement HiveOutputFormat, otherwise it should be either
IgnoreKeyTextOutputFormat or SequenceFileOutputFormat
It works fine, thanks for the help Michael.
Liancheng also told me a trick, using a subquery with LIMIT n. It works in
latest 1.2.0
BTW, looks like the broadcast optimization won't be recognized if I do a
left join instead of a inner join. Is that true? How can I make it work for
left joins?
Che
Thanks for the input. We purposefully made sure that the config option did
not make it into a release as it is not something that we are willing to
support long term. That said we'll try and make this easier in the future
either through hints or better support for statistics.
In this particular
Ok, currently there's cost-based optimization however Parquet statistics is
not implemented...
What's the good way if I want to join a big fact table with several tiny
dimension tables in Spark SQL (1.1)?
I wish we can allow user hint for the join.
Jianshi
On Wed, Oct 8, 2014 at 2:18 PM, Jiansh
Looks like https://issues.apache.org/jira/browse/SPARK-1800 is not merged
into master?
I cannot find spark.sql.hints.broadcastTables in latest master, but it's in
the following patch.
https://github.com/apache/spark/commit/76ca4341036b95f71763f631049fdae033990ab5
Jianshi
On Mon, Sep 29, 2014