Re: Running SparkSql against Hive tables

Cheng Lian Mon, 08 Jun 2015 18:58:04 -0700


On 6/9/15 8:42 AM, James Pirz wrote:

Thanks for the help!
I am actually trying Spark SQL to run queries against tables that I'vedefined in Hive.
I follow theses steps:
- I start hiveserver2 and in Spark, I start Spark's Thrift server by:
$SPARK_HOME/sbin/start-thriftserver.sh --masterspark://spark-master-node-ip:7077
- and I start beeline:
$SPARK_HOME/bin/beeline

- In my beeline session, I connect to my running hiveserver2
!connect jdbc:hive2://hive-node-ip:10000
and I can run queries successfully. But based on hiveserver2 logs, Itseems it actually uses "Hadoop's MR" to run queries, *not* Spark'sworkers. My goals is to access Hive's tables' data, but run queriesthrough Spark SQL using Spark workers (not Hadoop).

Hm, interesting. HiveThriftServer2 should never issue MR jobs to performqueries. I did receive two reports in the past which also say MR jobsinstead of Spark jobs were issued to perform the SQL query. However, Ionly reproduced this issue in a rare corner case, which uses HTTP modeto connect to Hive 0.12.0. Apparently this isn't your case. Would youmind to provide more details so that I can dig in? The followinginformation would be very helpful:


1. Hive version
2. A copy of your hive-site.xml
3. Hadoop version
4. Full HiveThriftServer2 log (which can be found in $SPARK_HOME/logs)

Thanks in advance!

Is it possible to do that via Spark SQL (its CLI) or through itsthrift server ? (I tried to find some basic examples in thedocumentation, but I was not able to) - Any suggestion or hint on howI can do that would be highly appreciated.


Thnx

On Sun, Jun 7, 2015 at 6:39 AM, Cheng Lian <lian.cs....@gmail.com<mailto:lian.cs....@gmail.com>> wrote:




    On 6/6/15 9:06 AM, James Pirz wrote:

    I am pretty new to Spark, and using Spark 1.3.1, I am trying to
    use 'Spark SQL' to run some SQL scripts, on the cluster. I
    realized that for a better performance, it is a good idea to use
    Parquet files. I have 2 questions regarding that:

    1) If I wanna use Spark SQL against  *partitioned & bucketed*
    tables with Parquet format in Hive, does the provided spark
    binary on the apache website support that or do I need to build a
    new spark binary with some additional flags ? (I found a note
    
<https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables> in
    the documentation about enabling Hive support, but I could not
    fully get it as what the correct way of building is, if I need to
    build)

    Yes, Hive support is enabled by default now for the binaries on
    the website. However, currently Spark SQL doesn't support buckets yet.


    2) Does running Spark SQL against tables in Hive downgrade the
    performance, and it is better that I load parquet files directly
    to HDFS or having Hive in the picture is harmless ?

    If you're using Parquet, then it should be fine since by default
    Spark SQL uses its own native Parquet support to read Parquet Hive
    tables.


    Thnx

Re: Running SparkSql against Hive tables

Reply via email to