Re: SparkSQL + Tableau Connector

Todd Nist Wed, 11 Feb 2015 16:55:06 -0800

First sorry for the long post.  So back to tableau and Spark SQL, I'm still
missing something.


TL;DR

To get the Spark SQL Temp table associated with the metastore are there
additional steps required beyond doing the below?

Initial SQL on connection:

create temporary table test
using org.apache.spark.sql.json
options (path '/data/json/*');

cache table test;

I feel like I'm missing a step of associating the Spark SQL table with the
metastore, do I need to actually save it in some fashion?   I'm trying to
avoid saving to hive if possible.

*Details:*

I configured the hive-site.xml and placed it in the $SPARK_HOME/conf.  It
looks like this, thanks Andrew and Arush for the assistance:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>hive.semantic.analyzer.factory.impl</name>
    <value>org.apache.hcatalog.cli.HCatSemanticAnalyzerFactory</value>
  </property>

  <property>
    <name>hive.metastore.sasl.enabled</name>
    <value>false</value>
  </property>

  <property>
    <name>hive.server2.authentication</name>
    <value>NONE</value>
  </property>

  <property>
    <name>hive.server2.enable.doAs</name>
    <value>true</value>
  </property>

  <!--
  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://localhost:9083</value>
    <description>IP address (or fully-qualified domain name) and port of
the metastore host</description>
  </property>
  -->

  <property>
    <name>hive.warehouse.subdir.inherit.perms</name>
    <value>true</value>
  </property>

  <property>
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true</value>
    <description>metadata is stored in a MySQL server</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>MySQL JDBC driver class</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hiveuser</value>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>hiveuser</value>
  </property>

</configuration>

When I start the server it looks fine:

>$ ./sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001
--hiveconf hive.server2.thrift.bind.host radtech.io --master spark://
radtech.io:7077 --driver-class-path
/usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging
to
/usr/local/spark-1.2.1-bin-hadoop2.4/logs/spark-tnist-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-radtech.io.out
radtech:spark tnist$ tail -f
logs/spark-tnist-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-radtech.io.out
15/02/11 19:15:24 INFO SparkDeploySchedulerBackend: Granted executor ID
app-20150211191524-0008/1 on hostPort 192.168.1.2:50851 with 2 cores, 512.0
MB RAM
15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated:
app-20150211191524-0008/0 is now LOADING
15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated:
app-20150211191524-0008/1 is now LOADING
15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated:
app-20150211191524-0008/0 is now RUNNING
15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated:
app-20150211191524-0008/1 is now RUNNING
15/02/11 19:15:24 INFO NettyBlockTransferService: Server created on 50938
15/02/11 19:15:24 INFO BlockManagerMaster: Trying to register BlockManager
15/02/11 19:15:24 INFO BlockManagerMasterActor: Registering block manager
192.168.1.2:50938 with 265.1 MB RAM, BlockManagerId(<driver>, 192.168.1.2,
50938)
15/02/11 19:15:24 INFO BlockManagerMaster: Registered BlockManager
15/02/11 19:15:25 INFO SparkDeploySchedulerBackend: SchedulerBackend is
ready for scheduling beginning after reached minRegisteredResourcesRatio:
0.0
15/02/11 19:15:25 INFO HiveMetaStore: 0: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/02/11 19:15:25 INFO ObjectStore: ObjectStore, initialize called
15/02/11 19:15:26 INFO Persistence: Property
hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/02/11 19:15:26 INFO Persistence: Property datanucleus.cache.level2
unknown - will be ignored
15/02/11 19:15:26 WARN Connection: BoneCP specified but not present in
CLASSPATH (or one of dependencies)
15/02/11 19:15:26 WARN Connection: BoneCP specified but not present in
CLASSPATH (or one of dependencies)
15/02/11 19:15:27 INFO ObjectStore: Setting MetaStore object pin classes
with
hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
15/02/11 19:15:28 INFO SparkDeploySchedulerBackend: Registered executor:
Actor[akka.tcp://sparkExecutor@192.168.1.2:50944/user/Executor#1008909571]
with ID 0
15/02/11 19:15:28 INFO Datastore: The class
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
"embedded-only" so does not have its own datastore table.
15/02/11 19:15:28 INFO Datastore: The class
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
"embedded-only" so does not have its own datastore table.
15/02/11 19:15:28 INFO SparkDeploySchedulerBackend: Registered executor:
Actor[akka.tcp://sparkExecutor@192.168.1.2:50948/user/Executor#-688434541]
with ID 1
15/02/11 19:15:28 INFO Datastore: The class
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
"embedded-only" so does not have its own datastore table.
15/02/11 19:15:28 INFO Datastore: The class
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
"embedded-only" so does not have its own datastore table.
15/02/11 19:15:28 INFO Query: Reading in results for query
"org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is
closing
15/02/11 19:15:28 INFO ObjectStore: Initialized ObjectStore
15/02/11 19:15:28 INFO BlockManagerMasterActor: Registering block manager
192.168.1.2:50951 with 265.1 MB RAM, BlockManagerId(0, 192.168.1.2, 50951)
15/02/11 19:15:28 INFO BlockManagerMasterActor: Registering block manager
192.168.1.2:50952 with 265.1 MB RAM, BlockManagerId(1, 192.168.1.2, 50952)
15/02/11 19:15:28 INFO HiveMetaStore: Added admin role in metastore
15/02/11 19:15:28 INFO HiveMetaStore: Added public role in metastore
15/02/11 19:15:29 INFO HiveMetaStore: No user is added in admin role, since
config is empty
15/02/11 19:15:29 INFO SessionState: No Tez session required at this point.
hive.execution.engine=mr.
15/02/11 19:15:29 INFO AbstractService: HiveServer2: Async execution pool
size 100
15/02/11 19:15:29 INFO AbstractService: Service:OperationManager is inited.
15/02/11 19:15:29 INFO AbstractService: Service: SessionManager is inited.
15/02/11 19:15:29 INFO AbstractService: Service: CLIService is inited.
15/02/11 19:15:29 INFO AbstractService: Service:ThriftBinaryCLIService is
inited.
15/02/11 19:15:29 INFO AbstractService: Service: HiveServer2 is inited.
15/02/11 19:15:29 INFO AbstractService: Service:OperationManager is started.
15/02/11 19:15:29 INFO AbstractService: Service:SessionManager is started.
15/02/11 19:15:29 INFO AbstractService: Service:CLIService is started.
15/02/11 19:15:29 INFO HiveMetaStore: No user is added in admin role, since
config is empty
15/02/11 19:15:29 INFO HiveMetaStore: 0: get_databases: default
15/02/11 19:15:29 INFO audit: ugi=tnist ip=unknown-ip-addr cmd=get_databases:
default
15/02/11 19:15:29 INFO HiveMetaStore: 0: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/02/11 19:15:29 INFO ObjectStore: ObjectStore, initialize called
15/02/11 19:15:29 INFO Query: Reading in results for query
"org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is
closing
15/02/11 19:15:29 INFO ObjectStore: Initialized ObjectStore
15/02/11 19:15:29 INFO AbstractService: Service:ThriftBinaryCLIService is
started.
15/02/11 19:15:29 INFO AbstractService: Service:HiveServer2 is started.
15/02/11 19:15:29 INFO HiveThriftServer2: HiveThriftServer2 started
15/02/11 19:15:29 INFO ThriftCLIService: ThriftBinaryCLIService listening
on radtech.io/192.168.1.2:10001

When I start Tableau and use the SparkSQL (Beta) connector I see that the
"initial SQL" is being executed:

15/02/11 19:25:35 INFO HiveMetaStore: 2: get_database: default
15/02/11 19:25:35 INFO audit: ugi=anonymous ip=unknown-ip-addr
cmd=get_database: default
15/02/11 19:25:35 INFO Driver: OK
15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Running query 'set
-v'
15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Result Schema:
List(#18)
15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Result Schema:
List(#21)
15/02/11 19:25:35 INFO ThriftCLIService: Client protocol version:
HIVE_CLI_SERVICE_PROTOCOL_V6
15/02/11 19:25:35 INFO HiveMetaStore: No user is added in admin role, since
config is empty
15/02/11 19:25:35 INFO SessionState: No Tez session required at this point.
hive.execution.engine=mr.
15/02/11 19:25:35 INFO SessionState: No Tez session required at this point.
hive.execution.engine=mr.
15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Running query 'use
`default`'
15/02/11 19:25:35 INFO ParseDriver: Parsing command: use `default`
15/02/11 19:25:35 INFO ParseDriver: Parse Completed
15/02/11 19:25:35 INFO SessionState: No Tez session required at this point.
hive.execution.engine=mr.
15/02/11 19:25:35 INFO Driver: Concurrency mode is disabled, not creating a
lock manager
15/02/11 19:25:35 INFO ParseDriver: Parsing command: use `default`
15/02/11 19:25:35 INFO ParseDriver: Parse Completed
15/02/11 19:25:35 INFO Driver: Semantic Analysis Completed
15/02/11 19:25:35 INFO Driver: Returning Hive schema:
Schema(fieldSchemas:null, properties:null)
15/02/11 19:25:35 INFO Driver: Starting command: use `default`
15/02/11 19:25:35 INFO HiveMetaStore: 3: get_database: default
15/02/11 19:25:35 INFO audit: ugi=anonymous ip=unknown-ip-addr
cmd=get_database: default
15/02/11 19:25:35 INFO HiveMetaStore: 3: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/02/11 19:25:35 INFO ObjectStore: ObjectStore, initialize called
15/02/11 19:25:36 INFO Query: Reading in results for query
"org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is
closing
15/02/11 19:25:36 INFO ObjectStore: Initialized ObjectStore
15/02/11 19:25:36 INFO HiveMetaStore: 3: get_database: default
15/02/11 19:25:36 INFO audit: ugi=anonymous ip=unknown-ip-addr
cmd=get_database: default
15/02/11 19:25:36 INFO Driver: OK

*15/02/11 19:25:36 INFO SparkExecuteStatementOperation: Running query
'create temporary table test*
*using org.apache.spark.sql.json**options (path ‘/data/json/*')*'

....

15/02/11 19:25:38 INFO Driver: Starting command: use `default`
15/02/11 19:25:38 INFO HiveMetaStore: 4: get_database: default
15/02/11 19:25:38 INFO audit: ugi=anonymous ip=unknown-ip-addr
cmd=get_database: default
15/02/11 19:25:38 INFO HiveMetaStore: 4: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/02/11 19:25:38 INFO ObjectStore: ObjectStore, initialize called
15/02/11 19:25:38 INFO Query: Reading in results for query
"org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is
closing
15/02/11 19:25:38 INFO ObjectStore: Initialized ObjectStore
15/02/11 19:25:38 INFO HiveMetaStore: 4: get_database: default
15/02/11 19:25:38 INFO audit: ugi=anonymous ip=unknown-ip-addr
cmd=get_database: default
15/02/11 19:25:38 INFO Driver: OK
15/02/11 19:25:38 INFO SparkExecuteStatementOperation: Running query '
*cache table test* '
15/02/11 19:25:38 INFO MemoryStore: ensureFreeSpace(211383) called with
curMem=101514, maxMem=278019440
15/02/11 19:25:38 INFO MemoryStore: Block broadcast_2 stored as values in
memory (estimated size 206.4 KB, free 264.8 MB)

I see no way in Tableau to see the cached table "test".  I think I am
missing a step of associating the generated temp table from Spark SQL with
the metastore.  Any guidance or insights on what I'm missing here.

Thanks for the assistance.

-Todd

On Wed, Feb 11, 2015 at 3:20 PM, Andrew Lee <alee...@hotmail.com> wrote:

> Sorry folks, it is executing Spark jobs instead of Hive jobs. I mis-read
> the logs since there were other activities going on on the cluster.
>
> ------------------------------
> From: alee...@hotmail.com
> To: ar...@sigmoidanalytics.com; tsind...@gmail.com
> CC: user@spark.apache.org
> Subject: RE: SparkSQL + Tableau Connector
> Date: Wed, 11 Feb 2015 11:56:44 -0800
>
>
> I'm using mysql as the metastore DB with Spark 1.2.
> I simply copy the hive-site.xml to /etc/spark/ and added the mysql JDBC
> JAR to spark-env.sh in /etc/spark/, everything works fine now.
>
> My setup looks like this.
>
> Tableau => Spark ThriftServer2 => HiveServer2
>
> It's talking to Tableau Desktop 8.3. Interestingly, when I query a Hive
> table, it still invokes Hive queries to HiveServer2 which is running MR or
> Tez engine.  Is this expected?
>
> I thought it should at least use the catalyst engine and talk to the
> underlying HDFS like what HiveContext API does to pull in the data into
> RDD.  Did I misunderstood the purpose of Spark ThriftServer2?
>
>
>
> ------------------------------
> Date: Wed, 11 Feb 2015 16:07:40 +0530
> Subject: Re: SparkSQL + Tableau Connector
> From: ar...@sigmoidanalytics.com
> To: tsind...@gmail.com
> CC: user@spark.apache.org
>
> Hi
>
> I used this, though its using a embedded driver and is not a good
> approch.It works. You can configure for some other metastore type also. I
> have not tried the metastore uri's.
>
> <configuration>
>
>
> <property>
>   <name>javax.jdo.option.ConnectionURL</name>
>
> <value>jdbc:derby:;databaseName=/opt/bigdata/spark-1.2.0/metastore_db;create=true</value>
>   <description>URL for the DB</description>
> </property>
>
>
> <property>
>   <name>javax.jdo.option.ConnectionDriverName</name>
>   <value>org.apache.derby.jdbc.EmbeddedDriver</value>
> </property>
>
>
> <!-- <property>
>   <name>hive.metastore.uris</name>
>   <value>thrift://x.x.x.x:10000 <http://172.17.1.172:10000/></value>
>   <description>IP address (or fully-qualified domain name) and port of the
> metastore host</description>
> </property> -->
> </configuration>
>
> On Wed, Feb 11, 2015 at 3:59 PM, Todd Nist <tsind...@gmail.com> wrote:
>
> Hi Arush,
>
> So yes I want to create the tables through Spark SQL.  I have placed the
> hive-site.xml file inside of the $SPARK_HOME/conf directory I thought that
> was all I should need to do to have the thriftserver use it.  Perhaps my
> hive-site.xml is worng, it currently looks like this:
>
> <configuration>
> <property>
>   <name>hive.metastore.uris</name>
>   <!-- Ensure that the following statement points to the Hive Metastore
> URI in your cluster -->
>   <value>thrift://sandbox.hortonworks.com:9083</value>
>   <description>URI for client to contact metastore server</description>
> </property>
> </configuration>
>
> Which leads me to believe it is going to pull form the thriftserver from
> Horton?  I will go look at the docs to see if this is right, it is what
> Horton says to do.  Do you have an example hive-site.xml by chance that
> works with Spark SQL?
>
> I am using 8.3 of tableau with the SparkSQL Connector.
>
> Thanks for the assistance.
>
> -Todd
>
> On Wed, Feb 11, 2015 at 2:34 AM, Arush Kharbanda <
> ar...@sigmoidanalytics.com> wrote:
>
> BTW what tableau connector are you using?
>
> On Wed, Feb 11, 2015 at 12:55 PM, Arush Kharbanda <
> ar...@sigmoidanalytics.com> wrote:
>
>  I am a little confused here, why do you want to create the tables in
> hive. You want to create the tables in spark-sql, right?
>
> If you are not able to find the same tables through tableau then thrift is
> connecting to a diffrent metastore than your spark-shell.
>
> One way to specify a metstore to thrift is to provide the path to
> hive-site.xml while starting thrift using --files hive-site.xml.
>
> similarly you can specify the same metastore to your spark-submit or
> sharp-shell using the same option.
>
>
>
> On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist <tsind...@gmail.com> wrote:
>
> Arush,
>
> As for #2 do you mean something like this from the docs:
>
> // sc is an existing SparkContext.val sqlContext = new 
> org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value 
> STRING)")sqlContext.sql("LOAD DATA LOCAL INPATH 
> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
> // Queries are expressed in HiveQLsqlContext.sql("FROM src SELECT key, 
> value").collect().foreach(println)
>
> Or did you have something else in mind?
>
> -Todd
>
>
> On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist <tsind...@gmail.com> wrote:
>
> Arush,
>
> Thank you will take a look at that approach in the morning.  I sort of
> figured the answer to #1 was NO and that I would need to do 2 and 3 thanks
> for clarifying it for me.
>
> -Todd
>
> On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda <
> ar...@sigmoidanalytics.com> wrote:
>
> 1.  Can the connector fetch or query schemaRDD's saved to Parquet or JSON
> files? NO
> 2.  Do I need to do something to expose these via hive / metastore other
> than creating a table in hive? Create a table in spark sql to expose via
> spark sql
> 3.  Does the thriftserver need to be configured to expose these in some
> fashion, sort of related to question 2 you would need to configure thrift
> to read from the metastore you expect it read from - by default it reads
> from metastore_db directory present in the directory used to launch the
> thrift server.
>  On 11 Feb 2015 01:35, "Todd Nist" <tsind...@gmail.com> wrote:
>
> Hi,
>
> I'm trying to understand how and what the Tableau connector to SparkSQL is
> able to access.  My understanding is it needs to connect to the
> thriftserver and I am not sure how or if it exposes parquet, json,
> schemaRDDs, or does it only expose schemas defined in the metastore / hive.
>
>
> For example, I do the following from the spark-shell which generates a
> schemaRDD from a csv file and saves it as a JSON file as well as a parquet
> file.
>
> import *org.apache.sql.SQLContext
> *import com.databricks.spark.csv._
> val sqlContext = new SQLContext(sc)
> val test = 
> sqlContext.csfFile("/data/test.csv")test.toJSON().saveAsTextFile("/data/out")
> test.saveAsParquetFile("/data/out")
>
> When I connect from Tableau, the only thing I see is the "default" schema
> and nothing in the tables section.
> So my questions are:
>
> 1.  Can the connector fetch or query schemaRDD's saved to Parquet or JSON
> files?
> 2.  Do I need to do something to expose these via hive / metastore other
> than creating a table in hive?
> 3.  Does the thriftserver need to be configured to expose these in some
> fashion, sort of related to question 2.
> TIA for the assistance.
> -Todd
>
>
>
>
>
>
> --
>
> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>
> *Arush Kharbanda* || Technical Teamlead
>
> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>
>
>
>
> --
>
> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>
> *Arush Kharbanda* || Technical Teamlead
>
> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>
>
>
>
>
> --
>
> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>
> *Arush Kharbanda* || Technical Teamlead
>
> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>

Re: SparkSQL + Tableau Connector

Reply via email to