First sorry for the long post. So back to tableau and Spark SQL, I'm still missing something.
TL;DR To get the Spark SQL Temp table associated with the metastore are there additional steps required beyond doing the below? Initial SQL on connection: create temporary table test using org.apache.spark.sql.json options (path '/data/json/*'); cache table test; I feel like I'm missing a step of associating the Spark SQL table with the metastore, do I need to actually save it in some fashion? I'm trying to avoid saving to hive if possible. *Details:* I configured the hive-site.xml and placed it in the $SPARK_HOME/conf. It looks like this, thanks Andrew and Arush for the assistance: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hive.semantic.analyzer.factory.impl</name> <value>org.apache.hcatalog.cli.HCatSemanticAnalyzerFactory</value> </property> <property> <name>hive.metastore.sasl.enabled</name> <value>false</value> </property> <property> <name>hive.server2.authentication</name> <value>NONE</value> </property> <property> <name>hive.server2.enable.doAs</name> <value>true</value> </property> <!-- <property> <name>hive.metastore.uris</name> <value>thrift://localhost:9083</value> <description>IP address (or fully-qualified domain name) and port of the metastore host</description> </property> --> <property> <name>hive.warehouse.subdir.inherit.perms</name> <value>true</value> </property> <property> <name>hive.metastore.schema.verification</name> <value>false</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true</value> <description>metadata is stored in a MySQL server</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>MySQL JDBC driver class</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hiveuser</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>hiveuser</value> </property> </configuration> When I start the server it looks fine: >$ ./sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf hive.server2.thrift.bind.host radtech.io --master spark:// radtech.io:7077 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /usr/local/spark-1.2.1-bin-hadoop2.4/logs/spark-tnist-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-radtech.io.out radtech:spark tnist$ tail -f logs/spark-tnist-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-radtech.io.out 15/02/11 19:15:24 INFO SparkDeploySchedulerBackend: Granted executor ID app-20150211191524-0008/1 on hostPort 192.168.1.2:50851 with 2 cores, 512.0 MB RAM 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/0 is now LOADING 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/1 is now LOADING 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/0 is now RUNNING 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/1 is now RUNNING 15/02/11 19:15:24 INFO NettyBlockTransferService: Server created on 50938 15/02/11 19:15:24 INFO BlockManagerMaster: Trying to register BlockManager 15/02/11 19:15:24 INFO BlockManagerMasterActor: Registering block manager 192.168.1.2:50938 with 265.1 MB RAM, BlockManagerId(<driver>, 192.168.1.2, 50938) 15/02/11 19:15:24 INFO BlockManagerMaster: Registered BlockManager 15/02/11 19:15:25 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 15/02/11 19:15:25 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/02/11 19:15:25 INFO ObjectStore: ObjectStore, initialize called 15/02/11 19:15:26 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 15/02/11 19:15:26 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored 15/02/11 19:15:26 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/02/11 19:15:26 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/02/11 19:15:27 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order" 15/02/11 19:15:28 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@192.168.1.2:50944/user/Executor#1008909571] with ID 0 15/02/11 19:15:28 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 15/02/11 19:15:28 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 15/02/11 19:15:28 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@192.168.1.2:50948/user/Executor#-688434541] with ID 1 15/02/11 19:15:28 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 15/02/11 19:15:28 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 15/02/11 19:15:28 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing 15/02/11 19:15:28 INFO ObjectStore: Initialized ObjectStore 15/02/11 19:15:28 INFO BlockManagerMasterActor: Registering block manager 192.168.1.2:50951 with 265.1 MB RAM, BlockManagerId(0, 192.168.1.2, 50951) 15/02/11 19:15:28 INFO BlockManagerMasterActor: Registering block manager 192.168.1.2:50952 with 265.1 MB RAM, BlockManagerId(1, 192.168.1.2, 50952) 15/02/11 19:15:28 INFO HiveMetaStore: Added admin role in metastore 15/02/11 19:15:28 INFO HiveMetaStore: Added public role in metastore 15/02/11 19:15:29 INFO HiveMetaStore: No user is added in admin role, since config is empty 15/02/11 19:15:29 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/02/11 19:15:29 INFO AbstractService: HiveServer2: Async execution pool size 100 15/02/11 19:15:29 INFO AbstractService: Service:OperationManager is inited. 15/02/11 19:15:29 INFO AbstractService: Service: SessionManager is inited. 15/02/11 19:15:29 INFO AbstractService: Service: CLIService is inited. 15/02/11 19:15:29 INFO AbstractService: Service:ThriftBinaryCLIService is inited. 15/02/11 19:15:29 INFO AbstractService: Service: HiveServer2 is inited. 15/02/11 19:15:29 INFO AbstractService: Service:OperationManager is started. 15/02/11 19:15:29 INFO AbstractService: Service:SessionManager is started. 15/02/11 19:15:29 INFO AbstractService: Service:CLIService is started. 15/02/11 19:15:29 INFO HiveMetaStore: No user is added in admin role, since config is empty 15/02/11 19:15:29 INFO HiveMetaStore: 0: get_databases: default 15/02/11 19:15:29 INFO audit: ugi=tnist ip=unknown-ip-addr cmd=get_databases: default 15/02/11 19:15:29 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/02/11 19:15:29 INFO ObjectStore: ObjectStore, initialize called 15/02/11 19:15:29 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing 15/02/11 19:15:29 INFO ObjectStore: Initialized ObjectStore 15/02/11 19:15:29 INFO AbstractService: Service:ThriftBinaryCLIService is started. 15/02/11 19:15:29 INFO AbstractService: Service:HiveServer2 is started. 15/02/11 19:15:29 INFO HiveThriftServer2: HiveThriftServer2 started 15/02/11 19:15:29 INFO ThriftCLIService: ThriftBinaryCLIService listening on radtech.io/192.168.1.2:10001 When I start Tableau and use the SparkSQL (Beta) connector I see that the "initial SQL" is being executed: 15/02/11 19:25:35 INFO HiveMetaStore: 2: get_database: default 15/02/11 19:25:35 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_database: default 15/02/11 19:25:35 INFO Driver: OK 15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Running query 'set -v' 15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Result Schema: List(#18) 15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Result Schema: List(#21) 15/02/11 19:25:35 INFO ThriftCLIService: Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V6 15/02/11 19:25:35 INFO HiveMetaStore: No user is added in admin role, since config is empty 15/02/11 19:25:35 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/02/11 19:25:35 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Running query 'use `default`' 15/02/11 19:25:35 INFO ParseDriver: Parsing command: use `default` 15/02/11 19:25:35 INFO ParseDriver: Parse Completed 15/02/11 19:25:35 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/02/11 19:25:35 INFO Driver: Concurrency mode is disabled, not creating a lock manager 15/02/11 19:25:35 INFO ParseDriver: Parsing command: use `default` 15/02/11 19:25:35 INFO ParseDriver: Parse Completed 15/02/11 19:25:35 INFO Driver: Semantic Analysis Completed 15/02/11 19:25:35 INFO Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null) 15/02/11 19:25:35 INFO Driver: Starting command: use `default` 15/02/11 19:25:35 INFO HiveMetaStore: 3: get_database: default 15/02/11 19:25:35 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_database: default 15/02/11 19:25:35 INFO HiveMetaStore: 3: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/02/11 19:25:35 INFO ObjectStore: ObjectStore, initialize called 15/02/11 19:25:36 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing 15/02/11 19:25:36 INFO ObjectStore: Initialized ObjectStore 15/02/11 19:25:36 INFO HiveMetaStore: 3: get_database: default 15/02/11 19:25:36 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_database: default 15/02/11 19:25:36 INFO Driver: OK *15/02/11 19:25:36 INFO SparkExecuteStatementOperation: Running query 'create temporary table test* *using org.apache.spark.sql.json**options (path ‘/data/json/*')*' .... 15/02/11 19:25:38 INFO Driver: Starting command: use `default` 15/02/11 19:25:38 INFO HiveMetaStore: 4: get_database: default 15/02/11 19:25:38 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_database: default 15/02/11 19:25:38 INFO HiveMetaStore: 4: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/02/11 19:25:38 INFO ObjectStore: ObjectStore, initialize called 15/02/11 19:25:38 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing 15/02/11 19:25:38 INFO ObjectStore: Initialized ObjectStore 15/02/11 19:25:38 INFO HiveMetaStore: 4: get_database: default 15/02/11 19:25:38 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_database: default 15/02/11 19:25:38 INFO Driver: OK 15/02/11 19:25:38 INFO SparkExecuteStatementOperation: Running query ' *cache table test* ' 15/02/11 19:25:38 INFO MemoryStore: ensureFreeSpace(211383) called with curMem=101514, maxMem=278019440 15/02/11 19:25:38 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 206.4 KB, free 264.8 MB) I see no way in Tableau to see the cached table "test". I think I am missing a step of associating the generated temp table from Spark SQL with the metastore. Any guidance or insights on what I'm missing here. Thanks for the assistance. -Todd On Wed, Feb 11, 2015 at 3:20 PM, Andrew Lee <alee...@hotmail.com> wrote: > Sorry folks, it is executing Spark jobs instead of Hive jobs. I mis-read > the logs since there were other activities going on on the cluster. > > ------------------------------ > From: alee...@hotmail.com > To: ar...@sigmoidanalytics.com; tsind...@gmail.com > CC: user@spark.apache.org > Subject: RE: SparkSQL + Tableau Connector > Date: Wed, 11 Feb 2015 11:56:44 -0800 > > > I'm using mysql as the metastore DB with Spark 1.2. > I simply copy the hive-site.xml to /etc/spark/ and added the mysql JDBC > JAR to spark-env.sh in /etc/spark/, everything works fine now. > > My setup looks like this. > > Tableau => Spark ThriftServer2 => HiveServer2 > > It's talking to Tableau Desktop 8.3. Interestingly, when I query a Hive > table, it still invokes Hive queries to HiveServer2 which is running MR or > Tez engine. Is this expected? > > I thought it should at least use the catalyst engine and talk to the > underlying HDFS like what HiveContext API does to pull in the data into > RDD. Did I misunderstood the purpose of Spark ThriftServer2? > > > > ------------------------------ > Date: Wed, 11 Feb 2015 16:07:40 +0530 > Subject: Re: SparkSQL + Tableau Connector > From: ar...@sigmoidanalytics.com > To: tsind...@gmail.com > CC: user@spark.apache.org > > Hi > > I used this, though its using a embedded driver and is not a good > approch.It works. You can configure for some other metastore type also. I > have not tried the metastore uri's. > > <configuration> > > > <property> > <name>javax.jdo.option.ConnectionURL</name> > > <value>jdbc:derby:;databaseName=/opt/bigdata/spark-1.2.0/metastore_db;create=true</value> > <description>URL for the DB</description> > </property> > > > <property> > <name>javax.jdo.option.ConnectionDriverName</name> > <value>org.apache.derby.jdbc.EmbeddedDriver</value> > </property> > > > <!-- <property> > <name>hive.metastore.uris</name> > <value>thrift://x.x.x.x:10000 <http://172.17.1.172:10000/></value> > <description>IP address (or fully-qualified domain name) and port of the > metastore host</description> > </property> --> > </configuration> > > On Wed, Feb 11, 2015 at 3:59 PM, Todd Nist <tsind...@gmail.com> wrote: > > Hi Arush, > > So yes I want to create the tables through Spark SQL. I have placed the > hive-site.xml file inside of the $SPARK_HOME/conf directory I thought that > was all I should need to do to have the thriftserver use it. Perhaps my > hive-site.xml is worng, it currently looks like this: > > <configuration> > <property> > <name>hive.metastore.uris</name> > <!-- Ensure that the following statement points to the Hive Metastore > URI in your cluster --> > <value>thrift://sandbox.hortonworks.com:9083</value> > <description>URI for client to contact metastore server</description> > </property> > </configuration> > > Which leads me to believe it is going to pull form the thriftserver from > Horton? I will go look at the docs to see if this is right, it is what > Horton says to do. Do you have an example hive-site.xml by chance that > works with Spark SQL? > > I am using 8.3 of tableau with the SparkSQL Connector. > > Thanks for the assistance. > > -Todd > > On Wed, Feb 11, 2015 at 2:34 AM, Arush Kharbanda < > ar...@sigmoidanalytics.com> wrote: > > BTW what tableau connector are you using? > > On Wed, Feb 11, 2015 at 12:55 PM, Arush Kharbanda < > ar...@sigmoidanalytics.com> wrote: > > I am a little confused here, why do you want to create the tables in > hive. You want to create the tables in spark-sql, right? > > If you are not able to find the same tables through tableau then thrift is > connecting to a diffrent metastore than your spark-shell. > > One way to specify a metstore to thrift is to provide the path to > hive-site.xml while starting thrift using --files hive-site.xml. > > similarly you can specify the same metastore to your spark-submit or > sharp-shell using the same option. > > > > On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist <tsind...@gmail.com> wrote: > > Arush, > > As for #2 do you mean something like this from the docs: > > // sc is an existing SparkContext.val sqlContext = new > org.apache.spark.sql.hive.HiveContext(sc) > sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value > STRING)")sqlContext.sql("LOAD DATA LOCAL INPATH > 'examples/src/main/resources/kv1.txt' INTO TABLE src") > // Queries are expressed in HiveQLsqlContext.sql("FROM src SELECT key, > value").collect().foreach(println) > > Or did you have something else in mind? > > -Todd > > > On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist <tsind...@gmail.com> wrote: > > Arush, > > Thank you will take a look at that approach in the morning. I sort of > figured the answer to #1 was NO and that I would need to do 2 and 3 thanks > for clarifying it for me. > > -Todd > > On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda < > ar...@sigmoidanalytics.com> wrote: > > 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON > files? NO > 2. Do I need to do something to expose these via hive / metastore other > than creating a table in hive? Create a table in spark sql to expose via > spark sql > 3. Does the thriftserver need to be configured to expose these in some > fashion, sort of related to question 2 you would need to configure thrift > to read from the metastore you expect it read from - by default it reads > from metastore_db directory present in the directory used to launch the > thrift server. > On 11 Feb 2015 01:35, "Todd Nist" <tsind...@gmail.com> wrote: > > Hi, > > I'm trying to understand how and what the Tableau connector to SparkSQL is > able to access. My understanding is it needs to connect to the > thriftserver and I am not sure how or if it exposes parquet, json, > schemaRDDs, or does it only expose schemas defined in the metastore / hive. > > > For example, I do the following from the spark-shell which generates a > schemaRDD from a csv file and saves it as a JSON file as well as a parquet > file. > > import *org.apache.sql.SQLContext > *import com.databricks.spark.csv._ > val sqlContext = new SQLContext(sc) > val test = > sqlContext.csfFile("/data/test.csv")test.toJSON().saveAsTextFile("/data/out") > test.saveAsParquetFile("/data/out") > > When I connect from Tableau, the only thing I see is the "default" schema > and nothing in the tables section. > So my questions are: > > 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON > files? > 2. Do I need to do something to expose these via hive / metastore other > than creating a table in hive? > 3. Does the thriftserver need to be configured to expose these in some > fashion, sort of related to question 2. > TIA for the assistance. > -Todd > > > > > > > -- > > [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com> > > *Arush Kharbanda* || Technical Teamlead > > ar...@sigmoidanalytics.com || www.sigmoidanalytics.com > > > > > -- > > [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com> > > *Arush Kharbanda* || Technical Teamlead > > ar...@sigmoidanalytics.com || www.sigmoidanalytics.com > > > > > > -- > > [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com> > > *Arush Kharbanda* || Technical Teamlead > > ar...@sigmoidanalytics.com || www.sigmoidanalytics.com >