Hey Todd, I don’t have an app to test against the thrift server, are you able to define custom SQL without using Tableau’s schema query? I guess it’s not possible to just use SparkSQL temp tables, you may have to use permanent Hive tables that are actually in the metastore so Tableau can discover them in the schema. In that case you will either have to generate the Hive tables externally from Spark or use Spark to process the data and save them using a HiveContext.
From: Todd Nist Date: Wednesday, February 11, 2015 at 7:53 PM To: Andrew Lee Cc: Arush Kharbanda, "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Re: SparkSQL + Tableau Connector First sorry for the long post. So back to tableau and Spark SQL, I'm still missing something. TL;DR To get the Spark SQL Temp table associated with the metastore are there additional steps required beyond doing the below? Initial SQL on connection: create temporary table test using org.apache.spark.sql.json options (path '/data/json/*'); cache table test; I feel like I'm missing a step of associating the Spark SQL table with the metastore, do I need to actually save it in some fashion? I'm trying to avoid saving to hive if possible. Details: I configured the hive-site.xml and placed it in the $SPARK_HOME/conf. It looks like this, thanks Andrew and Arush for the assistance: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hive.semantic.analyzer.factory.impl</name> <value>org.apache.hcatalog.cli.HCatSemanticAnalyzerFactory</value> </property> <property> <name>hive.metastore.sasl.enabled</name> <value>false</value> </property> <property> <name>hive.server2.authentication</name> <value>NONE</value> </property> <property> <name>hive.server2.enable.doAs</name> <value>true</value> </property> <!-- <property> <name>hive.metastore.uris</name> <value>thrift://localhost:9083</value> <description>IP address (or fully-qualified domain name) and port of the metastore host</description> </property> --> <property> <name>hive.warehouse.subdir.inherit.perms</name> <value>true</value> </property> <property> <name>hive.metastore.schema.verification</name> <value>false</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true</value> <description>metadata is stored in a MySQL server</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>MySQL JDBC driver class</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hiveuser</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>hiveuser</value> </property> </configuration> When I start the server it looks fine: >$ ./sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 >--hiveconf hive.server2.thrift.bind.host radtech.io<http://radtech.io> >--master spark://radtech.io:7077<http://radtech.io:7077> --driver-class-path >/usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /usr/local/spark-1.2.1-bin-hadoop2.4/logs/spark-tnist-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-radtech.io.out radtech:spark tnist$ tail -f logs/spark-tnist-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-radtech.io.out 15/02/11 19:15:24 INFO SparkDeploySchedulerBackend: Granted executor ID app-20150211191524-0008/1 on hostPort 192.168.1.2:50851<http://192.168.1.2:50851> with 2 cores, 512.0 MB RAM 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/0 is now LOADING 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/1 is now LOADING 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/0 is now RUNNING 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/1 is now RUNNING 15/02/11 19:15:24 INFO NettyBlockTransferService: Server created on 50938 15/02/11 19:15:24 INFO BlockManagerMaster: Trying to register BlockManager 15/02/11 19:15:24 INFO BlockManagerMasterActor: Registering block manager 192.168.1.2:50938<http://192.168.1.2:50938> with 265.1 MB RAM, BlockManagerId(<driver>, 192.168.1.2, 50938) 15/02/11 19:15:24 INFO BlockManagerMaster: Registered BlockManager 15/02/11 19:15:25 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 15/02/11 19:15:25 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/02/11 19:15:25 INFO ObjectStore: ObjectStore, initialize called 15/02/11 19:15:26 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 15/02/11 19:15:26 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored 15/02/11 19:15:26 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/02/11 19:15:26 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/02/11 19:15:27 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order" 15/02/11 19:15:28 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@192.168.1.2:50944/user/Executor#1008909571<http://sparkExecutor@192.168.1.2:50944/user/Executor#1008909571>] with ID 0 15/02/11 19:15:28 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 15/02/11 19:15:28 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 15/02/11 19:15:28 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@192.168.1.2:50948/user/Executor#-688434541<http://sparkExecutor@192.168.1.2:50948/user/Executor#-688434541>] with ID 1 15/02/11 19:15:28 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 15/02/11 19:15:28 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 15/02/11 19:15:28 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing 15/02/11 19:15:28 INFO ObjectStore: Initialized ObjectStore 15/02/11 19:15:28 INFO BlockManagerMasterActor: Registering block manager 192.168.1.2:50951<http://192.168.1.2:50951> with 265.1 MB RAM, BlockManagerId(0, 192.168.1.2, 50951) 15/02/11 19:15:28 INFO BlockManagerMasterActor: Registering block manager 192.168.1.2:50952<http://192.168.1.2:50952> with 265.1 MB RAM, BlockManagerId(1, 192.168.1.2, 50952) 15/02/11 19:15:28 INFO HiveMetaStore: Added admin role in metastore 15/02/11 19:15:28 INFO HiveMetaStore: Added public role in metastore 15/02/11 19:15:29 INFO HiveMetaStore: No user is added in admin role, since config is empty 15/02/11 19:15:29 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/02/11 19:15:29 INFO AbstractService: HiveServer2: Async execution pool size 100 15/02/11 19:15:29 INFO AbstractService: Service:OperationManager is inited. 15/02/11 19:15:29 INFO AbstractService: Service: SessionManager is inited. 15/02/11 19:15:29 INFO AbstractService: Service: CLIService is inited. 15/02/11 19:15:29 INFO AbstractService: Service:ThriftBinaryCLIService is inited. 15/02/11 19:15:29 INFO AbstractService: Service: HiveServer2 is inited. 15/02/11 19:15:29 INFO AbstractService: Service:OperationManager is started. 15/02/11 19:15:29 INFO AbstractService: Service:SessionManager is started. 15/02/11 19:15:29 INFO AbstractService: Service:CLIService is started. 15/02/11 19:15:29 INFO HiveMetaStore: No user is added in admin role, since config is empty 15/02/11 19:15:29 INFO HiveMetaStore: 0: get_databases: default 15/02/11 19:15:29 INFO audit: ugi=tnistip=unknown-ip-addr cmd=get_databases: default 15/02/11 19:15:29 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/02/11 19:15:29 INFO ObjectStore: ObjectStore, initialize called 15/02/11 19:15:29 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing 15/02/11 19:15:29 INFO ObjectStore: Initialized ObjectStore 15/02/11 19:15:29 INFO AbstractService: Service:ThriftBinaryCLIService is started. 15/02/11 19:15:29 INFO AbstractService: Service:HiveServer2 is started. 15/02/11 19:15:29 INFO HiveThriftServer2: HiveThriftServer2 started 15/02/11 19:15:29 INFO ThriftCLIService: ThriftBinaryCLIService listening on radtech.io/192.168.1.2:10001<http://radtech.io/192.168.1.2:10001> When I start Tableau and use the SparkSQL (Beta) connector I see that the "initial SQL" is being executed: 15/02/11 19:25:35 INFO HiveMetaStore: 2: get_database: default 15/02/11 19:25:35 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_database: default 15/02/11 19:25:35 INFO Driver: OK 15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Running query 'set -v' 15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Result Schema: List(#18) 15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Result Schema: List(#21) 15/02/11 19:25:35 INFO ThriftCLIService: Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V6 15/02/11 19:25:35 INFO HiveMetaStore: No user is added in admin role, since config is empty 15/02/11 19:25:35 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/02/11 19:25:35 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Running query 'use `default`' 15/02/11 19:25:35 INFO ParseDriver: Parsing command: use `default` 15/02/11 19:25:35 INFO ParseDriver: Parse Completed 15/02/11 19:25:35 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/02/11 19:25:35 INFO Driver: Concurrency mode is disabled, not creating a lock manager 15/02/11 19:25:35 INFO ParseDriver: Parsing command: use `default` 15/02/11 19:25:35 INFO ParseDriver: Parse Completed 15/02/11 19:25:35 INFO Driver: Semantic Analysis Completed 15/02/11 19:25:35 INFO Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null) 15/02/11 19:25:35 INFO Driver: Starting command: use `default` 15/02/11 19:25:35 INFO HiveMetaStore: 3: get_database: default 15/02/11 19:25:35 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_database: default 15/02/11 19:25:35 INFO HiveMetaStore: 3: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/02/11 19:25:35 INFO ObjectStore: ObjectStore, initialize called 15/02/11 19:25:36 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing 15/02/11 19:25:36 INFO ObjectStore: Initialized ObjectStore 15/02/11 19:25:36 INFO HiveMetaStore: 3: get_database: default 15/02/11 19:25:36 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_database: default 15/02/11 19:25:36 INFO Driver: OK 15/02/11 19:25:36 INFO SparkExecuteStatementOperation: Running query 'create temporary table test using org.apache.spark.sql.json options (path ‘/data/json/*')' .... 15/02/11 19:25:38 INFO Driver: Starting command: use `default` 15/02/11 19:25:38 INFO HiveMetaStore: 4: get_database: default 15/02/11 19:25:38 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_database: default 15/02/11 19:25:38 INFO HiveMetaStore: 4: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/02/11 19:25:38 INFO ObjectStore: ObjectStore, initialize called 15/02/11 19:25:38 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing 15/02/11 19:25:38 INFO ObjectStore: Initialized ObjectStore 15/02/11 19:25:38 INFO HiveMetaStore: 4: get_database: default 15/02/11 19:25:38 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_database: default 15/02/11 19:25:38 INFO Driver: OK 15/02/11 19:25:38 INFO SparkExecuteStatementOperation: Running query ' cache table test ' 15/02/11 19:25:38 INFO MemoryStore: ensureFreeSpace(211383) called with curMem=101514, maxMem=278019440 15/02/11 19:25:38 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 206.4 KB, free 264.8 MB) I see no way in Tableau to see the cached table "test". I think I am missing a step of associating the generated temp table from Spark SQL with the metastore. Any guidance or insights on what I'm missing here. Thanks for the assistance. -Todd On Wed, Feb 11, 2015 at 3:20 PM, Andrew Lee <alee...@hotmail.com<mailto:alee...@hotmail.com>> wrote: Sorry folks, it is executing Spark jobs instead of Hive jobs. I mis-read the logs since there were other activities going on on the cluster. ________________________________ From: alee...@hotmail.com<mailto:alee...@hotmail.com> To: ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com>; tsind...@gmail.com<mailto:tsind...@gmail.com> CC: user@spark.apache.org<mailto:user@spark.apache.org> Subject: RE: SparkSQL + Tableau Connector Date: Wed, 11 Feb 2015 11:56:44 -0800 I'm using mysql as the metastore DB with Spark 1.2. I simply copy the hive-site.xml to /etc/spark/ and added the mysql JDBC JAR to spark-env.sh in /etc/spark/, everything works fine now. My setup looks like this. Tableau => Spark ThriftServer2 => HiveServer2 It's talking to Tableau Desktop 8.3. Interestingly, when I query a Hive table, it still invokes Hive queries to HiveServer2 which is running MR or Tez engine. Is this expected? I thought it should at least use the catalyst engine and talk to the underlying HDFS like what HiveContext API does to pull in the data into RDD. Did I misunderstood the purpose of Spark ThriftServer2? ________________________________ Date: Wed, 11 Feb 2015 16:07:40 +0530 Subject: Re: SparkSQL + Tableau Connector From: ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com> To: tsind...@gmail.com<mailto:tsind...@gmail.com> CC: user@spark.apache.org<mailto:user@spark.apache.org> Hi I used this, though its using a embedded driver and is not a good approch.It works. You can configure for some other metastore type also. I have not tried the metastore uri's. <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:derby:;databaseName=/opt/bigdata/spark-1.2.0/metastore_db;create=true</value> <description>URL for the DB</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>org.apache.derby.jdbc.EmbeddedDriver</value> </property> <!-- <property> <name>hive.metastore.uris</name> <value>thrift://x.x.x.x:10000<http://172.17.1.172:10000/></value> <description>IP address (or fully-qualified domain name) and port of the metastore host</description> </property> --> </configuration> On Wed, Feb 11, 2015 at 3:59 PM, Todd Nist <tsind...@gmail.com<mailto:tsind...@gmail.com>> wrote: Hi Arush, So yes I want to create the tables through Spark SQL. I have placed the hive-site.xml file inside of the $SPARK_HOME/conf directory I thought that was all I should need to do to have the thriftserver use it. Perhaps my hive-site.xml is worng, it currently looks like this: <configuration> <property> <name>hive.metastore.uris</name> <!-- Ensure that the following statement points to the Hive Metastore URI in your cluster --> <value>thrift://sandbox.hortonworks.com:9083<http://sandbox.hortonworks.com:9083></value> <description>URI for client to contact metastore server</description> </property> </configuration> Which leads me to believe it is going to pull form the thriftserver from Horton? I will go look at the docs to see if this is right, it is what Horton says to do. Do you have an example hive-site.xml by chance that works with Spark SQL? I am using 8.3 of tableau with the SparkSQL Connector. Thanks for the assistance. -Todd On Wed, Feb 11, 2015 at 2:34 AM, Arush Kharbanda <ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com>> wrote: BTW what tableau connector are you using? On Wed, Feb 11, 2015 at 12:55 PM, Arush Kharbanda <ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com>> wrote: I am a little confused here, why do you want to create the tables in hive. You want to create the tables in spark-sql, right? If you are not able to find the same tables through tableau then thrift is connecting to a diffrent metastore than your spark-shell. One way to specify a metstore to thrift is to provide the path to hive-site.xml while starting thrift using --files hive-site.xml. similarly you can specify the same metastore to your spark-submit or sharp-shell using the same option. On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist <tsind...@gmail.com<mailto:tsind...@gmail.com>> wrote: Arush, As for #2 do you mean something like this from the docs: // sc is an existing SparkContext.val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")// Queries are expressed in HiveQLsqlContext.sql("FROM src SELECT key, value").collect().foreach(println) Or did you have something else in mind? -Todd On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist <tsind...@gmail.com<mailto:tsind...@gmail.com>> wrote: Arush, Thank you will take a look at that approach in the morning. I sort of figured the answer to #1 was NO and that I would need to do 2 and 3 thanks for clarifying it for me. -Todd On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda <ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com>> wrote: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? NO 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? Create a table in spark sql to expose via spark sql 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2 you would need to configure thrift to read from the metastore you expect it read from - by default it reads from metastore_db directory present in the directory used to launch the thrift server. On 11 Feb 2015 01:35, "Todd Nist" <tsind...@gmail.com<mailto:tsind...@gmail.com>> wrote: Hi, I'm trying to understand how and what the Tableau connector to SparkSQL is able to access. My understanding is it needs to connect to the thriftserver and I am not sure how or if it exposes parquet, json, schemaRDDs, or does it only expose schemas defined in the metastore / hive. For example, I do the following from the spark-shell which generates a schemaRDD from a csv file and saves it as a JSON file as well as a parquet file. import org.apache.sql.SQLContext import com.databricks.spark.csv._ val sqlContext = new SQLContext(sc) val test = sqlContext.csfFile("/data/test.csv")test.toJSON().saveAsTextFile("/data/out") test.saveAsParquetFile("/data/out") When I connect from Tableau, the only thing I see is the "default" schema and nothing in the tables section. So my questions are: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2. TIA for the assistance. -Todd -- [Sigmoid Analytics]<http://htmlsig.com/www.sigmoidanalytics.com> Arush Kharbanda || Technical Teamlead ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com> || www.sigmoidanalytics.com<http://www.sigmoidanalytics.com/> -- [Sigmoid Analytics]<http://htmlsig.com/www.sigmoidanalytics.com> Arush Kharbanda || Technical Teamlead ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com> || www.sigmoidanalytics.com<http://www.sigmoidanalytics.com/> -- [Sigmoid Analytics]<http://htmlsig.com/www.sigmoidanalytics.com> Arush Kharbanda || Technical Teamlead ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com> || www.sigmoidanalytics.com<http://www.sigmoidanalytics.com/>