For this issue in particular ( ERROR XSDB6: Another instance of Derby may have already booted the database /spark/spark-1.4.1/metastore_db) -- I think it depends on where you start your application and HiveThriftserver from. I've run into a similar issue running a driver app first, which would create a directory called metastore_db. If I then try to start SparkShell from the same directory, I will see this exception. So it is like SPARK-9776. It's not so much that the two are in the same process (as the bug resolution states) I think you can't run 2 drivers which start a HiveConext from the same directory.
On Wed, Oct 28, 2015 at 4:10 PM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote: > All, > > One issue I'm seeing is that I start the thrift server (for jdbc access) > via the following: /spark/spark-1.4.1/sbin/start-thriftserver.sh --master > spark://master:7077 --hiveconf "spark.cores.max=2" > > After about 40 seconds the Thrift server is started and available on > default port 10000. > > I then submit my application - and the application throws the following > error: > > Caused by: java.sql.SQLException: Failed to start database 'metastore_db' > with class loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@6a552721, > see the next exception for details. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown > Source) > ... 86 more > Caused by: java.sql.SQLException: Another instance of Derby may have > already booted the database /spark/spark-1.4.1/metastore_db. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown > Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown > Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > ... 83 more > Caused by: ERROR XSDB6: Another instance of Derby may have already booted > the database /spark/spark-1.4.1/metastore_db. > > This also happens if I do the opposite (submit the application first, and > then start the thrift server). > > It looks similar to the following issue -- but not quite the same: > https://issues.apache.org/jira/browse/SPARK-9776 > > It seems like this set of steps works fine if the metadata database is not > yet created - but once it's created this happens every time. Is this a > known issue? Is there a workaround? > > Regards, > > Bryan Jeffrey > > On Wed, Oct 28, 2015 at 3:13 PM, Bryan Jeffrey <bryan.jeff...@gmail.com> > wrote: > >> Susan, >> >> I did give that a shot -- I'm seeing a number of oddities: >> >> (1) 'Partition By' appears only accepts alphanumeric lower case fields. >> It will work for 'machinename', but not 'machineName' or 'machine_name'. >> (2) When partitioning with maps included in the data I get odd string >> conversion issues >> (3) When partitioning without maps I see frequent out of memory issues >> >> I'll update this email when I've got a more concrete example of problems. >> >> Regards, >> >> Bryan Jeffrey >> >> >> >> On Wed, Oct 28, 2015 at 1:33 PM, Susan Zhang <suchenz...@gmail.com> >> wrote: >> >>> Have you tried partitionBy? >>> >>> Something like >>> >>> hiveWindowsEvents.foreachRDD( rdd => { >>> val eventsDataFrame = rdd.toDF() >>> eventsDataFrame.write.mode(SaveMode.Append).partitionBy(" >>> windows_event_time_bin").saveAsTable("windows_event") >>> }) >>> >>> >>> >>> On Wed, Oct 28, 2015 at 7:41 AM, Bryan Jeffrey <bryan.jeff...@gmail.com> >>> wrote: >>> >>>> Hello. >>>> >>>> I am working to get a simple solution working using Spark SQL. I am >>>> writing streaming data to persistent tables using a HiveContext. Writing >>>> to a persistent non-partitioned table works well - I update the table using >>>> Spark streaming, and the output is available via Hive Thrift/JDBC. >>>> >>>> I create a table that looks like the following: >>>> >>>> 0: jdbc:hive2://localhost:10000> describe windows_event; >>>> describe windows_event; >>>> +--------------------------+---------------------+----------+ >>>> | col_name | data_type | comment | >>>> +--------------------------+---------------------+----------+ >>>> | target_entity | string | NULL | >>>> | target_entity_type | string | NULL | >>>> | date_time_utc | timestamp | NULL | >>>> | machine_ip | string | NULL | >>>> | event_id | string | NULL | >>>> | event_data | map<string,string> | NULL | >>>> | description | string | NULL | >>>> | event_record_id | string | NULL | >>>> | level | string | NULL | >>>> | machine_name | string | NULL | >>>> | sequence_number | string | NULL | >>>> | source | string | NULL | >>>> | source_machine_name | string | NULL | >>>> | task_category | string | NULL | >>>> | user | string | NULL | >>>> | additional_data | map<string,string> | NULL | >>>> | windows_event_time_bin | timestamp | NULL | >>>> | # Partition Information | | | >>>> | # col_name | data_type | comment | >>>> | windows_event_time_bin | timestamp | NULL | >>>> +--------------------------+---------------------+----------+ >>>> >>>> >>>> However, when I create a partitioned table and write data using the >>>> following: >>>> >>>> hiveWindowsEvents.foreachRDD( rdd => { >>>> val eventsDataFrame = rdd.toDF() >>>> >>>> eventsDataFrame.write.mode(SaveMode.Append).saveAsTable("windows_event") >>>> }) >>>> >>>> The data is written as though the table is not partitioned (so >>>> everything is written to >>>> /user/hive/warehouse/windows_event/file.gz.paquet. Because the data is not >>>> following the partition schema, it is not accessible (and not partitioned). >>>> >>>> Is there a straightforward way to write to partitioned tables using >>>> Spark SQL? I understand that the read performance for partitioned data is >>>> far better - are there other performance improvements that might be better >>>> to use instead of partitioning? >>>> >>>> Regards, >>>> >>>> Bryan Jeffrey >>>> >>> >>> >> >