Thanks Michal. I have submitted a Spark issue and PR based on my understanding of why this changed in Spark 2.0. If interested you can follow it on https://issues.apache.org/jira/browse/SPARK-18687
Regards, Vinayak. From: Michal Šenkýř <bina...@gmail.com> To: Vinayak Joshi5/India/IBM@IBMIN, "user.spark" <user@spark.apache.org> Date: 02/12/2016 05:50 AM Subject: Re: Spark 2.x Pyspark Spark SQL createDataframe Error Hello Vinayak, As I understand it, Spark creates a Derby metastore database in the current location, in the metastore_db subdirectory, whenever you first use an SQL context. This database cannot be shared by multiple instances. This should be controlled by the javax.jdo.option.ConnectionURL property. I can imagine that using another kind of metastore database, like an in-memory or server-client db, would solve this specific problem. However, I do not think it is advisable. Is there a specific reason why you are creating a second SQL context? I think it is meant to be created only once per application and passed around. I also have no idea why the behavior changed between Spark 1.6 and Spark 2.0. Michal Šenkýř On Thu, Dec 1, 2016, 18:33 Vinayak Joshi5 <vijos...@in.ibm.com> wrote: This is the error received: 16/12/01 22:35:36 ERROR Schema: Failed initialising database. Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------ java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@4494053, see the next exception for details. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source) . . ------ org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------ java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@519dabfd, see the next exception for details. at org.apache.derby.impl.jdb . . . NestedThrowables: java.sql.SQLException: Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------ java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@519dabfd, see the next exception for details. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) . . . Caused by: java.sql.SQLException: Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------ java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@519dabfd, see the next exception for details. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.<init>(Unknown Source) . . . 16/12/01 22:48:09 ERROR Schema: Failed initialising database. Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------ java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@519dabfd, see the next exception for details. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.<init>(Unknown Source) . . . Caused by: java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@519dabfd, see the next exception for details. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.<init>(Unknown Source) . . . Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@519dabfd, see the next exception for details. at org.apache.derby.iapi.error.StandardException.newException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown Source) ... 111 more Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /Users/vinayak/devel/spark-stc/git_repo/spark-master-x/spark/metastore_db. at org.apache.derby.iapi.error.StandardException.newException(Unknown Source) at org.apache.derby.iapi.error.StandardException.newException(Unknown Source) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown Source) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknown Source) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source) Regards, Vinayak Joshi From: Vinayak Joshi5/India/IBM@IBMIN To: "user.spark" <user@spark.apache.org> Date: 01/12/2016 10:53 PM Subject: Spark 2.x Pyspark Spark SQL createDataframe Error With a local spark instance built with hive support, (-Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver) The following script/sequence works in Pyspark without any error against 1.6.x, but fails with 2.x. people = sc.parallelize(["Michael,30", "Andy,12", "Justin,19"]) peoplePartsRDD = people.map(lambda p: p.split(",")) peopleRDD = peoplePartsRDD.map(lambda p: pyspark.sql.Row(name=p[0], age=int(p[1]))) peopleDF= sqlContext.createDataFrame(peopleRDD) peopleDF.first() sqlContext2 = SQLContext(sc) people2 = sc.parallelize(["Abcd,40", "Efgh,14", "Ijkl,16"]) peoplePartsRDD2 = people2.map(lambda l: l.split(",")) peopleRDD2 = peoplePartsRDD2.map(lambda p: pyspark.sql.Row(fname=p[0], age=int(p[1]))) peopleDF2 = sqlContext2.createDataFrame(peopleRDD2) # <==== error here The error goes away if sqlContext2 is replaced with sqlContext in the error line. Is this a regression, or has something changed that makes this the expected behavior in Spark 2.x ? Regards, Vinayak