if the intention is to create this on the default hadoop filesystem (and not local), then maybe we can use FileSystem.getHomeDirectory()? it should return the correct home directory on the relevant FileSystem (local or hdfs).
if the intention is to create this only locally, then why bother using hadoop filesystem api at all? On Thu, Oct 6, 2016 at 9:45 AM, Koert Kuipers <ko...@tresata.com> wrote: > well it seems to work if set spark.sql.warehouse.dir to > /tmp/spark-warehouse in spark-defaults, and it creates it on hdfs. > > however can this directory safely be shared between multiple users running > jobs? > > if not then i need to set this per user (instead of single setting in > spark-defaults) which means i need to change the jobs, which means an > upgrade for a production cluster running many jobs becomes more difficult. > > or can i create a setting in spark-defaults that includes a reference to > the user? something like /tmp/{user}/spark-warehouse? > > > > On Thu, Oct 6, 2016 at 6:04 AM, Sean Owen <so...@cloudera.com> wrote: > >> Yeah I see the same thing. You can fix this by setting >> spark.sql.warehouse.dir of course as a workaround. I restarted a >> conversation about it at https://github.com/apache/s >> park/pull/13868#pullrequestreview-3081020 >> >> I think the question is whether spark-warehouse is always supposed to be >> a local dir, or could be an HDFS dir? a change is needed either way, just >> want to clarify what it is. >> >> >> On Thu, Oct 6, 2016 at 5:18 AM Koert Kuipers <ko...@tresata.com> wrote: >> >>> i just replaced out spark 2.0.0 install on yarn cluster with spark 2.0.1 >>> and copied over the configs. >>> >>> to give it a quick test i started spark-shell and created a dataset. i >>> get this: >>> >>> 16/10/05 23:55:13 WARN spark.SparkContext: Use an existing SparkContext, >>> some configuration may not take effect. >>> Spark context Web UI available at http://***:4040 >>> Spark context available as 'sc' (master = yarn, app id = >>> application_1471212701720_1580). >>> Spark session available as 'spark'. >>> Welcome to >>> ____ __ >>> / __/__ ___ _____/ /__ >>> _\ \/ _ \/ _ `/ __/ '_/ >>> /___/ .__/\_,_/_/ /_/\_\ version 2.0.1 >>> /_/ >>> >>> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java >>> 1.7.0_75) >>> Type in expressions to have them evaluated. >>> Type :help for more information. >>> >>> scala> import spark.implicits._ >>> import spark.implicits._ >>> >>> scala> val x = List(1,2,3).toDS >>> org.apache.spark.SparkException: Unable to create database default as >>> failed to create its directory hdfs://dev/home/koert/spark-warehouse >>> at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.lifted >>> Tree1$1(InMemoryCatalog.scala:114) >>> at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.create >>> Database(InMemoryCatalog.scala:108) >>> at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createD >>> atabase(SessionCatalog.scala:147) >>> at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>( >>> SessionCatalog.scala:89) >>> at org.apache.spark.sql.internal.SessionState.catalog$lzycomput >>> e(SessionState.scala:95) >>> at org.apache.spark.sql.internal.SessionState.catalog(SessionSt >>> ate.scala:95) >>> at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(Se >>> ssionState.scala:112) >>> at org.apache.spark.sql.internal.SessionState.analyzer$lzycompu >>> te(SessionState.scala:112) >>> at org.apache.spark.sql.internal.SessionState.analyzer(SessionS >>> tate.scala:111) >>> at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed >>> (QueryExecution.scala:49) >>> at org.apache.spark.sql.Dataset.<init>(Dataset.scala:161) >>> at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167) >>> at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59) >>> at org.apache.spark.sql.SparkSession.createDataset(SparkSession >>> .scala:423) >>> at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:380) >>> at org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQ >>> LImplicits.scala:171) >>> ... 50 elided >>> >>> this did not happen in spark 2.0.0 >>> the location it is trying to access makes little sense, since it is >>> going to hdfs but then it is looking for my local home directory >>> (/home/koert exists locally but not on hdfs). >>> >>> i suspect the issue is SPARK-15899, but i am not sure. in the pullreq >>> for that WAREHOUSE_PATH got changed: >>> val WAREHOUSE_PATH = SQLConfigBuilder("spark.sql.warehouse.dir") >>> val WAREHOUSE_PATH = SQLConfigBuilder("spark.sql.warehouse.dir") >>> .doc("The default location for managed databases and tables.") >>> .doc("The default location for managed databases and tables.") >>> .stringConf >>> - .createWithDefault("file:${system:user.dir}/spark-warehouse") >>> + .createWithDefault("${system:user.dir}/spark-warehouse") >>> >>> notice how the file: got removed from the url, causing spark to look on >>> hdfs now since it is my default filesystem on the cluster. but >>> system:user.dir is still a local home directory. when combining the two you >>> get something that doesn't exist. >>> >>> >>> >