>
> df = spark.sqlContext.read.csv('out/df_in.csv')
>

> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
> metastore. hive.metastore.schema.verification is not enabled so recording
> the schema version 1.2.0
> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default,
> returning NoSuchObjectException
> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database global_temp,
> returning NoSuchObjectException
>


> Py4JJavaError: An error occurred while calling o72.csv.
> : java.lang.RuntimeException: Multiple sources found for csv 
> (*com.databricks.spark.csv.DefaultSource15,
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*), please
> specify the fully qualified class name.
> at scala.sys.package$.error(package.scala:27)
> at
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:591)
> at
> org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:86)
> at
> org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86)
> at
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:325)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:214) at
> java.lang.Thread.run(Thread.java:745)


When I change our call to:

df = spark.hiveContext.read \
    .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
\
    .load('df_in.csv)

No such issue, I was under the impression (obviously wrongly) that spark
would automatically pick the local lib.  We have the databricks library
because other jobs still explicitly call it.

Is the 'correct answer' to go through and modify so as to remove the
databricks lib / remove it from our deploy?  Or should this just work?

One of the things I find less helpful in the spark docs are when there's
multiple ways to do it but no clear guidance on what those methods are
intended to accomplish.

Thanks!

Reply via email to