I'm a bit confused by that answer, I'm assuming it's spark deciding which lib to use.
On 9 May 2017 at 14:30, Mark Hamstra <m...@clearstorydata.com> wrote: > This looks more like a matter for Databricks support than spark-user. > > On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com <lucas.g...@gmail.com > > wrote: > >> df = spark.sqlContext.read.csv('out/df_in.csv') >>> >> >> >>> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in >>> metastore. hive.metastore.schema.verification is not enabled so >>> recording the schema version 1.2.0 >>> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default, >>> returning NoSuchObjectException >>> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database global_temp, >>> returning NoSuchObjectException >>> >> >> >>> Py4JJavaError: An error occurred while calling o72.csv. >>> : java.lang.RuntimeException: Multiple sources found for csv >>> (*com.databricks.spark.csv.DefaultSource15, >>> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*), please >>> specify the fully qualified class name. >>> at scala.sys.package$.error(package.scala:27) >>> at org.apache.spark.sql.execution.datasources.DataSource$. >>> lookupDataSource(DataSource.scala:591) >>> at org.apache.spark.sql.execution.datasources.DataSource. >>> providingClass$lzycompute(DataSource.scala:86) >>> at org.apache.spark.sql.execution.datasources.DataSource. >>> providingClass(DataSource.scala:86) >>> at org.apache.spark.sql.execution.datasources.DataSource. >>> resolveRelation(DataSource.scala:325) >>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) >>> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce >>> ssorImpl.java:57) >>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe >>> thodAccessorImpl.java:43) >>> at java.lang.reflect.Method.invoke(Method.java:606) >>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) >>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) >>> at py4j.Gateway.invoke(Gateway.java:280) >>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) >>> at py4j.commands.CallCommand.execute(CallCommand.java:79) >>> at py4j.GatewayConnection.run(GatewayConnection.java:214) at >>> java.lang.Thread.run(Thread.java:745) >> >> >> When I change our call to: >> >> df = spark.hiveContext.read \ >> .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat') >> \ >> .load('df_in.csv) >> >> No such issue, I was under the impression (obviously wrongly) that spark >> would automatically pick the local lib. We have the databricks library >> because other jobs still explicitly call it. >> >> Is the 'correct answer' to go through and modify so as to remove the >> databricks lib / remove it from our deploy? Or should this just work? >> >> One of the things I find less helpful in the spark docs are when there's >> multiple ways to do it but no clear guidance on what those methods are >> intended to accomplish. >> >> Thanks! >> > >