> > df = spark.sqlContext.read.csv('out/df_in.csv') >
> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.2.0 > 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default, > returning NoSuchObjectException > 17/05/09 15:51:30 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > > Py4JJavaError: An error occurred while calling o72.csv. > : java.lang.RuntimeException: Multiple sources found for csv > (*com.databricks.spark.csv.DefaultSource15, > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*), please > specify the fully qualified class name. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:591) > at > org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:86) > at > org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:325) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) at > java.lang.Thread.run(Thread.java:745) When I change our call to: df = spark.hiveContext.read \ .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat') \ .load('df_in.csv) No such issue, I was under the impression (obviously wrongly) that spark would automatically pick the local lib. We have the databricks library because other jobs still explicitly call it. Is the 'correct answer' to go through and modify so as to remove the databricks lib / remove it from our deploy? Or should this just work? One of the things I find less helpful in the spark docs are when there's multiple ways to do it but no clear guidance on what those methods are intended to accomplish. Thanks!