Multiple CSV libs causes issues spark 2.1

lucas.g...@gmail.com Tue, 09 May 2017 14:03:16 -0700

>
> df = spark.sqlContext.read.csv('out/df_in.csv')
>


> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
> metastore. hive.metastore.schema.verification is not enabled so recording
> the schema version 1.2.0
> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default,
> returning NoSuchObjectException
> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database global_temp,
> returning NoSuchObjectException
>


> Py4JJavaError: An error occurred while calling o72.csv.
> : java.lang.RuntimeException: Multiple sources found for csv 
> (*com.databricks.spark.csv.DefaultSource15,
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*), please
> specify the fully qualified class name.
> at scala.sys.package$.error(package.scala:27)
> at
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:591)
> at
> org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:86)
> at
> org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86)
> at
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:325)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:214) at
> java.lang.Thread.run(Thread.java:745)


When I change our call to:

df = spark.hiveContext.read \
    .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
\
    .load('df_in.csv)

No such issue, I was under the impression (obviously wrongly) that spark
would automatically pick the local lib.  We have the databricks library
because other jobs still explicitly call it.

Is the 'correct answer' to go through and modify so as to remove the
databricks lib / remove it from our deploy?  Or should this just work?

One of the things I find less helpful in the spark docs are when there's
multiple ways to do it but no clear guidance on what those methods are
intended to accomplish.

Thanks!

Multiple CSV libs causes issues spark 2.1

Reply via email to