Re: Multiple CSV libs causes issues spark 2.1

lucas.g...@gmail.com Tue, 09 May 2017 14:37:20 -0700

I'm a bit confused by that answer, I'm assuming it's spark deciding which
lib to use.


On 9 May 2017 at 14:30, Mark Hamstra <m...@clearstorydata.com> wrote:

> This looks more like a matter for Databricks support than spark-user.
>
> On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com <lucas.g...@gmail.com
> > wrote:
>
>> df = spark.sqlContext.read.csv('out/df_in.csv')
>>>
>>
>>
>>> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
>>> metastore. hive.metastore.schema.verification is not enabled so
>>> recording the schema version 1.2.0
>>> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default,
>>> returning NoSuchObjectException
>>> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database global_temp,
>>> returning NoSuchObjectException
>>>
>>
>>
>>> Py4JJavaError: An error occurred while calling o72.csv.
>>> : java.lang.RuntimeException: Multiple sources found for csv 
>>> (*com.databricks.spark.csv.DefaultSource15,
>>> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*), please
>>> specify the fully qualified class name.
>>> at scala.sys.package$.error(package.scala:27)
>>> at org.apache.spark.sql.execution.datasources.DataSource$.
>>> lookupDataSource(DataSource.scala:591)
>>> at org.apache.spark.sql.execution.datasources.DataSource.
>>> providingClass$lzycompute(DataSource.scala:86)
>>> at org.apache.spark.sql.execution.datasources.DataSource.
>>> providingClass(DataSource.scala:86)
>>> at org.apache.spark.sql.execution.datasources.DataSource.
>>> resolveRelation(DataSource.scala:325)
>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>>> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>>> ssorImpl.java:57)
>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>>> thodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>>> at py4j.Gateway.invoke(Gateway.java:280)
>>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>> at py4j.GatewayConnection.run(GatewayConnection.java:214) at
>>> java.lang.Thread.run(Thread.java:745)
>>
>>
>> When I change our call to:
>>
>> df = spark.hiveContext.read \
>>     .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
>> \
>>     .load('df_in.csv)
>>
>> No such issue, I was under the impression (obviously wrongly) that spark
>> would automatically pick the local lib.  We have the databricks library
>> because other jobs still explicitly call it.
>>
>> Is the 'correct answer' to go through and modify so as to remove the
>> databricks lib / remove it from our deploy?  Or should this just work?
>>
>> One of the things I find less helpful in the spark docs are when there's
>> multiple ways to do it but no clear guidance on what those methods are
>> intended to accomplish.
>>
>> Thanks!
>>
>
>

Re: Multiple CSV libs causes issues spark 2.1

Reply via email to