Yes, thanks great. This seems to be the issue.
At least running with spark-submit works as well.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Hadoop-InputFormat-in-Python-tp12067p12126.html
Sent from the Apache Spark User List mailing list archive
Good timing! I encountered that same issue recently and to address it, I
changed the default Class.forName call to Utils.classForName. See my patch
at https://github.com/apache/spark/pull/1916. After that change, my
bin/pyspark --jars worked.
On Wed, Aug 13, 2014 at 11:47 PM, Tassilo Klein wrote
Thanks. This was already helping a bit. But the examples don't use custom
InputFormats. Rather, org.apache fully qualified InputFormat. If I want to
use my own custom InputFormat in form of .class (or jar) how can I use it? I
tried providing it to pyspark with --jars
and then using sc.newAPIHadoo
Tassilo, newAPIHadoopRDD has been added to PySpark in master and
yet-to-be-released 1.1 branch. It allows you specify your custom
InputFormat. Examples of using it include hbase_inputformat.py and
cassandra_inputformat.py in examples/src/main/python. Check it out.
On Wed, Aug 13, 2014 at 3:12 PM,
Yes, somehow seems logical. But where / how to pass -the InputFormat
definition (.jar/.java/.class) Spark.
I mean when using Hadoop I need to call something like 'hadoop jar
-inFormat other stuff' to register the file
format definition file.
--
View this message in context:
http://apache-s
Not that much familiar with Python APIs, but You should be able to
configure a job object with your custom InputFormat and pass in the
required configuration (:- job.getConfiguration()) to newAPIHadoopRDD to
get the required RDD
On Wed, Aug 13, 2014 at 2:59 PM, Tassilo Klein wrote:
> Hi,
>
> I'