Spark RDD.take() generating duplicates for AvroData

2016-04-15 Thread Anoop Shiralige
Hi All, I have some avro data, which I am reading in the following way. Query : > val data = sc.newAPIHadoopFile(file, classOf[AvroKeyInputFormat[GenericRecord]], classOf[AvroKey[GenericRecord]], classOf[NullWritable]). map(_._1.datum) But, when I try to print the data, it is generating duplica

Re: PySpark : couldn't pickle object of type class T

2016-02-26 Thread Anoop Shiralige
gt; > https://github.com/databricks/spark-avro > > > On Thu, Feb 11, 2016 at 10:38 PM, Anoop Shiralige < > anoop.shiral...@gmail.com> wrote: > >> Hi All, >> >> I am working with Spark 1.6.0 and pySpark shell specifically. I have an >> JavaRDD[org.apa

PySpark : couldn't pickle object of type class T

2016-02-11 Thread Anoop Shiralige
Hi All, I am working with Spark 1.6.0 and pySpark shell specifically. I have an JavaRDD[org.apache.avro.GenericRecord] which I have converted to pythonRDD in the following way. javaRDD = sc._jvm.java.package.loadJson("path to data", sc._jsc) javaPython = sc._jvm.SerDe.javaToPython(javaRDD) from

Unexpected element type class

2016-02-07 Thread Anoop Shiralige
Hi All, I have written some functions in scala, which I want to expose in pyspark (interactively, spark - 1.6.0). The scala function(loadAvro) writtens a JavaRDD[AvroGenericRecord]. AvroGenericRecord is my wrapper class over the /org.apache.avro.generic.GenericRecord/. I am trying to convert this

DecisionTree Algorithm used in Spark MLLib

2014-12-29 Thread Anoop Shiralige
Hi All, I am trying to do a comparison, by building the model locally using R and on cluster using spark. There is some difference in the results. Any idea what is the internal implementation of Decision Tree in Spark MLLib.. (ID3 or C4.5 or C5.0 or CART algorithm). Thanks, AnoopShiralige