Re: Running spark in standlone alone mode, saveAsTextFile() runs for forever

2014-11-26 Thread lalit1303
try repartition of rdd to say 2x number of cores available before saveAsTextFile - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-in-standlone-alone-mode-saveAsTextFile-runs-for-forever-tp19848p19867.

Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread lalit1303
Hi Mukesh, Once you create a streming job, a DAG is created which contains your job plan i.e. all map transformation and all action operations to be performed on each batch of streaming application. So, once your job is started, the input dstream take the data input from specified source and all

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-26 Thread lalit1303
you can try creating hadoop Configuration and set s3 configuration i.e. access keys etc. Now, for reading files from s3 use newAPIHadoopFile and pass the config object here along with key, value classes. - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context: http://apa

Re: Do spark works on multicore systems?

2014-11-09 Thread lalit1303
While creating sparkConf, set the variable *"spark.cores.max"* to th"spark.cores.max maximum number of cores to be used by spark job. By default it is set to 1. - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Do-s

Re: How to add elements into map?

2014-11-07 Thread lalit1303
It doesn't work that way. Following is the correct way: val table = sc.textFile(args(1)) val histMap = table.map(x => { x.split('|')(0).toInt,1 }) - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-e

Re: Manipulating RDDs within a DStream

2014-10-30 Thread lalit1303
Hi, Since, the cassandra object is not serializable you can't open the connection on driver level and access the object inside foreachRDD (i.e. at worker level). You have to open connection inside foreachRDD only, perform the operation and then close the connection. For example: wordCounts.fore

Re: Spark streaming at-least once guarantee

2014-08-06 Thread lalit1303
Hi TD, Thanks a lot for your reply :) I am already looking into creating a new DStream for SQS messages. It would be very helpful if you can provide with some guidance regarding the same. The main motive of integrating SQS with spark streaming is to make my Jobs run in high availability. As of no

Re: Spark streaming at-least once guarantee

2014-08-05 Thread lalit1303
Hi Sanjeet, I have been using spark streaming for processing of files present in S3 and HDFS. I am also using SQS messages for the same purpose as yours i.e. pointer to S3 file. As of now, I have a separate SQS job which receive message from SQS queue and gets the corresponding file from S3. Now,

Re: Spark Streaming Json file groupby function

2014-07-30 Thread lalit1303
you can try repartition/coalesce and make the final RDD into a single partition before saveAsTextFile. This should bring the content of whole RDD into single part- - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.c

Re: java.lang.StackOverflowError when calling count()

2014-07-23 Thread lalit1303
Hi, Thanks TD for your reply. I am still not able to resolve the problem for my use case. I have let's say 1000 different RDD's, and I am applying a transformation function on each RDD and I want the output of all rdd's combined to a single output RDD. For, this I am doing the following: ** tempRD

Re: Transform pair to a new pair

2014-06-13 Thread lalit1303
Hi, You can use map functions like flatmapValues and mapValues, which will apply the map fucntion on each pairRDD contained in your input pairDstream and returns the paired Dstream On Fri, Jun 13, 2014 at 8:48 AM, ryan_seq [via Apache Spark User List] < ml-node+s1001560n7550...@n3.nabble.com> wr

Re: java.lang.StackOverflowError when calling count()

2014-05-14 Thread lalit1303
If we do cache() + count() after say every 50 iterations. The whole process becomes very slow. I have tried checkpoint() , cache() + count(), saveAsObjectFiles(). Nothing works. Materializing RDD's lead to drastic decrease in performance & if we don't materialize, we face stackoverflowerror. On W

Re: java.lang.StackOverflowError when calling count()

2014-05-14 Thread lalit1303
If we do cache() + count() after say every 50 iterations. The whole process becomes very slow. I have tried checkpoint() , cache() + count(), saveAsObjectFiles(). Nothing works. Materializing RDD's lead to drastic decrease in performance & if we don't materialize, we face stackoverflowerror. --

Re: Pig on Spark

2014-04-23 Thread lalit1303
Hi, We got spork working on spark 0.9.0 Repository available at: https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix Please suggest your feedback. - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pig

Re: Pig on Spark

2014-03-25 Thread lalit1303
Hi, I have been following Aniket's spork github repository. https://github.com/aniket486/pig I have done all the changes mentioned in recently modified pig-spark file. I am using: hadoop 2.0.5 alpha spark-0.8.1-incubating mesos 0.16.0 ##PIG variables export *HADOOP_CONF_DIR*=$HADOOP_INSTALL/etc/