Spark task default timeout

2018-06-03 Thread Shushant Arora
Hi I have an spark application where driver starts few tasks and In each task which is a VoidFunction , I have a long running infinite loop. I have set speculative execution to false. Will spark kill my task after sometime (Timeout) or tasks will run infinitely? If tasks will be killed after some

Re: Spark on EMR suddenly stalling

2017-12-29 Thread Shushant Arora
you may have to recreate your cluster with below configuration at emr creation "Configurations": [ { "Properties": { "maximizeResourceAllocation": "false" }, "Classification": "spark" } ] On Fri

does persistence required for single action ?

2017-02-07 Thread Shushant Arora
Hi I have a workflow like below: rdd1 = sc.textFile(input); rdd2 = rdd1.filter(filterfunc1); rdd3 = rdd1.filter(fiterfunc2); rdd4 = rdd2.map(mapptrans1); rdd5 = rdd3.map(maptrans2); rdd6 = rdd4.union(rdd5); rdd6.foreach(some transformation); [image: Inline image 1] 1. Do I need to persist

Re: spark narrow vs wide dependency

2017-01-26 Thread Shushant Arora
3.Also will the mappartitions can go out of memory if I return the arraylist of whole partition after processing the partition ? whats the alternative to this if this can fail. On Fri, Jan 27, 2017 at 9:32 AM, Shushant Arora wrote: > Hi > > I have two transformations in series.

spark narrow vs wide dependency

2017-01-26 Thread Shushant Arora
Hi I have two transformations in series. rdd1 = sourcerdd.map(new Function(...)); //step1 rdd2 = rdd1.mapPartitions(new Function(...)); //step2 1.Is map and mapPartitions narrow dependency ? Does spark optimise the dag and execute step 1 and step2 in single stage or there will be two stages ? B

Re: spark streaming with kinesis

2016-11-20 Thread Shushant Arora
> > // maropu > > > On Mon, Nov 14, 2016 at 11:20 PM, Shushant Arora < > shushantaror...@gmail.com> wrote: > >> 1.No, I want to implement low level consumer on kinesis stream. >> so need to stop the worker once it read the latest sequence number sent >> by

Re: spark streaming with kinesis

2016-11-14 Thread Shushant Arora
r usecase? > > On Mon, Nov 14, 2016 at 10:23 PM, Shushant Arora < > shushantaror...@gmail.com> wrote: > >> Thanks! >> Is there a way to get the latest sequence number of all shards of a >> kinesis stream? >> >> >> >> On Mon, Nov 14, 2016 at

Re: spark streaming with kinesis

2016-11-14 Thread Shushant Arora
gh, > it is not configurable in the current implementation. > > The detail can be found in; > https://github.com/apache/spark/blob/master/external/ > kinesis-asl/src/main/scala/org/apache/spark/streaming/ > kinesis/KinesisReceiver.scala#L152 > > // maropu > > >

receiver based spark streaming doubts

2016-11-13 Thread Shushant Arora
Hi In spark streaming based on receivers - when receiver gets data and store in blocks for workers to process, How many blocks does receiver gives to worker. Say I have a streaming app with 30 sec of batch interval what will happen 1.for first batch(first 30 sec) there will not be any data for wo

spark streaming with kinesis

2016-11-12 Thread Shushant Arora
*Hi * *is **spark.streaming.blockInterval* for kinesis input stream is hardcoded to 1 sec or is it configurable ? Time interval at which receiver fetched data from kinesis . Means stream batch interval cannot be less than *spark.streaming.blockInterval and this should be configrable , Also is the

Re: spark streaming with kinesis

2016-11-06 Thread Shushant Arora
; Also, we currently cannot disable the interval checkpoints. > > On Tue, Oct 25, 2016 at 11:53 AM, Shushant Arora < > shushantaror...@gmail.com> wrote: > >> Thanks! >> >> Is kinesis streams are receiver based only? Is there non receiver based >> consumer for

Re: spark streaming with kinesis

2016-10-24 Thread Shushant Arora
. > However, all the executors that have the replicated data crash, > IIUC the dataloss occurs. > > // maropu > > On Mon, Oct 24, 2016 at 4:43 PM, Shushant Arora > wrote: > >> Does spark streaming consumer for kinesis uses Kinesis Client Library >> and mandates

spark streaming with kinesis

2016-10-24 Thread Shushant Arora
Does spark streaming consumer for kinesis uses Kinesis Client Library and mandates to checkpoint the sequence number of shards in dynamo db. Will it lead to dataloss if consumed datarecords are not yet processed and kinesis checkpointed the consumed sequenece numbers in dynamo db and spark worker

detecting last record of partition

2016-10-13 Thread Shushant Arora
Hi I have a transformation on a pair rdd using flatmap function. 1.Can I detect in flatmap whether the current record is last record of partition being processed and 2. what is the partition index of this partition. public Iterable> call(Tuple2 t) throws Exception { //whether element is last ele

spark streaming minimum batch interval

2016-09-29 Thread Shushant Arora
Hi I want to enquire does spark streaming has some limitation of 500ms of batch intreval ? Is storm better than spark streaming for real time (for latency of just 50-100ms). In spark streaming can parallel batches be run ? If yes is it supported at productionlevel. Thanks

spark persistence doubt

2016-09-28 Thread Shushant Arora
Hi I have a flow like below 1.rdd1=some source.transform(); 2.tranformedrdd1 = rdd1.transform(..); 3.transformrdd2 = rdd1.transform(..); 4.tranformrdd1.action(); Does I need to persist rdd1 to optimise step 2 and 3 ? or since there is no lineage breakage so it will work without persist ? Thank

Re: spark on yarn

2016-05-21 Thread Shushant Arora
3.And is the same behavior applied to streaming application also? On Sat, May 21, 2016 at 7:44 PM, Shushant Arora wrote: > And will it allocate rest executors when other containers get freed which > were occupied by other hadoop jobs/spark applications? > > And is there any m

Re: spark on yarn

2016-05-21 Thread Shushant Arora
: YARN is a lot more unforgiving about > memory use than it is about CPU > > > On 20 Apr 2016, at 16:21, Shushant Arora > wrote: > > > > I am running a spark application on yarn cluster. > > > > say I have available vcors in cluster as 100.And I start spark

spark on yarn

2016-04-20 Thread Shushant Arora
I am running a spark application on yarn cluster. say I have available vcors in cluster as 100.And I start spark application with --num-executors 200 --num-cores 2 (so I need total 200*2=400 vcores) but in my cluster only 100 are available. What will happen ? Will the job abort or it will be subm

spark stages in parallel

2016-02-17 Thread Shushant Arora
can two stages of single job run in parallel in spark? e.g one stage is ,map transformation and another is repartition on mapped rdd. rdd.map(function,100).repartition(30); can it happen that map transformation which is running 100 tasks after few of them say (10 ) are finished and spark starte

Re: spark rdd grouping

2015-12-25 Thread Shushant Arora
Hi I have created a jira for this feature https://issues.apache.org/jira/browse/SPARK-12524 Please vote this feature if its necessary. I would like to implement this feature. Thanks Shushant On Wed, Dec 2, 2015 at 1:14 PM, Rajat Kumar wrote: > What if I don't have to use aggregate function onl

hbase Put object kryo serialisation error

2015-12-09 Thread Shushant Arora
Hi I have a javapairrdd pairrdd. when i do rdd.persist(StorageLevel.MEMORY_AND_DISK()). It throws exception : com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID:100 serialisationtrace: familyMap(org.apache.hadoop.hbase.cleint.Put) I have registerd Put,and TreeMap in sp

Re: parquet file doubts

2015-12-07 Thread Shushant Arora
Singh, Abhijeet wrote: > Yes, Parquet has min/max. > > > > *From:* Cheng Lian [mailto:l...@databricks.com] > *Sent:* Monday, December 07, 2015 11:21 AM > *To:* Ted Yu > *Cc:* Shushant Arora; user@spark.apache.org > *Subject:* Re: parquet file doubts > > > > O

Re: spark shuffle

2015-11-22 Thread Shushant Arora
partitioner and make all values of same key on same node and number of partitions to be equal to number of distinct keys. On Sat, Nov 21, 2015 at 11:21 PM, Shushant Arora wrote: > Hi > > I have few doubts > > 1.does > rdd.saveasNewAPIHadoopFile(outputdir,keyclass,valueclass,ouputformat &

spark shuffle

2015-11-21 Thread Shushant Arora
Hi I have few doubts 1.does rdd.saveasNewAPIHadoopFile(outputdir,keyclass,valueclass,ouputformat class)-> shuffles data or it will always create same no of files in output dir as number of partitions in rdd. 2. How to use multiple outputs in saveasNewAPIHadoopFile to have file name generated fro

Re: spark streaming 1.3 with kafka connection timeout

2015-09-10 Thread Shushant Arora
rebalance... why did you say " I am getting Connection tmeout in my code." > > You've asked questions about this exact same situation before, the answer > remains the same > > On Thu, Sep 10, 2015 at 9:44 AM, Shushant Arora > wrote: > >> Stack trace is >&g

Re: spark streaming 1.3 with kafka connection timeout

2015-09-10 Thread Shushant Arora
15 at 6:58 PM, Cody Koeninger wrote: > Post the actual stacktrace you're getting > > On Thu, Sep 10, 2015 at 12:21 AM, Shushant Arora < > shushantaror...@gmail.com> wrote: > >> Executors in spark streaming 1.3 fetch messages from kafka in batches and >> what ha

spark streaming 1.3 with kafka connection timeout

2015-09-09 Thread Shushant Arora
Executors in spark streaming 1.3 fetch messages from kafka in batches and what happens when executor takes longer time to complete a fetch batch say in directKafkaStream.foreachRDD(new Function, Void>() { @Override public Void call(JavaRDD v1) throws Exception { v1.foreachPartition(new VoidFun

Re: repartition on direct kafka stream

2015-09-04 Thread Shushant Arora
ady given is correct. You shouldn't doubt this, because > you've already seen the shuffle data change accordingly. > > On Fri, Sep 4, 2015 at 11:25 AM, Shushant Arora > wrote: > >> But Kafka stream has underlyng RDD which consists of offsets reanges >> only- so how

Re: repartition on direct kafka stream

2015-09-04 Thread Shushant Arora
repartition and shuffled. On Fri, Sep 4, 2015 at 10:24 AM, Saisai Shao wrote: > Yes not the offset ranges, but the real data will be shuffled when you > using repartition(). > > Thanks > Saisai > > On Fri, Sep 4, 2015 at 12:42 PM, Shushant Arora > wrote: > >>

repartition on direct kafka stream

2015-09-03 Thread Shushant Arora
1.Does repartitioning on direct kafka stream shuffles only the offsets or exact kafka messages across executors? Say I have a direct kafkastream directKafkaStream.repartition(numexecutors).mapPartitions(new FlatMapFunction>, String>(){ ... } Say originally I have 5*numexceutor partitons in kafka

Re: spark streaming 1.3 with kafka

2015-09-01 Thread Shushant Arora
s > down, and scripting monitoring / restarting of your job. > > On Tue, Sep 1, 2015 at 11:19 AM, Shushant Arora > wrote: > >> Since in my app , after processing the events I am posting the events to >> some external server- if external server is down - I want to bac

Re: spark streaming 1.3 with kafka

2015-09-01 Thread Shushant Arora
Sep 1, 2015 at 8:57 PM, Cody Koeninger wrote: > Honestly I'd concentrate more on getting your batches to finish in a > timely fashion, so you won't even have the issue to begin with... > > On Tue, Sep 1, 2015 at 10:16 AM, Shushant Arora > wrote: > >> What if I use c

Re: spark streaming 1.3 with kafka

2015-09-01 Thread Shushant Arora
ompute is called, things are going to get out of whack. > > e.g. checkpoints are no longer going to correspond to what you're actually > processing > > On Tue, Sep 1, 2015 at 10:04 AM, Shushant Arora > wrote: > >> can I reset the range based on some condition - be

Re: spark streaming 1.3 with kafka

2015-09-01 Thread Shushant Arora
09 PM, Cody Koeninger wrote: > It's at the time compute() gets called, which should be near the time the > batch should have been queued. > > On Tue, Sep 1, 2015 at 8:02 AM, Shushant Arora > wrote: > >> Hi >> >> In spark streaming 1.3 with kafka- when does

spark streaming 1.3 with kafka

2015-09-01 Thread Shushant Arora
Hi In spark streaming 1.3 with kafka- when does driver bring latest offsets of this run - at start of each batch or at time when batch gets queued ? Say few of my batches take longer time to complete than their batch interval. So some of batches will go in queue. Will driver waits for queued ba

Re: spark streaming 1.3 kafka topic error

2015-08-31 Thread Shushant Arora
it's > better to figure out what's going on with kafka. > > On Wed, Aug 26, 2015 at 9:07 PM, Shushant Arora > wrote: > >> Hi >> >> My streaming application gets killed with below error >> >> 5/08/26 21:55:20 ERROR

Re: spark streaming 1.3 kafka buffer size

2015-08-26 Thread Shushant Arora
unlimited again . On Wed, Aug 26, 2015 at 9:32 PM, Cody Koeninger wrote: > see http://kafka.apache.org/documentation.html#consumerconfigs > > fetch.message.max.bytes > > in the kafka params passed to the constructor > > > On Wed, Aug 26, 2015 at 10:39 AM, Shushant Arora <

spark streaming 1.3 kafka topic error

2015-08-26 Thread Shushant Arora
Hi My streaming application gets killed with below error 5/08/26 21:55:20 ERROR kafka.DirectKafkaInputDStream: ArrayBuffer(kafka.common.NotLeaderForPartitionException, kafka.common.NotLeaderForPartitionException, kafka.common.NotLeaderForPartitionException, kafka.common.NotLeaderForPartitionExcep

spark streaming 1.3 kafka buffer size

2015-08-26 Thread Shushant Arora
whats the default buffer in spark streaming 1.3 for kafka messages. Say In this run it has to fetch messages from offset 1 to 1. will it fetch all in one go or internally it fetches messages in few messages batch. Is there any setting to configure this no of offsets fetched in one batch?

Re: spark streaming 1.3 kafka error

2015-08-22 Thread Shushant Arora
t; > > On Sat, Aug 22, 2015 at 7:24 PM, Akhil Das > wrote: > >> Can you try some other consumer and see if the issue still exists? >> On Aug 22, 2015 12:47 AM, "Shushant Arora" >> wrote: >> >>> Exception comes when client has so many connec

Re: spark streaming 1.3 kafka error

2015-08-22 Thread Shushant Arora
: > Can you try some other consumer and see if the issue still exists? > On Aug 22, 2015 12:47 AM, "Shushant Arora" > wrote: > >> Exception comes when client has so many connections to some another >> external server also. >> So I think Exception is coming b

Re: spark streaming 1.3 kafka error

2015-08-22 Thread Shushant Arora
task ? Or is it created once only and that is getting closed somehow ? On Sat, Aug 22, 2015 at 9:41 AM, Shushant Arora wrote: > it comes at start of each tasks when there is new data inserted in kafka.( > data inserted is very few) > kafka topic has 300 partitions - data inserted

Re: spark streaming 1.3 kafka error

2015-08-21 Thread Shushant Arora
tat to see what's going on with those ports while the job >> is running, or tcpdump if you need to. >> >> If you can't figure out what's going on from a networking point of view, >> post a minimal reproducible code sample that demonstrates the issu

spark streaming 1.3 kafka error

2015-08-21 Thread Shushant Arora
Hi Getting below error in spark streaming 1.3 while consuming from kafka using directkafka stream. Few of tasks are getting failed in each run. What is the reason /solution of this error? 15/08/21 08:54:54 ERROR executor.Executor: Exception in task 262.0 in stage 130.0 (TID 16332) java.io.EOF

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-19 Thread Shushant Arora
will try scala. Only Reason of not using scala is - never worked on that. On Wed, Aug 19, 2015 at 7:34 PM, Cody Koeninger wrote: > Is there a reason not to just use scala? It's not a lot of code... and > it'll be even less code in scala ;) > > On Wed, Aug 19, 2015 at 4

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-19 Thread Shushant Arora
overriding DirectKafkaInputDStream ? On Wed, Aug 19, 2015 at 12:45 AM, Shushant Arora wrote: > But KafkaRDD[K, V, U, T, R] is not subclass of RDD[R] as per java generic > inheritance is not supported so derived class cannot return different > genric typed subclass from overriden method. &

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-18 Thread Shushant Arora
ass of RDD > > On Tue, Aug 18, 2015 at 1:12 PM, Shushant Arora > wrote: > >> Is it that in scala its allowed for derived class to have any return type >> ? >> >> And streaming jar is originally created in scala so its allowed for >> DirectKafkaInputDSt

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-18 Thread Shushant Arora
Is it that in scala its allowed for derived class to have any return type ? And streaming jar is originally created in scala so its allowed for DirectKafkaInputDStream to return Option[KafkaRDD[K, V, U, T, R]] compute method ? On Tue, Aug 18, 2015 at 8:36 PM, Shushant Arora wrote: > look

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-18 Thread Shushant Arora
on[RDD[T] so it should have been failed? On Tue, Aug 18, 2015 at 7:28 PM, Cody Koeninger wrote: > The superclass method in DStream is defined as returning an Option[RDD[T]] > > On Tue, Aug 18, 2015 at 8:07 AM, Shushant Arora > wrote: > >> Getting compilation error while o

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-18 Thread Shushant Arora
Stream > methods (the ones that take a JavaStreamingContext) > > On Mon, Aug 17, 2015 at 5:13 AM, Shushant Arora > wrote: > >> How to create classtag in java ?Also Constructor >> of DirectKafkaInputDStream takes Function1 not Function but >> kafkaut

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-17 Thread Shushant Arora
015 at 12:16 AM, Cody Koeninger wrote: > I'm not aware of an existing api per se, but you could create your own > subclass of the DStream that returns None for compute() under certain > conditions. > > > > On Wed, Aug 12, 2015 at 1:03 PM, Shushant Arora > wrote: > &

Re: stopping spark stream app

2015-08-12 Thread Shushant Arora
calling jssc.stop()- since that leads to deadlock. On Tue, Aug 11, 2015 at 9:54 PM, Shushant Arora wrote: > Is stopping in the streaming context in onBatchCompleted event > of StreamingListener does not kill the app? > > I have below code in streaming listener > > public voi

Re: avoid duplicate due to executor failure in spark stream

2015-08-11 Thread Shushant Arora
m/watch?v=fXnNEq1v3VA > > > On Mon, Aug 10, 2015 at 4:32 PM, Shushant Arora > wrote: > >> Hi >> >> How can I avoid duplicate processing of kafka messages in spark stream >> 1.3 because of executor failure. >> >> 1.Can I some how access accumulat

Re: stopping spark stream app

2015-08-11 Thread Shushant Arora
jssc.stop(false,false); System.out.println("stopped gracefully"); } stopped gracefully is never printed. On UI no more batches are processed but application is never killed/stopped? Whats the best way to kill the app.after stopping context? On Tue, Aug 11, 2015 at 2:55 AM, Shushant Aro

avoid duplicate due to executor failure in spark stream

2015-08-10 Thread Shushant Arora
Hi How can I avoid duplicate processing of kafka messages in spark stream 1.3 because of executor failure. 1.Can I some how access accumulators of failed task in retry task to skip those many events which are already processed by failed task on this partition ? 2.Or I ll have to persist each ms

Re: stopping spark stream app

2015-08-10 Thread Shushant Arora
ag. When you get the signal from RPC, you can just > call context.stop(stopGracefully = true) . Though note that this is > blocking, so gotta be carefully about doing blocking calls on the RPC > thread. > > On Mon, Aug 10, 2015 at 12:24 PM, Shushant Arora < > shushantaror...@gmail

Re: stopping spark stream app

2015-08-10 Thread Shushant Arora
; gracefully stop the context and terminate. This is more robust that than > leveraging shutdown hooks. > > On Mon, Aug 10, 2015 at 11:56 AM, Shushant Arora < > shushantaror...@gmail.com> wrote: > >> Any help in best recommendation for gracefully shutting down a spark >>

Re: stopping spark stream app

2015-08-10 Thread Shushant Arora
down. Runtime.getRuntime().addShutdownHook does not seem to be working. Yarn kills the application immediately and dooes not call shutdown hook call back . On Sun, Aug 9, 2015 at 12:45 PM, Shushant Arora wrote: > Hi > > How to ensure in spark streaming 1.3 with kafka that when an ap

stream application map transformation constructor called

2015-08-09 Thread Shushant Arora
In stream application how many times the map transformation object being created? Say I have directKafkaStream.repartition(numPartitions).mapPartitions (new FlatMapFunction_derivedclass(configs)); class FlatMapFunction_derivedclass{ FlatMapFunction_derivedclass(Config config){ } @Override publi

Re: stopping spark stream app

2015-08-09 Thread Shushant Arora
Hi How to ensure in spark streaming 1.3 with kafka that when an application is killed , last running batch is fully processed and offsets are written to checkpointing dir. On Fri, Aug 7, 2015 at 8:56 AM, Shushant Arora wrote: > Hi > > I am using spark stream 1.3 and using custom chec

Re: Spark on YARN

2015-08-08 Thread Shushant Arora
which is the scheduler on your cluster. Just check on RM UI scheduler tab and see your user and max limit of vcores for that user , is currently other applications of that user have occupies till max vcores of this user then that could be the reason of not allocating vcores to this user but for som

stopping spark stream app

2015-08-06 Thread Shushant Arora
Hi I am using spark stream 1.3 and using custom checkpoint to save kafka offsets. 1.Is doing Runtime.getRuntime().addShutdownHook(new Thread() { @Override public void run() { jssc.stop(true, true); System.out.println("Inside Add Shutdown Hook"); } }); to handle stop is safe ? 2.And

Re: Upgrade of Spark-Streaming application

2015-08-06 Thread Shushant Arora
at 9:05 AM, Shushant Arora wrote: > Hi > > For checkpointing and using fromOffsets arguments- Say for the first > time when my app starts I don't have any prev state stored and I want to > start consuming from largest offset > > 1. is it possible to specify that in fromOffset

Re: Upgrade of Spark-Streaming application

2015-08-05 Thread Shushant Arora
Hi For checkpointing and using fromOffsets arguments- Say for the first time when my app starts I don't have any prev state stored and I want to start consuming from largest offset 1. is it possible to specify that in fromOffsets api- I don't want to use another api which returs JavaPairInputDS

spark streaming max receiver rate doubts

2015-08-03 Thread Shushant Arora
1.In spark 1.3(Non receiver) - If my batch interval is 1 sec and I don't set spark.streaming.kafka.maxRatePerPartition - so default behavious is to bring all messages from kafka from last offset to current offset ? Say no of messages were large and it took 5 sec to process those so will all jobs

spark --files permission error

2015-08-03 Thread Shushant Arora
Is there any setting to allow --files to copy jar from driver to executor nodes. When I am passing some jar files using --files to executors and adding them in class path of executor it throws exception of File not found 15/08/03 07:59:50 WARN TaskSetManager: Lost task 8.0 in stage 0.0 (TID 8, ip

Re: broadcast variable and accumulators issue while spark streaming checkpoint recovery

2015-07-29 Thread Shushant Arora
y created one. > rdd.map { x => /// use accum } > } > > > On Wed, Jul 29, 2015 at 1:15 PM, Shushant Arora > wrote: > >> Hi >> >> I am using spark streaming 1.3 and using checkpointing. >> But job is failing to recover from checkpoint on restart. &g

broadcast variable and accumulators issue while spark streaming checkpoint recovery

2015-07-29 Thread Shushant Arora
Hi I am using spark streaming 1.3 and using checkpointing. But job is failing to recover from checkpoint on restart. For broadcast variable it says : 1.WARN TaskSetManager: Lost task 15.0 in stage 7.0 (TID 1269, hostIP): java.lang.ClassCastException: [B cannot be cast to pkg.broadcastvariableclas

spark streaming get kafka individual message's offset and partition no

2015-07-28 Thread Shushant Arora
Hi I am processing kafka messages using spark streaming 1.3. I am using mapPartitions function to process kafka message. How can I access offset no of individual message getting being processed. JavaPairInputDStream directKafkaStream =KafkaUtils.createDirectStream(..); directKafkaStream.mapPa

Re: spark as a lookup engine for dedup

2015-07-27 Thread Shushant Arora
ave a long running application where you want to check that > you didn't see the same value before, and check that for every value, you > probably need a key-value store, not RDD. > > On Sun, Jul 26, 2015 at 7:38 PM Shushant Arora > wrote: > >> Hi >> >> I

spark as a lookup engine for dedup

2015-07-26 Thread Shushant Arora
Hi I have a requirement for processing large events but ignoring duplicate at the same time. Events are consumed from kafka and each event has a eventid. It may happen that an event is already processed and came again at some other offset. 1.Can I use Spark RDD to persist processed events and th

spark classpath issue duplicate jar with diff versions

2015-07-24 Thread Shushant Arora
Hi I am running a spark stream app on yarn and using apache httpasyncclient 4.1 This client Jar internally has a dependency on jar http-core4.4.1.jar. This jar's( http-core .jar) old version i.e. httpcore-4.2.5.jar is also present in class path and has higher priority in classpath(coming earlier

Re: spark streaming 1.3 issues

2015-07-22 Thread Shushant Arora
rtitions. > > > > On Tue, Jul 21, 2015 at 1:02 AM, Akhil Das > wrote: > >> I'd suggest you upgrading to 1.4 as it has better metrices and UI. >> >> Thanks >> Best Regards >> >> On Mon, Jul 20, 2015 at 7:01 PM, Shushant Arora < >>

Re: user threads in executors

2015-07-22 Thread Shushant Arora
s. > If you are using the Kafka receiver based approach (not Direct), then the > raw Kafka data is stored in the executor memory. If you are using Direct > Kafka, then it is read from Kafka directly at the time of filtering. > > TD > > On Tue, Jul 21, 2015 at 9:34 PM, Shushant

Re: user threads in executors

2015-07-21 Thread Shushant Arora
ixed size thread pool to share across items on a partition as >> opposed to spawning a future per record in the RDD for example. >> >> On Tue, Jul 21, 2015 at 4:11 AM, Shushant Arora < >> shushantaror...@gmail.com> wrote: >> >>> Hi >>> >>>

user threads in executors

2015-07-21 Thread Shushant Arora
Hi Can I create user threads in executors. I have a streaming app where after processing I have a requirement to push events to external system . Each post request costs ~90-100 ms. To make post parllel, I can not use same thread because that is limited by no of cores available in system , can I

spark streaming 1.3 coalesce on kafkadirectstream

2015-07-20 Thread Shushant Arora
does spark streaming 1.3 launches task for each partition offset range whether that is 0 or not ? If yes, how can I enforce it to not to launch tasks for empty rdds.Not able t o use coalesce on directKafkaStream. Shall we enforce repartitioning always before processing direct stream ? use case i

Re: spark streaming 1.3 issues

2015-07-20 Thread Shushant Arora
call for getting offsets of each partition separately or in single call it gets all partitions new offsets ? I mean will reducing no of partitions oin kafka help improving the performance? On Mon, Jul 20, 2015 at 4:52 PM, Shushant Arora wrote: > Hi > > 1.I am using spark streamin

spark streaming 1.3 issues

2015-07-20 Thread Shushant Arora
Hi 1.I am using spark streaming 1.3 for reading from a kafka queue and pushing events to external source. I passed in my job 20 executors but it is showing only 6 in executor tab ? When I used highlevel streaming 1.2 - its showing 20 executors. My cluster is 10 node yarn cluster with each node ha

Re: spark streaming job to hbase write

2015-07-17 Thread Shushant Arora
; Cheers > > On Fri, Jul 17, 2015 at 5:15 AM, Shushant Arora > wrote: > >> Thanks ! >> >> My key is random (hexadecimal). So hot spot should not be created. >> >> Is there any concept of bulk put. Say I want to raise a one put request >> for a

Re: spark streaming job to hbase write

2015-07-17 Thread Shushant Arora
how you intend to use the data. > > So really this is less a spark issue than an HBase issue when it comes to > design. > > HTH > > -Mike > > > On Jul 15, 2015, at 11:46 AM, Shushant Arora > wrote: > > > > Hi > > > > I have a requirement of

spark streaming job to hbase write

2015-07-15 Thread Shushant Arora
Hi I have a requirement of writing in hbase table from Spark streaming app after some processing. Is Hbase put operation the only way of writing to hbase or is there any specialised connector or rdd of spark for hbase write. Should Bulk load to hbase from streaming app be avoided if output of ea

Re: spark on yarn

2015-07-14 Thread Shushant Arora
hat queue. > > Hope this helps. > > Jong Wook > > > > On Jul 15, 2015, at 01:57, Shushant Arora > wrote: > > > > I am running spark application on yarn managed cluster. > > > > When I specify --executor-cores > 4 it fails to start the applicat

Re: spark on yarn

2015-07-14 Thread Shushant Arora
containers . And these 10 containers will be released only at end of streaming application never in between if none of them fails. On Tue, Jul 14, 2015 at 11:32 PM, Marcelo Vanzin wrote: > On Tue, Jul 14, 2015 at 10:55 AM, Shushant Arora < > shushantaror...@gmail.com> wr

Re: spark on yarn

2015-07-14 Thread Shushant Arora
Is yarn.scheduler.maximum-allocation-vcores the setting for max vcores per container? Whats the setting for max limit of --num-executors ? On Tue, Jul 14, 2015 at 11:18 PM, Marcelo Vanzin wrote: > On Tue, Jul 14, 2015 at 10:40 AM, Shushant Arora < > shushantaror...@gmail.com> wr

Re: spark on yarn

2015-07-14 Thread Shushant Arora
are very less ? On Tue, Jul 14, 2015 at 10:52 PM, Marcelo Vanzin wrote: > > On Tue, Jul 14, 2015 at 9:57 AM, Shushant Arora > wrote: > >> When I specify --executor-cores > 4 it fails to start the application. >> When I give --executor-cores as 4 , it works fine. >&

spark on yarn

2015-07-14 Thread Shushant Arora
I am running spark application on yarn managed cluster. When I specify --executor-cores > 4 it fails to start the application. I am starting the app as spark-submit --class classname --num-executors 10 --executor-cores 5 --master masteradd jarname Exception in thread "main" org.apache.spark.Spar

Re: spark streaming doubt

2015-07-13 Thread Shushant Arora
rkload will be distributed more evenly. There's some degree of per-task > overhead, but as long as you don't have a huge imbalance between number of > tasks and number of executors that shouldn't be a large problem. > > I don't really understand your second question. >

spark streaming doubt

2015-07-11 Thread Shushant Arora
1.spark streaming 1.3 creates as many RDD Partitions as there are kafka partitions in topic. Say I have 300 partitions in topic and 10 executors and each with 3 cores so , is it means at a time only 10*3=30 partitions are processed and then 30 like that since executors launch tasks per RDD partitio

Re: spark streaming kafka compatibility

2015-07-09 Thread Shushant Arora
: > It's the consumer version. Should work with 0.8.2 clusters. > > On Thu, Jul 9, 2015 at 11:10 AM, Shushant Arora > wrote: > >> Does spark streaming 1.3 requires kafka version 0.8.1.1 and is not >> compatible with kafka 0.8.2 ? >> >> As per maven

spark streaming kafka compatibility

2015-07-09 Thread Shushant Arora
Does spark streaming 1.3 requires kafka version 0.8.1.1 and is not compatible with kafka 0.8.2 ? As per maven dependency of spark streaming 1.3 with kafka org.apache.sparkspark-streaming_2.10 1.3.0 providedspark-core_2.10org.apache.spark org.apache.kafkakafka_2.10 0.8.1.1 compilejmxricom.sun.jmxj

pause and resume streaming app

2015-07-08 Thread Shushant Arora
Is it possible to pause and resume a streaming app? I have a streaming app which reads events from kafka and post to some external source. I want to pause the app when external source is down and resume it automatically when it comes back ? Is it possible to pause the app and is it possible to p

spark core/streaming doubts

2015-07-08 Thread Shushant Arora
1.Does creation of read only singleton object in each map function is same as broadcast object as singleton never gets garbage collected unless executor gets shutdown ? Aim is to avoid creation of complex object at each batch interval of a spark streaming app. 2.why JavaStreamingContext 's sc ()

Re: writing to kafka using spark streaming

2015-07-06 Thread Shushant Arora
better design in general to have > "transformations" be purely functional (that is, no external side effect) > and all non-functional stuff be "actions" (e.g., saveAsHadoopFile is an > action). > > > On Mon, Jul 6, 2015 at 12:09 PM, Shushant Arora > wrote: >

Re: writing to kafka using spark streaming

2015-07-06 Thread Shushant Arora
t; >> On Mon, Jul 6, 2015 at 6:11 AM, Shushant Arora > > wrote: >> >>> I have a requirement to write in kafka queue from a spark streaming >>> application. >>> >>> I am using spark 1.2 streaming. Since different executors in spark are >>

Re: kafka offset commit in spark streaming 1.2

2015-07-06 Thread Shushant Arora
auto commit is disabled, no any part will > call commitOffset, you need to call this API yourself. > > > > Also Kafka’s offset commitment mechanism is actually a timer way, so it is > asynchronized with replication. > > > > *From:* Shushant Arora [mailto:shushantaror...@gmail

Re: kafka offset commit in spark streaming 1.2

2015-07-06 Thread Shushant Arora
t you’re setting with autocommit.enable, internally Spark > Streaming will set it to false to turn off autocommit mechanism. > > > > Thanks > > Jerry > > > > *From:* Shushant Arora [mailto:shushantaror...@gmail.com] > *Sent:* Monday, July 6, 2015 8:11 PM > *To:* user > *

kafka offset commit in spark streaming 1.2

2015-07-06 Thread Shushant Arora
In spark streaming 1.2 , Is offset of kafka message consumed are updated in zookeeper only after writing in WAL if WAL and checkpointig are enabled or is it depends upon kafkaparams while initialing the kafkaDstream. Map kafkaParams = new HashMap(); kafkaParams.put("zookeeper.connect","ip:2181");

writing to kafka using spark streaming

2015-07-06 Thread Shushant Arora
I have a requirement to write in kafka queue from a spark streaming application. I am using spark 1.2 streaming. Since different executors in spark are allocated at each run so instantiating a new kafka producer at each run seems a costly operation .Is there a way to reuse objects in processing ex

  1   2   >