Re: Spark Streaming from Kafka - no receivers and spark.streaming.receiver.maxRate?

2015-05-27 Thread Dmitry Goldenberg
Got it, thank you, Tathagata and Ted. Could you comment on my other question as well? Basically, I'm trying to get a handle on a good approa

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-27 Thread Dmitry Goldenberg
Thanks, Rajesh. I think that acquring/relinquishing executors is important but I feel like there are at least two layers for resource allocation and autoscaling. It seems that acquiring and relinquishing executors is a way to optimize resource utilization within a pre-set Spark cluster of machine

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
Thank you, Gerard. We're looking at the receiver-less setup with Kafka Spark streaming so I'm not sure how to apply your comments to that case (not that we have to use receiver-less but it seems to offer some advantages over the receiver-based). As far as "the number of Kafka receivers is fixed f

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
Thanks, Evo. Per the last part of your comment, it sounds like we will need to implement a job manager which will be in control of starting the jobs, monitoring the status of the Kafka topic(s), shutting jobs down and marking them as ones to relaunch, scaling the cluster up/down by adding/removing

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
> > > > > Sent from Samsung Mobile > > ---- Original message > > From: Evo Eftimov > > Date:2015/05/28 13:22 (GMT+00:00) > > To: Dmitry Goldenberg > > Cc: Gerard Maas ,spark users > > Subject: Re: Autoscaling Spark cluster based on top

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
y against sys crashes due to memory leaks. >> Until there is free RAM, spark streaming (spark) will NOT resort to disk – >> and of course resorting to disk from time to time (ie when there is no free >> RAM ) and taking a performance hit from that, BUT only until there is no >>

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
>> Probably you should ALWAYS keep the RDD storage policy to MEMORY AND DISK >>> – it will be your insurance policy against sys crashes due to memory leaks. >>> Until there is free RAM, spark streaming (spark) will NOT resort to disk – >>> and of course resorting to d

Re: How to monitor Spark Streaming from Kafka?

2015-06-01 Thread Dmitry Goldenberg
Thank you, Tathagata, Cody, Otis. - Dmitry On Mon, Jun 1, 2015 at 6:57 PM, Otis Gospodnetic wrote: > I think you can use SPM - http://sematext.com/spm - it will give you all > Spark and all Kafka metrics, including offsets broken down by topic, etc. > out of the box. I see more and more peopl

Re: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Dmitry Goldenberg
So Evo, option b is to "singleton" the Param, as in your modified snippet, i.e. instantiate is once per an RDD. But if I understand correctly the a) option is broadcast, meaning instantiation is in the Driver once before any transformations and actions, correct? That's where my serialization cost

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
. >> >> -Andrew >> >> >> >> 2015-05-28 8:02 GMT-07:00 Evo Eftimov : >> >> Probably you should ALWAYS keep the RDD storage policy to MEMORY AND DISK >>> – it will be your insurance policy against sys crashes due to memory leaks. >>

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
e and make them contact > and register with the Master without bringing down the Master (or any of > the currently running worker nodes) > > > > Then just shutdown your currently running spark streaming job/app and > restart it with new params to take advantage of the larg

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
> nodes which allows you to double, triple etc in the background/parallel the > resources of the currently running cluster > > > > I was thinking more about the scenario where you have e.g. 100 boxes and > want to / can add e.g. 20 more > > > > *From:* Dmitry Go

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
es of the currently running cluster > > > > I was thinking more about the scenario where you have e.g. 100 boxes and > want to / can add e.g. 20 more > > > > *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] > *Sent:* Wednesday, June 3, 2015 4:46 PM > *To:* E

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
und/parallel the > resources of the currently running cluster > > > > I was thinking more about the scenario where you have e.g. 100 boxes and > want to / can add e.g. 20 more > > > > *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] > *Sent:* Wednesday,

Re: StreamingListener, anyone?

2015-06-04 Thread Dmitry Goldenberg
Shixiong, Thanks, interesting point. So if we want to only process one batch then terminate the consumer, what's the best way to achieve that? Presumably the listener could set a flag on the driver notifying it that it can terminate. But the driver is not in a loop, it's basically blocked in await

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-04 Thread Dmitry Goldenberg
from Samsung Mobile > > Original message ---- > > From: Evo Eftimov > > Date:2015/05/28 13:22 (GMT+00:00) > > To: Dmitry Goldenberg > > Cc: Gerard Maas ,spark users > > Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of growth &

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread Dmitry Goldenberg
Thanks so much, Yiannis, Olivier, Huang! On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas wrote: > Hi there, > > I would recommend checking out > https://github.com/spark-jobserver/spark-jobserver which I think gives > the functionality you are looking for. > I haven't tested it though. > > BR >

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Dmitry Goldenberg
ads > > > > Re Dmytiis intial question – you can load large data sets as Batch > (Static) RDD from any Spark Streaming App and then join DStream RDDs > against them to emulate “lookups” , you can also try the “Lookup RDD” – > there is a git hub project > > > > *From

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-09 Thread Dmitry Goldenberg
re > unless you want it to. > > If you want it, just call cache. > > On Thu, Jun 4, 2015 at 8:20 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> "set the storage policy for the DStream RDDs to MEMORY AND DISK" - it >> appears t

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-11 Thread Dmitry Goldenberg
rces of the currently running cluster > > > > I was thinking more about the scenario where you have e.g. 100 boxes and > want to / can add e.g. 20 more > > > > *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] > *Sent:* Wednesday, June 3, 2015 4:46 PM > *T

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-11 Thread Dmitry Goldenberg
gt; > > > > > > On Thu, Jun 11, 2015 at 7:30 AM, Cody Koeninger > wrote: > >> Depends on what you're reusing multiple times (if anything). >> >> Read >> http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence >> >

Re: Registering custom metrics

2015-06-22 Thread Dmitry Goldenberg
Great, thank you, Silvio. In your experience, is there any way to instument a callback into Coda Hale or the Spark consumers from the metrics sink? If the sink performs some steps once it has received the metrics, I'd like to be able to make the consumers aware of that via some sort of a callback.

Re: Any way to retrieve time of message arrival to Kafka topic, in Spark Streaming?

2015-06-23 Thread Dmitry Goldenberg
Yes, Akhil. We already have an origination timestamp in the body of the message when we send it. But we can't guarantee the network speed nor a precise enough synchronization of clocks across machines. Pulling the timestamp from Kafka itself would be a step forward although the broker is most l

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg
Richard, That's exactly the strategy I've been trying, which is a wrapper singleton class. But I was seeing the inner object being created multiple times. I wonder if the problem has to do with the way I'm processing the RDD's. I'm using JavaDStream to stream data (from Kafka). Then I'm processin

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg
My singletons do in fact stick around. They're one per worker, looks like. So with 4 workers running on the box, we're creating one singleton per worker process/jvm, which seems OK. Still curious about foreachPartition vs. foreachRDD though... On Tue, Jul 7, 2015 at 11:27 AM, Richard Marscher wr

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
"These are quite different operations. One operates on RDDs in DStream and one operates on partitions of an RDD. They are not alternatives." Sean, different operations as they are, they can certainly be used on the same data set. In that sense, they are alternatives. Code can be written using on

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
Thanks, Cody. The "good boy" comment wasn't from me :) I was the one asking for help. On Wed, Jul 8, 2015 at 10:52 AM, Cody Koeninger wrote: > Sean already answered your question. foreachRDD and foreachPartition are > completely different, there's nothing fuzzy or insufficient about that > a

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
to allocate some expensive resource to do the processing. > > On Wed, Jul 8, 2015 at 3:18 PM, Dmitry Goldenberg > wrote: > > "These are quite different operations. One operates on RDDs in DStream > and > > one operates on partitions of an RDD. They are not alternativ

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg
I and DB clients on the executor > side. I think it's the most straightforward approach to dealing with any > non-serializable object you need. > > I don't entirely follow what over-network data shuffling effects you are > alluding to (maybe more specific to streaming?). &g

Re: What is a best practice for passing environment variables to Spark workers?

2015-07-10 Thread Dmitry Goldenberg
Thanks, Akhil. We're trying the conf.setExecutorEnv() approach since we've already got environment variables set. For system properties we'd go the conf.set("spark.") route. We were concerned that doing the below type of thing did not work, which this blog post seems to confirm ( http://proge

Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-11 Thread Dmitry Goldenberg
We have a bunch of Spark jobs deployed and a few large resource files such as e.g. a dictionary for lookups or a statistical model. Right now, these are deployed as part of the Spark jobs which will eventually make the mongo-jars too bloated for deployments. What are some of the best practices to

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
ion could be to store them as blobs in a cache like Redis and then > read + broadcast them from the driver. Or you could store them in HDFS and > read + broadcast from the driver. > > Regards > Sab > >> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg >> wrote:

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
Jorn, you said Ignite or ... ? What was the second choice you were thinking of? It seems that got omitted. > On Jan 12, 2016, at 2:44 AM, Jörn Franke wrote: > > You can look at ignite as a HDFS cache or for storing rdds. > >> On 11 Jan 2016, at 21:14, Dmitry Goldenberg

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
I'd guess that if the resources are broadcast Spark would put them into Tachyon... > On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg > wrote: > > Would it make sense to load them into Tachyon and read and broadcast them > from there since Tachyon is already a part of the

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
lso able to > interact with Tachyon via the Hadoop-compatible API, so Spark jobs can read > input files from Tachyon and write output files to Tachyon. > > I hope that helps, > Gene > > On Tue, Jan 12, 2016 at 4:26 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-14 Thread Dmitry Goldenberg
since it's pluggable into Spark), Redis, and the like. Ideally, I would think we'd want resources to be loaded into the cluster memory as needed; paged in/out on-demand in an LRU fashion. From this perspective, it's not yet clear to me what the best option(s) would be. Any thoughts / re

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-14 Thread Dmitry Goldenberg
The other thing from some folks' recommendations on this list was Apache Ignite. Their In-Memory File System ( https://ignite.apache.org/features/igfs.html) looks quite interesting. On Thu, Jan 14, 2016 at 7:54 AM, Dmitry Goldenberg wrote: > OK so it looks like Tachyon is a cluste

Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
I apologize for posting this Kafka related issue into the Spark list. Have gotten no responses on the Kafka list and was hoping someone on this list could shed some light on the below. --- We're running into this

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
set the advertised.host.name in Kafka's server.properties. Yay/nay? Thanks again. On Tue, Sep 29, 2015 at 8:31 AM, Adrian Tanase wrote: > I believe some of the brokers in your cluster died and there are a number > of partitions that nobody is currently managing. > > -adrian > >

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
ts where folks suggest to check the > DNS settings (those appear fine) and to set the advertised.host.name in > Kafka's server.properties. Yay/nay? > > Thanks again. > > On Tue, Sep 29, 2015 at 8:31 AM, Adrian Tanase wrote: > >> I believe some of the brokers i

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
on the broker, you > may want to look through the kafka jira, e.g. > > https://issues.apache.org/jira/browse/KAFKA-899 > > > On Tue, Sep 29, 2015 at 8:05 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> "more partitions and replicas than avai

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
er, you > may want to look through the kafka jira, e.g. > > https://issues.apache.org/jira/browse/KAFKA-899 > > > On Tue, Sep 29, 2015 at 8:05 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> "more partitions and replicas than available brokers"

ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Dmitry Goldenberg
We're seeing this occasionally. Granted, this was caused by a wrinkle in the Solr schema but this bubbled up all the way in Spark and caused job failures. I just checked and SolrException class is actually in the consumer job jar we use. Is there any reason why Spark cannot find the SolrException

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Dmitry Goldenberg
ase of Spark > command line for running Spark job > > Cheers > > On Tue, Sep 29, 2015 at 1:37 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> We're seeing this occasionally. Granted, this was caused by a wrinkle in >> the Solr schema but this

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Dmitry Goldenberg
ou tried the following ? > --conf spark.driver.userClassPathFirst=true --conf spark.executor. > userClassPathFirst=true > > On Tue, Sep 29, 2015 at 4:38 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Release of Spark: 1.5.0. >> >> Command line

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Dmitry Goldenberg
I'm actually not sure how either one of these would possibly cause Spark to find SolrException. Whether the driver or executor class path is first, should it not matter, if the class is in the consumer job jar? On Tue, Sep 29, 2015 at 9:12 PM, Dmitry Goldenberg wrote: > Ted, I thin

How to tell Spark not to use /tmp for snappy-unknown-***-libsnappyjava.so

2015-09-30 Thread Dmitry Goldenberg
Is there a way to ensure Spark doesn't write to /tmp directory? We've got spark.local.dir specified in the spark-defaults.conf file to point at another directory. But we're seeing many of these snappy-unknown-***-libsnappyjava.so files being written to /tmp still. Is there a config setting or so

Re: How to tell Spark not to use /tmp for snappy-unknown-***-libsnappyjava.so

2015-09-30 Thread Dmitry Goldenberg
Thanks, Ted, will try it out. On Wed, Sep 30, 2015 at 9:07 AM, Ted Yu wrote: > See the tail of this: > https://bugzilla.redhat.com/show_bug.cgi?id=1005811 > > FYI > > > On Sep 30, 2015, at 5:54 AM, Dmitry Goldenberg > wrote: > > > > Is there a way t

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-30 Thread Dmitry Goldenberg
d ahead of hbase-protocol.jar, things start to get > interesting ... > > On Tue, Sep 29, 2015 at 6:12 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Ted, I think I have tried these settings with the hbase protocol jar, to >> no avail. >> >>

What are the .snapshot files in /home/spark/Snapshots?

2015-11-10 Thread Dmitry Goldenberg
We're seeing a bunch of .snapshot files being created under /home/spark/Snapshots, such as the following for example: CoarseGrainedExecutorBackend-2015-08-27-shutdown.snapshot CoarseGrainedExecutorBackend-2015-08-31-shutdown-1.snapshot SparkSubmit-2015-08-31-shutdown-1.snapshot Worker-2015-08-27-s

Re: What are the .snapshot files in /home/spark/Snapshots?

2015-11-10 Thread Dmitry Goldenberg
N/m, these are just profiling snapshots :) Sorry for the wide distribution. On Tue, Nov 10, 2015 at 9:46 AM, Dmitry Goldenberg wrote: > We're seeing a bunch of .snapshot files being created under > /home/spark/Snapshots, such as the following for example: > > CoarseGrainedExe

Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg
I've instrumented checkpointing per the programming guide and I can tell that Spark Streaming is creating the checkpoint directories but I'm not seeing any content being created in those directories nor am I seeing the effects I'd expect from checkpointing. I'd expect any data that comes into Kafk

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg
f it's quite long. > > It sort of seems wrong though since > https://spark.apache.org/docs/latest/streaming-programming-guide.html > suggests it was intended to be a multiple of the batch interval. The > slide duration wouldn't always be relevant anyway. > > On Fri,

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg
est/streaming-programming-guide.html > suggests it was intended to be a multiple of the batch interval. The > slide duration wouldn't always be relevant anyway. > > On Fri, Jul 31, 2015 at 6:16 PM, Dmitry Goldenberg > wrote: > > I've instrumented checkpointing per the

How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
We're getting the below error. Tried increasing spark.executor.memory e.g. from 1g to 2g but the below error still happens. Any recommendations? Something to do with specifying -Xmx in the submit job scripts? Thanks. Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit excee

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
at looks like it's during recovery from a checkpoint, so it'd be driver > memory not executor memory. > > How big is the checkpoint directory that you're trying to restore from? > > On Mon, Aug 10, 2015 at 10:57 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com&g

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
r we can estimate the > size of the checkpoint and compare with Runtime.getRuntime().freeMemory(). > > If the size of checkpoint is much bigger than free memory, log warning, etc > > Cheers > > On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com&

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
om? Thanks. On Mon, Aug 10, 2015 at 1:07 PM, Cody Koeninger wrote: > You need to keep a certain number of rdds around for checkpointing, based > on e.g. the window size. Those would all need to be loaded at once. > > On Mon, Aug 10, 2015 at 11:49 AM, Dmitry Goldenberg < > dgold

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
to workers. Hefty file system usage, hefty memory consumption... What can we do to offset some of these costs? On Mon, Aug 10, 2015 at 4:27 PM, Cody Koeninger wrote: > The rdd is indeed defined by mostly just the offsets / topic partitions. > > On Mon, Aug 10, 2015 at 3:24 PM

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-08-14 Thread Dmitry Goldenberg
s the interval? that could at least > explain why it's not doing anything, if it's quite long. > > It sort of seems wrong though since > https://spark.apache.org/docs/latest/streaming-programming-guide.html > suggests it was intended to be a multiple of the batch interval. Th

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-08-14 Thread Dmitry Goldenberg
er program could terminate as the last batch is being processed... On Fri, Aug 14, 2015 at 6:17 PM, Cody Koeninger wrote: > You'll resume and re-process the rdd that didnt finish > > On Fri, Aug 14, 2015 at 1:31 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote

Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-03 Thread Dmitry Goldenberg
I'm seeing an oddity where I initially set the batchdurationmillis to 1 second and it works fine: JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.milliseconds(batchDurationMillis)); Then I tried changing the value to 10 seconds. The change didn't seem to take. I've bounc

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
ng from checkpoint files which has 10 second > as the batch interval? > > > On Thu, Sep 3, 2015 at 7:34 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> I'm seeing an oddity where I initially set the batchdurationmillis to 1 >> second an

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
UI says? It should show > the underlying batch duration of the StreamingContext, the details of when > the batch starts, etc. > > BTW, it seems that the 5.6 or 6.8 seconds delay is present only when data > is present (that is, * Documents processed: > 0)* > > On Fri,

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
ontext(sparkConf, params);jssc.start(); jssc.awaitTermination(); jssc.close(); On Fri, Sep 4, 2015 at 8:48 PM, Tathagata Das wrote: > Are you sure you are not accidentally recovering from checkpoint? How are > you using StreamingContext.getOrCreate() in your code? > > TD &

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
ew ProcessPartitionFunction(params); rdd.foreachPartition(func); return null; } }); return jssc; } On Fri, Sep 4, 2015 at 8:57 PM, Dmitry Goldenberg wrote: > I'd think that we wouldn't be "accidentally recovering from checkpoint" > hours or ev

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
or moving the contents of the checkpoint directory > and restarting the job? > > On Fri, Sep 4, 2015 at 8:02 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Sorry, more relevant code below: >> >> SparkConf sparkConf = createSparkConf(appName, ka

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
what values are actually being read > out of your params object for the various settings. > > > On Tue, Sep 8, 2015 at 10:24 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> I've stopped the jobs, the workers, and the master. Deleted the contents >

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
ger wrote: > Well, I'm not sure why you're checkpointing messages. > > I'd also put in some logging to see what values are actually being read > out of your params object for the various settings. > > > On Tue, Sep 8, 2015 at 10:24 AM, Dmitry Goldenberg < > dg

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
:23 PM, Tathagata Das wrote: > Why are you checkpointing the direct kafka stream? It serves not purpose. > > TD > > On Tue, Sep 8, 2015 at 9:35 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> I just disabled checkpointing in our consumers and

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
ata > in each batch > > (which checkpoint is enabled using > streamingContext.checkpoint(checkpointDir)) and can recover from failure by > reading the exact same data back from Kafka. > > > TD > > On Tue, Sep 8, 2015 at 4:38 PM, Dmitry Goldenberg < > dgoldenbe

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
wrote: >> >>> Calling directKafkaStream.checkpoint() will make the system write the >>> raw kafka data into HDFS files (that is, RDD checkpointing). This is >>> completely unnecessary with Direct Kafka because it already tracks the >>> offset of data in each

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-09 Thread Dmitry Goldenberg
pointing to override its checkpoint duration millis, is there? Is the default there max(batchdurationmillis, 10seconds)? Is there a way to override this? Thanks. On Wed, Sep 9, 2015 at 2:44 PM, Tathagata Das wrote: > > > See inline. > > On Tue, Sep 8, 2015 at 9:02 PM, Dmitry

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-10 Thread Dmitry Goldenberg
the time of > recovery? Trying to understand your usecase. > > > On Wed, Sep 9, 2015 at 12:03 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> >> when you use getOrCreate, and there exists a valid checkpoint, it will >> always return the c

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Dmitry Goldenberg
>> checkpoints can't be used between controlled restarts Is that true? If so, why? From my testing, checkpoints appear to be working fine, we get the data we've missed between the time the consumer went down and the time we brought it back up. >> If I cannot make checkpoints between code upgrades

A way to kill laggard jobs?

2015-09-11 Thread Dmitry Goldenberg
Is there a way to kill a laggard Spark job manually, and more importantly, is there a way to do it programmatically based on a configurable timeout value? Thanks.

A way to timeout and terminate a laggard 'Stage' ?

2015-09-14 Thread Dmitry Goldenberg
Is there a way in Spark to automatically terminate laggard "stage's", ones that appear to be hanging? In other words, is there a timeout for processing of a given RDD? In the Spark GUI, I see the "kill" function for a given Stage under 'Details for Job <...>". Is there something in Spark that w

Re: A way to timeout and terminate a laggard 'Stage' ?

2015-09-15 Thread Dmitry Goldenberg
proach, but yes you >> can have a separate program to keep an eye in the webui (possibly parsing >> the content) and make it trigger the kill task/job once it detects a lag. >> (Again you will have to figure out the correct numbers before killing any >> job) >> >>

Reason for Kafka topic existence check / "Does the topic exist?" error

2016-10-08 Thread Dmitry Goldenberg
Hi, I am trying to start up a simple consumer that streams from a Kafka topic, using Spark 2.0.0: - spark-streaming_2.11 - spark-streaming-kafka-0-8_2.11 I was getting an error as below until I created the topic in Kafka. From integrating Spark 1.5, I never used to hit this check; we were

SparkConf.setExecutorEnv works differently in Spark 2.0.0

2016-10-08 Thread Dmitry Goldenberg
Hi, We have some code which worked with Spark 1.5.0, which allowed us to pass system properties to the executors: SparkConf sparkConf = new SparkConf().setAppName(appName); ... for each property: sparkConf.setExecutorEnv(propName, propValue); Our current Spark Streaming driver is compiled wi

Re: Reason for Kafka topic existence check / "Does the topic exist?" error

2016-10-29 Thread Dmitry Goldenberg
for those of us that > prefer to run with auto.create set to false (because it makes sure the > topic is actually set up the way you want, reduces the likelihood of > spurious topics being created, etc). > > > > On Sat, Oct 8, 2016 at 11:44 AM, Dmitry Goldenberg > wrote:

NoNodeAvailableException (None of the configured nodes are available) error when trying to push data to Elastic from a Spark job

2017-02-03 Thread Dmitry Goldenberg
Hi, Any reason why we might be getting this error? The code seems to work fine in the non-distributed mode but the same code when run from a Spark job is not able to get to Elastic. Spark version: 2.0.1 built for Hadoop 2.4, Scala 2.11 Elastic version: 2.3.1 I've verified the Elastic hosts and

How to fix error "Failed to get records for..." after polling for 120000

2017-04-18 Thread Dmitry Goldenberg
Hi, I was wondering if folks have some ideas, recommendation for how to fix this error (full stack trace included below). We're on Kafka 0.10.0.0 and spark_streaming_2.11 v. 2.0.0. We've tried a few things as suggested in these sources: - http://stackoverflow.com/questions/42264669/spark

Losing system properties on executor side, if context is checkpointed

2019-02-19 Thread Dmitry Goldenberg
Hi all, I'm seeing an odd behavior where if I switch the context from regular to checkpointed, the system properties are no longer automatically carried over into the worker / executors and turn out to be null there. This is in Java, using spark-streaming_2.10, version 1.5.0. I'm placing propert

Issues with Spark Streaming checkpointing of Kafka topic content

2019-04-02 Thread Dmitry Goldenberg
Hi, I've got 3 questions/issues regarding checkpointing, was hoping someone could help shed some light on this. We've got a Spark Streaming consumer consuming data from a Kafka topic; works fine generally until I switch it to the checkpointing mode by calling the 'checkpoint' method on the contex

Re: Issues with Spark Streaming checkpointing of Kafka topic content

2019-04-02 Thread Dmitry Goldenberg
To add more info, this project is on an older version of Spark, 1.5.0, and on an older version of Kafka which is 0.8.2.1 (2.10-0.8.2.1). On Tue, Apr 2, 2019 at 11:39 AM Dmitry Goldenberg wrote: > Hi, > > I've got 3 questions/issues regarding checkpointing, was hoping someone >

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
Are you proposing I downgrade Solrj's httpclient dependency to be on par with that of Spark/Hadoop? Or upgrade Spark/Hadoop's httpclient to the latest? Solrj has to stay with its selected version. I could try and rebuild Spark with the latest httpclient but I've no idea what effects that may cau

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
I think I'm going to have to rebuild Spark with commons.httpclient.version set to 4.3.1 which looks to be the version chosen by Solrj, rather than the 4.2.6 that Spark's pom mentions. Might work. On Wed, Feb 18, 2015 at 1:37 AM, Arush Kharbanda wrote: > Hi > > Did you try to make maven pick the

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
ll the > functionality I needed. > > -- > Emre Sevinç > http://www.bigindustries.be/ > > > > On Wed, Feb 18, 2015 at 4:39 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> I think I'm going to have to rebuild Spark with >> commons.h

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
Thanks, Emre! Will definitely try this. On Wed, Feb 18, 2015 at 11:00 AM, Emre Sevinc wrote: > > On Wed, Feb 18, 2015 at 4:54 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Thank you, Emre. It seems solrj still depends on HttpClient 4.1.3; would >&g

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Dmitry Goldenberg
Thanks, Cody. Yes, I originally started off by looking at that but I get a compile error if I try and use that approach: constructor JdbcRDD in class JdbcRDD cannot be applied to given types. Not to mention that JavaJdbcRDDSuite somehow manages to not pass in the class tag (the last argument). Wo

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Dmitry Goldenberg
a static method JdbcRDD.create, not new JdbcRDD. Is that what > you tried doing? > > On Wed, Feb 18, 2015 at 12:00 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Thanks, Cody. Yes, I originally started off by looking at that but I get >> a compile err

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
I'm not sure what "on the driver" means but I've tried setting spark.files.userClassPathFirst to true, in $SPARK-HOME/conf/spark-defaults.conf and also in the SparkConf programmatically; it appears to be ignored. The solution was to follow Emre's recommendation and downgrade the selected Solrj dis

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Dmitry Goldenberg
Koeninger wrote: > Is sc there a SparkContext or a JavaSparkContext? The compilation error > seems to indicate the former, but JdbcRDD.create expects the latter > > On Wed, Feb 18, 2015 at 12:30 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> I have t

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Dmitry Goldenberg
e and pass an instance of that to JdbcRDD.create ? > > On Wed, Feb 18, 2015 at 3:48 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Cody, you were right, I had a copy and paste snag where I ended up with a >> vanilla SparkContext rather than a Java

Re: NotSerializableException: org.apache.http.impl.client.DefaultHttpClient when trying to send documents to Solr

2015-02-18 Thread Dmitry Goldenberg
Thank you, Jose. That fixed it. On Wed, Feb 18, 2015 at 6:31 PM, Jose Fernandez wrote: > You need to instantiate the server in the forEachPartition block or Spark > will attempt to serialize it to the task. See the design patterns section > in the Spark Streaming guide. > > > Jose Fernandez | Pr

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-19 Thread Dmitry Goldenberg
DD[T] = { > > > JFunction here is the interface org.apache.spark.api.java.function.Function, > not scala Function0 > > LIkewise, ConnectionFactory is an interface defined inside JdbcRDD, not > scala Function0 > > On Wed, Feb 18, 2015 at 4:50 PM, Dmitry Goldenberg

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-19 Thread Dmitry Goldenberg
t that takes 2 numbers to specify the bounds. Of > course, a numeric primary key is going to be the most efficient way to do > that. > > On Thu, Feb 19, 2015 at 8:57 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Yup, I did see that. Good point though, Cody.

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-18 Thread Dmitry Goldenberg
Thanks, Ted. Well, so far even there I'm only seeing my post and not, for example, your response. On Wed, Mar 18, 2015 at 7:28 PM, Ted Yu wrote: > Was this one of the threads you participated ? > http://search-hadoop.com/m/JW1q5w0p8x1 > > You should be able to find your posts on search-hadoop.co

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Dmitry Goldenberg
r/ >>> >>> Nick >>> >>> On Thu, Mar 19, 2015 at 5:18 AM Ted Yu wrote: >>> >>>> There might be some delay: >>>> >>>> >>>> http://search-hadoop.com/m/JW1q5mjZUy/Spark+people%2527s+responses&subj=Apache+Sp

  1   2   >