Distribution of spark 3.0.1 with Hive1.2

2020-11-10 Thread Dmitry
Hi all, I am trying to make distribution 3.0.1 with spark 3 using ./dev/make-distribution.sh --name spark3-hive12 --pip --tgz -Phive-1.2 -Phadoop-2.7 -Pyarn The problem is maven can't found right profile for hive and build ends without hive jars ++ /Users/reireirei/spark/spark/build/mvn help:eval

Re: Issues with Spark Streaming checkpointing of Kafka topic content

2019-04-02 Thread Dmitry Goldenberg
To add more info, this project is on an older version of Spark, 1.5.0, and on an older version of Kafka which is 0.8.2.1 (2.10-0.8.2.1). On Tue, Apr 2, 2019 at 11:39 AM Dmitry Goldenberg wrote: > Hi, > > I've got 3 questions/issues regarding checkpointing, was hoping someone >

Issues with Spark Streaming checkpointing of Kafka topic content

2019-04-02 Thread Dmitry Goldenberg
r, once the data has been processed by the driver + the workers successfully? If the former, how can we configure checkpointing to do the latter? Thanks, - Dmitry

Losing system properties on executor side, if context is checkpointed

2019-02-19 Thread Dmitry Goldenberg
params.getAppName(), params.getTopic(), .. rdd.foreachPartition(func); return null; } }); Would appreciate any recommendations/clues, Thanks, - Dmitry

Re: Spark on Kubernetes (minikube) 2.3 fails with class not found exception

2018-04-10 Thread Dmitry
spark-examples_2.11-2.3.0.jar. > > On Tue, Apr 10, 2018 at 1:34 AM, Dmitry wrote: > >> Hello spent a lot of time to find what I did wrong , but not found. >> I have a minikube WIndows based cluster ( Hyper V as hypervisor ) and try >> to run examples against

Spark on Kubernetes (minikube) 2.3 fails with class not found exception

2018-04-10 Thread Dmitry
Hello spent a lot of time to find what I did wrong , but not found. I have a minikube WIndows based cluster ( Hyper V as hypervisor ) and try to run examples against Spark 2.3. Tried several docker images builds: * several builds that I build myself * andrusha/spark-k8s:2.3.0-hadoop2.7 from docke

Re: Why do checkpoints work the way they do?

2017-09-11 Thread Dmitry Naumenko
+1 for me for this question. If there any constraints in restoring checkpoint for Structured Streaming, they should be documented. 2017-08-31 9:20 GMT+03:00 张万新 : > So is there any documents demonstrating in what condition can my > application recover from the same checkpoint and in what conditi

How to fix error "Failed to get records for..." after polling for 120000

2017-04-18 Thread Dmitry Goldenberg
s/42264669/spark-streaming-assertion-failed-failed-to-get-records-for-spark-executor-a-gro - https://issues.apache.org/jira/browse/SPARK-19275 - https://issues.apache.org/jira/browse/SPARK-17147 but still seeing the error. We'd appreciate any clues or recommendations. T

NoNodeAvailableException (None of the configured nodes are available) error when trying to push data to Elastic from a Spark job

2017-02-03 Thread Dmitry Goldenberg
Hi, Any reason why we might be getting this error? The code seems to work fine in the non-distributed mode but the same code when run from a Spark job is not able to get to Elastic. Spark version: 2.0.1 built for Hadoop 2.4, Scala 2.11 Elastic version: 2.3.1 I've verified the Elastic hosts and

subtractByKey modifes values in the source RDD

2016-11-23 Thread Dmitry Dzhus
I'm experiencing a problem with subtractByKey using Spark 2.0.2 with Scala 2.11.x: Relevant code: object Types { type ContentId = Int type ContentKey = Tuple2[Int, ContentId] type InternalContentId = Int } val inverseItemIDMap: RDD[(InternalContentId, ContentKey)

Re: Reason for Kafka topic existence check / "Does the topic exist?" error

2016-10-29 Thread Dmitry Goldenberg
ipt at install time. This seems like a Kafka doc issue potentially, to explain what exactly one can expect from the auto.create.topics.enable flag. -Dmitry On Sat, Oct 8, 2016 at 1:26 PM, Cody Koeninger wrote: > So I just now retested this with 1.5.2, and 2.0.0, and the behavior is > ex

SparkConf.setExecutorEnv works differently in Spark 2.0.0

2016-10-08 Thread Dmitry Goldenberg
way to pass environment variables or system properties to the executor side, preferably a programmatic way rather than configuration-wise? Thanks, - Dmitry

Reason for Kafka topic existence check / "Does the topic exist?" error

2016-10-08 Thread Dmitry Goldenberg
ather be able to not have to pre-create the topics before I start the consumers. Any thoughts/comments would be appreciated. Thanks. - Dmitry Exception in thread "main" org.apache.spark

Profiling a spark job

2016-04-05 Thread Dmitry Olshansky
nt in EPollWait. However I'm using standalone mode with local master without starting separate daemon (could it be that I should?) --- Dmitry Olshansky - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional c

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-14 Thread Dmitry Goldenberg
The other thing from some folks' recommendations on this list was Apache Ignite. Their In-Memory File System ( https://ignite.apache.org/features/igfs.html) looks quite interesting. On Thu, Jan 14, 2016 at 7:54 AM, Dmitry Goldenberg wrote: > OK so it looks like Tachyon is a cluste

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-14 Thread Dmitry Goldenberg
since it's pluggable into Spark), Redis, and the like. Ideally, I would think we'd want resources to be loaded into the cluster memory as needed; paged in/out on-demand in an LRU fashion. From this perspective, it's not yet clear to me what the best option(s) would be. Any thoughts / re

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
, Gene Pang wrote: > Hi Dmitry, > > Yes, Tachyon can help with your use case. You can read and write to > Tachyon via the filesystem api ( > http://tachyon-project.org/documentation/File-System-API.html). There is > a native Java API as well as a Hadoop-compatible API. Spark is a

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
I'd guess that if the resources are broadcast Spark would put them into Tachyon... > On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg > wrote: > > Would it make sense to load them into Tachyon and read and broadcast them > from there since Tachyon is already a part of the

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
Jorn, you said Ignite or ... ? What was the second choice you were thinking of? It seems that got omitted. > On Jan 12, 2016, at 2:44 AM, Jörn Franke wrote: > > You can look at ignite as a HDFS cache or for storing rdds. > >> On 11 Jan 2016, at 21:14, Dmitry Goldenberg

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
ion could be to store them as blobs in a cache like Redis and then > read + broadcast them from the driver. Or you could store them in HDFS and > read + broadcast from the driver. > > Regards > Sab > >> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg >> wrote:

Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-11 Thread Dmitry Goldenberg
We have a bunch of Spark jobs deployed and a few large resource files such as e.g. a dictionary for lookups or a statistical model. Right now, these are deployed as part of the Spark jobs which will eventually make the mongo-jars too bloated for deployments. What are some of the best practices to

Re: [discuss] dropping Python 2.6 support

2016-01-10 Thread Dmitry Kniazev
st in the same environment. For example, we use virtualenv to run Spark with Python 2.7 and do not touch system Python 2.6. Thank you, Dmitry 09.01.2016, 06:36, "Sasha Kacanski" : > +1 > Companies that use stock python in redhat 2.6 will need to upgrade or install > fresh vers

[no subject]

2015-11-26 Thread Dmitry Tolpeko

Re: What are the .snapshot files in /home/spark/Snapshots?

2015-11-10 Thread Dmitry Goldenberg
N/m, these are just profiling snapshots :) Sorry for the wide distribution. On Tue, Nov 10, 2015 at 9:46 AM, Dmitry Goldenberg wrote: > We're seeing a bunch of .snapshot files being created under > /home/spark/Snapshots, such as the following for example: > > CoarseGrainedExe

What are the .snapshot files in /home/spark/Snapshots?

2015-11-10 Thread Dmitry Goldenberg
We're seeing a bunch of .snapshot files being created under /home/spark/Snapshots, such as the following for example: CoarseGrainedExecutorBackend-2015-08-27-shutdown.snapshot CoarseGrainedExecutorBackend-2015-08-31-shutdown-1.snapshot SparkSubmit-2015-08-31-shutdown-1.snapshot Worker-2015-08-27-s

Re: Broadcast var is null

2015-10-05 Thread Dmitry Pristin
ar = ssc.sparkContext.broadcast(Array(1, 2, 3)) >> >> //Try to print out the value of the broadcast var here >> val transformed = events.transform(rdd => { >> rdd.map(x => { >> if(broadcastVar == null) { >> println("broadcastVar is null") >>

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-30 Thread Dmitry Goldenberg
d ahead of hbase-protocol.jar, things start to get > interesting ... > > On Tue, Sep 29, 2015 at 6:12 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Ted, I think I have tried these settings with the hbase protocol jar, to >> no avail. >> >>

Re: How to tell Spark not to use /tmp for snappy-unknown-***-libsnappyjava.so

2015-09-30 Thread Dmitry Goldenberg
Thanks, Ted, will try it out. On Wed, Sep 30, 2015 at 9:07 AM, Ted Yu wrote: > See the tail of this: > https://bugzilla.redhat.com/show_bug.cgi?id=1005811 > > FYI > > > On Sep 30, 2015, at 5:54 AM, Dmitry Goldenberg > wrote: > > > > Is there a way t

How to tell Spark not to use /tmp for snappy-unknown-***-libsnappyjava.so

2015-09-30 Thread Dmitry Goldenberg
Is there a way to ensure Spark doesn't write to /tmp directory? We've got spark.local.dir specified in the spark-defaults.conf file to point at another directory. But we're seeing many of these snappy-unknown-***-libsnappyjava.so files being written to /tmp still. Is there a config setting or so

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Dmitry Goldenberg
I'm actually not sure how either one of these would possibly cause Spark to find SolrException. Whether the driver or executor class path is first, should it not matter, if the class is in the consumer job jar? On Tue, Sep 29, 2015 at 9:12 PM, Dmitry Goldenberg wrote: > Ted, I thin

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Dmitry Goldenberg
ou tried the following ? > --conf spark.driver.userClassPathFirst=true --conf spark.executor. > userClassPathFirst=true > > On Tue, Sep 29, 2015 at 4:38 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Release of Spark: 1.5.0. >> >> Command line

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Dmitry Goldenberg
ase of Spark > command line for running Spark job > > Cheers > > On Tue, Sep 29, 2015 at 1:37 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> We're seeing this occasionally. Granted, this was caused by a wrinkle in >> the Solr schema but this

ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Dmitry Goldenberg
We're seeing this occasionally. Granted, this was caused by a wrinkle in the Solr schema but this bubbled up all the way in Spark and caused job failures. I just checked and SolrException class is actually in the consumer job jar we use. Is there any reason why Spark cannot find the SolrException

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
er, you > may want to look through the kafka jira, e.g. > > https://issues.apache.org/jira/browse/KAFKA-899 > > > On Tue, Sep 29, 2015 at 8:05 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> "more partitions and replicas than available brokers"

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
on the broker, you > may want to look through the kafka jira, e.g. > > https://issues.apache.org/jira/browse/KAFKA-899 > > > On Tue, Sep 29, 2015 at 8:05 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> "more partitions and replicas than avai

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
—describe topic_name and see which brokers are missing from the replica > assignment. > *(replace home, zk-quorum etc with your own set-up)* > > Lastly, has this ever worked? Maybe you’ve accidentally created the topic > with more partitions and replicas than available brok

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
set the advertised.host.name in Kafka's server.properties. Yay/nay? Thanks again. On Tue, Sep 29, 2015 at 8:31 AM, Adrian Tanase wrote: > I believe some of the brokers in your cluster died and there are a number > of partitions that nobody is currently managing. > > -adrian > >

Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
I apologize for posting this Kafka related issue into the Spark list. Have gotten no responses on the Kafka list and was hoping someone on this list could shed some light on the below. --- We're running into this

Re: A way to timeout and terminate a laggard 'Stage' ?

2015-09-15 Thread Dmitry Goldenberg
proach, but yes you >> can have a separate program to keep an eye in the webui (possibly parsing >> the content) and make it trigger the kill task/job once it detects a lag. >> (Again you will have to figure out the correct numbers before killing any >> job) >> >>

A way to timeout and terminate a laggard 'Stage' ?

2015-09-14 Thread Dmitry Goldenberg
Is there a way in Spark to automatically terminate laggard "stage's", ones that appear to be hanging? In other words, is there a timeout for processing of a given RDD? In the Spark GUI, I see the "kill" function for a given Stage under 'Details for Job <...>". Is there something in Spark that w

A way to kill laggard jobs?

2015-09-11 Thread Dmitry Goldenberg
Is there a way to kill a laggard Spark job manually, and more importantly, is there a way to do it programmatically based on a configurable timeout value? Thanks.

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Dmitry Goldenberg
>> checkpoints can't be used between controlled restarts Is that true? If so, why? From my testing, checkpoints appear to be working fine, we get the data we've missed between the time the consumer went down and the time we brought it back up. >> If I cannot make checkpoints between code upgrades

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-10 Thread Dmitry Goldenberg
the time of > recovery? Trying to understand your usecase. > > > On Wed, Sep 9, 2015 at 12:03 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> >> when you use getOrCreate, and there exists a valid checkpoint, it will >> always return the c

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-09 Thread Dmitry Goldenberg
pointing to override its checkpoint duration millis, is there? Is the default there max(batchdurationmillis, 10seconds)? Is there a way to override this? Thanks. On Wed, Sep 9, 2015 at 2:44 PM, Tathagata Das wrote: > > > See inline. > > On Tue, Sep 8, 2015 at 9:02 PM, Dmitry

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
sing here? On Tue, Sep 8, 2015 at 11:42 PM, Tathagata Das wrote: > Well, you are returning JavaStreamingContext.getOrCreate(params. > getCheckpointDir(), factory); > That is loading the checkpointed context, independent of whether params > .isCheckpointed() is true. > > &

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
ata > in each batch > > (which checkpoint is enabled using > streamingContext.checkpoint(checkpointDir)) and can recover from failure by > reading the exact same data back from Kafka. > > > TD > > On Tue, Sep 8, 2015 at 4:38 PM, Dmitry Goldenberg < > dgoldenbe

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
:23 PM, Tathagata Das wrote: > Why are you checkpointing the direct kafka stream? It serves not purpose. > > TD > > On Tue, Sep 8, 2015 at 9:35 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> I just disabled checkpointing in our consumers and

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
ger wrote: > Well, I'm not sure why you're checkpointing messages. > > I'd also put in some logging to see what values are actually being read > out of your params object for the various settings. > > > On Tue, Sep 8, 2015 at 10:24 AM, Dmitry Goldenberg < > dg

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
what values are actually being read > out of your params object for the various settings. > > > On Tue, Sep 8, 2015 at 10:24 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> I've stopped the jobs, the workers, and the master. Deleted the contents >

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
or moving the contents of the checkpoint directory > and restarting the job? > > On Fri, Sep 4, 2015 at 8:02 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Sorry, more relevant code below: >> >> SparkConf sparkConf = createSparkConf(appName, ka

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
ew ProcessPartitionFunction(params); rdd.foreachPartition(func); return null; } }); return jssc; } On Fri, Sep 4, 2015 at 8:57 PM, Dmitry Goldenberg wrote: > I'd think that we wouldn't be "accidentally recovering from checkpoint" > hours or ev

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
ontext(sparkConf, params);jssc.start(); jssc.awaitTermination(); jssc.close(); On Fri, Sep 4, 2015 at 8:48 PM, Tathagata Das wrote: > Are you sure you are not accidentally recovering from checkpoint? How are > you using StreamingContext.getOrCreate() in your code? > > TD &

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
UI says? It should show > the underlying batch duration of the StreamingContext, the details of when > the batch starts, etc. > > BTW, it seems that the 5.6 or 6.8 seconds delay is present only when data > is present (that is, * Documents processed: > 0)* > > On Fri,

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
c={}.)...", appName, topic); // ... iterate ... log.info("Finished data partition processing (appName={}, topic={}). Documents processed: {}.", appName, topic, docCount); } Any ideas? Thanks. - Dmitry On Thu, Sep 3, 2015 at 10:45 PM, Tathagata Das wrote: > Are you accidentally recoveri

Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-03 Thread Dmitry Goldenberg
rkers, and consumers and the value seems "stuck" at 10 seconds. Any ideas? We're running Spark 1.3.0 built for Hadoop 2.4. Thanks. - Dmitry

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-08-14 Thread Dmitry Goldenberg
er program could terminate as the last batch is being processed... On Fri, Aug 14, 2015 at 6:17 PM, Cody Koeninger wrote: > You'll resume and re-process the rdd that didnt finish > > On Fri, Aug 14, 2015 at 1:31 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-08-14 Thread Dmitry Goldenberg
s the interval? that could at least > explain why it's not doing anything, if it's quite long. > > It sort of seems wrong though since > https://spark.apache.org/docs/latest/streaming-programming-guide.html > suggests it was intended to be a multiple of the batch interval. Th

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
to workers. Hefty file system usage, hefty memory consumption... What can we do to offset some of these costs? On Mon, Aug 10, 2015 at 4:27 PM, Cody Koeninger wrote: > The rdd is indeed defined by mostly just the offsets / topic partitions. > > On Mon, Aug 10, 2015 at 3:24 PM

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
om? Thanks. On Mon, Aug 10, 2015 at 1:07 PM, Cody Koeninger wrote: > You need to keep a certain number of rdds around for checkpointing, based > on e.g. the window size. Those would all need to be loaded at once. > > On Mon, Aug 10, 2015 at 11:49 AM, Dmitry Goldenberg < > dgold

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
r we can estimate the > size of the checkpoint and compare with Runtime.getRuntime().freeMemory(). > > If the size of checkpoint is much bigger than free memory, log warning, etc > > Cheers > > On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com&

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
at looks like it's during recovery from a checkpoint, so it'd be driver > memory not executor memory. > > How big is the checkpoint directory that you're trying to restore from? > > On Mon, Aug 10, 2015 at 10:57 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com&g

How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
We're getting the below error. Tried increasing spark.executor.memory e.g. from 1g to 2g but the below error still happens. Any recommendations? Something to do with specifying -Xmx in the submit job scripts? Thanks. Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit excee

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg
est/streaming-programming-guide.html > suggests it was intended to be a multiple of the batch interval. The > slide duration wouldn't always be relevant anyway. > > On Fri, Jul 31, 2015 at 6:16 PM, Dmitry Goldenberg > wrote: > > I've instrumented checkpointing per the

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg
f it's quite long. > > It sort of seems wrong though since > https://spark.apache.org/docs/latest/streaming-programming-guide.html > suggests it was intended to be a multiple of the batch interval. The > slide duration wouldn't always be relevant anyway. > > On Fri,

Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg
I've instrumented checkpointing per the programming guide and I can tell that Spark Streaming is creating the checkpoint directories but I'm not seeing any content being created in those directories nor am I seeing the effects I'd expect from checkpointing. I'd expect any data that comes into Kafk

Re: What is a best practice for passing environment variables to Spark workers?

2015-07-10 Thread Dmitry Goldenberg
Thanks, Akhil. We're trying the conf.setExecutorEnv() approach since we've already got environment variables set. For system properties we'd go the conf.set("spark.") route. We were concerned that doing the below type of thing did not work, which this blog post seems to confirm ( http://proge

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg
I and DB clients on the executor > side. I think it's the most straightforward approach to dealing with any > non-serializable object you need. > > I don't entirely follow what over-network data shuffling effects you are > alluding to (maybe more specific to streaming?). &g

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
ean it > 'works' to call foreachRDD on an RDD? > > @Dmitry are you asking about foreach vs foreachPartition? that's quite > different. foreachPartition does not give more parallelism but lets > you operate on a whole batch of data at once, which is nice if you > need

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
Thanks, Cody. The "good boy" comment wasn't from me :) I was the one asking for help. On Wed, Jul 8, 2015 at 10:52 AM, Cody Koeninger wrote: > Sean already answered your question. foreachRDD and foreachPartition are > completely different, there's nothing fuzzy or insufficient about that > a

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
"These are quite different operations. One operates on RDDs in DStream and one operates on partitions of an RDD. They are not alternatives." Sean, different operations as they are, they can certainly be used on the same data set. In that sense, they are alternatives. Code can be written using on

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg
My singletons do in fact stick around. They're one per worker, looks like. So with 4 workers running on the box, we're creating one singleton per worker process/jvm, which seems OK. Still curious about foreachPartition vs. foreachRDD though... On Tue, Jul 7, 2015 at 11:27 AM, Richard Marscher wr

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg
Richard, That's exactly the strategy I've been trying, which is a wrapper singleton class. But I was seeing the inner object being created multiple times. I wonder if the problem has to do with the way I'm processing the RDD's. I'm using JavaDStream to stream data (from Kafka). Then I'm processin

Re: Any way to retrieve time of message arrival to Kafka topic, in Spark Streaming?

2015-06-23 Thread Dmitry Goldenberg
Yes, Akhil. We already have an origination timestamp in the body of the message when we send it. But we can't guarantee the network speed nor a precise enough synchronization of clocks across machines. Pulling the timestamp from Kafka itself would be a step forward although the broker is most l

Re: Registering custom metrics

2015-06-22 Thread Dmitry Goldenberg
Great, thank you, Silvio. In your experience, is there any way to instument a callback into Coda Hale or the Spark consumers from the metrics sink? If the sink performs some steps once it has received the metrics, I'd like to be able to make the consumers aware of that via some sort of a callback.

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-11 Thread Dmitry Goldenberg
gt; > > > > > > On Thu, Jun 11, 2015 at 7:30 AM, Cody Koeninger > wrote: > >> Depends on what you're reusing multiple times (if anything). >> >> Read >> http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence >> >

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-11 Thread Dmitry Goldenberg
rces of the currently running cluster > > > > I was thinking more about the scenario where you have e.g. 100 boxes and > want to / can add e.g. 20 more > > > > *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] > *Sent:* Wednesday, June 3, 2015 4:46 PM > *T

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-09 Thread Dmitry Goldenberg
re > unless you want it to. > > If you want it, just call cache. > > On Thu, Jun 4, 2015 at 8:20 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> "set the storage policy for the DStream RDDs to MEMORY AND DISK" - it >> appears t

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Dmitry Goldenberg
ads > > > > Re Dmytiis intial question – you can load large data sets as Batch > (Static) RDD from any Spark Streaming App and then join DStream RDDs > against them to emulate “lookups” , you can also try the “Lookup RDD” – > there is a git hub project > > > > *From

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread Dmitry Goldenberg
Thanks so much, Yiannis, Olivier, Huang! On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas wrote: > Hi there, > > I would recommend checking out > https://github.com/spark-jobserver/spark-jobserver which I think gives > the functionality you are looking for. > I haven't tested it though. > > BR >

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-04 Thread Dmitry Goldenberg
from Samsung Mobile > > Original message > > From: Evo Eftimov > > Date:2015/05/28 13:22 (GMT+00:00) > > To: Dmitry Goldenberg > > Cc: Gerard Maas ,spark users > > Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of growth &

Re: StreamingListener, anyone?

2015-06-04 Thread Dmitry Goldenberg
blocked in awaitTermination. So what would be a way to trigger the termination in the driver? "context.awaitTermination() allows the current thread to wait for the termination of a context by stop() or by an exception" - presumably, we need to call stop() somewhere or perhaps throw. Chee

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
und/parallel the > resources of the currently running cluster > > > > I was thinking more about the scenario where you have e.g. 100 boxes and > want to / can add e.g. 20 more > > > > *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] > *Sent:* Wednesday,

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
es of the currently running cluster > > > > I was thinking more about the scenario where you have e.g. 100 boxes and > want to / can add e.g. 20 more > > > > *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] > *Sent:* Wednesday, June 3, 2015 4:46 PM > *To:* E

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
> nodes which allows you to double, triple etc in the background/parallel the > resources of the currently running cluster > > > > I was thinking more about the scenario where you have e.g. 100 boxes and > want to / can add e.g. 20 more > > > > *From:* Dmitry Go

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
e and make them contact > and register with the Master without bringing down the Master (or any of > the currently running worker nodes) > > > > Then just shutdown your currently running spark streaming job/app and > restart it with new params to take advantage of the larg

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
. >> >> -Andrew >> >> >> >> 2015-05-28 8:02 GMT-07:00 Evo Eftimov : >> >> Probably you should ALWAYS keep the RDD storage policy to MEMORY AND DISK >>> – it will be your insurance policy against sys crashes due to memory leaks. >>

Re: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Dmitry Goldenberg
am - ? public class Param { // ==> potentially a very hefty resource to load private Map dictionary = new HashMap(); ... } I'm groking that Spark will serialize Param right before the call to foreachRDD, if we're to broadcast... On Wed, Jun 3, 2015 at 9:58 AM,

Re: How to monitor Spark Streaming from Kafka?

2015-06-01 Thread Dmitry Goldenberg
Thank you, Tathagata, Cody, Otis. - Dmitry On Mon, Jun 1, 2015 at 6:57 PM, Otis Gospodnetic wrote: > I think you can use SPM - http://sematext.com/spm - it will give you all > Spark and all Kafka metrics, including offsets broken down by topic, etc. > out of the box. I see more

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
>> Probably you should ALWAYS keep the RDD storage policy to MEMORY AND DISK >>> – it will be your insurance policy against sys crashes due to memory leaks. >>> Until there is free RAM, spark streaming (spark) will NOT resort to disk – >>> and of course resorting to d

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
at the optimal patterns... Regards, - Dmitry On Thu, May 28, 2015 at 3:21 PM, Andrew Or wrote: > Hi all, > > As the author of the dynamic allocation feature I can offer a few insights > here. > > Gerard's explanation was both correct and concise: dynamic allocation is > not int

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
> > > > > Sent from Samsung Mobile > > ---- Original message > > From: Evo Eftimov > > Date:2015/05/28 13:22 (GMT+00:00) > > To: Dmitry Goldenberg > > Cc: Gerard Maas ,spark users > > Subject: Re: Autoscaling Spark cluster based on top

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
Thanks, Evo. Per the last part of your comment, it sounds like we will need to implement a job manager which will be in control of starting the jobs, monitoring the status of the Kafka topic(s), shutting jobs down and marking them as ones to relaunch, scaling the cluster up/down by adding/removing

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
l of machines, my subsequent questions then are, a) will Spark sense the addition of a new node / is it sufficient that the cluster manager is aware, then work just starts flowing there? and b) what would be a way to gracefully remove a worker node when the load subsides, so that no currently runn

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-27 Thread Dmitry Goldenberg
Thanks, Rajesh. I think that acquring/relinquishing executors is important but I feel like there are at least two layers for resource allocation and autoscaling. It seems that acquiring and relinquishing executors is a way to optimize resource utilization within a pre-set Spark cluster of machine

Re: Spark Streaming from Kafka - no receivers and spark.streaming.receiver.maxRate?

2015-05-27 Thread Dmitry Goldenberg
Got it, thank you, Tathagata and Ted. Could you comment on my other question as well? Basically, I'm trying to get a handle on a good approa

Re: Migrate Relational to Distributed

2015-05-23 Thread Dmitry Tolpeko
can help. Thanks, Dmitry On Sat, May 23, 2015 at 1:22 AM, Brant Seibert wrote: > Hi, The healthcare industry can do wonderful things with Apache Spark. > But, > there is already a very large base of data and applications firmly rooted > in > the relational paradigm and they a

Re: Spark Streaming and reducing latency

2015-05-18 Thread Dmitry Goldenberg
Thanks, Akhil. So what do folks typically do to increase/contract the capacity? Do you plug in some cluster auto-scaling solution to make this elastic? Does Spark have any hooks for instrumenting auto-scaling? In other words, how do you avoid overwheling the receivers in a scenario when your sy

Re: Spark and RabbitMQ

2015-05-12 Thread Dmitry Goldenberg
Thanks, Akhil. It looks like in the second example, for Rabbit they're doing this: https://www.rabbitmq.com/mqtt.html. On Tue, May 12, 2015 at 7:37 AM, Akhil Das wrote: > I found two examples Java version >

Re: Running Spark in local mode seems to ignore local[N]

2015-05-11 Thread Dmitry Goldenberg
batch interval of 10s > and block interval of 1s you'll get 10 partitions of data in the RDD. > > On Mon, May 11, 2015 at 10:29 PM, Dmitry Goldenberg > wrote: > > Understood. We'll use the multi-threaded code we already have.. > > > > How are these execu

Re: Running Spark in local mode seems to ignore local[N]

2015-05-11 Thread Dmitry Goldenberg
rtitions in a streaming RDD is determined by the block > interval and the batch interval. If you have a batch interval of 10s > and block interval of 1s you'll get 10 partitions of data in the RDD. > > On Mon, May 11, 2015 at 10:29 PM, Dmitry Goldenberg > wrote: > > Under

  1   2   >