Re: Best way to present data collected by Flume through Spark

2016-09-15 Thread Jeff Nadler
Yes we do something very similar and it's working well: Kafka -> Spark Streaming (write temp files, serialized RDDs) -> Spark Batch Application (build partitioned Parquet files on HDFS; this is needed because building Parquet files of a reasonable size is too slow for streaming) -> query with Spar

Re: Not all KafkaReceivers processing the data Why?

2016-09-14 Thread Jeff Nadler
Partition: 4 Leader: 89 Replicas: > 89Isr: 89 > > > > *From:* Jeff Nadler [mailto:jnad...@srcginc.com] > *Sent:* Wednesday, September 14, 2016 12:46 PM > *To:* Rachana Srivastava > *Cc:* user@spark.apache.org; d...@spark.apache.org > *Subject:* Re: Not all

Re: Not all KafkaReceivers processing the data Why?

2016-09-14 Thread Jeff Nadler
Have you checked your Kafka brokers to be certain that data is going to all 5 partitions?We use something very similar (but in Scala) and have no problems. Also you might not get the best response blasting both user+dev lists like this. Normally you'd want to use 'user' only. -Jeff On Wed

Re: Streaming Backpressure with Multiple Streams

2016-09-14 Thread Jeff Nadler
ri, Sep 9, 2016 at 5:54 PM, Jeff Nadler wrote: > Yes I'll test that next. > > On Sep 9, 2016 5:36 PM, "Cody Koeninger" wrote: > >> Does the same thing happen if you're only using direct stream plus back >> pressure, not the receiver stream? >> &

Re: Streaming Backpressure with Multiple Streams

2016-09-09 Thread Jeff Nadler
Yes I'll test that next. On Sep 9, 2016 5:36 PM, "Cody Koeninger" wrote: > Does the same thing happen if you're only using direct stream plus back > pressure, not the receiver stream? > > On Sep 9, 2016 6:41 PM, "Jeff Nadler" wrote: > >> Mayb

Streaming Backpressure with Multiple Streams

2016-09-09 Thread Jeff Nadler
mption until it is eventually consuming 1 record / second / partition. This happens even though there's no scheduling delay, and the receiver-based stream does not appear to be throttled. Anyone ever see anything like this? Thanks! Jeff Nadler Aerohive Networks

Re: Storing object in spark streaming

2015-10-12 Thread Jeff Nadler
Your receiver must extend Receiver[String].Try changing it to extend Receiver[Message]? On Mon, Oct 12, 2015 at 2:03 PM, Something Something < mailinglist...@gmail.com> wrote: > In my custom receiver for Spark Streaming I've code such as this: > > messages.toArray().foreach(ms

Re: Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-06 Thread Jeff Nadler
Gerard - any chance this is related to task locality waiting?Can you try (just as a diagnostic) something like this, does the unexpected delay go away? .set("spark.locality.wait", "0") On Tue, Oct 6, 2015 at 12:00 PM, Gerard Maas wrote: > Hi Cody, > > The job is doing ETL from Kafka record

Re: API to run spark Jobs

2015-10-06 Thread Jeff Nadler
sed > by core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala > > FYI > > On Tue, Oct 6, 2015 at 10:08 AM, shahid qadri > wrote: > >> hi Jeff >> Thanks >> More specifically i need the Rest api to submit pyspark job, can you >> point me to Sp

Re: API to run spark Jobs

2015-10-06 Thread Jeff Nadler
Spark standalone doesn't come with a UI for submitting jobs. Some Hadoop distros might, for example EMR in AWS has a job submit UI. Spark submit just calls a REST api, you could build any UI you want on top of that... On Tue, Oct 6, 2015 at 9:37 AM, shahid qadri wrote: > Hi Folks > > How i c

Streaming Performance w/ UpdateStateByKey

2015-10-05 Thread Jeff Nadler
While investigating performance challenges in a Streaming application using UpdateStateByKey, I found that serialization of state was a meaningful (not dominant) portion of our execution time. In StateDStream.scala, serialized persistence is required: super.persist(StorageLevel.MEMORY_ONLY_S

Re: Multiple HA spark clusters managed by 1 ZK cluster?

2015-04-22 Thread Jeff Nadler
You can run multiple Spark clusters against one ZK cluster. Just use this config to set independent ZK roots for each cluster: spark.deploy.zookeeper.dir The directory in ZooKeeper to store recovery state (default: /spark). -Jeff From: Sean Owen To: Akhil Das Cc: Michal Klos , Use

StackOverflowError from KafkaReceiver when rate limiting used

2015-04-16 Thread Jeff Nadler
scala compiler isn't optimizing waitToPush into a loop? Looks like tail recursion, no? Thanks- Jeff Nadler

Re: Streaming with Java: Expected ReduceByWindow to Return JavaDStream

2015-01-19 Thread Jeff Nadler
mLike is > used for Java code, return a Scala DStream is not reasonable. You can fix > this by submitting a PR, or I can help you to fix this. > > > > Thanks > > Jerry > > > > *From:* Jeff Nadler [mailto:jnad...@srcginc.com] > *Sent:* Monday, January 19,

Streaming with Java: Expected ReduceByWindow to Return JavaDStream

2015-01-18 Thread Jeff Nadler
Duration ): DStream[T] = { dstream.reduceByWindow(reduceFunc, windowDuration, slideDuration) } So I'm just a noob. Is this a bug or am I missing something? Thanks! Jeff Nadler