Hello Flavio, It sounds to me like the best solution for you is to implement your own ReceiverInputDStream/Receiver component to feed Spark Streaming with DStreams. It is not as scary as it sounds, take a look at some of the examples like TwitterInputDStream <https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala> which are ~60 lines of code.
The key snippet to look for is: def onStart() { try { val newTwitterStream = new TwitterStreamFactory().getInstance(twitterAuth) newTwitterStream.addListener(new StatusListener { def onStatus(status: Status) = { *store(status)* } ... This is the interface between twitter4j (external library which receives tweets) and Spark, basically you call store() which passes the object down to Spark to be batched into a DStream and fed into Spark Streaming. There is some excellent documentation on implementing Custom Receivers here: http://spark.apache.org/docs/latest/streaming-custom-receivers.html Using your own custom receiver means you can implement flow-control to ingest data at sensible rates, as well as failover/retry logic etc. Best of luck! MC *Michael Cutler* Founder, CTO *Mobile: +44 789 990 7847Email: mich...@tumra.com <mich...@tumra.com>Web: tumra.com <http://tumra.com/?utm_source=signature&utm_medium=email>* *Visit us at our offices in Chiswick Park <http://goo.gl/maps/abBxq>* *Registered in England & Wales, 07916412. VAT No. 130595328* This email and any files transmitted with it are confidential and may also be privileged. It is intended only for the person to whom it is addressed. If you have received this email in error, please inform the sender immediately. If you are not the intended recipient you must not use, disclose, copy, print, distribute or rely on this email. On 19 June 2014 07:50, Flavio Pompermaier <pomperma...@okkam.it> wrote: > Yes, I need to call the external service for every event and the order > does not matter. > There's no time limit in which each events should be processed. I can't > tell the producer to slow down nor drop events. > Of course I could put a message broker in between like an AMQP or JMS > broker but I was thinking that maybe this issue was already addressed in > some way (of course there should be some buffer to process high rate > streaming)..or not? > > > > On Thu, Jun 19, 2014 at 4:48 AM, Soumya Simanta <soumya.sima...@gmail.com> > wrote: > >> Flavio - i'm new to Spark as well but I've done stream processing using >> other frameworks. My comments below are not spark-streaming specific. Maybe >> someone who know more can provide better insights. >> >> I read your post on my phone and I believe my answer doesn't completely >> address the issue you have raised. >> >> Do you need to call the external service for every event ? i.e., do you >> need to process all events ? Also does order of processing events matter? >> Is there is time bound in which each event should be processed ? >> >> Calling an external service means network IO. So you have to buffer >> events if your service is rate limited or slower than rate at which you are >> processing your event. >> >> Here are some ways of dealing with this situation: >> >> 1. Drop events based on a policy (such as buffer/queue size), >> 2. Tell the event producer to slow down if that's in your control >> 3. Use a proxy or a set of proxies to distribute the calls to the remote >> service, if the rate limit is by user or network node only. >> >> I'm not sure how many of these are implemented directly in Spark >> streaming but you can have an external component that can : >> control the rate of event and only send events to Spark streams when it's >> ready to process more messages. >> >> Hope this helps. >> >> -Soumya >> >> >> >> >> On Wed, Jun 18, 2014 at 6:50 PM, Flavio Pompermaier <pomperma...@okkam.it >> > wrote: >> >>> Thanks for the quick reply soumya. Unfortunately I'm a newbie with >>> Spark..what do you mean? is there any reference to how to do that? >>> >>> >>> On Thu, Jun 19, 2014 at 12:24 AM, Soumya Simanta < >>> soumya.sima...@gmail.com> wrote: >>> >>>> >>>> You can add a back pressured enabled component in front that feeds data >>>> into Spark. This component can control in input rate to spark. >>>> >>>> > On Jun 18, 2014, at 6:13 PM, Flavio Pompermaier <pomperma...@okkam.it> >>>> wrote: >>>> > >>>> > Hi to all, >>>> > in my use case I'd like to receive events and call an external >>>> service as they pass through. Is it possible to limit the number of >>>> contemporaneous call to that service (to avoid DoS) using Spark streaming? >>>> if so, limiting the rate implies a possible buffer growth...how can I >>>> control the buffer of incoming events waiting to be processed? >>>> > >>>> > Best, >>>> > Flavio >>>> >>> >