You can create an RDD with json credentials & then run a mapper which takes these credentials & queries the api & stores results in another RDD. You can pass that RDD from task to task for further computation steps. Thr are two issues here: 1. how is number of calls /sec throttled, if you want spark to throttle it you should create rdd of appropriate size, if your API throttles it then also you should use spark of appropriate size as the slower responses will delay your overall processing 2. failed query response: this can be handled by filtering the rdd & storing failed responses in some disk location for debugging 3. Pipelining: do you expect processing to last long, in which case are you planning to pipeline the processing(basically tasks running on already downloaded data while you are downloading fresh data). This is a toughie , most likely you can do that with threading but there is no guarantee that you will get pipelining benefits .
Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Thu, Mar 6, 2014 at 10:21 AM, sonyjv <sonyjvech...@yahoo.com> wrote: > Hi, > > I am very new to Spark and currently trying to implement a use case. We > have > a JSON based REST Api implemented in Spring which gets around 50 calls/sec. > I would like to stream these JSON strings to Spark for processing and > aggregation. We are having strict SLA and would like to know the best way > to > design the interface between the REST Api and Spark. > > Also, the processing part has different steps and is think of having > multiple Spark jobs for performing these steps. What is the best way of > triggering one job from another and passing data between these jobs. > > Thanks, > Sony > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-JSON-string-from-REST-Api-in-Spring-tp2358.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >