Querying a service or a database from a Spark job is in most cases an anti-pattern, but there are exceptions. The jobs become unstable and indeterministic by relying on a live database.
The recommended pattern is to take regular dumps of the database to your cluster storage, e.g. HDFS, and join the dump dataset with other datasets, e.g. your incoming events. There are good and bad ways to dump, however. I covered the topic in this presentation, which you may find useful: http://www.slideshare.net/lallea/functional-architectural-patterns, https://vimeo.com/channels/flatmap2015/128468974. Let me know if you have follow-up questions, or want assistance. Regards, Lars Albertsson Data engineering consultant www.mapflat.com +46 70 7687109 On Tue, Jan 26, 2016 at 10:25 PM, Daniel Schulz <danielschulz2...@hotmail.com> wrote: > Hi, > > We are currently working on a solution architecture to solve IoT workloads > on Spark. Therefore, I am interested in getting to know whether it is > considered an Anti-Pattern in Spark to get records from a database and make > a ReST call to an external server with that data. This external server may > and will be the bottleneck -- but from a Spark point of view: is it possibly > harmful to open connections and wait for their responses for vast amounts of > rows? > > In the same manner: is calling an external library (instead of making a ReST > call) for any row possibly problematic? > > How to rather embed a C++ library in this workflow: is it best to make a > function having a JNI call to run it natively -- iff we know we are single > threaded then? Or is there a better way to include C++ code in Spark jobs? > > Many thanks in advance. > > Kind regards, Daniel. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org