Querying a service or a database from a Spark job is in most cases an
anti-pattern, but there are exceptions. The jobs become unstable and
indeterministic by relying on a live database.

The recommended pattern is to take regular dumps of the database to
your cluster storage, e.g. HDFS, and join the dump dataset with other
datasets, e.g. your incoming events. There are good and bad ways to
dump, however. I covered the topic in this presentation, which you may
find useful: http://www.slideshare.net/lallea/functional-architectural-patterns,
https://vimeo.com/channels/flatmap2015/128468974.

Let me know if you have follow-up questions, or want assistance.

Regards,


Lars Albertsson
Data engineering consultant
www.mapflat.com
+46 70 7687109


On Tue, Jan 26, 2016 at 10:25 PM, Daniel Schulz
<danielschulz2...@hotmail.com> wrote:
> Hi,
>
> We are currently working on a solution architecture to solve IoT workloads
> on Spark. Therefore, I am interested in getting to know whether it is
> considered an Anti-Pattern in Spark to get records from a database and make
> a ReST call to an external server with that data. This external server may
> and will be the bottleneck -- but from a Spark point of view: is it possibly
> harmful to open connections and wait for their responses for vast amounts of
> rows?
>
> In the same manner: is calling an external library (instead of making a ReST
> call) for any row possibly problematic?
>
> How to rather embed a C++ library in this workflow: is it best to make a
> function having a JNI call to run it natively -- iff we know we are single
> threaded then? Or is there a better way to include C++ code in Spark jobs?
>
> Many thanks in advance.
>
> Kind regards, Daniel.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to