Re: Spark Pattern and Anti-Pattern

Jörn Franke Tue, 26 Jan 2016 13:55:55 -0800

Spark has its best use cases in in-memory batch processing / machine learning. 
Connecting multiple different sources/destination requires some thinking and 
probably more than spark.
Connecting spark to a database makes only in very few cases sense. You will 
have huge performance issues due to the lack of data locality. You have 
unexpected loads to the database in case of speculative execution or nodes 
crashing etc
Using rest for transferring a lot of data - again something to be careful with. 
Rest does not allow to resume transmissions. If the transmission is interrupted 
after you have transferred 1 tb you have to do transmit everything again. Also 
rest is format agnostic it is usually used with highly inefficient formats for 
large files such as json or xml. It is better if you use avro or alike (for 
exchanges between systems! Not for querying!). In exceptional cases (eg legacy) 
one or multiple well designed csv are better. In any case please use 
compression.
What does the rest service do with the data?  Why don't you use sftp + rsync 
(or duplicity) for resuming transferred files?


I did not understand your last question. Generally Jni is fine. However you may 
carefully test memory allocation of your Jni library or go with software 
containers such as docker or you use cgroups to limit memory and cpu usage of 
your Jni library. 

However all requires on the details of your use case.

> On 26 Jan 2016, at 22:25, Daniel Schulz <danielschulz2...@hotmail.com> wrote:
> 
> Hi,
> 
> We are currently working on a solution architecture to solve IoT workloads on 
> Spark. Therefore, I am interested in getting to know whether it is considered 
> an Anti-Pattern in Spark to get records from a database and make a ReST call 
> to an external server with that data. This external server may and will be 
> the bottleneck -- but from a Spark point of view: is it possibly harmful to 
> open connections and wait for their responses for vast amounts of rows?
> 
> In the same manner: is calling an external library (instead of making a ReST 
> call) for any row possibly problematic?
> 
> How to rather embed a C++ library in this workflow: is it best to make a 
> function having a JNI call to run it natively -- iff we know we are single 
> threaded then? Or is there a better way to include C++ code in Spark jobs?
> 
> Many thanks in advance.
> 
> Kind regards, Daniel.

Re: Spark Pattern and Anti-Pattern

Reply via email to