Hi all,

We have a big issue and would like if someone have any insights or ideas.
The problem is composed of two connected problems.

1.       Run time of a single application.

2.       Run time of multiple applications in parallel is almost linear with 
run time of a single application.

We have written a spark application patching its data from HBase.
We are running the application using YARN-client resource manager.
The cluster have 2 nodes (both uses as HBase data nodes and spark/YARN 
processing nodes).

We have few sparks steps in our app, the heaviest and longest from all Is 
describe by this flow

1.       flatMap - converting the HBase RDD to objects RDD.

2.       Group by key

3.       Map making the calculations we need. (checking set of basic 
mathematical conditions)

When running a single instance of this step Working on only 2000 records this 
step takes around 13s. (all records are related to one key)
The HBase table we fetch the data from have 5 regions.

The implementation we have made is using REST service which creates one spark 
context
Each request we make to this service, run an instance of the application (but a 
gain all uses the same spark contxt)
Each request creates multiple threads which run all the application steps.
When running one request (with 10 parallel threads) the relevant stage takes 
about 40s for all the threads - each one of them takes 40s  itself, but they 
almost run completely in parallel, so also the total run time of one request is 
40s.

We have allocated 10 workers each with 512M memory (no need for more, looks 
like all the RDD is cached)

So the first question:
Does this run time make sense? For us it seems too long? Do you have an idea 
what are we doing wrong

The second problem and the more serious one
We need to run multiple parallel request of this kind.
When doing so the run time spikes again and instead of an request that runs in 
about 1m (40s is only the main stage)
We get 2 applications both running almost in parallel both run for 2m.
This also happens if we use 2 different services and sending each of them 1 
request.
These running times grows as we send more requests.

We have also monitored the CPU usage of the node and each request makes it jump 
to 90%.

If we reduce the number of workers to 2 the CPU usage jump is to about 35%, but 
the run time increases significantly.

This seems very unlikely to us.
Are there any spark parameters we should consider to change?
Any other ideas? We are quite stuck on this.

Thanks in advanced
Dana





---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Reply via email to