Re: Resource under-utilization when using RocksDb state backend [SOLVED]

Clifford Resnick Thu, 16 Feb 2017 08:50:42 -0800

Hi Vinay,

We found that our problems were not with RocksDb, but rather what we were 
throwing at it. We were working with more complex data types (e.g. Collections) 
and found that nearly 80% of the time was spent in serialization, so optimizing 
that helped a lot. But if your state is more primitive or aligned to byte[] 
then another thing to consider (assuming you have a keyed stream) is skew, 
which might then be helped by an upstream combiner. But to answer your 
question, we are also running with the default FLASH_SSD_OPTIMIZED.  I did play 
with increasing buffer size among other things, but found that the benefit was 
not worth the resource cost. Our data, like most, is naturally clustered on 
time so based on my rough understanding of RocksDb I’m guessing we get a lot of 
Level 0 hits, though that is not something I know how to measure.

-Cliff
From: vinay patil <vinay18.pa...@gmail.com>
Reply-To: "user@flink.apache.org" <user@flink.apache.org>
Date: Thursday, February 16, 2017 at 8:24 AM
To: "user@flink.apache.org" <user@flink.apache.org>
Subject: Re: Resource under-utilization when using RocksDb state backend 
[SOLVED]

Hi Cliff,

It will be really helpful if you could share your RocksDB configuration.

I am also running on c3.4xlarge EC2 instances backed by SSD's .

I had tried with FLASH_SSD_OPTIMIZED option which works great but somehow the 
pipeline stops in between and the overall processing time increases,

I tried to set different values as mentioned in this video, but somehow I am 
not getting it right, the TM's is getting killed after sometime.

Regards,
Vinay Patil

On Thu, Dec 8, 2016 at 10:19 PM, Cliff Resnick [via Apache Flink User Mailing 
List archive.] <[hidden 
email]<file://localhost/user/SendEmail.jtp%3Ftype=node&node=11678&i=0>> wrote:
It turns out that most of the time in RocksDBFoldingState was spent on 
serialization/deserializaton. RocksDb read/write was performing well. By moving 
from Kryo to custom serialization we were able to increase throughput 
dramatically. Load is now where it should be.

On Mon, Dec 5, 2016 at 1:15 PM, Robert Metzger <[hidden 
email]<http:///user/SendEmail.jtp?type=node&node=10537&i=0>> wrote:
Another Flink user using RocksDB with large state on SSDs recently posted this 
video for oprimizing the performance of Rocks on SSDs: 
https://www.youtube.com/watch?v=pvUqbIeoPzM
That could be relevant for you.

For how long did you look at iotop. It could be that the IO access happens in 
bursts, depending on how data is cached.

I'll also add Stefan Richter to the conversation, he has maybe some more ideas 
what we can do here.

On Mon, Dec 5, 2016 at 6:19 PM, Cliff Resnick <[hidden 
email]<http:///user/SendEmail.jtp?type=node&node=10537&i=1>> wrote:
Hi Robert,

We're following 1.2-SNAPSHOT,  using event time. I have tried "iotop" and I see 
usually less than 1 % IO. The most I've seen was a quick flash here or there of 
something substantial (e.g. 19%, 52%) then back to nothing. I also assumed we 
were disk-bound, but to use your metaphor I'm having trouble finding any smoke. 
However, I'm not very experienced in sussing out IO issues so perhaps there is 
something else I'm missing.

I'll keep investigating. If I continue to come up empty then I guess my next 
steps may be to stage some independent tests directly against RocksDb.

-Cliff

On Mon, Dec 5, 2016 at 5:52 AM, Robert Metzger <[hidden 
email]<http:///user/SendEmail.jtp?type=node&node=10537&i=2>> wrote:
Hi Cliff,

which Flink version are you using?
Are you using Eventtime or processing time windows?

I suspect that your disks are "burning" (= your job is IO bound). Can you check 
with a tool like "iotop" how much disk IO Flink is producing?
Then, I would set this number in relation with the theoretical maximum of your 
SSD's (a good rough estimate is to use dd for that).

If you find that your disk bandwidth is saturated by Flink, you could look into 
tuning the RocksDB settings so that it uses more memory for caching.

Regards,
Robert

On Fri, Dec 2, 2016 at 11:34 PM, Cliff Resnick <[hidden 
email]<http:///user/SendEmail.jtp?type=node&node=10537&i=3>> wrote:
In tests comparing RocksDb to fs state backend we observe much lower 
throughput, around 10x slower. While the lowered throughput is expected, what's 
perplexing is that machine load is also very low with RocksDb, typically 
falling to  < 25% CPU and negligible IO wait (around 0.1%). Our test instances 
are EC2 c3.xlarge which are 4 virtual CPUs and 7.5G RAM, each running a single 
TaskManager in YARN, with 6.5G allocated memory per TaskManager. The instances 
also have 2x40G attached SSDs which we have mapped to `taskmanager.tmp.dir`.

With FS state and 4 slots per TM, we will easily max out with an average load 
average around 5 or 6, so we actually need throttle down the slots to 3. With 
RocksDb using the Flink SSD configured options we see a load average at around 
1. Also, load (and actual) throughput remain more or less constant no matter 
how many slots we use. The weak load is spread over all CPUs.

Here is a sample top:

Cpu0  : 20.5%us,  0.0%sy,  0.0%ni, 79.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 18.5%us,  0.0%sy,  0.0%ni, 81.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 11.6%us,  0.7%sy,  0.0%ni, 87.0%id,  0.7%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 12.5%us,  0.3%sy,  0.0%ni, 86.8%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st

Our pipeline uses tumbling windows, each with a ValueState keyed to a 3-tuple 
of one string and two ints.. Each ValueState comprises a small set of tuples 
around 5-7 fields each. The WindowFunction simply diffs agains the set and 
updates state if there is a diff.

Any ideas as to what the bottleneck is here? Any suggestions welcomed!

-Cliff

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Resource-under-utilization-when-using-RocksDb-state-backend-SOLVED-tp10537.html
To start a new topic under Apache Flink User Mailing List archive., email 
[hidden email]<file://localhost/user/SendEmail.jtp%3Ftype=node&node=11678&i=1>
To unsubscribe from Apache Flink User Mailing List archive., click here.
NAML<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

________________________________
View this message in context: Re: Resource under-utilization when using RocksDb 
state backend 
[SOLVED]<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Resource-under-utilization-when-using-RocksDb-state-backend-SOLVED-tp10537p11678.html>
Sent from the Apache Flink User Mailing List archive. mailing list 
archive<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> 
at Nabble.com.

Re: Resource under-utilization when using RocksDb state backend [SOLVED]

Reply via email to