Re: Multi task Container Starvation

Ankit Malhotra Thu, 09 Mar 2017 10:39:12 -0800

Also, the PUTs are taking 10x of GETs is what baffles me a little bit. We’re 
running on SSDs and here is our config:


stores.stage- store.write.batch.size=100000
stores.stage- store.object.cache.size=200000
stores.stage- store.rocksdb.num.write.buffers=3
stores.stage- store.rocksdb.compaction.style=level
stores.stage- store.rocksdb.block.size.bytes=16384
stores.stage-auctions-store.container.write.buffer.size.bytes=268435456 #256MB
stores.stage-auctions-store.container.cache.size.bytes=268435456 #256MB

This is for a container that has 1 task per container since our earlier 2 
tasks/container experiment did not look too promising. Each message is ~ 1KB.

We do not have a lot of cache hits (cache hit % ~ 10% or less) and most GETs do 
end up going to the KeyValueStorageEngine but even with that get-ns is ~ 
25,000ns but put-ns can be as high as 200,000ns

I can send other metrics, if they are useful.

On 3/9/17, 11:45 AM, "Ankit Malhotra" <[email protected]> wrote:

    Replies inline
    
    On 3/9/17, 11:24 AM, "Jagadish Venkatraman" <[email protected]> wrote:
    
        I understand you are receiving messages from *all* partitions (but fewer
        messages from some partitions).
        
        Some questions:
        
        1. Is it possible that you may have saturated the capacity of the entire
        container?
    >> Possible but what metric/metrics would tell me that?
    
        2. What is the time you spend inside *process* and *window* for the
        affected container? (How does it compare with other containers?). The
        metrics of interest are *process_ns* and *window_ns.*
    >> process-ns is ~ 200,000ns window-ns happens once every 60 secs.and is < 
1 ms
    
        3. What is the number of messages per-second you process for the 
affected
        containers ? (How does it compare with other containers?). The metric of
        interest is *process-calls*. You can also look at per-task process calls
        
<https://github.com/apache/samza/blob/master/samza-core/src/main/scala/org/apache/samza/container/TaskInstanceMetrics.scala>,
        and per-partition messages read
        
<https://github.com/apache/samza/blob/master/samza-kafka/src/main/scala/org/apache/samza/system/kafka/KafkaSystemConsumerMetrics.scala>
    
    >> I was quoting process-calls from the metrics registry. For task 2, its 
lesser than those for task 1 as I stated earlier.
        .
        4. How many partitions do you have on the input topics?
    >> 100 each that we are trying to do in 50 containers (2 tasks each, with a 
pool of size 5)
    
        4.1 Have you tried increasing the # of partitions, and re-deploying with
        more containers? ( I take it that you are running on Yarn).
    >> No, since we don’t have the ability to change input partitions right now.
        
        *>> Another thing to add is our amount of time being blocked is 
generally
        quite high which we believe is mainly because our processing is not fast
        enough? *
        
        5. What metric are you referring to here? Do you mean the time spent in 
the
        *process* and *window* call? If you're not processing "fast enough", it
        will be helpful to instrument where you are spending most of the time.
    >> Referring to block-ns. From my understanding, since the stores are 
synchronized, once the run loop asks the tasks to process, it then waits 
(blocks) until at least one of them is ready to receive more. So multiple 
threads may not really be helping us?
    
        You can rely on Samza's MetricRegistry to configure / report your custom
        metrics.
    >> Most of the time spent is in the get and put from RocksDB. According to 
the samza metrics (KeyValueStorageEngine.store-put-ns), our PUTs are much more 
expensive than GETs. Any idea on how one would go about profiling that? We are 
using Level compaction with 3 buffers each 500M. Average input message is ~ 1KB 
and we commit (thus, flush the store) every 90 seconds. We are also using an 
object cache of 2million objects with a write cache of 1m objects.
     
    Input messages are ~ 5k/sec/partition aggregate for both input streams that 
we are joining.
        
        
        On Thu, Mar 9, 2017 at 5:20 AM, Ankit Malhotra <[email protected]>
        wrote:
        
        > Replies inline.
        >
        > --
        > Ankit
        >
        > > On Mar 9, 2017, at 12:34 AM, Jagadish Venkatraman <
        > [email protected]> wrote:
        > >
        > > We can certainly help you debug this more. Some questions:
        > >
        > > 1. Are you processing messages (at all) from the "suffering" 
containers?
        > > (You can verify that by observing metrics/ logging etc.)
        > Processing messages for sure. But mainly from one of the 2 partitions 
that
        > the container is reading from.
        >
        > The overall number of process calls for task 1 is much higher than the
        > process calls for task 2 and the diff is approx the lag on the 
container
        > which is all from task 2.
        > >
        > > 2. If you are indeed processing messages, is it possible the 
impacted
        > > containers not able to keep up with the surge in the data? You can 
try
        > > re-partitioning your input topics (and increasing the number of
        > containers)
        > We were trying the async loop by having 2 tasks be on the same 
container
        > and multiple threads processing messages. The process is a simple 
inner
        > join with a get and put into the RocksDB store.
        >
        > We saw that both get and put for the suffering task was higher than 
the
        > task that was chugging log.
        > >
        > > 3. If you are not processing messages, maybe can you provide us 
with your
        > > stack trace? It will be super-helpful to find out if (or where)
        > containers
        > > are stuck.
        > >
        >
        > Another thing to add is our amount of time being blocked is generally
        > quite high which we believe is mainly because our processing is not 
fast
        > enough? To add more color, our rocks store's average get across all 
tasks
        > is around 20,000ns BUT average put is 5X or more. We have the object 
cache
        > enabled.
        >
        > > Thanks,
        > > Jagadish
        > >
        > >
        > > On Wed, Mar 8, 2017 at 1:05 PM, Ankit Malhotra 
<[email protected]>
        > > wrote:
        > >
        > >> Hi,
        > >>
        > >> While joining streams from 2 partitions to join 2 streams, we see 
that
        > >> some containers start suffering in that, lag (messages behind high
        > >> watermark) for one of the tasks starts sky rocketing while the 
other
        > one is
        > >> ~ 0.
        > >>
        > >> We are using default values for buffer sizes, fetch threshold, are
        > using 4
        > >> threads as part of the pool and are using the default
        > >> RoundRobinMessageChooser.
        > >>
        > >> Happy to share more details/config if it can help to debug this 
further.
        > >>
        > >> Thanks
        > >> Ankit
        > >>
        > >>
        > >>
        > >>
        > >>
        > >
        > >
        > > --
        > > Jagadish V,
        > > Graduate Student,
        > > Department of Computer Science,
        > > Stanford University
        >
        
        
        
        -- 
        Jagadish V,
        Graduate Student,
        Department of Computer Science,
        Stanford University

Re: Multi task Container Starvation

Reply via email to