Re: Review Request 48459: SAMZA-964 Improve the performance of the continuous OFFSET checkpointing for logged stores

Navina Ramesh Tue, 14 Jun 2016 16:29:02 -0700


> On June 14, 2016, 8:11 p.m., Navina Ramesh wrote:
> > samza-kafka/src/main/scala/org/apache/samza/system/kafka/KafkaSystemAdmin.scala,
> >  line 276
> > <https://reviews.apache.org/r/48459/diff/3/?file=1416306#file1416306line276>
> >
> >     Documentation is not clear. It says "It does not retry if there is a 
> > failure".. But you are retrying upto "maxretries".
> >     
> >     It will be useful to define what is considered a failure, for example, 
> > what exceptions are ok for retry and which ones aren't. If there isn't any 
> > clear distinction between the exceptions, you can mention that as well
> 
> Jake Maes wrote:
>     Typo. I meant to say "It does not retry *indefinitely* if there is a 
> failure.


Got it. Thanks for fixing it!


> On June 14, 2016, 8:11 p.m., Navina Ramesh wrote:
> > samza-kafka/src/main/scala/org/apache/samza/system/kafka/KafkaSystemAdmin.scala,
> >  line 284
> > <https://reviews.apache.org/r/48459/diff/3/?file=1416306#file1416306line284>
> >
> >     I don't like what we are doing here - setting the ttl to be max value. 
> > However, I don't see a better way of doing it. :(
> 
> Jake Maes wrote:
>     There's no reason to fetch metadata if everything is working, so 
> initially we want to use the cache no matter what. If there's a problem, we 
> should refresh. 
>     
>     To accomplish this, I considered 2 ways:
>     1. Update the metadata cache to explicitly expire problematic entries. 
> This would be a much more invasive change and potentially less performant. 
>     2. Lower the cache TTL if there is a problem so the cache refreshes 
> itself. 
>     
>     Both rely on a large or infinite initial TTL.

Got it


> On June 14, 2016, 8:11 p.m., Navina Ramesh wrote:
> > samza-kv/src/main/scala/org/apache/samza/storage/kv/CachedStore.scala, line 
> > 192
> > <https://reviews.apache.org/r/48459/diff/3/?file=1416309#file1416309line192>
> >
> >     nit: Do we really need these trace lines? If we do, can you change it 
> > to something like: "Begin flush to store" and "Flush to store complete"
> 
> Jake Maes wrote:
>     They could be submitted in a separate patch, but they are useful trace 
> messages. 
>     The original message only tells us that the flush method was called but 
> doesn't help to trace the calls in the various KV Store wrappers. These new 
> messages can also be used to time the operation. 
>     
>     I don't see any value to making the messages more verbose. They currently 
> read like this:
>     2016-06-14 21:31:19.171 [main] [SerializedKeyValueStore] [TRACE] Flushing 
> store.
>     2016-06-14 21:31:19.171 [main] [LoggedStore] [TRACE] Flushing store.
>     2016-06-14 21:31:19.171 [main] [RocksDbKeyValueStore] [TRACE] Flushing 
> store: pending-requests.
>     2016-06-14 21:31:19.172 [main] [RocksDbKeyValueStore] [TRACE] Flushed 
> store: pending-requests.
>     2016-06-14 21:31:19.172 [main] [LoggedStore] [TRACE] Flushed store.
>     2016-06-14 21:31:19.172 [main] [SerializedKeyValueStore] [TRACE] Flushed 
> store.

oh cool. I didn't realize this. Thanks for pointing it out.


- Navina


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48459/#review137418
-----------------------------------------------------------


On June 14, 2016, 11:14 p.m., Jake Maes wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48459/
> -----------------------------------------------------------
> 
> (Updated June 14, 2016, 11:14 p.m.)
> 
> 
> Review request for samza, Boris Shkolnik, Chris Pettitt, Jake Maes, Navina 
> Ramesh, Jagadish Venkatraman, Xinyu Liu, and Yi Pan (Data Infrastructure).
> 
> 
> Bugs: SAMZA-964
>     https://issues.apache.org/jira/browse/SAMZA-964
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> SAMZA-964 Improve the performance of the continuous OFFSET checkpointing for 
> logged stores
> 
> 1. Cache metadata more aggressively. Only expire metadata if we get Kafka 
> exceptions. This applies for all cases EXCEPT the partition count monitor, 
> which uses the TTL from the StreamMetadataCache
> 2. Reduce excessive Offset fetching. 
> 3. Do not allow unbounded exponential backoff for offset checkpointing, just 
> skip the offset file. Exponential backoff can balloon the commit time and 
> stall the event loop. So we will only retry up to 3 times for a max delay of 
> 400ms
> 4. Add some trace log messages to help track/time KV Store flushes (the other 
> culprit for the slowdown)
> 
> 
> Diffs
> -----
> 
>   samza-api/src/main/java/org/apache/samza/system/ExtendedSystemAdmin.java 
> daa2212cf1d54e90861657fab86b2e780d7e89e2 
>   
> samza-core/src/main/java/org/apache/samza/coordinator/stream/CoordinatorStreamSystemConsumer.java
>  0a6661c423a09944aa211223cad205958d3b1fee 
>   samza-core/src/main/scala/org/apache/samza/storage/TaskStorageManager.scala 
> c7b05203a1958a62af9dec04b215d985c4646dc4 
>   samza-core/src/main/scala/org/apache/samza/system/StreamMetadataCache.scala 
> 18b47ec3393978e403cadd8754f3fa5fd68654e9 
>   
> samza-core/src/test/scala/org/apache/samza/coordinator/TestJobCoordinator.scala
>  110c3a910aa0bae77dfe5eebbf82286b56dc4654 
>   
> samza-core/src/test/scala/org/apache/samza/storage/TestTaskStorageManager.scala
>  c8ea64c7c67dd6bf789d2a3445d620ccef1beac0 
>   
> samza-kafka/src/main/scala/org/apache/samza/system/kafka/KafkaSystemAdmin.scala
>  23aa58dff6b5e282bb634d3913cacd73003402ea 
>   
> samza-kafka/src/test/scala/org/apache/samza/system/kafka/TestKafkaSystemAdmin.scala
>  6c292234dcdd54eaca05f3e1a3fc401e205d6066 
>   
> samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv/RocksDbKeyValueStore.scala
>  f0965aec5f3ec2a214dc40c70832c58273623749 
>   samza-kv/src/main/scala/org/apache/samza/storage/kv/CachedStore.scala 
> c28f8db8cb59bd5415e78535877acc1e5bee0f67 
>   samza-kv/src/main/scala/org/apache/samza/storage/kv/LoggedStore.scala 
> 7bba6ff37d8266674e7f15c10c7c146f4a41fc91 
>   
> samza-kv/src/main/scala/org/apache/samza/storage/kv/SerializedKeyValueStore.scala
>  8e183efcdec6fd3f921fc2bfe1971c95715930ed 
> 
> Diff: https://reviews.apache.org/r/48459/diff/
> 
> 
> Testing
> -------
> 
> New unit tests. 
> 
> ./gradlew clean build
> bin/check-all.sh on my Mac
> 
> Manual testing with 2 test jobs and the big job that had the performance 
> issue.
> 
> 
> Thanks,
> 
> Jake Maes
> 
>

Re: Review Request 48459: SAMZA-964 Improve the performance of the continuous OFFSET checkpointing for logged stores

Reply via email to