> On June 14, 2016, 8:11 p.m., Navina Ramesh wrote: > > samza-kafka/src/main/scala/org/apache/samza/system/kafka/KafkaSystemAdmin.scala, > > line 276 > > <https://reviews.apache.org/r/48459/diff/3/?file=1416306#file1416306line276> > > > > Documentation is not clear. It says "It does not retry if there is a > > failure".. But you are retrying upto "maxretries". > > > > It will be useful to define what is considered a failure, for example, > > what exceptions are ok for retry and which ones aren't. If there isn't any > > clear distinction between the exceptions, you can mention that as well > > Jake Maes wrote: > Typo. I meant to say "It does not retry *indefinitely* if there is a > failure.
Got it. Thanks for fixing it! > On June 14, 2016, 8:11 p.m., Navina Ramesh wrote: > > samza-kafka/src/main/scala/org/apache/samza/system/kafka/KafkaSystemAdmin.scala, > > line 284 > > <https://reviews.apache.org/r/48459/diff/3/?file=1416306#file1416306line284> > > > > I don't like what we are doing here - setting the ttl to be max value. > > However, I don't see a better way of doing it. :( > > Jake Maes wrote: > There's no reason to fetch metadata if everything is working, so > initially we want to use the cache no matter what. If there's a problem, we > should refresh. > > To accomplish this, I considered 2 ways: > 1. Update the metadata cache to explicitly expire problematic entries. > This would be a much more invasive change and potentially less performant. > 2. Lower the cache TTL if there is a problem so the cache refreshes > itself. > > Both rely on a large or infinite initial TTL. Got it > On June 14, 2016, 8:11 p.m., Navina Ramesh wrote: > > samza-kv/src/main/scala/org/apache/samza/storage/kv/CachedStore.scala, line > > 192 > > <https://reviews.apache.org/r/48459/diff/3/?file=1416309#file1416309line192> > > > > nit: Do we really need these trace lines? If we do, can you change it > > to something like: "Begin flush to store" and "Flush to store complete" > > Jake Maes wrote: > They could be submitted in a separate patch, but they are useful trace > messages. > The original message only tells us that the flush method was called but > doesn't help to trace the calls in the various KV Store wrappers. These new > messages can also be used to time the operation. > > I don't see any value to making the messages more verbose. They currently > read like this: > 2016-06-14 21:31:19.171 [main] [SerializedKeyValueStore] [TRACE] Flushing > store. > 2016-06-14 21:31:19.171 [main] [LoggedStore] [TRACE] Flushing store. > 2016-06-14 21:31:19.171 [main] [RocksDbKeyValueStore] [TRACE] Flushing > store: pending-requests. > 2016-06-14 21:31:19.172 [main] [RocksDbKeyValueStore] [TRACE] Flushed > store: pending-requests. > 2016-06-14 21:31:19.172 [main] [LoggedStore] [TRACE] Flushed store. > 2016-06-14 21:31:19.172 [main] [SerializedKeyValueStore] [TRACE] Flushed > store. oh cool. I didn't realize this. Thanks for pointing it out. - Navina ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/48459/#review137418 ----------------------------------------------------------- On June 14, 2016, 11:14 p.m., Jake Maes wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/48459/ > ----------------------------------------------------------- > > (Updated June 14, 2016, 11:14 p.m.) > > > Review request for samza, Boris Shkolnik, Chris Pettitt, Jake Maes, Navina > Ramesh, Jagadish Venkatraman, Xinyu Liu, and Yi Pan (Data Infrastructure). > > > Bugs: SAMZA-964 > https://issues.apache.org/jira/browse/SAMZA-964 > > > Repository: samza > > > Description > ------- > > SAMZA-964 Improve the performance of the continuous OFFSET checkpointing for > logged stores > > 1. Cache metadata more aggressively. Only expire metadata if we get Kafka > exceptions. This applies for all cases EXCEPT the partition count monitor, > which uses the TTL from the StreamMetadataCache > 2. Reduce excessive Offset fetching. > 3. Do not allow unbounded exponential backoff for offset checkpointing, just > skip the offset file. Exponential backoff can balloon the commit time and > stall the event loop. So we will only retry up to 3 times for a max delay of > 400ms > 4. Add some trace log messages to help track/time KV Store flushes (the other > culprit for the slowdown) > > > Diffs > ----- > > samza-api/src/main/java/org/apache/samza/system/ExtendedSystemAdmin.java > daa2212cf1d54e90861657fab86b2e780d7e89e2 > > samza-core/src/main/java/org/apache/samza/coordinator/stream/CoordinatorStreamSystemConsumer.java > 0a6661c423a09944aa211223cad205958d3b1fee > samza-core/src/main/scala/org/apache/samza/storage/TaskStorageManager.scala > c7b05203a1958a62af9dec04b215d985c4646dc4 > samza-core/src/main/scala/org/apache/samza/system/StreamMetadataCache.scala > 18b47ec3393978e403cadd8754f3fa5fd68654e9 > > samza-core/src/test/scala/org/apache/samza/coordinator/TestJobCoordinator.scala > 110c3a910aa0bae77dfe5eebbf82286b56dc4654 > > samza-core/src/test/scala/org/apache/samza/storage/TestTaskStorageManager.scala > c8ea64c7c67dd6bf789d2a3445d620ccef1beac0 > > samza-kafka/src/main/scala/org/apache/samza/system/kafka/KafkaSystemAdmin.scala > 23aa58dff6b5e282bb634d3913cacd73003402ea > > samza-kafka/src/test/scala/org/apache/samza/system/kafka/TestKafkaSystemAdmin.scala > 6c292234dcdd54eaca05f3e1a3fc401e205d6066 > > samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv/RocksDbKeyValueStore.scala > f0965aec5f3ec2a214dc40c70832c58273623749 > samza-kv/src/main/scala/org/apache/samza/storage/kv/CachedStore.scala > c28f8db8cb59bd5415e78535877acc1e5bee0f67 > samza-kv/src/main/scala/org/apache/samza/storage/kv/LoggedStore.scala > 7bba6ff37d8266674e7f15c10c7c146f4a41fc91 > > samza-kv/src/main/scala/org/apache/samza/storage/kv/SerializedKeyValueStore.scala > 8e183efcdec6fd3f921fc2bfe1971c95715930ed > > Diff: https://reviews.apache.org/r/48459/diff/ > > > Testing > ------- > > New unit tests. > > ./gradlew clean build > bin/check-all.sh on my Mac > > Manual testing with 2 test jobs and the big job that had the performance > issue. > > > Thanks, > > Jake Maes > >