We had one of our Samza jobs restart overnight recently and noticed that restoration from the changelog took much longer than I would expect (well over an hour). Looking through the logs, the throughput initially seems reasonable if not stellar. But nearly every container seems to encounter one or more long pauses during the restoration process:
2016-03-09 03:50:09,940 (default) [main] INFO [org.apache.samza.storage.kv.KeyValueStorageEngine] - 56000000 entries restored... 2016-03-09 03:50:19,895 (default) [main] INFO [org.apache.samza.storage.kv.KeyValueStorageEngine] - 57000000 entries restored... 2016-03-09 03:51:41,310 (default) [main] INFO [org.apache.samza.storage.kv.KeyValueStorageEngine] - 58000000 entries restored... 2016-03-09 04:22:13,003 (default) [main] INFO [org.apache.samza.storage.kv.KeyValueStorageEngine] - 59000000 entries restored... Here we see a nearly 30 minute span with no logs. So far as we can tell, Kafka is healthy during this period and other containers are making progress restoring their partitions around this time, so the "gaps" are not happening at the same time across containers. We are running Samza 0.9.1 on a YARN cluster in AWS so some variance in performance is to be expected, but this seems pretty extreme. Is anyone else seeing this behavior? -- Tommy Becker Senior Software Engineer Digitalsmiths A TiVo Company www.digitalsmiths.com<http://www.digitalsmiths.com> tobec...@tivo.com<mailto:tobec...@tivo.com> ________________________________ This email and any attachments may contain confidential and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments) by others is prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete this email and any attachments. No employee or agent of TiVo Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo Inc. may only be made by a signed written agreement.