On Wed, Nov 6, 2019 at 6:28 PM Tom Bentley <tbent...@redhat.com> wrote:
> Hi Ying, > > Because only inactive segments can be shipped to remote storage, to be able > > to ship log data as soon > > as possible, we will roll log segment very fast (e.g. every half hour). > > > > So that means a consumer which gets behind by half an hour will find its > reads being served from remote storage. No, the segments are shipped to remote storage as soon as possible. But the local segment is not deleted until a configurable time (e.g. 6 hours). The consumer request is served from local storage as long as the local copy is still available. After 6 hour or longer, the consumer request will be served by remote storage. > And, if I understand the proposed > algorithm, each such consumer fetch request could result in a separate > fetch request from the remote storage. I.e. there's no mechanism to > amortize the cost of the fetching between multiple consumers fetching > similar ranges? > > We can have a small in memory cache on the broker. But this is not a high priority right now. In any normal case, a Kafka consumer should not lag for more than several hours. Only in some very extreme cases, a Kafka consumer have to read from remote storage. It's very rare that 2 or more consumers are read the same piece of remote data at about the same time. (Actually the doc for RemoteStorageManager.read() says "It will read at > least one batch, if the 1st batch size is larger than maxBytes.". Does that > mean the broker might have to retry with increased maxBytes if the first > request fails to read a batch? If so, how does it know how much to increase > maxBytes by?) > > No, there is no retry, just continuously reading until a full batch is received. The logic is exactly the same with the existing local segment read.