Thanks Harsha, makes sense for the most part. > tiered storage is to get away from this and make this transparent to the user
I think you are saying that this enables additional (potentially cheaper) storage options without *requiring* an existing ETL pipeline. But it's not really a replacement for the sort of pipelines people build with Connect, Gobblin etc. My point was that, if you are already offloading records in an ETL pipeline, why do you need a new pipeline built into the broker to ship the same data to the same place? I think in most cases this will be an additional pipeline, not a replacement, because the segments written to cold storage won't be useful outside Kafka. So you'd end up with one of 1) cold segments are only useful to Kafka; 2) you have the same data written to HDFS/etc twice, once for Kafka and once for everything else, in two separate formats; 3) you use your existing ETL pipeline and read cold data directly. To me, an ideal solution would let me spool segments from Kafka to any sink I would like, and then let Kafka clients seamlessly access that cold data. Today I can do that in the client, but ideally the broker would do it for me via some HDFS/Hive/S3 plugin. The KIP seems to accomplish that -- just without leveraging anything I've currently got in place. Ryanne On Mon, Feb 4, 2019 at 3:34 PM Harsha <ka...@harsha.io> wrote: > Hi Eric, > Thanks for your questions. Answers are in-line > > "The high-level design seems to indicate that all of the logic for when and > how to copy log segments to remote storage lives in the RLM class. The > default implementation is then HDFS specific with additional > implementations being left to the community. This seems like it would > require anyone implementing a new RLM to also re-implement the logic for > when to ship data to remote storage." > > RLM will be responsible for shipping log segments and it will decide when > a log segment is ready to be shipped over. > Once a Log Segement(s) are identified as rolled over, RLM will delegate > this responsibility to a pluggable remote storage implementation. Users who > are looking add their own implementation to enable other storages all they > need to do is to implement the copy and read mechanisms and not to > re-implement RLM itself. > > > "Would it not be better for the Remote Log Manager implementation to be > non-configurable, and instead have an interface for the remote storage > layer? That way the "when" of the logic is consistent across all > implementations and it's only a matter of "how," similar to how the Streams > StateStores are managed." > > It's possible that we can RLM non-configurable. But for the initial > release and to keep the backward compatibility > we want to make this configurable and for any users who might not be > interested in having the LogSegments shipped to remote, they don't need to > worry about this. > > > Hi Ryanne, > Thanks for your questions. > > "How could this be used to leverage fast key-value stores, e.g. Couchbase, > which can serve individual records but maybe not entire segments? Or is the > idea to only support writing and fetching entire segments? Would it make > sense to support both?" > > LogSegment once its rolled over are immutable objects and we want to keep > the current structure of LogSegments and corresponding Index files. It will > be easy to copy the whole segment as it is, instead of re-reading each file > and use a key/value store. > > " > - Instead of defining a new interface and/or mechanism to ETL segment files > from brokers to cold storage, can we just leverage Kafka itself? In > particular, we can already ETL records to HDFS via Kafka Connect, Gobblin > etc -- we really just need a way for brokers to read these records back. > I'm wondering whether the new API could be limited to the fetch, and then > existing ETL pipelines could be more easily leveraged. For example, if you > already have an ETL pipeline from Kafka to HDFS, you could leave that in > place and just tell Kafka how to read these records/segments from cold > storage when necessary." > > This is pretty much what everyone does and it has the additional overhead > of keeping these pipelines operating and monitoring. > What's proposed in the KIP is not ETL. It's just looking a the logs that > are written and rolled over to copy the file as it is. > Each new topic needs to be added (sure we can do so via wildcard or > another mechanism) but new topics need to be onboard to ship the data into > remote storage through a traditional ETL pipeline. > Once the data lands somewhere like HDFS/HIVE etc.. Users need to write > another processing line to re-process this data similar to how they are > doing it in their Stream processing pipelines. Tiered storage is to get > away from this and make this transparent to the user. They don't need to > run another ETL process to ship the logs. > > "I'm wondering if we could just add support for loading segments from > remote URIs instead of from file, i.e. via plugins for s3://, hdfs:// etc. > I suspect less broker logic would change in that case -- the broker > wouldn't necessarily care if it reads from file:// or s3:// to load a given > segment." > > Yes, this is what we are discussing in KIP. We are leaving the details of > loading segments to RLM read part instead of directly exposing this in the > Broker. This way we can keep the current Kafka code as it is without > changing the assumptions around the local disk. Let the RLM handle the > remote storage part. > > > Thanks, > Harsha > > > > On Mon, Feb 4, 2019, at 12:54 PM, Ryanne Dolan wrote: > > Harsha, Sriharsha, Suresh, a couple thoughts: > > > > - How could this be used to leverage fast key-value stores, e.g. > Couchbase, > > which can serve individual records but maybe not entire segments? Or is > the > > idea to only support writing and fetching entire segments? Would it make > > sense to support both? > > > > - Instead of defining a new interface and/or mechanism to ETL segment > files > > from brokers to cold storage, can we just leverage Kafka itself? In > > particular, we can already ETL records to HDFS via Kafka Connect, Gobblin > > etc -- we really just need a way for brokers to read these records back. > > I'm wondering whether the new API could be limited to the fetch, and then > > existing ETL pipelines could be more easily leveraged. For example, if > you > > already have an ETL pipeline from Kafka to HDFS, you could leave that in > > place and just tell Kafka how to read these records/segments from cold > > storage when necessary. > > > > - I'm wondering if we could just add support for loading segments from > > remote URIs instead of from file, i.e. via plugins for s3://, hdfs:// > etc. > > I suspect less broker logic would change in that case -- the broker > > wouldn't necessarily care if it reads from file:// or s3:// to load a > given > > segment. > > > > Combining the previous two comments, I can imagine a URI resolution chain > > for segments. For example, first try file:///logs/{topic}/{segment}.log, > > then s3://mybucket/{topic}/{date}/{segment}.log, etc, leveraging your > > existing ETL pipeline(s). > > > > Ryanne > > > > > > On Mon, Feb 4, 2019 at 12:01 PM Harsha <ka...@harsha.io> wrote: > > > > > Hi All, > > > We are interested in adding tiered storage to Kafka. More > details > > > about motivation and design are in the KIP. We are working towards an > > > initial POC. Any feedback or questions on this KIP are welcome. > > > > > > Thanks, > > > Harsha > > > >