Re: [DISCUSS] KIP-405: Kafka Tiered Storage

Harsha Mon, 04 Feb 2019 13:34:28 -0800

Hi Eric,
       Thanks for your questions. Answers are in-line

"The high-level design seems to indicate that all of the logic for when and
how to copy log segments to remote storage lives in the RLM class. The
default implementation is then HDFS specific with additional
implementations being left to the community. This seems like it would
require anyone implementing a new RLM to also re-implement the logic for
when to ship data to remote storage."

RLM will be responsible for shipping log segments and it will decide when a log 
segment is ready to be shipped over.
Once a Log Segement(s) are identified as rolled over,  RLM will delegate this 
responsibility to a pluggable remote storage implementation. Users who are 
looking add their own implementation to enable other storages all they need to 
do is to implement the copy and read mechanisms and not to re-implement RLM 
itself.

"Would it not be better for the Remote Log Manager implementation to be
non-configurable, and instead have an interface for the remote storage
layer? That way the "when" of the logic is consistent across all
implementations and it's only a matter of "how," similar to how the Streams
StateStores are managed."

It's possible that we can RLM non-configurable. But for the initial release and 
to keep the backward compatibility 
we want to make this configurable and for any users who might not be interested 
in having the LogSegments shipped to remote, they don't need to worry about 
this.

Hi Ryanne,
 Thanks for your questions.

"How could this be used to leverage fast key-value stores, e.g. Couchbase,
which can serve individual records but maybe not entire segments? Or is the
idea to only support writing and fetching entire segments? Would it make
sense to support both?"

LogSegment once its rolled over are immutable objects and we want to keep the 
current structure of LogSegments and corresponding Index files. It will be easy 
to copy the whole segment as it is, instead of re-reading each file and use a 
key/value store. 

"
- Instead of defining a new interface and/or mechanism to ETL segment files
from brokers to cold storage, can we just leverage Kafka itself? In
particular, we can already ETL records to HDFS via Kafka Connect, Gobblin
etc -- we really just need a way for brokers to read these records back.
I'm wondering whether the new API could be limited to the fetch, and then
existing ETL pipelines could be more easily leveraged. For example, if you
already have an ETL pipeline from Kafka to HDFS, you could leave that in
place and just tell Kafka how to read these records/segments from cold
storage when necessary."

This is pretty much what everyone does and it has the additional overhead of 
keeping these pipelines operating and monitoring.
What's proposed in the KIP is not ETL. It's just looking a the logs that are 
written and rolled over to copy the file as it is.
Each new topic needs to be added (sure we can do so via wildcard or another 
mechanism) but new topics need to be onboard to ship the data into remote 
storage through a traditional ETL pipeline.
Once the data lands somewhere like HDFS/HIVE etc.. Users need to write another 
processing line to re-process this data similar to how they are doing it in 
their Stream processing pipelines.  Tiered storage is to get away from this and 
make this transparent to the user.  They don't need to run another ETL process 
to ship the logs.

"I'm wondering if we could just add support for loading segments from
remote URIs instead of from file, i.e. via plugins for s3://, hdfs:// etc.
I suspect less broker logic would change in that case -- the broker
wouldn't necessarily care if it reads from file:// or s3:// to load a given
segment."

Yes, this is what we are discussing in KIP. We are leaving the details of 
loading segments to RLM read part instead of directly exposing this in the  
Broker. This way we can keep the current Kafka code as it is without changing 
the assumptions around the local disk. Let the RLM handle the remote storage 
part.

Thanks,
Harsha

On Mon, Feb 4, 2019, at 12:54 PM, Ryanne Dolan wrote:
> Harsha, Sriharsha, Suresh, a couple thoughts:
> 
> - How could this be used to leverage fast key-value stores, e.g. Couchbase,
> which can serve individual records but maybe not entire segments? Or is the
> idea to only support writing and fetching entire segments? Would it make
> sense to support both?
> 
> - Instead of defining a new interface and/or mechanism to ETL segment files
> from brokers to cold storage, can we just leverage Kafka itself? In
> particular, we can already ETL records to HDFS via Kafka Connect, Gobblin
> etc -- we really just need a way for brokers to read these records back.
> I'm wondering whether the new API could be limited to the fetch, and then
> existing ETL pipelines could be more easily leveraged. For example, if you
> already have an ETL pipeline from Kafka to HDFS, you could leave that in
> place and just tell Kafka how to read these records/segments from cold
> storage when necessary.
> 
> - I'm wondering if we could just add support for loading segments from
> remote URIs instead of from file, i.e. via plugins for s3://, hdfs:// etc.
> I suspect less broker logic would change in that case -- the broker
> wouldn't necessarily care if it reads from file:// or s3:// to load a given
> segment.
> 
> Combining the previous two comments, I can imagine a URI resolution chain
> for segments. For example, first try file:///logs/{topic}/{segment}.log,
> then s3://mybucket/{topic}/{date}/{segment}.log, etc, leveraging your
> existing ETL pipeline(s).
> 
> Ryanne
> 
> 
> On Mon, Feb 4, 2019 at 12:01 PM Harsha <ka...@harsha.io> wrote:
> 
> >  Hi All,
> >          We are interested in adding tiered storage to Kafka. More details
> > about motivation and design are in the KIP.  We are working towards an
> > initial POC. Any feedback or questions on this KIP are welcome.
> >
> > Thanks,
> > Harsha
> >

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

Reply via email to