Re: [DISCUSS] KIP-405: Kafka Tiered Storage

Ryanne Dolan Mon, 04 Feb 2019 14:54:04 -0800

Thanks Harsha, makes sense for the most part.

> tiered storage is to get away from this and make this transparent to the
user


I think you are saying that this enables additional (potentially cheaper)
storage options without *requiring* an existing ETL pipeline. But it's not
really a replacement for the sort of pipelines people build with Connect,
Gobblin etc. My point was that, if you are already offloading records in an
ETL pipeline, why do you need a new pipeline built into the broker to ship
the same data to the same place? I think in most cases this will be an
additional pipeline, not a replacement, because the segments written to
cold storage won't be useful outside Kafka. So you'd end up with one of 1)
cold segments are only useful to Kafka; 2) you have the same data written
to HDFS/etc twice, once for Kafka and once for everything else, in two
separate formats; 3) you use your existing ETL pipeline and read cold data
directly.

To me, an ideal solution would let me spool segments from Kafka to any sink
I would like, and then let Kafka clients seamlessly access that cold data.
Today I can do that in the client, but ideally the broker would do it for
me via some HDFS/Hive/S3 plugin. The KIP seems to accomplish that -- just
without leveraging anything I've currently got in place.

Ryanne

On Mon, Feb 4, 2019 at 3:34 PM Harsha <ka...@harsha.io> wrote:

> Hi Eric,
>        Thanks for your questions. Answers are in-line
>
> "The high-level design seems to indicate that all of the logic for when and
> how to copy log segments to remote storage lives in the RLM class. The
> default implementation is then HDFS specific with additional
> implementations being left to the community. This seems like it would
> require anyone implementing a new RLM to also re-implement the logic for
> when to ship data to remote storage."
>
> RLM will be responsible for shipping log segments and it will decide when
> a log segment is ready to be shipped over.
> Once a Log Segement(s) are identified as rolled over,  RLM will delegate
> this responsibility to a pluggable remote storage implementation. Users who
> are looking add their own implementation to enable other storages all they
> need to do is to implement the copy and read mechanisms and not to
> re-implement RLM itself.
>
>
> "Would it not be better for the Remote Log Manager implementation to be
> non-configurable, and instead have an interface for the remote storage
> layer? That way the "when" of the logic is consistent across all
> implementations and it's only a matter of "how," similar to how the Streams
> StateStores are managed."
>
> It's possible that we can RLM non-configurable. But for the initial
> release and to keep the backward compatibility
> we want to make this configurable and for any users who might not be
> interested in having the LogSegments shipped to remote, they don't need to
> worry about this.
>
>
> Hi Ryanne,
>  Thanks for your questions.
>
> "How could this be used to leverage fast key-value stores, e.g. Couchbase,
> which can serve individual records but maybe not entire segments? Or is the
> idea to only support writing and fetching entire segments? Would it make
> sense to support both?"
>
> LogSegment once its rolled over are immutable objects and we want to keep
> the current structure of LogSegments and corresponding Index files. It will
> be easy to copy the whole segment as it is, instead of re-reading each file
> and use a key/value store.
>
> "
> - Instead of defining a new interface and/or mechanism to ETL segment files
> from brokers to cold storage, can we just leverage Kafka itself? In
> particular, we can already ETL records to HDFS via Kafka Connect, Gobblin
> etc -- we really just need a way for brokers to read these records back.
> I'm wondering whether the new API could be limited to the fetch, and then
> existing ETL pipelines could be more easily leveraged. For example, if you
> already have an ETL pipeline from Kafka to HDFS, you could leave that in
> place and just tell Kafka how to read these records/segments from cold
> storage when necessary."
>
> This is pretty much what everyone does and it has the additional overhead
> of keeping these pipelines operating and monitoring.
> What's proposed in the KIP is not ETL. It's just looking a the logs that
> are written and rolled over to copy the file as it is.
> Each new topic needs to be added (sure we can do so via wildcard or
> another mechanism) but new topics need to be onboard to ship the data into
> remote storage through a traditional ETL pipeline.
> Once the data lands somewhere like HDFS/HIVE etc.. Users need to write
> another processing line to re-process this data similar to how they are
> doing it in their Stream processing pipelines.  Tiered storage is to get
> away from this and make this transparent to the user.  They don't need to
> run another ETL process to ship the logs.
>
> "I'm wondering if we could just add support for loading segments from
> remote URIs instead of from file, i.e. via plugins for s3://, hdfs:// etc.
> I suspect less broker logic would change in that case -- the broker
> wouldn't necessarily care if it reads from file:// or s3:// to load a given
> segment."
>
> Yes, this is what we are discussing in KIP. We are leaving the details of
> loading segments to RLM read part instead of directly exposing this in the
> Broker. This way we can keep the current Kafka code as it is without
> changing the assumptions around the local disk. Let the RLM handle the
> remote storage part.
>
>
> Thanks,
> Harsha
>
>
>
> On Mon, Feb 4, 2019, at 12:54 PM, Ryanne Dolan wrote:
> > Harsha, Sriharsha, Suresh, a couple thoughts:
> >
> > - How could this be used to leverage fast key-value stores, e.g.
> Couchbase,
> > which can serve individual records but maybe not entire segments? Or is
> the
> > idea to only support writing and fetching entire segments? Would it make
> > sense to support both?
> >
> > - Instead of defining a new interface and/or mechanism to ETL segment
> files
> > from brokers to cold storage, can we just leverage Kafka itself? In
> > particular, we can already ETL records to HDFS via Kafka Connect, Gobblin
> > etc -- we really just need a way for brokers to read these records back.
> > I'm wondering whether the new API could be limited to the fetch, and then
> > existing ETL pipelines could be more easily leveraged. For example, if
> you
> > already have an ETL pipeline from Kafka to HDFS, you could leave that in
> > place and just tell Kafka how to read these records/segments from cold
> > storage when necessary.
> >
> > - I'm wondering if we could just add support for loading segments from
> > remote URIs instead of from file, i.e. via plugins for s3://, hdfs://
> etc.
> > I suspect less broker logic would change in that case -- the broker
> > wouldn't necessarily care if it reads from file:// or s3:// to load a
> given
> > segment.
> >
> > Combining the previous two comments, I can imagine a URI resolution chain
> > for segments. For example, first try file:///logs/{topic}/{segment}.log,
> > then s3://mybucket/{topic}/{date}/{segment}.log, etc, leveraging your
> > existing ETL pipeline(s).
> >
> > Ryanne
> >
> >
> > On Mon, Feb 4, 2019 at 12:01 PM Harsha <ka...@harsha.io> wrote:
> >
> > >  Hi All,
> > >          We are interested in adding tiered storage to Kafka. More
> details
> > > about motivation and design are in the KIP.  We are working towards an
> > > initial POC. Any feedback or questions on this KIP are welcome.
> > >
> > > Thanks,
> > > Harsha
> > >
>

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

Reply via email to