@Neha, not sure what you mean by using base64 encoded strings. base64 encoding takes bytes and gives you ASCII text. We need to go from arbitrarily structured offsets data to bytes (e.g. user has given us a record (with schema they have defined) containing db name + table name for the key, and another record (with schema they have defined) containing a timestamp and value from an auto-incrementing column).
By the way, offset serialization comes with all the same schema challenges as regular data does, although admittedly they may be less likely to see upgrades to the schema. You need to either ship schemas with the data, have a separate registry and just ship an ID, or have some way for the plugins to provide their schemas to deserializers (perhaps with IDs, acting as an embedded-in-code registry). It may be fine to choose JSON for all offset data, but then that either requires committing to this ugly envelope approach to include the schema with messages (and accept all that overhead) or make separate offset deserializers APIs where you can pass in the schema of the data to be decoded. Alternatively, you ignore schemas and hope the connectors behave well (which was described as the loosey-goosey approach in the other thread about handling connector config upgrades). -Ewen On Sat, Aug 15, 2015 at 9:00 PM, Gwen Shapira <g...@confluent.io> wrote: > Yeah, I agree that if we have the ser/de we can do anything :) > > I'd actually feel more comfortable if the users *have* to go through our > APIs to get to the metadata (which again, is kind of internal to Copycat). > If they start writing their own code that depends on this data, who knows > what we may accidentally break? > > I'd prefer a more well-defined contract here. > > The JSON shop should still feel fairly comfortable using REST APIs... most > of them are :) > > On Fri, Aug 14, 2015 at 8:14 PM, Ewen Cheslack-Postava <e...@confluent.io> > wrote: > > > On Fri, Aug 14, 2015 at 6:35 PM, Gwen Shapira <g...@confluent.io> wrote: > > > > > Yeah, I missed the option to match serialization of offsets to data, > > which > > > solves the configuration overhead. > > > > > > It still doesn't give us the ability to easily evolve the metadata > > messages > > > or to use them in monitoring tools. > > > > > > And I am still not clear of the benefits of using user-defined > > > serialization for the offsets. > > > > > > > If we can get at the serialization config (e.g. via the REST API), then > we > > can decode the data regardless of the format. The main drawback is that > > then any tool that needs to decode them needs the serializer jars on its > > classpath. I think the benefit is that it lets users process that data in > > their preferred format if they want to build their own tools. For example > > in a shop that prefers JSON, this be the difference between them > > considering it easily accessible (they just read the topic and parse > using > > their favorite JSON library) vs not being accessible (they won't pull in > > whatever serializer we use internally, or don't want to write a custom > > parser for our custom serialization format). > > > > -Ewen > > > > > > > > > > Gwen > > > > > > On Fri, Aug 14, 2015 at 1:29 AM, Ewen Cheslack-Postava < > > e...@confluent.io> > > > wrote: > > > > > > > I'm not sure the existing discussion is clear about how the format of > > > > offset data is decided. One possibility is that we choose one fixed > > > format > > > > and that is what we use internally to store offsets no matter what > > > > serializer you choose. This would be similar to how the __offsets > topic > > > is > > > > currently handled (with a custom serialization format). In other > words, > > > we > > > > use format X to store offsets. If you serialize your data with Y or > Z, > > we > > > > don't care, we still use format X. The other option (which is used in > > the > > > > current PR-99 patch) would still make offset serialization pluggable, > > but > > > > there wouldn't be a separate option for it. Offset serialization > would > > > use > > > > the same format as the data serialization. If you use X for data, we > > use > > > X > > > > for offsets; you use Y for data, we use Y for offsets. > > > > > > > > @neha wrt providing access through a REST API, I guess you are > > suggesting > > > > that we can serialize that data to JSON for that API. I think it's > > > > important to point out that this is arbitrarily structured, > > > > connector-specific data. In many ways, it's not that different from > the > > > > actual message data in that it is highly dependent on the connector > and > > > > downstream consumers need to understand the connector and its data > > format > > > > to do anything meaningful with the data. Because of this, I'm not > > > convinced > > > > that serializing it in a format other than the one used for the data > > will > > > > be particularly useful. > > > > > > > > > > > > On Thu, Aug 13, 2015 at 11:22 PM, Neha Narkhede <n...@confluent.io> > > > wrote: > > > > > > > > > Copycat enables streaming data in and out of Kafka. Connector > writers > > > > need > > > > > to define the serde of the data as it is different per system. > > Metadata > > > > > should be entirely hidden by the copycat framework and isn't > > something > > > > > users or connector implementors need to serialize differently as > long > > > as > > > > we > > > > > provide tools/REST APIs to access the metadata where required. > > > Moreover, > > > > as > > > > > you suggest, evolution, maintenance and configs are much simpler if > > it > > > > > remains hidden. > > > > > > > > > > +1 on keeping just the serializers for data configurable. > > > > > > > > > > On Thu, Aug 13, 2015 at 9:59 PM, Gwen Shapira <g...@confluent.io> > > > wrote: > > > > > > > > > > > Hi Team Kafka, > > > > > > > > > > > > As you know from KIP-26 and PR-99, when you will use Copycat to > > move > > > > data > > > > > > from an external system to Kafka, in addition to storing the data > > > > itself, > > > > > > Copycat will also need to store some metadata. > > > > > > > > > > > > This metadata is currently offsets on the source system (say, > SCN# > > > from > > > > > > Oracle redo log), but I can imagine to store a bit more. > > > > > > > > > > > > When storing data, we obviously want pluggable serializers, so > > users > > > > will > > > > > > get the data in a format they like. > > > > > > > > > > > > But the metadata seems internal. i.e users shouldn't care about > it > > > and > > > > if > > > > > > we want them to read or change anything, we want to provide them > > with > > > > > tools > > > > > > to do it. > > > > > > > > > > > > Moreover, by controlling the format we can do three important > > things: > > > > > > * Read the metadata for monitoring / audit purposes > > > > > > * Evolve / modify it. If users serialize it in their own format, > > and > > > > > > actually write clients to use this metadata, we don't know if its > > > safe > > > > to > > > > > > evolve. > > > > > > * Keep configuration a bit simpler. This adds at least 4 new > > > > > configuration > > > > > > items... > > > > > > > > > > > > What do you guys think? > > > > > > > > > > > > Gwen > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Thanks, > > > > > Neha > > > > > > > > > > > > > > > > > > > > > -- > > > > Thanks, > > > > Ewen > > > > > > > > > > > > > > > -- > > Thanks, > > Ewen > > > -- Thanks, Ewen