We also use this feature of AvroSerDe and find it very useful. In our case
we copy the schema from our schema registry into S3 and reference it from
there. In effect, we listen to the internal topic used to store schemas by
our registry, and push to S3 whenever there is a new record. As well as
being automated, this also provides a degree of system decoupling; Hive
queries can be executed independently of the availability of our registry.
I appreciate that this is not a general solution.

On the topic of this feature proposal: The addition of some Json
filtering/extraction in the SerDe feels like a workaround for a very
specific design descision made in the confluent API rather than a more
generally useful feature for the SerDe. Arguably it'd be more generally
useful if the confluent API was amended/extended to return only the schema
document and not encapsulate it in a seemingly superfluous wrapper; that
way any system that can load an Avro schema from a URI can potentially
integrate with the registry with no confluent specific transformations.
However I can understand any reluctance to make such a change.

Given that it would be possible to implement some fairly simple
workarounds, I don't think it's the responsibility of the Hive project to
bridge this gap.

All that said, if you still are keen on this approach, might I suggest
using a JSONPointer to locate the relevant node in the returned document,
as this could be applied generally to many different Json response
structures: https://tools.ietf.org/html/rfc6901

Elliot.

On Thu, 12 Oct 2017 at 14:52, Stephen Durfey <sjdur...@gmail.com> wrote:

> Recently my team has opened a discussion with Confluent [1] in regards to
> the schema registry being used to serve up avro schemas for the Hive
> AvroSerDe to make use of through 'avro.schema.url' config. Originally we
> were hoping to just get a REST endpoint that returns just the schema to
> avoid making any changes to the AvroSerDe. The confluent rest endpoint
> today returns the avro schema embedded as an attribute inside a json
> response [2] which makes it unusable by the AvroSerDe.
>
> I wanted to reach out to the community to talk about the possibility of
> enhancing the AvroSerDe to be able to make use of JSON responses returned
> from the configured URL. One of the possibilities mentioned in the
> confluent github issue was to add in a new (optional) configuration to
> identify the field within the JSON response and the AvroSerDe, and if set
> use that config to retrieve the schema from that attribute.
>
> We're open to other suggestions and would be happy to contribute the patch
> back to hive for whatever design is settled on.
>
> - Stephen
>
>
> [1] https://github.com/confluentinc/schema-registry/issues/629
> [2]
> https://docs.confluent.io/current/schema-registry/docs/api.html#get--subjects-(string-
> subject)-versions-(versionId- version)
>

Reply via email to