We also use this feature of AvroSerDe and find it very useful. In our case we copy the schema from our schema registry into S3 and reference it from there. In effect, we listen to the internal topic used to store schemas by our registry, and push to S3 whenever there is a new record. As well as being automated, this also provides a degree of system decoupling; Hive queries can be executed independently of the availability of our registry. I appreciate that this is not a general solution.
On the topic of this feature proposal: The addition of some Json filtering/extraction in the SerDe feels like a workaround for a very specific design descision made in the confluent API rather than a more generally useful feature for the SerDe. Arguably it'd be more generally useful if the confluent API was amended/extended to return only the schema document and not encapsulate it in a seemingly superfluous wrapper; that way any system that can load an Avro schema from a URI can potentially integrate with the registry with no confluent specific transformations. However I can understand any reluctance to make such a change. Given that it would be possible to implement some fairly simple workarounds, I don't think it's the responsibility of the Hive project to bridge this gap. All that said, if you still are keen on this approach, might I suggest using a JSONPointer to locate the relevant node in the returned document, as this could be applied generally to many different Json response structures: https://tools.ietf.org/html/rfc6901 Elliot. On Thu, 12 Oct 2017 at 14:52, Stephen Durfey <sjdur...@gmail.com> wrote: > Recently my team has opened a discussion with Confluent [1] in regards to > the schema registry being used to serve up avro schemas for the Hive > AvroSerDe to make use of through 'avro.schema.url' config. Originally we > were hoping to just get a REST endpoint that returns just the schema to > avoid making any changes to the AvroSerDe. The confluent rest endpoint > today returns the avro schema embedded as an attribute inside a json > response [2] which makes it unusable by the AvroSerDe. > > I wanted to reach out to the community to talk about the possibility of > enhancing the AvroSerDe to be able to make use of JSON responses returned > from the configured URL. One of the possibilities mentioned in the > confluent github issue was to add in a new (optional) configuration to > identify the field within the JSON response and the AvroSerDe, and if set > use that config to retrieve the schema from that attribute. > > We're open to other suggestions and would be happy to contribute the patch > back to hive for whatever design is settled on. > > - Stephen > > > [1] https://github.com/confluentinc/schema-registry/issues/629 > [2] > https://docs.confluent.io/current/schema-registry/docs/api.html#get--subjects-(string- > subject)-versions-(versionId- version) >