Dariusz Seweryn created NIFI-14628:
--------------------------------------

             Summary: Amazon Glue message deserialization
                 Key: NIFI-14628
                 URL: https://issues.apache.org/jira/browse/NIFI-14628
             Project: Apache NiFi
          Issue Type: New Feature
          Components: Extensions
    Affects Versions: 2.4.0
            Reporter: Dariusz Seweryn


h1. Context

Using an (Arvo)Reader with Schema Access Strategy = Schema Reference Reader, 
which means the schema reference is retrieved from the message itself, requires 
both the Schema Registry and corresponding Schema Reference Reader.

There is an available ConsumeKinesisStream processor. The Kinesis messages can 
use Amazon Glue Schema Registry. There is a AmazonGlueSchemaRegistry service, 
but there is no corresponding AmazonGlueEncodedSchemaReferenceReader service. 
In this situation there is no possibility to use Glue encoded Kinesis messages 
"the NiFi way".

By comparison, there are services ConfluentSchemaRegistry and 
ConfulentEncodedSchemaReferenceReader. 
h1. Solution

Introduce AmazonGlueEncodedSchemaReferenceReader.
h1. Discussion Points
h2. Schema Version ID vs SchemaIdentifier
SchemaReferenceReader contract requires returning a SchemaIdentifier
This is somewhat problematic because the Glue encoded message header contains a 
Header Version byte, Header Compression byte, UUID (16 bytes) of the "Schema 
Version ID" which is a unique, opaque, identifier of both schema name and its 
version. Amazon API allows to retrieve schema by using (Schema Name + 
optionally Schema Version) XOR Schema Version ID. 
Passing the UUID does not fit well into SchemaIdentifier fields: name (string), 
version ID (long), identifier (long), version (int), branch (string).

How to continue?
 # Pass the UUID in the Branch field.
No other SchemaIdentifier field would be populated. Since there is an 
expectation that this Reference Reader will be used with 
AmazonGlueSchemaRegistry only we can introduce specialized logic there to work 
when branch is available. Current implementation does not use branch at all but 
expects Name field to be always available.
 # Pass the UUID in the Name field.
This feels less hacky than using the Branch field. Otherwise would need a some 
kind of a custom prefix/suffix in the passed value, so the 
AmazonGlueSchemaRegistry would know that it should handle this particular name 
as UUID. This approach may introduce schema name conflicts.
 # Introduce yet another SchemaIdentifier field. 
Inspiration for the naming is needed.

Personally I would lean to option 1. because it would not introduce any 
breaking changes to existing users and does not require potentially cascading 
changes due to no changes in SchemaIdentifier. Then option 2.
h2. Support for compression

In the header of the Glue message there is a byte describing whether message is 
compressed. If so, the data behind the header gets uncompressed before further 
processing. My understanding of NiFi does not give me confidence that this is 
something that can be easily addressed with the current architecture.

How to continue?
 # Divide into smaller problems
For now don't support compressed messages
 # Other
Open for suggestions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to