[jira] [Updated] (NIFI-14628) Amazon Glue message deserialization

Dariusz Seweryn (Jira) Tue, 10 Jun 2025 04:54:03 -0700


     [ 
https://issues.apache.org/jira/browse/NIFI-14628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dariusz Seweryn updated NIFI-14628:
-----------------------------------
    Description: 
h1. Context

Using an (Avro)Reader with Schema Access Strategy = Schema Reference Reader, 
which means the schema reference is retrieved from the message itself, requires 
both the Schema Registry and corresponding Schema Reference Reader.

There is an available ConsumeKinesisStream processor. The Kinesis messages can 
use Amazon Glue Schema Registry. There is a AmazonGlueSchemaRegistry service, 
but there is no corresponding AmazonGlueEncodedSchemaReferenceReader service. 
In this situation there is no possibility to use Glue encoded Kinesis messages 
"the NiFi way".

By comparison, there are services ConfluentSchemaRegistry and 
ConfulentEncodedSchemaReferenceReader. 
h1. Solution

Introduce AmazonGlueEncodedSchemaReferenceReader.
h1. Discussion Points
h2. Schema Version ID vs SchemaIdentifier

SchemaReferenceReader contract requires returning a SchemaIdentifier
This is somewhat problematic because the Glue encoded message header contains a 
Header Version byte, Header Compression byte, UUID (16 bytes) of the "Schema 
Version ID" which is a unique, opaque, identifier of both schema name and its 
version. Amazon API allows to retrieve schema by using (Schema Name + 
optionally Schema Version) XOR Schema Version ID. 
Passing the UUID does not fit well into SchemaIdentifier fields: name (string), 
version ID (long), identifier (long), version (int), branch (string).

How to continue?
 # Pass the UUID in the Branch field.
No other SchemaIdentifier field would be populated. Since there is an 
expectation that this Reference Reader will be used with 
AmazonGlueSchemaRegistry only we can introduce specialized logic there to work 
when branch is available. Current implementation does not use branch at all but 
expects Name field to be always available.
 # Pass the UUID in the Name field.
This feels less hacky than using the Branch field. Otherwise would need a some 
kind of a custom prefix/suffix in the passed value, so the 
AmazonGlueSchemaRegistry would know that it should handle this particular name 
as UUID. This approach may introduce schema name conflicts.
 # Introduce yet another SchemaIdentifier field.
Inspiration for the naming is needed.

-Personally I would lean to option 1. because it would not introduce any 
breaking changes to existing users and does not require potentially cascading 
changes due to no changes in SchemaIdentifier. Then option 2.-

After consulting option 2 seems like a good way to go with an addition of a 
specific prefix.
h2. Support for compression

In the header of the Glue message there is a byte describing whether message is 
compressed. If so, the data behind the header gets uncompressed before further 
processing. My understanding of NiFi does not give me confidence that this is 
something that can be easily addressed with the current architecture.

How to continue?
 # Divide into smaller problems
For now don't support compressed messages
 # Other
Open for suggestions

  was:
h1. Context

Using an (Avro)Reader with Schema Access Strategy = Schema Reference Reader, 
which means the schema reference is retrieved from the message itself, requires 
both the Schema Registry and corresponding Schema Reference Reader.

There is an available ConsumeKinesisStream processor. The Kinesis messages can 
use Amazon Glue Schema Registry. There is a AmazonGlueSchemaRegistry service, 
but there is no corresponding AmazonGlueEncodedSchemaReferenceReader service. 
In this situation there is no possibility to use Glue encoded Kinesis messages 
"the NiFi way".

By comparison, there are services ConfluentSchemaRegistry and 
ConfulentEncodedSchemaReferenceReader. 
h1. Solution

Introduce AmazonGlueEncodedSchemaReferenceReader.
h1. Discussion Points
h2. Schema Version ID vs SchemaIdentifier

SchemaReferenceReader contract requires returning a SchemaIdentifier
This is somewhat problematic because the Glue encoded message header contains a 
Header Version byte, Header Compression byte, UUID (16 bytes) of the "Schema 
Version ID" which is a unique, opaque, identifier of both schema name and its 
version. Amazon API allows to retrieve schema by using (Schema Name + 
optionally Schema Version) XOR Schema Version ID. 
Passing the UUID does not fit well into SchemaIdentifier fields: name (string), 
version ID (long), identifier (long), version (int), branch (string).

How to continue?
 # Pass the UUID in the Branch field.
No other SchemaIdentifier field would be populated. Since there is an 
expectation that this Reference Reader will be used with 
AmazonGlueSchemaRegistry only we can introduce specialized logic there to work 
when branch is available. Current implementation does not use branch at all but 
expects Name field to be always available.
 # Pass the UUID in the Name field.
This feels less hacky than using the Branch field. Otherwise would need a some 
kind of a custom prefix/suffix in the passed value, so the 
AmazonGlueSchemaRegistry would know that it should handle this particular name 
as UUID. This approach may introduce schema name conflicts.
 # Introduce yet another SchemaIdentifier field.
Inspiration for the naming is needed.

Personally I would lean to option 1. because it would not introduce any 
breaking changes to existing users and does not require potentially cascading 
changes due to no changes in SchemaIdentifier. Then option 2.
h2. Support for compression

In the header of the Glue message there is a byte describing whether message is 
compressed. If so, the data behind the header gets uncompressed before further 
processing. My understanding of NiFi does not give me confidence that this is 
something that can be easily addressed with the current architecture.

How to continue?
 # Divide into smaller problems
For now don't support compressed messages
 # Other
Open for suggestions


> Amazon Glue message deserialization
> -----------------------------------
>
>                 Key: NIFI-14628
>                 URL: https://issues.apache.org/jira/browse/NIFI-14628
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Extensions
>    Affects Versions: 2.4.0
>            Reporter: Dariusz Seweryn
>            Priority: Major
>
> h1. Context
> Using an (Avro)Reader with Schema Access Strategy = Schema Reference Reader, 
> which means the schema reference is retrieved from the message itself, 
> requires both the Schema Registry and corresponding Schema Reference Reader.
> There is an available ConsumeKinesisStream processor. The Kinesis messages 
> can use Amazon Glue Schema Registry. There is a AmazonGlueSchemaRegistry 
> service, but there is no corresponding AmazonGlueEncodedSchemaReferenceReader 
> service. In this situation there is no possibility to use Glue encoded 
> Kinesis messages "the NiFi way".
> By comparison, there are services ConfluentSchemaRegistry and 
> ConfulentEncodedSchemaReferenceReader. 
> h1. Solution
> Introduce AmazonGlueEncodedSchemaReferenceReader.
> h1. Discussion Points
> h2. Schema Version ID vs SchemaIdentifier
> SchemaReferenceReader contract requires returning a SchemaIdentifier
> This is somewhat problematic because the Glue encoded message header contains 
> a Header Version byte, Header Compression byte, UUID (16 bytes) of the 
> "Schema Version ID" which is a unique, opaque, identifier of both schema name 
> and its version. Amazon API allows to retrieve schema by using (Schema Name + 
> optionally Schema Version) XOR Schema Version ID. 
> Passing the UUID does not fit well into SchemaIdentifier fields: name 
> (string), version ID (long), identifier (long), version (int), branch 
> (string).
> How to continue?
>  # Pass the UUID in the Branch field.
> No other SchemaIdentifier field would be populated. Since there is an 
> expectation that this Reference Reader will be used with 
> AmazonGlueSchemaRegistry only we can introduce specialized logic there to 
> work when branch is available. Current implementation does not use branch at 
> all but expects Name field to be always available.
>  # Pass the UUID in the Name field.
> This feels less hacky than using the Branch field. Otherwise would need a 
> some kind of a custom prefix/suffix in the passed value, so the 
> AmazonGlueSchemaRegistry would know that it should handle this particular 
> name as UUID. This approach may introduce schema name conflicts.
>  # Introduce yet another SchemaIdentifier field.
> Inspiration for the naming is needed.
> -Personally I would lean to option 1. because it would not introduce any 
> breaking changes to existing users and does not require potentially cascading 
> changes due to no changes in SchemaIdentifier. Then option 2.-
> After consulting option 2 seems like a good way to go with an addition of a 
> specific prefix.
> h2. Support for compression
> In the header of the Glue message there is a byte describing whether message 
> is compressed. If so, the data behind the header gets uncompressed before 
> further processing. My understanding of NiFi does not give me confidence that 
> this is something that can be easily addressed with the current architecture.
> How to continue?
>  # Divide into smaller problems
> For now don't support compressed messages
>  # Other
> Open for suggestions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NIFI-14628) Amazon Glue message deserialization

Reply via email to