Matt Burgess created NIFI-7989:
----------------------------------

             Summary: Add Hive "data drift" processor
                 Key: NIFI-7989
                 URL: https://issues.apache.org/jira/browse/NIFI-7989
             Project: Apache NiFi
          Issue Type: New Feature
          Components: Extensions
            Reporter: Matt Burgess


It would be nice to have a Hive processor (one for each Hive NAR) that could 
check an incoming record-based flowfile against a destination table, and either 
add columns and/or partition values, or even create the table if it does not 
exist. Such a processor could be used in a flow where the incoming data's 
schema can change and we want to be able to write it to a Hive table, 
preferably by using PutHDFS, PutParquet, or PutORC to place it directly where 
it can be queried.

Such a processor should be able to use a HiveConnectionPool to execute any DDL 
(ALTER TABLE ADD COLUMN, e.g.) necessary to make the table match the incoming 
data. For Partition Values, they could be provided via a property that supports 
Expression Language. In such a case, an ALTER TABLE would be issued to add the 
partition directory.

Whether the table is created or updated, and whether there are partition values 
to consider, an attribute should be written to the outgoing flowfile 
corresponding to the location of the table (and any associated partitions). 
This supports the idea of having a flow that updates a Hive table based on the 
incoming data, and then allows the user to put the flowfile directly into the 
destination location (PutHDFS, e.g.) instead of having to load it using HiveQL 
or being subject to the restrictions of Hive Streaming tables (ORC-backed, 
transactional, etc.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to