joshuagrisham opened a new pull request #11523:
URL: https://github.com/apache/kafka/pull/11523


   > Note this PR is copied from #9492 where I mistakenly did a rebase on a 
very old change which added thousands of commits, so this is a second attempt 
to get a very clean PR for this change instead. 
   
   I have made an update to **TimestampConverter** Connect transform to address 
the main issues that I logged in 
[KAFKA-10627](https://issues.apache.org/jira/browse/KAFKA-10627).
   
   Namely, that it now ...
   
   - supports multiple fields via a new configuration parameter `fields` as a 
comma-separated list of field names. The old parameter `field` is still 
supported for compatibility but the value is moved to the new parameter.
   - supports a DateTimeFormatter-compatible pattern string that can support 
multiple timestamp formats for parsing input of string values to whatever 
target you configure (e.g. parsing strings to Timestamp type).
      - `format` config is now split into two: `format.input` and 
`format.output` but you can still just send `format` by itself if you do not 
need to use a more complicated input pattern. When providing only `format`, the 
string pattern which you provide will be used for both `format.input` and 
`format.output`.
   
   I realized that kafka is using `java.util.Date` everywhere and as part of 
its core types (including in Schemas, values, etc).  In theory it would be good 
over time to upgrade to `java.time` classes but on first reflection it seems 
like quite a big overhaul to do this.
   
   So instead I focused on the specific problem at hand: parsing strings into 
`Date` where the strings can come in different formats.  So for this part alone 
I changed to use `DateTimeFormatter` so we can use multiple patterns to match 
input strings and convert them to a `java.util.Date` after.
   
   I also updated some of the way the Config parameters and values work, to 
bring in line with the other classes and similar to what I did with #9470.
   
   #### String Input and Output Timestamp Format updates
   
   Because now for input formats we allow multiple different possibilities 
using pattern matching, this does not work for the output format of a Timestamp 
to a String (which was another possibility of this transform).  So I have 
changed the configuration a bit... now there are three parameters:
   
   - `format` which is the original one. You can still use this one, and it 
will set both input (parsing) and output (Date/Timestamp to string format) 
based on this format.
   - `format.input` is a new parameter, where you can specify a 
DateTimeFormatter-compatible pattern string that supports multiple different 
formats in case you have a mix in your data.  For just one example, now you can 
use something like this as `format.input` and it will catch a lot of different 
variations which you might see in one timestamp field: `"[yyyy-MM-dd[['T'][ 
]HH:mm:ss[.SSSSSSSz][.SSS[XXX][X]]]]"`
   - `format.output` is a new parameter which only controls the output of a 
Date/Timestamp to target type of `string`. This is the same as before and still 
uses `SimpleDateFormat` to create the output string, it is just controlled in a 
separate parameter now.
   
   I also added some code which checks the value of each of these three.  
Basically it forces you to use either `format`, or one or both of the new 
parameters -- you cannot mix the old and new together.  In the end, 
`format.input` and `format.output` are the ones used in the rest of the logic, 
but the code first compares `format` against these values and sets the value 
for both of the new parameters depending on what was sent in the config.
   
   #### Support for multiple fields instead of one single field
   
   I changed the `field` parameter to now be called `fields` and supports 
multiple values as a comma-separated list.  I used this new 
`ConfigUtils.translateDeprecatedConfigs` method to provide automatic 
translation of of the old parameter to the new one as well.
   
   With this change I also updated the `apply` methods so that they loop 
through each field and check against the list of `fields`.  Now you can specify 
a comma-separated list of multiple fields to have the same input format/output 
type applied.
   
   Unit tests have been added for both new updates (string formatting and 
multiple field support).
   
   As I looked at this one then I realized that maybe it would be good to add 
`recursive` support similar to what I have done in #9470 but I guess that can 
come at another day!
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to