Just wondering if this will help you:

https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/time/DateUtils.html#parseDate-java.lang.String-java.util.Locale-java.lang.String...-

You'll still need to extract the candidate date strings but once you have them, 
this can parse them using various formats.

Perhaps...since we are now in "The Age of AI" (:-)), you could use Apache 
OpenNLP, per this:

https://stackoverflow.com/questions/27182040/how-to-detect-dates-with-opennlp

I've used NLP in other situations...it's not popular but it does the job nicely.

A bit more of a general discussion:

https://www.baeldung.com/cs/finding-dates-addresses-in-emails

Hope this helps.

BOB

________________________________
From: Jochen Theodorou <blackd...@gmx.org>
Sent: Wednesday, 14 June 2023 4:42 AM
To: users@groovy.apache.org <users@groovy.apache.org>
Subject: Re: Existing resources to seek date patterns from raw data and 
normalize them

On 13.06.23 16:52, James McMahon wrote:
> Hello.  I have a task to parse dates out of incoming raw content. Of
> course the date patterns can assume any number of forms -   YYYY-MM-DD,
> YYYY/MM/DD, YYYYMMDD, MMDDYYYY, etc etc etc. I can build myself a robust
> regex to match a broad set of such patterns in the raw data, but I
> wonder if there is a project or library available for Groovy that
> already offes this?

I always wanted to try one time
https://github.com/joestelmach/natty/tree/master or at least
https://github.com/sisyphsu/dateparser... never came to it ;)

> Assuming I get pattern matches parsed out of my raw data, I will have a
> collection of strings representing year-month-days in a variety of
> formats. I'd then like to normalize them to a standard form so that I
> can sort and compare them. I intend to identify the range of dates in
> the raw data as a sorted Groovy list.

once you have the library identified the format this is the easy step

  [...]
> I intend to write a Groovy script that will run from an Apache NiFi
> ExecuteScript processor. I'll read in my data flowfile content using a
> buffered reader so I can handle flowfiles that may be large.

what does large mean? 1TB? Then BufferedReader may not be the right
choice ;)

bye Jochen

Reply via email to