BOB, Jochen, and Paul, you've given me a lot of strong suggestions to
consider. I wanted to respond and tell you I'm playing around with some
now. Paul, your thoughts have given me pause to think more about
performance. I'm going to try this first with a Groovy script in an
ExecuteScript processor, see how the performance looks, and keep your other
suggestions in mind if performance is not good. I suspect a buffered line
reader approach should suffice: my incoming files are typically on the
border of tens or hundreds of megabytes. An occasional monster of a few
gigabytes does appear, but nothing on the order of hundreds of gigabytes or
a terabyte.

I'm working now to build my regex. My first effort will be to try and
correctly parse from a test file a wide variety of date formats in string
representation. I will post that here as my starting point once I get
something working.

Jim

On Wed, Jun 14, 2023 at 5:18 AM Bob Brown <b...@transentia.com.au> wrote:

> Just wondering if this will help you:
>
>
> https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/time/DateUtils.html#parseDate-java.lang.String-java.util.Locale-java.lang.String...-
>
> You'll still need to extract the candidate date strings but once you have
> them, this can parse them using various formats.
>
> Perhaps...since we are now in "The Age of AI" (:-)), you could use Apache
> OpenNLP, per this:
>
>
> https://stackoverflow.com/questions/27182040/how-to-detect-dates-with-opennlp
>
> I've used NLP in other situations...it's not popular but it does the job
> nicely.
>
> A bit more of a general discussion:
>
> https://www.baeldung.com/cs/finding-dates-addresses-in-emails
>
> Hope this helps.
>
> BOB
>
> ------------------------------
> *From:* Jochen Theodorou <blackd...@gmx.org>
> *Sent:* Wednesday, 14 June 2023 4:42 AM
> *To:* users@groovy.apache.org <users@groovy.apache.org>
> *Subject:* Re: Existing resources to seek date patterns from raw data and
> normalize them
>
> On 13.06.23 16:52, James McMahon wrote:
> > Hello.  I have a task to parse dates out of incoming raw content. Of
> > course the date patterns can assume any number of forms -   YYYY-MM-DD,
> > YYYY/MM/DD, YYYYMMDD, MMDDYYYY, etc etc etc. I can build myself a robust
> > regex to match a broad set of such patterns in the raw data, but I
> > wonder if there is a project or library available for Groovy that
> > already offes this?
>
> I always wanted to try one time
> https://github.com/joestelmach/natty/tree/master or at least
> https://github.com/sisyphsu/dateparser... never came to it ;)
>
> > Assuming I get pattern matches parsed out of my raw data, I will have a
> > collection of strings representing year-month-days in a variety of
> > formats. I'd then like to normalize them to a standard form so that I
> > can sort and compare them. I intend to identify the range of dates in
> > the raw data as a sorted Groovy list.
>
> once you have the library identified the format this is the easy step
>
>   [...]
> > I intend to write a Groovy script that will run from an Apache NiFi
> > ExecuteScript processor. I'll read in my data flowfile content using a
> > buffered reader so I can handle flowfiles that may be large.
>
> what does large mean? 1TB? Then BufferedReader may not be the right
> choice ;)
>
> bye Jochen
>

Reply via email to