Title: ====================================================================
Hi James,
You have an interesting problem.
I too will be interested to see if a project library exists to meet your specifications.

Earlier this year I built a legal regulatory text parser that corrected formatting/logic errors and then transformed the text into a "normalized" form for further processing.
My lexical analyzer (lexan), parser, and post-processor were not "scripts" but a number of groovy class files that were all components of a 100% groovy program.
I ran the main-program file from the Linux command line shell "bash";  all supporting classes were picked up from the CLASSPATH.
I tagged all classes CompileStatic so that I could get Slf4j/logback to work right (showing the correct code line numbers in the log).
CompileStatic-groovy takes a while to get used to but in the end I can now work with such code easily.
I greatly value my detailed logs that identify code line numbers when debugging my >10K-line code base while processing my large datasets.
I used the latest Java and Groovy and it all worked like a champ.
In my opinion, Groovy is a great choice for regex processing because regexes are 1st class citizens in the language.

Observations in case no project library shows up to meet your needs:
- You will likely need a lexan to generate a stream of tokens where some will be your various date forms.
- Split your big regex up into pieces for each date form and assign them to field values, regex-OR the values together to create a super-search-regex, and pre-compile the super-search-regex.
  Example: String ReDate1=/.../; String ReDate2=/.../; String ReDate3=/.../; etc
- Once you determine that a token matches one of your known date patterns you can them categorize them with a Groovy switch-statement that uses the individual field values.
  Example:
  switch ( String dateTokenString ) {
     case ~ReDate1 : normalize here; break
     case ~ReDate2 : normalize here; break
     ...
     default: squawk about an unexpected date form showing context and exit
  }

Also, I have been able to mix CompileStatic with CompileDynamic (traditional Groovy) code and it worked OK.

- If you will be running a "groovy script" many times from Apache NiFi you will likely be disappointed with performance.
  Running "groovy SomeScript arg1 arg2 arg3" is likely to take a >1 seconds to startup on each invocation because groovy needs to compile the script and then run it each time.
  There are some strategies to mitigate this but if you need to run your script 1000s of times you may want to setup your date-normalizer program/library as a network connected coprocess and then interact with it from NiFi.  
  Better still, find a way to integrate your date-normalizer into Nifi.

Just a few thoughts...


On 6/13/2023 7:52 AM, James McMahon wrote:
Hello.  I have a task to parse dates out of incoming raw content. Of course the date patterns can assume any number of forms -   YYYY-MM-DD, YYYY/MM/DD, YYYYMMDD, MMDDYYYY, etc etc etc. I can build myself a robust regex to match a broad set of such patterns in the raw data, but I wonder if there is a project or library available for Groovy that already offes this?

Assuming I get pattern matches parsed out of my raw data, I will have a collection of strings representing year-month-days in a variety of formats. I'd then like to normalize them to a standard form so that I can sort and compare them. I intend to identify the range of dates in the raw data as a sorted Groovy list.

I anticipate I will miss many pattern variations with my initial cut at this. I do have one thing going for me: as I test through volumes of raw data, I'll be able to improve the pattern net I cast to catch an ever-improving percentage of year-month-day expressions.

I intend to write a Groovy script that will run from an Apache NiFi ExecuteScript processor. I'll read in my data flowfile content using a buffered reader so I can handle flowfiles that may be large.

Any recommendations or suggestions?


--
-
====================================================================
Paul                                              
EMAIL: pb...@s218777419.onlinehome.us    
====================================================================


 




Reply via email to