Thank you Bob, that did work for me.
Some Java syntax is new to me - like this .map(Path::toFile). Back to school
again.
This is standard Java, which is pretty groovy already, but I wonder if this
could be (or already has been) groovy-ised in some way, e.g. to simplify the
Files.walk(..).collect(..).parallelStream().
I put the filter before the collect - on the assertion that it would be more
efficient to skip unnecessary files before adding to the parallel processing.
In the following snippet I include a processedCount counter - and although this
works, I am aware that altering things outside of the parallel process can be
bad.
import java.nio.file.*
import java.util.stream.*
long scanFolder (File directory, Pattern fileMatch)
{
long processedCount = 0
Files.walk(directory.toPath(), 1) //just walk the current directory, not
subdirectories
.filter(p -> (Files.isRegularFile(p) && p.toString().matches(fileMatch) ) )
//skip files that do not match a regex pattern
.collect(Collectors.toList())
.parallelStream()
.map(Path::toFile)
.forEach( msgFile -> {
<do stuff with msgFile>
processedCount++
} )
return processedCount
}
Merlin Beedell
From: Bob Brown <[email protected]>
Sent: 10 May 2022 09:19
To: [email protected]
Subject: RE: Design pattern for processing a huge directory tree of files using
GPars
If you are able to use a modern Java implementation, you can use pure-Java
streams, eg:
https://stackoverflow.com/a/66044221
///
Files.walk(Paths.get("/path/to/root/directory")) // create a stream of paths
.collect(Collectors.toList()) // collect paths into list to better parallize
.parallelStream() // process this stream in multiple threads
.filter(Files::isRegularFile) // filter out any non-files (such as
directories)
.map(Path::toFile) // convert Path to File object
.sorted((a, b) -> Long.compare(a.lastModified(), b.lastModified())) // sort
files date
.limit(500) // limit processing to 500 files (optional)
.forEachOrdered(f -> {
// do processing here
System.out.println(f);
});
///
also read :
https://www.airpair.com/java/posts/parallel-processing-of-io-based-data-with-java-streams
Hope this helps some.
BOB
From: Merlin Beedell <[email protected]<mailto:[email protected]>>
Sent: Monday, 9 May 2022 8:12 PM
To: [email protected]<mailto:[email protected]>
Subject: Design pattern for processing a huge directory tree of files using
GPars
I am trying to process millions of files, spread over a tree of directories.
At the moment I can collect the set of top level directories into a list and
then process these in parallel using GPars with list processing (e.g.
.eachParallel).
But what would be more efficient would be a 'parallel' for the File handling
routines, for example:
withPool() {
directory.eachFileMatchParallel (FILES,
~/($fileMatch)/) {aFile -> ...
then I would be a very happy bunny!
I know I could copy the list of matching files into an Array list and then use
the withPool { filesArray.eachParallel { ... - but this does not seem like an
efficient solution - especially if there are several hundred thousand files in
a directory.
What design pattern(s) might be better to consider using?
Merlin Beedell