RE: Design pattern for processing a huge directory tree of files using GPars

Merlin Beedell Fri, 13 May 2022 02:05:23 -0700

Thank you Bob, that did work for me.
Some Java syntax is new to me - like this .map(Path::toFile). Back to school 
again.
This is standard Java, which is pretty groovy already, but I wonder if this 
could be (or already has been) groovy-ised in some way, e.g. to simplify the 
Files.walk(..).collect(..).parallelStream().
I put the filter before the collect - on the assertion that it would be more 
efficient to skip unnecessary files before adding to the parallel processing.
In the following snippet I include a processedCount counter - and although this 
works, I am aware that altering things outside of the parallel process can be 
bad.


import java.nio.file.*
import java.util.stream.*

               long scanFolder (File directory, Pattern fileMatch)
               {
long processedCount = 0
Files.walk(directory.toPath(), 1)  //just walk the current directory, not 
subdirectories
  .filter(p -> (Files.isRegularFile(p) && p.toString().matches(fileMatch) ) )  
//skip files that do not match a regex pattern
  .collect(Collectors.toList())
  .parallelStream()
 .map(Path::toFile)
  .forEach( msgFile -> {
  <do stuff with msgFile>
   processedCount++
} )
return processedCount
               }

Merlin Beedell

From: Bob Brown <[email protected]>
Sent: 10 May 2022 09:19
To: [email protected]
Subject: RE: Design pattern for processing a huge directory tree of files using 
GPars

If you are able to use a modern Java implementation, you can use pure-Java 
streams, eg:

https://stackoverflow.com/a/66044221

///
Files.walk(Paths.get("/path/to/root/directory")) // create a stream of paths
    .collect(Collectors.toList()) // collect paths into list to better parallize
    .parallelStream() // process this stream in multiple threads
    .filter(Files::isRegularFile) // filter out any non-files (such as 
directories)
    .map(Path::toFile) // convert Path to File object
    .sorted((a, b) -> Long.compare(a.lastModified(), b.lastModified())) // sort 
files date
    .limit(500) // limit processing to 500 files (optional)
    .forEachOrdered(f -> {
        // do processing here
        System.out.println(f);
    });
///

also read : 
https://www.airpair.com/java/posts/parallel-processing-of-io-based-data-with-java-streams

Hope this helps some.

BOB


From: Merlin Beedell <[email protected]<mailto:[email protected]>>
Sent: Monday, 9 May 2022 8:12 PM
To: [email protected]<mailto:[email protected]>
Subject: Design pattern for processing a huge directory tree of files using 
GPars

I am trying to process millions of files, spread over a tree of directories.  
At the moment I can collect the set of top level directories into a list and 
then process these in parallel using GPars with list processing (e.g. 
.eachParallel).
But what would be more efficient would be a 'parallel' for the File handling 
routines, for example:

               withPool() {
                              directory.eachFileMatchParallel (FILES, 
~/($fileMatch)/) {aFile ->  ...

then I would be a very happy bunny!

I know I could copy the list of matching files into an Array list and then use 
the withPool { filesArray.eachParallel { ... - but this does not seem like an 
efficient solution - especially if there are several hundred thousand files in 
a directory.

What design pattern(s) might be better to consider using?

Merlin Beedell

RE: Design pattern for processing a huge directory tree of files using GPars

Reply via email to