I am working on a process to do some compaction of files in S3. I read a bucket full of files key them, pull them all into a window, then remove older versions of the file. The files are not organized inside the bucket, they are simply name by guid. I can iterate them using a custom Source that just does a listObjects/listNextBatchOfObjects. The source emits ObjectKeys from. The problem is that right now I need to only have one source running at a time in order to ensure that I only get each file once. What I would like to do is have parallelism on the source where the sources are able to pick a file prefix like 00 or A6 and use that for listObjects. This would allow me to emit more filenames downstream. I could build some sort of process to use a DB to track partition ownership, but I am hoping there is a better (or already implemented) solution. Any ideas?
-Steve