I am working on a process to do some compaction of files in S3. I read a
bucket full of files key them, pull them all into a window, then remove
older versions of the file. The files are not organized inside the bucket,
they are simply name by guid. I can iterate them using a custom Source that
just does a listObjects/listNextBatchOfObjects. The source emits ObjectKeys
from. The problem is that right now I need to only have one source running
at a time in order to ensure that I only get each file once. What I would
like to do is have parallelism on the source where the sources are able to
pick a file prefix like 00 or A6 and use that for listObjects. This would
allow me to emit more filenames downstream. I could build some sort of
process to use a DB to track partition ownership, but I am hoping there is
a better (or already implemented) solution. Any ideas?

-Steve

Reply via email to