What is the correct way to implement a custom non-splittable file parser in
python?

My desired end-state is: 1) use Read to pass a file pattern (with wild
cards) pointing to several XML files on remote storage (S3 or GCS). 2) each
file is parsed as a single element (XML cannot be processed line-by-line)
resulting in a PCollection. 3) combine all PCollections together.

I've subclassed FileBasedSource, which seems to give me everything out of
the box. However I have a problem with zipped files.
The self.open_file(fname) method returns a file object. For non-compressed
files I can call self.open_file(fname).read(). But for compressed files I
have a missing argument error and must provide the number of bytes to read:
self.open_file(fname).read(num_bytes).

Is it possible to implement a FileBasedSource that works generically for
compressed and non-compressed non-splittable files?

Reply via email to