Hi all,

I’m looking for a working solution for cases where it’s needed (or even 
required) to use different file system configuration (HDFS, S3, GC) in the same 
pipeline and where IO is Beam FileSystems based (FileIO, TextIO, etc). 
For example: 
- reading data from one HDFS cluster and writing results into another one which 
requires different configuration;
- reading objects from one S3 bucket, writing into another one and we need to 
use different credentials and/or regions for that;
- we even can have heterogeneous case, where we need to read data from HDFS and 
write results into S3 or vice versa.

Usually, in other IOs, we can do this easily by having specific methods, like 
“withConfiguration()”, “withCredentialsProvider()”, etc. for Read and Write, 
but FileSystems based IO could be configured only with PipelineOptions afaik. 
There was a thread about that a while ago [1] where Lukasz Cwik said that it’s 
feasible by using different schemes but, unfortunately, I haven’t managed to 
make it working on my side (neither for HDFS nor for S3).

So, any additional inputs or working solutions would be very welcomed if 
someone has any. In the long term, I’d like to document this in details since, 
I guess, this case can be quite demanded.

[1] 
https://lists.apache.org/thread.html/bb5f98c4154cc72d097ce5b404ff0b3bcb52b7360b0834af7116883b@%3Cdev.beam.apache.org%3E
 
<https://lists.apache.org/thread.html/bb5f98c4154cc72d097ce5b404ff0b3bcb52b7360b0834af7116883b@%3Cdev.beam.apache.org%3E>


Reply via email to