Hi Guys,

The TextIo can handle the tar.gz type double compressed files. See the code
test code.

 PipelineOptions optios =
PipelineOptionsFactory.fromArgs(args).withValidation().create();
    Pipeline p = Pipeline.create(optios);

   * p.apply("ReadLines",  TextIO.read().from("/dataset.tar.gz"))*
                      .apply(ParDo.of(new DoFn<String, String>(){
    @ProcessElement
    public void processElement(ProcessContext c) {
    c.output(c.element());
    }

    }))

   .apply(TextIO.write().to("/tmp/filout/outputfile"));

    p.run().waitUntilFinish();

Thanks
/Saj
On 16 March 2018 at 04:29, Pablo Estrada <pabl...@google.com> wrote:

> Hi!
> Quick questions:
> - which sdk are you using?
> - is this batch or streaming?
>
> As JB mentioned, TextIO is able to work with compressed files that contain
> text. Nothing currently handles the double decompression that I believe
> you're looking for.
> TextIO for Java is also able to"watch" a directory for new files. If
> you're able to (outside of your pipeline) decompress your first zip file
> into a directory that your pipeline is watching, you may be able to use
> that as work around. Does that sound like a good thing?
> Finally, if you want to implement a transform that does all your logic,
> well then that sounds like SplittableDoFn material; and in that case,
> someone that knows SDF better can give you guidance (or clarify if my
> suggestions are not correct).
> Best
> -P.
>
> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
>> Hi
>>
>> TextIO supports compressed file. Do you want to read files in text ?
>>
>> Can you detail a bit the use case ?
>>
>> Thanks
>> Regards
>> JB
>> Le 15 mars 2018, à 18:28, Shirish Jamthe <sjam...@google.com> a écrit:
>>>
>>> Hi,
>>>
>>> My input is a tar.gz or .zip file which contains thousands of tar.gz
>>> files and other files.
>>> I would lile to extract the tar.gz files from the tar.
>>>
>>> Is there a transform that can do that? I couldn't find one.
>>> If not is it in works? Any pointers to start work on it?
>>>
>>> thanks
>>>
>> --
> Got feedback? go/pabloem-feedback
>

Reply via email to