Eugene - Yes, you are correct. I tried with a text file & Beam wordcount example. The TextIO reader reads some illegal characters as seen below.
here’s: 1 addiction: 1 new: 1 we: 1 mood: 1 an: 1 incredible: 1 swings,: 1 known: 1 choices.: 1 ^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@They’re: 1 already: 2 today: 1 the: 3 generation: 1 wordcount-00002 thanks Saj On 16 March 2018 at 17:45, Eugene Kirpichov <kirpic...@google.com> wrote: > To clarify: I think natively supporting .tar and .tar.gz would be quite > useful. I'm just saying that currently we don't. > > On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov <kirpic...@google.com> > wrote: > >> The code behaves as I expected, and the output is corrupt. >> Beam unzipped the .gz, but then interpreted the .tar as a text file, and >> split the .tar file by \n. >> E.g. the first file of the output starts with lines: >> A20171012.1145+0200-1200+0200_epg10-1_node.xml/ >> 0000755000175000017500000000000013252764467016513 5ustar >> eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/ >> data0000644000175000017500000000360513252764467017353 0ustar >> eachsajeachsaj<?xml version="1.0" encoding="UTF-8"?> >> >> which are clearly not the expected input. >> >> On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan < >> achuthan.sajee...@gmail.com> wrote: >> >>> Eugene, I ran the code and it works fine. I am very confident in this >>> case. I appreciate you guys for the great work. >>> >>> The code supposed to show that Beam TextIO can read the double >>> compressed files and write output without any processing. so ignored the >>> processing steps. I agree with you the further processing is not easy in >>> this case. >>> >>> >>> import org.apache.beam.sdk.Pipeline; >>> import org.apache.beam.sdk.io.TextIO; >>> import org.apache.beam.sdk.options.PipelineOptions; >>> import org.apache.beam.sdk.options.PipelineOptionsFactory; >>> import org.apache.beam.sdk.transforms.DoFn; >>> import org.apache.beam.sdk.transforms.ParDo; >>> >>> public class ReadCompressedTextFile { >>> >>> public static void main(String[] args) { >>> PipelineOptions optios = PipelineOptionsFactory. >>> fromArgs(args).withValidation().create(); >>> Pipeline p = Pipeline.create(optios); >>> >>> p.apply("ReadLines", >>> TextIO.read().from("./dataset.tar.gz") >>> >>> ).apply(ParDo.of(new DoFn<String, String>(){ >>> @ProcessElement >>> public void processElement(ProcessContext c) { >>> c.output(c.element()); >>> // Just write the all content to "/tmp/filout/outputfile" >>> } >>> >>> })) >>> >>> .apply(TextIO.write().to("/tmp/filout/outputfile")); >>> >>> p.run().waitUntilFinish(); >>> } >>> >>> } >>> >>> The full code, data file & output contents are attached. >>> >>> thanks >>> Saj >>> >>> >>> >>> >>> >>> On 16 March 2018 at 16:56, Eugene Kirpichov <kirpic...@google.com> >>> wrote: >>> >>>> Sajeevan - I'm quite confident that TextIO can handle .gz, but can not >>>> handle properly .tar. Did you run this code? Did your test .tar.gz file >>>> contain multiple files? Did you obtain the expected output, identical to >>>> the input except for order of lines? >>>> (also, the ParDo in this code doesn't do anything - it outputs its >>>> input - so it can be removed) >>>> >>>> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan < >>>> achuthan.sajee...@gmail.com> wrote: >>>> >>>>> Hi Guys, >>>>> >>>>> The TextIo can handle the tar.gz type double compressed files. See the >>>>> code test code. >>>>> >>>>> PipelineOptions optios = PipelineOptionsFactory. >>>>> fromArgs(args).withValidation().create(); >>>>> Pipeline p = Pipeline.create(optios); >>>>> >>>>> * p.apply("ReadLines", TextIO.read().from("/dataset.tar.gz"))* >>>>> .apply(ParDo.of(new DoFn<String, String>(){ >>>>> @ProcessElement >>>>> public void processElement(ProcessContext c) { >>>>> c.output(c.element()); >>>>> } >>>>> >>>>> })) >>>>> >>>>> .apply(TextIO.write().to("/tmp/filout/outputfile")); >>>>> >>>>> p.run().waitUntilFinish(); >>>>> >>>>> Thanks >>>>> /Saj >>>>> >>>>> On 16 March 2018 at 04:29, Pablo Estrada <pabl...@google.com> wrote: >>>>> >>>>>> Hi! >>>>>> Quick questions: >>>>>> - which sdk are you using? >>>>>> - is this batch or streaming? >>>>>> >>>>>> As JB mentioned, TextIO is able to work with compressed files that >>>>>> contain text. Nothing currently handles the double decompression that I >>>>>> believe you're looking for. >>>>>> TextIO for Java is also able to"watch" a directory for new files. If >>>>>> you're able to (outside of your pipeline) decompress your first zip file >>>>>> into a directory that your pipeline is watching, you may be able to use >>>>>> that as work around. Does that sound like a good thing? >>>>>> Finally, if you want to implement a transform that does all your >>>>>> logic, well then that sounds like SplittableDoFn material; and in that >>>>>> case, someone that knows SDF better can give you guidance (or clarify if >>>>>> my >>>>>> suggestions are not correct). >>>>>> Best >>>>>> -P. >>>>>> >>>>>> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré <j...@nanthrax.net> >>>>>> wrote: >>>>>> >>>>>>> Hi >>>>>>> >>>>>>> TextIO supports compressed file. Do you want to read files in text ? >>>>>>> >>>>>>> Can you detail a bit the use case ? >>>>>>> >>>>>>> Thanks >>>>>>> Regards >>>>>>> JB >>>>>>> Le 15 mars 2018, à 18:28, Shirish Jamthe <sjam...@google.com> a >>>>>>> écrit: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> My input is a tar.gz or .zip file which contains thousands of >>>>>>>> tar.gz files and other files. >>>>>>>> I would lile to extract the tar.gz files from the tar. >>>>>>>> >>>>>>>> Is there a transform that can do that? I couldn't find one. >>>>>>>> If not is it in works? Any pointers to start work on it? >>>>>>>> >>>>>>>> thanks >>>>>>>> >>>>>>> -- >>>>>> Got feedback? go/pabloem-feedback >>>>>> <https://goto.google.com/pabloem-feedback> >>>>>> >>>>> >>>>> >>>