Re: Reading compressed XML data

2017-02-24 Thread Sebastian Neef
Hi Till, just to clarify and for my understanding. Let's assume we have the following Bzip2 file: |A.BA.B|A...B|A|..BA.|...BA|B|A...B| |1 |2|3|4|5|6|7| ("block number")

Re: Reading compressed XML data

2017-02-17 Thread Till Rohrmann
Hi Sebastian, file input splits basically define the region of a file which a subtask will read. Thus, your file input format would have to break up the bzip2 file exactly at the border of compressed blocks when generating the input file splits. Otherwise a subtask won't be able to decompress it.

Re: Reading compressed XML data

2017-02-16 Thread Sebastian Neef
Hi Robert, sorry for the long delay. > I wonder why the decompression with the XmlInputFormat doesn't work. Did > you get any exception? I didn't get any exception, it just seems to read nothing (or at least don't match any opening/closing tags). I digged a bit into the code and found out, that

Re: Reading compressed XML data

2017-01-14 Thread Robert Metzger
Hi Sebastian, I'm not aware of a better way of implementing this in Flink. You could implement your own XmlInputFormat using Flink's InputFormat abstractions, but you would end up with almost exactly the same code as Mahout / Hadoop. I wonder why the decompression with the XmlInputFormat doesn't w

Reading compressed XML data

2017-01-11 Thread Sebastian Neef
Hi, what's the best way to read a compressed (bz2 / gz) XML file splitting it at a specific XML-tag? So far I've been using hadoop's TextInputFormat in combination with mahouts XmlInputFormat ([0]) with env.readHadoopFile(). Whereas the plain TextInputFormat can handle compressed data, the XmlInp