Hi Till,
just to clarify and for my understanding.
Let's assume we have the following Bzip2 file:
|A.BA.B|A...B|A|..BA.|...BA|B|A...B|
|1 |2|3|4|5|6|7| ("block number")
Hi Sebastian,
file input splits basically define the region of a file which a subtask
will read. Thus, your file input format would have to break up the bzip2
file exactly at the border of compressed blocks when generating the input
file splits. Otherwise a subtask won't be able to decompress it.
Hi Robert,
sorry for the long delay.
> I wonder why the decompression with the XmlInputFormat doesn't work. Did
> you get any exception?
I didn't get any exception, it just seems to read nothing (or at least
don't match any opening/closing tags).
I digged a bit into the code and found out, that
Hi Sebastian,
I'm not aware of a better way of implementing this in Flink. You could
implement your own XmlInputFormat using Flink's InputFormat abstractions,
but you would end up with almost exactly the same code as Mahout / Hadoop.
I wonder why the decompression with the XmlInputFormat doesn't w
Hi,
what's the best way to read a compressed (bz2 / gz) XML file splitting
it at a specific XML-tag?
So far I've been using hadoop's TextInputFormat in combination with
mahouts XmlInputFormat ([0]) with env.readHadoopFile(). Whereas the
plain TextInputFormat can handle compressed data, the XmlInp