Just to be more clear about my requirements, what I have is actually a
custom format, with header, summary and multi line blocks. I want to create
tasks per block and no per line.I already have a library that reads an
InputStream and outputs an Iterator of Block, but now I need to integrate
this with spark

On Tue, 17 Sep 2019 at 16:28, Marcelo Valle <marcelo.va...@ktech.com> wrote:

> Hi,
>
> I want to create a custom RDD which will read n lines in sequence from a
> file, which I call a block, and each block should be converted to a spark
> dataframe to be processed in parallel.
>
> Question - do I have to implement a custom hadoop input format to achieve
> this? Or is it possible to do it only with RDD APIs?
>
> Thanks,
> Marcelo.
>

This email is confidential [and may be protected by legal privilege]. If you 
are not the intended recipient, please do not copy or disclose its content but 
contact the sender immediately upon receipt.

KTech Services Ltd is registered in England as company number 10704940.

Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE, United 
Kingdom

Reply via email to