And you have to write your own input format, but this is not so complicated
(probably anyway recommended for the PDF case)
> Am 20.11.2018 um 08:06 schrieb Jörn Franke :
>
> Well, I am not so sure about the use cases, but what about using
> StreamingContext.fileStream?
> https://spark.apache.or
Well, I am not so sure about the use cases, but what about using
StreamingContext.fileStream?
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/streaming/StreamingContext.html#fileStream-java.lang.String-scala.Function1-boolean-org.apache.hadoop.conf.Configuration-scala.reflect.ClassTa
On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote:
> Why does it have to be a stream?
>
Right now I manage the pipelines as spark batch processing. Mooving to
stream would add some improvements such:
- simplification of the pipeline
- more frequent data ingestion
- better resource manag
Why does it have to be a stream?
> Am 18.11.2018 um 23:29 schrieb Nicolas Paris :
>
> Hi
>
> I have pdf to load into spark with at least
> format. I have considered some options:
>
> - spark streaming does not provide a native file stream for binary with
> variable size (binaryRecordStream sp
Hi
I have pdf to load into spark with at least
format. I have considered some options:
- spark streaming does not provide a native file stream for binary with
variable size (binaryRecordStream specifies a constant size) and I
would have to write my own receiver.
- Structured streaming allow