Re: Spark Streaming-- for each new file in HDFS

Peyman Mohajerian Thu, 15 Sep 2016 17:03:44 -0700

You can listen to files in a specific directory using:
Take a look at:
http://spark.apache.org/docs/latest/streaming-programming-guide.html


streamingContext.fileStream


On Thu, Sep 15, 2016 at 10:31 AM, Jörn Franke <jornfra...@gmail.com> wrote:

> Hi,
> I recommend that the third party application puts an empty file with the
> same filename as the original file, but the extension ".uploaded". This is
> an indicator that the file has been fully (!) written to the fs. Otherwise
> you risk only reading parts of the file.
> Then, you can have a file system listener for this .upload file.
>
> Spark streaming or Kafka are not needed/suitable, if the server is a file
> server. You can use oozie (maybe with a simple custom action) to poll for
> .uploaded files and transmit them.
>
> On 15 Sep 2016, at 19:00, Kappaganthu, Sivaram (ES) <
> sivaram.kappagan...@adp.com> wrote:
>
> Hello,
>
>
>
> I am a newbie to spark and I have  below requirement.
>
>
>
> Problem statement : A third party application is dumping files
> continuously in a server. Typically the count of files is 100 files  per
> hour and each file is of size less than 50MB. My application has to
>  process those files.
>
>
>
> Here
>
> 1) is it possible  for spark-stream to trigger a job after a file is
> placed instead of triggering a job at fixed batch interval?
>
> 2) If it is not possible with Spark-streaming, can we control this with
> Kafka/Flume
>
>
>
> Thanks,
>
> Sivaram
>
>
> ------------------------------
> This message and any attachments are intended only for the use of the
> addressee and may contain information that is privileged and confidential.
> If the reader of the message is not the intended recipient or an authorized
> representative of the intended recipient, you are hereby notified that any
> dissemination of this communication is strictly prohibited. If you have
> received this communication in error, notify the sender immediately by
> return email and delete the message and any attachments from your system.
>
>

Re: Spark Streaming-- for each new file in HDFS

Reply via email to