So you can run a job / spark job to get data to disk/hdfs. Then run a dstream from a hdfs folder. As you move your files, the dstream will kick in. Regards Mayur On 6 Jun 2014 21:13, "Gianluca Privitera" < gianluca.privite...@studio.unibo.it> wrote:
> Where are the API for QueueStream and RddQueue? > In my solution I cannot open a DStream with S3 location because I need to > run a script on the file (a script that unluckily doesn't accept stdin as > input), so I have to download it on my disk somehow than handle it from > there before creating the stream. > > Thanks > Gianluca > > On 06/06/2014 02:19, Mayur Rustagi wrote: > > You can look to create a Dstream directly from S3 location using file > stream. If you want to use any specific logic you can rely on Queuestream & > read data yourself from S3, process it & push it into RDDQueue. > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi <https://twitter.com/mayur_rustagi> > > > > On Fri, Jun 6, 2014 at 3:00 AM, Gianluca Privitera < > gianluca.privite...@studio.unibo.it> wrote: > >> Hi, >> I've got a weird question but maybe someone has already dealt with it. >> My Spark Streaming application needs to >> - download a file from a S3 bucket, >> - run a script with the file as input, >> - create a DStream from this script output. >> I've already got the second part done with the rdd.pipe() API that really >> fits my request, but I have no idea how to manage the first part. >> How can I manage to download a file and run a script on them inside a >> Spark Streaming Application? >> Should I use process() from Scala or it won't work? >> >> Thanks >> Gianluca >> >> > >