Hi Sebastian,
Yes... my mistake... you are right. Every partition will create a different
file.
Shridhar
On Fri, Mar 25, 2016 at 6:58 PM, Sebastian Piu
wrote:
> I dont understand about the race condition comment you mention.
> Have you seen this somewhere? That timestamp will be the same on e
I dont understand about the race condition comment you mention.
Have you seen this somewhere? That timestamp will be the same on each
worker for that rdd, and each worker is handling a different partition
which will be reflected on the filename, so no data will be overwriting. In
fact this is what
Hi Surendra,
Thanks for your suggestion. I tried MultipleOutputForma and
MultipleTextOutputFormat. But the result was the same. The folder would
always contain a single file part-r-0, and this file gets overwritten
everytime.
This is how I am invoking the API
dataToWrite.saveAsHad
Hi Sebastian,
Thanks for your reply. I think using rdd.timestamp may cause one issue. If
the parallelized RDD is executed on more than one workers, there may be a
race condition if rdd.timestamp is used. It may also result in overwriting
the file.
Shridhar
On Wed, Mar 23, 2016 at 12:32 AM, Seba
Hi Vetal,
You may try with MultiOutPutFormat instead of TextOutPutFormat in
saveAsNewAPIHadoopFile().
Regards,
Surendra M
-- Surendra Manchikanti
On Tue, Mar 22, 2016 at 10:26 AM, vetal king wrote:
> We are using Spark 1.4 for Spark Streaming. Kafka is data source for the
> Spark Stream.
>
>
As you said, create a folder for each different minute, you can use the
rdd.time also as a timestamp.
Also you might want to have a look at the window function for the batching
On Tue, 22 Mar 2016, 17:43 vetal king, wrote:
> Hi Cody,
>
> Thanks for your reply.
>
> Five seconds batch and one mi
Hi Cody,
Thanks for your reply.
Five seconds batch and one min publishing interval is just a representative
example. What we want is, to group data over a certain frequency. That
frequency is configurable. One way we think it can be achieved is
"directory" will be created per this frequency, and
If you want 1 minute granularity, why not use a 1 minute batch time?
Also, HDFS is not a great match for this kind of thing, because of the
small files issue.
On Tue, Mar 22, 2016 at 12:26 PM, vetal king wrote:
> We are using Spark 1.4 for Spark Streaming. Kafka is data source for the
> Spark St
We are using Spark 1.4 for Spark Streaming. Kafka is data source for the
Spark Stream.
Records are published on Kafka every second. Our requirement is to store
records published on Kafka in a single folder per minute. The stream will
read records every five seconds. For instance records published