Hi Ayan and Helena, I've considered using Cassandra/HBase but ended up opting to save to worker hdfs because I want to take advantage of the data locality since the data will often be loaded to Spark for further processing. I was also under the impression that saving to filesystem (instead of db) is the better option for intermediate data. Definitely going to read up some more and reconsider due to the time series nature of the data though.
This might be a bit out of topic, but in your experience is it common to store intermediate data that will be loaded to Spark plenty of times in the future in Cassandra? Regarding on how late a data can be, I might be able to set the limit. Would you know if it's possible to combine RDDs from different interval in Spark Streaming? Or would I need to write to file first then group the data by time dimension in other batch processing? Thanks in advance! Nisrina. On May 16, 2015 7:26 PM, "Helena Edelson" <helena.edel...@datastax.com> wrote: > Consider using cassandra with spark streaming and timeseries, cassandra > has been doing time series for years. > Here’s some snippets with kafka streaming and writing/reading the data > back: > > > https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L62-L64 > > or write in the stream, read back > > https://github.com/killrweather/killrweather/blob/master/killrweather-examples/src/main/scala/com/datastax/killrweather/KafkaStreamingJson2.scala#L53-L61 > > or more detailed reads back > > https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/TemperatureActor.scala#L65-L69 > > > > A CassandraInputDStream is coming, i’m working on it now. > > Helena > @helenaedelson > > On May 15, 2015, at 9:59 AM, ayan guha <guha.a...@gmail.com> wrote: > > Hi > > Do you have a cut off time, like how "late" an event can be? Else, you may > consider a different persistent storage like Cassandra/Hbase and delegate > "update: part to them. > > On Fri, May 15, 2015 at 8:10 PM, Nisrina Luthfiyati < > nisrina.luthfiy...@gmail.com> wrote: > >> >> Hi all, >> I have a stream of data from Kafka that I want to process and store in >> hdfs using Spark Streaming. >> Each data has a date/time dimension and I want to write data within the >> same time dimension to the same hdfs directory. The data stream might be >> unordered (by time dimension). >> >> I'm wondering what are the best practices in grouping/storing time series >> data stream using Spark Streaming? >> >> I'm considering grouping each batch of data in Spark Streaming per time >> dimension and then saving each group to different hdfs directories. However >> since it is possible for data with the same time dimension to be in >> different batches, I would need to handle "update" in case the hdfs >> directory already exists. >> >> Is this a common approach? Are there any other approaches that I can try? >> >> Thank you! >> Nisrina. >> > > > > -- > Best Regards, > Ayan Guha > > >