Cassandra is only one of the NoSQL options. Don't forget there is HBase :-)
On Oct 26, 2014, at 6:21 PM, Rick Richardson <rick.richard...@gmail.com> wrote: > I agree with Soumya. A relational database is usually the worst kind of > database to receive a constant event stream. > > That said, the best solution is one that already works :) > > If your system is meeting your needs, then great. When you get so many > events that your db can't keep up, I'd look into Cassandra to receive the > events, and spark to analyze them. > > On Oct 26, 2014 9:14 PM, "Soumya Simanta" <soumya.sima...@gmail.com> wrote: >> Given that you are storing event data (which is basically things that have >> happened in the past AND cannot be modified) you should definitely look at >> Event sourcing. >> http://martinfowler.com/eaaDev/EventSourcing.html >> >> If all you are doing is storing events then I don't think you need a >> relational database. Rather an event log is ideal. Please see - >> http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying >> >> There are many other datastores that can do a better job at storing your >> events. You can process your data and then store them in a relational >> database to query later. >> >> >> >> >> >> On Sun, Oct 26, 2014 at 9:01 PM, Peter Wolf <opus...@gmail.com> wrote: >>> Thanks for all the useful responses. >>> >>> We have the usual task of mining a stream of events coming from our many >>> users. We need to store these events, and process them. We use a standard >>> Star Schema to represent our data. >>> >>> For the moment, it looks like we should store these events in SQL. When >>> appropriate, we will do analysis with relational queries. Or, when >>> appropriate we will extract data into working sets in Spark. >>> >>> I imagine this is a pretty common use case for Spark. >>> >>> On Sun, Oct 26, 2014 at 10:05 AM, Rick Richardson >>> <rick.richard...@gmail.com> wrote: >>>> Spark's API definitely covers all of the things that a relational database >>>> can do. It will probably outperform a relational star schema if all of >>>> your *working* data set can fit into RAM on your cluster. It will still >>>> perform quite well if most of the data fits and some has to spill over to >>>> disk. >>>> >>>> What are your requirements exactly? >>>> What is massive amounts of data exactly? >>>> How big is your cluster? >>>> >>>> Note that Spark is not for data storage, only data analysis. It pulls data >>>> into working data sets called RDD's. >>>> >>>> As a migration path, you could probably pull the data out of a relational >>>> database to analyze. But in the long run, I would recommend using a more >>>> purpose built, huge storage database such as Cassandra. If your data is >>>> very static, you could also just store it in files. >>>> On Oct 26, 2014 9:19 AM, "Peter Wolf" <opus...@gmail.com> wrote: >>>>> My understanding is the SparkSQL allows one to access Spark data as if it >>>>> were stored in a relational database. It compiles SQL queries into a >>>>> series of calls to the Spark API. >>>>> >>>>> I need the performance of a SQL database, but I don't care about doing >>>>> queries with SQL. >>>>> >>>>> I create the input to MLib by doing a massive JOIN query. So, I am >>>>> creating a single collection by combining many collections. This sort of >>>>> operation is very inefficient in Mongo, Cassandra or HDFS. >>>>> >>>>> I could store my data in a relational database, and copy the query >>>>> results to Spark for processing. However, I was hoping I could keep >>>>> everything in Spark. >>>>> >>>>> On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta >>>>> <soumya.sima...@gmail.com> wrote: >>>>>> 1. What data store do you want to store your data in ? HDFS, HBase, >>>>>> Cassandra, S3 or something else? >>>>>> 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? >>>>>> >>>>>> One option is to process the data in Spark and then store it in the >>>>>> relational database of your choice. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf <opus...@gmail.com> wrote: >>>>>>> Hello all, >>>>>>> >>>>>>> We are considering Spark for our organization. It is obviously a >>>>>>> superb platform for processing massive amounts of data... how about >>>>>>> retrieving it? >>>>>>> >>>>>>> We are currently storing our data in a relational database in a star >>>>>>> schema. Retrieving our data requires doing many complicated joins >>>>>>> across many tables. >>>>>>> >>>>>>> Can we use Spark as a relational database? Or, if not, can we put >>>>>>> Spark on top of a relational database? >>>>>>> >>>>>>> Note that we don't care about SQL. Accessing our data via standard >>>>>>> queries is nice, but we are equally happy (or more happy) to write >>>>>>> Scala code. >>>>>>> >>>>>>> What is important to us is doing relational queries on huge amounts of >>>>>>> data. Is Spark good at this? >>>>>>> >>>>>>> Thank you very much in advance >>>>>>> Peter