Given that you are storing event data (which is basically things that have happened in the past AND cannot be modified) you should definitely look at Event sourcing. http://martinfowler.com/eaaDev/EventSourcing.html
If all you are doing is storing events then I don't think you need a relational database. Rather an event log is ideal. Please see - http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying There are many other datastores that can do a better job at storing your events. You can process your data and then store them in a relational database to query later. On Sun, Oct 26, 2014 at 9:01 PM, Peter Wolf <opus...@gmail.com> wrote: > Thanks for all the useful responses. > > We have the usual task of mining a stream of events coming from our many > users. We need to store these events, and process them. We use a standard > Star Schema to represent our data. > > For the moment, it looks like we should store these events in SQL. When > appropriate, we will do analysis with relational queries. Or, when > appropriate we will extract data into working sets in Spark. > > I imagine this is a pretty common use case for Spark. > > On Sun, Oct 26, 2014 at 10:05 AM, Rick Richardson < > rick.richard...@gmail.com> wrote: > >> Spark's API definitely covers all of the things that a relational >> database can do. It will probably outperform a relational star schema if >> all of your *working* data set can fit into RAM on your cluster. It will >> still perform quite well if most of the data fits and some has to spill >> over to disk. >> >> What are your requirements exactly? >> What is massive amounts of data exactly? >> How big is your cluster? >> >> Note that Spark is not for data storage, only data analysis. It pulls >> data into working data sets called RDD's. >> >> As a migration path, you could probably pull the data out of a relational >> database to analyze. But in the long run, I would recommend using a more >> purpose built, huge storage database such as Cassandra. If your data is >> very static, you could also just store it in files. >> On Oct 26, 2014 9:19 AM, "Peter Wolf" <opus...@gmail.com> wrote: >> >>> My understanding is the SparkSQL allows one to access Spark data as if >>> it were stored in a relational database. It compiles SQL queries into a >>> series of calls to the Spark API. >>> >>> I need the performance of a SQL database, but I don't care about doing >>> queries with SQL. >>> >>> I create the input to MLib by doing a massive JOIN query. So, I am >>> creating a single collection by combining many collections. This sort of >>> operation is very inefficient in Mongo, Cassandra or HDFS. >>> >>> I could store my data in a relational database, and copy the query >>> results to Spark for processing. However, I was hoping I could keep >>> everything in Spark. >>> >>> On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta < >>> soumya.sima...@gmail.com> wrote: >>> >>>> 1. What data store do you want to store your data in ? HDFS, HBase, >>>> Cassandra, S3 or something else? >>>> 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? >>>> >>>> One option is to process the data in Spark and then store it in the >>>> relational database of your choice. >>>> >>>> >>>> >>>> >>>> On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf <opus...@gmail.com> wrote: >>>> >>>>> Hello all, >>>>> >>>>> We are considering Spark for our organization. It is obviously a >>>>> superb platform for processing massive amounts of data... how about >>>>> retrieving it? >>>>> >>>>> We are currently storing our data in a relational database in a star >>>>> schema. Retrieving our data requires doing many complicated joins across >>>>> many tables. >>>>> >>>>> Can we use Spark as a relational database? Or, if not, can we put >>>>> Spark on top of a relational database? >>>>> >>>>> Note that we don't care about SQL. Accessing our data via standard >>>>> queries is nice, but we are equally happy (or more happy) to write Scala >>>>> code. >>>>> >>>>> What is important to us is doing relational queries on huge amounts of >>>>> data. Is Spark good at this? >>>>> >>>>> Thank you very much in advance >>>>> Peter >>>>> >>>> >>>> >>> >