> Given that you are storing event data (which is basically things that have > happened in the past AND cannot be modified) you should definitely look at > Event sourcing. > http://martinfowler.com/eaaDev/EventSourcing.html
Agreed. In this context: a lesser known fact is that the Lambda Architecture is, in a nutshell, an extension of Fowler’s ES, so you might also want to check out: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark Cheers, Michael -- Michael Hausenblas Ireland, Europe http://mhausenblas.info/ > On 27 Oct 2014, at 01:14, Soumya Simanta <soumya.sima...@gmail.com> wrote: > > Given that you are storing event data (which is basically things that have > happened in the past AND cannot be modified) you should definitely look at > Event sourcing. > http://martinfowler.com/eaaDev/EventSourcing.html > > If all you are doing is storing events then I don't think you need a > relational database. Rather an event log is ideal. Please see - > http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying > > There are many other datastores that can do a better job at storing your > events. You can process your data and then store them in a relational > database to query later. > > > > > > On Sun, Oct 26, 2014 at 9:01 PM, Peter Wolf <opus...@gmail.com> wrote: > Thanks for all the useful responses. > > We have the usual task of mining a stream of events coming from our many > users. We need to store these events, and process them. We use a standard > Star Schema to represent our data. > > For the moment, it looks like we should store these events in SQL. When > appropriate, we will do analysis with relational queries. Or, when > appropriate we will extract data into working sets in Spark. > > I imagine this is a pretty common use case for Spark. > > On Sun, Oct 26, 2014 at 10:05 AM, Rick Richardson <rick.richard...@gmail.com> > wrote: > Spark's API definitely covers all of the things that a relational database > can do. It will probably outperform a relational star schema if all of your > *working* data set can fit into RAM on your cluster. It will still perform > quite well if most of the data fits and some has to spill over to disk. > > What are your requirements exactly? > What is massive amounts of data exactly? > How big is your cluster? > > Note that Spark is not for data storage, only data analysis. It pulls data > into working data sets called RDD's. > > As a migration path, you could probably pull the data out of a relational > database to analyze. But in the long run, I would recommend using a more > purpose built, huge storage database such as Cassandra. If your data is very > static, you could also just store it in files. > On Oct 26, 2014 9:19 AM, "Peter Wolf" <opus...@gmail.com> wrote: > My understanding is the SparkSQL allows one to access Spark data as if it > were stored in a relational database. It compiles SQL queries into a series > of calls to the Spark API. > > I need the performance of a SQL database, but I don't care about doing > queries with SQL. > > I create the input to MLib by doing a massive JOIN query. So, I am creating > a single collection by combining many collections. This sort of operation is > very inefficient in Mongo, Cassandra or HDFS. > > I could store my data in a relational database, and copy the query results to > Spark for processing. However, I was hoping I could keep everything in Spark. > > On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta <soumya.sima...@gmail.com> > wrote: > 1. What data store do you want to store your data in ? HDFS, HBase, > Cassandra, S3 or something else? > 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? > > One option is to process the data in Spark and then store it in the > relational database of your choice. > > > > > On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf <opus...@gmail.com> wrote: > Hello all, > > We are considering Spark for our organization. It is obviously a superb > platform for processing massive amounts of data... how about retrieving it? > > We are currently storing our data in a relational database in a star schema. > Retrieving our data requires doing many complicated joins across many tables. > > Can we use Spark as a relational database? Or, if not, can we put Spark on > top of a relational database? > > Note that we don't care about SQL. Accessing our data via standard queries > is nice, but we are equally happy (or more happy) to write Scala code. > > What is important to us is doing relational queries on huge amounts of data. > Is Spark good at this? > > Thank you very much in advance > Peter > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org