Re: Spark as Relational Database

2014-10-27 Thread Peter Wolf
Great. Thank you very much Michael :-D On Mon, Oct 27, 2014 at 2:03 PM, Michael Armbrust wrote: > I'd suggest checking out the Spark SQL programming guide to answer this > type of query: > http://spark.apache.org/docs/latest/sql-programming-guide.html > > You could also perform it using the raw

Re: Spark as Relational Database

2014-10-27 Thread Michael Armbrust
I'd suggest checking out the Spark SQL programming guide to answer this type of query: http://spark.apache.org/docs/latest/sql-programming-guide.html You could also perform it using the raw Spark RDD API , but its of

Re: Spark as Relational Database

2014-10-27 Thread Peter Wolf
I agree. I'd like to avoid SQL If I could store everything in Cassandra or Mongo and process in Spark, that would be far preferable to creating a temporary Working Set. I'd like to write a performance test. Lets say I have two large collections A and B. Each collection has 2 columns and many m

Re: Spark as Relational Database

2014-10-26 Thread Michael Hausenblas
> Given that you are storing event data (which is basically things that have > happened in the past AND cannot be modified) you should definitely look at > Event sourcing. > http://martinfowler.com/eaaDev/EventSourcing.html Agreed. In this context: a lesser known fact is that the Lambda Archi

Re: Spark as Relational Database

2014-10-26 Thread Ted Yu
Cassandra is only one of the NoSQL options. Don't forget there is HBase :-) On Oct 26, 2014, at 6:21 PM, Rick Richardson wrote: > I agree with Soumya. A relational database is usually the worst kind of > database to receive a constant event stream. > > That said, the best solution is one th

Re: Spark as Relational Database

2014-10-26 Thread Rick Richardson
I agree with Soumya. A relational database is usually the worst kind of database to receive a constant event stream. That said, the best solution is one that already works :) If your system is meeting your needs, then great. When you get so many events that your db can't keep up, I'd look into C

Re: Spark as Relational Database

2014-10-26 Thread Soumya Simanta
Given that you are storing event data (which is basically things that have happened in the past AND cannot be modified) you should definitely look at Event sourcing. http://martinfowler.com/eaaDev/EventSourcing.html If all you are doing is storing events then I don't think you need a relational da

Re: Spark as Relational Database

2014-10-26 Thread Peter Wolf
Thanks for all the useful responses. We have the usual task of mining a stream of events coming from our many users. We need to store these events, and process them. We use a standard Star Schema to represent our data. For the moment, it looks like we should store these events in SQL. When app

Re: Spark as Relational Database

2014-10-26 Thread Helena Edelson
Hi, It is very easy to integrate using Cassandra in a use case such as this. For instance, do your joins in Spark and do your data storage in Cassandra which allows a very flexible schema, unlike a relational DB, and is much faster, fault tolerant, and with spark and colocation WRT data locality

Re: Spark as Relational Database

2014-10-26 Thread Soumya Simanta
@Peter - as Rick said - Spark's main usage is data analysis and not storage. Spark allows you to plugin different storage layers based on your use cases and quality attribute requirements. So in essence if your relational database is meeting your storage requirements you should think about how to

Re: Spark as Relational Database

2014-10-26 Thread Rick Richardson
Spark's API definitely covers all of the things that a relational database can do. It will probably outperform a relational star schema if all of your *working* data set can fit into RAM on your cluster. It will still perform quite well if most of the data fits and some has to spill over to disk.

Re: Spark as Relational Database

2014-10-26 Thread Peter Wolf
My understanding is the SparkSQL allows one to access Spark data as if it were stored in a relational database. It compiles SQL queries into a series of calls to the Spark API. I need the performance of a SQL database, but I don't care about doing queries with SQL. I create the input to MLib by

Re: Spark as Relational Database

2014-10-25 Thread Soumya Simanta
1. What data store do you want to store your data in ? HDFS, HBase, Cassandra, S3 or something else? 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? One option is to process the data in Spark and then store it in the relational database of your choice. On Sat, Oct 25, 2014 at 1