I agree. I'd like to avoid SQL If I could store everything in Cassandra or Mongo and process in Spark, that would be far preferable to creating a temporary Working Set.
I'd like to write a performance test. Lets say I have two large collections A and B. Each collection has 2 columns and many many rows. The columns are Id and Value. I want to create a third collection that is the equivalent of the SQL query select A.Id, A.Value, B.Value where A.Id = B.Id This new collection is the inner join of A and B. It has 3 columns A.Id, A.Value, B.Value and one row for each Id that A and B have in common. Furthermore, this table is only needed temporarily as part of processing. It needs to created efficiently, and accessible quickly. Can someone give me a pointer to the appropriate API and/or example code? Thanks again P On Mon, Oct 27, 2014 at 1:04 AM, Michael Hausenblas < michael.hausenb...@gmail.com> wrote: > > > Given that you are storing event data (which is basically things that > have happened in the past AND cannot be modified) you should definitely > look at Event sourcing. > > http://martinfowler.com/eaaDev/EventSourcing.html > > > Agreed. In this context: a lesser known fact is that the Lambda > Architecture is, in a nutshell, an extension of Fowler’s ES, so you might > also want to check out: > > https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark > > > Cheers, > Michael > > -- > Michael Hausenblas > Ireland, Europe > http://mhausenblas.info/ > > > On 27 Oct 2014, at 01:14, Soumya Simanta <soumya.sima...@gmail.com> > wrote: > > > > Given that you are storing event data (which is basically things that > have happened in the past AND cannot be modified) you should definitely > look at Event sourcing. > > http://martinfowler.com/eaaDev/EventSourcing.html > > > > If all you are doing is storing events then I don't think you need a > relational database. Rather an event log is ideal. Please see - > http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying > > > > There are many other datastores that can do a better job at storing your > events. You can process your data and then store them in a relational > database to query later. > > > > > > > > > > > > On Sun, Oct 26, 2014 at 9:01 PM, Peter Wolf <opus...@gmail.com> wrote: > > Thanks for all the useful responses. > > > > We have the usual task of mining a stream of events coming from our many > users. We need to store these events, and process them. We use a standard > Star Schema to represent our data. > > > > For the moment, it looks like we should store these events in SQL. When > appropriate, we will do analysis with relational queries. Or, when > appropriate we will extract data into working sets in Spark. > > > > I imagine this is a pretty common use case for Spark. > > > > On Sun, Oct 26, 2014 at 10:05 AM, Rick Richardson < > rick.richard...@gmail.com> wrote: > > Spark's API definitely covers all of the things that a relational > database can do. It will probably outperform a relational star schema if > all of your *working* data set can fit into RAM on your cluster. It will > still perform quite well if most of the data fits and some has to spill > over to disk. > > > > What are your requirements exactly? > > What is massive amounts of data exactly? > > How big is your cluster? > > > > Note that Spark is not for data storage, only data analysis. It pulls > data into working data sets called RDD's. > > > > As a migration path, you could probably pull the data out of a > relational database to analyze. But in the long run, I would recommend > using a more purpose built, huge storage database such as Cassandra. If > your data is very static, you could also just store it in files. > > On Oct 26, 2014 9:19 AM, "Peter Wolf" <opus...@gmail.com> wrote: > > My understanding is the SparkSQL allows one to access Spark data as if > it were stored in a relational database. It compiles SQL queries into a > series of calls to the Spark API. > > > > I need the performance of a SQL database, but I don't care about doing > queries with SQL. > > > > I create the input to MLib by doing a massive JOIN query. So, I am > creating a single collection by combining many collections. This sort of > operation is very inefficient in Mongo, Cassandra or HDFS. > > > > I could store my data in a relational database, and copy the query > results to Spark for processing. However, I was hoping I could keep > everything in Spark. > > > > On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta < > soumya.sima...@gmail.com> wrote: > > 1. What data store do you want to store your data in ? HDFS, HBase, > Cassandra, S3 or something else? > > 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? > > > > One option is to process the data in Spark and then store it in the > relational database of your choice. > > > > > > > > > > On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf <opus...@gmail.com> wrote: > > Hello all, > > > > We are considering Spark for our organization. It is obviously a superb > platform for processing massive amounts of data... how about retrieving it? > > > > We are currently storing our data in a relational database in a star > schema. Retrieving our data requires doing many complicated joins across > many tables. > > > > Can we use Spark as a relational database? Or, if not, can we put Spark > on top of a relational database? > > > > Note that we don't care about SQL. Accessing our data via standard > queries is nice, but we are equally happy (or more happy) to write Scala > code. > > > > What is important to us is doing relational queries on huge amounts of > data. Is Spark good at this? > > > > Thank you very much in advance > > Peter > > > > > > > > > >