Re: Spark as Relational Database

Michael Hausenblas Sun, 26 Oct 2014 22:06:04 -0700

> Given that you are storing event data (which is basically things that have 
> happened in the past AND cannot be modified) you should definitely look at 
> Event sourcing. 
> http://martinfowler.com/eaaDev/EventSourcing.html



Agreed. In this context: a lesser known fact is that the Lambda Architecture 
is, in a nutshell, an extension of Fowler’s ES, so you might also want to check 
out:

https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark


Cheers,
                Michael

--
Michael Hausenblas
Ireland, Europe
http://mhausenblas.info/

> On 27 Oct 2014, at 01:14, Soumya Simanta <soumya.sima...@gmail.com> wrote:
> 
> Given that you are storing event data (which is basically things that have 
> happened in the past AND cannot be modified) you should definitely look at 
> Event sourcing. 
> http://martinfowler.com/eaaDev/EventSourcing.html
> 
> If all you are doing is storing events then I don't think you need a 
> relational database. Rather an event log is ideal. Please see - 
> http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
> 
> There are many other datastores that can do a better job at storing your 
> events. You can process your data and then store them in a relational 
> database to query later. 
> 
>  
> 
> 
> 
> On Sun, Oct 26, 2014 at 9:01 PM, Peter Wolf <opus...@gmail.com> wrote:
> Thanks for all the useful responses.
> 
> We have the usual task of mining a stream of events coming from our many 
> users.  We need to store these events, and process them.  We use a standard 
> Star Schema to represent our data. 
> 
> For the moment, it looks like we should store these events in SQL.  When 
> appropriate, we will do analysis with relational queries.  Or, when 
> appropriate we will extract data into working sets in Spark. 
> 
> I imagine this is a pretty common use case for Spark.
> 
> On Sun, Oct 26, 2014 at 10:05 AM, Rick Richardson <rick.richard...@gmail.com> 
> wrote:
> Spark's API definitely covers all of the things that a relational database 
> can do. It will probably outperform a relational star schema if all of your 
> *working* data set can fit into RAM on your cluster. It will still perform 
> quite well if most of the data fits and some has to spill over to disk.
> 
> What are your requirements exactly? 
> What is massive amounts of data exactly?
> How big is your cluster?
> 
> Note that Spark is not for data storage, only data analysis. It pulls data 
> into working data sets called RDD's.
> 
> As a migration path, you could probably pull the data out of a relational 
> database to analyze. But in the long run, I would recommend using a more 
> purpose built, huge storage database such as Cassandra. If your data is very 
> static, you could also just store it in files. 
> On Oct 26, 2014 9:19 AM, "Peter Wolf" <opus...@gmail.com> wrote:
> My understanding is the SparkSQL allows one to access Spark data as if it 
> were stored in a relational database.  It compiles SQL queries into a series 
> of calls to the Spark API.
> 
> I need the performance of a SQL database, but I don't care about doing 
> queries with SQL.
> 
> I create the input to MLib by doing a massive JOIN query.  So, I am creating 
> a single collection by combining many collections.  This sort of operation is 
> very inefficient in Mongo, Cassandra or HDFS.
> 
> I could store my data in a relational database, and copy the query results to 
> Spark for processing.  However, I was hoping I could keep everything in Spark.
> 
> On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta <soumya.sima...@gmail.com> 
> wrote:
> 1. What data store do you want to store your data in ? HDFS, HBase, 
> Cassandra, S3 or something else? 
> 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? 
> 
> One option is to process the data in Spark and then store it in the 
> relational database of your choice.
> 
> 
> 
> 
> On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf <opus...@gmail.com> wrote:
> Hello all,
> 
> We are considering Spark for our organization.  It is obviously a superb 
> platform for processing massive amounts of data... how about retrieving it?
> 
> We are currently storing our data in a relational database in a star schema.  
> Retrieving our data requires doing many complicated joins across many tables.
> 
> Can we use Spark as a relational database?  Or, if not, can we put Spark on 
> top of a relational database?
> 
> Note that we don't care about SQL.  Accessing our data via standard queries 
> is nice, but we are equally happy (or more happy) to write Scala code. 
> 
> What is important to us is doing relational queries on huge amounts of data.  
> Is Spark good at this?
> 
> Thank you very much in advance
> Peter
> 
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark as Relational Database

Reply via email to