Re: Spark as Relational Database

Soumya Simanta Sun, 26 Oct 2014 18:16:29 -0700

Given that you are storing event data (which is basically things that have
happened in the past AND cannot be modified) you should definitely look at
Event sourcing.
http://martinfowler.com/eaaDev/EventSourcing.html


If all you are doing is storing events then I don't think you need a
relational database. Rather an event log is ideal. Please see -
http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

There are many other datastores that can do a better job at storing your
events. You can process your data and then store them in a relational
database to query later.





On Sun, Oct 26, 2014 at 9:01 PM, Peter Wolf <opus...@gmail.com> wrote:

> Thanks for all the useful responses.
>
> We have the usual task of mining a stream of events coming from our many
> users.  We need to store these events, and process them.  We use a standard
> Star Schema to represent our data.
>
> For the moment, it looks like we should store these events in SQL.  When
> appropriate, we will do analysis with relational queries.  Or, when
> appropriate we will extract data into working sets in Spark.
>
> I imagine this is a pretty common use case for Spark.
>
> On Sun, Oct 26, 2014 at 10:05 AM, Rick Richardson <
> rick.richard...@gmail.com> wrote:
>
>> Spark's API definitely covers all of the things that a relational
>> database can do. It will probably outperform a relational star schema if
>> all of your *working* data set can fit into RAM on your cluster. It will
>> still perform quite well if most of the data fits and some has to spill
>> over to disk.
>>
>> What are your requirements exactly?
>> What is massive amounts of data exactly?
>> How big is your cluster?
>>
>> Note that Spark is not for data storage, only data analysis. It pulls
>> data into working data sets called RDD's.
>>
>> As a migration path, you could probably pull the data out of a relational
>> database to analyze. But in the long run, I would recommend using a more
>> purpose built, huge storage database such as Cassandra. If your data is
>> very static, you could also just store it in files.
>>  On Oct 26, 2014 9:19 AM, "Peter Wolf" <opus...@gmail.com> wrote:
>>
>>> My understanding is the SparkSQL allows one to access Spark data as if
>>> it were stored in a relational database.  It compiles SQL queries into a
>>> series of calls to the Spark API.
>>>
>>> I need the performance of a SQL database, but I don't care about doing
>>> queries with SQL.
>>>
>>> I create the input to MLib by doing a massive JOIN query.  So, I am
>>> creating a single collection by combining many collections.  This sort of
>>> operation is very inefficient in Mongo, Cassandra or HDFS.
>>>
>>> I could store my data in a relational database, and copy the query
>>> results to Spark for processing.  However, I was hoping I could keep
>>> everything in Spark.
>>>
>>> On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta <
>>> soumya.sima...@gmail.com> wrote:
>>>
>>>> 1. What data store do you want to store your data in ? HDFS, HBase,
>>>> Cassandra, S3 or something else?
>>>> 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)?
>>>>
>>>> One option is to process the data in Spark and then store it in the
>>>> relational database of your choice.
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf <opus...@gmail.com> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> We are considering Spark for our organization.  It is obviously a
>>>>> superb platform for processing massive amounts of data... how about
>>>>> retrieving it?
>>>>>
>>>>> We are currently storing our data in a relational database in a star
>>>>> schema.  Retrieving our data requires doing many complicated joins across
>>>>> many tables.
>>>>>
>>>>> Can we use Spark as a relational database?  Or, if not, can we put
>>>>> Spark on top of a relational database?
>>>>>
>>>>> Note that we don't care about SQL.  Accessing our data via standard
>>>>> queries is nice, but we are equally happy (or more happy) to write Scala
>>>>> code.
>>>>>
>>>>> What is important to us is doing relational queries on huge amounts of
>>>>> data.  Is Spark good at this?
>>>>>
>>>>> Thank you very much in advance
>>>>> Peter
>>>>>
>>>>
>>>>
>>>
>

Re: Spark as Relational Database

Reply via email to