Re: Spark as Relational Database

Rick Richardson Sun, 26 Oct 2014 18:23:37 -0700

I agree with Soumya. A relational database is usually the worst kind of
database to receive a constant event stream.


That said, the best solution is one that already works :)

If your system is meeting your needs, then great.  When you get so many
events that your db can't keep up, I'd look into Cassandra to receive the
events, and spark to analyze them.
On Oct 26, 2014 9:14 PM, "Soumya Simanta" <soumya.sima...@gmail.com> wrote:

> Given that you are storing event data (which is basically things that have
> happened in the past AND cannot be modified) you should definitely look at
> Event sourcing.
> http://martinfowler.com/eaaDev/EventSourcing.html
>
> If all you are doing is storing events then I don't think you need a
> relational database. Rather an event log is ideal. Please see -
> http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
>
> There are many other datastores that can do a better job at storing your
> events. You can process your data and then store them in a relational
> database to query later.
>
>
>
>
>
> On Sun, Oct 26, 2014 at 9:01 PM, Peter Wolf <opus...@gmail.com> wrote:
>
>> Thanks for all the useful responses.
>>
>> We have the usual task of mining a stream of events coming from our many
>> users.  We need to store these events, and process them.  We use a standard
>> Star Schema to represent our data.
>>
>> For the moment, it looks like we should store these events in SQL.  When
>> appropriate, we will do analysis with relational queries.  Or, when
>> appropriate we will extract data into working sets in Spark.
>>
>> I imagine this is a pretty common use case for Spark.
>>
>> On Sun, Oct 26, 2014 at 10:05 AM, Rick Richardson <
>> rick.richard...@gmail.com> wrote:
>>
>>> Spark's API definitely covers all of the things that a relational
>>> database can do. It will probably outperform a relational star schema if
>>> all of your *working* data set can fit into RAM on your cluster. It will
>>> still perform quite well if most of the data fits and some has to spill
>>> over to disk.
>>>
>>> What are your requirements exactly?
>>> What is massive amounts of data exactly?
>>> How big is your cluster?
>>>
>>> Note that Spark is not for data storage, only data analysis. It pulls
>>> data into working data sets called RDD's.
>>>
>>> As a migration path, you could probably pull the data out of a
>>> relational database to analyze. But in the long run, I would recommend
>>> using a more purpose built, huge storage database such as Cassandra. If
>>> your data is very static, you could also just store it in files.
>>>  On Oct 26, 2014 9:19 AM, "Peter Wolf" <opus...@gmail.com> wrote:
>>>
>>>> My understanding is the SparkSQL allows one to access Spark data as if
>>>> it were stored in a relational database.  It compiles SQL queries into a
>>>> series of calls to the Spark API.
>>>>
>>>> I need the performance of a SQL database, but I don't care about doing
>>>> queries with SQL.
>>>>
>>>> I create the input to MLib by doing a massive JOIN query.  So, I am
>>>> creating a single collection by combining many collections.  This sort of
>>>> operation is very inefficient in Mongo, Cassandra or HDFS.
>>>>
>>>> I could store my data in a relational database, and copy the query
>>>> results to Spark for processing.  However, I was hoping I could keep
>>>> everything in Spark.
>>>>
>>>> On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta <
>>>> soumya.sima...@gmail.com> wrote:
>>>>
>>>>> 1. What data store do you want to store your data in ? HDFS, HBase,
>>>>> Cassandra, S3 or something else?
>>>>> 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)?
>>>>>
>>>>> One option is to process the data in Spark and then store it in the
>>>>> relational database of your choice.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf <opus...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> We are considering Spark for our organization.  It is obviously a
>>>>>> superb platform for processing massive amounts of data... how about
>>>>>> retrieving it?
>>>>>>
>>>>>> We are currently storing our data in a relational database in a star
>>>>>> schema.  Retrieving our data requires doing many complicated joins across
>>>>>> many tables.
>>>>>>
>>>>>> Can we use Spark as a relational database?  Or, if not, can we put
>>>>>> Spark on top of a relational database?
>>>>>>
>>>>>> Note that we don't care about SQL.  Accessing our data via standard
>>>>>> queries is nice, but we are equally happy (or more happy) to write Scala
>>>>>> code.
>>>>>>
>>>>>> What is important to us is doing relational queries on huge amounts
>>>>>> of data.  Is Spark good at this?
>>>>>>
>>>>>> Thank you very much in advance
>>>>>> Peter
>>>>>>
>>>>>
>>>>>
>>>>
>>
>

Re: Spark as Relational Database

Reply via email to