Re: Spark as Relational Database

Ted Yu Sun, 26 Oct 2014 18:27:06 -0700

Cassandra is only one of the NoSQL options. 

Don't forget there is HBase :-)


On Oct 26, 2014, at 6:21 PM, Rick Richardson <rick.richard...@gmail.com> wrote:

> I agree with Soumya. A relational database is usually the worst kind of 
> database to receive a constant event stream. 
> 
> That said, the best solution is one that already works :)
> 
> If your system is meeting your needs, then great.  When you get so many 
> events that your db can't keep up, I'd look into Cassandra to receive the 
> events, and spark to analyze them.
> 
> On Oct 26, 2014 9:14 PM, "Soumya Simanta" <soumya.sima...@gmail.com> wrote:
>> Given that you are storing event data (which is basically things that have 
>> happened in the past AND cannot be modified) you should definitely look at 
>> Event sourcing. 
>> http://martinfowler.com/eaaDev/EventSourcing.html
>> 
>> If all you are doing is storing events then I don't think you need a 
>> relational database. Rather an event log is ideal. Please see - 
>> http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
>> 
>> There are many other datastores that can do a better job at storing your 
>> events. You can process your data and then store them in a relational 
>> database to query later. 
>> 
>>  
>> 
>> 
>> 
>> On Sun, Oct 26, 2014 at 9:01 PM, Peter Wolf <opus...@gmail.com> wrote:
>>> Thanks for all the useful responses.
>>> 
>>> We have the usual task of mining a stream of events coming from our many 
>>> users.  We need to store these events, and process them.  We use a standard 
>>> Star Schema to represent our data. 
>>> 
>>> For the moment, it looks like we should store these events in SQL.  When 
>>> appropriate, we will do analysis with relational queries.  Or, when 
>>> appropriate we will extract data into working sets in Spark. 
>>> 
>>> I imagine this is a pretty common use case for Spark.
>>> 
>>> On Sun, Oct 26, 2014 at 10:05 AM, Rick Richardson 
>>> <rick.richard...@gmail.com> wrote:
>>>> Spark's API definitely covers all of the things that a relational database 
>>>> can do. It will probably outperform a relational star schema if all of 
>>>> your *working* data set can fit into RAM on your cluster. It will still 
>>>> perform quite well if most of the data fits and some has to spill over to 
>>>> disk.
>>>> 
>>>> What are your requirements exactly? 
>>>> What is massive amounts of data exactly?
>>>> How big is your cluster?
>>>> 
>>>> Note that Spark is not for data storage, only data analysis. It pulls data 
>>>> into working data sets called RDD's.
>>>> 
>>>> As a migration path, you could probably pull the data out of a relational 
>>>> database to analyze. But in the long run, I would recommend using a more 
>>>> purpose built, huge storage database such as Cassandra. If your data is 
>>>> very static, you could also just store it in files. 
>>>> On Oct 26, 2014 9:19 AM, "Peter Wolf" <opus...@gmail.com> wrote:
>>>>> My understanding is the SparkSQL allows one to access Spark data as if it 
>>>>> were stored in a relational database.  It compiles SQL queries into a 
>>>>> series of calls to the Spark API.
>>>>> 
>>>>> I need the performance of a SQL database, but I don't care about doing 
>>>>> queries with SQL.
>>>>> 
>>>>> I create the input to MLib by doing a massive JOIN query.  So, I am 
>>>>> creating a single collection by combining many collections.  This sort of 
>>>>> operation is very inefficient in Mongo, Cassandra or HDFS.
>>>>> 
>>>>> I could store my data in a relational database, and copy the query 
>>>>> results to Spark for processing.  However, I was hoping I could keep 
>>>>> everything in Spark.
>>>>> 
>>>>> On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta 
>>>>> <soumya.sima...@gmail.com> wrote:
>>>>>> 1. What data store do you want to store your data in ? HDFS, HBase, 
>>>>>> Cassandra, S3 or something else? 
>>>>>> 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? 
>>>>>> 
>>>>>> One option is to process the data in Spark and then store it in the 
>>>>>> relational database of your choice.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf <opus...@gmail.com> wrote:
>>>>>>> Hello all,
>>>>>>> 
>>>>>>> We are considering Spark for our organization.  It is obviously a 
>>>>>>> superb platform for processing massive amounts of data... how about 
>>>>>>> retrieving it?
>>>>>>> 
>>>>>>> We are currently storing our data in a relational database in a star 
>>>>>>> schema.  Retrieving our data requires doing many complicated joins 
>>>>>>> across many tables.
>>>>>>> 
>>>>>>> Can we use Spark as a relational database?  Or, if not, can we put 
>>>>>>> Spark on top of a relational database?
>>>>>>> 
>>>>>>> Note that we don't care about SQL.  Accessing our data via standard 
>>>>>>> queries is nice, but we are equally happy (or more happy) to write 
>>>>>>> Scala code. 
>>>>>>> 
>>>>>>> What is important to us is doing relational queries on huge amounts of 
>>>>>>> data.  Is Spark good at this?
>>>>>>> 
>>>>>>> Thank you very much in advance
>>>>>>> Peter

Re: Spark as Relational Database

Reply via email to