@Peter - as Rick said - Spark's main usage is data analysis and not
storage.

Spark allows you to plugin different storage layers based on your use cases
and quality attribute requirements. So in essence if your relational
database is meeting your storage requirements you should think about how to
use that "with" Spark. Because even if you decide not to use your
relational database you will have to select some storage layer, most likely
a distributed storage layer.

Another option to think about is - "can you possible restructure your data
schema so that you don't have to do that a large number of joins?". If this
is an option then you can potentially think about using stores such as
Cassandra, HBase, HDFS etc.

Spark really excels at processing large volumes of data really fast (given
enough memory) on horizontally scalable commodity hardware. As Rick pointed
out - "It will probably outperform a relational star schema if all of your
*working* data set can fit into RAM on your cluster." However, if you data
size is much larger than your cluster memory you don't have a choice but to
select a datastore.

HTH

-Soumya








On Sun, Oct 26, 2014 at 10:05 AM, Rick Richardson <rick.richard...@gmail.com
> wrote:

> Spark's API definitely covers all of the things that a relational database
> can do. It will probably outperform a relational star schema if all of your
> *working* data set can fit into RAM on your cluster. It will still perform
> quite well if most of the data fits and some has to spill over to disk.
>
> What are your requirements exactly?
> What is massive amounts of data exactly?
> How big is your cluster?
>
> Note that Spark is not for data storage, only data analysis. It pulls data
> into working data sets called RDD's.
>
> As a migration path, you could probably pull the data out of a relational
> database to analyze. But in the long run, I would recommend using a more
> purpose built, huge storage database such as Cassandra. If your data is
> very static, you could also just store it in files.
>  On Oct 26, 2014 9:19 AM, "Peter Wolf" <opus...@gmail.com> wrote:
>
>> My understanding is the SparkSQL allows one to access Spark data as if it
>> were stored in a relational database.  It compiles SQL queries into a
>> series of calls to the Spark API.
>>
>> I need the performance of a SQL database, but I don't care about doing
>> queries with SQL.
>>
>> I create the input to MLib by doing a massive JOIN query.  So, I am
>> creating a single collection by combining many collections.  This sort of
>> operation is very inefficient in Mongo, Cassandra or HDFS.
>>
>> I could store my data in a relational database, and copy the query
>> results to Spark for processing.  However, I was hoping I could keep
>> everything in Spark.
>>
>> On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta <
>> soumya.sima...@gmail.com> wrote:
>>
>>> 1. What data store do you want to store your data in ? HDFS, HBase,
>>> Cassandra, S3 or something else?
>>> 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)?
>>>
>>> One option is to process the data in Spark and then store it in the
>>> relational database of your choice.
>>>
>>>
>>>
>>>
>>> On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf <opus...@gmail.com> wrote:
>>>
>>>> Hello all,
>>>>
>>>> We are considering Spark for our organization.  It is obviously a
>>>> superb platform for processing massive amounts of data... how about
>>>> retrieving it?
>>>>
>>>> We are currently storing our data in a relational database in a star
>>>> schema.  Retrieving our data requires doing many complicated joins across
>>>> many tables.
>>>>
>>>> Can we use Spark as a relational database?  Or, if not, can we put
>>>> Spark on top of a relational database?
>>>>
>>>> Note that we don't care about SQL.  Accessing our data via standard
>>>> queries is nice, but we are equally happy (or more happy) to write Scala
>>>> code.
>>>>
>>>> What is important to us is doing relational queries on huge amounts of
>>>> data.  Is Spark good at this?
>>>>
>>>> Thank you very much in advance
>>>> Peter
>>>>
>>>
>>>
>>

Reply via email to