I agree.  I'd like to avoid SQL

If I could store everything in Cassandra or Mongo and process in Spark,
that would be far preferable to creating a temporary Working Set.

I'd like to write a performance test.  Lets say I have two large
collections A and B.  Each collection has 2 columns and many many rows.
The columns are Id and Value.

I want to create a third collection that is the equivalent of the SQL query

select A.Id, A.Value, B.Value where A.Id = B.Id

This new collection is the inner join of A and B.  It has 3 columns A.Id,
A.Value, B.Value and one row for each Id that A and B have in common.

Furthermore, this table is only needed temporarily as part of processing.
It needs to created efficiently, and accessible quickly.

Can someone give me a pointer to the appropriate API and/or example code?

Thanks again
P

On Mon, Oct 27, 2014 at 1:04 AM, Michael Hausenblas <
michael.hausenb...@gmail.com> wrote:

>
> > Given that you are storing event data (which is basically things that
> have happened in the past AND cannot be modified) you should definitely
> look at Event sourcing.
> > http://martinfowler.com/eaaDev/EventSourcing.html
>
>
> Agreed. In this context: a lesser known fact is that the Lambda
> Architecture is, in a nutshell, an extension of Fowler’s ES, so you might
> also want to check out:
>
> https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
>
>
> Cheers,
>                 Michael
>
> --
> Michael Hausenblas
> Ireland, Europe
> http://mhausenblas.info/
>
> > On 27 Oct 2014, at 01:14, Soumya Simanta <soumya.sima...@gmail.com>
> wrote:
> >
> > Given that you are storing event data (which is basically things that
> have happened in the past AND cannot be modified) you should definitely
> look at Event sourcing.
> > http://martinfowler.com/eaaDev/EventSourcing.html
> >
> > If all you are doing is storing events then I don't think you need a
> relational database. Rather an event log is ideal. Please see -
> http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
> >
> > There are many other datastores that can do a better job at storing your
> events. You can process your data and then store them in a relational
> database to query later.
> >
> >
> >
> >
> >
> > On Sun, Oct 26, 2014 at 9:01 PM, Peter Wolf <opus...@gmail.com> wrote:
> > Thanks for all the useful responses.
> >
> > We have the usual task of mining a stream of events coming from our many
> users.  We need to store these events, and process them.  We use a standard
> Star Schema to represent our data.
> >
> > For the moment, it looks like we should store these events in SQL.  When
> appropriate, we will do analysis with relational queries.  Or, when
> appropriate we will extract data into working sets in Spark.
> >
> > I imagine this is a pretty common use case for Spark.
> >
> > On Sun, Oct 26, 2014 at 10:05 AM, Rick Richardson <
> rick.richard...@gmail.com> wrote:
> > Spark's API definitely covers all of the things that a relational
> database can do. It will probably outperform a relational star schema if
> all of your *working* data set can fit into RAM on your cluster. It will
> still perform quite well if most of the data fits and some has to spill
> over to disk.
> >
> > What are your requirements exactly?
> > What is massive amounts of data exactly?
> > How big is your cluster?
> >
> > Note that Spark is not for data storage, only data analysis. It pulls
> data into working data sets called RDD's.
> >
> > As a migration path, you could probably pull the data out of a
> relational database to analyze. But in the long run, I would recommend
> using a more purpose built, huge storage database such as Cassandra. If
> your data is very static, you could also just store it in files.
> > On Oct 26, 2014 9:19 AM, "Peter Wolf" <opus...@gmail.com> wrote:
> > My understanding is the SparkSQL allows one to access Spark data as if
> it were stored in a relational database.  It compiles SQL queries into a
> series of calls to the Spark API.
> >
> > I need the performance of a SQL database, but I don't care about doing
> queries with SQL.
> >
> > I create the input to MLib by doing a massive JOIN query.  So, I am
> creating a single collection by combining many collections.  This sort of
> operation is very inefficient in Mongo, Cassandra or HDFS.
> >
> > I could store my data in a relational database, and copy the query
> results to Spark for processing.  However, I was hoping I could keep
> everything in Spark.
> >
> > On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta <
> soumya.sima...@gmail.com> wrote:
> > 1. What data store do you want to store your data in ? HDFS, HBase,
> Cassandra, S3 or something else?
> > 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)?
> >
> > One option is to process the data in Spark and then store it in the
> relational database of your choice.
> >
> >
> >
> >
> > On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf <opus...@gmail.com> wrote:
> > Hello all,
> >
> > We are considering Spark for our organization.  It is obviously a superb
> platform for processing massive amounts of data... how about retrieving it?
> >
> > We are currently storing our data in a relational database in a star
> schema.  Retrieving our data requires doing many complicated joins across
> many tables.
> >
> > Can we use Spark as a relational database?  Or, if not, can we put Spark
> on top of a relational database?
> >
> > Note that we don't care about SQL.  Accessing our data via standard
> queries is nice, but we are equally happy (or more happy) to write Scala
> code.
> >
> > What is important to us is doing relational queries on huge amounts of
> data.  Is Spark good at this?
> >
> > Thank you very much in advance
> > Peter
> >
> >
> >
> >
>
>

Reply via email to