I'd suggest checking out the Spark SQL programming guide to answer this
type of query:
http://spark.apache.org/docs/latest/sql-programming-guide.html

You could also perform it using the raw Spark RDD API
<http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.rdd.RDD>,
but its often the case that the in-memory columnar caching of spark SQL is
faster and more space efficient.

On Mon, Oct 27, 2014 at 6:27 AM, Peter Wolf <opus...@gmail.com> wrote:

> I agree.  I'd like to avoid SQL
>
> If I could store everything in Cassandra or Mongo and process in Spark,
> that would be far preferable to creating a temporary Working Set.
>
> I'd like to write a performance test.  Lets say I have two large
> collections A and B.  Each collection has 2 columns and many many rows.
> The columns are Id and Value.
>
> I want to create a third collection that is the equivalent of the SQL query
>
> select A.Id, A.Value, B.Value where A.Id = B.Id
>
> This new collection is the inner join of A and B.  It has 3 columns A.Id,
> A.Value, B.Value and one row for each Id that A and B have in common.
>
> Furthermore, this table is only needed temporarily as part of processing.
> It needs to created efficiently, and accessible quickly.
>
> Can someone give me a pointer to the appropriate API and/or example code?
>
> Thanks again
> P
>
> On Mon, Oct 27, 2014 at 1:04 AM, Michael Hausenblas <
> michael.hausenb...@gmail.com> wrote:
>
>>
>> > Given that you are storing event data (which is basically things that
>> have happened in the past AND cannot be modified) you should definitely
>> look at Event sourcing.
>> > http://martinfowler.com/eaaDev/EventSourcing.html
>>
>>
>> Agreed. In this context: a lesser known fact is that the Lambda
>> Architecture is, in a nutshell, an extension of Fowler’s ES, so you might
>> also want to check out:
>>
>> https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
>>
>>
>> Cheers,
>>                 Michael
>>
>> --
>> Michael Hausenblas
>> Ireland, Europe
>> http://mhausenblas.info/
>>
>> > On 27 Oct 2014, at 01:14, Soumya Simanta <soumya.sima...@gmail.com>
>> wrote:
>> >
>> > Given that you are storing event data (which is basically things that
>> have happened in the past AND cannot be modified) you should definitely
>> look at Event sourcing.
>> > http://martinfowler.com/eaaDev/EventSourcing.html
>> >
>> > If all you are doing is storing events then I don't think you need a
>> relational database. Rather an event log is ideal. Please see -
>> http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
>> >
>> > There are many other datastores that can do a better job at storing
>> your events. You can process your data and then store them in a relational
>> database to query later.
>> >
>> >
>> >
>> >
>> >
>> > On Sun, Oct 26, 2014 at 9:01 PM, Peter Wolf <opus...@gmail.com> wrote:
>> > Thanks for all the useful responses.
>> >
>> > We have the usual task of mining a stream of events coming from our
>> many users.  We need to store these events, and process them.  We use a
>> standard Star Schema to represent our data.
>> >
>> > For the moment, it looks like we should store these events in SQL.
>> When appropriate, we will do analysis with relational queries.  Or, when
>> appropriate we will extract data into working sets in Spark.
>> >
>> > I imagine this is a pretty common use case for Spark.
>> >
>> > On Sun, Oct 26, 2014 at 10:05 AM, Rick Richardson <
>> rick.richard...@gmail.com> wrote:
>> > Spark's API definitely covers all of the things that a relational
>> database can do. It will probably outperform a relational star schema if
>> all of your *working* data set can fit into RAM on your cluster. It will
>> still perform quite well if most of the data fits and some has to spill
>> over to disk.
>> >
>> > What are your requirements exactly?
>> > What is massive amounts of data exactly?
>> > How big is your cluster?
>> >
>> > Note that Spark is not for data storage, only data analysis. It pulls
>> data into working data sets called RDD's.
>> >
>> > As a migration path, you could probably pull the data out of a
>> relational database to analyze. But in the long run, I would recommend
>> using a more purpose built, huge storage database such as Cassandra. If
>> your data is very static, you could also just store it in files.
>> > On Oct 26, 2014 9:19 AM, "Peter Wolf" <opus...@gmail.com> wrote:
>> > My understanding is the SparkSQL allows one to access Spark data as if
>> it were stored in a relational database.  It compiles SQL queries into a
>> series of calls to the Spark API.
>> >
>> > I need the performance of a SQL database, but I don't care about doing
>> queries with SQL.
>> >
>> > I create the input to MLib by doing a massive JOIN query.  So, I am
>> creating a single collection by combining many collections.  This sort of
>> operation is very inefficient in Mongo, Cassandra or HDFS.
>> >
>> > I could store my data in a relational database, and copy the query
>> results to Spark for processing.  However, I was hoping I could keep
>> everything in Spark.
>> >
>> > On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta <
>> soumya.sima...@gmail.com> wrote:
>> > 1. What data store do you want to store your data in ? HDFS, HBase,
>> Cassandra, S3 or something else?
>> > 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)?
>> >
>> > One option is to process the data in Spark and then store it in the
>> relational database of your choice.
>> >
>> >
>> >
>> >
>> > On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf <opus...@gmail.com> wrote:
>> > Hello all,
>> >
>> > We are considering Spark for our organization.  It is obviously a
>> superb platform for processing massive amounts of data... how about
>> retrieving it?
>> >
>> > We are currently storing our data in a relational database in a star
>> schema.  Retrieving our data requires doing many complicated joins across
>> many tables.
>> >
>> > Can we use Spark as a relational database?  Or, if not, can we put
>> Spark on top of a relational database?
>> >
>> > Note that we don't care about SQL.  Accessing our data via standard
>> queries is nice, but we are equally happy (or more happy) to write Scala
>> code.
>> >
>> > What is important to us is doing relational queries on huge amounts of
>> data.  Is Spark good at this?
>> >
>> > Thank you very much in advance
>> > Peter
>> >
>> >
>> >
>> >
>>
>>
>

Reply via email to