I'd suggest checking out the Spark SQL programming guide to answer this type of query: http://spark.apache.org/docs/latest/sql-programming-guide.html
You could also perform it using the raw Spark RDD API <http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.rdd.RDD>, but its often the case that the in-memory columnar caching of spark SQL is faster and more space efficient. On Mon, Oct 27, 2014 at 6:27 AM, Peter Wolf <opus...@gmail.com> wrote: > I agree. I'd like to avoid SQL > > If I could store everything in Cassandra or Mongo and process in Spark, > that would be far preferable to creating a temporary Working Set. > > I'd like to write a performance test. Lets say I have two large > collections A and B. Each collection has 2 columns and many many rows. > The columns are Id and Value. > > I want to create a third collection that is the equivalent of the SQL query > > select A.Id, A.Value, B.Value where A.Id = B.Id > > This new collection is the inner join of A and B. It has 3 columns A.Id, > A.Value, B.Value and one row for each Id that A and B have in common. > > Furthermore, this table is only needed temporarily as part of processing. > It needs to created efficiently, and accessible quickly. > > Can someone give me a pointer to the appropriate API and/or example code? > > Thanks again > P > > On Mon, Oct 27, 2014 at 1:04 AM, Michael Hausenblas < > michael.hausenb...@gmail.com> wrote: > >> >> > Given that you are storing event data (which is basically things that >> have happened in the past AND cannot be modified) you should definitely >> look at Event sourcing. >> > http://martinfowler.com/eaaDev/EventSourcing.html >> >> >> Agreed. In this context: a lesser known fact is that the Lambda >> Architecture is, in a nutshell, an extension of Fowler’s ES, so you might >> also want to check out: >> >> https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark >> >> >> Cheers, >> Michael >> >> -- >> Michael Hausenblas >> Ireland, Europe >> http://mhausenblas.info/ >> >> > On 27 Oct 2014, at 01:14, Soumya Simanta <soumya.sima...@gmail.com> >> wrote: >> > >> > Given that you are storing event data (which is basically things that >> have happened in the past AND cannot be modified) you should definitely >> look at Event sourcing. >> > http://martinfowler.com/eaaDev/EventSourcing.html >> > >> > If all you are doing is storing events then I don't think you need a >> relational database. Rather an event log is ideal. Please see - >> http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying >> > >> > There are many other datastores that can do a better job at storing >> your events. You can process your data and then store them in a relational >> database to query later. >> > >> > >> > >> > >> > >> > On Sun, Oct 26, 2014 at 9:01 PM, Peter Wolf <opus...@gmail.com> wrote: >> > Thanks for all the useful responses. >> > >> > We have the usual task of mining a stream of events coming from our >> many users. We need to store these events, and process them. We use a >> standard Star Schema to represent our data. >> > >> > For the moment, it looks like we should store these events in SQL. >> When appropriate, we will do analysis with relational queries. Or, when >> appropriate we will extract data into working sets in Spark. >> > >> > I imagine this is a pretty common use case for Spark. >> > >> > On Sun, Oct 26, 2014 at 10:05 AM, Rick Richardson < >> rick.richard...@gmail.com> wrote: >> > Spark's API definitely covers all of the things that a relational >> database can do. It will probably outperform a relational star schema if >> all of your *working* data set can fit into RAM on your cluster. It will >> still perform quite well if most of the data fits and some has to spill >> over to disk. >> > >> > What are your requirements exactly? >> > What is massive amounts of data exactly? >> > How big is your cluster? >> > >> > Note that Spark is not for data storage, only data analysis. It pulls >> data into working data sets called RDD's. >> > >> > As a migration path, you could probably pull the data out of a >> relational database to analyze. But in the long run, I would recommend >> using a more purpose built, huge storage database such as Cassandra. If >> your data is very static, you could also just store it in files. >> > On Oct 26, 2014 9:19 AM, "Peter Wolf" <opus...@gmail.com> wrote: >> > My understanding is the SparkSQL allows one to access Spark data as if >> it were stored in a relational database. It compiles SQL queries into a >> series of calls to the Spark API. >> > >> > I need the performance of a SQL database, but I don't care about doing >> queries with SQL. >> > >> > I create the input to MLib by doing a massive JOIN query. So, I am >> creating a single collection by combining many collections. This sort of >> operation is very inefficient in Mongo, Cassandra or HDFS. >> > >> > I could store my data in a relational database, and copy the query >> results to Spark for processing. However, I was hoping I could keep >> everything in Spark. >> > >> > On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta < >> soumya.sima...@gmail.com> wrote: >> > 1. What data store do you want to store your data in ? HDFS, HBase, >> Cassandra, S3 or something else? >> > 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? >> > >> > One option is to process the data in Spark and then store it in the >> relational database of your choice. >> > >> > >> > >> > >> > On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf <opus...@gmail.com> wrote: >> > Hello all, >> > >> > We are considering Spark for our organization. It is obviously a >> superb platform for processing massive amounts of data... how about >> retrieving it? >> > >> > We are currently storing our data in a relational database in a star >> schema. Retrieving our data requires doing many complicated joins across >> many tables. >> > >> > Can we use Spark as a relational database? Or, if not, can we put >> Spark on top of a relational database? >> > >> > Note that we don't care about SQL. Accessing our data via standard >> queries is nice, but we are equally happy (or more happy) to write Scala >> code. >> > >> > What is important to us is doing relational queries on huge amounts of >> data. Is Spark good at this? >> > >> > Thank you very much in advance >> > Peter >> > >> > >> > >> > >> >> >