Re: Building Datwarehouse Application in Spark

Richard A. Bross Wed, 04 Apr 2018 04:43:01 -0700

Mahender,

To really address your question I think that you'd have to supply a bit more 
information, such as the kind of data that you want to save; RBDMS type look 
ups, key/value/index type look ups, insert velocity, etc.  These wide choices 
of technologies are suited to different use cases, although they overlap in 
some areas.


In a previous position that I held we used Spark on Cassandra to solve a 
similar problem.  The Datastax distribution puts Spark worker nodes directly on 
Cassandra nodes.  Because Cassandra partitions the data across nodes based on a 
row key, it's a nice match.  If the key is chosen properly, the Spark nodes are 
typically accessing local data on the Cassandra nodes, meaning that there are 
typically very few shuffles for direct queries and also that inserts go 
directly to the proper Cassandra nodes.  We had time series data based on 
unique row keys.  So our row keys were unique and our column keys were the time 
stamps. In that case our queries were done directly with the Cassandra clients 
for the most part, with SparkQL primarily used for ad-hoc queries.  

At my current position, we directly load raw data into Hive (using HiveQL) and 
then use Presto for queries.  That's our OLAP data store.  You can use any 
number of other tools to query Hive created data stores as well.

Then we have another pipeline that takes the same raw data, uses Spark for the 
ETL, and then inserts the results into Aurora (MySQL).  The schema is designed 
for specific queries, so the Spark ETL is designed to transform the data to 
optimize for the schema so as to allow efficient updates to those tables.  
That's our OLTP data store and we use standard SQL for queries.

Rick


----- Original Message -----
From: "Furcy Pin" <pin.fu...@gmail.com>
To: user@hive.apache.org
Sent: Wednesday, April 4, 2018 6:58:58 AM
Subject: Re: Building Datwarehouse Application in Spark


Hi Mahender, 


Did you look at this? https://www.snappydata.io/blog/the-spark-database 


But I believe that most people handle this use case by either using: 
- Their favorite regular RDBMS (mySQL, postgres, Oracle, SQL-Server, ...) if 
the data is not too big 
- Their favorite New-SQL storage (Cassandra, HBase) if the data is too big and 
needs to be distributed 


Spark generally makes it easy enough to query these other databases to allow 
you to perform analytics. 


Hive and Spark have been designed as OLAP tools, not OLTP. 
I'm not sure what features you are seeking for your SCD but they probably won't 
be part of Spark's core design. 


Hope this helps, 


Furcy 






On 4 April 2018 at 11:29, Mahender Sarangam < mahender.bigd...@outlook.com > 
wrote: 




Hi, 
Does anyone has good architecture document/design principle for building 
warehouse application using Spark. 


Is it better way of having Hive Context created with HQL and perform 
transformation or Directly loading files in dataframe and perform data 
transformation. 


We need to implement SCD 2 Type in Spark, Is there any better 
document/reference for building Type 2 warehouse object 


Thanks in advace 


/Mahender

Re: Building Datwarehouse Application in Spark

Reply via email to