RE: Hive on Spark - Hadoop 2 - Installation - Ubuntu

Mich Talebzadeh Fri, 20 Nov 2015 08:10:37 -0800

Hi Sai,


Sqoop will not be able to do that. You can use Sqoop to get data in first
time and populate your Hive table at time T0

 

At later times T1, T2 ..  you can get Sqoop to read new rows based on
Primary Key of your source table. Assuming the primary is a monolithically
increasing number say trade_id then, you can get max(trade_id) from Hive
table and request new rows. However, regardless of performance implication
you will only end up with new rows added to Hive table. How about deletes
and updates. In RDBMS we do  schema on write and we sort out and clean data
first. In other words we do Extract, Transform and Load (ETL) before storing
data to the source table. In schema on read like Hive, we add to data "as
is" and then decide how to do ETL, slice and dice. 

 

So when we say that a very large amount of data is loaded daily to Hadoop,
that basically means that the base table is augmented with new inserts, new
updates and new deletes. Unlike RDBMS table we don't go and update or delete
rows in place. In Hive, every updated or deleted row will be a new row added
to Hive table with the correct identifier. That is the way I do.

 

There are three record types added to Hive table in any time interval. They
follow the general CRUD (create, read, Update, Delete) rules. Within Entity
Life History every relational table will have one insert type for a given
Primary Key (PK), one delete for the same PK and many updates for the same
PK. So we will end up within a day with new records, new updates and deleted
records. Crucially a row in theory may be updated infinite times within a
time period but what matters is the most "recent" update! In other words the
Hive table will be a running total of original direct loads from RDBMS table
plus all new operations (CRUD) as well. For those familiar with replication,
you sync the replicate once and let replication take care of new inserts,
updates and deletes i.e. deltas thereafter. The crucial point being that
rather than row updates or deletes, you append the updated row or deleted
row to Hive table. 

 

Performance wise, I have done it with ORC table in Hive and as long as you
have partitioned the Hive table OK,  performance looks fine to me. After all
a transactional table is not a data warehouse table. It does not grow that
much (talking in relative terms)

 

HTH

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 

From: Sai Gopalakrishnan [mailto:sai.gopalakrish...@aspiresys.com] 
Sent: 20 November 2015 15:49
To: user@hive.apache.org
Subject: Re: Hive on Spark - Hadoop 2 - Installation - Ubuntu

 

Hi Mich,

 

Could you please explain more on how to efficiently reflect updates and
deletes done at RDBMS in HDFS via Sqoop? Even if Hive supports ACID
properties in ORC, it still needs to know which records are to be
updated/deleted right? You had mentioned feeding deltas from RDBMS to Hive,
but query performance degrades with increase in delta files. Is there an
existing feature related to this in Sqoop  or planned to be released any
time soon?

 

Thanks & Regards,

Sai

 

  _____  

From: Mich Talebzadeh <m...@peridale.co.uk>
Sent: Friday, November 20, 2015 4:54 PM
To: user@hive.apache.org
Subject: RE: Hive on Spark - Hadoop 2 - Installation - Ubuntu 

 

Right

 

Your steps look reasonable.

 

Try to understand your approach

 

1.    You have a current RDBMS (Oracle, Sybase, MSSQL?)

2.    You want to feed that data daily in batch or real time from RDBMS to
Hadoop as relational tables (that is where Hive comes into it)

3.    You need to have fully installed and configured Hiveincluding Hibve2
server !

4.    You will need to use sqoop (SQL to Hadoop) to get DDL and data on
RDBMS to be created on Hive. This is apriority step

5.    You will use Hive/MapReduce for batch processing

6.    You want to use Spark for real time data processing on Hadoop

 

How about feeding deltas (daily/periodic changes) from RDBMS to Hive. How
are you going to do that. Remember we are talking about
inserts/deletes/updates). 

 

HTH

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 

From: Dasun Hegoda [mailto:dasunheg...@gmail.com] 
Sent: 20 November 2015 09:36
To: user@hive.apache.org
Subject: Hive on Spark - Hadoop 2 - Installation - Ubuntu

 

Hi,

 

What I'm planning to do is develop a reporting platform using existing data.
I have an existing RDBMS which has large number of records. So I'm using.
(http://stackoverflow.com/questions/33635234/hadoop-2-7-spark-hive-jasperrep
orts-scoop-architecuture)

 

 - Scoop - Extract data from RDBMS to Hadoop

 - Hadoop - Storage platform -> *Deployment Completed*

 - Hive - Datawarehouse

 - Spark - Read time processing -> *Deployment Completed*

 

I'm planning to deploy Hive on Spark but I can't find the installation
steps. I tried to read the official '[Hive on Spark][1]' guide but it has
problems. As an example it says under 'Configuring Yarn'
`yarn.resourcemanager.scheduler.class=org.apache.hadoop.yarn.server.resource
manager.scheduler.fair.FairScheduler` but does not imply where should I do
it. Also as per the guide configurations are set in the Hive runtime shell
which is not permanent according to my knowledge.

 

Given that I read [this][2] but it does not have any steps.

 

Please provide me the steps to run Hive on Spark on Ubuntu as a production
system?

 

 

  [1]:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+St
arted

  [2]:
http://stackoverflow.com/questions/26018306/how-to-configure-hive-to-use-spa
rk

 

-- 

Regards,

Dasun Hegoda, Software Engineer  
www.dasunhegoda.com <http://www.dasunhegoda.com/>  | dasunheg...@gmail.com
<mailto:dasunheg...@gmail.com> 



This e-mail message and any attachments are for the sole use of the intended
recipient(s) and may contain proprietary, confidential, trade secret or
privileged information. Any unauthorized review, use, disclosure or
distribution is prohibited and may be a violation of law. If you are not the
intended recipient, please contact the sender by reply e-mail and destroy
all copies of the original message.

RE: Hive on Spark - Hadoop 2 - Installation - Ubuntu

Reply via email to