Re: Why is a single INSERT very slow in Hive?

Furcy Pin Tue, 12 Sep 2017 04:28:15 -0700

You're welcome :-)

On Tue, Sep 12, 2017 at 1:06 PM, Jinhui Qin <qin.jin...@gmail.com> wrote:


> Furcy,
>
> Thank you so much for the detailed explanation, I realized that those
> overhead seems unavoidable as each statement needs to be turned into a
> MapReduce job before it can be executed, and the execution of MapReduce
> jobs also has some overhead such as provisioning the resources for the
> computation and persisting the results on the hdfs as you mentioned, these
> are the routines and unavoidable.
>
>  I will consider what Jorn suggested trying to collect the records and do
> a bulk load instead of using single insertion for each record.
>
>
> Jorn and Furcy, thanks again for the explanations and suggestions!
>
>
> On Sep 12, 2017 4:53 AM, "Furcy Pin" <furcy....@flaminem.com> wrote:
>
> Hi,
>
> this is a very common question, as many people knowing SQL are used to
> RDBMS like MySQL, Oracle, or SQL Server.
> The first thing you need to know about Hive is that, in the first place,
> it has not been designed to replace
> such databases. Not when they are used for transaction processing anyway.
>
> This is stated in the Hive Tutorial, in the section "What Hive is NOT"
> https://cwiki.apache.org/confluence/display/Hive/Tutorial
>
> Transcationality is a feature that has been added afterwards, and as Jörn
> stated, it can be fast when used with TEZ + LLAP.
>
> In the logs you sent, you can see that Hive is running a MapReduce job to
> perform your task:
>
> MapReduce Jobs Launched:
> Stage-Stage-1: Map: 1   Cumulative CPU: 2.7 sec   HDFS Read: 3836 HDFS
> Write: 81 SUCCESS
> Total MapReduce CPU Time Spent: 2 seconds 700 msec
> OK
>
> If you are familiar with MapReduce, you should now that however small your
> dataset is, what takes time here
> is:
>
>    1. Provisionning executors on YARN
>    2. Starting one JVM per mapper and reducer (here you only have 1
>    mapper and 0 reducer, as stated by the logs). The starting time of a JVM is
>    generally a few seconds nowadays, except when to have to load 200Mb of
>    Hadoop jars to start, in that case it takes around 10 seconds.
>    3. Perform the task (a few milliseconds in your case)
>    4. Persisting the results on HDFS (which requires a few hdfs
>    operations and can take a few seconds to minutes if you write a lot of
>    files, but in your case should be quick)
>
> The main improvement that a Tez LLAP or Spark backend will do is that the
> first to steps are already done
> and waiting for your queries to run.
> Hive as been designed in the first place to go faster than RDBMS in the
> cases where:
> - your data is too large to fit on a single instance, and sharding is
> painful
> - your jobs mostly consist in analytical processing, like full table
> aggregations
>
> In such case, the correct way to use Hive is by partitioning your table by
> day (if you run nightly batches)
> and generate a new partition every day.
> If you want to change something in your table (e.g. fix a bug), you just
> regenerate it.
>
> If you need fast response time for updating and fetching records, and
> scalability, perhaps you should look into HBase, Cassandra, or Kudu.
>
>
>
>
> On Mon, Sep 11, 2017 at 9:18 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> Why do you want to do single inserts?
>> It has been more designed for bulk loads.
>> In any case newer version of Hive 2 using TEZ +llap improve it
>> significantly (also for bulk analysis). Nevertheless, it is good practice
>> to not use single inserts in an analysis systems, but try to combine and
>> bulk-load them.
>>
>> On 11. Sep 2017, at 21:01, Jinhui Qin <qin.jin...@gmail.com> wrote:
>>
>>
>>
>> Hi,
>>
>> I am new to Hive. I just created a simple table in hive and inserted two
>> records, the first insertion took 16.4 sec, while the second took 14.3 sec.
>> Why is that very slow? is this the normal performance you get in Hive using
>> INSERT ? Is there a way to improve the performance of a single "insert" in
>> Hive? Any help would be really appreciated. Thanks!
>>
>> Here is the record from a terminal in Hive shell:
>>
>> =========================
>>
>> hive> show tables;
>> OK
>> Time taken: 2.758 seconds
>> hive> create table people(id int, name string, age int);
>> OK
>> Time taken: 0.283 seconds
>> hive> insert into table people(1,'Tom A', 20);
>> Query ID = hive_20170911134052_04680c79-432a-43e0-827b-29a4212fbbc0
>> Total jobs = 3
>> Launching Job 1 out of 3
>> Number of reduce tasks is set to 0 since there's no reduce operator
>> Starting Job = job_1505146047428_0098, Tracking URL =
>> http://iop-hadoop-bi.novalocal:8088/proxy/application_1505146047428_0098/
>> Kill Command = /usr/iop/4.1.0.0/hadoop/bin/hadoop job  -kill
>> job_1505146047428_0098
>> Hadoop job information for Stage-1: number of mappers: 1; number of
>> reducers: 0
>> 2017-09-11 13:41:01,492 Stage-1 map = 0%,  reduce = 0%
>> 2017-09-11 13:41:06,940 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
>> 2.7 sec
>> MapReduce Total cumulative CPU time: 2 seconds 700 msec
>> Ended Job = job_1505146047428_0098
>> Stage-4 is selected by condition resolver.
>> Stage-3 is filtered out by condition resolver.
>> Stage-5 is filtered out by condition resolver.
>> Moving data to: hdfs://iop-hadoop-bi.novalocal
>> :8020/apps/hive/warehouse/people/.hive-staging_hive_2017-09-
>> 11_13-40-52_106_462156758110461544
>> 1-1/-ext-10000
>> Loading data to table default.people
>> Table default.people stats: [numFiles=1, numRows=1, totalSize=11,
>> rawDataSize=10]
>> MapReduce Jobs Launched:
>> Stage-Stage-1: Map: 1   Cumulative CPU: 2.7 sec   HDFS Read: 3836 HDFS
>> Write: 81 SUCCESS
>> Total MapReduce CPU Time Spent: 2 seconds 700 msec
>> OK
>> Time taken: 16.417 seconds
>> hive> insert into table people values(1,'Tom A', 20);
>> Query ID = hive_20170911134128_c8f46977-7718-4496-9a98-cce0f89ced79
>> Total jobs = 3
>> Launching Job 1 out of 3
>> Number of reduce tasks is set to 0 since there's no reduce operator
>> Starting Job = job_1505146047428_0099, Tracking URL =
>> http://iop-hadoop-bi.novalocal:8088/proxy/application_1505146047428_0099/
>> Kill Command = /usr/iop/4.1.0.0/hadoop/bin/hadoop job  -kill
>> job_1505146047428_0099
>> Hadoop job information for Stage-1: number of mappers: 1; number of
>> reducers: 0
>> 2017-09-11 13:41:36,289 Stage-1 map = 0%,  reduce = 0%
>> 2017-09-11 13:41:40,721 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
>> 2.28 sec
>> MapReduce Total cumulative CPU time: 2 seconds 280 msec
>> Ended Job = job_1505146047428_0099
>> Stage-4 is selected by condition resolver.
>> Stage-3 is filtered out by condition resolver.
>> Stage-5 is filtered out by condition resolver.
>> Moving data to: hdfs://iop-hadoop-bi.novalocal
>> :8020/apps/hive/warehouse/people/.hive-staging_hive_2017-09-
>> 11_13-41-28_757_445847252207124056
>> 7-1/-ext-10000
>> Loading data to table default.people
>> Table default.people stats: [numFiles=2, numRows=2, totalSize=22,
>> rawDataSize=20]
>> MapReduce Jobs Launched:
>> Stage-Stage-1: Map: 1   Cumulative CPU: 2.28 sec   HDFS Read: 3924 HDFS
>> Write: 81 SUCCESS
>> Total MapReduce CPU Time Spent: 2 seconds 280 msec
>> OK
>> Time taken: 14.288 seconds
>> hive> exit;
>> =================
>>
>>
>> Jinhui
>>
>>
>
>

Re: Why is a single INSERT very slow in Hive?

Reply via email to