You're welcome :-) On Tue, Sep 12, 2017 at 1:06 PM, Jinhui Qin <qin.jin...@gmail.com> wrote:
> Furcy, > > Thank you so much for the detailed explanation, I realized that those > overhead seems unavoidable as each statement needs to be turned into a > MapReduce job before it can be executed, and the execution of MapReduce > jobs also has some overhead such as provisioning the resources for the > computation and persisting the results on the hdfs as you mentioned, these > are the routines and unavoidable. > > I will consider what Jorn suggested trying to collect the records and do > a bulk load instead of using single insertion for each record. > > > Jorn and Furcy, thanks again for the explanations and suggestions! > > > On Sep 12, 2017 4:53 AM, "Furcy Pin" <furcy....@flaminem.com> wrote: > > Hi, > > this is a very common question, as many people knowing SQL are used to > RDBMS like MySQL, Oracle, or SQL Server. > The first thing you need to know about Hive is that, in the first place, > it has not been designed to replace > such databases. Not when they are used for transaction processing anyway. > > This is stated in the Hive Tutorial, in the section "What Hive is NOT" > https://cwiki.apache.org/confluence/display/Hive/Tutorial > > Transcationality is a feature that has been added afterwards, and as Jörn > stated, it can be fast when used with TEZ + LLAP. > > In the logs you sent, you can see that Hive is running a MapReduce job to > perform your task: > > MapReduce Jobs Launched: > Stage-Stage-1: Map: 1 Cumulative CPU: 2.7 sec HDFS Read: 3836 HDFS > Write: 81 SUCCESS > Total MapReduce CPU Time Spent: 2 seconds 700 msec > OK > > If you are familiar with MapReduce, you should now that however small your > dataset is, what takes time here > is: > > 1. Provisionning executors on YARN > 2. Starting one JVM per mapper and reducer (here you only have 1 > mapper and 0 reducer, as stated by the logs). The starting time of a JVM is > generally a few seconds nowadays, except when to have to load 200Mb of > Hadoop jars to start, in that case it takes around 10 seconds. > 3. Perform the task (a few milliseconds in your case) > 4. Persisting the results on HDFS (which requires a few hdfs > operations and can take a few seconds to minutes if you write a lot of > files, but in your case should be quick) > > The main improvement that a Tez LLAP or Spark backend will do is that the > first to steps are already done > and waiting for your queries to run. > Hive as been designed in the first place to go faster than RDBMS in the > cases where: > - your data is too large to fit on a single instance, and sharding is > painful > - your jobs mostly consist in analytical processing, like full table > aggregations > > In such case, the correct way to use Hive is by partitioning your table by > day (if you run nightly batches) > and generate a new partition every day. > If you want to change something in your table (e.g. fix a bug), you just > regenerate it. > > If you need fast response time for updating and fetching records, and > scalability, perhaps you should look into HBase, Cassandra, or Kudu. > > > > > On Mon, Sep 11, 2017 at 9:18 PM, Jörn Franke <jornfra...@gmail.com> wrote: > >> Why do you want to do single inserts? >> It has been more designed for bulk loads. >> In any case newer version of Hive 2 using TEZ +llap improve it >> significantly (also for bulk analysis). Nevertheless, it is good practice >> to not use single inserts in an analysis systems, but try to combine and >> bulk-load them. >> >> On 11. Sep 2017, at 21:01, Jinhui Qin <qin.jin...@gmail.com> wrote: >> >> >> >> Hi, >> >> I am new to Hive. I just created a simple table in hive and inserted two >> records, the first insertion took 16.4 sec, while the second took 14.3 sec. >> Why is that very slow? is this the normal performance you get in Hive using >> INSERT ? Is there a way to improve the performance of a single "insert" in >> Hive? Any help would be really appreciated. Thanks! >> >> Here is the record from a terminal in Hive shell: >> >> ========================= >> >> hive> show tables; >> OK >> Time taken: 2.758 seconds >> hive> create table people(id int, name string, age int); >> OK >> Time taken: 0.283 seconds >> hive> insert into table people(1,'Tom A', 20); >> Query ID = hive_20170911134052_04680c79-432a-43e0-827b-29a4212fbbc0 >> Total jobs = 3 >> Launching Job 1 out of 3 >> Number of reduce tasks is set to 0 since there's no reduce operator >> Starting Job = job_1505146047428_0098, Tracking URL = >> http://iop-hadoop-bi.novalocal:8088/proxy/application_1505146047428_0098/ >> Kill Command = /usr/iop/4.1.0.0/hadoop/bin/hadoop job -kill >> job_1505146047428_0098 >> Hadoop job information for Stage-1: number of mappers: 1; number of >> reducers: 0 >> 2017-09-11 13:41:01,492 Stage-1 map = 0%, reduce = 0% >> 2017-09-11 13:41:06,940 Stage-1 map = 100%, reduce = 0%, Cumulative CPU >> 2.7 sec >> MapReduce Total cumulative CPU time: 2 seconds 700 msec >> Ended Job = job_1505146047428_0098 >> Stage-4 is selected by condition resolver. >> Stage-3 is filtered out by condition resolver. >> Stage-5 is filtered out by condition resolver. >> Moving data to: hdfs://iop-hadoop-bi.novalocal >> :8020/apps/hive/warehouse/people/.hive-staging_hive_2017-09- >> 11_13-40-52_106_462156758110461544 >> 1-1/-ext-10000 >> Loading data to table default.people >> Table default.people stats: [numFiles=1, numRows=1, totalSize=11, >> rawDataSize=10] >> MapReduce Jobs Launched: >> Stage-Stage-1: Map: 1 Cumulative CPU: 2.7 sec HDFS Read: 3836 HDFS >> Write: 81 SUCCESS >> Total MapReduce CPU Time Spent: 2 seconds 700 msec >> OK >> Time taken: 16.417 seconds >> hive> insert into table people values(1,'Tom A', 20); >> Query ID = hive_20170911134128_c8f46977-7718-4496-9a98-cce0f89ced79 >> Total jobs = 3 >> Launching Job 1 out of 3 >> Number of reduce tasks is set to 0 since there's no reduce operator >> Starting Job = job_1505146047428_0099, Tracking URL = >> http://iop-hadoop-bi.novalocal:8088/proxy/application_1505146047428_0099/ >> Kill Command = /usr/iop/4.1.0.0/hadoop/bin/hadoop job -kill >> job_1505146047428_0099 >> Hadoop job information for Stage-1: number of mappers: 1; number of >> reducers: 0 >> 2017-09-11 13:41:36,289 Stage-1 map = 0%, reduce = 0% >> 2017-09-11 13:41:40,721 Stage-1 map = 100%, reduce = 0%, Cumulative CPU >> 2.28 sec >> MapReduce Total cumulative CPU time: 2 seconds 280 msec >> Ended Job = job_1505146047428_0099 >> Stage-4 is selected by condition resolver. >> Stage-3 is filtered out by condition resolver. >> Stage-5 is filtered out by condition resolver. >> Moving data to: hdfs://iop-hadoop-bi.novalocal >> :8020/apps/hive/warehouse/people/.hive-staging_hive_2017-09- >> 11_13-41-28_757_445847252207124056 >> 7-1/-ext-10000 >> Loading data to table default.people >> Table default.people stats: [numFiles=2, numRows=2, totalSize=22, >> rawDataSize=20] >> MapReduce Jobs Launched: >> Stage-Stage-1: Map: 1 Cumulative CPU: 2.28 sec HDFS Read: 3924 HDFS >> Write: 81 SUCCESS >> Total MapReduce CPU Time Spent: 2 seconds 280 msec >> OK >> Time taken: 14.288 seconds >> hive> exit; >> ================= >> >> >> Jinhui >> >> > >