Re: Is my Use Case possible with Hive?

Nitin Pawar Tue, 15 May 2012 04:44:51 -0700

the problem with hive server with jdbc currently is that it does not handle
concurrent connection in a seamless manner and chokes down on larger number
of parallel query executions.


For this one reason, I had actually written a pipeline kind of infra using
shell scripts which used to run queries after one another and used to run
them from different terminals or run them as background processes (but this
needed a larger memory on hive client cli as lot of times hive cli went OOM
as too many queries were doing some pre query processing (like mapside
joins etc)

On Tue, May 15, 2012 at 5:03 PM, Bhavesh Shah <bhavesh25s...@gmail.com>wrote:

> Thanks all for their replies.
> Just now I tried one thing that as folows:
> 1) I open tho two hive CLI.  hive>
> 2) I have one query which takes 7 jobs for execution. I submitted that
> query to both the CLI.
> 3) one of the hive CLI took 147.319 seconds  and second one took: 161.542
> seconds
> 4) Later I tried that query only on one CLI and it took 122.307 seconds
>  The thing what I want to ask is this, if multiple query runs parallel it
> takes less time to execute compare to execute one by one.
>
>   If I want to execute such parallel queries through JDBC, how can I do it.
>   I know that hive can accept at a time one connection. But still is
> there any way to so it?
>   Pls suggest me some solution for this.
>
>
> --
> Regards,
> Bhavesh Shah
>
>
> On Tue, May 15, 2012 at 1:15 AM, Nanda Vijaydev 
> <nanda.vijay...@gmail.com>wrote:
>
>> Hadoop in general does well with fewer large data files instead of more
>> smaller data files. RDBMS type of indexing and run time optimization is not
>> exactly available in Hadoop/Hive yet. So one suggestion is to combine some
>> of this data, if you can, into fewer tables as you are doing sqoop. Even if
>> there is a slight redundancy it should be OK. Storage is cheap and helps
>> during read.
>>
>> Other suggestions as given in this thread is to set map side and reduce
>> side hive optimization parameters. Querying via jdbc is generally slow as
>> well. There are certain products in Hadoop space that allow for hive
>> querying without jdbc interface. Give it a try and it should improve
>> performance.
>>
>> Good luck
>>
>>
>>
>> On Mon, May 14, 2012 at 6:17 AM, Bhavesh Shah <bhavesh25s...@gmail.com>wrote:
>>
>>> Thanks Nitin for your continous support.
>>> *Here is my data layout and change the queries as per needed*:
>>> 1) Initially after importing the tables from MS SQL Server, 1st basic
>>> task I am doing is that *PIVOTING.*
>>>    As SQL stores data in name value pair.
>>> 2) Pivoting results in subset of data, Using this subset we are running
>>> complex queries on history data and retrieves result for each row in
>>> subset.
>>>     again *data is updated into pivoted columns*. (I am not using
>>> partition. updated by INSERT OVERWRITE)
>>>     As update is not supporting, I have to again do *INSERT OVERWRITE
>>> TABLE
>>> *3) Likewise I have to do near about 20-30 times. (Depends upon
>>> Business rules and scenario if needed to Business rules)
>>> 4) After this I have to do computation which has very large queries from
>>> above generated tables.
>>>     (Each query has near about 10-11 jobs query)
>>>     This again repeats for 30 times.
>>>
>>> (My all queries contains -  case when, group by, cast function, etc )
>>>
>>> --
>>> Regards,
>>> Bhavesh Shah
>>>
>>>
>>> On Mon, May 14, 2012 at 6:05 PM, Nitin Pawar <nitinpawar...@gmail.com>wrote:
>>>
>>>> partitioning is mainly used when you want to access the table based on
>>>> value of a particular column and dont want to go through entire table for
>>>> same operation. This actually means if there are few columns whose values
>>>> are repeated in all the records, then you can consider partitioning on
>>>> them. Other approach will be partition data based on date/time if
>>>> applicable.
>>>>
>>>> From the queries you showed, i am just seeing inserting and creating
>>>> indexes. loading data to tables should not take much time and I personally
>>>> have never used indexing so can not tell about that particular query
>>>> execution time.
>>>>
>>>> if I understand correctly following is your execution approach
>>>>
>>>> 1) Import data from MS-SQL to hive using sqoop
>>>>     should be over quickly depending on how much time MS-SQL takes to
>>>> export
>>>> 2) example of queries which you are doing on the data being dumped in
>>>> hive will be good to know if we can decide on the data layout and change
>>>> the queries as per needed if needed
>>>> 3) Once query execution is over you are putting the result back in
>>>> MS-SQL
>>>>
>>>> can you note individually how much time each step is taking?
>>>>
>>>>
>>>> On Mon, May 14, 2012 at 4:38 PM, Bhavesh Shah 
>>>> <bhavesh25s...@gmail.com>wrote:
>>>>
>>>>> Hello Nitin,
>>>>> Thanks for suggesting me about the partition.
>>>>> But I want to tell one thing that I forgot to mention before is that :
>>>>> *
>>>>> I am using Indexes on all tables tables which are used again and
>>>>> again. *
>>>>> But the problem is that after execution I didn't see the difference in
>>>>> performance (before applying the index and after applying it)
>>>>> I have created the indexes as below:
>>>>> sql = "CREATE INDEX INDEX_VisitDate ON TABLE Tmp(Uid,VisitDate) as
>>>>> 'COMPACT' WITH DEFERRED REBUILD stored as RCFILE";
>>>>> res2 = stmt2.executeQuery(sql);
>>>>> sql = (new StringBuilder(" INSERT OVERWRITE TABLE Tmp  select C1.Uid,
>>>>> C1.VisitDate, C1.ID from
>>>>>        TmpElementTable C1 LEFT OUTER JOIN Tmp T on C1.Uid=T.Uid and
>>>>> C1.VisitDate=T.VisitDate").toString();
>>>>> stmt2.executeUpdate(sql);
>>>>> sql = "load data inpath '/user/hive/warehouse/tmp' overwrite into
>>>>> table TmpElementTable";
>>>>> stmt2.executeUpdate(sql);
>>>>> sql = "alter index clinical_index on TmpElementTable REBUILD";
>>>>> res2 = stmt2.executeQuery(sql);
>>>>> *Did I use it in correct way?*
>>>>>
>>>>> As you told me told me to try with partition
>>>>> Actually I am altering the table with large number of columns at the
>>>>> runtime only.
>>>>> If i use partition in such situation then is it good to use partition
>>>>> for all columns?
>>>>>
>>>>> So, I want to know that After using the partition Will it be able to
>>>>> improve the performance or
>>>>> do I need to use both Partition and Indexes?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Bhavesh Shah
>>>>>
>>>>>
>>>>> On Mon, May 14, 2012 at 3:13 PM, Nitin Pawar 
>>>>> <nitinpawar...@gmail.com>wrote:
>>>>>
>>>>>> it is definitely possible to increase your performance.
>>>>>>
>>>>>> I have run queries where more than 10 billion records were involved.
>>>>>> If you are doing joins in your queries, you may have a look at
>>>>>> different kind of joins supported by hive.
>>>>>> If one of your table is very small in size compared to another table
>>>>>> then you may consider mapside join etc
>>>>>>
>>>>>> Also the number of maps and reducers are decided by the split size
>>>>>> you provide to maps.
>>>>>>
>>>>>> I would suggest before you go full speed, decide on how you want to
>>>>>> layout data for hive.
>>>>>>
>>>>>> You can try loading some data, partition the data and write queries
>>>>>> based on partition then performance will improve but in that case your
>>>>>> queries will be in batch processing format. there are other approaches as
>>>>>> well.
>>>>>>
>>>>>>
>>>>>> On Mon, May 14, 2012 at 2:31 PM, Bhavesh Shah <
>>>>>> bhavesh25s...@gmail.com> wrote:
>>>>>>
>>>>>>> That I fail to know, how many maps and reducers are there. Because
>>>>>>> due to some reason my instance get terminated   :(
>>>>>>> I want to know one thing that If we use multiple nodes, then what
>>>>>>> should be the count of maps and reducers.
>>>>>>> Actually I am confused about that. How to decide it?
>>>>>>>
>>>>>>> Also I want to try the different properties like block size,
>>>>>>> compress output, size of in-memorybuffer, parallel execution etc.
>>>>>>> Will these all properties matters to increase the performance?
>>>>>>>
>>>>>>> Nitin, you have read all my use case. Whatever the thing I did to
>>>>>>> implement with the help of Hadoop is correct?
>>>>>>> Is it possible to increase the performance?
>>>>>>>
>>>>>>> Thanks Nitin for your reply.   :)
>>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>> Bhavesh Shah
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 14, 2012 at 2:07 PM, Nitin Pawar <
>>>>>>> nitinpawar...@gmail.com> wrote:
>>>>>>>
>>>>>>>> with a 10 node cluster the performance should improve.
>>>>>>>> how many maps and reducers are being launched?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, May 14, 2012 at 1:18 PM, Bhavesh Shah <
>>>>>>>> bhavesh25s...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I have near about 1 billion records in my relational database.
>>>>>>>>> Currently locally I am using just one cluster. But I also tried
>>>>>>>>> this on Amazon Elastic Mapreduce with 10 nodes. But the time taken to
>>>>>>>>> execute the complete program is same as that on my  single local 
>>>>>>>>> machine.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar <
>>>>>>>>> nitinpawar...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> how many # records?
>>>>>>>>>>
>>>>>>>>>> what is your hadoop cluster setup? how many nodes?
>>>>>>>>>> if you are running hadoop on a single node setup with normal
>>>>>>>>>> desktop, i doubt it will be of any help.
>>>>>>>>>>
>>>>>>>>>> You need a stronger cluster setup for better query runtimes and
>>>>>>>>>> ofcourse query optimization which I guess you would have already 
>>>>>>>>>> taken care.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <
>>>>>>>>>> bhavesh25s...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello all,
>>>>>>>>>>> My Use Case is:
>>>>>>>>>>> 1) I have a relational database which has a very large data. (MS
>>>>>>>>>>> SQL Server)
>>>>>>>>>>> 2) I want to do analysis on these huge data  and want to
>>>>>>>>>>> generate reports
>>>>>>>>>>> on it after analysis.
>>>>>>>>>>> Like this I have to generate various reports based on different
>>>>>>>>>>> analysis.
>>>>>>>>>>>
>>>>>>>>>>> I tried to implement this using Hive. What I did is:
>>>>>>>>>>> 1) I imported all tables in Hive from MS SQL Server using SQOOP.
>>>>>>>>>>> 2) I wrote many queries in Hive which is executing using JDBC on
>>>>>>>>>>> Hive
>>>>>>>>>>> Thrift Server
>>>>>>>>>>> 3) I am getting the correct result in table form, which I am
>>>>>>>>>>> expecting
>>>>>>>>>>> 4) But the problem is that the time which require to execute is
>>>>>>>>>>> too much
>>>>>>>>>>> long.
>>>>>>>>>>>    (My complete program is executing in near about 3-4 hours on
>>>>>>>>>>> *small
>>>>>>>>>>> amount of data*).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    I decided to do this using Hive.
>>>>>>>>>>>     And as I told previously how much time Hive consumed for
>>>>>>>>>>> execution. my
>>>>>>>>>>> organization is expecting to complete this task in near about
>>>>>>>>>>> less than
>>>>>>>>>>> 1/2 hours
>>>>>>>>>>>
>>>>>>>>>>> Now after spending too much time for complete execution for this
>>>>>>>>>>> task what
>>>>>>>>>>> should I do?
>>>>>>>>>>> I want to ask one thing that:
>>>>>>>>>>> *Is this Use Case is possible with Hive?* If possible what
>>>>>>>>>>> should I do in
>>>>>>>>>>>
>>>>>>>>>>> my program to increase the performance?
>>>>>>>>>>> *And If not possible what is the other good way to implement
>>>>>>>>>>> this Use Case?*
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Please reply me.
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Regards,
>>>>>>>>>>> Bhavesh Shah
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nitin Pawar
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Regards,
>>>>>>>>> Bhavesh Shah
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>
>
>
>


-- 
Nitin Pawar

Re: Is my Use Case possible with Hive?

Reply via email to