Re: Is my Use Case possible with Hive?

Bhavesh Shah Tue, 15 May 2012 04:34:05 -0700

Thanks all for their replies.
Just now I tried one thing that as folows:
1) I open tho two hive CLI.  hive>
2) I have one query which takes 7 jobs for execution. I submitted that
query to both the CLI.
3) one of the hive CLI took 147.319 seconds  and second one took: 161.542
seconds
4) Later I tried that query only on one CLI and it took 122.307 seconds
 The thing what I want to ask is this, if multiple query runs parallel it
takes less time to execute compare to execute one by one.


  If I want to execute such parallel queries through JDBC, how can I do it.
  I know that hive can accept at a time one connection. But still is there
any way to so it?
  Pls suggest me some solution for this.


-- 
Regards,
Bhavesh Shah


On Tue, May 15, 2012 at 1:15 AM, Nanda Vijaydev <nanda.vijay...@gmail.com>wrote:

> Hadoop in general does well with fewer large data files instead of more
> smaller data files. RDBMS type of indexing and run time optimization is not
> exactly available in Hadoop/Hive yet. So one suggestion is to combine some
> of this data, if you can, into fewer tables as you are doing sqoop. Even if
> there is a slight redundancy it should be OK. Storage is cheap and helps
> during read.
>
> Other suggestions as given in this thread is to set map side and reduce
> side hive optimization parameters. Querying via jdbc is generally slow as
> well. There are certain products in Hadoop space that allow for hive
> querying without jdbc interface. Give it a try and it should improve
> performance.
>
> Good luck
>
>
>
> On Mon, May 14, 2012 at 6:17 AM, Bhavesh Shah <bhavesh25s...@gmail.com>wrote:
>
>> Thanks Nitin for your continous support.
>> *Here is my data layout and change the queries as per needed*:
>> 1) Initially after importing the tables from MS SQL Server, 1st basic
>> task I am doing is that *PIVOTING.*
>>    As SQL stores data in name value pair.
>> 2) Pivoting results in subset of data, Using this subset we are running
>> complex queries on history data and retrieves result for each row in
>> subset.
>>     again *data is updated into pivoted columns*. (I am not using
>> partition. updated by INSERT OVERWRITE)
>>     As update is not supporting, I have to again do *INSERT OVERWRITE
>> TABLE
>> *3) Likewise I have to do near about 20-30 times. (Depends upon Business
>> rules and scenario if needed to Business rules)
>> 4) After this I have to do computation which has very large queries from
>> above generated tables.
>>     (Each query has near about 10-11 jobs query)
>>     This again repeats for 30 times.
>>
>> (My all queries contains -  case when, group by, cast function, etc )
>>
>> --
>> Regards,
>> Bhavesh Shah
>>
>>
>> On Mon, May 14, 2012 at 6:05 PM, Nitin Pawar <nitinpawar...@gmail.com>wrote:
>>
>>> partitioning is mainly used when you want to access the table based on
>>> value of a particular column and dont want to go through entire table for
>>> same operation. This actually means if there are few columns whose values
>>> are repeated in all the records, then you can consider partitioning on
>>> them. Other approach will be partition data based on date/time if
>>> applicable.
>>>
>>> From the queries you showed, i am just seeing inserting and creating
>>> indexes. loading data to tables should not take much time and I personally
>>> have never used indexing so can not tell about that particular query
>>> execution time.
>>>
>>> if I understand correctly following is your execution approach
>>>
>>> 1) Import data from MS-SQL to hive using sqoop
>>>     should be over quickly depending on how much time MS-SQL takes to
>>> export
>>> 2) example of queries which you are doing on the data being dumped in
>>> hive will be good to know if we can decide on the data layout and change
>>> the queries as per needed if needed
>>> 3) Once query execution is over you are putting the result back in
>>> MS-SQL
>>>
>>> can you note individually how much time each step is taking?
>>>
>>>
>>> On Mon, May 14, 2012 at 4:38 PM, Bhavesh Shah 
>>> <bhavesh25s...@gmail.com>wrote:
>>>
>>>> Hello Nitin,
>>>> Thanks for suggesting me about the partition.
>>>> But I want to tell one thing that I forgot to mention before is that :*
>>>> I am using Indexes on all tables tables which are used again and again.
>>>> *
>>>> But the problem is that after execution I didn't see the difference in
>>>> performance (before applying the index and after applying it)
>>>> I have created the indexes as below:
>>>> sql = "CREATE INDEX INDEX_VisitDate ON TABLE Tmp(Uid,VisitDate) as
>>>> 'COMPACT' WITH DEFERRED REBUILD stored as RCFILE";
>>>> res2 = stmt2.executeQuery(sql);
>>>> sql = (new StringBuilder(" INSERT OVERWRITE TABLE Tmp  select C1.Uid,
>>>> C1.VisitDate, C1.ID from
>>>>        TmpElementTable C1 LEFT OUTER JOIN Tmp T on C1.Uid=T.Uid and
>>>> C1.VisitDate=T.VisitDate").toString();
>>>> stmt2.executeUpdate(sql);
>>>> sql = "load data inpath '/user/hive/warehouse/tmp' overwrite into table
>>>> TmpElementTable";
>>>> stmt2.executeUpdate(sql);
>>>> sql = "alter index clinical_index on TmpElementTable REBUILD";
>>>> res2 = stmt2.executeQuery(sql);
>>>> *Did I use it in correct way?*
>>>>
>>>> As you told me told me to try with partition
>>>> Actually I am altering the table with large number of columns at the
>>>> runtime only.
>>>> If i use partition in such situation then is it good to use partition
>>>> for all columns?
>>>>
>>>> So, I want to know that After using the partition Will it be able to
>>>> improve the performance or
>>>> do I need to use both Partition and Indexes?
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Bhavesh Shah
>>>>
>>>>
>>>> On Mon, May 14, 2012 at 3:13 PM, Nitin Pawar 
>>>> <nitinpawar...@gmail.com>wrote:
>>>>
>>>>> it is definitely possible to increase your performance.
>>>>>
>>>>> I have run queries where more than 10 billion records were involved.
>>>>> If you are doing joins in your queries, you may have a look at
>>>>> different kind of joins supported by hive.
>>>>> If one of your table is very small in size compared to another table
>>>>> then you may consider mapside join etc
>>>>>
>>>>> Also the number of maps and reducers are decided by the split size you
>>>>> provide to maps.
>>>>>
>>>>> I would suggest before you go full speed, decide on how you want to
>>>>> layout data for hive.
>>>>>
>>>>> You can try loading some data, partition the data and write queries
>>>>> based on partition then performance will improve but in that case your
>>>>> queries will be in batch processing format. there are other approaches as
>>>>> well.
>>>>>
>>>>>
>>>>> On Mon, May 14, 2012 at 2:31 PM, Bhavesh Shah <bhavesh25s...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> That I fail to know, how many maps and reducers are there. Because
>>>>>> due to some reason my instance get terminated   :(
>>>>>> I want to know one thing that If we use multiple nodes, then what
>>>>>> should be the count of maps and reducers.
>>>>>> Actually I am confused about that. How to decide it?
>>>>>>
>>>>>> Also I want to try the different properties like block size, compress
>>>>>> output, size of in-memorybuffer, parallel execution etc.
>>>>>> Will these all properties matters to increase the performance?
>>>>>>
>>>>>> Nitin, you have read all my use case. Whatever the thing I did to
>>>>>> implement with the help of Hadoop is correct?
>>>>>> Is it possible to increase the performance?
>>>>>>
>>>>>> Thanks Nitin for your reply.   :)
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Bhavesh Shah
>>>>>>
>>>>>>
>>>>>> On Mon, May 14, 2012 at 2:07 PM, Nitin Pawar <nitinpawar...@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> with a 10 node cluster the performance should improve.
>>>>>>> how many maps and reducers are being launched?
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 14, 2012 at 1:18 PM, Bhavesh Shah <
>>>>>>> bhavesh25s...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I have near about 1 billion records in my relational database.
>>>>>>>> Currently locally I am using just one cluster. But I also tried
>>>>>>>> this on Amazon Elastic Mapreduce with 10 nodes. But the time taken to
>>>>>>>> execute the complete program is same as that on my  single local 
>>>>>>>> machine.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar <
>>>>>>>> nitinpawar...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> how many # records?
>>>>>>>>>
>>>>>>>>> what is your hadoop cluster setup? how many nodes?
>>>>>>>>> if you are running hadoop on a single node setup with normal
>>>>>>>>> desktop, i doubt it will be of any help.
>>>>>>>>>
>>>>>>>>> You need a stronger cluster setup for better query runtimes and
>>>>>>>>> ofcourse query optimization which I guess you would have already 
>>>>>>>>> taken care.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <
>>>>>>>>> bhavesh25s...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hello all,
>>>>>>>>>> My Use Case is:
>>>>>>>>>> 1) I have a relational database which has a very large data. (MS
>>>>>>>>>> SQL Server)
>>>>>>>>>> 2) I want to do analysis on these huge data  and want to generate
>>>>>>>>>> reports
>>>>>>>>>> on it after analysis.
>>>>>>>>>> Like this I have to generate various reports based on different
>>>>>>>>>> analysis.
>>>>>>>>>>
>>>>>>>>>> I tried to implement this using Hive. What I did is:
>>>>>>>>>> 1) I imported all tables in Hive from MS SQL Server using SQOOP.
>>>>>>>>>> 2) I wrote many queries in Hive which is executing using JDBC on
>>>>>>>>>> Hive
>>>>>>>>>> Thrift Server
>>>>>>>>>> 3) I am getting the correct result in table form, which I am
>>>>>>>>>> expecting
>>>>>>>>>> 4) But the problem is that the time which require to execute is
>>>>>>>>>> too much
>>>>>>>>>> long.
>>>>>>>>>>    (My complete program is executing in near about 3-4 hours on
>>>>>>>>>> *small
>>>>>>>>>> amount of data*).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    I decided to do this using Hive.
>>>>>>>>>>     And as I told previously how much time Hive consumed for
>>>>>>>>>> execution. my
>>>>>>>>>> organization is expecting to complete this task in near about
>>>>>>>>>> less than
>>>>>>>>>> 1/2 hours
>>>>>>>>>>
>>>>>>>>>> Now after spending too much time for complete execution for this
>>>>>>>>>> task what
>>>>>>>>>> should I do?
>>>>>>>>>> I want to ask one thing that:
>>>>>>>>>> *Is this Use Case is possible with Hive?* If possible what should
>>>>>>>>>> I do in
>>>>>>>>>>
>>>>>>>>>> my program to increase the performance?
>>>>>>>>>> *And If not possible what is the other good way to implement this
>>>>>>>>>> Use Case?*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Please reply me.
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Regards,
>>>>>>>>>> Bhavesh Shah
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Regards,
>>>>>>>> Bhavesh Shah
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>>
>>
>>
>>
>>
>

Re: Is my Use Case possible with Hive?

Reply via email to