Thanks all for their replies. Just now I tried one thing that as folows: 1) I open tho two hive CLI. hive> 2) I have one query which takes 7 jobs for execution. I submitted that query to both the CLI. 3) one of the hive CLI took 147.319 seconds and second one took: 161.542 seconds 4) Later I tried that query only on one CLI and it took 122.307 seconds The thing what I want to ask is this, if multiple query runs parallel it takes less time to execute compare to execute one by one.
If I want to execute such parallel queries through JDBC, how can I do it. I know that hive can accept at a time one connection. But still is there any way to so it? Pls suggest me some solution for this. -- Regards, Bhavesh Shah On Tue, May 15, 2012 at 1:15 AM, Nanda Vijaydev <nanda.vijay...@gmail.com>wrote: > Hadoop in general does well with fewer large data files instead of more > smaller data files. RDBMS type of indexing and run time optimization is not > exactly available in Hadoop/Hive yet. So one suggestion is to combine some > of this data, if you can, into fewer tables as you are doing sqoop. Even if > there is a slight redundancy it should be OK. Storage is cheap and helps > during read. > > Other suggestions as given in this thread is to set map side and reduce > side hive optimization parameters. Querying via jdbc is generally slow as > well. There are certain products in Hadoop space that allow for hive > querying without jdbc interface. Give it a try and it should improve > performance. > > Good luck > > > > On Mon, May 14, 2012 at 6:17 AM, Bhavesh Shah <bhavesh25s...@gmail.com>wrote: > >> Thanks Nitin for your continous support. >> *Here is my data layout and change the queries as per needed*: >> 1) Initially after importing the tables from MS SQL Server, 1st basic >> task I am doing is that *PIVOTING.* >> As SQL stores data in name value pair. >> 2) Pivoting results in subset of data, Using this subset we are running >> complex queries on history data and retrieves result for each row in >> subset. >> again *data is updated into pivoted columns*. (I am not using >> partition. updated by INSERT OVERWRITE) >> As update is not supporting, I have to again do *INSERT OVERWRITE >> TABLE >> *3) Likewise I have to do near about 20-30 times. (Depends upon Business >> rules and scenario if needed to Business rules) >> 4) After this I have to do computation which has very large queries from >> above generated tables. >> (Each query has near about 10-11 jobs query) >> This again repeats for 30 times. >> >> (My all queries contains - case when, group by, cast function, etc ) >> >> -- >> Regards, >> Bhavesh Shah >> >> >> On Mon, May 14, 2012 at 6:05 PM, Nitin Pawar <nitinpawar...@gmail.com>wrote: >> >>> partitioning is mainly used when you want to access the table based on >>> value of a particular column and dont want to go through entire table for >>> same operation. This actually means if there are few columns whose values >>> are repeated in all the records, then you can consider partitioning on >>> them. Other approach will be partition data based on date/time if >>> applicable. >>> >>> From the queries you showed, i am just seeing inserting and creating >>> indexes. loading data to tables should not take much time and I personally >>> have never used indexing so can not tell about that particular query >>> execution time. >>> >>> if I understand correctly following is your execution approach >>> >>> 1) Import data from MS-SQL to hive using sqoop >>> should be over quickly depending on how much time MS-SQL takes to >>> export >>> 2) example of queries which you are doing on the data being dumped in >>> hive will be good to know if we can decide on the data layout and change >>> the queries as per needed if needed >>> 3) Once query execution is over you are putting the result back in >>> MS-SQL >>> >>> can you note individually how much time each step is taking? >>> >>> >>> On Mon, May 14, 2012 at 4:38 PM, Bhavesh Shah >>> <bhavesh25s...@gmail.com>wrote: >>> >>>> Hello Nitin, >>>> Thanks for suggesting me about the partition. >>>> But I want to tell one thing that I forgot to mention before is that :* >>>> I am using Indexes on all tables tables which are used again and again. >>>> * >>>> But the problem is that after execution I didn't see the difference in >>>> performance (before applying the index and after applying it) >>>> I have created the indexes as below: >>>> sql = "CREATE INDEX INDEX_VisitDate ON TABLE Tmp(Uid,VisitDate) as >>>> 'COMPACT' WITH DEFERRED REBUILD stored as RCFILE"; >>>> res2 = stmt2.executeQuery(sql); >>>> sql = (new StringBuilder(" INSERT OVERWRITE TABLE Tmp select C1.Uid, >>>> C1.VisitDate, C1.ID from >>>> TmpElementTable C1 LEFT OUTER JOIN Tmp T on C1.Uid=T.Uid and >>>> C1.VisitDate=T.VisitDate").toString(); >>>> stmt2.executeUpdate(sql); >>>> sql = "load data inpath '/user/hive/warehouse/tmp' overwrite into table >>>> TmpElementTable"; >>>> stmt2.executeUpdate(sql); >>>> sql = "alter index clinical_index on TmpElementTable REBUILD"; >>>> res2 = stmt2.executeQuery(sql); >>>> *Did I use it in correct way?* >>>> >>>> As you told me told me to try with partition >>>> Actually I am altering the table with large number of columns at the >>>> runtime only. >>>> If i use partition in such situation then is it good to use partition >>>> for all columns? >>>> >>>> So, I want to know that After using the partition Will it be able to >>>> improve the performance or >>>> do I need to use both Partition and Indexes? >>>> >>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Bhavesh Shah >>>> >>>> >>>> On Mon, May 14, 2012 at 3:13 PM, Nitin Pawar >>>> <nitinpawar...@gmail.com>wrote: >>>> >>>>> it is definitely possible to increase your performance. >>>>> >>>>> I have run queries where more than 10 billion records were involved. >>>>> If you are doing joins in your queries, you may have a look at >>>>> different kind of joins supported by hive. >>>>> If one of your table is very small in size compared to another table >>>>> then you may consider mapside join etc >>>>> >>>>> Also the number of maps and reducers are decided by the split size you >>>>> provide to maps. >>>>> >>>>> I would suggest before you go full speed, decide on how you want to >>>>> layout data for hive. >>>>> >>>>> You can try loading some data, partition the data and write queries >>>>> based on partition then performance will improve but in that case your >>>>> queries will be in batch processing format. there are other approaches as >>>>> well. >>>>> >>>>> >>>>> On Mon, May 14, 2012 at 2:31 PM, Bhavesh Shah <bhavesh25s...@gmail.com >>>>> > wrote: >>>>> >>>>>> That I fail to know, how many maps and reducers are there. Because >>>>>> due to some reason my instance get terminated :( >>>>>> I want to know one thing that If we use multiple nodes, then what >>>>>> should be the count of maps and reducers. >>>>>> Actually I am confused about that. How to decide it? >>>>>> >>>>>> Also I want to try the different properties like block size, compress >>>>>> output, size of in-memorybuffer, parallel execution etc. >>>>>> Will these all properties matters to increase the performance? >>>>>> >>>>>> Nitin, you have read all my use case. Whatever the thing I did to >>>>>> implement with the help of Hadoop is correct? >>>>>> Is it possible to increase the performance? >>>>>> >>>>>> Thanks Nitin for your reply. :) >>>>>> >>>>>> -- >>>>>> Regards, >>>>>> Bhavesh Shah >>>>>> >>>>>> >>>>>> On Mon, May 14, 2012 at 2:07 PM, Nitin Pawar <nitinpawar...@gmail.com >>>>>> > wrote: >>>>>> >>>>>>> with a 10 node cluster the performance should improve. >>>>>>> how many maps and reducers are being launched? >>>>>>> >>>>>>> >>>>>>> On Mon, May 14, 2012 at 1:18 PM, Bhavesh Shah < >>>>>>> bhavesh25s...@gmail.com> wrote: >>>>>>> >>>>>>>> I have near about 1 billion records in my relational database. >>>>>>>> Currently locally I am using just one cluster. But I also tried >>>>>>>> this on Amazon Elastic Mapreduce with 10 nodes. But the time taken to >>>>>>>> execute the complete program is same as that on my single local >>>>>>>> machine. >>>>>>>> >>>>>>>> >>>>>>>> On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar < >>>>>>>> nitinpawar...@gmail.com> wrote: >>>>>>>> >>>>>>>>> how many # records? >>>>>>>>> >>>>>>>>> what is your hadoop cluster setup? how many nodes? >>>>>>>>> if you are running hadoop on a single node setup with normal >>>>>>>>> desktop, i doubt it will be of any help. >>>>>>>>> >>>>>>>>> You need a stronger cluster setup for better query runtimes and >>>>>>>>> ofcourse query optimization which I guess you would have already >>>>>>>>> taken care. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah < >>>>>>>>> bhavesh25s...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hello all, >>>>>>>>>> My Use Case is: >>>>>>>>>> 1) I have a relational database which has a very large data. (MS >>>>>>>>>> SQL Server) >>>>>>>>>> 2) I want to do analysis on these huge data and want to generate >>>>>>>>>> reports >>>>>>>>>> on it after analysis. >>>>>>>>>> Like this I have to generate various reports based on different >>>>>>>>>> analysis. >>>>>>>>>> >>>>>>>>>> I tried to implement this using Hive. What I did is: >>>>>>>>>> 1) I imported all tables in Hive from MS SQL Server using SQOOP. >>>>>>>>>> 2) I wrote many queries in Hive which is executing using JDBC on >>>>>>>>>> Hive >>>>>>>>>> Thrift Server >>>>>>>>>> 3) I am getting the correct result in table form, which I am >>>>>>>>>> expecting >>>>>>>>>> 4) But the problem is that the time which require to execute is >>>>>>>>>> too much >>>>>>>>>> long. >>>>>>>>>> (My complete program is executing in near about 3-4 hours on >>>>>>>>>> *small >>>>>>>>>> amount of data*). >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I decided to do this using Hive. >>>>>>>>>> And as I told previously how much time Hive consumed for >>>>>>>>>> execution. my >>>>>>>>>> organization is expecting to complete this task in near about >>>>>>>>>> less than >>>>>>>>>> 1/2 hours >>>>>>>>>> >>>>>>>>>> Now after spending too much time for complete execution for this >>>>>>>>>> task what >>>>>>>>>> should I do? >>>>>>>>>> I want to ask one thing that: >>>>>>>>>> *Is this Use Case is possible with Hive?* If possible what should >>>>>>>>>> I do in >>>>>>>>>> >>>>>>>>>> my program to increase the performance? >>>>>>>>>> *And If not possible what is the other good way to implement this >>>>>>>>>> Use Case?* >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Please reply me. >>>>>>>>>> Thanks >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Regards, >>>>>>>>>> Bhavesh Shah >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Nitin Pawar >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Regards, >>>>>>>> Bhavesh Shah >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Nitin Pawar >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Nitin Pawar >>>>> >>>>> >>>> >>>> >>>> >>>> >>> >>> >>> -- >>> Nitin Pawar >>> >>> >> >> >> >> >