Re: Issue while inserting data in the hive table using map side join

2014-04-23 Thread Db-Blog
Hi Anirudh,

Below are some links depicting the problem MIGHT BE related to data nodes.  
Please go thru the same and let us know if it was useful. 
1. http://hansmire.tumblr.com
2. http://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo

Hive Experts- Kindly share your suggestions/findings on the same. 

Thanks,
Saurabh

> On 24-Apr-2014, at 1:08 am, anirudh kala  wrote:
> 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException):


Re: large small files vs one big file in hive table

2014-05-05 Thread Db-Blog
In general it is recommended to have Millions of Large files rather than 
billions of small files in hadoop. 

Please describe your issues in detail. Say for ex. 
-How are you planning to consume the data stored in this partition table?
- Are you looking for storage and performance optimizations? Etc. 

Thanks
Saurabh

Sent from my iPhone, please avoid typos.

> On 05-May-2014, at 3:33 pm, Shushant Arora  wrote:
> 
> I have a hive table in which data is populated from RDBMS on daily basis.
> 
> After map reduce each mapper write its data in hive table partitioned at 
> month level.
> Issue is daily when job runs it fetches data of last day and each mapper 
> writes its output in seperate file. Shall I merge those files in single one ?
> 
> What should be file format? Sequence file or text is better ?
> 
> 


Re: largest table last in joins

2014-05-05 Thread Db-Blog
Hi, 
If we have one big table joining with a small table and MAPJOIN hint is 
specified on the Smaller table, still the ordering will be required? 

We can always forcefully set the auto convert join property to false and enable 
mapjoin hints. 

Please let me know if I am off base on this topic. 

Thanks,
Saurabh

Sent from my iPhone, please avoid typos.

> On 05-May-2014, at 9:19 pm, Alan Gates  wrote:
> 
> Join ordering is not yet part of the Hive optimizer.  There is integration 
> work being done with the Optiq framework that will address this, but it is 
> not complete yet.  Hopefully at least an initial integration will be 
> available in the next Hive release.
> 
> Alan.
> 
>> On May 2, 2014, at 5:36 AM, Aleksei Udatšnõi  wrote:
>> 
>> Hello,
>> 
>> There is this old recommendation for optimizing Hive join to use the largest 
>> table last in the join.
>> http://archive.cloudera.com/cdh/3/hive/language_manual/joins.html
>> 
>> The same recommendation appears in Programming Hive book.
>> 
>> Is this recommendation still valid or newer version of Hive take care of 
>> such optimization automatically?
>> 
>> Best,
>> Aleksei
> 
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.


Re: Hive Table : Read or load data in Hive Table from plural subdirectories

2014-05-26 Thread Db-Blog
Implement dynamic Partitioning on daily cadence. 
Example: 
ParentDirectory/partition=Day1/Day1_n_files.gz
ParentDirectory/partition=Day2/Day2_n_files.gz
ParentDirectory/partition=Day30/Day30_n_files.gz
And so on... 

You can also opt for Monthly partitions rather than daily by comparing the file 
size with the HDFS block size. 

Let us know your comments on the same. 

Thanks,
Saurabh

Sent from my iPhone, please avoid typos.

> On 16-May-2014, at 1:00 am, Matouk IFTISSEN  
> wrote:
> 
>  files are all the same format (.gz) but they are in different subdirectories 
> !!
> my problematique is : I want to do an import  by day from oracle to hdfs (in 
> directory : 
> hdfs_my_parent_directory/import_dir_day1/part_data_import.gz
> hdfs_my_parent_directory/import_dir_day2/part_data_import.gz
> 
> hdfs_my_parent_directory/import_dir_day30/part_data_import.gz
> 
> If I point the Hive table to the parent directory : hdfs_my_parent_directory 
> it did'nt read (load)  the Data !!
> 
> How Can do this, to read all files in subdirectories using one (or tow) Hive 
> Table (s) ??
> 
> Thanks By adavce, and sorry for grammer and orthographe error :)
> 
> 
> 2014-05-14 13:46 GMT+02:00 Joshua Fennessy :
>> If those files are all the same format, you would point the Hive table to 
>> the parent directory. It will resource through and find all of the files to 
>> include on the table. 
>> 
>> You can filter files to use multiple tables, but recursion is designed in. 
>> 
>> Sent from my gadget.  Please excuse any spelling errors.
>> 
>> 
>>> On May 14, 2014, at 7:31 AM, "Matouk IFTISSEN"  
>>> wrote:
>>> 
>>> Hé Geeks,
>>> Is there a best manner to load or read data in hive table (normal or 
>>> external) from plural subdirectories?
>>> exemple : I have a directory my_directory and in there are lot of 
>>> subdirectory ie:
>>> my_directory --> my_subdirectory1 , my_subdirectory2, ..., my_subdirectoryx
>>> and my data (files) are in those  subdirectories!!!, how can I read them 
>>> with one/two Hive tables ?
>>> 
>>> Thanks by advance
>>> -- 
>>> Matouk IFTISSEN | Consultant BI & Big Data
>>>  
>>> 24 rue du sentier - 75002 Paris - www.ysance.com
>>> Fax : +33 1 73 72 97 26 
>>> Ysance sur :Twitter | Facebook | Google+ | LinkedIn | Newsletter
>>> Nos autres sites : ys4you | labdecisionnel | decrypt
> 
> 
> 
> -- 
> Matouk IFTISSEN | Consultant BI & Big Data
>  
> 24 rue du sentier - 75002 Paris - www.ysance.com
> Fax : +33 1 73 72 97 26 
> Ysance sur :Twitter | Facebook | Google+ | LinkedIn | Newsletter
> Nos autres sites : ys4you | labdecisionnel | decrypt


Re: Hive huge 'startup time'

2014-07-18 Thread Db-Blog
Hello everyone, 

Thanks for sharing valuable inputs. I am working on similar kind of task, it 
will be really helpful if you can share the command for  increasing the heap 
size of hive-cli/launching process. 

Thanks,
Saurabh

Sent from my iPhone, please avoid typos.

> On 18-Jul-2014, at 8:23 pm, Edward Capriolo  wrote:
> 
> Unleash ze file crusha!
> 
> https://github.com/edwardcapriolo/filecrush
> 
> 
>> On Fri, Jul 18, 2014 at 10:51 AM, diogo  wrote:
>> Sweet, great answers, thanks.
>> 
>> Indeed, I have a small number of partitions, but lots of small files, ~20MB 
>> each. I'll make sure to combine them. Also, increasing the heap size of the 
>> cli process already helped speed it up.
>> 
>> Thanks, again.
>> 
>> 
>>> On Fri, Jul 18, 2014 at 10:26 AM, Edward Capriolo  
>>> wrote:
>>> The planning phase needs to do work for every hive partition and every 
>>> hadoop files. If you have a lot of 'small' files or many partitions this 
>>> can take a long time. 
>>> Also the planning phase that happens on the job tracker is single threaded.
>>> Also the new yarn stuff requires back and forth to allocated containers. 
>>> 
>>> Sometimes raising the heap to for the hive-cli/launching process helps 
>>> because the default heap of 1 GB may not be a lot of space to deal with all 
>>> of the partition information and memory overhead will make this go faster.
>>> Sometimes setting the min split size higher launches less map tasks which 
>>> speeds up everything.
>>> 
>>> So the answer...Try to tune everything, start hive like this:
>>> 
>>> bin/hive -hiveconf hive.root.logger=DEBUG,console
>>> 
>>> And record where the longest spaces with no output are, that is what you 
>>> should try to tune first.
>>> 
>>> 
>>> 
>>> 
 On Fri, Jul 18, 2014 at 9:36 AM, diogo  wrote:
 This is probably a simple question, but I'm noticing that for queries that 
 run on 1+TB of data, it can take Hive up to 30 minutes to actually start 
 the first map-reduce stage. What is it doing? I imagine it's gathering 
 information about the data somehow, this 'startup' time is clearly a 
 function of the amount of data I'm trying to process.
 
 Cheers,
> 


Re: Handling blob in hive

2014-08-11 Thread Db-Blog
You can store Blob data type as string in hive. 

Thanks,
Saurabh

Sent from my iPhone, please avoid typos.

> On 08-Aug-2014, at 9:10 am, Chhaya Vishwakarma 
>  wrote:
> 
> Hi,
>  
> I want to store and retrieve blob in hive.Is it possible to store blob in 
> hive?
> If it is not supported what alternatives i can go with?
> Blob may reside inside an relation DB also.
> I did some research but not finding relevant solution
>  
> Regards,
> Chhaya Vishwakarma
>  
> 
> The contents of this e-mail and any attachment(s) may contain confidential or 
> privileged information for the intended recipient(s). Unintended recipients 
> are prohibited from taking action on the basis of information in this e-mail 
> and using or disseminating the information, and must notify the sender and 
> delete it from their system. L&T Infotech will not accept responsibility or 
> liability for the accuracy or completeness of, or the presence of any virus 
> or disabling code in this e-mail"


Learn Java for Hadoop

2014-08-14 Thread Db-Blog
Greetings to everyone.

I am a newbie in Java and seeks guidance in learning "Java specifically 
required for Hadoop". It will be really helpful if someone can pass on the 
links/topics/online-courses which can be helpful to get started on it. 

I come from ETL & DB- SQL background and currently working on 
Hive/Impala/Pig/Sqoop since couple of years. 

I have done some research on other tools of Big Data and Java will be required 
in depth. Below is the list of tools analysed :
- Real time processing  (Apache Kafka and  Storm) 
- Advance Searching (Solr/Lucene) 
- Machine learning (Apache Mahout)

Please feel free to comment if I am off-base on anything. 

Kindly suggest regarding the same and thanks for going thru the post and 
providing your valuable time.

Thanks,
Saurabh

Re: Learn Java for Hadoop

2014-08-15 Thread Db-Blog
Hey There,

Thanks for suggesting the below mentioned links however I am aware of how 
hadoop works and referred the below links in detail since my inception with 
Hadoop. My apologies if my earlier email wasn't clear enough to explain my 
problem statement. 

Staring Fresh again! 
I have experience in hadoop and worked on Bare metal and cloud implementations 
of big data e.g. Cloudera HD, Hortonworks HD and Amazon EMR's. During this 
affair I got a chance to explore Hive, Impala, Sqoop and Pig in detail and 
processed large data sets residing in HDFS. Also enjoyed playing with Shell 
Scripts to automate commands and orchestrate processes. All this was batch 
processing and majorly related to SQL. 

Now I want to move with Real-Time implementations and other technologies 
(mentioned in trailing mails); which definitely need Java Expertise. 

I am seeking guidance to learn specific java topics which will be needed for 
Hadoop only! Links/Posts/courses on the same will be really helpful. 

I also look forward to contribute and share my knowledge to the community. :) 

Thanks,
Saurabh

> On 15-Aug-2014, at 5:09 am, Nishant Kelkar  wrote:
> 
> Hi Saurabh, 
> 
> Welcome to the world of Apache Hadoop! Here are a few good places to start: 
> 
> 1. Apache Hadoop Definitive Guide book: 
> http://shop.oreilly.com/product/0636920021773.do (you could find a free 
> e-copy if you Google some :) )
> 2. Hadoop Javadocs: https://hadoop.apache.org/docs/current/api/
> 3. If you want to install Hadoop on your local, Noll's tutorial on how to do 
> so for a pseudo-distributed mode is really nice: 
> http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
> 4. The way I started, is by experimenting with Hadoop on my Linux box 
> terminal. You should definitely try out basic operations, like adding a file 
> to HDFS from your local filesystem, copying a file from HDFS to your local, 
> looking at filesystem size, moving files around in HDFS, etc. Here's where 
> you can start: http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html
> 
> In general, I think you should also look at blogs/posts that help you 
> distinguish Java from the other languages you've used (like HiveQL for 
> example). How is Java different from C++? What is the difference between a 
> declarative programming language and an object-oriented programming language? 
> How does Java create objects? How does it manage them, and dispose of them? 
> These are the questions you want to look into first, even before starting to 
> write code in Java.
> 
> Welcome to the group once again, and hope you'll be able to start 
> contributing to the open-source community real quick! :)
> 
> Best Regards,
> Nishant Kelkar
> 
> 
>> On Thu, Aug 14, 2014 at 3:27 PM, Db-Blog  wrote:
>> Greetings to everyone.
>> 
>> I am a newbie in Java and seeks guidance in learning "Java specifically 
>> required for Hadoop". It will be really helpful if someone can pass on the 
>> links/topics/online-courses which can be helpful to get started on it.
>> 
>> I come from ETL & DB- SQL background and currently working on 
>> Hive/Impala/Pig/Sqoop since couple of years.
>> 
>> I have done some research on other tools of Big Data and Java will be 
>> required in depth. Below is the list of tools analysed :
>> - Real time processing  (Apache Kafka and  Storm)
>> - Advance Searching (Solr/Lucene)
>> - Machine learning (Apache Mahout)
>> 
>> Please feel free to comment if I am off-base on anything.
>> 
>> Kindly suggest regarding the same and thanks for going thru the post and 
>> providing your valuable time.
>> 
>> Thanks,
>> Saurabh
> 


Tez Optimisation Parameters

2015-08-22 Thread Db-Blog
Hi, 

I am trying to load aggregate data from one massive table containing historical 
data of ONE year. Partitioning is implemented on the historical table, however 
the number of files huge (>#100) and are gz compressed. 

When i trying to load it using Tez execution engine. Can someone suggest some 
quick optimisation parameters to fine tune the query performance? 
Similar to hdfs block size, min input splits, map aggregation etc. 

Thanks,
Saurabh

Bucketing- Identify Number of Buckets

2015-09-06 Thread Db-Blog
Hi, 

I need to join two big tables in hive. The join key is the grain of both these 
tables, hence clustering and sorting on the same will provide significant 
performance optimisation while joining.  

However, i am not sure how to calculate the exact number of buckets while 
creating these tables. Can someone please share any pointers on the same? 

Planning to keep these Clustered and Sorted tables as parquet/orc- for columnar 
storage and better compression. 

Thanks,
Saurabh

Sent from my iPhone, please avoid typos.

Re: Bucketing- Identify Number of Buckets

2015-09-06 Thread Db-Blog
Details of Hive Version:
I am using Hive -14.0 with Tez as execution engine. 

Thanks,
Saurabh

Sent from my iPhone, please avoid typos.

> On 07-Sep-2015, at 1:51 am, Db-Blog  wrote:
> 
> Hi, 
> 
> I need to join two big tables in hive. The join key is the grain of both 
> these tables, hence clustering and sorting on the same will provide 
> significant performance optimisation while joining.  
> 
> However, i am not sure how to calculate the exact number of buckets while 
> creating these tables. Can someone please share any pointers on the same? 
> 
> Planning to keep these Clustered and Sorted tables as parquet/orc- for 
> columnar storage and better compression. 
> 
> Thanks,
> Saurabh
> 
> Sent from my iPhone, please avoid typos.


Re: Hive update operation

2016-09-01 Thread Db-Blog
Hi Mich, 
Nice explanation! 
The Update operation in hive work on row by row or it is performed in batches? 
We also observed multiple temp files getting generated in hdfs while performing 
the update operation. 

It will be really helpful if you can share details what hive does in the 
background. 

Thanks,
Saurabh

> On 26-Aug-2016, at 11:47 AM, Mich Talebzadeh  
> wrote:
> 
> Ok this is what you have in MSSQL (COLLATE) does not come into it in Hive)
> 
> UPDATE table1
> SET
>   address=regexp_replace(t2.cout_event_description,,)
> , latitude=t2.latitude
> , longitude=t2.longitude
> , speed =t2.speed
> , dtimestamp =mv.dtimestamp
> , reg_no=t2.registration
> , gpsstate = t2.bgps
> FROM
>   default.maxvalues  mv
> , table2 t2
> INNER JOIN table2 t2 on  mv.dtimestamp=t2.dtimestamp AND  mv.acqnum=t2.acqnum 
> INNER JOIN table1 t1 on mv.acqnum=t1.deal_number
> where t1.deal_number=mv.acqnum;
> 
> Simplify this in Hive and test
> 
> CREATE TEMPORARY TABLE tmp1
> AS
> SELECT  FROM
>  table2 t2, default.maxvalues mv
> WHERE mv.dtimestamp=t2.dtimestamp AND mv.acqnum=t2.acqnum
> 
> UPDATE table1
> SET
>   address=regexp_replace(t2.cout_event_description,,)
> , latitude=tmp1.latitude
> , longitude=tmp1.longitude
> , speed =tmp1.speed
> , dtimestamp =tmp1.dtimestamp
> , reg_no=tmp1.registration
> , gpsstate = tmp1.bgps
> FROM
> tmp1, table1 t1
> WHERE   tmp1.acqnum = tmp1.deal_number
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 26 August 2016 at 06:38, Priyanka Raghuvanshi  
>> wrote:
>> Current RDBMS:  SQL Server 2012
>> 
>> 
>> Yes, I tried below one.
>> 
>> 
>> UPDATE table1 set 
>> address=regexp_replace(t2.cout_event_description,,),latitude=t2.latitude,longitude=t2.longitude
>>  ,speed =t2.speed,dtimestamp =mv.dtimestamp,reg_no=t2.registration,gpsstate 
>> = t2.bgps FROM  default.maxvalues mv, table2 t2 INNER JOIN table2 t2 on 
>> mv.dtimestamp=t2.dtimestamp AND mv.acqnum=t2.acqnum INNER JOIN table1 t1 on 
>> mv.acqnum=t1.deal_number
>> where t1.deal_number=mv.acqnum;
>> 
>> OUTPUT:
>> 
>> " FAILED: ParseException line 1:221 missing EOF at 'FROM' near 'bgps' " 
>> 
>> 
>> From: Mich Talebzadeh 
>> Sent: 25 August 2016 21:41:51
>> 
>> To: user
>> Subject: Re: Fw: Hive update operation
>>  
>> Him
>> 
>> What is your current RDBMS and are these SQL the ones used in RDBMS?
>> 
>> Have you tried them on Hive?
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>>> On 25 August 2016 at 06:56, Priyanka Raghuvanshi  
>>> wrote:
>>> Hi  Dr Mich,
>>> 
>>> 
>>> Thank you for replying.
>>> 
>>> 
>>> Yes, while creating the table, transactional property has been set as true, 
>>> same implies to other tables.
>>> 
>>> 
>>> Following in are SQL update query example, same I want to achieve through 
>>> HQL:
>>> 
>>> 
>>> 1)
>>> 
>>> UPDATE table1
>>> SET FAging=t2.FAging, 
>>> PaymentR=t2.PaymentR,
>>> ArrearsO=t2.ArrearsO ,
>>> IRemaining=t2.IRemaining,
>>> Taxi_Association=t2.TaxiAssociation
>>> From table2 t2
>>> Left JOIN table1  t1 
>>> ON t2.AccNum COLLATE DATABASE_DEFAULT= t1.AccNo COLLATE DATABASE_DEFAULT
>>> 
>>> 2)
>>> 
>>> UPDATE table1
>>> SET Img_String=CASE WHEN 
>>> convert(nvarchar,T1.dTimeStamp,103)=Convert(nvarchar,getdate(),103) AND 
>>> T1.Speed>0 then 
>>> isnull(T2.clmn1,'Other') 
>>> +';Moving;'+ISNULL(T2.PinsStatus,'Performing')+';'+CASE WHEN 
>>> ISNULL(T2.SupplierName,'New') LIKE '%Repo%' THEN 'Refurbished' ELSE 'New' 
>>> END   
>>> ELSE 
>>> isnull(T2.clmn1,'Other') +';Non 
>>> Moving;'+ISNULL(VEH.PinsStatus,'Performing')+';'+CASE WHEN 
>>> ISNULL(T2.SupplierName,'New') LIKE '%Repo%' THEN 'Refurbished' ELSE 'New' 
>>> END
>>> END,
>>> Moving_or_NonMoving
>>> =CASE WHEN 
>>> convert(nvarchar,T1.dTimeStamp,103)=Convert(nvarchar,getdate(),103) AND 
>>> T1.Speed>0 then 
>>> 'Moving'
>>> ELSE
>>> 'Non

Re: Controlling Number of small files while inserting into Hive table

2017-06-25 Thread Db-Blog
Hi Arpan,
Include the partition column in the distribute by clause of DML, it will 
generate only one file per day. Hope this will resolve the issue. 

> "insert into 'target_table' select a,b,c from x where ... distribute by 
> (date)"
> 
PS: Backdated processing will generate additional file(s). One file per load. 

Thanks,
Saurabh

Sent from my iPhone, please avoid typos.

> On 22-Jun-2017, at 11:30 AM, Arpan Rajani  wrote:
> 
> Hello everyone,
> 
> 
> 
> I am sure many of you might have faced similar issue.
> 
> We do "insert into 'target_table' select a,b,c from x where .." kind of 
> queries for a nightly load. This insert goes in a new partition of the 
> target_table. 
> 
> Now the concern is : this inserts load hardly any data ( I would say less 
> than 128 MB per day) but data is fregmented into1200 files. Each file in a 
> few KiloBytes. This is slowing down the performance. How can we make sure, 
> this load does not generate lot of small files?
> 
> I have already set : hive.merge.mapfiles and hive.merge.mapredfiles to true 
> in custom/advanced hive-site.xml. But still the load job loads data with 1200 
> small files. 
> 
> I know why 1200 is, this is the value of maximum number of 
> reducers/containers available in one of the hive-sites. (I do not think its a 
> good idea to do cluster wide setting to change this number, as this can 
> affect other jobs which can use cluster when it has free containers) 
> 
> What could be other way/settings, so that the hive insert do not take 1200 
> slots and generate lots of small files?
> 
> I also have another question which is partly contrary to above : (This is 
> relatively less important)
> 
> When I reload this table by creating a new table by doing select on target 
> table, the newly created table does not contain too many small files. This 
> newly created table's number of files drops down from 1200 to ±50. What could 
> be the reason?
> 
> PS: I did go through 
> http://www.openkb.info/2014/12/how-to-control-file-numbers-of-hive.html
> 
> 
> 
> Regards,
> Arpan
> 
> The contents of this e-mail are confidential and for the exclusive use of the 
> intended recipient. If you receive this e-mail in error please delete it from 
> your system immediately and notify us either by e-mail or telephone. You 
> should not copy, forward or otherwise disclose the content of the e-mail. The 
> views expressed in this communication may not necessarily be the view held by 
> WHISHWORKS.