We have a similar situation like this in production…for your case case I would
propose the following steps
1. Design a map reduce job (Job Output format - Text, Lzo, Snappy, your choice)
Inputs to Mapper
-- records from these three feeds
Outputs from Mapper
-- Key = <EMP1> Value = <feed1~field1 field2 field6 field9>
-- Key = <EMP1> Value = <feed2~field5 field7 field10>
-- Key = <EMP1> Value = <feed3~field3 field4 field8>
Reducer Output
-- Key = <EMP1> Value = <field1 field2 field3 field4 field5 field6
field7 field8 field9 field10>
2. (Optional) If u use LZO then u will need to run LzoIndexer
3. CREATE TABLE IF NOT EXISTS YOUR_HIVE_TABLE
4. ALTER TABLE ADD PARTITION (foo1 = , foo2 = ) LOCATION 'path/to/files'
From: Stephen Sprague <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Date: Friday, July 26, 2013 4:37 PM
To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: Re: Merging different HDFS file for HIVE
i like #2.
so you have three, say, external tables representing your three feed files.
After the third and final file is loaded then join 'em all together - maybe
make the table partitioned for one per day.
for example:
alter table final add partition (datekey=YYYYMMDD);
insert overwrite table final partition (datekey=YYYYMMDD) select
EMP_ID,f1,...,f10 from FF1 a join FF2 b on (a.EMP_ID=b.EMP_ID join FF3 c on
(b.EMP_ID=c.EMP_ID)
Or a variation on #3. make a view on the three tables which would look just
like the select statement above.
What do you want to optimize for?
On Fri, Jul 26, 2013 at 5:30 AM, Nitin Pawar
<[email protected]<mailto:[email protected]>> wrote:
Option 1 ) Use pig or oozie, write a workflow and join the files to a single
file
Option 2 ) Create a temp table for each of the different file and then join
them to a single table and delete temp table
Option 3 ) don't do anything, change your queries to look at three different
files when they query about different files
Wait for others to give better suggestions :)
On Fri, Jul 26, 2013 at 4:22 PM, Ramasubramanian Narayanan
<[email protected]<mailto:[email protected]>>
wrote:
Hi,
Please help in providing solution for the below problem... this scenario is
applicable in Banking atleast...
I have a HIVE table with the below structure...
Hive Table:
Field1
...
Field 10
For the above table, I will get the values for each feed in different file. You
can imagine that these files belongs to same branch and will get at any time
interval. I have to load into table only if I get all 3 files for the same
branch. (assume that we have a common field in all the files to join)
Feed file 1 :
EMP ID
Field 1
Field 2
Field 6
Field 9
Feed File2 :
EMP ID
Field 5
Field 7
Field 10
Feed File3 :
EMP ID
Field 3
Field 4
Field 8
Now the question is,
what is the best way to make all these files to make it as a single file so
that it can be placed under the HIVE structure.
regards,
Rams
--
Nitin Pawar
CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the
intended recipient(s) and may contain confidential and privileged information.
Any unauthorized review, use, disclosure or distribution is prohibited. If you
are not the intended recipient, please contact the sender by reply email and
destroy all copies of the original message along with any attachments, from
your computer system. If you are the intended recipient, please be advised that
the content of this message is subject to access, review and disclosure by the
sender's Email System Administrator.