RE: Table Wrapper

Peter Marron Thu, 27 Jun 2013 02:19:47 -0700

Hi,

I have thought about a map-only join, but as I understand it this is still going
to do a full table scan on my large data file. If this is billions of records 
then it's
still going to be slow, even if it only returns a handful of records.


Also I don't know of any way to get Hive to do a join without performing a
Map/Reduce. And, as I mentioned before, just the overheads of setting
up a Map/Reduce, even if it's map only and does practically nothing, makes the
elapsed time too high. I want it to be interactive. (I guess that something Tez
when it becomes available might solve this problem...)

My ugly "hack" approach works in seconds, the overhead for setting up 
Map/Reduce takes this into minutes.

Indexing looks promising but, as far as I can see, it can't be done without a 
Map/Reduce.

If I could find a way to perform a join or use indexing without a Map/Reduce I 
would be happy to use that approach.

Partitioning and ORC would be helpful but I can't assume anything about the 
original data format.

Z

From: Nitin Pawar [mailto:nitinpawar...@gmail.com]
Sent: 27 June 2013 09:52
To: user@hive.apache.org
Subject: Re: Table Wrapper

few thoughts:
If you have a smaller file (in size of MB's) have you tried considering map 
only join?
also if you are interested in particular records from a table and do not want 
to go through entire table to find them, then partitioning + indexing will be 
handy.

ORCFile Format (still very new) can help you in this regard as well.

On Thu, Jun 27, 2013 at 2:16 PM, Peter Marron 
<peter.mar...@trilliumsoftware.com<mailto:peter.mar...@trilliumsoftware.com>> 
wrote:
Well, I'm not very good at keeping things brief, unfortunately.
But I'll have a go, trying to keep things simple.

Suppose that I have a data table in Hive and it has many rows - say billions.
I have another file stored in HDFS (it can be a Hive table too if it helps)
and this file is small and contains file offsets into the data, Stored as 
binary,
8 bytes per offset. Now suppose that I want to read the records from the data
defined by the offsets in the small file, in the order defined in the small 
file.

How can I do that?

The obvious way is to turn the small file into a Hive table and provide a custom
InputFormat which can read the binary. I've done that, that's the easy part and
then I could form a query like this:

                SELECT * FROM data JOIN small ON data. ON 
data.BLOCK__OFFSET__INSIDE__FILE = small.offset;

But, when it works, this performs awfully.

The approach that I have taken is to create a "copy" of the data table which is 
"hacked" to use a custom input
format which knows about the small file and which overrides the record reader 
to use the offsets
as seeks before it reads the records. This is awkward, for various reasons, but 
it works well. I can
avoid a full table scan, in fact I can suppress any Map/Reduce and so the query 
runs very quickly.

So I was just trying to "wrap" the data table so that I didn't have to create 
the copy.

I hope that you don't regret asking too much.

Regards,

Z

From: Stephen Sprague [mailto:sprag...@gmail.com<mailto:sprag...@gmail.com>]
Sent: 25 June 2013 18:37
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Table Wrapper

Good luck, bro. :)   May i ask why are you doing this to yourself? I think your 
instincts are correct going down the path you describe sounds a tad more 
painful than just hitting yourself in the head with a hammer. Different strokes 
for different folks though.
so can we back up? what - briefly if possible - do you want to achieve with a 
"wrapper"? (i'm going to regret asking that i know.)


On Tue, Jun 25, 2013 at 7:29 AM, Peter Marron 
<peter.mar...@trilliumsoftware.com<mailto:peter.mar...@trilliumsoftware.com>> 
wrote:
Hi,

Running Hive 0.11.0 over Hadoop 1.0.4.

I would like to be able to "wrap" a Hive table.

So, if I have table "X" which uses SerDe "s" and InputFormat "i"
then I would like to be able to create a table "Y" which has a
SerDe "ws" which is a wrapper of "s" (and so can encapsulate an instance of "s")
and an InputFormat "wi" which is a wrapper of "I" (and similarly encapsulates an
instance of "i").  So far I have done this by creating a table like this

CREATE TABLE Y (... copy of underlying table's columns...)
ROW FORMAT SERDE 'ws'
WITH SERDEPROPERTIES (...
'wrapped.serde.name<http://wrapped.serde.name>'='org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe',
'wrapped.inputformat.name<http://wrapped.inputformat.name>'='TextInputFormat',
'serialization.format'='|',              'field.delim'='|'
)
STORED AS
  INPUTFORMAT 'wi'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
TBLPROPERTIES (...);

I have to add the names of the underlying classes "s" and "I"
into the table properties so that I know what to instantiate.
I also have to replicate all the column details of the wrapped table
to ensure the correct information is passed down to the underlying SerDe
when I instantiate it. I also have to know the output format. I have to
explicitly add the default SerDe properties to get it to work.
I have to explicitly provide the default output format too.
If any changes are made to the underlying table then I need to reflect
those changes in my "wrapper" table.
It's a mess.

What I'd like to be able to do is to just parameterise my wrapper
table with the name of the underlying table and using that name
be able to instantiate the correct SerDe and InputFormat.

Is there an easier way to do this?

Any pointers appreciated.

Z





--
Nitin Pawar

RE: Table Wrapper

Reply via email to