Re: ORC vs TEXT file

Owen O'Malley Mon, 12 Aug 2013 14:03:55 -0700

Ok, given the large number of doubles in the schema and bzip2 compression,
I can see why the text would be smaller.


ORC doesn't do compression on floats or doubles, although there is a jira
to do so. (https://issues.apache.org/jira/browse/HIVE-3889)

Bzip is a very aggressive compressor. We should probably add it as an
option with ORC for long term storage files. (
https://issues.apache.org/jira/browse/HIVE-5067)

ORC will give you significant performance gains when you select a subset of
the columns.

select COL70, COL74 from test;

will run much faster on ORC than text files.



On Mon, Aug 12, 2013 at 9:27 AM, pandees waran <pande...@gmail.com> wrote:

> Hi Owen,
>
> Thanks for your response.
>
> My structure is like:
>
> a)Textfile:
> CREATE EXTERNAL TABLE test_textfile (
>     COL1 BIGINT,
>     COL2 STRING,
>     COL3 BIGINT,
>     COL4 STRING,
>     COL5 STRING,
>     COL6 BIGINT,
>     COL7 BIGINT,
>     COL8 BIGINT,
>     COL9 BIGINT,
>     COl10 BIGINT,
>     COl11 BIGINT,
>     COL12 STRING,
>     COl13 STRING,
>     COl14 STRING,
>     COl15 BIGINT,
>     COl16 STRING,
>     COL17 DOUBLE,
>     COl18 DOUBLE,
>     COl19 DOUBLE,
>     COl20 DOUBLE,
>     COl21 DOUBLE,
>     COL22 DOUBLE,
>     COl23 DOUBLE,
>     COL24 DOUBLE,
>     COl25 DOUBLE,
>     COL26 DOUBLE,
>     COl27 DOUBLE,
>     COL28 DOUBLE,
>     COL29 DOUBLE,
>     COl30 DOUBLE,
>     COl31 DOUBLE,
>     COL32 DOUBLE,
>     COL33 STRING,
>     COl34 STRING,
>     COl35 DOUBLE,
>     COL36 DOUBLE,
>     COl37 DOUBLE,
>     COL38 DOUBLE,
>     COl39 DOUBLE,
>     COL40 DOUBLE,
>     COl41 DOUBLE,
>     COL42 DOUBLE,
>     COL43 DOUBLE,
>     COl44 DOUBLE,
>     COl45 DOUBLE,
>     COL46 DOUBLE,
>     COL47 DOUBLE,
>     COl48 DOUBLE,
>     COl49 DOUBLE,
>     COL50 DOUBLE,
>     COL51 DOUBLE,
>     COl52 DOUBLE,
>     COl53 DOUBLE,
>     COl54 DOUBLE,
>     COL55 DOUBLE,
>     COL56 STRING,
>     COL57 DOUBLE,
>     COL58 DOUBLE,
>     COL59 DOUBLE,
>     COl60 DOUBLE,
>     COl61 STRING,
>     COL62 STRING,
>     COL63 STRING,
>     COL64 STRING,
>     COl65 STRING,
>     COl66 STRING,
>     COl67 STRING,
>     COL68 STRING,
>     Col69 STRING,
>     COL70 STRING,
>     COL71 STRING,
>     COl72 STRING,
>     COl73 STRING,
>     COL74  STRING
> ) PARTITIONED BY (
>     COL75 STRING,
>     COL76 STRING
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
> STORED AS TEXTFILE LOCATION 's3://test/textfile/';
> Using block level compression and bzip2codec  for output.
>
> b) With the above set of columns, just i have changed as STORED AS ORC for
> creating ORC. Not using any compression option
>
> c)Inserted 7256852 records in  both the tables
>
> d)Space occupied in S3:
>
> Storing as ORC(3 files):153.4MB *3=460.2MB
> TEXT(single file in bz2 format)=306MB
>
> I need to check ORC with compression enabled.
>
> Please let me know, if i miss anything.
>
> Thanks,
>
>
>
>
> On Mon, Aug 12, 2013 at 8:50 PM, Owen O'Malley <omal...@apache.org> wrote:
>
>> Pandees,
>>   I've never seen a table that was larger with ORC than with text. Can
>> you share your text's file schema with us? Is the table very small? How
>> many rows and GB are the tables? The overhead for ORC is typically small,
>> but as Ed says it is possible for rare cases for the overhead to dominate
>> the data size itself.
>>
>> -- Owen
>>
>>
>> On Mon, Aug 12, 2013 at 6:52 AM, pandees waran <pande...@gmail.com>wrote:
>>
>>> Thanks Edward.  I shall try compression besides orc and let you know.
>>> And also,  it looks like the cpu  usage is lesser while querying orc rather
>>> than text file.
>>> But the total time taken by the query time is slightly more in orc than
>>> text file.  Could you please explain the difference between cumulative cpu
>>> time and the total time taken (usually in last line in terms or secs)?
>>> Which one should we give preference?
>>> On Aug 12, 2013 7:01 PM, "Edward Capriolo" <edlinuxg...@gmail.com>
>>> wrote:
>>>
>>>> Colmnar formats do not always beat row wise storage. Many times gzip
>>>> plus block storage will compress something better then columnar storage
>>>> especially when you have repeated data in different columns.
>>>>
>>>> Based on what you are saying it could be possible that you missed a
>>>> setting and the ocr are not compressed.
>>>>
>>>>
>>>> On Monday, August 12, 2013, pandees waran <pande...@gmail.com> wrote:
>>>> > Hi,
>>>> >
>>>> > Currently, we use TEXTFILE format in hive 0.8 ,while creating the
>>>> > external tables in intermediate processing .
>>>> > I have read about ORC in 0.11. I have created the same table in 0.11
>>>> > with ORC format.
>>>> > Without any compression, the ORC file(totally 3 files) occupied the
>>>> > space twice more than the TEXTFILE(only one file).
>>>> > Even, when i query the data from ORC:
>>>> > Select count(*) from orc_table
>>>> >
>>>> > It took more time than the same query against textfile.
>>>> > But, i see cumulative CPU time is lesser in ORC than the text file.
>>>> >
>>>> > What sort of queries will benefit, if we use ORC?
>>>> > In which cases TEXTFILE will be preferred more than ORC?
>>>> >
>>>> > Thanks.
>>>> >
>>>
>>>
>>
>
>
> --
> Thanks,
> Pandeeswaran
>

Re: ORC vs TEXT file

Reply via email to