Ok, given the large number of doubles in the schema and bzip2 compression, I can see why the text would be smaller.
ORC doesn't do compression on floats or doubles, although there is a jira to do so. (https://issues.apache.org/jira/browse/HIVE-3889) Bzip is a very aggressive compressor. We should probably add it as an option with ORC for long term storage files. ( https://issues.apache.org/jira/browse/HIVE-5067) ORC will give you significant performance gains when you select a subset of the columns. select COL70, COL74 from test; will run much faster on ORC than text files. On Mon, Aug 12, 2013 at 9:27 AM, pandees waran <pande...@gmail.com> wrote: > Hi Owen, > > Thanks for your response. > > My structure is like: > > a)Textfile: > CREATE EXTERNAL TABLE test_textfile ( > COL1 BIGINT, > COL2 STRING, > COL3 BIGINT, > COL4 STRING, > COL5 STRING, > COL6 BIGINT, > COL7 BIGINT, > COL8 BIGINT, > COL9 BIGINT, > COl10 BIGINT, > COl11 BIGINT, > COL12 STRING, > COl13 STRING, > COl14 STRING, > COl15 BIGINT, > COl16 STRING, > COL17 DOUBLE, > COl18 DOUBLE, > COl19 DOUBLE, > COl20 DOUBLE, > COl21 DOUBLE, > COL22 DOUBLE, > COl23 DOUBLE, > COL24 DOUBLE, > COl25 DOUBLE, > COL26 DOUBLE, > COl27 DOUBLE, > COL28 DOUBLE, > COL29 DOUBLE, > COl30 DOUBLE, > COl31 DOUBLE, > COL32 DOUBLE, > COL33 STRING, > COl34 STRING, > COl35 DOUBLE, > COL36 DOUBLE, > COl37 DOUBLE, > COL38 DOUBLE, > COl39 DOUBLE, > COL40 DOUBLE, > COl41 DOUBLE, > COL42 DOUBLE, > COL43 DOUBLE, > COl44 DOUBLE, > COl45 DOUBLE, > COL46 DOUBLE, > COL47 DOUBLE, > COl48 DOUBLE, > COl49 DOUBLE, > COL50 DOUBLE, > COL51 DOUBLE, > COl52 DOUBLE, > COl53 DOUBLE, > COl54 DOUBLE, > COL55 DOUBLE, > COL56 STRING, > COL57 DOUBLE, > COL58 DOUBLE, > COL59 DOUBLE, > COl60 DOUBLE, > COl61 STRING, > COL62 STRING, > COL63 STRING, > COL64 STRING, > COl65 STRING, > COl66 STRING, > COl67 STRING, > COL68 STRING, > Col69 STRING, > COL70 STRING, > COL71 STRING, > COl72 STRING, > COl73 STRING, > COL74 STRING > ) PARTITIONED BY ( > COL75 STRING, > COL76 STRING > ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' > STORED AS TEXTFILE LOCATION 's3://test/textfile/'; > Using block level compression and bzip2codec for output. > > b) With the above set of columns, just i have changed as STORED AS ORC for > creating ORC. Not using any compression option > > c)Inserted 7256852 records in both the tables > > d)Space occupied in S3: > > Storing as ORC(3 files):153.4MB *3=460.2MB > TEXT(single file in bz2 format)=306MB > > I need to check ORC with compression enabled. > > Please let me know, if i miss anything. > > Thanks, > > > > > On Mon, Aug 12, 2013 at 8:50 PM, Owen O'Malley <omal...@apache.org> wrote: > >> Pandees, >> I've never seen a table that was larger with ORC than with text. Can >> you share your text's file schema with us? Is the table very small? How >> many rows and GB are the tables? The overhead for ORC is typically small, >> but as Ed says it is possible for rare cases for the overhead to dominate >> the data size itself. >> >> -- Owen >> >> >> On Mon, Aug 12, 2013 at 6:52 AM, pandees waran <pande...@gmail.com>wrote: >> >>> Thanks Edward. I shall try compression besides orc and let you know. >>> And also, it looks like the cpu usage is lesser while querying orc rather >>> than text file. >>> But the total time taken by the query time is slightly more in orc than >>> text file. Could you please explain the difference between cumulative cpu >>> time and the total time taken (usually in last line in terms or secs)? >>> Which one should we give preference? >>> On Aug 12, 2013 7:01 PM, "Edward Capriolo" <edlinuxg...@gmail.com> >>> wrote: >>> >>>> Colmnar formats do not always beat row wise storage. Many times gzip >>>> plus block storage will compress something better then columnar storage >>>> especially when you have repeated data in different columns. >>>> >>>> Based on what you are saying it could be possible that you missed a >>>> setting and the ocr are not compressed. >>>> >>>> >>>> On Monday, August 12, 2013, pandees waran <pande...@gmail.com> wrote: >>>> > Hi, >>>> > >>>> > Currently, we use TEXTFILE format in hive 0.8 ,while creating the >>>> > external tables in intermediate processing . >>>> > I have read about ORC in 0.11. I have created the same table in 0.11 >>>> > with ORC format. >>>> > Without any compression, the ORC file(totally 3 files) occupied the >>>> > space twice more than the TEXTFILE(only one file). >>>> > Even, when i query the data from ORC: >>>> > Select count(*) from orc_table >>>> > >>>> > It took more time than the same query against textfile. >>>> > But, i see cumulative CPU time is lesser in ORC than the text file. >>>> > >>>> > What sort of queries will benefit, if we use ORC? >>>> > In which cases TEXTFILE will be preferred more than ORC? >>>> > >>>> > Thanks. >>>> > >>> >>> >> > > > -- > Thanks, > Pandeeswaran >