Re: Hive alter table concatenate loses data - can parquet help?

Marcin Tustin Tue, 08 Mar 2016 03:53:36 -0800

Hi Mich,

ddl as below.


Hi Prasanth,

Hive version as reported by Hortonworks is 1.2.1.2.3.

Thanks,
Marcin

CREATE TABLE `<tablename>`(

  `col1` string,

  `col2` bigint,

  `col3` string,

  `col4` string,

  `col4` string,

  `col5` bigint,

  `col6` string,

  `col7` string,

  `col8` string,

  `col9` string,

  `col10` boolean,

  `col11` boolean,

  `col12` string,

  `metadata`
struct<file:string,hostname:string,level:string,line:bigint,logger:string,method:string,millis:bigint,pid:bigint,timestamp:string>,

  `col14` string,

  `col15` bigint,

  `col16` double,

  `col17` bigint)

ROW FORMAT SERDE

  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'

STORED AS INPUTFORMAT

  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'

OUTPUTFORMAT

  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

LOCATION

  'hdfs://reporting-handy/<path>'

TBLPROPERTIES (

  'COLUMN_STATS_ACCURATE'='true',

  'numFiles'='2800',

  'numRows'='297263',

  'rawDataSize'='454748401',

  'totalSize'='31310353',

  'transient_lastDdlTime'='1457437204')

Time taken: 1.062 seconds, Fetched: 34 row(s)

On Tue, Mar 8, 2016 at 4:29 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi
>
> can you please provide DDL for this table "show create table <TABLE>"
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 7 March 2016 at 23:25, Marcin Tustin <mtus...@handybook.com> wrote:
>
>> Hi All,
>>
>> Following on from from our parquet vs orc discussion, today I observed
>> hive's alter table ... concatenate command remove rows from an ORC
>> formatted table.
>>
>> 1. Has anyone else observed this (fuller description below)? And
>> 2. How to do parquet users handle the file fragmentation issue?
>>
>> Description of the problem:
>>
>> Today I ran a query to count rows by date. Relevant days below:
>> 2016-02-28 16866
>> 2016-03-06 219
>> 2016-03-07 2863
>> I then ran concatenation on that table. Rerunning the same query resulted
>> in:
>>
>> 2016-02-28 16866
>> 2016-03-06 219
>> 2016-03-07 1158
>>
>> Note reduced count for 2016-03-07
>>
>> I then ran concatenation a second time, and the query a third time:
>> 2016-02-28 16344
>> 2016-03-06 219
>> 2016-03-07 1158
>>
>> Now the count for 2016-02-28 is reduced.
>>
>> This doesn't look like an elimination of duplicates occurring by design -
>> these didn't all happen on the first run of concatenation. It looks like
>> concatenation just kind of loses data.
>>
>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> <http://www.handy.com/careers>
>> Latest news <http://www.handy.com/press> at Handy
>> Handy just raised $50m
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>>  led
>> by Fidelity
>>
>>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
 led 
by Fidelity

Re: Hive alter table concatenate loses data - can parquet help?

Reply via email to