[ 
https://issues.apache.org/jira/browse/HIVE-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161290#comment-14161290
 ] 

Sushanth Sowmyan commented on HIVE-8371:
----------------------------------------

The main goal of HIVE-6405 was to unify expected behaviour between hive and 
hcatalog, and making the default for HCatStorer different from the default for 
hive defeats that purpose. To that end, I disagree that it should fail by 
default unless you are also saying hive should also fail by default inserting 
into a partition that already exists.

I fully see the need for data quality issues needing the immutability aspect 
when jobs are not written assuming idempotency, and that's why HIVE-6406 added 
a table-wide property to do exactly that, and default append behaviour can 
currently be turned off table-wide by setting "immutable"="true" as a table 
property, and my suggestion would be to use that on tables with jobs that you 
expect to hit this problem.

If your requirement is to have a job-level property that handles this, then, 
allowing for the "keeping in-sync with hive default behaviour" principle then 
leads me to the following behaviour :

a) If no special argument is provided, stick to defaults for the table - i.e., 
hive defaults, and overridable by the "immutable" property, which also 
overrides the default behaviour for hive.
b) org.apache.hive.hcatalog.pig.HCatStorer('partspec', '', ' -immutable') => 
ignore immutable setting, disallow append.
c) org.apache.hive.hcatalog.pig.HCatStorer('partspec', '', ' -append') => 
ignore immutable setting, allow append. 

Now, thinking more about this, if we were to have a job-level override, to be 
honest, I am not comfortable with (c), since it's possible for a end user to 
write a pig script that ignores the table-level immutability property if set, 
and have it cause data quality issues later, even if a user tries to control it 
for that table using the "immutable" property. Thus, I think we should not 
implement (c) in this case. I am okay with implementing (b) if you want to have 
a safeguard default.

I would further say, btw, that I would also be okay with making the default 
value for the "immutable" table property (i.e. what value it'll have if it 
isn't set) be made configurable on a warehouse-wide level from hive-site.xml. 
That would also solve your problem without needing you to go set it for each 
table.


> HCatStorer should fail by default when publishing to an existing partition
> --------------------------------------------------------------------------
>
>                 Key: HIVE-8371
>                 URL: https://issues.apache.org/jira/browse/HIVE-8371
>             Project: Hive
>          Issue Type: Bug
>          Components: HCatalog
>    Affects Versions: 0.13.0, 0.14.0, 0.13.1
>            Reporter: Thiruvel Thirumoolan
>            Assignee: Thiruvel Thirumoolan
>              Labels: hcatalog, partition
>
> In Hive-12 and before (on in previous HCatalog releases) HCatStorer would 
> fail if the partition already exists (whether before launching the job or 
> during commit depending on the partitioning). HIVE-6406 changed that behavior 
> and by default does an append. This causes data quality issues since an rerun 
> (or duplicate run) won't fail (when it used to) and will just append to the 
> partition.
> A preferable approach would be to leave HCatStorer behavior as is (fail 
> during a duplicate publish) and support append through an option. Overwrite 
> also can be implemented in a similar fashion. Eg:
> store A into 'db.table' using 
> org.apache.hive.hcatalog.pig.HCatStorer('partspec', '', ' -append');



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to