[ https://issues.apache.org/jira/browse/HIVE-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161290#comment-14161290 ]
Sushanth Sowmyan commented on HIVE-8371: ---------------------------------------- The main goal of HIVE-6405 was to unify expected behaviour between hive and hcatalog, and making the default for HCatStorer different from the default for hive defeats that purpose. To that end, I disagree that it should fail by default unless you are also saying hive should also fail by default inserting into a partition that already exists. I fully see the need for data quality issues needing the immutability aspect when jobs are not written assuming idempotency, and that's why HIVE-6406 added a table-wide property to do exactly that, and default append behaviour can currently be turned off table-wide by setting "immutable"="true" as a table property, and my suggestion would be to use that on tables with jobs that you expect to hit this problem. If your requirement is to have a job-level property that handles this, then, allowing for the "keeping in-sync with hive default behaviour" principle then leads me to the following behaviour : a) If no special argument is provided, stick to defaults for the table - i.e., hive defaults, and overridable by the "immutable" property, which also overrides the default behaviour for hive. b) org.apache.hive.hcatalog.pig.HCatStorer('partspec', '', ' -immutable') => ignore immutable setting, disallow append. c) org.apache.hive.hcatalog.pig.HCatStorer('partspec', '', ' -append') => ignore immutable setting, allow append. Now, thinking more about this, if we were to have a job-level override, to be honest, I am not comfortable with (c), since it's possible for a end user to write a pig script that ignores the table-level immutability property if set, and have it cause data quality issues later, even if a user tries to control it for that table using the "immutable" property. Thus, I think we should not implement (c) in this case. I am okay with implementing (b) if you want to have a safeguard default. I would further say, btw, that I would also be okay with making the default value for the "immutable" table property (i.e. what value it'll have if it isn't set) be made configurable on a warehouse-wide level from hive-site.xml. That would also solve your problem without needing you to go set it for each table. > HCatStorer should fail by default when publishing to an existing partition > -------------------------------------------------------------------------- > > Key: HIVE-8371 > URL: https://issues.apache.org/jira/browse/HIVE-8371 > Project: Hive > Issue Type: Bug > Components: HCatalog > Affects Versions: 0.13.0, 0.14.0, 0.13.1 > Reporter: Thiruvel Thirumoolan > Assignee: Thiruvel Thirumoolan > Labels: hcatalog, partition > > In Hive-12 and before (on in previous HCatalog releases) HCatStorer would > fail if the partition already exists (whether before launching the job or > during commit depending on the partitioning). HIVE-6406 changed that behavior > and by default does an append. This causes data quality issues since an rerun > (or duplicate run) won't fail (when it used to) and will just append to the > partition. > A preferable approach would be to leave HCatStorer behavior as is (fail > during a duplicate publish) and support append through an option. Overwrite > also can be implemented in a similar fashion. Eg: > store A into 'db.table' using > org.apache.hive.hcatalog.pig.HCatStorer('partspec', '', ' -append'); -- This message was sent by Atlassian JIRA (v6.3.4#6332)