Many thanks to Zoltán for starting this thread and driving the discussion forward. I am happy to see that many people chimed in and gave feedback about the suggested changes in HIVE-28483; this is a good indicator of a healthy community.
Since there were some comments/questions about the behavior in previous Hive versions, I added a new set of tests under HIVE-27586 [1] to give a better overview of the situation. People who are interested can give a look. Personally, I am +0 with the changes in HIVE-28483; I like them but don't know if all users will. There is already a clear +1 from Okumin and I sense a general positive sentiment from Attila and Ayush so I will proceed in merging the PR in the following days unless the tide changes. Best, Stamatis [1] https://issues.apache.org/jira/browse/HIVE-27586 On Thu, Sep 19, 2024 at 8:43 AM Zoltán Rátkai <zrat...@cloudera.com> wrote: > > Hi Okumin, > > thank you very much to checking them I appreciate it! > Now we have information on 5 different DB (MySQL, Oracle, Postgre, Spark, > Trino) and 3 different versions of Hive about this topic and as I see only > Hive 4.0 works like the No. 1 way. > > > Regards, > > Zoltan > > > On Wed, Sep 18, 2024 at 3:53 PM Okumin <m...@okumin.com> wrote: >> >> Hi, >> >> I put supplementary information in the ticket, such as examples of >> Hive 2, Hive 3, Spark, Trino, or my research of the ANSI standard. >> >> Regards, >> Okumin >> >> On Wed, Sep 18, 2024 at 7:17 PM Attila Turoczy >> <aturo...@cloudera.com.invalid> wrote: >> > >> > Hi Folks, >> > >> > Let me share my opinion about this change and request. To have an option >> > for parse data from string is a very valuable feature especially in OLAP, >> > because many of the input data many times come from different formats. It >> > is easy and makes sense that the date follows the ISO format. But real >> > life is different. Especially on data lakes the sources could come from >> > many formats and the ETL job parses excel, json, csv etc to create tables, >> > insert rows. For those countries like Hungary where the ISO date format is >> > default everywhere the HIVE-27586 would not cause any issue. However, in >> > regions such as the US and UK, where date formats like mm/dd/yyyy and >> > dd/mm/yyyy are common (e.g., in Excel files), this could present >> > challenges. In these cases, users may encounter problems. In my mind >> > especially for these formats it has to be null or as an oracle throws an >> > error. If users fail to validate the output, this could lead to data >> > correctness issues or, worse, potential data loss if source files are >> > deleted. Yes, Hive users do not like the behavior change, but one type >> > (one of the most popular formats in the US) could lead to more problems. >> > >> > -Attila >> > >> > On Wed, Sep 18, 2024 at 9:43 AM Ayush Saxena <ayush...@gmail.com> wrote: >> >> >> >> Since this is landing in Spam for some reason for most people, I am >> >> explicitly forwarding it rather than replying, so as people can get >> >> it. >> >> >> >> I will vote for (2) considering it was returning `null` earlier as >> >> well & the new behaviour came up recently, it would be a behaviour >> >> change for folks migrating from earlier versions of hive to 4.x & the >> >> new behaviour ain't that fancy either... >> >> >> >> -Ayush >> >> >> >> ---------- Forwarded message --------- >> >> From: Zoltán Rátkai <zrat...@cloudera.com> >> >> Date: Wed, 18 Sept 2024 at 12:55 >> >> Subject: Re: IMPORTANT: Hive date parsing issue, input needed >> >> To: <u...@hive.apache.org> >> >> Cc: dev <dev@hive.apache.org> >> >> >> >> >> >> Hi Ayush, >> >> >> >> thanks for your thoughts! >> >> >> >> My answers: >> >> >> >> - On the official Hive page about types this is not documented at all: >> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=82706456#LanguageManualTypes-date >> >> >> >> "Dates: DATE values describe a particular year/month/day, in the form >> >> YYYY-MM-DD. For example, DATE '2013-01-01'. Date types do not have >> >> a time of day component. The range of values supported for the Date >> >> type is 0000-01-01 to 9999-12-31, dependent on support by the >> >> primitive Java Date type." >> >> >> >> - Since the beginning of Hive it worked like returning null, this was >> >> introduced one year ago as a side effect of this: HIVE-27586 "Parse >> >> dates from strings ignoring trailing (potentialy) invalid chars", so >> >> the goal of this ticket was to be able to handle string to date >> >> parsing like : "2024-09-18 GARBAGE" and parse it to 2024-09-18 as >> >> date. >> >> So it is not a long standing behaviour. >> >> >> >> I made a comparison of other DB (Postre, MySQL, ORACLE) and none of >> >> them handle it like Hive now. Please check it here: >> >> https://issues.apache.org/jira/browse/HIVE-28483 >> >> I agree with you to align with those. This is why I asked this question. >> >> >> >> Thanks and regards, >> >> >> >> Zoltan Ratkai >> >> >> >> >> >> On Wed, Sep 18, 2024 at 8:58 AM Ayush Saxena <ayush...@gmail.com> wrote: >> >> > >> >> > ++ Adding the dev mailing list, as this might be relevant to them as >> >> > well. >> >> > >> >> > I have a few additional questions: >> >> > >> >> > * Has this behavior been present since the inception of Hive, or did >> >> > it start occurring more recently? If this has been the behavior for >> >> > over a decade, it might be best to leave it unchanged. However, if >> >> > this is a recent development, we should aim to restore the original >> >> > behavior. In my opinion, any change that alters long-standing behavior >> >> > could be considered incompatible unless it was introduced to prevent >> >> > data loss or address a security issue. >> >> > * How do other engines (e.g., Impala, Spark) handle similar scenarios, >> >> > and how do other databases like MySQL and PostgreSQL behave in this >> >> > regard? In the past, we've typically aimed to align with these >> >> > systems, not necessarily exactly, but closely enough—especially in >> >> > cases where there's ambiguity or conflict. >> >> > >> >> > >> >> > -Ayush >> >> > >> >> > >> >> > On Wed, 18 Sept 2024 at 12:00, Zoltán Rátkai <zrat...@cloudera.com> >> >> > wrote: >> >> > > >> >> > > Dear Hive User! >> >> > > >> >> > > In Hive the standard date format is YYYY-MM-DD, like “2024-09-17”. >> >> > > So when casting a string which format is different, like DD-MM-YYYY >> >> > > e.g. “17-09-2024" to date, as a Hive user what would you expect as a >> >> > > result? >> >> > > >> >> > > >> >> > > 1. “0017-09-20” => Year of 17 September 20 >> >> > > or >> >> > > 2. null >> >> > > >> >> > > It is possible to give a date pattern CAST(<string-value> AS DATE >> >> > > FORMAT <date-pattern>)and cast with it like "DD-MM-YYYY". >> >> > > Earlier it gave back null, but currently Hive gives back the first, >> >> > > which I think unexpected as a user and customers complain about that. >> >> > > There is a debate if it is a bug or feature, so I need your help! >> >> > > >> >> > > I would like to ask you to reply with a vote on 1 or 2 what you want >> >> > > as a user to happen! >> >> > > >> >> > > I really appreciate your input! >> >> > > >> >> > > Thank you, >> >> > > >> >> > > Zoltan Ratkai