On Thu, Jan 23, 2020 at 11:00 PM Tom Lane <t...@sss.pgh.pa.us> wrote:
Thank you for your time looking into this. Here's a v7 patch, rebased over my recent hacking, and addressing > most of the complaints I raised in <31691.1579648...@sss.pgh.pa.us>. > However, I've got some new complaints now that I've looked harder > at the code: > > * It's not exactly apparent to me why this code should be concerned > about non-normalized characters when noplace else in the backend is. > I think we should either rip that out entirely, or move the logic > into str_tolower (and hence also str_toupper etc). I'd tend to favor > the former, but of course I don't speak any languages where this would > be an issue. Perhaps a better idea is to invent a new SQL function > that normalizes a given string, and expect users to call that first > if they'd like to_date() to match unnormalized text. > > There is an open patch that will make the normalization functionality user visible [1]. So, if a user can call to_date(normalize('01 ŞUB 2010'), 'DD TMMON YYYY') I would vote to drop the normalization logic inside this patch altogether. * I have no faith in this calculation that decides how long the match > length was: > > *len = element_len + name_len - norm_len; > > I seriously doubt that this does the right thing if either the > normalization or the downcasing changed the byte length of the > string. I'm not actually sure how we can do that correctly. > There's no good way to know whether the changes happened within > the part of the "name" string that we matched, or the part beyond > what we matched, but we only want to count the former. So this > seems like a pretty hard problem, and even if this logic is somehow > correct as it stands, it needs a comment explaining why. > > The proper logic would come from do_to_timestamp() receiving a normalized "date_txt" input, so we do not operate with unnormalize and normalize strings at the same time. > * I'm still concerned about the risk of integer overflow in the > string allocations in seq_search_localized(). Those need to be > adjusted to be more paranoid, similar to the code in e.g. str_tolower. > Please find attached a patch with the normalization logic removed, thus no direct allocations in seq_search_localized(). I would like to rise a couple of questions myself: * When compiled with DEBUG_TO_FROM_CHAR, there is a warning "‘dump_node’ defined but not used". Should we drop this function or uncomment its usage? * Would it be worth moving str_tolower(localized_name) from seq_search_localized() into cache_locale_time()? [1] https://www.postgresql.org/message-id/014866c8-d7ff-2a4f-c185-cf7e3ceb7028%402ndquadrant.com Regards, Juan José Santamaría Flecha
0001-Allow-localized-month-names-to_date-v8.patch
Description: Binary data