Hi, On Sun, Sep 8, 2024 at 4:09 AM Tom Lane <t...@sss.pgh.pa.us> wrote:
> A few notes from a quick read of the patch: > > * roman_to_int() should have a header comment, notably explaining its > result convention. I find it fairly surprising that "0" means an > error --- I realize that Roman notation can't represent zero, but > wouldn't it be better to use -1? > Noted. > > * Do *not* rely on toupper(). There are multiple reasons why not, > but in this context the worst one is that in Turkish locales it's > likely to translate "i" to "İ", on which you will fail. I'd use > pg_ascii_toupper(). > Noted. > > * I think roman_to_int() is under-commented internally too. > To start with, where did the magic "15" come from? And why > should we have such a test anyway --- what if the format allows > for trailing stuff after the roman numeral? (That is, I think > you need to fix this so that roman_to_int reports how much data > it ate, instead of assuming that it must read to the end of the > input.) MMMDCCCLXXXVIII is the longest valid standard roman numeral (15 characters). I will add appropriate comment on length check. I am not sure I am able to understand the latter part. Could you please elaborate it? Are you referring to cases like these? SELECT to_number('CXXIII foo', 'RN foo'); SELECT to_number('CXXIII foo', 'RN'); Please check my comments on Oracle's behaviour below. > The logic for detecting invalid numeral combinations > feels a little opaque too. Do you have a source for the rules > you're implementing, and if so could you cite it? > There are many sources on the internet. A few sources: 1. https://www.toppr.com/guides/maths/knowing-our-numbers/roman-numerals/ 2. https://www.cuemath.com/numbers/roman-numerals/ Note that we are following the standard form: https://en.wikipedia.org/wiki/Roman_numerals#Standard_form > > * This code will get run through pgindent before commit, so you > might want to revisit whether your comments will still look nice > afterwards. There's not a lot of consistency in them about initial > cap versus lower case or trailing period versus not, too. > Noted. > > * roman_result could be declared where used, no? > Noted. > > * I'm not really convinced that we need a new errcode > ERRCODE_INVALID_ROMAN_NUMERAL rather than using a generic one like > ERRCODE_INVALID_TEXT_REPRESENTATION. However, if we do do it > like that, why did you pick the out-of-sequence value 22P07? > 22P07 is not out-of-sequence. 22P01 to 22P06 are already used. However, I do agree with you that we can use ERRCODE_INVALID_TEXT_REPRESENTATION. I will change it. > > * Might be good to have a few test cases demonstrating what happens > when there's other stuff combined with the RN format spec. Notably, > even if RN itself won't eat whitespace, there had better be a way > to write the format string to allow that. > The current patch (v3) simply ignores other formats with RN except when RN is used EEEE which returns error. ``` postgres=# SELECT to_number('CXXIII', 'foo RN'); to_number ----------- 123 (1 row) postgres=# SELECT to_number('CXXIII', 'MIrn'); to_number ----------- 123 (1 row) postgres=# SELECT to_number('CXXIII', 'EEEErn'); ERROR: "EEEE" must be the last pattern used ``` However, I think that other formats except FM should be rejected when used with RN in NUMDesc_prepare function. Any opinions? If we look into Oracle, they are strict in this regard and don't allow any other format with RN. 1. SELECT to_char(12, 'MIrn') from dual; 2. SELECT to_char(12, 'foo RN') from dual; results in error: ORA-01481: invalid number format model I also found that there is no check when RN is used twice (both in to_char and to_number) and that results in unexpected output. ``` postgres=# SELECT to_number('CXXIII', 'RNrn'); to_number ----------- 123 (1 row) postgres=# SELECT to_char(12, 'RNrn'); to_char -------------------------------- XII xii (1 row) postgres=# SELECT to_char(12, 'rnrn'); to_char -------------------------------- xii xii (1 row) postgres=# SELECT to_char(12, 'FMrnrn'); to_char --------- xiixii (1 row) ``` > * Further to Aleksander's point about lack of test coverage for > the to_char direction, I see from > > https://coverage.postgresql.org/src/backend/utils/adt/formatting.c.gcov.html > that almost none of the existing roman-number code paths are covered > today. While it's not strictly within the charter of this patch > to improve that, maybe we should try to do so --- at the very > least it'd raise formatting.c's coverage score a few notches. > > I see that you have provided a patch to increase test coverage of to_char roman format including some other formats. I will try to add more tests for the to_number roman format. I will provide the next patch soon. Regards, Hunaid Sohail