On Wed, Nov 19, 2014 at 9:52 AM, Robert Haas <robertmh...@gmail.com> wrote: > If you agree, then I'm not being clear enough. I don't think think > that tinkering with the Levenshtein cost factors is a good idea, and I > think it's unhelpful to suggest something when the suggestion and the > original word differ by more than 50% of the number characters in the > shorter word.
I agree - except for very short strings, where there is insufficient information to go on. I was talking about the difficulty of bolting something on top of Levenshtein distance that looked at the first character. That would not need two costings, but would require an encoding aware matching of the first code point. > Suggesting "col" for "oid" or "x" for "xmax", as crops > up in the regression tests with this patch applied, shows the folly of > this: the user didn't mean the other named column; rather, the user > was confused about whether a particular system column existed for that > table. Those are all very terse strings. What you're overlooking is what is broken by using straight Levenshtein distance, which includes things in the regression test that are reasonable and helpful. As I mentioned before, requiring a greater than 50% of total string size distance breaks this, just within the regression tests: """ ERROR: column "f1" does not exist LINE 1: select f1 from c1; ^ - HINT: Perhaps you meant to reference the column "c1"."f2". """ And: """ ERROR: column atts.relid does not exist LINE 1: select atts.relid::regclass, s.* from pg_stats s join ^ - HINT: Perhaps you meant to reference the column "atts"."indexrelid". """ Those are really useful suggestions! And, they're much more representative of real user error. The downside of weighing deletion less than substitution and insertion is much smaller than the upside. It's worth it. The downside is only that the user gets to see the best suggestion that isn't all that good in an absolute sense (which we have a much harder time concluding using simple tests for short misspellings). > If we had a large database of examples showing what the user typed and > what they intended, we could try different algorithms against it and > see which one performs best with fewest false positives. But if we > don't have that, we should do things that are like the things that > other people have done before. That seems totally impractical. No one has that kind of data that I'm aware of. How about git as a kind of precedent? It is not at all conservative about showing *some* suggestion: """ $ git aa git: 'aa' is not a git command. See 'git --help'. Did you mean this? am $ git d git: 'd' is not a git command. See 'git --help'. Did you mean one of these? diff add $ git ddd git: 'ddd' is not a git command. See 'git --help'. Did you mean this? add """ And why wouldn't git be? As far as its concerned, you can only have meant one of those small number of things. Similarly, with the patch, the number of things we can pick from is fairly limited at this stage, since we are actually fairly far along with parse analysis. Now, this won't give a suggestion: """ $ git aaaa git: 'aaaa' is not a git command. See 'git --help'. """ So it looks like git similarly weighs deletion less than insertion/substitution. Are its suggestions any less ridiculous than your examples of questionable hints from the modified regression test expected output? This is just a guidance mechanism, and at worst we'll show the best match (on the mechanisms own terms, which isn't too bad). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers