Re: [BUGS] Bug with Tsearch and tsvector

2010-05-01 Thread Jasen Betts
On 2010-04-29, Tom Lane wrote: > Jasen Betts writes: >> \ is popular in URIs on some platfroms, or is URI a different beast > > I hope not, because \ is explicitly disallowed by both the older and > newer versions of that RFC. I should have known better than to assume that Microsoft was using a

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-29 Thread Tom Lane
Jasen Betts writes: > \ is popular in URIs on some platfroms, or is URI a different beast I hope not, because \ is explicitly disallowed by both the older and newer versions of that RFC. I did think of proposing that we allow \ and : in FilePath, which is currently pretty Unix-centric: regressi

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-29 Thread Jasen Betts
On 2010-04-26, Kevin Grittner wrote: > Tom Lane wrote: > > From the RFC: > >| control = >| space = >| delims = "<" | ">" | "#" | "%" | <"> >| unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`" > > Except, of course, that since % is the escape character, it is OK. >

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-27 Thread Tom Lane
"Kevin Grittner" writes: > reserved= gen-delims / sub-delims > gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" > sub-delims = "!" / "$" / "&" / "'" / "(" / ")" > / "*" / "+" / "," / ";" / "=" > unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" >

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-27 Thread Tom Lane
"Kevin Grittner" writes: > I think that we should accept all the above characters (reserved and > unreserved) and the percent character (since it is the escape > character) as part of a URL. Check. > I don't know whether we should try to extract components of the URL, > but if we do, perhaps we

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-27 Thread Kevin Grittner
"Kevin Grittner" wrote: > I'll read this RFC closely and follow up later today. For anyone not clear on what a URI is compared to a URL, every URL is also a URI (but not the other way around): A URI can be further classified as a locator, a name, or both. The term "Uniform Resource Locator

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-27 Thread Kevin Grittner
Tom Lane wrote: > "Kevin Grittner" writes: >> Tom Lane wrote: >>> We'd probably not want to apply this as-is, but should first >>> tighten up what characters URLPath allows, per Kevin's spec >>> research. > >> If we're headed that way, I figured I should double-check. The >> RFC I referenced

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-27 Thread Tom Lane
"Kevin Grittner" writes: > Tom Lane wrote: >> We'd probably not want to apply this as-is, but should first >> tighten up what characters URLPath allows, per Kevin's spec >> research. > If we're headed that way, I figured I should double-check. The RFC > I referenced was later obsoleted by: > h

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-27 Thread Kevin Grittner
Tom Lane wrote: > We'd probably not want to apply this as-is, but should first > tighten up what characters URLPath allows, per Kevin's spec > research. If we're headed that way, I figured I should double-check. The RFC I referenced was later obsoleted by: http://www.ietf.org/rfc/rfc3986.tx

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-26 Thread Tom Lane
"Kevin Grittner" writes: > Hmm. Having typed that, I'm staring at the # character, which is > used to mark off an anchor within an HTML page identified by the > URL. Should we consider the # and anchor part of a URL? Yeah, I would think so. This discussion is making me think that my previous p

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-26 Thread Kevin Grittner
Tom Lane wrote: > there's a potential compatibility issue here, so my thought is to > apply this only in HEAD. Agreed. -Kevin -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-26 Thread Kevin Grittner
Tom Lane wrote: > Hmm, thanks for the reference, but I'm not sure this is specifying > quite what we want to get at. In particular I note that it > excludes '%' on the grounds that that ought to be escaped, so I > guess this is specifying the characters allowed in an underlying > URI, *not* the

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-26 Thread Tom Lane
I wrote: > "Donald Fraser" writes: >> Using the default tsearch configuration, for 'english', text is being >> wrongly parsed into the tsvector type. > ts_debug shows that it's being parsed like this: > alias | description | token >

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-26 Thread Tom Lane
"Kevin Grittner" writes: > Tom Lane wrote: >> ie the critical point seems to be that url_path is willing to soak >> up a string containing "<" and ">", so the span tags don't get >> recognized as separate lexemes. While that's "obviously" the >> wrong thing in this particular example, I'm not su

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-26 Thread Kevin Grittner
Tom Lane wrote: > ie the critical point seems to be that url_path is willing to soak > up a string containing "<" and ">", so the span tags don't get > recognized as separate lexemes. While that's "obviously" the > wrong thing in this particular example, I'm not sure if it's the > wrong thing i

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-26 Thread Tom Lane
"Donald Fraser" writes: > Using the default tsearch configuration, for 'english', text is being wrongly > parsed into the tsvector type. ts_debug shows that it's being parsed like this: alias | description | token | dictionaries

[BUGS] Bug with Tsearch and tsvector

2010-04-26 Thread Donald Fraser
PostgreSQL 8.3.10 (on i686-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46)) OS: Linux Redhat EL 5.4 Database encoding: LATIN9 Using the default tsearch configuration, for 'english', text is being wrongly parsed into the tsvector type. The fail condition is shown w