On Thu, Aug 11, 2016 at 10:42 AM, Ryan Pedela <rped...@datalanche.com> wrote:
> On Thu, Aug 11, 2016 at 9:27 AM, Oleg Bartunov <obartu...@gmail.com> > wrote: > >> On Tue, Aug 9, 2016 at 9:59 PM, Ryan Pedela <rped...@datalanche.com> >> wrote: >> > >> > >> >> > I would say that it is worth it to have a "phrase slop" operator >> (Apache >> > Lucene terminology). Proximity search is extremely useful for improving >> > relevance and phrase slop is one of the tools to achieve that. >> > >> >> It'd be great if you explain what is "phrase slop". I assume it's not >> about search, but about relevance. >> > > Sure. An exact phrase query has slop = 0 which means find all terms in the > exact positions relative to each other. Phrase query with slop > 0 means > find all terms within <slop> positions relative to each other. If slop = > 10, find all terms within 10 positions of each other. Here is a concrete > example from my current work searching SEC filings. > > Bill Gates' full legal name is William H. Gates, III. In the SEC database > [1], his name is GATES WILLIAM H III. If you are searching the records of > people within the SEC database and you want to find Bill Gates, most users > will type "bill gates". Since there are many people with the first name > Bill (William) and the last name Gates, Bill Gates most likely won't be the > first result with a standard keyword query. Likewise an exact phrase query > (slop = 0) will not find him either because the first and last names are > transposed. What you need is a phrase query with a slop = 2 which will > match "William Gates", "William H Gates", "Gates William", etc. There is > still the issue of Bill vs William, but that can be solved with synonyms > and is a different topic. > > 1. https://www.sec.gov/cgi-bin/browse-edgar?CIK=902012&owner > =exclude&action=getcompany&Find=Search > One more thing. In that trivial example, an AND query would probably do a great job too. However if you are searching for Bill Gates in large text documents rather than a list of names, an AND query will not give you very good results because the words "bill" and "gates" are so common.