On Thu, Aug 11, 2016 at 10:42 AM, Ryan Pedela <rped...@datalanche.com>
wrote:

> On Thu, Aug 11, 2016 at 9:27 AM, Oleg Bartunov <obartu...@gmail.com>
> wrote:
>
>> On Tue, Aug 9, 2016 at 9:59 PM, Ryan Pedela <rped...@datalanche.com>
>> wrote:
>> >
>> >
>>
>> >  I would say that it is worth it to have a "phrase slop" operator
>> (Apache
>> > Lucene terminology). Proximity search is extremely useful for improving
>> > relevance and phrase slop is one of the tools to achieve that.
>> >
>>
>> It'd be great if you explain what is "phrase slop". I assume it's not
>> about search, but about relevance.
>>
>
> Sure. An exact phrase query has slop = 0 which means find all terms in the
> exact positions relative to each other. Phrase query with slop > 0 means
> find all terms within <slop> positions relative to each other. If slop =
> 10, find all terms within 10 positions of each other. Here is a concrete
> example from my current work searching SEC filings.
>
> Bill Gates' full legal name is William H. Gates, III. In the SEC database
> [1], his name is GATES WILLIAM H III. If you are searching the records of
> people within the SEC database and you want to find Bill Gates, most users
> will type "bill gates". Since there are many people with the first name
> Bill (William) and the last name Gates, Bill Gates most likely won't be the
> first result with a standard keyword query. Likewise an exact phrase query
> (slop = 0) will not find him either because the first and last names are
> transposed. What you need is a phrase query with a slop = 2 which will
> match "William Gates", "William H Gates", "Gates William", etc. There is
> still the issue of Bill vs William, but that can be solved with synonyms
> and is a different topic.
>
> 1. https://www.sec.gov/cgi-bin/browse-edgar?CIK=902012&owner
> =exclude&action=getcompany&Find=Search
>


One more thing. In that trivial example, an AND query would probably do a
great job too. However if you are searching for Bill Gates in large text
documents rather than a list of names, an AND query will not give you very
good results because the words "bill" and "gates" are so common.

Reply via email to