RE: Wildcard searches????

Fuad Efendi Fri, 05 Feb 2010 20:29:49 -0800

I understand this:

> So what I need is to have a "generalization" of the user agent in  my
> index



So that we may end up with 5 - 10 different tokens. It has to be hard-coded, 
for instance, via synonym dictionary or something similar (it is very easy in 
SOLR). WAP, HTML, and etc. Most important agent attributes. It doesn't matter 
IE or Mozilla; what plays a role is, for instance, screen resolution, character 
encoding support, gzip support, and etc.; WAP or HTML is very important.

But why???

I think we are giving bad advice without knowing source of a problem (use 
case)...


Obviously:
Niclas tries to map thousands User-Agent strings to few tokens at indexing 
time, and at query time.

Question:
Why to use multivalued field? {"Mozilla", "Firefox"} - can't we have simple 
encoded value "MF"? - we need use case...


...

(better to post in SOLR; it is just configuration settings without hard 
coding...)




> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> Sent: February-05-10 6:45 PM
> To: gene...@lucene.apache.org
> Cc: java-user@lucene.apache.org
> Subject: Re: Wildcard searches????
> 
> Fuad,
> 
> I think that you took Niclas requirements backwards.  He wants a reverse
> wild-card search where the wildcard is in the document and the search
> query
> is more specific.
> 
> You are correct that leading wildcard is critical here.
> 
> On Fri, Feb 5, 2010 at 2:25 PM, Digy <digyd...@gmail.com> wrote:
> 
> > http://en.wikipedia.org/wiki/Crossposting
> >
> > -----Original Message-----
> > From: Niclas Rothman [mailto:n...@lechill.com]
> > Sent: Saturday, February 06, 2010 12:12 AM
> > To: gene...@lucene.apache.org
> > Cc: java-user@lucene.apache.org
> > Subject: RE: Wildcard searches????
> >
> > Hi Fuad and thanks for your reply!
> >
> > The first post I know now was a wrong approach, I should not have the
> > wildcard included in my index.
> >
> > However, I can't do as you suggest, to have the full user agent in the
> > index, that’s the whole idea actually.
> >
> > The reason can be explained like this, device manufactures are
> literally
> > spitting out new devices and updates all the time which generates new
> user
> > agents that are very similar, perhaps only a small version number
> differs.
> > So what I need is to have a "generalization" of the user agent in  my
> > index, to only have the start of the useragent without including the
> > versions numbers.
> > This way my index are all the time "up to date" even if users with new
> > version numbers access my search service, which in my app isn’t
> significant
> > but instead causing my problems....
> >
> > Example:
> >
> > I have 2 Indexed documents where the documents useragent field are
> partial:
> > <doc>
> >        <id>1</id>
> >        <useragents>
> >        Firefox
> >            Mozilla/4.0+SonyEricsson
> >        </useragents>
> > </doc>
> > <doc>
> >        <id>2</id>
> >        <useragents>
> >        Firefox
> >            Mozilla/4.0+SonyEricsson
> >        </useragents>
> > </doc>
> >
> > User A searches my app with an user agent as:
> >
> >
> >  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> >
> > The search app will display both document 1 and 2, because his user
> agent
> > starts exactly has the user agent pattern in my document.
> >
> >
> > User B searches my app with an user agent as (Please note that this
> user
> > agent differs in the near end from Users A (JP9.5.1 instead of
> JP8.4.1)):
> >
> >
> >  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP9.5.1+UP.Link/6.3.1.20.0
> >
> > The search app will also display both document 1 and 2, because his
> user
> > agent starts exactly has the user agent pattern in my document.
> > Even if the version number of the java platform differs between user A
> and
> >  B.
> >
> > If we now have a different index with FULL user agents, only User A
> would
> > have documents returned, none of the documents user agents matched
> Users B
> > user agent because of the "silly" version number!!
> >
> > <doc>
> >        <id>1</id>
> >        <useragents>
> >        Firefox
> >
> >  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> >        </useragents>
> > </doc>
> > <doc>
> >        <id>2</id>
> >        <useragents>
> >        Firefox
> >
> >  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> >        </useragents>
> > </doc>
> >
> > Can you see my problem?
> > So the basic thing is if I somehow can do a query saying that at match
> > should take place if a document useragent starts with the value of the
> users
> > useragent.
> >
> > In theory, having a startsWith "function / locig are easy enough to
> > implement in C# / T-SQL,  but how on earth should I do this in SolR /
> > Lucene?????
> >
> > Regards
> >
> > Niclas
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Fuad Efendi [mailto:f...@efendi.ca]
> > Sent: 05 February 2010 22:49
> > To: gene...@lucene.apache.org
> > Cc: java-user@lucene.apache.org
> > Subject: RE: Wildcard searches????
> >
> > Niclas,
> >
> > I looked at your initial post, you are creating document with field
> "abc*"
> > - nothing related to "wildcard query"!
> >
> > Of course, query [useragents:abcdefghijklm] will return no results,
> and
> > [q=useragents:abc] no results, but [q=useragents:abc*] will return
> > something.
> >
> > text_nav is specific SOLR type for _leading_ wildcard queries; you
> don't
> > need it (you don't need _leading_ wildcard queries).
> >
> > On indexing time, instead of
> > <doc>
> > <useragents>
> >                Firefox*
> >                Mozilla/4.0*
> > </useragents>
> > </doc>
> >
> >
> > You should index
> > <doc>
> > <useragents>
> >
> >  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> > </useragents>
> > </doc>
> >
> > And also, you need to choose properly SOLR type; for instance,
> textTight or
> > textgen, or even non-tokenized string!
> >
> >
> > And, query [q=useragents:moz*] will return this document (even if this
> > field is nontokenized).
> >
> >
> > -Fuad
> >
> >
> > P.S. Don't use * when you create Lucene document; use it as part of
> query.
> >
> >
> >
> >
> > > -----Original Message-----
> > > From: Niclas Rothman [mailto:n...@lechill.com]
> > > Sent: February-05-10 4:44 PM
> > > To: gene...@lucene.apache.org
> > > Cc: java-user@lucene.apache.org
> > > Subject: RE: Wildcard searches????
> > >
> > > Ted im using SOLR, but I cant figure out what type of fieldtype I
> should
> > > use to get a query like this to work:
> > >
> > >
> > > q=useragents: abcdefghijklm
> > >
> > >
> > > where I have in my index one document with value "abc" in field
> > > "useragents"
> > >
> > > That query results in 0 hits.
> > >
> > > If I issue this I get 1 hit of course (exact mathch)
> > >
> > > q=useragents: Mozilla
> > >
> > >
> > > My document definition in SOLR looks like:
> > >
> > > <fields>
> > >     <field name="id" type="tint" indexed="true" stored="true"
> > > required="true" />
> > >     <field name="useragents" type="text_rev" indexed="true"
> > > stored="true" required="false" multiValued="true" />
> > > </fields>
> > >
> > > Any clue?
> > >
> > > Nic
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> > > Sent: 05 February 2010 21:18
> > > To: gene...@lucene.apache.org
> > > Cc: java-user@lucene.apache.org
> > > Subject: Re: Wildcard searches????
> > >
> > > This is quite close.  You will have to break down the user agent
> that is
> > > your query into the same kinds of pieces as you did for your index.
> > > Lucene
> > > will only do exact matching of terms during searching (wildcard
> queries
> > > are
> > > handled by exploding the term into all possible variants).
> > >
> > > Regarding the field type, you will probably have to customize that a
> > > fair
> > > bit to make +'s be separators and such.  If you use SOLR to index
> and
> > > query
> > > your data, then it will make sure that your separation into tokens
> is
> > > compatible unless you are using shortened forms like you mention
> here.
> > >
> > > On Fri, Feb 5, 2010 at 12:03 PM, Niclas Rothman <n...@lechill.com>
> > > wrote:
> > >
> > > > Hi again Ted and many thanks for your efforts.
> > > > Ok, just to be sure that we fully understand each other:
> > > >
> > > > In my index I will store partial useragents without any wildcards
> *,
> > > e.g.
> > > >
> > > > Fire    (for Firefox)
> > > > Inte    (Internet Explorer)
> > > > Moz     (Mozill)
> > > >
> > > >
> > > > When I during runtime search my index for Media objects that are
> > > compatible
> > > > with a useragent,
> > > > e.g:
> > > >
> > > >
> > > >
> > >
> "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> > > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> > > >
> > > > Hopefully lucene / solr will serve me with all Media objects that
> > > partially
> > > > math my full user agent string and also perhaps some mismatches.
> To be
> > > > absolutely sure that I only show Media objects that are
> compatible, I
> > > will
> > > > have to loop through the resultset in my program to do a final
> test
> > > and
> > > > exclude any mismatches.
> > > >
> > > > Is this what you are saying Ted, that I cant do the whole process
> in
> > > Solr /
> > > > Lucene, that I need to do the final test in my program (C#)?
> > > >
> > > > Also, Im using Solr 1.4, what fieldtype would you recommend to use
> for
> > > the
> > > > useragent ( tokenized)
> > > >
> > > > Okey, lets see what you have to say about this.
> > > > Please bear with me, im all new to lucene and solr!!
> > > >
> > > > Regards
> > > > Niclas
> > > >
> > > >
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> > > > Sent: 05 February 2010 20:43
> > > > To: gene...@lucene.apache.org
> > > > Cc: java-user@lucene.apache.org
> > > > Subject: Re: Wildcard searches????
> > > >
> > > > Yes.  I think you have it.
> > > >
> > > > To explain in a bit more detail, I think that you should store a
> > > tokenized
> > > > form of the user agents and should query using a tokenized form of
> > > your
> > > > user
> > > > agent.  This will retrieve documents that have partial matches to
> the
> > > user
> > > > agent of interest.  Many of these matches, however, may not meet
> the
> > > > requirements of the wildcard expression in the documents.  As
> such,
> > > you
> > > > will
> > > > need to look at each retrieved document to retrieve the wild
> > > expression
> > > > from
> > > > each one in turn to test if the original (untokenized) query
> satisfies
> > > the
> > > > wildcard.
> > > >
> > > > If your wildcards are all of a positive nature as your example is,
> > > then
> > > > this
> > > > should work pretty well.
> > > >
> > > > On Fri, Feb 5, 2010 at 9:09 AM, Niclas Rothman <n...@lechill.com>
> > > wrote:
> > > >
> > > > > Hi Ted and thanks for all your efforts.
> > > > > Listen im a little bit lost here trying to understand what you
> are
> > > trying
> > > > > to tell me :-)
> > > > >
> > > > > 1. I Store my useragents in a field that is tokenized.
> > > > > 2. Then when I search, you are saying that I should "scan" down
> the
> > > > matches
> > > > > via a SOLR function, or what?
> > > > > Are you referring to these functions in SOLR?
> > > > >
> > > > > http://wiki.apache.org/solr/FunctionQuery
> > > > >
> > > > >
> > > > > Sorry for not grasping immmediatley!
> > > > >
> > > > > Regards Niclas
> > > > >
> > > > > -----Original Message-----
> > > > > From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> > > > > Sent: 05 February 2010 17:44
> > > > > To: gene...@lucene.apache.org
> > > > > Cc: java-user@lucene.apache.org
> > > > > Subject: Re: Wildcard searches????
> > > > >
> > > > > Tokenize your user agent strings, then store the tokenized form
> > > > separately
> > > > > from the wild card.  At retrieval time, scan down the matches
> and
> > > apply
> > > > the
> > > > > wildcard from each document to your original query.  The SOLR
> > > function
> > > > > query
> > > > > might be useful for this as would be a custom hit collector.
> > > > >
> > > > > On Fri, Feb 5, 2010 at 7:57 AM, Niclas Rothman
> <n...@lechill.com>
> > > wrote:
> > > > >
> > > > > > Hi there, i facing a problem and would like to ask the
> community
> > > for
> > > > some
> > > > > > help.
> > > > > >
> > > > > > In my index I store browser  useragent values as "wildcarded"
> /
> > > > partial,
> > > > > >  which should be understood that an indexed document
> > > > > > should only be shown to end users if his browsers useragent
> > > matches a
> > > > > > wildcared usereragent in my document.
> > > > > >
> > > > > > So what I have Is actually a "reversed" matching, the
> wildcards
> > > are in
> > > > my
> > > > > > document and NOT in my actual query.
> > > > > > Does anyone know if this "setup" Is possible, e.g. to execute
> a
> > > query
> > > > in
> > > > > > style with:
> > > > > >
> > > > > > useragents:
> > > > > >
> > > > >
> > > >
> "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> > > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> > > > > >
> > > > > > In this example I would have a hit because Mozilla/4.0*
> matches
> > > the
> > > > > > useragent.
> > > > > >
> > > > > > <doc>
> > > > > > <useragents>
> > > > > >                Firefox*
> > > > > >                Mozilla/4.0*
> > > > > > </useragents>
> > > > > > </doc>
> > > > > >
> > > > > >
> > > > > > Regards
> > > > > > Niclas
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ted Dunning, CTO
> > > > > DeepDyve
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ted Dunning, CTO
> > > > DeepDyve
> > > >
> > >
> > >
> > >
> > > --
> > > Ted Dunning, CTO
> > > DeepDyve
> >
> >
> >
> >
> 
> 
> --
> Ted Dunning, CTO
> DeepDyve



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Wildcard searches????

Reply via email to