I understand this: > So what I need is to have a "generalization" of the user agent in my > index
So that we may end up with 5 - 10 different tokens. It has to be hard-coded, for instance, via synonym dictionary or something similar (it is very easy in SOLR). WAP, HTML, and etc. Most important agent attributes. It doesn't matter IE or Mozilla; what plays a role is, for instance, screen resolution, character encoding support, gzip support, and etc.; WAP or HTML is very important. But why??? I think we are giving bad advice without knowing source of a problem (use case)... Obviously: Niclas tries to map thousands User-Agent strings to few tokens at indexing time, and at query time. Question: Why to use multivalued field? {"Mozilla", "Firefox"} - can't we have simple encoded value "MF"? - we need use case... ... (better to post in SOLR; it is just configuration settings without hard coding...) > -----Original Message----- > From: Ted Dunning [mailto:ted.dunn...@gmail.com] > Sent: February-05-10 6:45 PM > To: gene...@lucene.apache.org > Cc: java-user@lucene.apache.org > Subject: Re: Wildcard searches???? > > Fuad, > > I think that you took Niclas requirements backwards. He wants a reverse > wild-card search where the wildcard is in the document and the search > query > is more specific. > > You are correct that leading wildcard is critical here. > > On Fri, Feb 5, 2010 at 2:25 PM, Digy <digyd...@gmail.com> wrote: > > > http://en.wikipedia.org/wiki/Crossposting > > > > -----Original Message----- > > From: Niclas Rothman [mailto:n...@lechill.com] > > Sent: Saturday, February 06, 2010 12:12 AM > > To: gene...@lucene.apache.org > > Cc: java-user@lucene.apache.org > > Subject: RE: Wildcard searches???? > > > > Hi Fuad and thanks for your reply! > > > > The first post I know now was a wrong approach, I should not have the > > wildcard included in my index. > > > > However, I can't do as you suggest, to have the full user agent in the > > index, that’s the whole idea actually. > > > > The reason can be explained like this, device manufactures are > literally > > spitting out new devices and updates all the time which generates new > user > > agents that are very similar, perhaps only a small version number > differs. > > So what I need is to have a "generalization" of the user agent in my > > index, to only have the start of the useragent without including the > > versions numbers. > > This way my index are all the time "up to date" even if users with new > > version numbers access my search service, which in my app isn’t > significant > > but instead causing my problems.... > > > > Example: > > > > I have 2 Indexed documents where the documents useragent field are > partial: > > <doc> > > <id>1</id> > > <useragents> > > Firefox > > Mozilla/4.0+SonyEricsson > > </useragents> > > </doc> > > <doc> > > <id>2</id> > > <useragents> > > Firefox > > Mozilla/4.0+SonyEricsson > > </useragents> > > </doc> > > > > User A searches my app with an user agent as: > > > > > > Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP- > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0 > > > > The search app will display both document 1 and 2, because his user > agent > > starts exactly has the user agent pattern in my document. > > > > > > User B searches my app with an user agent as (Please note that this > user > > agent differs in the near end from Users A (JP9.5.1 instead of > JP8.4.1)): > > > > > > Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP- > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP9.5.1+UP.Link/6.3.1.20.0 > > > > The search app will also display both document 1 and 2, because his > user > > agent starts exactly has the user agent pattern in my document. > > Even if the version number of the java platform differs between user A > and > > B. > > > > If we now have a different index with FULL user agents, only User A > would > > have documents returned, none of the documents user agents matched > Users B > > user agent because of the "silly" version number!! > > > > <doc> > > <id>1</id> > > <useragents> > > Firefox > > > > Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP- > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0 > > </useragents> > > </doc> > > <doc> > > <id>2</id> > > <useragents> > > Firefox > > > > Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP- > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0 > > </useragents> > > </doc> > > > > Can you see my problem? > > So the basic thing is if I somehow can do a query saying that at match > > should take place if a document useragent starts with the value of the > users > > useragent. > > > > In theory, having a startsWith "function / locig are easy enough to > > implement in C# / T-SQL, but how on earth should I do this in SolR / > > Lucene????? > > > > Regards > > > > Niclas > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > From: Fuad Efendi [mailto:f...@efendi.ca] > > Sent: 05 February 2010 22:49 > > To: gene...@lucene.apache.org > > Cc: java-user@lucene.apache.org > > Subject: RE: Wildcard searches???? > > > > Niclas, > > > > I looked at your initial post, you are creating document with field > "abc*" > > - nothing related to "wildcard query"! > > > > Of course, query [useragents:abcdefghijklm] will return no results, > and > > [q=useragents:abc] no results, but [q=useragents:abc*] will return > > something. > > > > text_nav is specific SOLR type for _leading_ wildcard queries; you > don't > > need it (you don't need _leading_ wildcard queries). > > > > On indexing time, instead of > > <doc> > > <useragents> > > Firefox* > > Mozilla/4.0* > > </useragents> > > </doc> > > > > > > You should index > > <doc> > > <useragents> > > > > Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP- > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0 > > </useragents> > > </doc> > > > > And also, you need to choose properly SOLR type; for instance, > textTight or > > textgen, or even non-tokenized string! > > > > > > And, query [q=useragents:moz*] will return this document (even if this > > field is nontokenized). > > > > > > -Fuad > > > > > > P.S. Don't use * when you create Lucene document; use it as part of > query. > > > > > > > > > > > -----Original Message----- > > > From: Niclas Rothman [mailto:n...@lechill.com] > > > Sent: February-05-10 4:44 PM > > > To: gene...@lucene.apache.org > > > Cc: java-user@lucene.apache.org > > > Subject: RE: Wildcard searches???? > > > > > > Ted im using SOLR, but I cant figure out what type of fieldtype I > should > > > use to get a query like this to work: > > > > > > > > > q=useragents: abcdefghijklm > > > > > > > > > where I have in my index one document with value "abc" in field > > > "useragents" > > > > > > That query results in 0 hits. > > > > > > If I issue this I get 1 hit of course (exact mathch) > > > > > > q=useragents: Mozilla > > > > > > > > > My document definition in SOLR looks like: > > > > > > <fields> > > > <field name="id" type="tint" indexed="true" stored="true" > > > required="true" /> > > > <field name="useragents" type="text_rev" indexed="true" > > > stored="true" required="false" multiValued="true" /> > > > </fields> > > > > > > Any clue? > > > > > > Nic > > > > > > > > > > > > > > > -----Original Message----- > > > From: Ted Dunning [mailto:ted.dunn...@gmail.com] > > > Sent: 05 February 2010 21:18 > > > To: gene...@lucene.apache.org > > > Cc: java-user@lucene.apache.org > > > Subject: Re: Wildcard searches???? > > > > > > This is quite close. You will have to break down the user agent > that is > > > your query into the same kinds of pieces as you did for your index. > > > Lucene > > > will only do exact matching of terms during searching (wildcard > queries > > > are > > > handled by exploding the term into all possible variants). > > > > > > Regarding the field type, you will probably have to customize that a > > > fair > > > bit to make +'s be separators and such. If you use SOLR to index > and > > > query > > > your data, then it will make sure that your separation into tokens > is > > > compatible unless you are using shortened forms like you mention > here. > > > > > > On Fri, Feb 5, 2010 at 12:03 PM, Niclas Rothman <n...@lechill.com> > > > wrote: > > > > > > > Hi again Ted and many thanks for your efforts. > > > > Ok, just to be sure that we fully understand each other: > > > > > > > > In my index I will store partial useragents without any wildcards > *, > > > e.g. > > > > > > > > Fire (for Firefox) > > > > Inte (Internet Explorer) > > > > Moz (Mozill) > > > > > > > > > > > > When I during runtime search my index for Media objects that are > > > compatible > > > > with a useragent, > > > > e.g: > > > > > > > > > > > > > > > > "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP- > > > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0" > > > > > > > > Hopefully lucene / solr will serve me with all Media objects that > > > partially > > > > math my full user agent string and also perhaps some mismatches. > To be > > > > absolutely sure that I only show Media objects that are > compatible, I > > > will > > > > have to loop through the resultset in my program to do a final > test > > > and > > > > exclude any mismatches. > > > > > > > > Is this what you are saying Ted, that I cant do the whole process > in > > > Solr / > > > > Lucene, that I need to do the final test in my program (C#)? > > > > > > > > Also, Im using Solr 1.4, what fieldtype would you recommend to use > for > > > the > > > > useragent ( tokenized) > > > > > > > > Okey, lets see what you have to say about this. > > > > Please bear with me, im all new to lucene and solr!! > > > > > > > > Regards > > > > Niclas > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > From: Ted Dunning [mailto:ted.dunn...@gmail.com] > > > > Sent: 05 February 2010 20:43 > > > > To: gene...@lucene.apache.org > > > > Cc: java-user@lucene.apache.org > > > > Subject: Re: Wildcard searches???? > > > > > > > > Yes. I think you have it. > > > > > > > > To explain in a bit more detail, I think that you should store a > > > tokenized > > > > form of the user agents and should query using a tokenized form of > > > your > > > > user > > > > agent. This will retrieve documents that have partial matches to > the > > > user > > > > agent of interest. Many of these matches, however, may not meet > the > > > > requirements of the wildcard expression in the documents. As > such, > > > you > > > > will > > > > need to look at each retrieved document to retrieve the wild > > > expression > > > > from > > > > each one in turn to test if the original (untokenized) query > satisfies > > > the > > > > wildcard. > > > > > > > > If your wildcards are all of a positive nature as your example is, > > > then > > > > this > > > > should work pretty well. > > > > > > > > On Fri, Feb 5, 2010 at 9:09 AM, Niclas Rothman <n...@lechill.com> > > > wrote: > > > > > > > > > Hi Ted and thanks for all your efforts. > > > > > Listen im a little bit lost here trying to understand what you > are > > > trying > > > > > to tell me :-) > > > > > > > > > > 1. I Store my useragents in a field that is tokenized. > > > > > 2. Then when I search, you are saying that I should "scan" down > the > > > > matches > > > > > via a SOLR function, or what? > > > > > Are you referring to these functions in SOLR? > > > > > > > > > > http://wiki.apache.org/solr/FunctionQuery > > > > > > > > > > > > > > > Sorry for not grasping immmediatley! > > > > > > > > > > Regards Niclas > > > > > > > > > > -----Original Message----- > > > > > From: Ted Dunning [mailto:ted.dunn...@gmail.com] > > > > > Sent: 05 February 2010 17:44 > > > > > To: gene...@lucene.apache.org > > > > > Cc: java-user@lucene.apache.org > > > > > Subject: Re: Wildcard searches???? > > > > > > > > > > Tokenize your user agent strings, then store the tokenized form > > > > separately > > > > > from the wild card. At retrieval time, scan down the matches > and > > > apply > > > > the > > > > > wildcard from each document to your original query. The SOLR > > > function > > > > > query > > > > > might be useful for this as would be a custom hit collector. > > > > > > > > > > On Fri, Feb 5, 2010 at 7:57 AM, Niclas Rothman > <n...@lechill.com> > > > wrote: > > > > > > > > > > > Hi there, i facing a problem and would like to ask the > community > > > for > > > > some > > > > > > help. > > > > > > > > > > > > In my index I store browser useragent values as "wildcarded" > / > > > > partial, > > > > > > which should be understood that an indexed document > > > > > > should only be shown to end users if his browsers useragent > > > matches a > > > > > > wildcared usereragent in my document. > > > > > > > > > > > > So what I have Is actually a "reversed" matching, the > wildcards > > > are in > > > > my > > > > > > document and NOT in my actual query. > > > > > > Does anyone know if this "setup" Is possible, e.g. to execute > a > > > query > > > > in > > > > > > style with: > > > > > > > > > > > > useragents: > > > > > > > > > > > > > > > > "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP- > > > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0" > > > > > > > > > > > > In this example I would have a hit because Mozilla/4.0* > matches > > > the > > > > > > useragent. > > > > > > > > > > > > <doc> > > > > > > <useragents> > > > > > > Firefox* > > > > > > Mozilla/4.0* > > > > > > </useragents> > > > > > > </doc> > > > > > > > > > > > > > > > > > > Regards > > > > > > Niclas > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Ted Dunning, CTO > > > > > DeepDyve > > > > > > > > > > > > > > > > > > > > > -- > > > > Ted Dunning, CTO > > > > DeepDyve > > > > > > > > > > > > > > > > -- > > > Ted Dunning, CTO > > > DeepDyve > > > > > > > > > > > -- > Ted Dunning, CTO > DeepDyve --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org