Re: Lucene Seaches VS. Relational database Queries
Gentlemen, A join like operation between Lucene indexes can be done with (at least) reasonable performance by using a few standard methods from RDB's: sort before going to disk, and cache whenever possible. The steps are: - query the first Lucene index with the low level search API to get the Lucene doc nrs. (using HitCollector or TermDocs). - retrieve the key field values for the second index from the first in doc number order. This step will perform better when there is as less data stored in the first index. This is normally the most performance critical step. (IndexReader.document(n)) - Sort these key values and use them, again with low level API to get the doc numbers for the second Lucene index. (using TermDocs). - Build a Filter for the second index from these doc numbers, this step usually implies some sorting of document numbers, for example by collecting them in a BitSet. - Use this filter for a text search in the second index. On Friday 14 April 2006 00:58, Ananth T. Sarathy wrote: > Erick, > Don't get me wrong. I agree with you 100 percent on everything you just > said, and have been advocating what you are saying. I turned to the forum to > get other peoples thoughts on the issue, feeling that my perspective may be > a little warped, and wanted to see what the community thinks. I think there > is a performance issue with or DB that I have never experienced in any other > project I have worked on, which needs someone with more specific domain > knowledge to fix. I think Lucene is fantastic for what we are already using > it for (searching contents of HTML, colliding the values of database rows to > make them free text searchable). We have been using it for over 2 years, and > with very good results (once we got a hang of it). > > I for one think that native language searches are fundamentally different > than Discrete Database queries, I am just having a problem trying to explain > this to some of the people on my team, and wanted to see if there wer eother > POV out there. The first step above can start from the results of an RDB query. Usually, the last text search step is more interactive (fundamentally different?) than earlier steps, so a filter is used to cache the join result. If the join needs to be changed slightly it is also quite effective to cache the retrieved key values from the first index and the retrieved fields from the second index. For the last step (and earlier ones), when two successive searches retrieve a somewhat overlapping document set, one might also want to avoid using the Hits class, because it only caches results for a single search. Instead, some LRU caches for retrieved documents and for filters can be quite effective. The caches can have the index version in their keys to keep things in sync. Enough RAM should be available so that the indexes can be accessed without alternating between them. Also the disk head should not do anything else when it is using the sorted inputs to minimize the total seek time. When the filters start taking too much RAM, have a look here: http://issues.apache.org/jira/browse/LUCENE-328 Regards, Paul Elschot > > Ananth > > On 4/13/06, Erick Erickson <[EMAIL PROTECTED]> wrote: > > > > On 4/13/06, Ananth T. Sarathy <[EMAIL PROTECTED]> wrote: > > > > > > No we do have drop downs selects that would allow for the substitution, > > > but > > > we also have a free text fields to allow the user to search. That > > solution > > > would I think work for the DB query replacement, but you would need a > > > regular non underscored field to allow for free text. > > > > > > > > Well, as I say, you've solved that problem already. Somewhere, somehow, > > you > > have to decide what to do with the "free text" data. Somewhere, somehow, > > you've got to decide whether "stunt director trainee" means "stunt > > director" > > + trainee, stunt + "director trainee", or stunt + director + trainee. Or > > else you can't form your SQL in the first place. And the query doesn't > > produce reasonable results if you *do* form the query. > > > > If you can form your SQL with distinct "Title = 'blah'" clauses, you can > > substitute underscores for spaces in the terms. If you can do that, you > > can > > ask Lucene to find the terms you indexed with underscores. And if you > > can't > > form your SQL queries in the first place, the question is irrelevant. > > > > All that said, perhaps a better question is "why is your SQL slow?". > > Relational databases are really good at this sort of thing. Many smart > > people have put many, many developer years into making relational > > databases > > deal with joins efficiently. Assuming you have the proper indexes etc. > > > > As much as I've been impressed with Lucene, I have to ask whether it's > > relevant to your problem. I have no clue what database you're using, how > > it's set up, or whether the examples you've given are simplified enough > > that > > I don't understand what the *real* pr
Max Frequency and Tf/Idf
Hello everybody. We are building a complex automatic classification system using Lucene. We need to manage normalized Tf/Idf (Term Frequency / Inverse Document Frequency). We understood that Lucene can give us Tf and Df and we are using these values to calculate the normalized Tf/Idf but we would like to optimize this calculation for better performance. Is there any way to expose the maximum term frequency in a document from Lucene, and maybe to obtain the normalized Tf/Idf from Lucene? There aren't a public methods to get these values, but maybe Lucene holds these informations privately and with a modify on Lucene source we could have the work done to fasten the system. P.S. Sorry for MY English: I hope I explained clearly my question. 1000 KBye [) /\ |\| | |_ () web: www.ciconet.it Web Portal Now: www.webportalnow.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Max Frequency and Tf/Idf
The Term Vector code can be used to get the term frequencies from a specific document. Search this list, see the Lucene In Action book or look at http://www.cnlp.org/apachecon2005 for examples on how to use Term Vectors Danilo Cicognani wrote: Hello everybody. We are building a complex automatic classification system using Lucene. We need to manage normalized Tf/Idf (Term Frequency / Inverse Document Frequency). We understood that Lucene can give us Tf and Df and we are using these values to calculate the normalized Tf/Idf but we would like to optimize this calculation for better performance. Is there any way to expose the maximum term frequency in a document from Lucene, and maybe to obtain the normalized Tf/Idf from Lucene? There aren't a public methods to get these values, but maybe Lucene holds these informations privately and with a modify on Lucene source we could have the work done to fasten the system. P.S. Sorry for MY English: I hope I explained clearly my question. 1000 KBye [) /\ |\| | |_ () web: www.ciconet.it Web Portal Now: www.webportalnow.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll Sr. Software Engineer Center for Natural Language Processing Syracuse University School of Information Studies 335 Hinds Hall Syracuse, NY 13244 http://www.cnlp.org Voice: 315-443-5484 Fax: 315-443-6886 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
query analysis
Hello list, I want to know if a human written query passed through the QueryParser is "clean" from fields, boolean clauses and query indicators. Easy way out would of course to add a boolean that resets at ReInit(), but maybe there is a smart way to do it. Perhaps it is possible to treat the retuned Query as a composite pattern (i.e. query.iterateNonRewrittenParts())? The plan is to avoid making suggestions on meta data in the query. "+name:foo" should suggest on "foo" only, and not "+name:foo". I initially tried to work the enumerateTerms, but realised this is hopeless(?) as a rewritten query looks quite different. Perhaps I'm attacking this from the wrong angle? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Syntax help
Hello, We am using Lucene to facilitate searching of our applications log files. I am noticing some inconsistencies in result sets when searching on certain fields. One field we index is the file path. I am using a simple query like "location:Z:\logs\someLogFile.log". However, I can never get path searches like this to come back with any results. Tried escaping the backslashes and colon. Nothing seems to work. I missing something here in my syntax? We also index the file name. However, on file names that have mixed case or multiple extensions (logfile.D20060303.T234234) I cannot get results either. Weird. I haven't worked with Lucene very long, so I expect I am missing something simple here. If you need more info, let me know! Many Thanks! --Bill
Re: Syntax help
It would be helpful to download Luke (http://www.getopt.org/luke/) and analyze whats getting indexed. Have you tried that? On 4/14/06, Bill Snyder <[EMAIL PROTECTED]> wrote: > > Hello, > > We am using Lucene to facilitate searching of our applications log files. > I > am noticing some inconsistencies in result sets when searching on certain > fields. > > One field we index is the file path. I am using a simple query like > "location:Z:\logs\someLogFile.log". However, I can never get path searches > like this to come back with any results. Tried escaping the backslashes > and > colon. Nothing seems to work. I missing something here in my syntax? > > We also index the file name. However, on file names that have mixed case > or > multiple extensions (logfile.D20060303.T234234) I cannot get results > either. > Weird. > > I haven't worked with Lucene very long, so I expect I am missing something > simple here. > > If you need more info, let me know! > Many Thanks! > > --Bill > >
Re: Syntax help
14 apr 2006 kl. 16.37 skrev Bill Snyder: One field we index is the file path. I am using a simple query like "location:Z:\logs\someLogFile.log". However, I can never get path searches like this to come back with any results. Tried escaping the backslashes and colon. Nothing seems to work. I missing something here in my syntax? Can you open your index with Luke and see what the index looks like? If it looks right, what does the code look like that retrieve the field value? If not, what does the code look like that set the field value? In case everything seems fine, do some debugging and report what values you send to Lucene and what you get out. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Syntax help
Oh, cool. Look at that. A neat tool made with thinlets. I had not heard of this...I'll see if it helps me figure out whats going on. --Bill On 4/14/06, Rajesh Munavalli <[EMAIL PROTECTED]> wrote: > > It would be helpful to download Luke (http://www.getopt.org/luke/) and > analyze whats getting indexed. Have you tried that? > > On 4/14/06, Bill Snyder <[EMAIL PROTECTED]> wrote: > > > > Hello, > > > > We am using Lucene to facilitate searching of our applications log > files. > > I > > am noticing some inconsistencies in result sets when searching on > certain > > fields. > > > > One field we index is the file path. I am using a simple query like > > "location:Z:\logs\someLogFile.log". However, I can never get path > searches > > like this to come back with any results. Tried escaping the backslashes > > and > > colon. Nothing seems to work. I missing something here in my syntax? > > > > We also index the file name. However, on file names that have mixed case > > or > > multiple extensions (logfile.D20060303.T234234) I cannot get results > > either. > > Weird. > > > > I haven't worked with Lucene very long, so I expect I am missing > something > > simple here. > > > > If you need more info, let me know! > > Many Thanks! > > > > --Bill > > > > > >
Re: Syntax help
AHA! I am using the Search tab and have enteres the query : location:Z:\install\logs\archive.log.D20060406.T141958 the query details says the query was parsed to location:z so if I escape the colon I see the new parsed query as location:"z installlogsarchive.log.d20060406.t141958" So, lucence does not store the file path exactly?! It converts it all lower case! Is there some property I should turn on? Plus, it is not storing the backslash. Should I be escaping these in the index before storing them? It seems so. -Bill On 4/14/06, Bill Snyder <[EMAIL PROTECTED]> wrote: > > Oh, cool. Look at that. A neat tool made with thinlets. I had not heard of > this...I'll see if it helps me figure out whats going on. > > --Bill > > > On 4/14/06, Rajesh Munavalli <[EMAIL PROTECTED]> wrote: > > > > It would be helpful to download Luke (http://www.getopt.org/luke/) and > > analyze whats getting indexed. Have you tried that? > > > > On 4/14/06, Bill Snyder < [EMAIL PROTECTED]> wrote: > > > > > > Hello, > > > > > > We am using Lucene to facilitate searching of our applications log > > files. > > > I > > > am noticing some inconsistencies in result sets when searching on > > certain > > > fields. > > > > > > One field we index is the file path. I am using a simple query like > > > "location:Z:\logs\someLogFile.log". However, I can never get path > > searches > > > like this to come back with any results. Tried escaping the > > backslashes > > > and > > > colon. Nothing seems to work. I missing something here in my syntax? > > > > > > We also index the file name. However, on file names that have mixed > > case > > > or > > > multiple extensions (logfile.D20060303.T234234 ) I cannot get results > > > either. > > > Weird. > > > > > > I haven't worked with Lucene very long, so I expect I am missing > > something > > > simple here. > > > > > > If you need more info, let me know! > > > Many Thanks! > > > > > > --Bill > > > > > > > > > > >
Re: Syntax help
14 apr 2006 kl. 17.11 skrev Bill Snyder: so if I escape the colon I see the new parsed query as location:"z installlogsarchive.log.d20060406.t141958" So, lucence does not store the file path exactly?! It converts it all lower case! Is there some property I should turn on? It is the Analyzer that does that. Try creating your IndexSearcher with a KeywordAnalyzer (it think). - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Syntax help
On 4/14/06, Bill Snyder <[EMAIL PROTECTED]> wrote: > > AHA! I am using the Search tab and have enteres the query : > > location:Z:\install\logs\archive.log.D20060406.T141958 > > the query details says the query was parsed to > > location:z > > so if I escape the colon I see the new parsed query as > > location:"z installlogsarchive.log.d20060406.t141958" > > So, lucence does not store the file path exactly?! It converts it all > lower > case! Is there some property I should turn on? In the StandardAnalyzer, the LowerCaseFilter converts everything into lower case. You can skip that step. Plus, it is not storing the backslash. Should I be escaping these in the > index before storing them? It seems so. Yes -Bill On 4/14/06, Bill Snyder <[EMAIL PROTECTED]> wrote: > > Oh, cool. Look at that. A neat tool made with thinlets. I had not heard of > this...I'll see if it helps me figure out whats going on. > > --Bill > > > On 4/14/06, Rajesh Munavalli <[EMAIL PROTECTED]> wrote: > > > > It would be helpful to download Luke (http://www.getopt.org/luke/) and > > analyze whats getting indexed. Have you tried that? > > > > On 4/14/06, Bill Snyder < [EMAIL PROTECTED]> wrote: > > > > > > Hello, > > > > > > We am using Lucene to facilitate searching of our applications log > > files. > > > I > > > am noticing some inconsistencies in result sets when searching on > > certain > > > fields. > > > > > > One field we index is the file path. I am using a simple query like > > > "location:Z:\logs\someLogFile.log". However, I can never get path > > searches > > > like this to come back with any results. Tried escaping the > > backslashes > > > and > > > colon. Nothing seems to work. I missing something here in my syntax? > > > > > > We also index the file name. However, on file names that have mixed > > case > > > or > > > multiple extensions (logfile.D20060303.T234234 ) I cannot get results > > > either. > > > Weird. > > > > > > I haven't worked with Lucene very long, so I expect I am missing > > something > > > simple here. > > > > > > If you need more info, let me know! > > > Many Thanks! > > > > > > --Bill > > > > > > > > > > >
Using Lucene for searching tokens, not storing them.
I would like to store all in my application rather than using the Lucene persistency mechanism for tokens. I only want the search mechanism. I do not need the IndexReader and IndexWriter as that will be a natural part of my application. I only want to use the Searchable. So I looked at extending my own. Tried to follow the code. Is there UML or something that describes the code and the process? Would very much appreciate someone telling me what I need to do :-) Perhaps there is some implementation I should take a look at? Memory consumption is not an issue. What do I need to consider for the CPU? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Syntax help
Thanks! OK, how do I get the file separator to be part of the term? Luke shows the parsed query as ignoring the file separator. so location:Z\:\\/install/logs\\jetspeedservices.log becomes location:"z install logs jetspeedservices.log" --Bill On 4/14/06, Rajesh Munavalli <[EMAIL PROTECTED]> wrote: > > On 4/14/06, Bill Snyder <[EMAIL PROTECTED]> wrote: > > > > AHA! I am using the Search tab and have enteres the query : > > > > location:Z:\install\logs\archive.log.D20060406.T141958 > > > > the query details says the query was parsed to > > > > location:z > > > > so if I escape the colon I see the new parsed query as > > > > location:"z installlogsarchive.log.d20060406.t141958" > > > > So, lucence does not store the file path exactly?! It converts it all > > lower > > case! Is there some property I should turn on? > > > In the StandardAnalyzer, the LowerCaseFilter converts everything into > lower > case. You can skip that step. > > Plus, it is not storing the backslash. Should I be escaping these in the > > index before storing them? It seems so. > > Yes > > -Bill > > On 4/14/06, Bill Snyder <[EMAIL PROTECTED]> wrote: > > > > Oh, cool. Look at that. A neat tool made with thinlets. I had not heard > of > > this...I'll see if it helps me figure out whats going on. > > > > --Bill > > > > > > On 4/14/06, Rajesh Munavalli <[EMAIL PROTECTED]> wrote: > > > > > > It would be helpful to download Luke (http://www.getopt.org/luke/) and > > > analyze whats getting indexed. Have you tried that? > > > > > > On 4/14/06, Bill Snyder < [EMAIL PROTECTED]> wrote: > > > > > > > > Hello, > > > > > > > > We am using Lucene to facilitate searching of our applications log > > > files. > > > > I > > > > am noticing some inconsistencies in result sets when searching on > > > certain > > > > fields. > > > > > > > > One field we index is the file path. I am using a simple query like > > > > "location:Z:\logs\someLogFile.log". However, I can never get path > > > searches > > > > like this to come back with any results. Tried escaping the > > > backslashes > > > > and > > > > colon. Nothing seems to work. I missing something here in my syntax? > > > > > > > > We also index the file name. However, on file names that have mixed > > > case > > > > or > > > > multiple extensions (logfile.D20060303.T234234 ) I cannot get > results > > > > either. > > > > Weird. > > > > > > > > I haven't worked with Lucene very long, so I expect I am missing > > > something > > > > simple here. > > > > > > > > If you need more info, let me know! > > > > Many Thanks! > > > > > > > > --Bill > > > > > > > > > > > > > > > > > >
Re: query analysis
tried MultiFieldQueryParser? Chris Lu --- Full-Text Lucene Search on Any Databases http://www.dbsight.net Faster to Setup than reading marketing materials! On 4/14/06, karl wettin <[EMAIL PROTECTED]> wrote: > Hello list, > > I want to know if a human written query passed through the > QueryParser is "clean" from fields, boolean clauses and query > indicators. Easy way out would of course to add a boolean that resets > at ReInit(), but maybe there is a smart way to do it. Perhaps it is > possible to treat the retuned Query as a composite pattern (i.e. > query.iterateNonRewrittenParts())? > > The plan is to avoid making suggestions on meta data in the query. > "+name:foo" should suggest on "foo" only, and not "+name:foo". I > initially tried to work the enumerateTerms, but realised this is > hopeless(?) as a rewritten query looks quite different. > > Perhaps I'm attacking this from the wrong angle? > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Syntax help
14 apr 2006 kl. 17.22 skrev karl wettin: It is the Analyzer that does that. Try creating your IndexSearcher with a KeywordAnalyzer (it think). err It is the Analyzer that does that. Try using a KeywordAnalyzer (it think). - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: query analysis
14 apr 2006 kl. 17.41 skrev Chris Lu: On 4/14/06, karl wettin <[EMAIL PROTECTED]> wrote: I want to know if a human written query passed through the QueryParser is "clean" from fields, boolean clauses and query indicators. Easy way out would of course to add a boolean that resets at ReInit(), but maybe there is a smart way to do it. Perhaps it is possible to treat the retuned Query as a composite pattern (i.e. query.iterateNonRewrittenParts())? tried MultiFieldQueryParser? How do you mean that it can help? I'm not sure if you understood my question or if MultiFieldQueryParser has some features I'm unaware of. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using Lucene for searching tokens, not storing them.
use Store.NO when creating Field Chris Lu --- Full-Text Lucene Search on Any Databases http://www.dbsight.net Faster to Setup than reading marketing materials! On 4/14/06, karl wettin <[EMAIL PROTECTED]> wrote: > I would like to store all in my application rather than using the > Lucene persistency mechanism for tokens. I only want the search > mechanism. I do not need the IndexReader and IndexWriter as that will > be a natural part of my application. I only want to use the Searchable. > > So I looked at extending my own. Tried to follow the code. Is there > UML or something that describes the code and the process? Would very > much appreciate someone telling me what I need to do :-) > > Perhaps there is some implementation I should take a look at? > > Memory consumption is not an issue. What do I need to consider for > the CPU? > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using Lucene for searching tokens, not storing them.
14 apr 2006 kl. 17.46 skrev Chris Lu: On 4/14/06, karl wettin <[EMAIL PROTECTED]> wrote: I would like to store all in my application rather than using the Lucene persistency mechanism for tokens. I only want the search mechanism. I do not need the IndexReader and IndexWriter as that will be a natural part of my application. I only want to use the Searchable. use Store.NO when creating Field You missunderstand all my questions. But it's OK. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using Lucene for searching tokens, not storing them.
On 14 Apr 2006, at 08:51, karl wettin wrote: You missunderstand all my questions. I must admit I was not sure I understood your question, either. In order to search, Lucene needs an index. That index is maintained by the IndexReader and IndexWriter classes. Are you contemplating having your own index and index format? In that case, it's not clear to me how much leverage you will be getting using Lucene at all. Could you explain in more detail what you are trying to do? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using Lucene for searching tokens, not storing them.
14 apr 2006 kl. 17.51 skrev karl wettin: 14 apr 2006 kl. 17.46 skrev Chris Lu: On 4/14/06, karl wettin <[EMAIL PROTECTED]> wrote: I would like to store all in my application rather than using the Lucene persistency mechanism for tokens. I only want the search mechanism. I do not need the IndexReader and IndexWriter as that will be a natural part of my application. I only want to use the Searchable. use Store.NO when creating Field You missunderstand all my questions. I'll clarify though. I don't want to use Lucene for persistence. I do not want to store tokens nor field text in a FSDirectory or in a RAMDirectory. I want to take store the the tokens in my application. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using Lucene for searching tokens, not storing them.
14 apr 2006 kl. 17.51 skrev Christophe: Are you contemplating having your own index and index format? In that case, it's not clear to me how much leverage you will be getting using Lucene at all. Could you explain in more detail what you are trying to do? I want to use the parts of Lucene built to query an index, not the part that persists an index. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Syntax help
oops, thought that you were just referring to the lowercase... :) On 4/14/06, karl wettin <[EMAIL PROTECTED]> wrote: > > > 14 apr 2006 kl. 17.22 skrev karl wettin: > > > > It is the Analyzer that does that. Try creating your IndexSearcher > > with a KeywordAnalyzer (it think). > > err > > It is the Analyzer that does that. Try using a KeywordAnalyzer (it > think). > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: Using Lucene for searching tokens, not storing them.
On 14 Apr 2006, at 08:55, karl wettin wrote: I don't want to use Lucene for persistence. I do not want to store tokens nor field text in a FSDirectory or in a RAMDirectory. I want to take store the the tokens in my application. If I understand your question, I think that the first answer was exactly correct. You don't need to use Lucene for persistence in order to use it for searching. By setting the fields to be non-stored, Lucene only constructs the index for those fields, and doesn't save the full text of the field. For example, we store the text we are searching in an RDBMS, and only use Lucene for the full-text index. When we need to retrieve the actual document, we don't go to Lucene; we go to the RDBMS. This doesn't require any code changes at all; you just set the fields to non-stored when you index the documents. Lucene still does need an index, somewhere, in order to search, and Lucene manages the format of the index, so you will still need to use IndexWriter and IndexReader, and some Directory subclass, in order for Lucene to have a place to store its index. You can create a new flavor of Directory if you want Lucene to store its index files somewhere more exotic than the standard classes allow. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: query analysis
Sorry, really misunderstood you. And you already know Lucene a lot. :) Basically you want to restore the original query from the Query object. But it may have already passed a lot of composition, like Boolean, Span, Wildcard. I don't feel it's possible to reconstruct the original human query. Chris Lu --- Full-Text Lucene Search on Any Databases http://www.dbsight.net Faster to Setup than reading marketing materials! On 4/14/06, karl wettin <[EMAIL PROTECTED]> wrote: > > 14 apr 2006 kl. 17.41 skrev Chris Lu: > > > > On 4/14/06, karl wettin <[EMAIL PROTECTED]> wrote: > > >> I want to know if a human written query passed through the > >> QueryParser is "clean" from fields, boolean clauses and query > >> indicators. Easy way out would of course to add a boolean that resets > >> at ReInit(), but maybe there is a smart way to do it. Perhaps it is > >> possible to treat the retuned Query as a composite pattern (i.e. > >> query.iterateNonRewrittenParts())? > > > tried MultiFieldQueryParser? > > How do you mean that it can help? I'm not sure if you understood my > question or if MultiFieldQueryParser has some features I'm unaware of. > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using Lucene for searching tokens, not storing them.
14 apr 2006 kl. 17.56 skrev karl wettin: 14 apr 2006 kl. 17.51 skrev Christophe: Are you contemplating having your own index and index format? In that case, it's not clear to me how much leverage you will be getting using Lucene at all. Could you explain in more detail what you are trying to do? I want to use the parts of Lucene built to query an index, not the part that persists an index. Sorry for flooding. Here is a class diagram (go fixed size font) of what I want to do: [MyTokenizedClass](field)-- {0..*} | {0..1} --[Token]<- - - <> - -[Searchable] | \---[Offset] I want to store all the tokens in the realm of my application. I do not want to use the IndexWriter to analyze and tokenize my fields. I do that my self. I only want the query mechanism of Lucene. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using Lucene for searching tokens, not storing them.
Thanks, Christophe. Hi, Kevin, I think your question means you want to store the Analyzed tokens yourself? If so, you can use Analyzer to directly process the text, and save the analyzed results in your application, maybe later use it in some RDBMS? or BerkelyDB? Chris Lu --- Full-Text Lucene Search on Any Databases http://www.dbsight.net Faster to Setup than reading marketing materials! On 4/14/06, Christophe <[EMAIL PROTECTED]> wrote: > On 14 Apr 2006, at 08:55, karl wettin wrote: > > > I don't want to use Lucene for persistence. I do not want to store > > tokens nor field text in a FSDirectory or in a RAMDirectory. I want > > to take store the the tokens in my application. > > If I understand your question, I think that the first answer was > exactly correct. > > You don't need to use Lucene for persistence in order to use it for > searching. By setting the fields to be non-stored, Lucene only > constructs the index for those fields, and doesn't save the full text > of the field. For example, we store the text we are searching in an > RDBMS, and only use Lucene for the full-text index. When we need to > retrieve the actual document, we don't go to Lucene; we go to the RDBMS. > > This doesn't require any code changes at all; you just set the fields > to non-stored when you index the documents. > > Lucene still does need an index, somewhere, in order to search, and > Lucene manages the format of the index, so you will still need to use > IndexWriter and IndexReader, and some Directory subclass, in order > for Lucene to have a place to store its index. You can create a new > flavor of Directory if you want Lucene to store its index files > somewhere more exotic than the standard classes allow. > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using Lucene for searching tokens, not storing them.
14 apr 2006 kl. 18.01 skrev Christophe: On 14 Apr 2006, at 08:55, karl wettin wrote: I don't want to use Lucene for persistence. I do not want to store tokens nor field text in a FSDirectory or in a RAMDirectory. I want to take store the the tokens in my application. If I understand your question, I think that the first answer was exactly correct. You don't need to use Lucene for persistence in order to use it for searching. By setting the fields to be non-stored, Lucene only constructs the index You speak of storing field values in the Lucene index. I speak of not using a Lucene index at all, to only use the query mechanism. All data Lucene need (the index) would be supplied from my application. Not from the Lucene Directory implementation. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene probabilistic
Hi all, I came across an old mail list item from 2003 exploring the possibilities of a more probabilistic approach to using Lucene. Do the online experts know if anyone achieved this since? Thanks for any advice, Malc
Re: Using Lucene for searching tokens, not storing them.
karl wettin wrote: I would like to store all in my application rather than using the Lucene persistency mechanism for tokens. I only want the search mechanism. I do not need the IndexReader and IndexWriter as that will be a natural part of my application. I only want to use the Searchable. Implement the IndexReader API, overriding all of the abstract methods. That will enable you to search your index using Lucene's search code. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boosting Fields (in index) or Queries
Wow, I finally found out why I was getting results in the wrong order - I got the results in the correct order from the Lucene index. I got the explanation of each of the results along with their database id and found the ordering mismatch. The problem is in the database call. I am calling: select * from product where id in (444, 333, 555, 888); and the ordering that comes back is not preserved. So the results are correct but the ordering and hence all of the relevancy is out the window. So that at least leads me to the actual problem. Now I have to figure out how I'll approach reordering the results because I don't believe that there's any way to force the ordering of a list and I don't want to call a separate database query for each id (lots of database round-trips). Thanks for the help Erik! On Apr 13, 2006, at 7:13 PM, Erik Hatcher wrote: On Apr 13, 2006, at 8:55 PM, Jeremy Hanna wrote: Looking at the results, the first document in the results should hopefully be near the bottom and the Explanation for this document has a Description/Details (using the toString() on the Explanation) of: product of: 0.0 = sum of: 0.0 = coord(0/7) So I'm kind of at a loss as to what's going on. Am I just doing something crazy weird in my code? I didn't find that many examples out there, so I'm kind of winging it according to what I've read in the javadocs and what examples I could find. Be sure to pass the document id, not the hit number, to explain(). Looks like you passed an id of an unmatched document. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using Lucene for searching tokens, not storing them.
14 apr 2006 kl. 18.31 skrev Doug Cutting: karl wettin wrote: I would like to store all in my application rather than using the Lucene persistency mechanism for tokens. I only want the search mechanism. I do not need the IndexReader and IndexWriter as that will be a natural part of my application. I only want to use the Searchable. Implement the IndexReader API, overriding all of the abstract methods. That will enable you to search your index using Lucene's search code. Aha, thanks. Do I have to worry about passing a null Directory to the default constructor? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Boosting Fields (in index) or Queries
We tried two approaches: 1) Pull data from the db in arbitrary order and then sort in the application AFTER the retrieve. This will require two passes over the results. 2) Add an order by clause to the select. In Oracle, you could do something like "order by decode(444,1,333,2,555,3,888,4,...)". This will force the order you want in the query from the db. FWIW, after trying both of the above in production, we changed our strategy to avoid the db hit altogether, storing everything we needed for presentation within the Lucene index. We saw a net performance increase AND simpler code when we did this. -Mike -Original Message- From: Jeremy Hanna [mailto:[EMAIL PROTECTED] Sent: Fri 4/14/06 1:15 PM To: java-user@lucene.apache.org Cc: Subject:Re: Boosting Fields (in index) or Queries Wow, I finally found out why I was getting results in the wrong order - I got the results in the correct order from the Lucene index. I got the explanation of each of the results along with their database id and found the ordering mismatch. The problem is in the database call. I am calling: select * from product where id in (444, 333, 555, 888); and the ordering that comes back is not preserved. So the results are correct but the ordering and hence all of the relevancy is out the window. So that at least leads me to the actual problem. Now I have to figure out how I'll approach reordering the results because I don't believe that there's any way to force the ordering of a list and I don't want to call a separate database query for each id (lots of database round-trips). Thanks for the help Erik! On Apr 13, 2006, at 7:13 PM, Erik Hatcher wrote: > > On Apr 13, 2006, at 8:55 PM, Jeremy Hanna wrote: >> Looking at the results, the first document in the results should >> hopefully be near the bottom and the Explanation for this document >> has a Description/Details (using the toString() on the >> Explanation) of: >> >> product of: >> 0.0 = sum of: >> 0.0 = coord(0/7) >> >> So I'm kind of at a loss as to what's going on. Am I just doing >> something crazy weird in my code? I didn't find that many >> examples out there, so I'm kind of winging it according to what >> I've read in the javadocs and what examples I could find. > > Be sure to pass the document id, not the hit number, to explain(). > Looks like you passed an id of an unmatched document. > > Erik > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using Lucene for searching tokens, not storing them.
On 4/14/06, karl wettin <[EMAIL PROTECTED]> wrote: > Do I have to worry about passing a null Directory to the default > constructor? That's not an easy road you are trying to take, but it should be doable. There are some final methods you can't override, but just set directoryOwner=false and closeDirectory=false, and that code shoudn't touch the directory you set to null. -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Syntax help
Something that took me a while to get was that the analyzer is important BOTH in the indexing phase and in the searching phase (assuming you're using the QueryParser). For you experiment, you probably want to use the WhitespaceAnalyzer. See page 119 of "Lucene in Action". The other three most-common analyzers divide text at nonletter characters, which will do bad things to your path names. Also note that you can use the PerFieldAnalyzerWrapper to use, say, the WhitespaceAnalyzer on the file-path field and other analyzers on other fields, you're not locked into using the same analyzer for all fields. Best Erick BTW, I really recommend a copy of "Lucene in Action"..
Re: Syntax help
On 4/14/06, Erick Erickson <[EMAIL PROTECTED]> wrote: > > Something that took me a while to get was that the analyzer is important > BOTH in the indexing phase and in the searching phase (assuming you're > using > the QueryParser). For you experiment, you probably want to use the > WhitespaceAnalyzer. See page 119 of "Lucene in Action". The other three most-common analyzers divide text at nonletter characters, > which will do bad things to your path names. > > Also note that you can use the PerFieldAnalyzerWrapper to use, say, the > WhitespaceAnalyzer on the file-path field and other analyzers on other > fields, you're not locked into using the same analyzer for all fields. > > Best > Erick > > > BTW, I really recommend a copy of "Lucene in Action".. PerFieldAnalyzerWrapper looks like what I want! I've heard nothing but good things about the book and will have to pick it up! Thanks for the help everyone!
Re: Boosting Fields (in index) or Queries
I would use a database function to force the ordering like the one your provided that works in Oracle, but it doesn't look like mysql 5 supports that. If anyone else knows of a way to force the ordering using mysql 5 queries, please respond. I think I'll just resort them when they get back though. Thanks! On Apr 14, 2006, at 11:39 AM, Bryzek.Michael wrote: We tried two approaches: 1) Pull data from the db in arbitrary order and then sort in the application AFTER the retrieve. This will require two passes over the results. 2) Add an order by clause to the select. In Oracle, you could do something like "order by decode(444,1,333,2,555,3,888,4,...)". This will force the order you want in the query from the db. FWIW, after trying both of the above in production, we changed our strategy to avoid the db hit altogether, storing everything we needed for presentation within the Lucene index. We saw a net performance increase AND simpler code when we did this. -Mike -Original Message- From: Jeremy Hanna [mailto:[EMAIL PROTECTED] Sent: Fri 4/14/06 1:15 PM To: java-user@lucene.apache.org Cc: Subject:Re: Boosting Fields (in index) or Queries Wow, I finally found out why I was getting results in the wrong order - I got the results in the correct order from the Lucene index. I got the explanation of each of the results along with their database id and found the ordering mismatch. The problem is in the database call. I am calling: select * from product where id in (444, 333, 555, 888); and the ordering that comes back is not preserved. So the results are correct but the ordering and hence all of the relevancy is out the window. So that at least leads me to the actual problem. Now I have to figure out how I'll approach reordering the results because I don't believe that there's any way to force the ordering of a list and I don't want to call a separate database query for each id (lots of database round-trips). Thanks for the help Erik! On Apr 13, 2006, at 7:13 PM, Erik Hatcher wrote: On Apr 13, 2006, at 8:55 PM, Jeremy Hanna wrote: Looking at the results, the first document in the results should hopefully be near the bottom and the Explanation for this document has a Description/Details (using the toString() on the Explanation) of: product of: 0.0 = sum of: 0.0 = coord(0/7) So I'm kind of at a loss as to what's going on. Am I just doing something crazy weird in my code? I didn't find that many examples out there, so I'm kind of winging it according to what I've read in the javadocs and what examples I could find. Be sure to pass the document id, not the hit number, to explain(). Looks like you passed an id of an unmatched document. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using Lucene for searching tokens, not storing them.
karl wettin wrote: Do I have to worry about passing a null Directory to the default constructor? A null Directory should not cause you problems. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boosting Fields (in index) or Queries
Jeremy Hanna wrote: I would use a database function to force the ordering like the one your provided that works in Oracle, but it doesn't look like mysql 5 supports that. If anyone else knows of a way to force the ordering using mysql 5 queries, please respond. I think I'll just resort them when they get back though. If there's nothing in the relational table that specifies the ordering, I'm afraid you've probably got similar problems in other places. RDBMSes don't guarantee to return rows in the order they were INSERTed. Sure, early in the life of a table that will tend to happen, but as DELETEs, then UPDATEs and new INSERTs get processed, the on-disk order tends to get pretty jumbled. Note that I'm talking about anything that uses the results of your SELECT, not just your Lucene-related code. If ordering of the rows is something your app needs, I recommend adding a column that is expressly for ordering. A one-up integer or something like that. I don't remember what the keyword in MySQL is for that, but I'm pretty sure there is one. Then you can code all your SELECTs with an ORDER BY clause that does what you want. Good luck! --MDC - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boosting Fields (in index) or Queries
I still have a similar problem with the boost factor. I change the name to have the AND operator and set that query's boost to a very high value in relation to the others. I also have a regular OR based name so that it doesn't rule those out. However whenever I change the boost values with the queries, nothing, absolutely nothing changes with the results. Besides that - I search for: playstation game. The only value that has both playstation and game in the name field is Hit number 20. That's really why I put the name AND operator in there with such a high boost value, to see if it would bring that single ANDed record towards the top, but nothing. Am I doing something wrong in all of this? Am I doing the boost wrong or something? On Apr 14, 2006, at 1:43 PM, Michael D. Curtin wrote: Jeremy Hanna wrote: I would use a database function to force the ordering like the one your provided that works in Oracle, but it doesn't look like mysql 5 supports that. If anyone else knows of a way to force the ordering using mysql 5 queries, please respond. I think I'll just resort them when they get back though. If there's nothing in the relational table that specifies the ordering, I'm afraid you've probably got similar problems in other places. RDBMSes don't guarantee to return rows in the order they were INSERTed. Sure, early in the life of a table that will tend to happen, but as DELETEs, then UPDATEs and new INSERTs get processed, the on-disk order tends to get pretty jumbled. Note that I'm talking about anything that uses the results of your SELECT, not just your Lucene-related code. If ordering of the rows is something your app needs, I recommend adding a column that is expressly for ordering. A one-up integer or something like that. I don't remember what the keyword in MySQL is for that, but I'm pretty sure there is one. Then you can code all your SELECTs with an ORDER BY clause that does what you want. Good luck! --MDC - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Seaches VS. Relational database Queries
Im the co-worker who suggested to Ananth( I've think we have been debating this for 3 days now,from the post it seems he is winning :)... ) Anway, as Ananth stated I suggested this because I am wondering if lucene could solve a bottle neck query that is taking a deathly long time to complete(read-only)and the orginal design actually generated a threaded 60+ queries on the database to return results per userThread who hit our website for this view..., I know that this will kill our server when user-load increases...i know that lucene is built for speed and can handle a very large number of peopel searching(we are using singleton Searcher), and the (threaded)results will be the "hits" returned from lucene.. , also this query will NOT be executed by any user in a text field , but rather in our application code only when user selects differnt parts of the site...if all values in this 1:n relationship we are trying to query in lucene then the "application-provided" query will return accurate results. we are using Quartz, and not creating threads in servlets... FINAL SOLUTION MAYBE?: if our client EVER gives us a requirement that says we must have accurate text-searching even if somthing on our index for 1: "Jason" and "Jason Black" relationship, then we should just simply say we cannot implement this because lucene search will yield inaccurate results correct??? comments? -- View this message in context: http://www.nabble.com/Lucene-Seaches-VS.-Relational-database-Queries-t1434583.html#a3925693 Sent from the Lucene - Java Users forum at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Catching BooleanQuery.TooManyClauses
Hi Lucene Users, I would like to catch BooleanQuery.TooManyClauses exception for certain wildcard searches and display a 'subset' of results. I have used the WildcardTermEnum to give me the first X documents matching the wildcard query. Below is the code I use to implement the solution. Without any performance concerns is this the best solution? Or should I just tell the user to refine their query!? Thanks Ben = QueryParserTest.java ... public class QueryParserTest extends LuceneTestCase { ... private static int MAX_HITS = 10; public void testCatchTooManyClauses() throws Exception { reader = IndexReader.open(directory); String queryStr = "9*"; String field = "PART_NBR"; Hits hits = null; Vector docList; try { System.out.println("query: " + queryStr); System.out.println("field: " + field); hits = searcher.search(parser.parse(field+":"+queryStr)); docList = new Vector(hits.length()); Iterator docListIt = hits.iterator(); while(docListIt.hasNext()) docList.add(((Hit)docListIt.next()).getDocument()); } catch(BooleanQuery.TooManyClauses ex) { System.out.println("catch BooleanQuery.TooManyClauses, refining query"); Term term = new Term(field, queryStr); WildcardTermEnum wte = new WildcardTermEnum(reader, term); int cnt = 0; docList = new Vector(MAX_HITS); while(wte.next() && cnt++ < MAX_HITS) { term = wte.term(); TermQuery query = new TermQuery(new Term(field, term.text())); System.out.println("search for " + query.getTerm().text()); hits = searcher.search(query); Iterator docListIt = hits.iterator(); while(docListIt.hasNext()) docList.add(((Hit)docListIt.next()).getDocument()); } } System.out.println("found:" + docList.size()); } ... = QueryParserTest.java -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.1.385 / Virus Database: 268.4.1/312 - Release Date: 14/04/2006 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]