date:20060414

Re: Lucene Seaches VS. Relational database Queries

2006-04-14 Thread Paul Elschot

Gentlemen,

A join like operation between Lucene indexes can be done with
(at least) reasonable performance by using a few standard
methods from RDB's: sort before going to disk, and cache
whenever possible. The steps are:
- query the first Lucene index with the low level search API to get the
  Lucene doc nrs. (using HitCollector or TermDocs).
- retrieve the key field values for the second index from the first
  in doc number order. This step will perform better when
  there is as less data stored in the first index. This is normally
  the most performance critical step. (IndexReader.document(n))
- Sort these key values and use them, again with low level API
  to get the doc numbers for the second Lucene index.
  (using TermDocs).
- Build a Filter for the second index from these doc numbers,
  this step usually implies some sorting of document numbers,
  for example by collecting them in a BitSet.
- Use this filter for a text search in the second index.

On Friday 14 April 2006 00:58, Ananth T. Sarathy wrote:
> Erick,
> Don't get me wrong. I agree with you 100 percent on everything you just
> said, and have been advocating what you are saying. I turned to the forum to
> get other peoples thoughts on the issue, feeling that my perspective may be
> a little warped, and wanted to see what the community thinks. I think there
> is a performance issue with or DB that I have never experienced in any other
> project I have worked on, which needs someone with more specific domain
> knowledge to fix.  I think Lucene is fantastic for what we are already using
> it for (searching contents of HTML, colliding the values of database rows to
> make them free text searchable). We have been using it for over 2 years, and
> with very good results (once we got a hang of it).
> 
>   I for one think that native language searches are fundamentally different
> than Discrete Database queries, I am just having a problem trying to explain
> this to some of the people on my team, and wanted to see if there wer eother
> POV out there.

The first step above can start from the results of an RDB query.
Usually, the last text search step is more interactive (fundamentally
different?) than earlier steps, so a filter is used to cache the join result.
If the join needs to be changed slightly it is also quite effective to
cache the retrieved key values from the first index and the retrieved
fields from the second  index.
For the last step (and earlier ones), when two successive searches
retrieve a somewhat overlapping document set, one might also want to
avoid using  the Hits class, because it only caches results for a single
search.
Instead, some LRU caches for retrieved documents and for filters can
be quite effective. The caches can have the index version in their keys to
keep things in sync.

Enough RAM should be available so that the indexes can be accessed
without alternating between them.
Also the disk head should not do anything else when it is using
the sorted inputs to minimize the total seek time.
When the filters start taking too much RAM, have a look here:
http://issues.apache.org/jira/browse/LUCENE-328

Regards,
Paul Elschot


> 
> Ananth
> 
> On 4/13/06, Erick Erickson <[EMAIL PROTECTED]> wrote:
> >
> > On 4/13/06, Ananth T. Sarathy <[EMAIL PROTECTED]> wrote:
> > >
> > > No we do have drop downs selects that would allow for the substitution,
> > > but
> > > we also have a free text fields to allow the user to search. That
> > solution
> > > would I think work for the DB query replacement, but you would need a
> > > regular non underscored field to allow for free text.
> > >
> > >
> > Well, as I say, you've solved that problem already. Somewhere, somehow,
> > you
> > have to decide what to do with the "free text" data. Somewhere, somehow,
> > you've got to decide whether "stunt director trainee" means "stunt
> > director"
> > + trainee, stunt + "director trainee", or stunt + director + trainee. Or
> > else you can't form your SQL in the first place. And the query doesn't
> > produce reasonable results if you *do* form the query.
> >
> > If you can form your SQL with distinct "Title = 'blah'" clauses, you can
> > substitute underscores for spaces in the terms. If you can do that, you
> > can
> > ask Lucene to find the terms you indexed with underscores. And if you
> > can't
> > form your SQL queries in the first place, the question is irrelevant.
> >
> > All that said, perhaps a better question is "why is your SQL slow?".
> > Relational databases are really good at this sort of thing. Many smart
> > people have put many, many developer years into making relational
> > databases
> > deal with joins efficiently. Assuming you have the proper indexes etc.
> >
> > As much as I've been impressed with Lucene, I have to ask whether it's
> > relevant to your problem. I have no clue what database you're using, how
> > it's set up, or whether the examples you've given are simplified enough
> > that
> > I don't understand what the *real* pr

Max Frequency and Tf/Idf

2006-04-14 Thread Danilo Cicognani

Hello everybody.
We are building a complex automatic classification system using Lucene.
We need to manage normalized Tf/Idf (Term Frequency / Inverse Document
Frequency).
We understood that Lucene can give us Tf and Df and we are using these
values to calculate the normalized Tf/Idf but we would like to optimize this
calculation for better performance.
Is there any way to expose the maximum term frequency in a document from
Lucene, and maybe to obtain the normalized Tf/Idf from Lucene?
There aren't a public methods to get these values, but maybe Lucene holds
these informations privately and with a modify on Lucene source we could
have the work done to fasten the system.

P.S. Sorry for MY English: I hope I explained clearly my question.

 1000 KBye 

 [) /\ |\| | |_ ()

web: www.ciconet.it
Web Portal Now: www.webportalnow.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Max Frequency and Tf/Idf

2006-04-14 Thread Grant Ingersoll

The Term Vector code can be used to get the term frequencies from a 
specific document.  Search this list, see the Lucene In Action book or 
look at http://www.cnlp.org/apachecon2005 for examples on how to use 
Term Vectors


Danilo Cicognani wrote:

Hello everybody.
We are building a complex automatic classification system using Lucene.
We need to manage normalized Tf/Idf (Term Frequency / Inverse Document
Frequency).
We understood that Lucene can give us Tf and Df and we are using these
values to calculate the normalized Tf/Idf but we would like to optimize this
calculation for better performance.
Is there any way to expose the maximum term frequency in a document from
Lucene, and maybe to obtain the normalized Tf/Idf from Lucene?
There aren't a public methods to get these values, but maybe Lucene holds
these informations privately and with a modify on Lucene source we could
have the work done to fasten the system.

P.S. Sorry for MY English: I hope I explained clearly my question.

 1000 KBye 

 [) /\ |\| | |_ ()

web: www.ciconet.it
Web Portal Now: www.webportalnow.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  


--

Grant Ingersoll 
Sr. Software Engineer 
Center for Natural Language Processing 
Syracuse University 
School of Information Studies 
335 Hinds Hall 
Syracuse, NY 13244 

http://www.cnlp.org 
Voice:  315-443-5484 
Fax: 315-443-6886 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

query analysis

2006-04-14 Thread karl wettin


Hello list,

I want to know if a human written query passed through the  
QueryParser is "clean" from fields, boolean clauses and query  
indicators. Easy way out would of course to add a boolean that resets  
at ReInit(), but maybe there is a smart way to do it. Perhaps it is  
possible to treat the retuned Query as a composite pattern (i.e.  
query.iterateNonRewrittenParts())?


The plan is to avoid making suggestions on meta data in the query.  
"+name:foo" should suggest on "foo" only, and not "+name:foo". I  
initially tried to work the enumerateTerms, but realised this is  
hopeless(?) as a rewritten query looks quite different.


Perhaps I'm attacking this from the wrong angle?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Syntax help

2006-04-14 Thread Bill Snyder

Hello,

We am using Lucene to facilitate searching of our applications log files. I
am noticing some inconsistencies in result sets when searching on certain
fields.

One field we index is the file path. I am using a simple query like
"location:Z:\logs\someLogFile.log". However, I can never get path searches
like this to come back with any results. Tried escaping the backslashes and
colon. Nothing seems to work. I missing something here in my syntax?

We also index the file name. However, on file names that have mixed case or
multiple extensions (logfile.D20060303.T234234) I cannot get results either.
Weird.

I haven't worked with Lucene very long, so I expect I am missing something
simple here.

If you need more info, let me know!
Many Thanks!

--Bill

Re: Syntax help

2006-04-14 Thread Rajesh Munavalli

It would be helpful to download Luke (http://www.getopt.org/luke/) and
analyze whats getting indexed. Have you tried that?

On 4/14/06, Bill Snyder <[EMAIL PROTECTED]> wrote:
>
> Hello,
>
> We am using Lucene to facilitate searching of our applications log files.
> I
> am noticing some inconsistencies in result sets when searching on certain
> fields.
>
> One field we index is the file path. I am using a simple query like
> "location:Z:\logs\someLogFile.log". However, I can never get path searches
> like this to come back with any results. Tried escaping the backslashes
> and
> colon. Nothing seems to work. I missing something here in my syntax?
>
> We also index the file name. However, on file names that have mixed case
> or
> multiple extensions (logfile.D20060303.T234234) I cannot get results
> either.
> Weird.
>
> I haven't worked with Lucene very long, so I expect I am missing something
> simple here.
>
> If you need more info, let me know!
> Many Thanks!
>
> --Bill
>
>

Re: Syntax help

2006-04-14 Thread karl wettin



14 apr 2006 kl. 16.37 skrev Bill Snyder:


One field we index is the file path. I am using a simple query like
"location:Z:\logs\someLogFile.log". However, I can never get path  
searches
like this to come back with any results. Tried escaping the  
backslashes and

colon. Nothing seems to work. I missing something here in my syntax?


Can you open your index with Luke and see what the index looks like?

If it looks right, what does the code look like that retrieve the  
field value?

If not, what does the code look like that set the field value?

In case everything seems fine, do some debugging and report what values
you send to Lucene and what you get out.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Syntax help

2006-04-14 Thread Bill Snyder

Oh, cool. Look at that. A neat tool made with thinlets. I had not heard of
this...I'll see if it helps me figure out whats going on.

--Bill

On 4/14/06, Rajesh Munavalli <[EMAIL PROTECTED]> wrote:
>
> It would be helpful to download Luke (http://www.getopt.org/luke/) and
> analyze whats getting indexed. Have you tried that?
>
> On 4/14/06, Bill Snyder <[EMAIL PROTECTED]> wrote:
> >
> > Hello,
> >
> > We am using Lucene to facilitate searching of our applications log
> files.
> > I
> > am noticing some inconsistencies in result sets when searching on
> certain
> > fields.
> >
> > One field we index is the file path. I am using a simple query like
> > "location:Z:\logs\someLogFile.log". However, I can never get path
> searches
> > like this to come back with any results. Tried escaping the backslashes
> > and
> > colon. Nothing seems to work. I missing something here in my syntax?
> >
> > We also index the file name. However, on file names that have mixed case
> > or
> > multiple extensions (logfile.D20060303.T234234) I cannot get results
> > either.
> > Weird.
> >
> > I haven't worked with Lucene very long, so I expect I am missing
> something
> > simple here.
> >
> > If you need more info, let me know!
> > Many Thanks!
> >
> > --Bill
> >
> >
>
>

Re: Syntax help

2006-04-14 Thread Bill Snyder

AHA!  I am using the Search tab and have enteres the query :

location:Z:\install\logs\archive.log.D20060406.T141958

the query details says the query was parsed to

location:z

so if I escape the colon I see the new parsed query as

location:"z installlogsarchive.log.d20060406.t141958"

So, lucence does not store the file path exactly?! It converts it all lower
case! Is there some property I should turn on?

Plus, it is not storing the backslash. Should I be escaping these in the
index before storing them? It seems so.

-Bill

On 4/14/06, Bill Snyder <[EMAIL PROTECTED]> wrote:
>
> Oh, cool. Look at that. A neat tool made with thinlets. I had not heard of
> this...I'll see if it helps me figure out whats going on.
>
> --Bill
>
>
> On 4/14/06, Rajesh Munavalli <[EMAIL PROTECTED]> wrote:
> >
> > It would be helpful to download Luke (http://www.getopt.org/luke/) and
> > analyze whats getting indexed. Have you tried that?
> >
> > On 4/14/06, Bill Snyder < [EMAIL PROTECTED]> wrote:
> > >
> > > Hello,
> > >
> > > We am using Lucene to facilitate searching of our applications log
> > files.
> > > I
> > > am noticing some inconsistencies in result sets when searching on
> > certain
> > > fields.
> > >
> > > One field we index is the file path. I am using a simple query like
> > > "location:Z:\logs\someLogFile.log". However, I can never get path
> > searches
> > > like this to come back with any results. Tried escaping the
> > backslashes
> > > and
> > > colon. Nothing seems to work. I missing something here in my syntax?
> > >
> > > We also index the file name. However, on file names that have mixed
> > case
> > > or
> > > multiple extensions (logfile.D20060303.T234234 ) I cannot get results
> > > either.
> > > Weird.
> > >
> > > I haven't worked with Lucene very long, so I expect I am missing
> > something
> > > simple here.
> > >
> > > If you need more info, let me know!
> > > Many Thanks!
> > >
> > > --Bill
> > >
> > >
> >
> >
>

Re: Syntax help

2006-04-14 Thread karl wettin



14 apr 2006 kl. 17.11 skrev Bill Snyder:



so if I escape the colon I see the new parsed query as

location:"z installlogsarchive.log.d20060406.t141958"

So, lucence does not store the file path exactly?! It converts it  
all lower

case! Is there some property I should turn on?


It is the Analyzer that does that. Try creating your IndexSearcher  
with a KeywordAnalyzer (it think).


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Syntax help

2006-04-14 Thread Rajesh Munavalli

On 4/14/06, Bill Snyder <[EMAIL PROTECTED]> wrote:
>
> AHA!  I am using the Search tab and have enteres the query :
>
> location:Z:\install\logs\archive.log.D20060406.T141958
>
> the query details says the query was parsed to
>
> location:z
>
> so if I escape the colon I see the new parsed query as
>
> location:"z installlogsarchive.log.d20060406.t141958"
>
> So, lucence does not store the file path exactly?! It converts it all
> lower
> case! Is there some property I should turn on?


In the StandardAnalyzer, the LowerCaseFilter converts everything into lower
case. You can skip that step.

Plus, it is not storing the backslash. Should I be escaping these in the
> index before storing them? It seems so.

Yes

-Bill

On 4/14/06, Bill Snyder <[EMAIL PROTECTED]> wrote:
>
> Oh, cool. Look at that. A neat tool made with thinlets. I had not heard of
> this...I'll see if it helps me figure out whats going on.
>
> --Bill
>
>
> On 4/14/06, Rajesh Munavalli <[EMAIL PROTECTED]> wrote:
> >
> > It would be helpful to download Luke (http://www.getopt.org/luke/) and
> > analyze whats getting indexed. Have you tried that?
> >
> > On 4/14/06, Bill Snyder < [EMAIL PROTECTED]> wrote:
> > >
> > > Hello,
> > >
> > > We am using Lucene to facilitate searching of our applications log
> > files.
> > > I
> > > am noticing some inconsistencies in result sets when searching on
> > certain
> > > fields.
> > >
> > > One field we index is the file path. I am using a simple query like
> > > "location:Z:\logs\someLogFile.log". However, I can never get path
> > searches
> > > like this to come back with any results. Tried escaping the
> > backslashes
> > > and
> > > colon. Nothing seems to work. I missing something here in my syntax?
> > >
> > > We also index the file name. However, on file names that have mixed
> > case
> > > or
> > > multiple extensions (logfile.D20060303.T234234 ) I cannot get results
> > > either.
> > > Weird.
> > >
> > > I haven't worked with Lucene very long, so I expect I am missing
> > something
> > > simple here.
> > >
> > > If you need more info, let me know!
> > > Many Thanks!
> > >
> > > --Bill
> > >
> > >
> >
> >
>

Using Lucene for searching tokens, not storing them.

2006-04-14 Thread karl wettin

I would like to store all in my application rather than using the  
Lucene persistency mechanism for tokens. I only want the search  
mechanism. I do not need the IndexReader and IndexWriter as that will  
be a natural part of my application. I only want to use the Searchable.


So I looked at extending my own. Tried to follow the code. Is there  
UML or something that describes the code and the process? Would very  
much appreciate someone telling me what I need to do :-)


Perhaps there is some implementation I should take a look at?

Memory consumption is not an issue. What do I need to consider for  
the CPU?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Syntax help

2006-04-14 Thread Bill Snyder

Thanks! OK, how do I get the file separator to be part of the term? Luke
shows the parsed query as ignoring the file separator.

so location:Z\:\\/install/logs\\jetspeedservices.log

becomes location:"z install logs jetspeedservices.log"

--Bill



On 4/14/06, Rajesh Munavalli <[EMAIL PROTECTED]> wrote:
>
> On 4/14/06, Bill Snyder <[EMAIL PROTECTED]> wrote:
> >
> > AHA!  I am using the Search tab and have enteres the query :
> >
> > location:Z:\install\logs\archive.log.D20060406.T141958
> >
> > the query details says the query was parsed to
> >
> > location:z
> >
> > so if I escape the colon I see the new parsed query as
> >
> > location:"z installlogsarchive.log.d20060406.t141958"
> >
> > So, lucence does not store the file path exactly?! It converts it all
> > lower
> > case! Is there some property I should turn on?
>
>
> In the StandardAnalyzer, the LowerCaseFilter converts everything into
> lower
> case. You can skip that step.
>
> Plus, it is not storing the backslash. Should I be escaping these in the
> > index before storing them? It seems so.
>
> Yes
>
> -Bill
>
> On 4/14/06, Bill Snyder <[EMAIL PROTECTED]> wrote:
> >
> > Oh, cool. Look at that. A neat tool made with thinlets. I had not heard
> of
> > this...I'll see if it helps me figure out whats going on.
> >
> > --Bill
> >
> >
> > On 4/14/06, Rajesh Munavalli <[EMAIL PROTECTED]> wrote:
> > >
> > > It would be helpful to download Luke (http://www.getopt.org/luke/) and
> > > analyze whats getting indexed. Have you tried that?
> > >
> > > On 4/14/06, Bill Snyder < [EMAIL PROTECTED]> wrote:
> > > >
> > > > Hello,
> > > >
> > > > We am using Lucene to facilitate searching of our applications log
> > > files.
> > > > I
> > > > am noticing some inconsistencies in result sets when searching on
> > > certain
> > > > fields.
> > > >
> > > > One field we index is the file path. I am using a simple query like
> > > > "location:Z:\logs\someLogFile.log". However, I can never get path
> > > searches
> > > > like this to come back with any results. Tried escaping the
> > > backslashes
> > > > and
> > > > colon. Nothing seems to work. I missing something here in my syntax?
> > > >
> > > > We also index the file name. However, on file names that have mixed
> > > case
> > > > or
> > > > multiple extensions (logfile.D20060303.T234234 ) I cannot get
> results
> > > > either.
> > > > Weird.
> > > >
> > > > I haven't worked with Lucene very long, so I expect I am missing
> > > something
> > > > simple here.
> > > >
> > > > If you need more info, let me know!
> > > > Many Thanks!
> > > >
> > > > --Bill
> > > >
> > > >
> > >
> > >
> >
>
>

Re: query analysis

2006-04-14 Thread Chris Lu

tried MultiFieldQueryParser?

Chris Lu
---
Full-Text Lucene Search on Any Databases
http://www.dbsight.net
Faster to Setup than reading marketing materials!

On 4/14/06, karl wettin <[EMAIL PROTECTED]> wrote:
> Hello list,
>
> I want to know if a human written query passed through the
> QueryParser is "clean" from fields, boolean clauses and query
> indicators. Easy way out would of course to add a boolean that resets
> at ReInit(), but maybe there is a smart way to do it. Perhaps it is
> possible to treat the retuned Query as a composite pattern (i.e.
> query.iterateNonRewrittenParts())?
>
> The plan is to avoid making suggestions on meta data in the query.
> "+name:foo" should suggest on "foo" only, and not "+name:foo". I
> initially tried to work the enumerateTerms, but realised this is
> hopeless(?) as a rewritten query looks quite different.
>
> Perhaps I'm attacking this from the wrong angle?
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Syntax help

2006-04-14 Thread karl wettin



14 apr 2006 kl. 17.22 skrev karl wettin:


It is the Analyzer that does that. Try creating your IndexSearcher  
with a KeywordAnalyzer (it think).


err

It is the Analyzer that does that. Try using a KeywordAnalyzer (it  
think).


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: query analysis

2006-04-14 Thread karl wettin



14 apr 2006 kl. 17.41 skrev Chris Lu:


On 4/14/06, karl wettin <[EMAIL PROTECTED]> wrote:



I want to know if a human written query passed through the
QueryParser is "clean" from fields, boolean clauses and query
indicators. Easy way out would of course to add a boolean that resets
at ReInit(), but maybe there is a smart way to do it. Perhaps it is
possible to treat the retuned Query as a composite pattern (i.e.
query.iterateNonRewrittenParts())?



tried MultiFieldQueryParser?


How do you mean that it can help? I'm not sure if you understood my  
question or if MultiFieldQueryParser has some features I'm unaware of. 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Using Lucene for searching tokens, not storing them.

2006-04-14 Thread Chris Lu

use Store.NO when creating Field

Chris Lu
---
Full-Text Lucene Search on Any Databases
http://www.dbsight.net
Faster to Setup than reading marketing materials!


On 4/14/06, karl wettin <[EMAIL PROTECTED]> wrote:
> I would like to store all in my application rather than using the
> Lucene persistency mechanism for tokens. I only want the search
> mechanism. I do not need the IndexReader and IndexWriter as that will
> be a natural part of my application. I only want to use the Searchable.
>
> So I looked at extending my own. Tried to follow the code. Is there
> UML or something that describes the code and the process? Would very
> much appreciate someone telling me what I need to do :-)
>
> Perhaps there is some implementation I should take a look at?
>
> Memory consumption is not an issue. What do I need to consider for
> the CPU?
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Using Lucene for searching tokens, not storing them.

2006-04-14 Thread karl wettin



14 apr 2006 kl. 17.46 skrev Chris Lu:


On 4/14/06, karl wettin <[EMAIL PROTECTED]> wrote:

I would like to store all in my application rather than using the
Lucene persistency mechanism for tokens. I only want the search
mechanism. I do not need the IndexReader and IndexWriter as that will
be a natural part of my application. I only want to use the  
Searchable.



use Store.NO when creating Field


You missunderstand all my questions.

But it's OK.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Using Lucene for searching tokens, not storing them.

2006-04-14 Thread Christophe


On 14 Apr 2006, at 08:51, karl wettin wrote:

You missunderstand all my questions.


I must admit I was not sure I understood your question, either.  In  
order to search, Lucene needs an index.  That index is maintained by  
the IndexReader and IndexWriter classes.  Are you contemplating  
having your own index and index format?  In that case, it's not clear  
to me how much leverage you will be getting using Lucene at all.   
Could you explain in more detail what you are trying to do?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Using Lucene for searching tokens, not storing them.

2006-04-14 Thread karl wettin



14 apr 2006 kl. 17.51 skrev karl wettin:



14 apr 2006 kl. 17.46 skrev Chris Lu:


On 4/14/06, karl wettin <[EMAIL PROTECTED]> wrote:

I would like to store all in my application rather than using the
Lucene persistency mechanism for tokens. I only want the search
mechanism. I do not need the IndexReader and IndexWriter as that  
will
be a natural part of my application. I only want to use the  
Searchable.



use Store.NO when creating Field


You missunderstand all my questions.


I'll clarify though.

I don't want to use Lucene for persistence. I do not want to store  
tokens nor field text in a FSDirectory or in a RAMDirectory. I want  
to take store the the tokens in my application.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Using Lucene for searching tokens, not storing them.

2006-04-14 Thread karl wettin



14 apr 2006 kl. 17.51 skrev Christophe:

Are you contemplating having your own index and index format?  In  
that case, it's not clear to me how much leverage you will be  
getting using Lucene at all.  Could you explain in more detail what  
you are trying to do?


I want to use the parts of Lucene built to query an index, not the  
part that persists an index.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Syntax help

2006-04-14 Thread Bill Snyder

oops, thought that you were just referring to the lowercase...
:)

On 4/14/06, karl wettin <[EMAIL PROTECTED]> wrote:
>
>
> 14 apr 2006 kl. 17.22 skrev karl wettin:
> >
> > It is the Analyzer that does that. Try creating your IndexSearcher
> > with a KeywordAnalyzer (it think).
>
> err
>
> It is the Analyzer that does that. Try using a KeywordAnalyzer (it
> think).
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Using Lucene for searching tokens, not storing them.

2006-04-14 Thread Christophe


On 14 Apr 2006, at 08:55, karl wettin wrote:

I don't want to use Lucene for persistence. I do not want to store  
tokens nor field text in a FSDirectory or in a RAMDirectory. I want  
to take store the the tokens in my application.


If I understand your question, I think that the first answer was  
exactly correct.


You don't need to use Lucene for persistence in order to use it for  
searching.  By setting the fields to be non-stored, Lucene only  
constructs the index for those fields, and doesn't save the full text  
of the field.  For example, we store the text we are searching in an  
RDBMS, and only use Lucene for the full-text index.  When we need to  
retrieve the actual document, we don't go to Lucene; we go to the RDBMS.


This doesn't require any code changes at all; you just set the fields  
to non-stored when you index the documents.


Lucene still does need an index, somewhere, in order to search, and  
Lucene manages the format of the index, so you will still need to use  
IndexWriter and IndexReader, and some Directory subclass, in order  
for Lucene to have a place to store its index.  You can create a new  
flavor of Directory if you want Lucene to store its index files  
somewhere more exotic than the standard classes allow.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: query analysis

2006-04-14 Thread Chris Lu

Sorry, really misunderstood you. And you already know Lucene a lot. :)

Basically you want to restore the original query from the Query
object. But it may have already passed a lot of composition, like
Boolean, Span, Wildcard.

I don't feel it's possible to reconstruct the original human query.

Chris Lu
---
Full-Text Lucene Search on Any Databases
http://www.dbsight.net
Faster to Setup than reading marketing materials!


On 4/14/06, karl wettin <[EMAIL PROTECTED]> wrote:
>
> 14 apr 2006 kl. 17.41 skrev Chris Lu:
> >
> > On 4/14/06, karl wettin <[EMAIL PROTECTED]> wrote:
>
> >> I want to know if a human written query passed through the
> >> QueryParser is "clean" from fields, boolean clauses and query
> >> indicators. Easy way out would of course to add a boolean that resets
> >> at ReInit(), but maybe there is a smart way to do it. Perhaps it is
> >> possible to treat the retuned Query as a composite pattern (i.e.
> >> query.iterateNonRewrittenParts())?
>
> > tried MultiFieldQueryParser?
>
> How do you mean that it can help? I'm not sure if you understood my
> question or if MultiFieldQueryParser has some features I'm unaware of.
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Using Lucene for searching tokens, not storing them.

2006-04-14 Thread karl wettin



14 apr 2006 kl. 17.56 skrev karl wettin:



14 apr 2006 kl. 17.51 skrev Christophe:

Are you contemplating having your own index and index format?  In  
that case, it's not clear to me how much leverage you will be  
getting using Lucene at all.  Could you explain in more detail  
what you are trying to do?


I want to use the parts of Lucene built to query an index, not the  
part that persists an index.


Sorry for flooding. Here is a class diagram (go fixed size font) of  
what I want to do:


[MyTokenizedClass](field)-- {0..*} | {0..1} --[Token]<- - - <>  
- -[Searchable]

   |
   \---[Offset]

I want to store all the tokens in the realm of my application. I do  
not want to use the IndexWriter to analyze and tokenize my fields.  I  
do that my self.


I only want the query mechanism of Lucene.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Using Lucene for searching tokens, not storing them.

2006-04-14 Thread Chris Lu

Thanks, Christophe.

Hi, Kevin,

I think your question means you want to store the Analyzed tokens yourself?
If so, you can use Analyzer to directly process the text, and save the
analyzed results in your application, maybe later use it in some
RDBMS? or BerkelyDB?

Chris Lu
---
Full-Text Lucene Search on Any Databases
http://www.dbsight.net
Faster to Setup than reading marketing materials!

On 4/14/06, Christophe <[EMAIL PROTECTED]> wrote:
> On 14 Apr 2006, at 08:55, karl wettin wrote:
>
> > I don't want to use Lucene for persistence. I do not want to store
> > tokens nor field text in a FSDirectory or in a RAMDirectory. I want
> > to take store the the tokens in my application.
>
> If I understand your question, I think that the first answer was
> exactly correct.
>
> You don't need to use Lucene for persistence in order to use it for
> searching.  By setting the fields to be non-stored, Lucene only
> constructs the index for those fields, and doesn't save the full text
> of the field.  For example, we store the text we are searching in an
> RDBMS, and only use Lucene for the full-text index.  When we need to
> retrieve the actual document, we don't go to Lucene; we go to the RDBMS.
>
> This doesn't require any code changes at all; you just set the fields
> to non-stored when you index the documents.
>
> Lucene still does need an index, somewhere, in order to search, and
> Lucene manages the format of the index, so you will still need to use
> IndexWriter and IndexReader, and some Directory subclass, in order
> for Lucene to have a place to store its index.  You can create a new
> flavor of Directory if you want Lucene to store its index files
> somewhere more exotic than the standard classes allow.
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Using Lucene for searching tokens, not storing them.

2006-04-14 Thread karl wettin



14 apr 2006 kl. 18.01 skrev Christophe:


On 14 Apr 2006, at 08:55, karl wettin wrote:

I don't want to use Lucene for persistence. I do not want to store  
tokens nor field text in a FSDirectory or in a RAMDirectory. I  
want to take store the the tokens in my application.


If I understand your question, I think that the first answer was  
exactly correct.


You don't need to use Lucene for persistence in order to use it for  
searching.  By setting the fields to be non-stored, Lucene only  
constructs the index


You speak of storing field values in the Lucene index. I speak of not  
using a Lucene index at all, to only use the query mechanism. All  
data Lucene need (the index) would be supplied from my application.  
Not from the Lucene Directory implementation.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Lucene probabilistic

2006-04-14 Thread Malcolm Clark

Hi all,
I came across an old mail list item from 2003 exploring the possibilities of a 
more probabilistic approach to using Lucene. Do the online experts know if 
anyone achieved this since?
Thanks for any advice,
Malc

Re: Using Lucene for searching tokens, not storing them.

2006-04-14 Thread Doug Cutting


karl wettin wrote:
I would like to store all in my application rather than using the  
Lucene persistency mechanism for tokens. I only want the search  
mechanism. I do not need the IndexReader and IndexWriter as that will  
be a natural part of my application. I only want to use the Searchable.


Implement the IndexReader API, overriding all of the abstract methods. 
That will enable you to search your index using Lucene's search code.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Boosting Fields (in index) or Queries

2006-04-14 Thread Jeremy Hanna

Wow, I finally found out why I was getting results in the wrong order  
- I got the results in the correct order from the Lucene index.  I  
got the explanation of each of the results along with their database  
id and found the ordering mismatch.

The problem is in the database call.  I am calling:

select * from product where id in (444, 333, 555, 888);

and the ordering that comes back is not preserved.  So the results  
are correct but the ordering and hence all of the relevancy is out  
the window.  So that at least leads me to the actual problem.  Now I  
have to figure out how I'll approach reordering the results because I  
don't believe that there's any way to force the ordering of a list  
and I don't want to call a separate database query for each id (lots  
of database round-trips).


Thanks for the help Erik!

On Apr 13, 2006, at 7:13 PM, Erik Hatcher wrote:



On Apr 13, 2006, at 8:55 PM, Jeremy Hanna wrote:
Looking at the results, the first document in the results should  
hopefully be near the bottom and the Explanation for this document  
has a Description/Details (using the toString() on the  
Explanation) of:


product of:
  0.0 = sum of:
  0.0 = coord(0/7)

So I'm kind of at a loss as to what's going on.  Am I just doing  
something crazy weird in my code?  I didn't find that many  
examples out there, so I'm kind of winging it according to what  
I've read in the javadocs and what examples I could find.


Be sure to pass the document id, not the hit number, to explain().   
Looks like you passed an id of an unmatched document.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Using Lucene for searching tokens, not storing them.

2006-04-14 Thread karl wettin



14 apr 2006 kl. 18.31 skrev Doug Cutting:


karl wettin wrote:
I would like to store all in my application rather than using the   
Lucene persistency mechanism for tokens. I only want the search   
mechanism. I do not need the IndexReader and IndexWriter as that  
will  be a natural part of my application. I only want to use the  
Searchable.


Implement the IndexReader API, overriding all of the abstract  
methods. That will enable you to search your index using Lucene's  
search code.


Aha, thanks.

Do I have to worry about passing a null Directory to the default  
constructor?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Boosting Fields (in index) or Queries

2006-04-14 Thread Bryzek.Michael

We tried two approaches:

  1) Pull data from the db in arbitrary order and then sort in the application  
AFTER the retrieve. This will require two passes over the results.

  2) Add an order by clause to the select. In Oracle, you could do something 
like "order by decode(444,1,333,2,555,3,888,4,...)". This will force the order 
you want in the query from the db.

FWIW, after trying both of the above in production, we changed our strategy to 
avoid the db hit altogether, storing everything we needed for presentation 
within the Lucene index. We saw a net performance increase AND simpler code 
when we did this.

-Mike

-Original Message-
From:   Jeremy Hanna [mailto:[EMAIL PROTECTED]
Sent:   Fri 4/14/06 1:15 PM
To: java-user@lucene.apache.org
Cc: 
Subject:Re: Boosting Fields (in index) or Queries

Wow, I finally found out why I was getting results in the wrong order  
- I got the results in the correct order from the Lucene index.  I  
got the explanation of each of the results along with their database  
id and found the ordering mismatch.
The problem is in the database call.  I am calling:

select * from product where id in (444, 333, 555, 888);

and the ordering that comes back is not preserved.  So the results  
are correct but the ordering and hence all of the relevancy is out  
the window.  So that at least leads me to the actual problem.  Now I  
have to figure out how I'll approach reordering the results because I  
don't believe that there's any way to force the ordering of a list  
and I don't want to call a separate database query for each id (lots  
of database round-trips).

Thanks for the help Erik!

On Apr 13, 2006, at 7:13 PM, Erik Hatcher wrote:

>
> On Apr 13, 2006, at 8:55 PM, Jeremy Hanna wrote:
>> Looking at the results, the first document in the results should  
>> hopefully be near the bottom and the Explanation for this document  
>> has a Description/Details (using the toString() on the  
>> Explanation) of:
>>
>> product of:
>>   0.0 = sum of:
>>   0.0 = coord(0/7)
>>
>> So I'm kind of at a loss as to what's going on.  Am I just doing  
>> something crazy weird in my code?  I didn't find that many  
>> examples out there, so I'm kind of winging it according to what  
>> I've read in the javadocs and what examples I could find.
>
> Be sure to pass the document id, not the hit number, to explain().   
> Looks like you passed an id of an unmatched document.
>
>   Erik
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Using Lucene for searching tokens, not storing them.

2006-04-14 Thread Yonik Seeley

On 4/14/06, karl wettin <[EMAIL PROTECTED]> wrote:
> Do I have to worry about passing a null Directory to the default
> constructor?

That's not an easy road you are trying to take, but it should be doable.
There are some final methods you can't override, but just set
directoryOwner=false and closeDirectory=false, and that code shoudn't
touch the directory you set to null.

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Syntax help

2006-04-14 Thread Erick Erickson

Something that took me a while to get was that the analyzer is important
BOTH in the indexing phase and in the searching phase (assuming you're using
the QueryParser). For you experiment, you probably want to use the
WhitespaceAnalyzer. See page 119 of "Lucene in Action".

The other three most-common analyzers divide text at nonletter characters,
which will do bad things to your path names.

Also note that you can use the PerFieldAnalyzerWrapper to use, say, the
WhitespaceAnalyzer on the file-path field and other analyzers on other
fields, you're not locked into using the same analyzer for all fields.

Best
Erick


BTW, I really recommend a copy of "Lucene in Action"..

Re: Syntax help

2006-04-14 Thread Bill Snyder

On 4/14/06, Erick Erickson <[EMAIL PROTECTED]> wrote:
>
> Something that took me a while to get was that the analyzer is important
> BOTH in the indexing phase and in the searching phase (assuming you're
> using
> the QueryParser). For you experiment, you probably want to use the
> WhitespaceAnalyzer. See page 119 of "Lucene in Action".

The other three most-common analyzers divide text at nonletter characters,
> which will do bad things to your path names.
>
> Also note that you can use the PerFieldAnalyzerWrapper to use, say, the
> WhitespaceAnalyzer on the file-path field and other analyzers on other
> fields, you're not locked into using the same analyzer for all fields.
>
> Best
> Erick
>
>
> BTW, I really recommend a copy of "Lucene in Action"..


PerFieldAnalyzerWrapper looks like what I want!

I've heard nothing but good things about the book and will have to pick it
up!

Thanks for the help everyone!

Re: Boosting Fields (in index) or Queries

2006-04-14 Thread Jeremy Hanna

I would use a database function to force the ordering like the one  
your provided that works in Oracle, but it doesn't look like mysql 5  
supports that.  If anyone else knows of a way to force the ordering  
using mysql 5 queries, please respond.  I think I'll just resort them  
when they get back though.

Thanks!

On Apr 14, 2006, at 11:39 AM, Bryzek.Michael wrote:


We tried two approaches:

  1) Pull data from the db in arbitrary order and then sort in the  
application  AFTER the retrieve. This will require two passes over  
the results.


  2) Add an order by clause to the select. In Oracle, you could do  
something like "order by decode(444,1,333,2,555,3,888,4,...)". This  
will force the order you want in the query from the db.


FWIW, after trying both of the above in production, we changed our  
strategy to avoid the db hit altogether, storing everything we  
needed for presentation within the Lucene index. We saw a net  
performance increase AND simpler code when we did this.


-Mike

-Original Message-
From:   Jeremy Hanna [mailto:[EMAIL PROTECTED]
Sent:   Fri 4/14/06 1:15 PM
To: java-user@lucene.apache.org
Cc: 
Subject:Re: Boosting Fields (in index) or Queries

Wow, I finally found out why I was getting results in the wrong order
- I got the results in the correct order from the Lucene index.  I
got the explanation of each of the results along with their database
id and found the ordering mismatch.
The problem is in the database call.  I am calling:

select * from product where id in (444, 333, 555, 888);

and the ordering that comes back is not preserved.  So the results
are correct but the ordering and hence all of the relevancy is out
the window.  So that at least leads me to the actual problem.  Now I
have to figure out how I'll approach reordering the results because I
don't believe that there's any way to force the ordering of a list
and I don't want to call a separate database query for each id (lots
of database round-trips).

Thanks for the help Erik!

On Apr 13, 2006, at 7:13 PM, Erik Hatcher wrote:



On Apr 13, 2006, at 8:55 PM, Jeremy Hanna wrote:

Looking at the results, the first document in the results should
hopefully be near the bottom and the Explanation for this document
has a Description/Details (using the toString() on the
Explanation) of:

product of:
  0.0 = sum of:
  0.0 = coord(0/7)

So I'm kind of at a loss as to what's going on.  Am I just doing
something crazy weird in my code?  I didn't find that many
examples out there, so I'm kind of winging it according to what
I've read in the javadocs and what examples I could find.


Be sure to pass the document id, not the hit number, to explain().
Looks like you passed an id of an unmatched document.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Using Lucene for searching tokens, not storing them.

2006-04-14 Thread Doug Cutting


karl wettin wrote:
Do I have to worry about passing a null Directory to the default  
constructor?


A null Directory should not cause you problems.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Boosting Fields (in index) or Queries

2006-04-14 Thread Michael D. Curtin


Jeremy Hanna wrote:

I would use a database function to force the ordering like the one  your 
provided that works in Oracle, but it doesn't look like mysql 5  
supports that.  If anyone else knows of a way to force the ordering  
using mysql 5 queries, please respond.  I think I'll just resort them  
when they get back though.


If there's nothing in the relational table that specifies the ordering, I'm 
afraid you've probably got similar problems in other places.  RDBMSes don't 
guarantee to return rows in the order they were INSERTed.  Sure, early in the 
life of a table that will tend to happen, but as DELETEs, then UPDATEs and new 
INSERTs get processed, the on-disk order tends to get pretty jumbled.  Note 
that I'm talking about anything that uses the results of your SELECT, not just 
your Lucene-related code.


If ordering of the rows is something your app needs, I recommend adding a 
column that is expressly for ordering.  A one-up integer or something like 
that.  I don't remember what the keyword in MySQL is for that, but I'm pretty 
sure there is one.  Then you can code all your SELECTs with an ORDER BY clause 
that does what you want.


Good luck!

--MDC

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Boosting Fields (in index) or Queries

2006-04-14 Thread Jeremy Hanna

I still have a similar problem with the boost factor.  I change the  
name to have the AND operator and set that query's boost to a very  
high value in relation to the others.  I also have a regular OR based  
name so that it doesn't rule those out.  However whenever I change  
the boost values with the queries, nothing, absolutely nothing  
changes with the results.  Besides that - I search for: playstation  
game.  The only value that has both playstation and game in the name  
field is Hit number 20.  That's really why I put the name AND  
operator in there with such a high boost value, to see if it would  
bring that single ANDed record towards the top, but nothing.  Am I  
doing something wrong in all of this?  Am I doing the boost wrong or  
something?


On Apr 14, 2006, at 1:43 PM, Michael D. Curtin wrote:


Jeremy Hanna wrote:

I would use a database function to force the ordering like the  
one  your provided that works in Oracle, but it doesn't look like  
mysql 5  supports that.  If anyone else knows of a way to force  
the ordering  using mysql 5 queries, please respond.  I think I'll  
just resort them  when they get back though.


If there's nothing in the relational table that specifies the  
ordering, I'm afraid you've probably got similar problems in other  
places.  RDBMSes don't guarantee to return rows in the order they  
were INSERTed.  Sure, early in the life of a table that will tend  
to happen, but as DELETEs, then UPDATEs and new INSERTs get  
processed, the on-disk order tends to get pretty jumbled.  Note  
that I'm talking about anything that uses the results of your  
SELECT, not just your Lucene-related code.


If ordering of the rows is something your app needs, I recommend  
adding a column that is expressly for ordering.  A one-up integer  
or something like that.  I don't remember what the keyword in MySQL  
is for that, but I'm pretty sure there is one.  Then you can code  
all your SELECTs with an ORDER BY clause that does what you want.


Good luck!

--MDC

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene Seaches VS. Relational database Queries

2006-04-14 Thread Jeryl Cook


Im the co-worker who suggested to Ananth( I've think we have been debating
this for 3 days now,from the post it seems he is winning :)... )

Anway, as Ananth stated I suggested this because I am wondering if  lucene
could solve a bottle neck query that is taking a deathly long time to
complete(read-only)and the orginal design actually generated a threaded
60+ queries on the database to return results per userThread who hit our
website for this view..., I know that this will kill our server when
user-load increases...i know that lucene is built for speed and can handle a
very large number of peopel searching(we are using singleton Searcher), and
the (threaded)results will be the "hits" returned from lucene.. , also this
query will NOT be executed by any user in a text field , but rather in our
application code only when user selects differnt parts of the site...if all
values in this 1:n relationship we are trying to query in lucene then the
"application-provided" query will return accurate results.  

we are using Quartz, and not creating threads in servlets...

FINAL SOLUTION MAYBE?:
if our client EVER gives us a requirement that says we must have accurate
text-searching even if somthing on our index for  1:  "Jason" and "Jason
Black" relationship, then we should just simply say we cannot implement this
because  lucene search will yield inaccurate results correct???

comments?
--
View this message in context: 
http://www.nabble.com/Lucene-Seaches-VS.-Relational-database-Queries-t1434583.html#a3925693
Sent from the Lucene - Java Users forum at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Catching BooleanQuery.TooManyClauses

2006-04-14 Thread bb

Hi Lucene Users,

I would like to catch BooleanQuery.TooManyClauses exception for certain
wildcard searches and display a 'subset' of results.  I have used the
WildcardTermEnum to give me the first X documents matching the wildcard
query.  Below is the code I use to implement the solution.  

Without any performance concerns is this the best solution?
Or should I just tell the user to refine their query!?

Thanks

Ben

= QueryParserTest.java  
...
public class QueryParserTest extends LuceneTestCase {
...
private static int MAX_HITS = 10;
public void testCatchTooManyClauses() throws Exception {
reader = IndexReader.open(directory);
String queryStr = "9*";
String field = "PART_NBR";
Hits hits = null;
Vector docList;
try {
System.out.println("query: " + queryStr);
System.out.println("field: " + field);
hits =
searcher.search(parser.parse(field+":"+queryStr));
docList = new Vector(hits.length());
Iterator docListIt = hits.iterator();
while(docListIt.hasNext())

docList.add(((Hit)docListIt.next()).getDocument());
}
catch(BooleanQuery.TooManyClauses ex) {
System.out.println("catch
BooleanQuery.TooManyClauses, refining query");
Term term = new Term(field, queryStr);

WildcardTermEnum wte = new WildcardTermEnum(reader,
term);
int cnt = 0;
docList = new Vector(MAX_HITS);
while(wte.next() && cnt++ < MAX_HITS) {
term = wte.term();
TermQuery query = new TermQuery(new
Term(field, term.text()));
System.out.println("search for " +
query.getTerm().text());
hits = searcher.search(query);
Iterator docListIt = hits.iterator();
while(docListIt.hasNext())

docList.add(((Hit)docListIt.next()).getDocument());
}
}
System.out.println("found:" + docList.size());

}
...
= QueryParserTest.java  

-- 
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.385 / Virus Database: 268.4.1/312 - Release Date: 14/04/2006
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

41 matches

Mail list logo