Re: document diversity

2009-10-01 Thread Phil Whelan
Hi Mike, I'd simply store a field "doctype" with values "pdf", "txt", "html" and perform a separate search for each type. Although, I'd be interested if anyone has a cooler way of doing this. Cheers, Phil On Thu, Oct 1, 2009 at 9:56 AM, Michael Masters wrote: > I was wondering if there is any w

Problems with IndexReader.reopen()

2009-09-14 Thread Phil Whelan
Hi, I'm not sure why my IndexReader.reopen() call is not working. The latest results are not coming back, meaning the reader / searcher has not being re-opened for the new Documents that have been added. IndexReader openReader = searcher.getIndexReader(); searcher.close(); openReader.reope

Re: Problems with IndexReader.reopen()

2009-09-14 Thread Phil Whelan
Sorry, just realised my mistake. I should read the docs more carefully. IndexReader.reopen() does not reopen the existing IndexReader, but returns a new one. Phil On Mon, Sep 14, 2009 at 3:20 PM, Phil Whelan wrote: > Hi, > > I'm not sure why my IndexReader.reopen() call is not wo

Re: [ANNOUNCEMENT] LucidGaze for Lucene released

2009-09-14 Thread Phil Whelan
Hi Mark, Is there any Lucene 2.9 versions of this in development that I could get my hands on? I'd be happy to be an alpha tester. Cheers, Phil > LucidGaze for Lucene works as a drop-in replacement for the Lucene JAR; > it requires no changes to the source code of the application, or even > reco

Re: Enumerating NumericField using TermEnum?

2009-09-13 Thread Phil Whelan
Hi Uwe, Thanks for the explanation! It really helps. That makes sense that for a small number of values, such as "hour" NumericField is not going to help me. I'm experimenting with using epoch NumericField for sorting, which funnily is where I started with 2.4.1, before going down the usual TooMan

Enumerating NumericField using TermEnum?

2009-09-11 Thread Phil Whelan
Hi, I've used NumericField to store my "hour" field. Example... doc.add(new NumericField("hour").setIntValue(Integer.parseInt("12"))); Before I was using plain string Field and enumerating them with TermEnum, which worked fine. Now I'm using NumericField's I'm not sure how to port this enu

Re: Why does this search succeed with web app, but not Luke?

2009-08-06 Thread Phil Whelan
the "" part. >> >> That's why I said in my original post that I was kind of surprised that >> doing a web query for "path:.yyy" succeeded, i.e, in the path field in >> the index, there is no ".yyy", just "". >&

Re: Why does this search succeed with web app, but not Luke?

2009-08-06 Thread Phil Whelan
Hi Jim, Are you using the same Analyzer for indexing and searching? .yyy will be seem as a HOSTNAME by StandardAnalyzer and will keep it as one term, whereas another indexer might split this into 2 terms. This should not matter either way as long as you are using the same Analyzer for both ind

Re: Searching doubt

2009-08-04 Thread Phil Whelan
(sorry, tangent. I'll be quick) On Tue, Aug 4, 2009 at 8:42 AM, Shai Erera wrote: > Interesting ... I don't have access to a Japanese dictionary, so I just > extract bi-grams. Shai - if you're interested in parsing Japanese, check out Kakasi. It can split into words and convert Kanji->Katakana/Hi

Re: Searching doubt

2009-08-04 Thread Phil Whelan
On Tue, Aug 4, 2009 at 8:31 AM, Shai Erera wrote: > Hi Darren, > > The question was, how given a string "aboutus" in a document, you can return > that document as a result to the query "about us" (note the space). So we're > mostly discussing how to detect and then break the word "aboutus" to two >

Re: Searching doubt

2009-08-04 Thread Phil Whelan
On Tue, Aug 4, 2009 at 3:56 AM, Shai Erera wrote: > 2) Use a dictionary (real dictionary), and search it for every substring, > e.g. "a", "ab", "abo" ... "about" etc. If you find a match, split it there. > This needs some fine tuning, like checking if the rest is also a word and if > the full strin

Re: How to improve search time?

2009-08-02 Thread Phil Whelan
Hi Prashant, Take a look at this... http://wiki.apache.org/lucene-java/ImproveSearchingSpeed Cheers, Phil On Sun, Aug 2, 2009 at 9:33 PM, prashant ullegaddi wrote: > Hi, > > I've a single index of size 87GB containing around 50M documents. When I > search for any query, > best search time I obse

Re: Weird discrepancy with term counts vs. terms (off by 1)

2009-08-02 Thread Phil Whelan
Hi Jim, On Sun, Aug 2, 2009 at 12:12 PM, wrote: > i.e., I was ignoring the 1st term in the TermEnum (since the .next() bumps > the TermEnum to the 2nd term, initially). Great! Glad you found the problem. I couldn't see it. Phil -

Re: Weird discrepancy with term counts vs. terms (off by 1)

2009-08-02 Thread Phil Whelan
On Sun, Aug 2, 2009 at 10:58 AM, Andrzej Bialecki wrote: > Thank you Phil for spotting this bug - this fix will be included in the next > release of Luke. Glad to help. Thanks for building this great tool! Phil - To unsubscribe,

Re: Weird behaviour

2009-08-02 Thread Phil Whelan
Hi Prashant, I agree with Shai, that using Luke and printing out what the Document looks like before it goes into the index, are going to be your best bet for debugging this problem. The problem you're having is that StandardAnalyzer does not break-up the hostname into separate terms, as it has a

Re: Weird discrepancy with term counts vs. terms (off by 1)

2009-08-02 Thread Phil Whelan
Hi Jim, On Sun, Aug 2, 2009 at 9:08 AM, Phil Whelan wrote: > >> So then, I reviewed the index using Luke, and what I saw with that was that >> there were indeed only 12 "path" terms (under "Term Count" on the left), >> but, when I clicked the "Show

Re: Weird discrepancy with term counts vs. terms (off by 1)

2009-08-02 Thread Phil Whelan
Hi Jim, On Sun, Aug 2, 2009 at 1:32 AM, wrote: > I first noticed the problem that I'm seeing while working on this latter app. > Basically, what I noticed was that while I was adding 13 documents to the > index, when I listed the "path" terms, there were only 12 of them. Field text (the whole

Re: java.io.IOException when trying to list terms in index (IndexReader)

2009-08-01 Thread Phil Whelan
Hi Jim, I cannot see anything obvious, but both open() and terms() throw IOException's. You could try putting these in separate try..catch blocks to see which one it's coming from. Or using e.printStackTrace() in the catch block will give more info to help you debug what's happening. On Sat, Aug

Re: ThreadedIndexWriter vs. IndexWriter

2009-08-01 Thread Phil Whelan
Hi Mike, It's Jibo, not me, having the problem. But thanks for the link. I was interested to look at the code. Will be buying the book soon. Phil On Sat, Aug 1, 2009 at 2:08 AM, Michael McCandless wrote: > > (Please note that ThreadedIndexWriter is source code available with > the upcoming revi

Is it possible to retrieve Terms from a Document?

2009-07-31 Thread Phil Whelan
Hi, I know you can use Field.Store.YES, but I want to inspect the terms / tokens and their order related to the field name at search time. Is this possible? Obviously this information is stored in the index, but I can not find any API to access it. I'm guessing the answer might be that Terms point

Re: ThreadedIndexWriter vs. IndexWriter

2009-07-31 Thread Phil Whelan
Hi Jibo, Your mergeFactor is different, and the resulting numFiles (segment files) is different. Maybe each thread is responsible for a segment file. Just curious - do you have 3 threads? Phil - To unsubscribe, e-mail: java-user

Re: ThreadedIndexWriter vs. IndexWriter

2009-07-31 Thread Phil Whelan
Hi Jibo, Have you tried optimizing indexes? I do not know anything about the implementation of ThreadedIndexWriter, but if they both optimize down to the same size, it could just mean that ThreadedIndexWriter is not as optimized. Thanks, Phil On Fri, Jul 31, 2009 at 11:38 AM, Jibo John wrote: >

Re: Seeking guidance for updating indexes

2009-07-31 Thread Phil Whelan
Hi Jim, There should not be much difference from the lucene end between a new index and index you want to update (add more documents to). As stated in the Lucene docs IndexWriter will create the index "if it does not already exist". http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/in

Re: indexing multiple email addresses in one field

2009-07-31 Thread Phil Whelan
u do, just don't include stop word removal in the > processing of your token stream. > > Matt > > Phil Whelan wrote: >> >> Hi Matthew / Paul, >> >> On Thu, Jul 30, 2009 at 4:32 PM, Paul Cowan wrote: >> >>> >>> Matthew Hall wrote: >&

Re: Is there a list of "special" characters for standard analyzer?

2009-07-30 Thread Phil Whelan
On Thu, Jul 30, 2009 at 7:12 PM, wrote: > I was wonder if there is a list of special characters for the standard > analyzer? > > What I mean by "special" is characters that the analyzer considers break > characters. > For example, if I have something like "foo=something", apparently the analyzer

Re: indexing multiple email addresses in one field

2009-07-30 Thread Phil Whelan
Hi Matthew / Paul, On Thu, Jul 30, 2009 at 4:32 PM, Paul Cowan wrote: > Matthew Hall wrote: >> >> Place a delimiter between the email addresses that doesn't get removed in >> your analyzer.  (preferably something you know will never be searched on) > > Or add them separately (rather than: >  doc.a

Re: indexing multiple email addresses in one field

2009-07-30 Thread Phil Whelan
On Thu, Jul 30, 2009 at 11:22 AM, Matthew Hall wrote: > > 1. Sure, just have an analyzer that splits on all non letter characters. > 2. Phrase queries keep the order intact.  (And yes, the positional > information for the terms is kept, which is what allows span queries to work) > > So searching

indexing multiple email addresses in one field

2009-07-30 Thread Phil Whelan
Hi, We have a very large lucene index that we're developing that has a field of email addresses. (Actually mulitple fields with multiple emails addresses, but I'll simplify here) Each document will have one "email" field containing multiple email addresses. I am indexing email addresses only usi

Re: Querying across object relationships

2009-07-29 Thread Phil Whelan
Hi Don, On Wed, Jul 29, 2009 at 1:42 PM, Donal Murtagh wrote: >    Course.name   Attendance.mandatory   Student.name >    - >    cooking                        N                      Bob >    art                                Y                      

Re: Batch searching

2009-07-22 Thread Phil Whelan
On Wed, Jul 22, 2009 at 12:28 PM, Matthew Hall wrote: > Not sure if this helps you, but some of the issue you are facing seem > similar to those in the "real time" search threads. Hi Matthew, Do you have a pointer of where to go to see the "real time" threads? Thanks, Phil -

Re: Alternative way to simulate sorting without doing actual sort

2009-07-22 Thread Phil Whelan
Hi Ganesh, I'm not sure whether this will work for you, but one way I got around this was with multiple searches. I only needed the first 50 results, but wanted to sort by date,hour,min,sec. This could result in 5 results or millions of results. I added the date to the query, so I'd search for r

Re: indexing 100GB of data

2009-07-22 Thread Phil Whelan
On Wed, Jul 22, 2009 at 5:46 AM, m.harig wrote: > Is there any article or forum for using Hadoop with lucene? Please any1 help > me Hi M, Katta is a project that is combining Lucene and Hadoop. Check it out here... http://katta.sourceforge.net/ Thanks, Phil

Re: Exclusion search

2009-07-22 Thread Phil Whelan
If there are only have a few thousand documents, and the number of results quite small is this a case where post-search filtering can be done? I have not done anything like this myself with Lucene, so is this a bad idea? If not, what would be the best way to do this? org.apache.lucene.search.Filte