DocIdSet to represent small numberr of hits in large Document set

2011-04-04 Thread Antony Bowesman
I'm converting a Lucene 2.3.2 to 2.4.1 (with a view to going to 2.9.4). Many of our indexes are 5M+ Documents, however, only a small subset of these are relevant to any user. As a DocIdSet, backed by a BitSet or OpenBitSet, is rather inefficient in terms of memory use, what is the recommended

Index time boost question

2011-04-14 Thread Antony Bowesman
I have a test case written for 2.3.2 that tested an index time boost on a field of 0.0F and then did a search using Hits and got 0 results. I'm now in the process of upgrading to 2.9.4 and am removing all use of Hits in my test cases and using a Collector instead. Now the test case fails as it

NullPointerException in FieldSortedHitQueue

2011-04-14 Thread Antony Bowesman
Upgrading from 2.3.2 to 2.9.4 I get NPE as below Caused by: java.lang.NullPointerException at org.apache.lucene.search.FieldSortedHitQueue$1.createValue(FieldSortedHitQueue.java:224) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:224) at org.apache.lucene.s

What doc id to use on IndexReader with SetNextReader

2011-04-18 Thread Antony Bowesman
Migrating some code from 2.3.2 to 2.9.4 and I have custom Collectors. Now there are multiple calls to collect and each call needs to adjust the passed doc id by docBase as given in SetNextReader. However, if you want to fetch the document in the collector, what docId/IndexReader combination s

Re: What doc id to use on IndexReader with SetNextReader

2011-04-18 Thread Antony Bowesman
Thanks Uwe, I assumed as much. On 18/04/2011 7:28 PM, Uwe Schindler wrote: Document d = reader.document(doc) This is the correct way to do it. Uwe - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additi

Can TermDocs.skipTo() go backwards

2008-08-27 Thread Antony Bowesman
I have a custom TopDocsCollector and need to collect a payload from each final document hit. The payload comes from a single term in each hit. When collecting the payload, I don't want to fetch the payload during the collect() method as it will make fetches which may subsequently be bumped fro

Re: Can TermDocs.skipTo() go backwards

2008-08-27 Thread Antony Bowesman
Michael McCandless wrote: TermDocs.skipTo() only moves forwards. Can you use a stored field to retrieve this information, or do you really need to store it per-term-occurrence in your docs? I discussed my use case with Doron earlier and there were two options, either to use payloads or stor

Re: Can TermDocs.skipTo() go backwards

2008-08-27 Thread Antony Bowesman
Michael McCandless wrote: Ahh right, my short term memory failed me ;) I now remember this thread. Excused :) I expect you have real work to occupy your mind! Yes, though LUCENE-1231 (column stride stored fields) should help this. I see from JIRA that MB has started working on this - It's

Javadoc wording in IndexWriter.addIndexesNoOptimize()

2008-09-04 Thread Antony Bowesman
The Javadoc for this method has the following comment: "This requires this index not be among those to be added, and the upper bound* of those segment doc counts not exceed maxMergeDocs. " What does the second part of that mean, which is especially confusing given that MAX_MERGE_DOCS is depre

Merging indexes - which is best option?

2008-09-04 Thread Antony Bowesman
I am creating several temporary batches of indexes to separate indices and periodically will merge those batches to a set of master indices. I'm using IndexWriter#addIndexesNoOptimise(), but problem that gives me is that the master may already contain the index for that document and I get a dup

Re: Merging indexes - which is best option?

2008-09-08 Thread Antony Bowesman
Thanks Karsten, I decided first to delete all duplicates from master(iW) and then to insert all temporary indices(other). I reached the same conclusion. As your code shows, it's a simple enough solution. You had a good point with the iW.abort() in the rollback case. Antony ---

Caching Filters and docIds when using MultiSearcher/IndexSearcher(MultiReader)...

2008-09-11 Thread Antony Bowesman
Up to now I have only needed to search a single index, but now I will have many index shards to search across. My existing search mantained cached filters for the index as well as a cache of my own unique ID fields in the index, keyed by Lucene DocId. Now I need to search multiple indices, I

Re: Phrase Query

2008-09-16 Thread Antony Bowesman
Is it possible to write a document with different analyzers in different fields? PerFieldAnalyzerWrapper - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: distinct field values

2008-10-14 Thread Antony Bowesman
Akanksha Baid wrote: I have indexed multiple documents - each of them have 3 fields ( id, tag , text). Is there an easy way to determine the set of tags for a given query without iterating through all the hits? For example if I have 100 documents in my index and my set of tag = {A, B, C}. Query

Which is faster/better

2008-11-24 Thread Antony Bowesman
In 2.4, as well as IndexWriter.deleteDocuments(Term) there is also IndexReader.deleteDocuments(Term). I understand opening a reader is expensive, so does this means using IndexWriter.deleteDocuments would be faster from a closed index position? As the IndexReader instance is newer, it has bet

Re: Which is faster/better

2008-11-25 Thread Antony Bowesman
Michael McCandless wrote: If you have nothing open already, and all you want to do is delete certain documents and make a commit point, then using IndexReader vs IndexWriter should show very little difference in speed. Thanks. This use case can assume there may be nothing open. I prefer Ind

addIndexesNoOptimize question

2008-12-17 Thread Antony Bowesman
The javadocs state "This requires ... and the upper bound* of those segment doc counts not exceed maxMergeDocs." Can one of the gurus please explain what that means and what needs to be done to find out whether an index being merged fits that criteria. Thanks Antony

Re: addIndexesNoOptimize question

2008-12-19 Thread Antony Bowesman
Thanks Mike, I'm still on 2.3.1, so will upgrade soon. Antony Michael McCandless wrote: This was an attempt on addIndexesNoOptimize's part to "respect" the maxMergeDocs (which prevents large segments from being merged) you had set on IndexWriter. However, the check was too pedantic, and was

Re: Lucene 2.4 - Searching

2009-01-27 Thread Antony Bowesman
Karl Heinz Marbaise wrote: I have a field which is called filename and contains a filename which can of course be lowercase or upppercase or a mixture... I would like to do the following: +filename:/*scm*.doc That should result in getting things like /...SCMtest.doc /...scmtest.doc /...scm

How to not overwrite a Document if it 'already exists'?

2009-05-05 Thread Antony Bowesman
I'm adding Documents in batches to an index with IndexWriter. In certain circumstances, I do not want to add the Document if it already exists, where existence is determined by field id=myId. Is there any way to do this with IndexWriter or do I have to open a reader and look for the term id:X

Which is more efficient

2009-05-05 Thread Antony Bowesman
Just wondered which was more efficient under the hood for (int i = 0; i < size; i++) terms[i] = new Term("id", doc_key[i]); This writer.deleteDocuments(terms); for (int i = 0; i < size; i++) writer.addDocument(doc[i]); Or this for (int i = 0; i < size; i++) writer.updateDoc

Re: How to not overwrite a Document if it 'already exists'?

2009-05-05 Thread Antony Bowesman
Michael McCandless wrote: Lucene doesn't provide any way to do this, except opening a reader. Opening a reader is not "that" expensive if you use it for this purpose. EG neither norms nor FieldCache will be loaded if you just enumerate the term docs. Thanks for that info. These indexes will

Re: How to not overwrite a Document if it 'already exists'?

2009-05-05 Thread Antony Bowesman
Thanks for that info. These indexes will be large, in the 10s of millions. id field is unique and is 29 bytes. I guess that's still a lot of data to trawl through to get to the term. Have you tested how long it takes to look up docs from your id? Not in indexes that size in a live environme

TermEnum with deleted dccuments

2009-05-06 Thread Antony Bowesman
I am merging Index A to Index B. First I read the terms for a particular field from index A and some of the documents in A get deleted. I then enumerate the terms on a different field also in index A, but the terms from the deleted document are still present. The termEnum.docFreq() also retu

Re: TermEnum with deleted dccuments

2009-05-10 Thread Antony Bowesman
t 1:04 AM, Antony Bowesman wrote: I am merging Index A to Index B. First I read the terms for a particular field from index A and some of the documents in A get deleted. I then enumerate the terms on a different field also in index A, but the terms from the deleted document are still pres

NumberFormatException when creating field cache

2009-09-09 Thread Antony Bowesman
I'm using Lucene 2.3.2 and have a date field used for sorting, which is MMDDHHMM. I get an exception when the FieldCache is being generated as follows: java.lang.NumberFormatException: For input string: "190400-412317" java.lang.NumberFormatException.forInputString(NumberFormatException.jav

TopFieldDocCollector and v3.0.0

2009-12-07 Thread Antony Bowesman
I'm on 2.3.2 and looking to move to 2.9.1 or 3.0.0 In 2.9.1 TopFieldDocCollector is "Deprecated. Please use TopFieldCollector instead." in 3.0.0 TopFieldCollector says NOTE: This API is experimental and might change in incompatible ways in the next release What is the suggested path for mig

deleteDocuments by Term[] for ALL terms

2007-11-25 Thread Antony Bowesman
Hi, I'm using IndexReader.deleteDocuments(Term) to delete documents in batches. I need the deleted count, so I cannot use IndexWriter.deleteDocuments(). What I want to do is delete documents based on more than one term, but not like IndexWriter.deleteDocuments(Term[]) which deletes all docum

Re: deleteDocuments by Term[] for ALL terms

2007-12-04 Thread Antony Bowesman
int delCount = 0; while(scorer.next()) { reader.deleteDocument(scorer.doc()); delCount++; } that iterates over all the docIDs without scoring them and without building up a Hit for each, etc. Mike "Antony Bowesman" <[EMAIL PROTECTED]> wrote: Hi, I'm using IndexReade

Concurrency between IndexReader and IndexWriter

2007-12-09 Thread Antony Bowesman
My application batch adds documents to the index using IndexWriter.addDocument. Another thread handles searchers, creating new ones as needed, based on a policy. These searchers open a new IndexReader and there is currently no synchronisation between this action and any being performed by my w

Re: Concurrency between IndexReader and IndexWriter

2007-12-09 Thread Antony Bowesman
Using Lucene 2.1 Antony Bowesman wrote: My application batch adds documents to the index using IndexWriter.addDocument. Another thread handles searchers, creating new ones as needed, based on a policy. These searchers open a new IndexReader and there is currently no synchronisation between

Re: Concurrency between IndexReader and IndexWriter

2007-12-09 Thread Antony Bowesman
Looks like I got myself into a twist for nothing - the reader will see a consistent view, despite what the writer does as long as the reader remains open. Appologies for the noise... Antony - To unsubscribe, e-mail: [EMAIL PR

Deleting a single TermPosition for a Document

2008-01-07 Thread Antony Bowesman
I'd like to 'update' a single Document in a Lucene index. In practice, this 'update' is actually just a removal of a single TermPosition for a given Term for a given doc Id. I don't think this is currently possible, but would it be easy to change Lucene to support this type of usage? The re

Re: Deleting a single TermPosition for a Document

2008-01-08 Thread Antony Bowesman
cument and most are not stored Antony Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Antony Bowesman <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, January 8, 2008 12:47:05 AM Subject: Deleting a single TermPo

Re: Why is lucene so slow indexing in nfs file system ?

2008-01-09 Thread Antony Bowesman
Ariel wrote: The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omni

Re: how do I get my own TopDocHitCollector?

2008-01-09 Thread Antony Bowesman
Beard, Brian wrote: Question: The documents that I index have two id's - a unique document id and a record_id that can link multiple documents together that belong to a common record. I'd like to use something like TopDocs to return the first 1024 results that have unique record_id's, but I wil

Re: how do I get my own TopDocHitCollector?

2008-01-10 Thread Antony Bowesman
fetches the external id's from the searcher and places them in the cache? -----Original Message- From: Antony Bowesman [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 09, 2008 7:19 PM To: java-user@lucene.apache.org Subject: Re: how do I get my own TopDocHitCollector? Beard, Brian wr

Re: Lucene sorting case-sensitive by default?

2008-01-15 Thread Antony Bowesman
Erick Erickson wrote: doc.add( new Field( "f", "This is Some Mixed, case Junk($*%& With Ugly SYmbols", Field.Store.YES, Field.Index.TOKENIZED)); pr

Re: Using RangeFilter

2008-01-21 Thread Antony Bowesman
vivek sar wrote: I need to be able to sort on optime as well, thus need to store it . Lucene's default sorting does not need the field to be stored, only indexed as untokenized. Antony - To unsubscribe, e-mail: [EMAIL PRO

DateTools UTC/GMT mismatch

2008-01-22 Thread Antony Bowesman
Hi, I just noticed that although the Javadocs for Lucene 2.2 state that the dates for DateTools use UTC as a timezone, they are actually using GMT. Should either the Javadocs be corrected or the code corrected to use UTC instead. Antony -

Re: Multiple searchers (Was: CachingWrapperFilter: why cache per IndexReader?)

2008-01-23 Thread Antony Bowesman
Toke Eskildsen wrote: == Average over the first 50.000 queries == metis_flash_RAID0_8GB_i37_t2_l21.log - 279.6 q/sec metis_flash_RAID0_8GB_i37_t2_l23.log - 202.3 q/sec metis_flash_RAID0_8GB_i37_v23_t2_l23.log - 195.9 q/sec == Average over the first 340.000 queries == metis_flash_RAID0_8GB_i37

Re: Using RangeFilter

2008-01-24 Thread Antony Bowesman
vivek sar wrote: I've a field as NO_NORM, does it has to be untokenized to be able to sort on it? NO_NORMS is the same as UNTOKENIZED + omitNorms, so you can sort on that. Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] F

Re: Biggest index

2008-03-16 Thread Antony Bowesman
[EMAIL PROTECTED] wrote: Yes of course, the answers to your questions are important too. But no anwser at all until now :( One example: 1.5 million documents Approx 15 fields per document DB is 10-15GB (can't find correct figure) All on one machine. No stats on search usage though. We're abo

Re: Search emails - parsing mailbox (mbox) files

2008-04-04 Thread Antony Bowesman
Subodh Damle wrote: Is there any reliable implementation for parsing email mailbox files (mbox format), especially large (>50MB) archives ? Even after searching lucene mailing list archives, googling around, I couldn't find one. I took a look at Apache James project which seems to offer some supp

Re: How to improve performance of large numbers of successive searches?

2008-04-10 Thread Antony Bowesman
Chris McGee wrote: These tips have significantly improved the time to build the directory and search it. However, I have noticed that when I perform term queries using a searcher many times in rapid succession and iterate over all of the hits it can take a significant time. To perform 1000 te

Using Lucene partly as DB and 'joining' search results.

2008-04-11 Thread Antony Bowesman
We're planning to archive email over many years and have been looking at using DB to store mail meta data and Lucene for the indexed mail data, or just Lucene on its own with email data and structure stored as XML and the raw message stored in the file system. For some customers, the volumes a

Re: Using Lucene partly as DB and 'joining' search results.

2008-04-11 Thread Antony Bowesman
Paul Elschot wrote: Op Friday 11 April 2008 13:49:59 schreef Mathieu Lecarme: Use Filter and BitSet. From the personnal data, you build a Filter (http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/Fil ter.html) wich is used in the main index. With 1 billion mails, and possibly

Re: Using Lucene partly as DB and 'joining' search results.

2008-04-14 Thread Antony Bowesman
Thanks all for the suggestions - there was also another thread "Lucene index on relational data" which had crossover here. That's an interesting idea about using ParallelReader for the changable index. I had thought to just have a triplet indexed 'owner:mailId:label' in each Doc and have multi

Re: Using Lucene partly as DB and 'joining' search results.

2008-04-14 Thread Antony Bowesman
Chris Hostetter wrote: you can't ... that's why i said you'd need to rebuild the smaller index completley on a periodic basis (going in the same order as the docs in the Mmm, the annotations would only be stored in the index. It would be possible to store them elsewhere, so I can investigate

Re: Binding lucene instance/threads to a particular processor(or core)

2008-04-21 Thread Antony Bowesman
That paper from 1997 is pretty old, but mirrors our experiences in those days. Then, we used Solaris processor sets to really improve performance by binding one of our processes to a particular CPU while leaving the other CPUs to manage the thread intensive work. You can bind processes/LWPs to

Rebuilding parallel indexes

2008-06-09 Thread Antony Bowesman
I have a design where I will be using multiple index shards to hold approx 7.5 million documents per index per month over many years. These will be large static R/O indexes but the corresponding smaller parallel index will get many frequent changes. I understand from previous replies by Hoss

Re: Rebuilding parallel indexes

2008-06-09 Thread Antony Bowesman
Andrzej Bialecki wrote: I have a thought ;) Perhaps you could use a FilteredIndexReader to maintain a map between new IDs and old IDs, and remap on the fly. Although I think that some parts of Lucene depend on the fact that in a normal index the IDs are monotonically increasing ... this would co

Modifying a document by updating a payloads?

2008-07-30 Thread Antony Bowesman
I seem to recall some discussion about updating a payload, but I can't find it. I was wondering if it were possible to use a payload to implement 'modify' of a Lucene document. For example, I have an ID field, which has a unique ID refering to an external DB. For example, I would like to stor

Re: Modifying a document by updating a payloads?

2008-07-30 Thread Antony Bowesman
Hi Mike, Unfortunately you will have to delete the old doc, then reindex a new doc, in order to change any payloads in the document's Tokens. This issue: https://issues.apache.org/jira/browse/LUCENE-1231 which is still in progress, could make updating stored (but not indexed) fields a m

Re: Per user data store

2008-08-05 Thread Antony Bowesman
Ganesh - yahoo wrote: Hello all, Documents coressponding to multiple users are to be indexed. Each user is going to search only his documents. Only Administrator could search all users data. Is it good to have one database for each User or to have only one database for all Users? Which will be

Payloads and tokenizers

2008-08-13 Thread Antony Bowesman
I started playing with payloads and have been trying to work out how to get the data into the payload I have a field where I want to add the following untokenized fields A1 A2 A3 With these fields, I would like to add the payloads B1 B2 B3 Firstly, it looks like you cannot add payloads to un

Re: Payloads and tokenizers

2008-08-14 Thread Antony Bowesman
Thanks for your comments Doron. I found the earlier discussions on the dev list (21/12/06), where this issue is discussed - my use case is similar to Nadav Har'El. Implementing payloads via Tokens explicitly prevents the use of payloads for untokenized fields, as they only support field.string

Fields with the same name?? - Was Re: Payloads and tokenizers

2008-08-17 Thread Antony Bowesman
I assume you already know this but just to make sure what I meant was clear - on tokenization but still indexing just means that the entire field's text becomes a single unchanged token. I believe this is exactly what SingleTokenTokenStream can buy you - a single token, for which you can pre set a

Re: Fields with the same name?? - Was Re: Payloads and tokenizers

2008-08-18 Thread Antony Bowesman
Doron Cohen wrote: The API definitely doesn't promise this. AFAIK implementation wise it happens to be like this but I can be wrong and plus it might change in the future. It would make me nervous to rely on this. I made some tests and it 'seems' to work, but I agree, it also makes me nervous

Re: Multiple index performance

2008-08-18 Thread Antony Bowesman
Cyndy wrote: I want to keep user text files indexed separately, I will have about 10,000 users and each user may have about 20,000 short files, and I need to keep privacy. So the idea is to have one folder with the text files and index for each user, so when search will be done, it will be poin

Re: Multiple index performance

2008-08-18 Thread Antony Bowesman
[EMAIL PROTECTED] wrote: Thanks Anthony for your response, I did not know about that field. You make your own fields in Lucene, it is not something Lucene gives you. But still I have a problem and it is about privacy. The users are concerned about privacy and so, we thought we could have all

Re: Problem with Field.Text()

2006-10-05 Thread Antony Bowesman
You have to create a new Field class with "new Field(...", i.e. replace doc.add(Field.Text with doc.add(new Field(... Antony Jan Pieper wrote: No it is not your fault, it is mine, but it also does not function. My compiler gives me this error message: ---

Analyzers and multiple languages

2006-10-13 Thread Antony Bowesman
Hello, I'm new to Lucene and wanted some advice on analyzers, stemmers and language analysis. I've got LIA, so have read it's chapters. I am writing a framework that needs to be able to index documents from a range of languages where just the character set of the document is known. Has anyo

Email and attachments

2006-10-13 Thread Antony Bowesman
Hi, I am a newbie with Lucene and I am working out the best way to index email data. An earlier poster talked about index attachments with two alternatives: However, there is a third alternative: Each message/attachment is indexed as a separate Document with the email header data included in

Query not finding indexed data

2006-10-15 Thread Antony Bowesman
Hi, I have a field "attname" that is indexed with Field.Store.YES, Field.Index.UN_TOKENIZED. I have a document with the attname of "IqTstAdminGuide2.pdf". QueryParser parser = new QueryParser("body", new StandardAnalyzer()); Query query = parser.parse("attname:IqTstAdminGuide2.pdf"); fails

Re: Query not finding indexed data

2006-10-15 Thread Antony Bowesman
Doron Cohen wrote: Hi Antony, you cannot instruct the query parser to do that. Note that an Thanks, I suspected as much. I've changed it to make the field tokenized. field name. This is an application logic to know that a certain query is not to be tokenized. In this case you could create yo

idf in scores

2006-11-06 Thread Antony Bowesman
I've been trying to understand how idf is arrived at from a query. I have a single Document with 9 fields. One field "subject" has the phrase "RFC2822 - Internet Message Format" and a second "body" has the contents of rfc2822. The other fields contain additional meta data. If I search for su

Re: idf in scores

2006-11-07 Thread Antony Bowesman
Yonik Seeley wrote: idf is dependent only on the corpus, not on the individual document. The formula is here: http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html 1+log(1/2) = 0.30685282 Thanks Yonik, whilst all is not yet completely clear, it is much more so! Ant

Re: Indexing Performance issue

2006-11-16 Thread Antony Bowesman
spinergywmy wrote: Hi, I having this indexing the pdf file performance issue. It took me more than 10 sec to index a pdf file about 200kb. Is it because I only have a segment file? How can I make the indexing performance better? If you're using the log4j PDFBox jar file, you must make sure

IOException question

2006-11-16 Thread Antony Bowesman
Hi, I have the IndexWriter.infoStream set to System.out and get the following merging segments _4m (2 docs) _4n (1 docs) into _4o (3 docs) java.io.IOException: Cannot delete PathToDB\_29.cfs; Will re-try later. java.io.IOException: Cannot delete PathToDB\_29.cfs; Will re-try later. Is this norm

Re: IOException question

2006-11-16 Thread Antony Bowesman
Hi Mike, Do you also have a reader open against this index? If yes, then this is totally normal on Windows. A reader holds open the segments cfs files that it is using, so when the writer tries to delete them (because they were merged) the delete fails and Lucene will try again later. Aha,

NOT queries

2006-11-21 Thread Antony Bowesman
Hi, I'm writing a mapping mechanism between an existing search interface and Lucene and wondered how to support a single NOT/- query. Given the query "-attribute", then from an ealier comment by Chris Hostetter where he says "you can't have a negative clause in isolation by itself", I assume

Limiting QueryParser

2006-11-21 Thread Antony Bowesman
Hi, I have a search UI that allows search criteria to be input against specific fields, e.g. Subject. In order to create a suitable Lucene Query, I must analyze that String so that it becomes a set of Tokens which I can then turn into Terms. QueryParser seems to fit the bill for that, howev

Re: How to do a "starts with" search

2006-11-21 Thread Antony Bowesman
Martin Braun wrote: Please refer to the answers to my question on this list: http://www.nabble.com/forum/ViewPost.jtp?post=7337585&framed=y Shortly spoken: SpanFirstQuery works like a charm :) Thanks Martin, that looks just right. I'll try it. Antony --

Re: NOT queries

2006-11-21 Thread Antony Bowesman
Daniel Naber wrote: That's correct. For the "find everything" part you can use MatchAllDocsQuery. Thanks - I hadn't noticed that Query. Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMA

Re: Limiting QueryParser

2006-11-21 Thread Antony Bowesman
Chris Hostetter wrote: : important:conference agenda : I want to end up with : : +subject:important +subject:conference +subject:agenda : : I've written something to do this, but I know it is not as clever as QP as : currently it can only create BooleanQueries with TermQueries and cannot handle

Re: Limiting QueryParser

2006-11-21 Thread Antony Bowesman
Mark Miller wrote: if you scan the query and escape all colons (ie \:) then you should be good (I have not verified). Of course you will not be able to do a field search, but that seems to be what your after. Thanks for that suggestion. However, a standard un-escaped parse gives Input - impo

Re: Q: Wildcard searching with germ an umlauts (ä, ö, ß, ...)

2006-11-21 Thread Antony Bowesman
Stephan Spat wrote: Hello again! It replaces german umlauts, e.g. ä <=> a, ü <=> u, ... . So no umlauts are in the index. For searching I use the same Analyzer. When I do a simple search for a word with umlauts there is no problem. But if I use addidionally wildcards I suppose the word is not

Re: Limiting QueryParser

2006-11-22 Thread Antony Bowesman
Michael Rusch wrote: Sorry if I'm missing the point here, but what about simply replacing colons with spaces first? Michael. Err, thanks. I've been in too deep at the wrong end :) Wood, trees and visibility spring to mind! Antony --

Re: Limiting QueryParser

2006-11-22 Thread Antony Bowesman
Erik Hatcher wrote: It doesn't seem like you need a "parser" at all for your field-specific search fields. Simply tokenize, using a Lucene Analyzer, the text field and build up a BooleanQuery of all the tokens. That's what I'm currently doing, but I was getting bogged down with trying to s

Re: indexing performance issue

2006-11-30 Thread Antony Bowesman
Grant Ingersoll wrote: On Nov 30, 2006, at 10:54 AM, spinergywmy wrote: For my scenario will be every time the users upload the single file, I need to index that particular file. Previously was because the previous version of pdfbox integrate with log4j.jar file and I believe is the log4j.j

Re: indexing performance issue

2006-11-30 Thread Antony Bowesman
spinergywmy wrote: I have posted this question before and this time I found that it could be pdfbox problem and this pdfbox I downloaded doesn't use the log4j.jar. To index the app 2.13mb pdf file took me 17s and total time to upload a file is 18s. Re: PFDBox. I have a 2.5Mb test file that

Re: de-boosting fields

2006-12-11 Thread Antony Bowesman
Daniel Naber wrote: On Saturday 09 December 2006 02:25, Scott Smith wrote: What is the best way to do this? Is changing the boost the right answer? Can a field's boost be zero? Yes, just use: term1 term2 category1^0 category2^0. Erick's Filter idea is also useful. Isn't it also true that

IOException - The handle is invalid

2006-12-21 Thread Antony Bowesman
Hi, I'm running load tests with Lucene 2.0, SUN's JDK 6 on Windows XP2, dual core CPU. I have 8 worker threads adding a few hundred K documents, split between two Lucene indexes, I've started getting java.io.IOException: The handle is invalid in places like java.io.RandomAccessFile.writeByt

Re: IOException - The handle is invalid

2007-01-02 Thread Antony Bowesman
Hi Mike, I saw Mike McCandless JIRA issue http://issues.apache.org/jira/browse/LUCENE-669 Is the patch referenced there useful for a 2.0 system. I would like to use the lockless commit stuff, but am waiting until I get the core system working well. I am also getting IOException in some of

Boost/Scoring question

2007-01-30 Thread Antony Bowesman
Hi, In trying to understand scoring and boosting a bit better, I tried setting a boost of 0.0F for a field. As it's used as a multiplier, I wanted to see how it affects score. I added a single document with two fields, one with the default boost and another with a boost of 0.0F. hits.score

Re: Boost/Scoring question

2007-01-30 Thread Antony Bowesman
Chris Hostetter wrote: 1) you can never compare the score from a Hits object with the score from an Explanation. Explanation has the raw score, Hits has the psuedo-normalized score. Thanks for the comments. Where I was trying to get to was whether a match on a field with boost of 0.0 can eve

Re: Boost/Scoring question

2007-02-01 Thread Antony Bowesman
Hi Chris, : If I search for a document where the field boost is 0.0 then the document is not : found I just search that field. Is this expected??? you mean you search on: A^0and get no results even though documents contain A, and if you search on: +A^0 B^1 you see those d

Re: Boost/Scoring question

2007-02-02 Thread Antony Bowesman
Thanks a lot for your answers Hoss. This list is really well supported! Antony Chris Hostetter wrote: : It's the index time boost, rather than query time boost. This short example : shows the behaviour of searches for A... index boosts! ... totally didn't occur to me that was what you we

Re: search on colon ":" ending words

2007-02-12 Thread Antony Bowesman
Not sure if you're still after a solution, but I had a similar issue and I modified QueryParser.jj to not treat : as a field name terminator, so work: would then just be given as work: to the analyzer and treated as a search term. Antony Felix Litman wrote: We want to be able to return a res

Positions in SpanFirst

2007-02-21 Thread Antony Bowesman
Hi, I have a field to which I add several bits of information, e.g. doc.add(new Field("x", "first bit")); doc.add(new Field("x", "second part")); doc.add(new Field("x", "third section")); I am using SpanFirstQuery to search them with something like: while... SpanTermQuery stquery = new SpanT

Re: Positions in SpanFirst

2007-02-21 Thread Antony Bowesman
Hi Erick, I'm not sure you can, since all the interfaces I use alter the increment between successive terms, but I'll be the first to admit that there are many nooks and crannies that I don't know about... But I suspect that a negative increment is not supported intentionally I read your

ClassCastException/DocumentWriter and NullPointerException/RAMInputStream

2007-02-21 Thread Antony Bowesman
When adding documents to an index has anyone seen either java.lang.ClassCastException: org.apache.lucene.analysis.Token cannot be cast to org.apache.lucene.index.Posting at org.apache.lucene.index.DocumentWriter.sortPostingTable(DocumentWriter.java:238) at org.apache.lucene.index.DocumentW

Re: Positions in SpanFirst

2007-02-21 Thread Antony Bowesman
Hi Erick, What this does is allow you to put gaps between successive sets of terms indexed in the same field. For instance... doc.add("field", "some stuff"); doc.add("field", "bunch hooey"); doc.add("field", "what is this"); writer.add(doc); In this case, there would be the following positions,

Re: Positions in SpanFirst

2007-02-21 Thread Antony Bowesman
Ahh, now it falls into place. Thanks Antony Chris Hostetter wrote: it's not called Analyzer.getPositionAfterGap .. it's Analyzer.getPositionIncrementGap .. it's the Position Increment used when there is a Gap -- so returning 0 means that no exra increment is used, and multiple values are treate

QueryParser bug?

2007-02-21 Thread Antony Bowesman
Using QueryParser to parse *tex* seems to create a PrefixQuery rather than WildcardQuery due to the trailing *, rather than Wildcard because of the other leading *. As a result, this does not match, for example "context". I've swapped the order of WILDTERM and PREFIXTERM in queryparsr.jj but

Re: QueryParser bug?

2007-02-22 Thread Antony Bowesman
in the JavaCC compiled code and I'm not familiar enough with JavaCC high level stuff to know how to make it choose based on an existing condition. Regards Antony : Date: Thu, 22 Feb 2007 15:36:43 +1100 : From: Antony Bowesman <[EMAIL PROTECTED]> : Reply-To: java-user@lucene.apa

Re: Positions in SpanFirst

2007-02-22 Thread Antony Bowesman
I'll probably end up ducking it on the basis that the system directory defaults to a surname/firstname name order, but of course there's no guarantee that mail from other systems will have those names in that order, e.g. #1 To: Bowesman Antony #2 To: Antony Bowesman makes this 'starts

Re: search on colon ":" ending words

2007-02-22 Thread Antony Bowesman
Felix Litman wrote: Yes. thank you. How did you make that modification not to treat ":" as a field-name terminator? Is it using this Or some other way? I removed the : handling stuff from QueryParser.jj in the method: Query Clause(String field) : I removed this section --- [ LOOKAHE

TextMining.org Word extractor

2007-02-22 Thread Antony Bowesman
I'm extracting text from Word using TextMining.org extractors - it works better than POI because it extracts Word 6/95 as well as 97-2002, which POI cannot do. However, I'm trying to find out about licence issues with the TM jar. The TM website seems to be permanently hacked these days. Anyon

  1   2   >