Re: Lucene scoring: coord_q_d factor

2006-12-19 Thread Doug Cutting
Karl Koch wrote: Are there any other papers that regard the combination of coordination level matching and TFxIDF as advantageous? We independently developed coordination-level matching combined with TFxIDF when I worked at Apple. This is documented in: http://www.informatik.uni-trier.de/~

Re: Lucene 2.0.1 release date

2006-12-19 Thread Doug Cutting
Steven Rowe wrote: "2.1" is much more likely to be the label used for the next release than "2.0.1". The roadmap in Jira shows 21 issues scheduled for 2.0.1. If there is in fact no intent to merge these into the 2.0 branch, these should probably be retargetted for 2.1.0, and the 2.0.1 versio

Re: Oracle and Lucene Integration

2006-11-22 Thread Doug Cutting
Marcelo Ochoa wrote: Then I'll move the code outside the lucene-2.0 code tree to be packed as subdirectory of the contrib area, for example. Other alternative is to make an small zip file and send it to the list as attach as a preliminary (alpha-alpha version ;) This sounds like great potenti

Re: Searching by bit masks

2006-11-10 Thread Doug Cutting
Erick Erickson wrote: Something like Document doc = new Document(); doc.add("flag1", "Y"); doc.add("flag2", "Y"); IndexWriter.add(doc); Fields have overheads. It would be more efficient to implement this as a single field with a different value for each boolean flag (as others have suggested

Re: DateTools oddity....

2006-10-18 Thread Doug Cutting
Michael J. Prichard wrote: I get this output: Tue Aug 01 21:15:45 EDT 2006 That's August 2, 2006 at 01:15:45 GMT. 20060802 Huh?! Should it be: 20060801 DateTools uses GMT. Doug - To unsubscribe, e-mail: [EMAIL

Re: Seeing what's occupying all the space in the index

2006-05-26 Thread Doug Cutting
Rob Staveley (Tom) wrote: Is there a tool I can use to see how much of the index is occupied by the different fields I am indexing? Note that IndexReader has a main() that will list the contents of compound index files. Doug --

Re: Changing the scoring (newest doc date first)

2006-05-22 Thread Doug Cutting
Marcus Falck wrote: There is however one LARGE problem that we have run into. All search result should be displayed sorted with the newest document at top. We tried to accomplish this using Lucene's sort capabilites but quickly ran into large performance bottlenecks. So i figured since the default

Re: How are results merged from a multisearcher?

2006-05-22 Thread Doug Cutting
Tom Emerson wrote: Thanks for the clarification. What then is the difference between a MultiSearcher and using an IndexSearcher on a MultiReader? The results should be identical. A MultiSearcher permits use of ParallelMultiSearcher and RemoteSearchable, for parallel and/or distributed operat

Re: Ask for a better solution for the case

2006-04-28 Thread Doug Cutting
hu andy wrote: Hi, I hava an application that need mark the retrieved documents which have been read. So the next time I needn't read the marked documents again. You could mark the documents as deleted, then later clear deletions. So long as you don't close the IndexReader, the deletions wil

Re: Lucene search benchmark/stress test tool

2006-04-27 Thread Doug Cutting
Sunil Kumar PK wrote: I want to know is there any possibility or method to merge the weight calculation of index 1 and its search in a single RPC instead of doing the both function in separate steps. To score correctly, weights from all indexes must be created before any can be searched. This

Re: RAM Directory / querying Performance issue

2006-04-26 Thread Doug Cutting
Is this markedly faster than using an MMapDirectory? Copying all this data into the Java heap (as RAMDirectory does) puts a tremendous burden on the garbage collector. MMapDirectory should be nearly as fast, but keeps the index out of the Java heap. Doug z shalev wrote: I've rewritten

Re: Using Lucene for searching tokens, not storing them.

2006-04-14 Thread Doug Cutting
karl wettin wrote: Do I have to worry about passing a null Directory to the default constructor? A null Directory should not cause you problems. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mai

Re: Using Lucene for searching tokens, not storing them.

2006-04-14 Thread Doug Cutting
karl wettin wrote: I would like to store all in my application rather than using the Lucene persistency mechanism for tokens. I only want the search mechanism. I do not need the IndexReader and IndexWriter as that will be a natural part of my application. I only want to use the Searchable.

Re: MultiReader and MultiSearcher

2006-04-11 Thread Doug Cutting
Peter Keegan wrote: Oops. I meant to say: Does this mean that an IndexSearcher constructed from a MultiReader doesn't merge the search results and sort the results as if there was only one index? It doesn't have to, since a MultiReader *is* a single index. A quick test indicates that it does

Re: Distributed Lucene.. - clustering as a requirement

2006-04-10 Thread Doug Cutting
Dmitry Goldenberg wrote: For an enterprise-level application, Lucene appears too file-system and too byte-sequence-centric a technology. Just my opinion. The Directory API is just too low-level. There are good reasons why Lucene is not built on top of a RDBMS. An inverted index is not effi

Re: Lucene Document order not being maintained?

2006-04-05 Thread Doug Cutting
Dan Armbrust wrote: My indexing process works as follows (and some of this is hold-over from the time before lucene had a compound file format - so bear with me) I open up a File based index - using a merge factor of 90, and in my current test, the compound index format. When I have added 100

Re: Data structure of a Lucene Index

2006-03-30 Thread Doug Cutting
I talked about this a bit in a presentation at Haifa last year: http://www.haifa.ibm.com/Workshops/ir2005/papers/DougCutting-Haifa05.pdf See the section on "Seek versus Transfer". Doug Prasenjit Mukherjee wrote: It seems to me that lucene doesn't use B-tree for its indexing storage. Any paper

Re: Lucene Performance Issues

2006-03-28 Thread Doug Cutting
thomasg wrote: Hi, we are currently intending to implement a document storage / search tool using Jackrabbit and Lucene. We have been approached by a commercial search and indexing organisation called ISYS who are suggesting the following problems with using Lucene. We do have a requirement to st

Re: Lucene indexing on Hadoop distributed file system

2006-03-27 Thread Doug Cutting
Igor Bolotin wrote: Does it make sense to change TermInfosWriter.FORMAT in the patch? Yes. This should be updated for any change to the format of the file, and this certainly constitutes a format change. This discussion should move to [EMAIL PROTECTED] Doug --

Re: Lucene indexing on Hadoop distributed file system

2006-03-27 Thread Doug Cutting
Igor Bolotin wrote: If somebody is interested - I can post our changes in TermInfosWriter and SegmentTermEnum code, although they are pretty trivial. Please submit this as a patch attached to a bug report. I contemplated making this change to Lucene myself, when writing Nutch's FsDirectory, b

Re: span query scoring vs boolean query scoring

2006-03-27 Thread Doug Cutting
Vincent Le Maout wrote: I am missing something ? Is it intented or is it a bug ? Looks like a bug. Can you please submit a bug report, and, ideally, attach a patch? Thanks, Doug - To unsubscribe, e-mail: [EMAIL PROTECTED]

Re: span query scoring vs boolean query scoring

2006-03-27 Thread Doug Cutting
Vincent Le Maout wrote: I am missing something ? Is it intented or is it a bug ? Looks like a bug. Can you submit a patch? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: lucene NFS support

2006-03-23 Thread Doug Cutting
Dai, Chunhe wrote: Does anyone know whether Lucene plans to support NFS in later release(2.0)? We are planning to integrate Lucene into our products and cluster support is definitely needed. We want to check whether NFS support is in the plan or not before implementing a new file locking ourselve

Re: Multiple threads in Lucene

2006-03-23 Thread Doug Cutting
Olivier Jaquemet wrote: IndexReader.unlock(indexDir); // unlock directory in case of unproper shutdown This should be used very carefully. In particular, you should only call it when you are certain that no other applications are accessing the index. Doug ---

Re: Lookup Issues

2006-03-22 Thread Doug Cutting
The Hits-based search API is optimized for returning earlier hits. If you want the lowest-scoring matches, then you could reverse-sort the hits, so that these are returned first. Or you could use the TopDocs-based API to retrieve hits up to your "toHits". (Hits-based search is implemented us

Re: Lucene job

2006-03-17 Thread Doug Cutting
Michael Wechner wrote: Maybe it would make sense to sort it alphabetically [ ... ] +1 This should be sorted alphabetically be business name or last name. That's what it says on the page, although a few entries are out of place. Please feel free to fix this. Doug -

Re: Throughput doesn't increase when using more concurrent threads

2006-03-17 Thread Doug Cutting
Peter Keegan wrote: I did some additional testing with Chris's patch and mine (based on Doug's note) vs. no patch and found that all 3 produced the same throughput - about 330 qps - over a longer period. Was CPU utilizaton 100%? If not, where do you think the bottleneck now is? Network? Or

Re: Lucene and Tomcat, too many open files

2006-03-16 Thread Doug Cutting
Are you changing the default mergeFactor or other settings? If so, how? Large mergeFactors are generally a bad idea: they don't make things faster in the long run and they chew up file handles. Are all searches reusing a single IndexReader? They should. This is the other most common reason

Re: TooManyClauses exception in Lucene (1.4)

2006-03-16 Thread Doug Cutting
Erick Erickson wrote: Could you point me to any explanation of *why* range queries expand this way? It's just what they do. They were contributed a long time ago, before things like RangeFilter or ConstantScoreRangeQuery were written. The latter are relatively recent additions to Lucene and

Re: Can Lucene load more then 2GB into RAM memory?

2006-03-16 Thread Doug Cutting
and it seems like performance is basically the same if not better!!! if anyone is interested let me know Doug Cutting <[EMAIL PROTECTED]> wrote: RAMDirectory is indeed currently limited to 2GB. This would not be too hard to fix. Please file a bug report. Better yet, attach a patch.

Re: PhraseQuery and edit distance slightly confusing.

2006-03-15 Thread Doug Cutting
Dawid Weiss wrote: I get the concept implemented in PhraseQuery but isn't calling it an edit distance a little bit far fetched? Yes, it should probably be called "edit-distance-like" or something. Only the marginal elements (minimum and maximum distance from their respective query positions)

Re: Can Lucene load more then 2GB into RAM memory?

2006-03-13 Thread Doug Cutting
RAMDirectory is indeed currently limited to 2GB. This would not be too hard to fix. Please file a bug report. Better yet, attach a patch. I assume you're running a 64bit JVM. If so, then MMapDirectory might also work well for you. Doug z shalev wrote: this is in continuation of a pr

Re: Throughput doesn't increase when using more concurrent threads

2006-03-07 Thread Doug Cutting
Peter Keegan wrote: I ran a query performance tester against 8-cpu and 16-cpu Xeon servers (16/32 cpu hyperthreaded). on Linux. Here are the results: 8-cpu: 275 qps 16-cpu: 305 qps (the dual-core Opteron servers are still faster) Here is the stack trace of 8 of the 16 query threads during the

Re: Lucene version 1.9

2006-03-07 Thread Doug Cutting
WATHELET Thomas wrote: I've created an index with the Lucene version 1.9 and when I try to open this index I have always this error mesage: java.lang.ArrayIndexOutOfBoundsException. if I use an index built with the lucene version 1.4.3 it's working. Wath's wrong? Are you perhaps trying to open

Lucene 1.9.1 release available

2006-03-03 Thread Doug Cutting
Release 1.9.1 of Lucene is now available from: http://www.apache.org/dyn/closer.cgi/lucene/java/ This fixes a serious bug in 1.9-final. It is strongly recommended that all 1.9-final users upgrade to 1.9.1. For details see: http://svn.apache.org/repos/asf/lucene/java/tags/lucene_1_9_1/CHANGES.

Lucene 1.9-final release available

2006-03-01 Thread Doug Cutting
Release 1.9-final of Lucene is now available from: http://www.apache.org/dyn/closer.cgi/lucene/java/ This release has many improvements since release 1.4.3, including new features, performance improvements, bug fixes, etc. For details, see: http://svn.apache.org/viewcvs.cgi/*checkout*/lucene/j

Re: Hacking proximity search: looking for feedback

2006-03-01 Thread Doug Cutting
Jeff Rodenburg wrote: Following on the Range Query approach, how is performance? I found the range approach (albeit with the exact values) to be slower than the parsed-string approach I posited. Note that Hoss suggested RangeFilter, not RangeQuery. Or perhaps ConstantScoreRangeQuery, which i

Re: Frequency of phrase

2006-02-24 Thread Doug Cutting
Eric Jain wrote: This gives you the number of documents containing the phrase, rather than the number of occurrences of the phrase itself, but that may in fact be good enough... If you use a span query then you can get the actual number of phrase instances. Doug ---

Re: Indexing speed

2006-02-24 Thread Doug Cutting
revati joshi wrote: hi all, I just wnted to know how to increase the speed of indexing of files . I tried it by using Multithreading approach but couldn't get much better performance. It was same as it is in usual sequential indexing.Is there any other approach to get better Inde

Lucene 1.9 RC1 release available

2006-02-22 Thread Doug Cutting
Release 1.9 RC1 of Lucene is now available from: http://www.apache.org/dyn/closer.cgi/lucene/java/ This release candidate has many improvements since release 1.4.3, including new features, performance improvements, bug fixes, etc. For details, see: http://svn.apache.org/viewcvs.cgi/*checkout*/

Re: BM25 Similarity implementation

2006-02-16 Thread Doug Cutting
Trieschnigg, R.B. (Dolf) wrote: I would like to implement the Okapi BM25 weighting function using my own Similarity implementation. Unfortunately BM25 requires the document length in the score calculation, which is not provided by the Scorer. How do you want to measure document length? If th

Re: CompoundFileReader question/'leaking' file descriptors ?

2006-02-13 Thread Doug Cutting
Paul Smith wrote: is 1.9 binary backward compatible? (both source code and index format). That is the intent. Try a nightly build: http://cvs.apache.org/dist/lucene/java/nightly/ Doug - To unsubscribe, e-mail: [EMAIL PROTEC

Re: Boosting

2006-02-13 Thread Doug Cutting
Sebastian Menge wrote: Or, to put it more simple, what does a boost of "2" or "10" _mean_ in contrast to a boost of "0.5" or "0.1" !? Boosts are simply multiplied into scores. So they only mean something in the context of the rest of the scoring mechanism. http://lucene.apache.org/java/docs

Re: CompoundFileReader question/'leaking' file descriptors ?

2006-02-13 Thread Doug Cutting
Paul Smith wrote: We're using Lucene 1.4.3, and after hunting around in the source code just to see what I might be missing, I came across this, and I'd just like some comments. Please try using a 1.9 build to see if this is something that's perhaps already been fixed. CompoundFileReader

Re: [SPAM] - Re: Performance tips? - Sending mail server found on bl.spamcop.net

2006-01-27 Thread Doug Cutting
Daniel Pfeifer wrote: Are we both talking about Lucene? I am using Lucene 1.4.3 and can't find a class called MapDirectory or MMapDirectory. It is post-1.4. You can download a nightly build of the current trunk at: http://cvs.apache.org/dist/lucene/java/nightly/ Doug ---

Re: Performance tips?

2006-01-27 Thread Doug Cutting
Daniel Pfeifer wrote: We are sporting Solaris 10 on a Sun Fire-machine with four cores and 12GB of RAM and mirrored Ultra 320-disks. I guess I could try switching to FSDirectory and hope for the best. Or, since you're on a 64-bit platform, try MMapDirectory, which supports greater parallelism

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Doug Cutting
Doug Cutting wrote: A 64-bit JVM with NioDirectory would really be optimal for this. Oops. I meant MMapDirectory, not NioDirectory. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Doug Cutting
Peter Keegan wrote: The throughput is worse with NioFSDIrectory than with the FSDIrectory (patched and unpatched). The bottleneck still seems to be synchronization, this time in NioFile.getChannel (7 of the 8 threads were blocked there during one snapshot). I tried this with 4 and 8 channels.

Re: Throughput doesn't increase when using more concurrent threads

2006-01-25 Thread Doug Cutting
Peter Keegan wrote: This is just fyi - in my stress tests on a 8-cpu box (that's 8 real cpus), the maximum throughput occurred with just 4 query threads. The query throughput decreased with fewer than 4 or greater than 4 query threads. The entire index was most likely in the file system cache, t

Re: Lucene Logo? (high resolution)

2006-01-19 Thread Doug Cutting
Daniel Rabus wrote: I've created an Semantic Desktop application using Lucene. For a presentation I'd like to create a poster. Unfortunately I haven't found any high resolution version (or vector graphic) of the Lucene logo. At http://svn.apache.org/repos/asf/lucene/java/trunk/docs/images/ only

Re: BTree

2006-01-12 Thread Doug Cutting
B-Tree's are best for random, incremental updates. They require log_b(N) disk accesses for inserts, deletes and accesses, where b is the number of entries per page, and N is the total number of entries in the tree. But that's too slow for text indexing. Rather Lucene uses a combination of fi

Re: AW: Boolean Query

2006-01-12 Thread Doug Cutting
Klaus wrote: I have tried to study to lucene scoring in the default similarity. Can anyone explain me, how this similarity was designed? I have read a lot of IR literature, but I have never seen an equation like the one used in lucene. Why is this better then the normal cosine-measure? It degen

Re: IndexReader.open crashes JVM

2005-12-15 Thread Doug Cutting
chandler burgess wrote: Im using lucene1.4.3 on a XP machine with jdk1.5. Any help is appreciated. Try typing control-break to get some stack dumps. I also recommend building the current Lucene code from subversion and trying that. There have been lots of improvements since 1.4.3. It woul

Re: Merging with IndexWriter.addIndexes(...)

2005-12-08 Thread Doug Cutting
J.J. Larrea wrote: So... I notice that both IndexWriter.addIndexes(...) merge methods start and end with calls to optimize() on the target index. I'm not sure whether that is causing the unpacking and repacking I observe, but it does wonder whether they truly need to be there: I don't recall

Re: Lucene performance bottlenecks

2005-12-07 Thread Doug Cutting
Andrzej Bialecki wrote: It's nice to have these couple percent... however, it doesn't solve the main problem; I need 50 or more percent increase... :-) and I suspect this can be achieved only by some radical changes in the way Nutch uses Lucene. It seems the default query structure is too compl

Re: Lucene performance bottlenecks

2005-12-07 Thread Doug Cutting
Paul Elschot wrote: Querying the host field like this in a web page index can be dangerous business. For example when term1 is "wikipedia" and term2 is "org", the query will match at least all pages from wikipedia.org. Note that if you search for wikipedia.org in Nutch this is interpreted as a

Re: Lucene performance bottlenecks

2005-12-02 Thread Doug Cutting
Andrzej Bialecki wrote: For a simple TermQuery, if the DF(term) is above 10%, the response time from IndexSearcher.search() is around 400ms (repeatable, after warm-up). For such complex phrase queries the response time is around 1 sec or more (again, after warm-up). Are you specifying -server

Re: IndexReader locking

2005-11-28 Thread Doug Cutting
IndexReader locks the index while opening it to prohibit an IndexWriter from deleting any of the files in that index until all are opened. Lock files are not stored in the index directory since write access to an index should not be required to lock it while opening an IndexReader. Doug Dani

Re: Throughput doesn't increase when using more concurrent threads

2005-11-21 Thread Doug Cutting
Jay Booth wrote: I had a similar problem with threading, the problem turned out to be that in the back end of the FSDirectory class I believe it was, there was a synchronized block on the actual RandomAccessFile resource when reading a block of data from it... high-concurrency situations caused t

Re: Memory Usage

2005-11-17 Thread Doug Cutting
Daniel Noll wrote: Doug Cutting wrote: Daniel Noll wrote: I actually did throw a lot of terms in, and eventually chose "one" for the tests because it was the slowest query to complete of them all (hence I figured it was already spending some fairly long time in I/O, and would be

Re: Memory Usage

2005-11-17 Thread Doug Cutting
Daniel Noll wrote: I actually did throw a lot of terms in, and eventually chose "one" for the tests because it was the slowest query to complete of them all (hence I figured it was already spending some fairly long time in I/O, and would be penalised the most.) Every other query was around 7ms

Re: Filtering on a SpanQuery without losing spans

2005-11-16 Thread Doug Cutting
Greg K wrote: Now, however, I'd like to be able restrict the search to certain documents in the index, so I don't have to stream through a couple of thousand spans to produce the 10 excerpts on a subset of the documents. I've tried added a term to the SpanNearQueries that targets a keyword field

Re: Memory Usage

2005-11-16 Thread Doug Cutting
Daniel Noll wrote: Timings were obtained by performing the same search 1,000 times and averaging the total time. This was then performed five times in a row to get the range that's displayed below. Memory usage was obtained using a 20-second sleep after loading the index, and then using the Win

Re: Memory Usage

2005-11-14 Thread Doug Cutting
Marvin Humphrey wrote: You *can't* set it on the reader end. If you could set it, the reader would get out of sync and break. The value is set per-segment at write time, and the reader has to be able to adapt on the fly. It would actually not be too hard to change things so that there was

Re: Sentence boundary storage

2005-10-30 Thread Doug Cutting
Chris Hostetter wrote: : One thing that I know has bogged me is when matching a phrase where I : would expect mathematical formula (which is "just a subphrase"). I : would have liked the phrase-query to extend as far as it wishes but not : passed a given token... would this be possible ? : Presum

Re: trying to boost a phrase higher than its individual words

2005-10-30 Thread Doug Cutting
Erik Hatcher wrote: On 28 Oct 2005, at 22:31, Andy Lee wrote: You know what, I was confusing Nutch and Lucene classes (as I've done before), in this case the IndexSearcher classes. Sorry. The Nutch names are bad. I'm continually amazed at Doug's ability to build these using only emacs - h

Re: query across fields?

2005-10-11 Thread Doug Cutting
Marc Hadfield wrote: In the SpanNear (or for that matter PhraseQuery), one can set a slop value where 0 (zero) means one following after the other. How can one differentiate between Terms at the **same** position vs. one after the other? The following queries only match "x" and "y" at the sa

Re: query across fields?

2005-10-10 Thread Doug Cutting
Marc Hadfield wrote: I'll give Span Query's a try as they can handle the 0 increment issue. Note that PhraseQuery can now handle this too. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMA

Re: query across fields?

2005-10-10 Thread Doug Cutting
Marc Hadfield wrote: I actually mention your option in my email: In principle I could store the full text in two fields with the second field containing the types without incrementing the token index. Then, do a SpanQuery for "Johnson" and "name" with a distance of 0. The resulting match w

Re: query across fields?

2005-10-10 Thread Doug Cutting
Marc Hadfield wrote: I would prefer not to mix the full text and "types" in the same field as it would make the term positions inconsistent which i depend on for other queries. Why not store them in the same field using positionIncrement=0 for the types? Then they won't change positions of n

Re: IllegalArgumentException: attempt to access a deleted document

2005-10-06 Thread Doug Cutting
Peter Kim wrote: I noticed one way to get around this is to use IndexReader.isDeleted() to check if it's deleted or not. The problem with that is I only have access to a MultiSearcher in my HitCollector which doesn't give me access to the underlying IndexReader. I don't want to have to open an In

Re: IndexWriter.optimize() need to much time.

2005-10-05 Thread Doug Cutting
Eric Louvard wrote: my problem is that IndexWriter.optimize() take 20 minutes. OK it is not a lot of time, but I can't allow me to block the system such a long time :-(. If you're worried about blocking, queue changes to the index and have a separate thread which processes the queue, adding a

Re: Performance Improvments?

2005-10-04 Thread Doug Cutting
Palmer, Andrew MMI Woking wrote: I am looking at changing the value BufferedIndexOutput.BUFFER_SIZE from 1024 to maybe 8192. Has anyone done anything similar and did they get any performance improvements. I doubt this will speed things much. Generally I am looking to reduce the time it ta

Re: A very technical question.

2005-09-28 Thread Doug Cutting
Dawid Weiss wrote: I have a very technical question. I need to alter document score (or in fact: document boosts) for an existing index, but for each query. In other words, I'd like these to have pseudo-queries of the form: 1. civil war PREFER:shorter 2. civil war PREFER:longer for these two

Re: Is Lucene right for my app?

2005-09-18 Thread Doug Cutting
Jeff Rodenburg wrote: My suggestion to you: pick up a copy of Lucene in Action. [ ...] The authors lurk on this list. They're pretty chatty for lurkers. http://en.wikipedia.org/wiki/Lurker But good advice nonetheless! Cheers, Doug ---

Re: OutOfMemoryError on addIndexes()

2005-08-18 Thread Doug Cutting
Tony Schwartz wrote: What about the TermInfosReader class? It appears to read the entire term set for the segment into 3 arrays. Am I seeing double on this one? p.s. I am looking at the current sources. see TermInfosReader.ensureIndexIsRead(); The index only has 1/128 of the terms, by def

Re: Indexing document instances and retrieving instance attributes

2005-08-18 Thread Doug Cutting
Chris D wrote: Well in my case field order is important, but the order of the individual fields isn't. So I can speed up getFields to roughly O(1) by implementing Document as follows. Have you actually found getFields to be a performance bottleneck in your application? I'd be surprised if it

Re: OutOfMemory error when searching

2005-08-18 Thread Doug Cutting
Fredrik wrote: Opening the index with Luke, I can see the following: Number of fields: 17 Number of documents: 1165726 Number of terms: 6721726 The size of the index is approx 5,3 GB. Lucene version is 1.4.3. The index contains Norwegian terms, but lots of inline HTML, etc is probably increasin

Re: OutOfMemoryError on addIndexes()

2005-08-18 Thread Doug Cutting
Tony Schwartz wrote: I think you're jumping into the conversation too late. What you have said here does not address the problem at hand. That is, in TermInfosReader, all terms in the segment get loaded into three very large arrays. That's not true. Only 1/128th of the terms are loaded by

Re: Why is Hits.java not Serializable?

2005-08-10 Thread Doug Cutting
Ali Rouhi wrote: I can think of 3 reasons why search methods returning Hits objects are not exposed in Searchable: 1) Someone forgot to declare Hits Serializable 2) There is a fundamental reason the forms of search which return Hits objects cannot be called remotely, some non optimal form of se

Re: Regarding range queries.

2005-08-09 Thread Doug Cutting
Tony, If your improvements are of general utility, please contribute them. Even if they are not, post them as-is and perhaps someone will take the time to make them more reusable. Cheers, Doug Tony Schwartz wrote: I think there are a few things that should be added to lucene to really give

Re: "docMap" array in SegmentMergeInfo

2005-07-13 Thread Doug Cutting
Lokesh Bajaj wrote: For a very large index where we might want to delete/replace some documents, this would require a lot of memory (for 100 million documents, this would need 381 MB of memory). Is there any reason why this was implemented this way? In practice this has not been an issue. A

Re: Queries boost and scoring problems

2005-06-15 Thread Doug Cutting
The method Similarity.queryNorm() normalizes query term weights. To disable this you could define it to return 1.0 in your own Similarity implementation. http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#queryNorm(float) Doug Robichaud, Jean-Philippe wrote: Ok,

Re: Preserving original HTML file offsets for highlighting, need HTMLTokenizer?

2005-06-03 Thread Doug Cutting
Fred Toth wrote: I'm thinking we need something like "HTMLTokenizer" which bridges the gap between StandardAnalyzer and an external HTML parser. Since so many of us are dealing with HTML, I would think this would be generally useful for many problems. It could work this way: Given this input: H

Re: managing docids for ParallelReader

2005-06-03 Thread Doug Cutting
Sebastian Marius Kirsch wrote: I took up your suggestion to use a ParallelReader for adding more fields to existing documents. I now have two indexes with the same number of documents, but different fields. Does search work using the ParalleReader? One field is duplicated (the id field.) Wh

Re: Indexing multiple languages

2005-06-03 Thread Doug Cutting
Tansley, Robert wrote: What if we're trying to index multiple languages in the same site? Is it best to have: 1/ one index for all languages 2/ one index for all languages, with an extra language field so searches can be constrained to a particular language 3/ separate indices for each language

Re: managing docids for ParallelReader (was Augmenting an existing index)

2005-05-31 Thread Doug Cutting
Matt Quail wrote: I have a similar problem, for which ParallelReader looks like a good solution -- except for the problem of creating a set of indices with matching document numbers. I have wondered about this as well. Are there any *sure fire* ways of creating (and updating) two indices so

Re: Indexing in multi-threaded environment

2005-05-10 Thread Doug Cutting
Chris Lamprecht wrote: I've done exactly what you describe, using N threads where N is the number of processors on the machine, plus one more thread that writes to the file system index (since that is I/O-bound anyway). Since most of the CPU time is tokenizing/stemming/etc, the method works well.

Re: Distribution Strategies?

2005-05-10 Thread Doug Cutting
Steven J. Owens wrote: A friend just asked me for advice about synchronizing lucene indexes across a very large number of servers. I haven't really delved that deeply into this sort of stuff, but I've seen a variety of comments here about similar topics. Are there are any well-known approach

Re: Deletes and Hits

2005-05-04 Thread Doug Cutting
Scott Smith wrote: Any other solutions or comments? Use a different IndexReader for searching than you use for deletions? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: PerFieldSimilarity

2005-05-04 Thread Doug Cutting
Robichaud, Jean-Philippe wrote: How cool, I did not knew that... that may help me... If I understand you correctly, I can create a boolean query where each "clause" use a different similarity ? Yes. That would look something like: BooleanQuery booleanQuery = new BooleanQuery(); TermQuery clause1

Re: PerFieldSimilarity

2005-05-04 Thread Doug Cutting
Robichaud, Jean-Philippe wrote: Again, I can change the similarity of the reader at run-time and issue specific queries, summing the score myself, but that is pretty inefficient. You can also specify a Similarity implementation per Query node in a complex query, e.g.: BooleanQuery query = new Boo

Re: Results ranking on filtered multi-field query

2005-05-02 Thread Doug Cutting
Chuck Williams wrote: I found this to be a problem as well and created alternative classes, DistributedMultiFieldQueryParser and MaxDisjunctionQuery, which are available here: http://issues.apache.org/bugzilla/show_bug.cgi?id=32674 You might check these out and see if they provide the ranking y

Re: Indexing of virtual "made up" documents

2005-04-27 Thread Doug Cutting
Morus Walter wrote: Alternatively it should be able to write a query that does such a scoring directly (without the document start anchor) by the same means proximity query uses. Proximity query uses positional information so it should be possible to use that information for scoring based on docum

Re: CVS Lucene 2.0

2005-04-26 Thread Doug Cutting
Yonik Seeley wrote: I don't think at this point anything structural has been proposed as different between 1.9 and 2.0. Are any of Paul Elschot's query and scorer changes being considered for 2.0? 1.9 and 2.0 will be what's in the SVN trunk. Many of Paul's changes have already been committed. Ar

Re: CVS Lucene 2.0

2005-04-25 Thread Doug Cutting
George Aroush wrote: I would like to see a source release of 1.9, a packaged source release as ZIP/TAR. Is that possible? There is no 1.9 release. It is a *planned* release at this point. When a release is actually made, then you will be able to download it. Doug --

Re: Fields with same name boosting

2005-04-15 Thread Doug Cutting
Peter Veentjer - Anchor Men wrote: I have question about field boosting. If I have 2 (or more) fields with the same fieldname in a single document, and I boost one of those, than only that one will be boosted? Or will all fields with the same name be boosted? I guess only one field is boosted, bu

Re: Update performance/indexwriter.delete()?

2005-04-14 Thread Doug Cutting
Roy Klein wrote: I think this is a better way of asking my original questions: "Why was this designed this way?" In order to optimize updates. "Can it be changed to optimize updates?" Updates are fastest when additions and deletions are separately batched. That is the design. Doug -

Re: Update performance/indexwriter.delete()?

2005-04-14 Thread Doug Cutting
Yonik Seeley wrote: There are times, however, when it would be nice for deletes to be able to be concurrent with adds. It would also be nice if good coffee was free. Q: can docids change after an add() (with merging segments going on behind the scenes) or is optimize() the only call that ends up ch

Re: Reverting QueryParser ?

2005-04-14 Thread Doug Cutting
Paul Libbrecht wrote: I am currently evaluating the need for an elaborate query data-structure (to be exchanged over XML-RPC) as opposed to working with plain strings. I'd opt for both. For example: "java based" -coffee site apache.org d

  1   2   >