Many thanks to Erik and Ollie for responding - a lot of ideas and I'll have
my work cut out grokking them properly and thinking about what to do.
I'll respond further as that develops.
One quick thing though - Erik wrote:
> So, I wonder if your out of memory issue is really related to the number
Right, as described in my book,
The Oracle database furnishes an embedded Java run time, which can be >
used by database components such as XDB, *inter*Media, Spatial,
Text, > XQuery, and so on. Oracle Text leverages the XML DB
framework, which > includes a protocol server and a
True. But is it enough faster than TermDocs.seek(new Term("unique id",
id)).doc() to be worth the complication for this situation? ...
Erick
On 10/17/06, Daniel Noll <[EMAIL PROTECTED]> wrote:
Erick Erickson wrote:
> Why go through all this effort when it's easy to make your own unique
ID?
I
I can certainly vouch for the benefits of partitioning, we've seen a very
big improvement in searcher refresh times (our main pain point) since we
implemented such an architecture.
Our application has 1000's of indexes, ranging in size from a few meg up
several gigabytes, updates occur very freque
Another option is to run Lucene inside your Oracle instance using
it's JVM. This might help with combining Lucene and Oracle search
results.
On Oct 17, 2006, at 12:39 PM, Chris Lu wrote:
Several additional reasons I can think of:
1) Being able to control the algorithsm, for example,
1.1)
Hi,
The IndexModifier class always opens up an IndexWriter in the init
method. If we need to update a document, it closes the IndexWriter and
opens up IndexReader to delete the desired document. Then again it opens
IndexWriter to add the document to the index.
Instead can't we pass one extra
Erick Erickson wrote:
Why go through all this effort when it's easy to make your own unique ID?
I can think of one reason: hits.id() is orders of magnitude faster than
hits.doc().
Daniel
--
Daniel Noll
Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699
W
Thanks for the explanation.
I am using ChainedFilter and it is taking some more time than using just one
Filter.
I read somewhere on the lucene forums that the speed can be increased for
Filters if we have a large bitset and then work on it. Is it possible and if
yes, how? I would like to kno
On 10/17/06, vasu shah <[EMAIL PROTECTED]> wrote:
Can anyone please tell as to what is the difference between PrefixFilter and
WildcardQuery as far as memory is concerned?
I saw the code of PrefixFilter and it gets TermEnum for all the terms in the
index. Won't this consume memory??
It t
Hi,
Can anyone please tell as to what is the difference between PrefixFilter and
WildcardQuery as far as memory is concerned?
I saw the code of PrefixFilter and it gets TermEnum for all the terms in the
index. Won't this consume memory??
I started using PrefixFilter, ConstantSc
I've been curious for a while about this scheme, and I'm hoping you
implement it and tell me if it works . In truth, my data is pretty static
so I haven't had to worry about it much. That said...
Would it do (and, perhaps, be less complex) to have a FSDirectory and a
RAMDirectory that you search?
Hi chaps,
Just looking for some ideas/experience as to how to improve our
current architecture.
We have a single-index system containing approx. 2.5 million docs of
about 1-3k each.
The Lucene implementation is a daemon and it services requests on
a port in multi-threaded manner, and it runs on
Several additional reasons I can think of:
1) Being able to control the algorithsm, for example,
1.1) applying your own analyzer to a field.
1.2) control your own way of ranking
2) De-couple your data model from the searching
Searching directly on your data model may not be ideal. You may wan
All has to do with the total focus on strings in an inverted index, as
opposed to the more general model in an RDBMS.
Lucene doesn't need to track the max length. It sees each date as a
string and understands all string intervals lexicographically. That
means 20060401 is less than 20060401HHMMSS f
Under the covers, as I understand it, a BooleanQuery is assembled for each
unique term in the range. So, if you store your dates with milliseconds,
there can be, what, 86,000,000+ unique terms per day. If you stored your
times as strings to millisecond resolution, you can have a lot of clauses in
Another solution is work with plain java dates and calendar objects,
convert into Lucene strings
using DateTools (resolution day) then query this field with two
RangeFilters using ChainedFilter.
You will never get the BooleanQuery error.
Peter
On Oct 17, 2006, at 10:57 AM, Bushey, John wrote
Thanks. That's the explanation that I was looking for. The WIKI does
not cover this in much detail. The architectural reason for this sounds
strange to me since my background is in relational databases where this
is not an issue so I still have a question. How does reducing the
precision really h
See also relevant FAQ entry & Wiki page:
http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831
http://wiki.apache.org/jakarta-lucene/LargeScaleDateRangeProcessing
"Steven Parkes" <[EMAIL PROTECTED]> wrote on 17/10/2006 09:12:55:
> Lucene takes your date
karl wettin wrote:
17 okt 2006 kl. 17.54 skrev Find Me:
How to eliminate near duplicates from the index?
I would probably try to measure the Ecludian distance between all
documents, computed on terms and their positions. Or perhaps use
standard deviation to find the distribution of terms i
17 okt 2006 kl. 17.54 skrev Find Me:
How to eliminate near duplicates from the index?
Oh, one more thing. You should probably look at the norms in order to
avoid comparing all documents to each other.
We used Oracle interMedia/Text for search within the RDMS beginning with oracle
8i through oracle 10g. Two primary reasons we switched to solr/lucene:
* We saw random errors (< .1% of the time) when users ran full text search.
We believe the source of this error occurred during index update as
17 okt 2006 kl. 17.54 skrev Find Me:
How to eliminate near duplicates from the index?
I would probably try to measure the Ecludian distance between all
documents, computed on terms and their positions. Or perhaps use
standard deviation to find the distribution of terms in a document.
On
17 okt 2006 kl. 15.55 skrev Ariel Isaac Romero Cartaya:
Here are pieces of my source code:
public Hits search(String query) throws IOException {
for (int i = 0; i < IndexCount; i++) {
searchables[i] = new IndexSearcher
(RAMIndexsManager.getInstance
().getDir
Why go through all this effort when it's easy to make your own unique ID?
Add a new field to each document "myuniqueid" and fill it in yourself. It'll
never change then.
The complex coordination way.
To coordinate things, you could keep the last ID used (and maybe other
information) in a unique
Lucene takes your date range, enumerates all the unique date/time values
in your corpus within that range, and then executes that query. So the
number of terms in your query is going to be equal to the number of
unique date/time values in the range.
The most common way of handling this is to not i
Hi -
I'm currently looking into adding full text search capabilities to our
site. While some threads in this list had the same basic question (RDBMS
full-text versus lucene), their configurations and conderns were different.
Here's my configuration
* RDBMS is Enteprise Oracle 10g
* RAC-enabled
How to eliminate near duplicates from the index? Someone suggested that I
could look at the TermVectors and do a comparision to remove the duplicates.
One major problem with this is the structure of the document is no longer
important. Are there any obvious pitfalls? For example: Document A being
I think the idea is that 2.0.1 would be a patch-fix release from the
branch created at 2.0 release. This release would incorporate only
back-ported high-impact patches, where "high-impact" is defined by the
community. Certainly security vulnerabilities would be included. As Otis
said, to date, nobo
On 10/17/06, Johan Stuyts <[EMAIL PROTECTED]> wrote:
So my questions are: is there a way to prevent the IndexWriter from
merging, forcing it to create a new segment for each indexing batch?
Already done in the Lucene trunk:
http://issues.apache.org/jira/browse/LUCENE-672
Background:
http://www
Hi,
I'm trying to come up with the best design for a problem.
I want to search texts for expressions that shouldn't be found in them.
My bad expressions list is quite stable. But the texts that I want to scan
change often.
Design I
Index my texts, and then loop on my expressions list to see i
Ignore the bit about keeping the mappings, it's too tricky unless really
really necessary, since by virtue of updating the meta-data document, you'll
delete a document, thus perhaps changing the Lucene IDs.
I should proofread before hitting the "send" button ...
Erick
On 10/17/06, Erick Erickso
Hi,
(I am using Lucene 2.0.0)
I have been looking at a way to use stable IDs with Lucene. The reason I
want this is so I can efficiently store and retrieve information outside
of Lucene for filtering search results. It looks like this is going to
require most of Lucene to be rewritten, so I gave
Here are pieces of my source code:
First of all, I search in all the indexes given a query String with a
parallel searcher. As you can see I make a multi field query. Then you can
see the index format I use, I store in the index all the fields. My index is
optimized.
public Hits search
Thanks for all your help.
I used PrefixFilter, ChainedFilter, CachingWrapperFilter, ConstantScoreQuery
and the search speed has been dramatically improved. I am just doing wildcard
search like abc*.
It used to give me OOM problem with WildcardQuery. Will I get the same
problem with
Take a look at the explain functionality on the Searcher
On Oct 17, 2006, at 5:43 AM, Mukesh Bhardwaj wrote:
Hi,
If I do a search such as "field1:jim OR field2:bob" is there any
way to
determine for each document that was a hit, which field caused the
hit?
Or rather, since they both migh
Hi,
If I do a search such as "field1:jim OR field2:bob" is there any way to
determine for each document that was a hit, which field caused the hit?
Or rather, since they both might, is there any easy way to find out
which fields definitely cause a hit?
Regards,
--Mukesh
36 matches
Mail list logo