See
http://lucene.apache.org/core/4_3_1/queryparser/org/apache/lucene/queryparser/complexPhrase/ComplexPhraseQueryParser.html
From: Ian Lea
To: java-user@lucene.apache.org
Sent: Tuesday, 27 August 2013, 10:16
Subject: Re: Wildcard in PhraseQuery
See the FAQ
Answering my own question - add optional new MatchAllDocsQuery("text") clause
to factor in the encoded norms from the "text" field.
____
From: mark harwood
To: "java-user@lucene.apache.org"
Sent: Friday, 25 January 2013, 16:11
Subj
I have a 3.6 index with many no-norms fields and a single text field with norms
(a fairly common configuration). There is a document boost I have set at
index-time that will have been encoded into the text field's norms.
If I query solely on a non-text field then the ranking does not apply the
This was part of the rationale for introducing the XML Query Parser:
1) An extensible query syntax that is expressive enough to represent the full
range of Lucene functions (filters, moreLikeThis etc)
2) Serializable
3) Language independent
4) Decouples the holder of query criteria from the impl
Hi Brandon,
Can you start by calling toString on the parse result (the Query object) to
see what is being produced and post that here.
On the face of it it sounds like it should work OK. What happens if you use
the "normal" query parser on your query "time to leave" - that should parse ok
as
DuplicateFilter has been mostly broken since Lucene's switch over to
segment-level filtering.
Since v2.9 the calls to Filter.getDocIdSet no longer pass a "top-level" reader
for accessing the whole index and instead pass a reader restricted to only
accessing a single segment's contents.
Becaus
>>> Ideally I'd like to take any ANDed clauses and require them to occur>
>>> withing $SPAN of the other ANDs.
See ComplexPhraseQueryParser?
Like the standard QueryParser it uses quotes to define a phrase but also
interprets any special characters between the quotes e.g. ( ) * ~
The syntax and
Many considerations here - I find the technical concerns you present typically
open a can of worms for any businesses worried about security.
It gets political quickly.
In environments where security is paramount, software must be formally
accredited, which is a costly exercise.
Often the choi
I've created a couple of sequence diagrams of core Lucene 4.0 classes that may
be of use to others:
Low-level classes used while writing indexes
http://goo.gl/dI3HY
Low-level classes used while reading indexes:
http://goo.gl/e8JEj
FWIW I found the websequencediagrams.com editor in these lin
value: 20
}
}
doc 2:
{
form: { id: 1040 }
attrib: {
name: age
value: 22
}
}
On Mon, May 21, 2012 at 3:24 PM, Mark Harwood wrote:
> You're describing what I call the "cross matching" problem if you flatten
> nested,
You're describing what I call the "cross matching" problem if you flatten
nested, repeating structures with multiple fields into a single flat Lucene
document model.
The approach for handling the more complex mappings is to use nested child docs
in Lucene and for that look at BlockJoinQuery.
Ho
Your requirement does not sound like a good fit for the nested stuff but is
probably more one for conventional faceting.
I would characterise the uses for Nested as follows:
1) The parent of a nested block is typically the "item of interest" that is
returned i.e. the search results are a list
>
> Other parameters such as filters, faceting, highlighting, sorting,
> etc, don't normally have any hierarchy.
I regularly mix filters and queries inside Boolean logic. Attempts to structure
data (e.g. geocoding) don't always achieve 100% coverage and so for better
recall you must also resor
I don't think of queries as inherently flat in the way HTTP request parameters
are with their name=value pairings.
JSON or XML can reflect more closely the hierarchy in the underlying Lucene
query objects.
For me using a "flat" query interface feels a bit like when you start off
trying to manag
>> > Avg lookup time slightly less than a HashSet? Interesting.
Scratch that. A new dataset and revised code shows HashSets out in front (but
still not a realistic option for very large sets) : http://goo.gl/Lb4J1
In this benchmark I removed the code common to all previous tests which was
firs
lightly less than a HashSet? Interesting. Is the code
> to these benchmarks available somewhere?
>
> Dawid
>
> On Tue, Oct 25, 2011 at 9:57 PM, Grant Ingersoll wrote:
>>
>> On Oct 25, 2011, at 11:26 AM, mark harwood wrote:
>>
>>>>> using
>>using Lucene that don't fit under the core premise of full text search
I've had several use cases over the years that use features peculiar to Lucene
but here's a very simple one I came across today that illustrates its raw index
lookup capability:
I needed a fast, scalable and persistent "S
Check "norms" are disabled on your fields because they'll cost you1byte x
NumberOfDocs x numberOfFieldsWithNormsEnabled.
On 16 Aug 2011, at 15:11, Bennett, Tony wrote:
> Thank you for your response.
>
> You are correct, we are sorting on timestamp.
> Timestamp has microsecond granualarity, a
From: Michael McCandless
To: java-user@lucene.apache.org
Sent: Tue, 28 June, 2011 14:59:48
Subject: Re: Corrupt segments file full of zeros
On Tue, Jun 28, 2011 at 9:29 AM, mark harwood wrote:
> Hi Mike.
>>>Hmmm -- what code are you running here, to pr
Hi Mike.
>>Hmmm -- what code are you running here, to print the number of docs?
SegmentInfos.setInfoStream(System.out);
FSDirectory dir = FSDirectory.open(new File("j:/indexes/myindex"));
IndexReader r = IndexReader.open(dir, true);
System.out.println("index has "+r.maxDoc()+" docs");
From my
According to the spec there should at least be an Int32 of -9 to declare the
Format - http://lucene.apache.org/java/2_9_3/fileformats.html#Segments File
- Original Message
From: Uwe Schindler
To: java-user@lucene.apache.org
Sent: Tue, 28 June, 2011 12:32:34
Subject: RE: Corrupt segme
See Highlighter's GradientFormatter
Cheers
Mark
On 16 Jun 2011, at 22:01, Itamar Syn-Hershko wrote:
> Hi all,
>
>
> Interesting question: is it possible to color search results in a web-page
> based on their score? e.g. most relevant results in green, and then different
> shades through ora
Partitioning and replication are the keys to handling data and user volumes
respectively.
However, this approach introduces some other concerns over consistency and
availability of content which I've tried to capture
here: http://www.slideshare.net/MarkHarwood/patterns-for-large-scale-search
Th
As of 3.2 the necessary changes were put in to safely support indexing nested
docs. See
http://lucene.apache.org/java/3_2_0/changes/Changes.html#3.2.0.new_features
On 6 Jun 2011, at 17:18, 周诚 wrote:
> I just saw this:
> https://issues.apache.org/jira/secure/attachment/12480123/LUCENE-2454.patc
Of course IDF is a factor too meaning a match on a single rare (to the overall
index) term may be worth more than a match on 2 different common (to the index)
terms.
As Ian suggests a custom Similarity implementation can be used to tune this out.
- Original Message
From: Ian Lea
To: j
See https://issues.apache.org/jira/browse/LUCENE-1720
- Original Message
From: Alex vB
To: java-user@lucene.apache.org
Sent: Wed, 16 March, 2011 0:12:41
Subject: Early Termination
Hi,
is Lucene capable of any early termination techniques during query
processing?
On the forum I only fo
This is possible using contrib's DuplicateFilter.
Below is an example of your problem defined as an XML-based test which I just
ran OK through my test writer/runner.
Hopefully this is readable and demonstrates the use of
FilteredQuery/DuplicateFilter.
This is my test
Somewhat historic reasons.
It used to be IndexWriter was the only place you could define this setting
(making it an index-time decision burnt into the index).
The IndexReader option is a relatively newer addition that adds the flexibility
to decide about memory usage whenever you open the index (
Probably off-topic for a Lucene list but the typical database options are:
1) an auto-updated "last changed" timestamp column on related tables that can
be
queried
2) a database trigger automatically feeding a "to-be-indexed" table
Option 1 would also need a "marked as deleted" column adding to
>> 1. why ir.hashCode() returns different value every time I run
>> this
>>code?
Presumably because it is a different object instance in a different JVM?
IndexReader.hashCode() and IndexReader.equals() are not designed to
represent/summarise the physical contents of an index.
They
See the Collocation stuff here https://issues.apache.org/jira/browse/LUCENE-474
- Original Message
From: Lucene
To: java-user@lucene.apache.org
Sent: Tue, 26 October, 2010 13:27:06
Subject: Next Word - Any Suggestions?
Am about to implement a custom query that is sort of mash-up of Fac
Can you not just call reader.docFreq(categoryTerm) ?
The returned figure includes deleted docs but then the search term uses this
method too so should suffer from the same inaccuracy.
Cheers
Mark
- Original Message
From: Max Jakob
To: java-user@lucene.apache.org
Sent: Mon, 18 Octobe
the last commit.
Mike
On Tue, Oct 5, 2010 at 6:45 PM, Mark Harwood wrote:
> OK. I'll double check the reports.
> So presumably when merges occur outside of transaction control (post commit)
>the post-merge update of the segments_N file is managed safely somehow?
> I can see the
>
> In both 2.4 and 2.9.x (and all later versions), neither .prepareCommit
> nor .commit wait for merges.
>
> That said, if a merge happens to complete before you call those
> methods, then it is in fact committed.
>
> Mike
>
> On Tue, Oct 5, 2010 at 1:13 PM, Mar
Having upgraded a live system from 2.4 to 2.9.3 the client is reporting a
change in merge behaviour that is causing some issues with their update
monitoring logic.
The suggestion is that any merge operations now complete as part of the
IW.prepareCommit() call rather than previously when they ra
A pretty thorough exploration of the issues in federated search here:
http://ilpubs.stanford.edu:8090/271/
I'd add "security" i.e. authentication and authorisation to the list of issues
to be considered (key in some environments).
If you consolidate content in a centralised Solr/Lucene indexing
.set(docs[0]);
> }
>
>>> That could involve a lot of disk seeks unless you cache a pk->docid lookup
>>> in ram.
> That sounds interesting. How would the pk->docid lookup get populated?
> Wouldn't a pk->docid cache be invalidated with each commit or merge?
Re scalability of filter construction - the database is likely to hold stable
primary keys not lucene doc ids which are unstable in the face of updates. You
therefore need a quick way of converting stable database keys read from the db
into current lucene doc ids to create the filter. That could
Lucene 2454 includes an example of matching logic that respects the structure
in XML documents (see (https://issues.apache.org/jira/browse/LUCENE-2454 )
The example class TestNestedDocumentQuery queries xhtml marked up with hResume
syntax.
We don't have XQuery syntax support in a parser now (an
Check out lucene 2454 and accompanying slide show if your reason for doing this
is modelling repeating elements.
On 9 Jul 2010, at 13:43, "Hans-Gunther Birken" wrote:
> I'm examining the following search problem. Consider a document with two
> multi-va
The DuplicateFilter passed to the searcher does not have visibility of the text
query and is therefore evaluated independently from all other criteria.
Sounds like the behaviour you want is to get the last duplicate that also
matches your criteria, which seems like something fairly common to need
terest and Query
objects record match metadata in singleton MatchAttribute objects as they
stream their way through result sets.
Result set streaming and tokenisation streams are similar problems and the
Attribute design seems like it can apply here.
Cheers
Mark
Le 11-mai-10 à 12:02, mark harwo
See https://issues.apache.org/jira/browse/LUCENE-1999
- Original Message
From: Paul Libbrecht
To: java-user@lucene.apache.org
Sent: Tue, 11 May, 2010 10:52:14
Subject: Re: best way to interest two queries?
Dear lucene experts,
Let me try to make this precise since there was not answe
Not the fastest thing in the world but works:
Term startTerm=new Term("myFieldName","");
TermEnum te=reader.terms(startTerm);
BitSet docsRead=new BitSet(reader.maxDoc());
boolean multiValued=false;
rietary metadata system, and any other config
resource to hook into Luke.
That would be pretty cool
- Original Message
From: Andrzej Bialecki
To: java-user@lucene.apache.org
Sent: Fri, 5 March, 2010 11:11:12
Subject: Re: SpanQueries in Luke
On 2010-03-05 11:22, mark harwood wrote:
things like
Solr's config.
Cheers,
Mark
- Original Message
From: Andrzej Bialecki
To: java-user@lucene.apache.org
Sent: Fri, 5 March, 2010 10:03:23
Subject: Re: SpanQueries in Luke
On 2010-03-05 10:47, mark harwood wrote:
>
>
>>> No, this simply means tha
>>No, this simply means that you will be able to use the xml-query-parser
>>instead of the regular one
Not sure exactly what you have in mind for an editor, Andrzej but there is an
opportunity to do something smart here for little effort.
The XMLQueryParser comes with a DTD which means you ca
/wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> DBSight customer, a shopping comparison site, (anonymous per request) got 2.6
> Million Euro funding!
>
>
>
> Mark Harwood wrote:
>> Yes it is being maintained and I have it in product
nymous per request) got 2.6
Million Euro funding!
Mark Harwood wrote:
This was part of the rationale for creating the XMLQueryParser which can be
found in contrib.
See here for the background:
http://marc.info/?l=lucene-dev&m=113355526731460&w=2
On 17 Feb 2010, at 18:44, Aaron Schon w
This was part of the rationale for creating the XMLQueryParser which can be
found in contrib.
See here for the background:
http://marc.info/?l=lucene-dev&m=113355526731460&w=2
On 17 Feb 2010, at 18:44, Aaron Schon wrote:
> Hi all, I know that persisting a Lucene query by query ToString() meth
This could be down to IDF ie "Lucane" is ranked higher because it is rarer
despite having worse edit distance.
This is arguably a bug.
See http://issues.apache.org/jira/browse/LUCENE-329 which discusses this. You
could try subclass QueryParser and override newFuzzyQuery to return
FuzzyLikeThisQu
Re Mike's delegating custom query suggestion - see
https://issues.apache.org/jira/browse/LUCENE-1999
- Original Message
From: Michael McCandless
To: java-user@lucene.apache.org
Sent: Mon, 15 February, 2010 10:03:30
Subject: Re: Further refinement of search results - distinguishing hi
rd analyzer and changed the
> name to be added to the index to "Mr.\\ Kumar"
> but still couldn't get it to work.
>
>
>
>
>
>
> Rohit Banga
>
>
> On Tue, Feb 9, 2010 at 1:06 PM, Mark Harwood wrote:
>
>> I suspect it is because QueryPa
I suspect it is because QueryParser uses space characters to separate different
clauses in a query string while you want the space to represent some content in
your "name" field. Try escaping the space character.
Cheers
Mark
On 9 Feb 2010, at 07:26, Rohit Banga wrote:
> Hello
>
> i have a f
Try call rewrite on the query object to expand then call tostring on
the result.
Cheers,
Mark
-
On 1 Feb 2010, at 21:32, "Haghighi, Nariman" wrote:
> We are relying on the ComplexPhraseQueryParser for some impressive
> matching capabilities.
>
> Of concern is that Wildcard Queries,
> > Do you think I can get any advantage from building a solution on
> Lucene?
Lucene is generally about information retrieval not information extraction (as
suggested, GATE or UIMA are more commonly used for extraction).
However, Lucene can play a role in extraction if you use it for determining
Hi Fayyaz,
>>I have found an error in the web.xml file,
Good job! I found an error in your code so that makes us even :)
It looks like you removed the line in the "openExampleIndex" method which opens
the searcher.
That explains your null pointer.
The problem you found in the web.xml isn't a
It could be the "merge contiguous fragments" feature that attempts to
do exactly this to improve readability
It's an option you can turn off.
On 15 Nov 2009, at 01:21, Felipe Lobo wrote:
Hi, i'm having some problems with the size of the fragmentes when
i'm doing
the highlight. I pass on the
So many questions..
>>Which one will be better
As in.
* Faster to implement?
* Faster to search?
* Faster to update?
* Cheaper in licenses?
* More robust?
* Easier to maintain?
* Easier to backup?
Are results sorted by :
* quality (e.g. when using fuzzy text matching)?
* distance?
* pric
I have a client with 700 million doc index running on a SAN. The performance is
v good but this obviously depends on your choice of SAN config. In this
environment I have multiple search servers happily hitting the same physical
lucene index on the SAN. The servers negotiate with each other via
Since you can't (and it doesn't make sense to) use wildcards in phrase
queries,
You can with this:
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/misc/src/java/org/apache/lucene/queryParser/complexPhrase/
Discussion here: http://tinyurl.com/lrnage
Cheers,
Mark
Or https://issues.apache.org/jira/browse/LUCENE-1720 offers lightweight timeout
testing at all index access stages prior to calls to Collector e.g. will catch
a runaway fuzzy query during it's expensive term expansion phase.
- Original Message
From: Uwe Schindler
To: java-user@lucene
>>It removes the duplicates at query time and not in the results.
Not sure I understand that statement. Do you mean you want index-time
rejection of potentially duplicate inserts?
On 4 Sep 2009, at 07:01, Ganesh wrote:
It removes the duplicates at query time and not in the results.
--
See "DuplicateFilter" in contrib.
http://markmail.org/message/lsvnpu7mwhht3a4p
Cheers
Mark
- Original Message
From: Ganesh
To: java-user@lucene.apache.org
Sent: Wednesday, 2 September, 2009 12:38:35
Subject: Re: First result in the group
I have a field called category and all docume
>>I need to start off with this project where we can find the ranking of
>>controversial articles. Could anyone kindly help me how to start?
Check out the wikipedia "logging" dumps which contain the reasons for actions
on page titles (including ip blocks and deletes) but without the bulk of the
I just try norms idea as well no change
You'll need to look at searcher.explain() for the two docs or post a
Junit or code example that can be executed which shows the issue
-
To unsubscribe, e-mail: java-user-unsubscr...@l
if the term is "X Y" the document 2 is getting higher score then
document 1.
That may be length normalisation at play. Doc 2 is shorter so may be
seen as a better match for that reason.
Using the "explain" function helps illustrate the break down of scores
in matches.
You could try index
ts() + " hits");
The result is 0 hits (should be 640).
[1] tinyurl.com/ml52ye
2009/7/4 Mark Harwood :
>
> Check out booleanfilter in contrib/queries. It can be wrapped in a
> constantScoreQuery
>
>
>
> On 4 Jul 2009, at 17:37, Lukas Michelbacher
> wrote:
>
I would appreciate if i can get help with the code as well.
If you want to tweak an existing example rather than coding entirely
from scratch the XMLQueryParser in /contrib has a demo web app for job
search with a "location" field similar in principle to your "state"
field plus it has a G
Check out booleanfilter in contrib/queries. It can be wrapped in a
constantScoreQuery
On 4 Jul 2009, at 17:37, Lukas Michelbacher
wrote:
This is about an experiment comparing plain Boolean retrieval with
vector-space-based retrieval.
I would like to disable all of Lucene's scoring mechani
On 1 Jul 2009, at 17:39, k.sayama wrote:
I could verify Token byte offsets
The sytsem outputs
aaa:0:3
bbb:0:3
ccc:4:7
That explains the highlighter behaviour. Clearly BBB is not at
position 0-3 in the String you supplied
String CONTENTS = "AAA :BBB CCC";
Looks like the Tokenizer need
day, 1 July, 2009 16:13:17
Subject: Re: Highligheter fails using JapaneseAnalyzer
Sorry
I can not verify the Token byte offsets produced by JapaneseAnalyzer
How should I verify it?
- Original Message -
From: "mark harwood"
To:
Sent: Wednesday, July 01, 2009 11:31 PM
Subject:
Can you verify the Token byte offsets produced by this particular analyzer are
correct?
- Original Message
From: k.sayama
To: java-user@lucene.apache.org
Sent: Wednesday, 1 July, 2009 15:22:37
Subject: Re: Highligheter fails using JapaneseAnalyzer
hi
I verified it by using SimpleAn
See MoreLikeThis in the contrib/queries folder. It optimizes the speed
of similarity comparisons by taking the most significant words only
from a document as search terms.
On 29 Jun 2009, at 20:14, Amir Hossein Jadidinejad wrote:
Hi,
It's my first experiment with Lucene. Please help me.
FuzzyQuery performance is related to number of unique terms in the index not
the number of documents e.g. a single "telephone directory" document could
contain millions of terms.
Each term considered is compared using an "edit distance" algo which is CPU
intensive.
The FuzzyQuery prefix length
>techniques used by big search engines to search among such huge data.
Two keywords here - partitioning and replication.
Partitioning is breaking the content down into shards and assigning shards to
servers. These can then be queried in parallel to make search response times
independent of the
If you're CPU-bound - I've had issues before with GC in long-running indexing
tasks loading very large volumes (100s of millions) of docs. I was seeing lots
of CPU usage tied up in GC.
I solved all these problems by firing batches of indexing activity off in
seperate processes then immediately
See IndexReader.setTermInfosIndexDivisor() for a way to help reduce memory
usage without needing to re-index.
If you have indexed fields with omitNorms off (the default) you will be paying
a 1 byte per field per document memory cost and may need to look at re-indexing
Cheers
Mark
- Orig
Related: https://issues.apache.org/jira/browse/LUCENE-1486
- Original Message
From: Steven A Rowe
To: "java-user@lucene.apache.org"
Sent: Thursday, 23 April, 2009 16:54:08
Subject: RE: SpanQuery wildcards?
Hi Ivan, SpanRegexQuery should work - just use ".*" instead of "*". - Steve
Spring is pretty useful for managing and sharing resources - see what looks
like a related example here:
http://croarkin.blogspot.com/2008/05/injecting-spring-bean-into-servlet.html
Cheers,
Mark
- Original Message
From: David Seltzer
To: java-user@lucene.apache.org
Sent: Tuesday,
Try setting the minimum prefix length for fuzzy queries ( I think there is a
setting on QueryParser or you may need to subclass)
Prefix length of zero does edit distance comparisons for all unique terms e.g.
from "aardvark" to ""
Prefix length of one would cut this search space down to just
ptimal approach incase someone already have similar
situation.
-Original Message-----
From: mark harwood [mailto:markharw...@yahoo.co.uk]
Sent: Mon 3/30/2009 11:16 AM
To: java-user@lucene.apache.org
Subject: Re: What is an optimal approach?
That's probably more a question about MarkLogic A
That's probably more a question about MarkLogic APIs than it is about Lucene.
What APIs does MarkLogic provide for getting at the content e.g does it provide
a JSR-170 standard interface (
http://www.slideshare.net/uncled/introduction-to-jcr )
I presume you have already ruled out the in-built M
The attachment didn't make it through here. Can you add it as an attachment to
a new JIRA issue?
Thanks,
Mark
From: Amin Mohammed-Coleman
To: java-user@lucene.apache.org
Sent: Thursday, 12 March, 2009 7:47:20
Subject: Re: Lucene Highlighting and Dynamic Summ
OK, it's early days and I'm holding my breath but I'm currently progressing
further through my content without an OOM just by using a different GC setting.
Thanks to advice here and colleagues at work I've gone with a GC setting of
-XX:+UseSerialGC for this indexing task.
The rationale that is
Wednesday, 11 March, 2009 10:42:33
Subject: Re: A model for predicting indexing memory costs?
* mark harwood:
>>>Could you get a heap dump (eg with YourKit) of what's using up all the
>>>memory when you hit OOM?
>
> On this particular machine I have a JRE, no adm
ts
by pointing out that it's not only *your* time that's at risk, but
customers' time too. Whether you define customers as internal
or external is irrelevant. Every round of diagnosis/fix carries the risk
that N people waste time (and get paid for it). All to avoid a little
up-front co
ing a new IndexWriter each
time? Or, just calling .commit() and then re-using the same writer?
It seems likely this has something to do with merging, though from your listing
I count 14 segments which shouldn't have been doing any merging at
mergeFactor=20, so that's confusing.
Token class when creating the trie
> encoded fields.
>
> How works TrieRange for you? Are you happy, does searches work well with
> 30
> mio docs, which precisionStep do you use?
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http:
you? Are you happy, does searches work well with 30
mio docs, which precisionStep do you use?
Uwe
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
> -Original Message-
> From: mark harwood [mailto:markharw...@yahoo.co.uk]
> Sent:
with -XX:-UseGCOverheadLimit
http://java-monitor.com/forum/archive/index.php/t-54.html
http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc.oom
--
Ian.
On Tue, Mar 10, 2009 at 10:45 AM, mark harwood wrote:
>
>>>But... how come setting IW's RAM buffer do
ent: Tuesday, 10 March, 2009 0:01:30
Subject: Re: A model for predicting indexing memory costs?
mark harwood wrote:
>
> I've been building a large index (hundreds of millions) with mainly
> structured data which consists of several fields with mostly unique values.
> I've been
>>Maybe we could do something similar to declare that agiven field uses Trie*,
>>and with what datatype.
With the current implementation you can at least test for the presence of a
field called:
[fieldName]#trie
..which tells you some form of trie is used but could be extended to include
I've been building a large index (hundreds of millions) with mainly structured
data which consists of several fields with mostly unique values.
I've been hitting out of memory issues when doing periodic commits/closes which
I suspect is down to the sheer number of terms.
I set the IndexWriter..
As suggested, the window for failure here is very small. The commit is
effectively an atomic single file rename operation to make the new segments
file visible.
However, should there be a failure between 2 commits the new deletion policy
logic should help you recover to prior commit points. See
I was having some thoughts recently about speeding up fuzzy search.
The current system does edit-distance on all terms A-Z, single threaded. Prefix
length can reduce the search space and there is a "minimum similarity"
threshold but that's roughly where we are. Multithreading this to make use o
>>My documents are quite big sometimes up to 300ktokens.
You could look at indexing them as seperate documents using overlapping
sections of text. Erik used this for one of his projects.
Cheers
Mark
- Original Message
From: Michael Stoppelman
To: java-user@lucene.apache.org
Sent: Tu
Field("field5", "groupId" + i, Field.Store.YES,
Field.Index.UN_TOKENIZED));
writer.addDocument(doc);
From: mark harwood
To: java-user@lucene.apache.org
Sent: Tuesday, December 23, 2008 2:42:25 PM
Subject: Re: Optimize a
I've had reports of OOM exceptions during optimize on a couple of large
deployments recently (based on Lucene 2.4.0)
I've given the usual advice of turning off norms, providing plenty of RAM and
also suggested setting IndexWriter.setTermIndexInterval().
I don't have access to these deployment en
,
Mark
- Original Message
From: Andrzej Bialecki <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, 14 November, 2008 10:47:03
Subject: Re: [ANN] Luke 0.9 released
mark harwood wrote:
> Hi Andrzej,
>
> Thanks for the update. Looks like you've been bus
1 - 100 of 294 matches
Mail list logo