Your requirement does not sound like a good fit for the nested stuff but is
probably more one for conventional faceting.
I would characterise the uses for Nested as follows:
1) The parent of a nested block is typically the "item of interest" that is
returned i.e. the search results are a list
You're describing what I call the "cross matching" problem if you flatten
nested, repeating structures with multiple fields into a single flat Lucene
document model.
The approach for handling the more complex mappings is to use nested child docs
in Lucene and for that look at BlockJoinQuery.
Ho
value: 20
}
}
doc 2:
{
form: { id: 1040 }
attrib: {
name: age
value: 22
}
}
On Mon, May 21, 2012 at 3:24 PM, Mark Harwood wrote:
> You're describing what I call the "cross matching" problem if you flatten
> nested,
I've created a couple of sequence diagrams of core Lucene 4.0 classes that may
be of use to others:
Low-level classes used while writing indexes
http://goo.gl/dI3HY
Low-level classes used while reading indexes:
http://goo.gl/e8JEj
FWIW I found the websequencediagrams.com editor in these lin
Many considerations here - I find the technical concerns you present typically
open a can of worms for any businesses worried about security.
It gets political quickly.
In environments where security is paramount, software must be formally
accredited, which is a costly exercise.
Often the choi
>>> Ideally I'd like to take any ANDed clauses and require them to occur>
>>> withing $SPAN of the other ANDs.
See ComplexPhraseQueryParser?
Like the standard QueryParser it uses quotes to define a phrase but also
interprets any special characters between the quotes e.g. ( ) * ~
The syntax and
DuplicateFilter has been mostly broken since Lucene's switch over to
segment-level filtering.
Since v2.9 the calls to Filter.getDocIdSet no longer pass a "top-level" reader
for accessing the whole index and instead pass a reader restricted to only
accessing a single segment's contents.
Becaus
Hi Brandon,
Can you start by calling toString on the parse result (the Query object) to
see what is being produced and post that here.
On the face of it it sounds like it should work OK. What happens if you use
the "normal" query parser on your query "time to leave" - that should parse ok
as
This was part of the rationale for introducing the XML Query Parser:
1) An extensible query syntax that is expressive enough to represent the full
range of Lucene functions (filters, moreLikeThis etc)
2) Serializable
3) Language independent
4) Decouples the holder of query criteria from the impl
I have a 3.6 index with many no-norms fields and a single text field with norms
(a fairly common configuration). There is a document boost I have set at
index-time that will have been encoded into the text field's norms.
If I query solely on a non-text field then the ranking does not apply the
Answering my own question - add optional new MatchAllDocsQuery("text") clause
to factor in the encoded norms from the "text" field.
____
From: mark harwood
To: "java-user@lucene.apache.org"
Sent: Friday, 25 January 2013, 16:11
Subj
See
http://lucene.apache.org/core/4_3_1/queryparser/org/apache/lucene/queryparser/complexPhrase/ComplexPhraseQueryParser.html
From: Ian Lea
To: java-user@lucene.apache.org
Sent: Tuesday, 27 August 2013, 10:16
Subject: Re: Wildcard in PhraseQuery
See the FAQ
See https://issues.apache.org/jira/browse/LUCENE-1999
- Original Message
From: Paul Libbrecht
To: java-user@lucene.apache.org
Sent: Tue, 11 May, 2010 10:52:14
Subject: Re: best way to interest two queries?
Dear lucene experts,
Let me try to make this precise since there was not answe
terest and Query
objects record match metadata in singleton MatchAttribute objects as they
stream their way through result sets.
Result set streaming and tokenisation streams are similar problems and the
Attribute design seems like it can apply here.
Cheers
Mark
Le 11-mai-10 à 12:02, mark harwo
The DuplicateFilter passed to the searcher does not have visibility of the text
query and is therefore evaluated independently from all other criteria.
Sounds like the behaviour you want is to get the last duplicate that also
matches your criteria, which seems like something fairly common to need
Check out lucene 2454 and accompanying slide show if your reason for doing this
is modelling repeating elements.
On 9 Jul 2010, at 13:43, "Hans-Gunther Birken" wrote:
> I'm examining the following search problem. Consider a document with two
> multi-va
Lucene 2454 includes an example of matching logic that respects the structure
in XML documents (see (https://issues.apache.org/jira/browse/LUCENE-2454 )
The example class TestNestedDocumentQuery queries xhtml marked up with hResume
syntax.
We don't have XQuery syntax support in a parser now (an
Re scalability of filter construction - the database is likely to hold stable
primary keys not lucene doc ids which are unstable in the face of updates. You
therefore need a quick way of converting stable database keys read from the db
into current lucene doc ids to create the filter. That could
.set(docs[0]);
> }
>
>>> That could involve a lot of disk seeks unless you cache a pk->docid lookup
>>> in ram.
> That sounds interesting. How would the pk->docid lookup get populated?
> Wouldn't a pk->docid cache be invalidated with each commit or merge?
A pretty thorough exploration of the issues in federated search here:
http://ilpubs.stanford.edu:8090/271/
I'd add "security" i.e. authentication and authorisation to the list of issues
to be considered (key in some environments).
If you consolidate content in a centralised Solr/Lucene indexing
Having upgraded a live system from 2.4 to 2.9.3 the client is reporting a
change in merge behaviour that is causing some issues with their update
monitoring logic.
The suggestion is that any merge operations now complete as part of the
IW.prepareCommit() call rather than previously when they ra
>
> In both 2.4 and 2.9.x (and all later versions), neither .prepareCommit
> nor .commit wait for merges.
>
> That said, if a merge happens to complete before you call those
> methods, then it is in fact committed.
>
> Mike
>
> On Tue, Oct 5, 2010 at 1:13 PM, Mar
the last commit.
Mike
On Tue, Oct 5, 2010 at 6:45 PM, Mark Harwood wrote:
> OK. I'll double check the reports.
> So presumably when merges occur outside of transaction control (post commit)
>the post-merge update of the segments_N file is managed safely somehow?
> I can see the
Can you not just call reader.docFreq(categoryTerm) ?
The returned figure includes deleted docs but then the search term uses this
method too so should suffer from the same inaccuracy.
Cheers
Mark
- Original Message
From: Max Jakob
To: java-user@lucene.apache.org
Sent: Mon, 18 Octobe
See the Collocation stuff here https://issues.apache.org/jira/browse/LUCENE-474
- Original Message
From: Lucene
To: java-user@lucene.apache.org
Sent: Tue, 26 October, 2010 13:27:06
Subject: Next Word - Any Suggestions?
Am about to implement a custom query that is sort of mash-up of Fac
>> 1. why ir.hashCode() returns different value every time I run
>> this
>>code?
Presumably because it is a different object instance in a different JVM?
IndexReader.hashCode() and IndexReader.equals() are not designed to
represent/summarise the physical contents of an index.
They
Probably off-topic for a Lucene list but the typical database options are:
1) an auto-updated "last changed" timestamp column on related tables that can
be
queried
2) a database trigger automatically feeding a "to-be-indexed" table
Option 1 would also need a "marked as deleted" column adding to
Somewhat historic reasons.
It used to be IndexWriter was the only place you could define this setting
(making it an index-time decision burnt into the index).
The IndexReader option is a relatively newer addition that adds the flexibility
to decide about memory usage whenever you open the index (
This is possible using contrib's DuplicateFilter.
Below is an example of your problem defined as an XML-based test which I just
ran OK through my test writer/runner.
Hopefully this is readable and demonstrates the use of
FilteredQuery/DuplicateFilter.
This is my test
See https://issues.apache.org/jira/browse/LUCENE-1720
- Original Message
From: Alex vB
To: java-user@lucene.apache.org
Sent: Wed, 16 March, 2011 0:12:41
Subject: Early Termination
Hi,
is Lucene capable of any early termination techniques during query
processing?
On the forum I only fo
Of course IDF is a factor too meaning a match on a single rare (to the overall
index) term may be worth more than a match on 2 different common (to the index)
terms.
As Ian suggests a custom Similarity implementation can be used to tune this out.
- Original Message
From: Ian Lea
To: j
As of 3.2 the necessary changes were put in to safely support indexing nested
docs. See
http://lucene.apache.org/java/3_2_0/changes/Changes.html#3.2.0.new_features
On 6 Jun 2011, at 17:18, 周诚 wrote:
> I just saw this:
> https://issues.apache.org/jira/secure/attachment/12480123/LUCENE-2454.patc
Partitioning and replication are the keys to handling data and user volumes
respectively.
However, this approach introduces some other concerns over consistency and
availability of content which I've tried to capture
here: http://www.slideshare.net/MarkHarwood/patterns-for-large-scale-search
Th
See Highlighter's GradientFormatter
Cheers
Mark
On 16 Jun 2011, at 22:01, Itamar Syn-Hershko wrote:
> Hi all,
>
>
> Interesting question: is it possible to color search results in a web-page
> based on their score? e.g. most relevant results in green, and then different
> shades through ora
According to the spec there should at least be an Int32 of -9 to declare the
Format - http://lucene.apache.org/java/2_9_3/fileformats.html#Segments File
- Original Message
From: Uwe Schindler
To: java-user@lucene.apache.org
Sent: Tue, 28 June, 2011 12:32:34
Subject: RE: Corrupt segme
Hi Mike.
>>Hmmm -- what code are you running here, to print the number of docs?
SegmentInfos.setInfoStream(System.out);
FSDirectory dir = FSDirectory.open(new File("j:/indexes/myindex"));
IndexReader r = IndexReader.open(dir, true);
System.out.println("index has "+r.maxDoc()+" docs");
From my
From: Michael McCandless
To: java-user@lucene.apache.org
Sent: Tue, 28 June, 2011 14:59:48
Subject: Re: Corrupt segments file full of zeros
On Tue, Jun 28, 2011 at 9:29 AM, mark harwood wrote:
> Hi Mike.
>>>Hmmm -- what code are you running here, to pr
Check "norms" are disabled on your fields because they'll cost you1byte x
NumberOfDocs x numberOfFieldsWithNormsEnabled.
On 16 Aug 2011, at 15:11, Bennett, Tony wrote:
> Thank you for your response.
>
> You are correct, we are sorting on timestamp.
> Timestamp has microsecond granualarity, a
>>using Lucene that don't fit under the core premise of full text search
I've had several use cases over the years that use features peculiar to Lucene
but here's a very simple one I came across today that illustrates its raw index
lookup capability:
I needed a fast, scalable and persistent "S
lightly less than a HashSet? Interesting. Is the code
> to these benchmarks available somewhere?
>
> Dawid
>
> On Tue, Oct 25, 2011 at 9:57 PM, Grant Ingersoll wrote:
>>
>> On Oct 25, 2011, at 11:26 AM, mark harwood wrote:
>>
>>>>> using
>> > Avg lookup time slightly less than a HashSet? Interesting.
Scratch that. A new dataset and revised code shows HashSets out in front (but
still not a realistic option for very large sets) : http://goo.gl/Lb4J1
In this benchmark I removed the code common to all previous tests which was
firs
I don't think of queries as inherently flat in the way HTTP request parameters
are with their name=value pairings.
JSON or XML can reflect more closely the hierarchy in the underlying Lucene
query objects.
For me using a "flat" query interface feels a bit like when you start off
trying to manag
>
> Other parameters such as filters, faceting, highlighting, sorting,
> etc, don't normally have any hierarchy.
I regularly mix filters and queries inside Boolean logic. Attempts to structure
data (e.g. geocoding) don't always achieve 100% coverage and so for better
recall you must also resor
>> You don't need to copy the whole index every time
>> if you do incremental indexing/updates and don't optimize the index
But at 5 minute intervals for replication does this not quickly lead to a very
fragmented index?
It seems there is a fundamental conflict when building replication system
MoreLikeThis needs to find the terms in your doc. It tries to do this by using
TermFreqVectors which are stored in the index if you choose to add them at
index-time. If you haven't done this then it will fall back to reanalysing the
content of the document usings an analyser (despite what the j
hows no result.
I checked the stored documents and they TermVector exists and si correct but
morelikethis return no result for a given document id.
What am I missing?
mark harwood wrote:
>
> MoreLikeThis needs to find the terms in your doc. It tries to do this by
> using TermFreqVecto
You need to call rewrite on the query to expand it then give that version to
the highlighter - see the package javadocs.
http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/search/highlight/package-summary.html#package_description
Cheers
Mark
- Original Message
From: "Sertic M
Ok, one final question:
If i query for "*ll*", the query is expanded to ("hallo" or "alle" or ...), so
the
Highligter will highlight the words "hallo" or "alle". But how can i highlight
only
the original query, so only the "ll"? Is this
ass to the
highlighter.
That should give you the functionality you are looking for.
-Matt
mark harwood wrote:
>>> Is this possible?
>>>
>
> Not currently, the highlighter works with a list of words (or words AND
> phrases using the new span support) and highlig
TermsFilter has taken the relatively easy option of ORing terms and this is
inexpensive to construct.
Adding more complex features (mixes of MUST/SHOULD/NOT clauses) starts to
require the sorts of optimisations you see in BooleanQuery (MUST clauses
accelerating processing of other clauses throu
>>here I'm AND-ing each bitset. Does it look ok?
In principle it looks like it will work fine but the BooleanQuery approach I
described may prove to be faster on large datasets because ultimately
td.skipTo() will be called to avoid excessive disk reads.
Cheers
Mark
- Original Message ---
Ah, sorry. Just saw the bit about the free text query too.
A FieldCache is the answer here I suspect in order to quickly retrieve the date
values for arbitrary queries.
- Original Message
From: mark harwood <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, 10 O
Assuming your date data is held as MMDD and you want daily totals
Term startTerm=new Term("date","20080101");
TermEnum termEnum = indexReader.terms(startTerm);
do
{
Term currentTerm = termEnum.term();
if(currentTerm.field()!=startTerm
Assuming content is added in chronological order and with no updates to
existing docs couldn't you rely on internal Lucene document id to give a
chronological sort order?
That would require no memory cache at all when sorting.
Querying across multiple indexes simultaneously however may present a
ndexes than this, right?
cheers,
Aleksander
On Fri, 10 Oct 2008 15:18:46 +0200, mark harwood <[EMAIL PROTECTED]>
wrote:
> Assuming content is added in chronological order and with no updates to
> existing docs couldn't you rely on internal Lucene document id to give a
> ch
epresent up to 65536 values - capable of representing a date
range of 179 years.
- Original Message ----
From: mark harwood <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, 10 October, 2008 15:43:35
Subject: Re: Question regarding sorting and memory consumpt
ick isn't really a word in my vocabulary when it's 6 o'clock on
a Friday :(
Guess it'll be a looong night.. :(
Cheers,
Aleks
On Fri, 10 Oct 2008 17:07:31 +0200, mark harwood <[EMAIL PROTECTED]>
wrote:
> Update: The statement "...cost is field size (10
Yes, StringIndex's public fields make life awkward. Re initialization - I did
think you could try use arrays of byte arrays. First 256 terms can be addressed
using just one byte array, on encountering a 257th term an extra byte array is
allocated. References to terms then require indexing into
Further to our discussion - see below a class that measures the added
construction cost and memory savings for an optimised field value cache for a
given index.
The optimisation here being initial use of byte arrays, then shorts, then ints
as more unique terms emerge.
I imagine the majority of
Yes, use TermsFilter to add your 5000 terms by calling
TermsFilter.addTerm(term) repeatedly then put that single filter as a single
"not" clause in a BooleanFilter
Cheers
Mark
On 17 Oct 2008, at 04:02, "prabin meitei" <[EMAIL PROTECTED]> wrote:
Hi, Thanks for the reply. I looked through the Fi
One issue with the existing field cache implementation is that it uses int
arrays to reference into the list of unique terms where short or even byte
arrays may suffice for fields with smaller numbers of unique terms.
How many unique terms do you have?
I posted some code that measures the potent
>>I'd like to ask the Lucene user community what version of Lucene would be
>>preferable
A Swing-based one, managed in Lucene/contrib and released with every Lucene
build .
;)
- Original Message
From: Andrzej Bialecki <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thur
Message
From: Andrzej Bialecki <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, 30 October, 2008 11:32:37
Subject: Re: Luke is coming .. not there yet.
mark harwood wrote:
>>> I'd like to ask the Lucene user community what version of Lucene would be
>
Probably a question for Mike M.
Is it possible/sensible to use IndexDeletionPolicy to remove the *newest*
commit points (as opposed to the usual scenario of deleting old commit points).
I experimented with this:
class RollbackDeletionPolicy implements IndexDeletionPolicy
{
pub
Hi Andrzej,
Thanks for the update. Looks like you've been busy adding some great new
features!
I think you may have a bug in opening an index with prior commit points,
though. I want to keep these in my index and so I opened it in Luke selecting
the "open read only" and "keep all commit points
,
Mark
- Original Message
From: Andrzej Bialecki <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, 14 November, 2008 10:47:03
Subject: Re: [ANN] Luke 0.9 released
mark harwood wrote:
> Hi Andrzej,
>
> Thanks for the update. Looks like you've been bus
I've had reports of OOM exceptions during optimize on a couple of large
deployments recently (based on Lucene 2.4.0)
I've given the usual advice of turning off norms, providing plenty of RAM and
also suggested setting IndexWriter.setTermIndexInterval().
I don't have access to these deployment en
Field("field5", "groupId" + i, Field.Store.YES,
Field.Index.UN_TOKENIZED));
writer.addDocument(doc);
From: mark harwood
To: java-user@lucene.apache.org
Sent: Tuesday, December 23, 2008 2:42:25 PM
Subject: Re: Optimize a
>>My documents are quite big sometimes up to 300ktokens.
You could look at indexing them as seperate documents using overlapping
sections of text. Erik used this for one of his projects.
Cheers
Mark
- Original Message
From: Michael Stoppelman
To: java-user@lucene.apache.org
Sent: Tu
I was having some thoughts recently about speeding up fuzzy search.
The current system does edit-distance on all terms A-Z, single threaded. Prefix
length can reduce the search space and there is a "minimum similarity"
threshold but that's roughly where we are. Multithreading this to make use o
As suggested, the window for failure here is very small. The commit is
effectively an atomic single file rename operation to make the new segments
file visible.
However, should there be a failure between 2 commits the new deletion policy
logic should help you recover to prior commit points. See
I've been building a large index (hundreds of millions) with mainly structured
data which consists of several fields with mostly unique values.
I've been hitting out of memory issues when doing periodic commits/closes which
I suspect is down to the sheer number of terms.
I set the IndexWriter..
>>Maybe we could do something similar to declare that agiven field uses Trie*,
>>and with what datatype.
With the current implementation you can at least test for the presence of a
field called:
[fieldName]#trie
..which tells you some form of trie is used but could be extended to include
ent: Tuesday, 10 March, 2009 0:01:30
Subject: Re: A model for predicting indexing memory costs?
mark harwood wrote:
>
> I've been building a large index (hundreds of millions) with mainly
> structured data which consists of several fields with mostly unique values.
> I've been
with -XX:-UseGCOverheadLimit
http://java-monitor.com/forum/archive/index.php/t-54.html
http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc.oom
--
Ian.
On Tue, Mar 10, 2009 at 10:45 AM, mark harwood wrote:
>
>>>But... how come setting IW's RAM buffer do
you? Are you happy, does searches work well with 30
mio docs, which precisionStep do you use?
Uwe
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
> -Original Message-
> From: mark harwood [mailto:markharw...@yahoo.co.uk]
> Sent:
Token class when creating the trie
> encoded fields.
>
> How works TrieRange for you? Are you happy, does searches work well with
> 30
> mio docs, which precisionStep do you use?
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http:
ing a new IndexWriter each
time? Or, just calling .commit() and then re-using the same writer?
It seems likely this has something to do with merging, though from your listing
I count 14 segments which shouldn't have been doing any merging at
mergeFactor=20, so that's confusing.
ts
by pointing out that it's not only *your* time that's at risk, but
customers' time too. Whether you define customers as internal
or external is irrelevant. Every round of diagnosis/fix carries the risk
that N people waste time (and get paid for it). All to avoid a little
up-front co
Wednesday, 11 March, 2009 10:42:33
Subject: Re: A model for predicting indexing memory costs?
* mark harwood:
>>>Could you get a heap dump (eg with YourKit) of what's using up all the
>>>memory when you hit OOM?
>
> On this particular machine I have a JRE, no adm
OK, it's early days and I'm holding my breath but I'm currently progressing
further through my content without an OOM just by using a different GC setting.
Thanks to advice here and colleagues at work I've gone with a GC setting of
-XX:+UseSerialGC for this indexing task.
The rationale that is
The attachment didn't make it through here. Can you add it as an attachment to
a new JIRA issue?
Thanks,
Mark
From: Amin Mohammed-Coleman
To: java-user@lucene.apache.org
Sent: Thursday, 12 March, 2009 7:47:20
Subject: Re: Lucene Highlighting and Dynamic Summ
That's probably more a question about MarkLogic APIs than it is about Lucene.
What APIs does MarkLogic provide for getting at the content e.g does it provide
a JSR-170 standard interface (
http://www.slideshare.net/uncled/introduction-to-jcr )
I presume you have already ruled out the in-built M
ptimal approach incase someone already have similar
situation.
-Original Message-----
From: mark harwood [mailto:markharw...@yahoo.co.uk]
Sent: Mon 3/30/2009 11:16 AM
To: java-user@lucene.apache.org
Subject: Re: What is an optimal approach?
That's probably more a question about MarkLogic A
Try setting the minimum prefix length for fuzzy queries ( I think there is a
setting on QueryParser or you may need to subclass)
Prefix length of zero does edit distance comparisons for all unique terms e.g.
from "aardvark" to ""
Prefix length of one would cut this search space down to just
Spring is pretty useful for managing and sharing resources - see what looks
like a related example here:
http://croarkin.blogspot.com/2008/05/injecting-spring-bean-into-servlet.html
Cheers,
Mark
- Original Message
From: David Seltzer
To: java-user@lucene.apache.org
Sent: Tuesday,
Related: https://issues.apache.org/jira/browse/LUCENE-1486
- Original Message
From: Steven A Rowe
To: "java-user@lucene.apache.org"
Sent: Thursday, 23 April, 2009 16:54:08
Subject: RE: SpanQuery wildcards?
Hi Ivan, SpanRegexQuery should work - just use ".*" instead of "*". - Steve
See IndexReader.setTermInfosIndexDivisor() for a way to help reduce memory
usage without needing to re-index.
If you have indexed fields with omitNorms off (the default) you will be paying
a 1 byte per field per document memory cost and may need to look at re-indexing
Cheers
Mark
- Orig
If you're CPU-bound - I've had issues before with GC in long-running indexing
tasks loading very large volumes (100s of millions) of docs. I was seeing lots
of CPU usage tied up in GC.
I solved all these problems by firing batches of indexing activity off in
seperate processes then immediately
>techniques used by big search engines to search among such huge data.
Two keywords here - partitioning and replication.
Partitioning is breaking the content down into shards and assigning shards to
servers. These can then be queried in parallel to make search response times
independent of the
FuzzyQuery performance is related to number of unique terms in the index not
the number of documents e.g. a single "telephone directory" document could
contain millions of terms.
Each term considered is compared using an "edit distance" algo which is CPU
intensive.
The FuzzyQuery prefix length
See MoreLikeThis in the contrib/queries folder. It optimizes the speed
of similarity comparisons by taking the most significant words only
from a document as search terms.
On 29 Jun 2009, at 20:14, Amir Hossein Jadidinejad wrote:
Hi,
It's my first experiment with Lucene. Please help me.
Can you verify the Token byte offsets produced by this particular analyzer are
correct?
- Original Message
From: k.sayama
To: java-user@lucene.apache.org
Sent: Wednesday, 1 July, 2009 15:22:37
Subject: Re: Highligheter fails using JapaneseAnalyzer
hi
I verified it by using SimpleAn
day, 1 July, 2009 16:13:17
Subject: Re: Highligheter fails using JapaneseAnalyzer
Sorry
I can not verify the Token byte offsets produced by JapaneseAnalyzer
How should I verify it?
- Original Message -
From: "mark harwood"
To:
Sent: Wednesday, July 01, 2009 11:31 PM
Subject:
On 1 Jul 2009, at 17:39, k.sayama wrote:
I could verify Token byte offsets
The sytsem outputs
aaa:0:3
bbb:0:3
ccc:4:7
That explains the highlighter behaviour. Clearly BBB is not at
position 0-3 in the String you supplied
String CONTENTS = "AAA :BBB CCC";
Looks like the Tokenizer need
Check out booleanfilter in contrib/queries. It can be wrapped in a
constantScoreQuery
On 4 Jul 2009, at 17:37, Lukas Michelbacher
wrote:
This is about an experiment comparing plain Boolean retrieval with
vector-space-based retrieval.
I would like to disable all of Lucene's scoring mechani
I would appreciate if i can get help with the code as well.
If you want to tweak an existing example rather than coding entirely
from scratch the XMLQueryParser in /contrib has a demo web app for job
search with a "location" field similar in principle to your "state"
field plus it has a G
ts() + " hits");
The result is 0 hits (should be 640).
[1] tinyurl.com/ml52ye
2009/7/4 Mark Harwood :
>
> Check out booleanfilter in contrib/queries. It can be wrapped in a
> constantScoreQuery
>
>
>
> On 4 Jul 2009, at 17:37, Lukas Michelbacher
> wrote:
>
if the term is "X Y" the document 2 is getting higher score then
document 1.
That may be length normalisation at play. Doc 2 is shorter so may be
seen as a better match for that reason.
Using the "explain" function helps illustrate the break down of scores
in matches.
You could try index
I just try norms idea as well no change
You'll need to look at searcher.explain() for the two docs or post a
Junit or code example that can be executed which shows the issue
-
To unsubscribe, e-mail: java-user-unsubscr...@l
1 - 100 of 294 matches
Mail list logo