Re: Nested BlockJoinQuery

2012-02-11 Thread Mark Harwood
Your requirement does not sound like a good fit for the nested stuff but is probably more one for conventional faceting. I would characterise the uses for Nested as follows: 1) The parent of a nested block is typically the "item of interest" that is returned i.e. the search results are a list

Re: Searching accross 2 fields

2012-05-21 Thread Mark Harwood
You're describing what I call the "cross matching" problem if you flatten nested, repeating structures with multiple fields into a single flat Lucene document model. The approach for handling the more complex mappings is to use nested child docs in Lucene and for that look at BlockJoinQuery. Ho

Re: Searching accross 2 fields

2012-05-22 Thread mark harwood
       value: 20            } } doc 2: { form: { id: 1040 }   attrib: {                  name: age                  value: 22            } } On Mon, May 21, 2012 at 3:24 PM, Mark Harwood wrote: > You're describing what I call the "cross matching" problem if you flatten > nested,

Sequence diagrams for Lucene 4.0 classes

2012-05-23 Thread mark harwood
I've created a couple of sequence diagrams of core Lucene 4.0 classes that may be of use to others: Low-level classes used while writing indexes http://goo.gl/dI3HY Low-level classes used while reading indexes: http://goo.gl/e8JEj FWIW I found the websequencediagrams.com editor in these lin

Re: Mapping Lucene search results with a relational database

2012-07-03 Thread mark harwood
Many considerations here - I find the technical concerns you present typically open a can of worms for any businesses worried about security. It gets political quickly.   In environments where security is paramount, software must be formally accredited, which is a costly exercise. Often the choi

Re: Creating Span Queries from Boolean Queries

2012-08-22 Thread mark harwood
>>> Ideally I'd like to take any ANDed clauses and require them to occur> >>> withing $SPAN of the other ANDs. See ComplexPhraseQueryParser? Like the standard QueryParser it uses quotes to define a phrase but also interprets any special characters between the quotes e.g. ( ) * ~ The syntax and

Re: DuplicateFilter filters not only duplicates

2012-08-30 Thread mark harwood
DuplicateFilter has been mostly broken  since Lucene's switch over to segment-level filtering. Since v2.9 the calls to Filter.getDocIdSet no longer pass a "top-level" reader for accessing the whole index and instead pass a reader restricted to only accessing a single segment's contents. Becaus

Re: ComplexPhraseQueryParser and stop words

2012-11-02 Thread Mark Harwood
Hi Brandon, Can you start by calling toString on the parse result (the Query object) to see what is being produced and post that here. On the face of it it sounds like it should work OK. What happens if you use the "normal" query parser on your query "time to leave" - that should parse ok as

Re: Lucene 4.0, Serialization

2012-12-04 Thread mark harwood
This was part of the rationale for introducing the XML Query Parser: 1) An extensible query syntax that is expressive enough to represent the full range of Lucene functions (filters, moreLikeThis etc) 2) Serializable 3) Language independent 4) Decouples the holder of query criteria from the  impl

3.6 - querying a no-norms field and getting document boost

2013-01-25 Thread mark harwood
I have a 3.6 index with many no-norms fields and a single text field with norms (a fairly common configuration). There is a document boost I have set at index-time that will have been encoded into the text field's norms. If I query solely on a non-text field then the ranking does not apply the

Re: 3.6 - querying a no-norms field and getting document boost

2013-01-25 Thread mark harwood
Answering my own question - add optional new MatchAllDocsQuery("text") clause to factor in the encoded norms from the "text" field. ____ From: mark harwood To: "java-user@lucene.apache.org" Sent: Friday, 25 January 2013, 16:11 Subj

Re: Wildcard in PhraseQuery

2013-08-27 Thread mark harwood
See  http://lucene.apache.org/core/4_3_1/queryparser/org/apache/lucene/queryparser/complexPhrase/ComplexPhraseQueryParser.html From: Ian Lea To: java-user@lucene.apache.org Sent: Tuesday, 27 August 2013, 10:16 Subject: Re: Wildcard in PhraseQuery See the FAQ

Re: best way to interest two queries?

2010-05-11 Thread mark harwood
See https://issues.apache.org/jira/browse/LUCENE-1999 - Original Message From: Paul Libbrecht To: java-user@lucene.apache.org Sent: Tue, 11 May, 2010 10:52:14 Subject: Re: best way to interest two queries? Dear lucene experts, Let me try to make this precise since there was not answe

Re: best way to interest two queries?

2010-05-12 Thread mark harwood
terest and Query objects record match metadata in singleton MatchAttribute objects as they stream their way through result sets. Result set streaming and tokenisation streams are similar problems and the Attribute design seems like it can apply here. Cheers Mark Le 11-mai-10 à 12:02, mark harwo

Re: DuplicateFilter question

2010-05-31 Thread Mark Harwood
The DuplicateFilter passed to the searcher does not have visibility of the text query and is therefore evaluated independently from all other criteria. Sounds like the behaviour you want is to get the last duplicate that also matches your criteria, which seems like something fairly common to need

Re: Searching docs with multi-value fields

2010-07-09 Thread Mark Harwood
Check out lucene 2454 and accompanying slide show if your reason for doing this is modelling repeating elements. On 9 Jul 2010, at 13:43, "Hans-Gunther Birken" wrote: > I'm examining the following search problem. Consider a document with two > multi-va

Re: XML results ranking

2010-07-16 Thread mark harwood
Lucene 2454 includes an example of matching logic that respects the structure in XML documents (see (https://issues.apache.org/jira/browse/LUCENE-2454 ) The example class TestNestedDocumentQuery queries xhtml marked up with hResume syntax. We don't have XQuery syntax support in a parser now (an

Re: on-the-fly "filters" from docID lists

2010-07-22 Thread Mark Harwood
Re scalability of filter construction - the database is likely to hold stable primary keys not lucene doc ids which are unstable in the face of updates. You therefore need a quick way of converting stable database keys read from the db into current lucene doc ids to create the filter. That could

Re: on-the-fly "filters" from docID lists

2010-07-23 Thread Mark Harwood
.set(docs[0]); > } > >>> That could involve a lot of disk seeks unless you cache a pk->docid lookup >>> in ram. > That sounds interesting. How would the pk->docid lookup get populated? > Wouldn't a pk->docid cache be invalidated with each commit or merge?

Re: Federated search with opensearch or proprietary APIs for Atlassian

2010-09-02 Thread mark harwood
A pretty thorough exploration of the issues in federated search here: http://ilpubs.stanford.edu:8090/271/ I'd add "security" i.e. authentication and authorisation to the list of issues to be considered (key in some environments). If you consolidate content in a centralised Solr/Lucene indexing

Merge and commit behaviour - changed between 2.4 and 2.9?

2010-10-05 Thread Mark Harwood
Having upgraded a live system from 2.4 to 2.9.3 the client is reporting a change in merge behaviour that is causing some issues with their update monitoring logic. The suggestion is that any merge operations now complete as part of the IW.prepareCommit() call rather than previously when they ra

Re: Merge and commit behaviour - changed between 2.4 and 2.9?

2010-10-05 Thread Mark Harwood
> > In both 2.4 and 2.9.x (and all later versions), neither .prepareCommit > nor .commit wait for merges. > > That said, if a merge happens to complete before you call those > methods, then it is in fact committed. > > Mike > > On Tue, Oct 5, 2010 at 1:13 PM, Mar

Re: Merge and commit behaviour - changed between 2.4 and 2.9?

2010-10-06 Thread mark harwood
the last commit. Mike On Tue, Oct 5, 2010 at 6:45 PM, Mark Harwood wrote: > OK. I'll double check the reports. > So presumably when merges occur outside of transaction control (post commit) >the post-merge update of the segments_N file is managed safely somehow? > I can see the

Re: Consider only documents of a category for IDF

2010-10-18 Thread mark harwood
Can you not just call reader.docFreq(categoryTerm) ? The returned figure includes deleted docs but then the search term uses this method too so should suffer from the same inaccuracy. Cheers Mark - Original Message From: Max Jakob To: java-user@lucene.apache.org Sent: Mon, 18 Octobe

Re: Next Word - Any Suggestions?

2010-10-26 Thread mark harwood
See the Collocation stuff here https://issues.apache.org/jira/browse/LUCENE-474 - Original Message From: Lucene To: java-user@lucene.apache.org Sent: Tue, 26 October, 2010 13:27:06 Subject: Next Word - Any Suggestions? Am about to implement a custom query that is sort of mash-up of Fac

Re: How to Cache Filter Results between Servers

2010-11-29 Thread mark harwood
>> 1. why ir.hashCode() returns different value every time I run >> this >>code? Presumably because it is a different object instance in a different JVM? IndexReader.hashCode() and IndexReader.equals() are not designed to represent/summarise the physical contents of an index. They

Re: Maintaining index for "flattened" database tables

2011-01-13 Thread mark harwood
Probably off-topic for a Lucene list but the typical database options are: 1) an auto-updated "last changed" timestamp column on related tables that can be queried 2) a database trigger automatically feeding a "to-be-indexed" table Option 1 would also need a "marked as deleted" column adding to

Re: termInfosIndexDivisor vs termIndexInterval

2011-02-07 Thread mark harwood
Somewhat historic reasons. It used to be IndexWriter was the only place you could define this setting (making it an index-time decision burnt into the index). The IndexReader option is a relatively newer addition that adds the flexibility to decide about memory usage whenever you open the index (

Re: Detecting duplicates

2011-03-10 Thread mark harwood
This is possible using contrib's DuplicateFilter. Below is an example of your problem defined as an XML-based test which I just ran OK through my test writer/runner. Hopefully this is readable and demonstrates the use of FilteredQuery/DuplicateFilter. This is my test

Re: Early Termination

2011-03-16 Thread mark harwood
See https://issues.apache.org/jira/browse/LUCENE-1720 - Original Message From: Alex vB To: java-user@lucene.apache.org Sent: Wed, 16 March, 2011 0:12:41 Subject: Early Termination Hi, is Lucene capable of any early termination techniques during query processing? On the forum I only fo

Re: Ranking docs with all terms higher

2011-05-19 Thread mark harwood
Of course IDF is a factor too meaning a match on a single rare (to the overall index) term may be worth more than a match on 2 different common (to the index) terms. As Ian suggests a custom Similarity implementation can be used to tune this out. - Original Message From: Ian Lea To: j

Re: When nested indexing and search will be available?

2011-06-06 Thread Mark Harwood
As of 3.2 the necessary changes were put in to safely support indexing nested docs. See http://lucene.apache.org/java/3_2_0/changes/Changes.html#3.2.0.new_features On 6 Jun 2011, at 17:18, 周诚 wrote: > I just saw this: > https://issues.apache.org/jira/secure/attachment/12480123/LUCENE-2454.patc

Re: Index size and performance degradation

2011-06-14 Thread mark harwood
Partitioning and replication are the keys to handling data and user volumes respectively. However, this approach introduces some other concerns over consistency and availability of content which I've tried to capture here: http://www.slideshare.net/MarkHarwood/patterns-for-large-scale-search Th

Re: Coloring search results based on score?

2011-06-16 Thread Mark Harwood
See Highlighter's GradientFormatter Cheers Mark On 16 Jun 2011, at 22:01, Itamar Syn-Hershko wrote: > Hi all, > > > Interesting question: is it possible to color search results in a web-page > based on their score? e.g. most relevant results in green, and then different > shades through ora

Re: Corrupt segments file full of zeros

2011-06-28 Thread mark harwood
According to the spec there should at least be an Int32 of -9 to declare the Format - http://lucene.apache.org/java/2_9_3/fileformats.html#Segments File - Original Message From: Uwe Schindler To: java-user@lucene.apache.org Sent: Tue, 28 June, 2011 12:32:34 Subject: RE: Corrupt segme

Re: Corrupt segments file full of zeros

2011-06-28 Thread mark harwood
Hi Mike. >>Hmmm -- what code are you running here, to print the number of docs? SegmentInfos.setInfoStream(System.out); FSDirectory dir = FSDirectory.open(new File("j:/indexes/myindex")); IndexReader r = IndexReader.open(dir, true); System.out.println("index has "+r.maxDoc()+" docs"); From my

Re: Corrupt segments file full of zeros

2011-06-28 Thread mark harwood
From: Michael McCandless To: java-user@lucene.apache.org Sent: Tue, 28 June, 2011 14:59:48 Subject: Re: Corrupt segments file full of zeros On Tue, Jun 28, 2011 at 9:29 AM, mark harwood wrote: > Hi Mike. >>>Hmmm -- what code are you running here, to pr

Re: What kind of System Resources are required to index 625 million row table...???

2011-08-16 Thread Mark Harwood
Check "norms" are disabled on your fields because they'll cost you1byte x NumberOfDocs x numberOfFieldsWithNormsEnabled. On 16 Aug 2011, at 15:11, Bennett, Tony wrote: > Thank you for your response. > > You are correct, we are sorting on timestamp. > Timestamp has microsecond granualarity, a

Re: Bet you didn't know Lucene can...

2011-10-25 Thread mark harwood
>>using Lucene that don't fit under the core premise of full text search  I've had several use cases over the years that use features peculiar to Lucene but here's a very simple one I came across today that illustrates its raw index lookup capability: I needed a fast, scalable and persistent "S

Re: Bet you didn't know Lucene can...

2011-10-25 Thread Mark Harwood
lightly less than a HashSet? Interesting. Is the code > to these benchmarks available somewhere? > > Dawid > > On Tue, Oct 25, 2011 at 9:57 PM, Grant Ingersoll wrote: >> >> On Oct 25, 2011, at 11:26 AM, mark harwood wrote: >> >>>>> using

Re: Bet you didn't know Lucene can...

2011-10-26 Thread mark harwood
>>  > Avg lookup time slightly less than a HashSet? Interesting. Scratch that. A new dataset and revised code shows HashSets out in front (but still not a realistic option for very large sets) : http://goo.gl/Lb4J1 In this benchmark I removed the code common to all previous tests which was firs

Re: ElasticSearch

2011-11-17 Thread Mark Harwood
I don't think of queries as inherently flat in the way HTTP request parameters are with their name=value pairings. JSON or XML can reflect more closely the hierarchy in the underlying Lucene query objects. For me using a "flat" query interface feels a bit like when you start off trying to manag

Re: ElasticSearch

2011-11-17 Thread Mark Harwood
> > Other parameters such as filters, faceting, highlighting, sorting, > etc, don't normally have any hierarchy. I regularly mix filters and queries inside Boolean logic. Attempts to structure data (e.g. geocoding) don't always achieve 100% coverage and so for better recall you must also resor

Re: Replicating Lucene Index with out SOLR

2008-08-28 Thread mark harwood
>> You don't need to copy the whole index every time >> if you do incremental indexing/updates and don't optimize the index But at 5 minute intervals for replication does this not quickly lead to a very fragmented index? It seems there is a fundamental conflict when building replication system

Re: MoreLikeThis return no results

2008-08-30 Thread mark harwood
MoreLikeThis needs to find the terms in your doc. It tries to do this by using TermFreqVectors which are stored in the index if you choose to add them at index-time. If you haven't done this then it will fall back to reanalysing the content of the document usings an analyser (despite what the j

Re: MoreLikeThis return no results

2008-09-01 Thread mark harwood
hows no result. I checked the stored documents and they TermVector exists and si correct but morelikethis return no result for a given document id. What am I missing? mark harwood wrote: > > MoreLikeThis needs to find the terms in your doc. It tries to do this by > using TermFreqVecto

Re: AW: Search with multiple wildcards

2008-09-11 Thread mark harwood
You need to call rewrite on the query to expand it then give that version to the highlighter - see the package javadocs. http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/search/highlight/package-summary.html#package_description Cheers Mark - Original Message From: "Sertic M

Re: AW: AW: Search with multiple wildcards

2008-09-11 Thread mark harwood
Ok, one final question: If i query for "*ll*", the query is expanded to ("hallo" or "alle" or ...), so the Highligter will highlight the words "hallo" or "alle". But how can i highlight only the original query, so only the "ll"? Is this

Re: AW: AW: Search with multiple wildcards

2008-09-11 Thread mark harwood
ass to the highlighter. That should give you the functionality you are looking for. -Matt mark harwood wrote: >>> Is this possible? >>> > > Not currently, the highlighter works with a list of words (or words AND > phrases using the new span support) and highlig

Re: TermsFilter and MUST

2008-09-12 Thread mark harwood
TermsFilter has taken the relatively easy option of ORing terms and this is inexpensive to construct. Adding more complex features (mixes of MUST/SHOULD/NOT clauses) starts to require the sorts of optimisations you see in BooleanQuery (MUST clauses accelerating processing of other clauses throu

Re: TermsFilter and MUST

2008-09-12 Thread mark harwood
>>here I'm AND-ing each bitset. Does it look ok? In principle it looks like it will work fine but the BooleanQuery approach I described may prove to be faster on large datasets because ultimately td.skipTo() will be called to avoid excessive disk reads. Cheers Mark - Original Message ---

Re: Buzz measurement - Aggregate functions

2008-10-10 Thread mark harwood
Ah, sorry. Just saw the bit about the free text query too. A FieldCache is the answer here I suspect in order to quickly retrieve the date values for arbitrary queries. - Original Message From: mark harwood <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, 10 O

Re: Buzz measurement - Aggregate functions

2008-10-10 Thread mark harwood
Assuming your date data is held as MMDD and you want daily totals Term startTerm=new Term("date","20080101"); TermEnum termEnum = indexReader.terms(startTerm); do { Term currentTerm = termEnum.term(); if(currentTerm.field()!=startTerm

Re: Question regarding sorting and memory consumption in lucene

2008-10-10 Thread mark harwood
Assuming content is added in chronological order and with no updates to existing docs couldn't you rely on internal Lucene document id to give a chronological sort order? That would require no memory cache at all when sorting. Querying across multiple indexes simultaneously however may present a

Re: Question regarding sorting and memory consumption in lucene

2008-10-10 Thread mark harwood
ndexes than this, right? cheers, Aleksander On Fri, 10 Oct 2008 15:18:46 +0200, mark harwood <[EMAIL PROTECTED]> wrote: > Assuming content is added in chronological order and with no updates to > existing docs couldn't you rely on internal Lucene document id to give a > ch

Re: Question regarding sorting and memory consumption in lucene

2008-10-10 Thread mark harwood
epresent up to 65536 values - capable of representing a date range of 179 years. - Original Message ---- From: mark harwood <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, 10 October, 2008 15:43:35 Subject: Re: Question regarding sorting and memory consumpt

Re: Question regarding sorting and memory consumption in lucene

2008-10-10 Thread mark harwood
ick isn't really a word in my vocabulary when it's 6 o'clock on a Friday :( Guess it'll be a looong night.. :( Cheers, Aleks On Fri, 10 Oct 2008 17:07:31 +0200, mark harwood <[EMAIL PROTECTED]> wrote: > Update: The statement "...cost is field size (10

Re: Question regarding sorting and memory consumption in lucene

2008-10-14 Thread Mark Harwood
Yes, StringIndex's public fields make life awkward. Re initialization - I did think you could try use arrays of byte arrays. First 256 terms can be addressed using just one byte array, on encountering a 257th term an extra byte array is allocated. References to terms then require indexing into

Re: Question regarding sorting and memory consumption in lucene

2008-10-15 Thread mark harwood
Further to our discussion - see below a class that measures the added construction cost and memory savings for an optimised field value cache for a given index. The optimisation here being initial use of byte arrays, then shorts, then ints as more unique terms emerge. I imagine the majority of

Re: using list of items to be excluded while querying

2008-10-16 Thread Mark Harwood
Yes, use TermsFilter to add your 5000 terms by calling TermsFilter.addTerm(term) repeatedly then put that single filter as a single "not" clause in a BooleanFilter Cheers Mark On 17 Oct 2008, at 04:02, "prabin meitei" <[EMAIL PROTECTED]> wrote: Hi, Thanks for the reply. I looked through the Fi

Re: OutOfMemory Problems Lucene 2.4 / Tomcat

2008-10-30 Thread mark harwood
One issue with the existing field cache implementation is that it uses int arrays to reference into the list of unique terms where short or even byte arrays may suffice for fields with smaller numbers of unique terms. How many unique terms do you have? I posted some code that measures the potent

Re: Luke is coming .. not there yet.

2008-10-30 Thread mark harwood
>>I'd like to ask the Lucene user community what version of Lucene would be >>preferable A Swing-based one, managed in Lucene/contrib and released with every Lucene build . ;) - Original Message From: Andrzej Bialecki <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thur

Re: Luke is coming .. not there yet.

2008-10-30 Thread mark harwood
Message From: Andrzej Bialecki <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, 30 October, 2008 11:32:37 Subject: Re: Luke is coming .. not there yet. mark harwood wrote: >>> I'd like to ask the Lucene user community what version of Lucene would be >

Using DeletionPolicy to roll back to previous commit point

2008-11-11 Thread mark harwood
Probably a question for Mike M. Is it possible/sensible to use IndexDeletionPolicy to remove the *newest* commit points (as opposed to the usual scenario of deleting old commit points). I experimented with this: class RollbackDeletionPolicy implements IndexDeletionPolicy { pub

Re: [ANN] Luke 0.9 released

2008-11-14 Thread mark harwood
Hi Andrzej, Thanks for the update. Looks like you've been busy adding some great new features! I think you may have a bug in opening an index with prior commit points, though. I want to keep these in my index and so I opened it in Luke selecting the "open read only" and "keep all commit points

Re: [ANN] Luke 0.9 released

2008-11-14 Thread mark harwood
, Mark - Original Message From: Andrzej Bialecki <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, 14 November, 2008 10:47:03 Subject: Re: [ANN] Luke 0.9 released mark harwood wrote: > Hi Andrzej, > > Thanks for the update. Looks like you've been bus

Re: Optimize and Out Of Memory Errors

2008-12-23 Thread mark harwood
I've had reports of OOM exceptions during optimize on a couple of large deployments recently (based on Lucene 2.4.0) I've given the usual advice of turning off norms, providing plenty of RAM and also suggested setting IndexWriter.setTermIndexInterval(). I don't have access to these deployment en

Re: Optimize and Out Of Memory Errors

2008-12-23 Thread mark harwood
Field("field5", "groupId" + i, Field.Store.YES, Field.Index.UN_TOKENIZED)); writer.addDocument(doc); From: mark harwood To: java-user@lucene.apache.org Sent: Tuesday, December 23, 2008 2:42:25 PM Subject: Re: Optimize a

Re: Poor QPS with highlighting

2009-02-03 Thread mark harwood
>>My documents are quite big sometimes up to 300ktokens. You could look at indexing them as seperate documents using overlapping sections of text. Erik used this for one of his projects. Cheers Mark - Original Message From: Michael Stoppelman To: java-user@lucene.apache.org Sent: Tu

Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers

2009-02-18 Thread mark harwood
I was having some thoughts recently about speeding up fuzzy search. The current system does edit-distance on all terms A-Z, single threaded. Prefix length can reduce the search space and there is a "minimum similarity" threshold but that's roughly where we are. Multithreading this to make use o

Re: IndexWriter 2-phase commit usage

2009-02-24 Thread mark harwood
As suggested, the window for failure here is very small. The commit is effectively an atomic single file rename operation to make the new segments file visible. However, should there be a failure between 2 commits the new deletion policy logic should help you recover to prior commit points. See

A model for predicting indexing memory costs?

2009-03-09 Thread mark harwood
I've been building a large index (hundreds of millions) with mainly structured data which consists of several fields with mostly unique values. I've been hitting out of memory issues when doing periodic commits/closes which I suspect is down to the sheer number of terms. I set the IndexWriter..

Re: Lucene 2.9

2009-03-09 Thread mark harwood
>>Maybe we could do something similar to declare that agiven field uses Trie*, >>and with what datatype. With the current implementation you can at least test for the presence of a field called: [fieldName]#trie ..which tells you some form of trie is used but could be extended to include

Re: A model for predicting indexing memory costs?

2009-03-10 Thread mark harwood
ent: Tuesday, 10 March, 2009 0:01:30 Subject: Re: A model for predicting indexing memory costs? mark harwood wrote: > > I've been building a large index (hundreds of millions) with mainly > structured data which consists of several fields with mostly unique values. > I've been

Re: A model for predicting indexing memory costs?

2009-03-10 Thread mark harwood
with -XX:-UseGCOverheadLimit http://java-monitor.com/forum/archive/index.php/t-54.html http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc.oom -- Ian. On Tue, Mar 10, 2009 at 10:45 AM, mark harwood wrote: > >>>But... how come setting IW's RAM buffer do

Re: A model for predicting indexing memory costs?

2009-03-10 Thread mark harwood
you? Are you happy, does searches work well with 30 mio docs, which precisionStep do you use? Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: mark harwood [mailto:markharw...@yahoo.co.uk] > Sent:

Re: A model for predicting indexing memory costs?

2009-03-10 Thread mark harwood
Token class when creating the trie > encoded fields. > > How works TrieRange for you? Are you happy, does searches work well with > 30 > mio docs, which precisionStep do you use? > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http:

Re: A model for predicting indexing memory costs?

2009-03-10 Thread mark harwood
ing a new IndexWriter each time? Or, just calling .commit() and then re-using the same writer? It seems likely this has something to do with merging, though from your listing I count 14 segments which shouldn't have been doing any merging at mergeFactor=20, so that's confusing.

Re: A model for predicting indexing memory costs?

2009-03-10 Thread mark harwood
ts by pointing out that it's not only *your* time that's at risk, but customers' time too. Whether you define customers as internal or external is irrelevant. Every round of diagnosis/fix carries the risk that N people waste time (and get paid for it). All to avoid a little up-front co

Re: A model for predicting indexing memory costs?

2009-03-11 Thread mark harwood
Wednesday, 11 March, 2009 10:42:33 Subject: Re: A model for predicting indexing memory costs? * mark harwood: >>>Could you get a heap dump (eg with YourKit) of what's using up all the >>>memory when you hit OOM? > > On this particular machine I have a JRE, no adm

Re: A model for predicting indexing memory costs?

2009-03-11 Thread mark harwood
OK, it's early days and I'm holding my breath but I'm currently progressing further through my content without an OOM just by using a different GC setting. Thanks to advice here and colleagues at work I've gone with a GC setting of -XX:+UseSerialGC for this indexing task. The rationale that is

Re: Lucene Highlighting and Dynamic Summaries

2009-03-12 Thread mark harwood
The attachment didn't make it through here. Can you add it as an attachment to a new JIRA issue? Thanks, Mark From: Amin Mohammed-Coleman To: java-user@lucene.apache.org Sent: Thursday, 12 March, 2009 7:47:20 Subject: Re: Lucene Highlighting and Dynamic Summ

Re: What is an optimal approach?

2009-03-30 Thread mark harwood
That's probably more a question about MarkLogic APIs than it is about Lucene. What APIs does MarkLogic provide for getting at the content e.g does it provide a JSR-170 standard interface ( http://www.slideshare.net/uncled/introduction-to-jcr ) I presume you have already ruled out the in-built M

Re: What is an optimal approach?

2009-03-30 Thread mark harwood
ptimal approach incase someone already have similar situation. -Original Message----- From: mark harwood [mailto:markharw...@yahoo.co.uk] Sent: Mon 3/30/2009 11:16 AM To: java-user@lucene.apache.org Subject: Re: What is an optimal approach? That's probably more a question about MarkLogic A

Re: Speed of fuzzy searches

2009-04-02 Thread mark harwood
Try setting the minimum prefix length for fuzzy queries ( I think there is a setting on QueryParser or you may need to subclass) Prefix length of zero does edit distance comparisons for all unique terms e.g. from "aardvark" to "" Prefix length of one would cut this search space down to just

Re: Servlets Sharing Resources

2009-04-21 Thread mark harwood
Spring is pretty useful for managing and sharing resources - see what looks like a related example here: http://croarkin.blogspot.com/2008/05/injecting-spring-bean-into-servlet.html Cheers, Mark - Original Message From: David Seltzer To: java-user@lucene.apache.org Sent: Tuesday,

Re: SpanQuery wildcards?

2009-04-23 Thread mark harwood
Related: https://issues.apache.org/jira/browse/LUCENE-1486 - Original Message From: Steven A Rowe To: "java-user@lucene.apache.org" Sent: Thursday, 23 April, 2009 16:54:08 Subject: RE: SpanQuery wildcards? Hi Ivan, SpanRegexQuery should work - just use ".*" instead of "*". - Steve

Re: Low-memory searcher

2009-04-24 Thread mark harwood
See IndexReader.setTermInfosIndexDivisor() for a way to help reduce memory usage without needing to re-index. If you have indexed fields with omitNorms off (the default) you will be paying a 1 byte per field per document memory cost and may need to look at re-indexing Cheers Mark - Orig

Re: Indexing becomes slow with time

2009-04-30 Thread mark harwood
If you're CPU-bound - I've had issues before with GC in long-running indexing tasks loading very large volumes (100s of millions) of docs. I was seeing lots of CPU usage tied up in GC. I solved all these problems by firing batches of indexing activity off in seperate processes then immediately

Re: Max size of index? How do search engines avoid this?

2009-05-18 Thread mark harwood
>techniques used by big search engines to search among such huge data. Two keywords here - partitioning and replication. Partitioning is breaking the content down into shards and assigning shards to servers. These can then be queried in parallel to make search response times independent of the

Re: Fuzzy vs Prefix query Performance

2009-06-15 Thread mark harwood
FuzzyQuery performance is related to number of unique terms in the index not the number of documents e.g. a single "telephone directory" document could contain millions of terms. Each term considered is compared using an "edit distance" algo which is CPU intensive. The FuzzyQuery prefix length

Re: Doc-Doc Similarity Matrix Construction

2009-06-29 Thread Mark Harwood
See MoreLikeThis in the contrib/queries folder. It optimizes the speed of similarity comparisons by taking the most significant words only from a document as search terms. On 29 Jun 2009, at 20:14, Amir Hossein Jadidinejad wrote: Hi, It's my first experiment with Lucene. Please help me.

Re: Highligheter fails using JapaneseAnalyzer

2009-07-01 Thread mark harwood
Can you verify the Token byte offsets produced by this particular analyzer are correct? - Original Message From: k.sayama To: java-user@lucene.apache.org Sent: Wednesday, 1 July, 2009 15:22:37 Subject: Re: Highligheter fails using JapaneseAnalyzer hi I verified it by using SimpleAn

Re: Highligheter fails using JapaneseAnalyzer

2009-07-01 Thread mark harwood
day, 1 July, 2009 16:13:17 Subject: Re: Highligheter fails using JapaneseAnalyzer Sorry I can not verify the Token byte offsets produced by JapaneseAnalyzer How should I verify it? - Original Message - From: "mark harwood" To: Sent: Wednesday, July 01, 2009 11:31 PM Subject:

Re: Highligheter fails using JapaneseAnalyzer

2009-07-01 Thread Mark Harwood
On 1 Jul 2009, at 17:39, k.sayama wrote: I could verify Token byte offsets The sytsem outputs aaa:0:3 bbb:0:3 ccc:4:7 That explains the highlighter behaviour. Clearly BBB is not at position 0-3 in the String you supplied String CONTENTS = "AAA :BBB CCC"; Looks like the Tokenizer need

Re: Boolean retrieval

2009-07-04 Thread Mark Harwood
Check out booleanfilter in contrib/queries. It can be wrapped in a constantScoreQuery On 4 Jul 2009, at 17:37, Lukas Michelbacher wrote: This is about an experiment comparing plain Boolean retrieval with vector-space-based retrieval. I would like to disable all of Lucene's scoring mechani

Re: Need help regarding Lucene index/query

2009-07-05 Thread Mark Harwood
I would appreciate if i can get help with the code as well. If you want to tweak an existing example rather than coding entirely from scratch the XMLQueryParser in /contrib has a demo web app for job search with a "location" field similar in principle to your "state" field plus it has a G

Re: Boolean retrieval

2009-07-07 Thread mark harwood
ts() + " hits"); The result is 0 hits (should be 640). [1] tinyurl.com/ml52ye 2009/7/4 Mark Harwood : > > Check out booleanfilter in contrib/queries. It can be wrapped in a > constantScoreQuery > > > > On 4 Jul 2009, at 17:37, Lukas Michelbacher > wrote: >

Re: Multi Value field

2009-07-07 Thread Mark Harwood
if the term is "X Y" the document 2 is getting higher score then document 1. That may be length normalisation at play. Doc 2 is shorter so may be seen as a better match for that reason. Using the "explain" function helps illustrate the break down of scores in matches. You could try index

Re: Multi Value field

2009-07-07 Thread Mark Harwood
I just try norms idea as well no change You'll need to look at searcher.explain() for the two docs or post a Junit or code example that can be executed which shows the issue - To unsubscribe, e-mail: java-user-unsubscr...@l

  1   2   3   >