Re: Wildcard in PhraseQuery

2013-08-27 Thread mark harwood
See  http://lucene.apache.org/core/4_3_1/queryparser/org/apache/lucene/queryparser/complexPhrase/ComplexPhraseQueryParser.html From: Ian Lea To: java-user@lucene.apache.org Sent: Tuesday, 27 August 2013, 10:16 Subject: Re: Wildcard in PhraseQuery See the FAQ

Re: 3.6 - querying a no-norms field and getting document boost

2013-01-25 Thread mark harwood
Answering my own question - add optional new MatchAllDocsQuery("text") clause to factor in the encoded norms from the "text" field. ____ From: mark harwood To: "java-user@lucene.apache.org" Sent: Friday, 25 January 2013, 16:11 Subj

3.6 - querying a no-norms field and getting document boost

2013-01-25 Thread mark harwood
I have a 3.6 index with many no-norms fields and a single text field with norms (a fairly common configuration). There is a document boost I have set at index-time that will have been encoded into the text field's norms. If I query solely on a non-text field then the ranking does not apply the

Re: Lucene 4.0, Serialization

2012-12-04 Thread mark harwood
This was part of the rationale for introducing the XML Query Parser: 1) An extensible query syntax that is expressive enough to represent the full range of Lucene functions (filters, moreLikeThis etc) 2) Serializable 3) Language independent 4) Decouples the holder of query criteria from the  impl

Re: ComplexPhraseQueryParser and stop words

2012-11-02 Thread Mark Harwood
Hi Brandon, Can you start by calling toString on the parse result (the Query object) to see what is being produced and post that here. On the face of it it sounds like it should work OK. What happens if you use the "normal" query parser on your query "time to leave" - that should parse ok as

Re: DuplicateFilter filters not only duplicates

2012-08-30 Thread mark harwood
DuplicateFilter has been mostly broken  since Lucene's switch over to segment-level filtering. Since v2.9 the calls to Filter.getDocIdSet no longer pass a "top-level" reader for accessing the whole index and instead pass a reader restricted to only accessing a single segment's contents. Becaus

Re: Creating Span Queries from Boolean Queries

2012-08-22 Thread mark harwood
>>> Ideally I'd like to take any ANDed clauses and require them to occur> >>> withing $SPAN of the other ANDs. See ComplexPhraseQueryParser? Like the standard QueryParser it uses quotes to define a phrase but also interprets any special characters between the quotes e.g. ( ) * ~ The syntax and

Re: Mapping Lucene search results with a relational database

2012-07-03 Thread mark harwood
Many considerations here - I find the technical concerns you present typically open a can of worms for any businesses worried about security. It gets political quickly.   In environments where security is paramount, software must be formally accredited, which is a costly exercise. Often the choi

Sequence diagrams for Lucene 4.0 classes

2012-05-23 Thread mark harwood
I've created a couple of sequence diagrams of core Lucene 4.0 classes that may be of use to others: Low-level classes used while writing indexes http://goo.gl/dI3HY Low-level classes used while reading indexes: http://goo.gl/e8JEj FWIW I found the websequencediagrams.com editor in these lin

Re: Searching accross 2 fields

2012-05-22 Thread mark harwood
       value: 20            } } doc 2: { form: { id: 1040 }   attrib: {                  name: age                  value: 22            } } On Mon, May 21, 2012 at 3:24 PM, Mark Harwood wrote: > You're describing what I call the "cross matching" problem if you flatten > nested,

Re: Searching accross 2 fields

2012-05-21 Thread Mark Harwood
You're describing what I call the "cross matching" problem if you flatten nested, repeating structures with multiple fields into a single flat Lucene document model. The approach for handling the more complex mappings is to use nested child docs in Lucene and for that look at BlockJoinQuery. Ho

Re: Nested BlockJoinQuery

2012-02-11 Thread Mark Harwood
Your requirement does not sound like a good fit for the nested stuff but is probably more one for conventional faceting. I would characterise the uses for Nested as follows: 1) The parent of a nested block is typically the "item of interest" that is returned i.e. the search results are a list

Re: ElasticSearch

2011-11-17 Thread Mark Harwood
> > Other parameters such as filters, faceting, highlighting, sorting, > etc, don't normally have any hierarchy. I regularly mix filters and queries inside Boolean logic. Attempts to structure data (e.g. geocoding) don't always achieve 100% coverage and so for better recall you must also resor

Re: ElasticSearch

2011-11-17 Thread Mark Harwood
I don't think of queries as inherently flat in the way HTTP request parameters are with their name=value pairings. JSON or XML can reflect more closely the hierarchy in the underlying Lucene query objects. For me using a "flat" query interface feels a bit like when you start off trying to manag

Re: Bet you didn't know Lucene can...

2011-10-26 Thread mark harwood
>>  > Avg lookup time slightly less than a HashSet? Interesting. Scratch that. A new dataset and revised code shows HashSets out in front (but still not a realistic option for very large sets) : http://goo.gl/Lb4J1 In this benchmark I removed the code common to all previous tests which was firs

Re: Bet you didn't know Lucene can...

2011-10-25 Thread Mark Harwood
lightly less than a HashSet? Interesting. Is the code > to these benchmarks available somewhere? > > Dawid > > On Tue, Oct 25, 2011 at 9:57 PM, Grant Ingersoll wrote: >> >> On Oct 25, 2011, at 11:26 AM, mark harwood wrote: >> >>>>> using

Re: Bet you didn't know Lucene can...

2011-10-25 Thread mark harwood
>>using Lucene that don't fit under the core premise of full text search  I've had several use cases over the years that use features peculiar to Lucene but here's a very simple one I came across today that illustrates its raw index lookup capability: I needed a fast, scalable and persistent "S

Re: What kind of System Resources are required to index 625 million row table...???

2011-08-16 Thread Mark Harwood
Check "norms" are disabled on your fields because they'll cost you1byte x NumberOfDocs x numberOfFieldsWithNormsEnabled. On 16 Aug 2011, at 15:11, Bennett, Tony wrote: > Thank you for your response. > > You are correct, we are sorting on timestamp. > Timestamp has microsecond granualarity, a

Re: Corrupt segments file full of zeros

2011-06-28 Thread mark harwood
From: Michael McCandless To: java-user@lucene.apache.org Sent: Tue, 28 June, 2011 14:59:48 Subject: Re: Corrupt segments file full of zeros On Tue, Jun 28, 2011 at 9:29 AM, mark harwood wrote: > Hi Mike. >>>Hmmm -- what code are you running here, to pr

Re: Corrupt segments file full of zeros

2011-06-28 Thread mark harwood
Hi Mike. >>Hmmm -- what code are you running here, to print the number of docs? SegmentInfos.setInfoStream(System.out); FSDirectory dir = FSDirectory.open(new File("j:/indexes/myindex")); IndexReader r = IndexReader.open(dir, true); System.out.println("index has "+r.maxDoc()+" docs"); From my

Re: Corrupt segments file full of zeros

2011-06-28 Thread mark harwood
According to the spec there should at least be an Int32 of -9 to declare the Format - http://lucene.apache.org/java/2_9_3/fileformats.html#Segments File - Original Message From: Uwe Schindler To: java-user@lucene.apache.org Sent: Tue, 28 June, 2011 12:32:34 Subject: RE: Corrupt segme

Re: Coloring search results based on score?

2011-06-16 Thread Mark Harwood
See Highlighter's GradientFormatter Cheers Mark On 16 Jun 2011, at 22:01, Itamar Syn-Hershko wrote: > Hi all, > > > Interesting question: is it possible to color search results in a web-page > based on their score? e.g. most relevant results in green, and then different > shades through ora

Re: Index size and performance degradation

2011-06-14 Thread mark harwood
Partitioning and replication are the keys to handling data and user volumes respectively. However, this approach introduces some other concerns over consistency and availability of content which I've tried to capture here: http://www.slideshare.net/MarkHarwood/patterns-for-large-scale-search Th

Re: When nested indexing and search will be available?

2011-06-06 Thread Mark Harwood
As of 3.2 the necessary changes were put in to safely support indexing nested docs. See http://lucene.apache.org/java/3_2_0/changes/Changes.html#3.2.0.new_features On 6 Jun 2011, at 17:18, 周诚 wrote: > I just saw this: > https://issues.apache.org/jira/secure/attachment/12480123/LUCENE-2454.patc

Re: Ranking docs with all terms higher

2011-05-19 Thread mark harwood
Of course IDF is a factor too meaning a match on a single rare (to the overall index) term may be worth more than a match on 2 different common (to the index) terms. As Ian suggests a custom Similarity implementation can be used to tune this out. - Original Message From: Ian Lea To: j

Re: Early Termination

2011-03-16 Thread mark harwood
See https://issues.apache.org/jira/browse/LUCENE-1720 - Original Message From: Alex vB To: java-user@lucene.apache.org Sent: Wed, 16 March, 2011 0:12:41 Subject: Early Termination Hi, is Lucene capable of any early termination techniques during query processing? On the forum I only fo

Re: Detecting duplicates

2011-03-10 Thread mark harwood
This is possible using contrib's DuplicateFilter. Below is an example of your problem defined as an XML-based test which I just ran OK through my test writer/runner. Hopefully this is readable and demonstrates the use of FilteredQuery/DuplicateFilter. This is my test

Re: termInfosIndexDivisor vs termIndexInterval

2011-02-07 Thread mark harwood
Somewhat historic reasons. It used to be IndexWriter was the only place you could define this setting (making it an index-time decision burnt into the index). The IndexReader option is a relatively newer addition that adds the flexibility to decide about memory usage whenever you open the index (

Re: Maintaining index for "flattened" database tables

2011-01-13 Thread mark harwood
Probably off-topic for a Lucene list but the typical database options are: 1) an auto-updated "last changed" timestamp column on related tables that can be queried 2) a database trigger automatically feeding a "to-be-indexed" table Option 1 would also need a "marked as deleted" column adding to

Re: How to Cache Filter Results between Servers

2010-11-29 Thread mark harwood
>> 1. why ir.hashCode() returns different value every time I run >> this >>code? Presumably because it is a different object instance in a different JVM? IndexReader.hashCode() and IndexReader.equals() are not designed to represent/summarise the physical contents of an index. They

Re: Next Word - Any Suggestions?

2010-10-26 Thread mark harwood
See the Collocation stuff here https://issues.apache.org/jira/browse/LUCENE-474 - Original Message From: Lucene To: java-user@lucene.apache.org Sent: Tue, 26 October, 2010 13:27:06 Subject: Next Word - Any Suggestions? Am about to implement a custom query that is sort of mash-up of Fac

Re: Consider only documents of a category for IDF

2010-10-18 Thread mark harwood
Can you not just call reader.docFreq(categoryTerm) ? The returned figure includes deleted docs but then the search term uses this method too so should suffer from the same inaccuracy. Cheers Mark - Original Message From: Max Jakob To: java-user@lucene.apache.org Sent: Mon, 18 Octobe

Re: Merge and commit behaviour - changed between 2.4 and 2.9?

2010-10-06 Thread mark harwood
the last commit. Mike On Tue, Oct 5, 2010 at 6:45 PM, Mark Harwood wrote: > OK. I'll double check the reports. > So presumably when merges occur outside of transaction control (post commit) >the post-merge update of the segments_N file is managed safely somehow? > I can see the

Re: Merge and commit behaviour - changed between 2.4 and 2.9?

2010-10-05 Thread Mark Harwood
> > In both 2.4 and 2.9.x (and all later versions), neither .prepareCommit > nor .commit wait for merges. > > That said, if a merge happens to complete before you call those > methods, then it is in fact committed. > > Mike > > On Tue, Oct 5, 2010 at 1:13 PM, Mar

Merge and commit behaviour - changed between 2.4 and 2.9?

2010-10-05 Thread Mark Harwood
Having upgraded a live system from 2.4 to 2.9.3 the client is reporting a change in merge behaviour that is causing some issues with their update monitoring logic. The suggestion is that any merge operations now complete as part of the IW.prepareCommit() call rather than previously when they ra

Re: Federated search with opensearch or proprietary APIs for Atlassian

2010-09-02 Thread mark harwood
A pretty thorough exploration of the issues in federated search here: http://ilpubs.stanford.edu:8090/271/ I'd add "security" i.e. authentication and authorisation to the list of issues to be considered (key in some environments). If you consolidate content in a centralised Solr/Lucene indexing

Re: on-the-fly "filters" from docID lists

2010-07-23 Thread Mark Harwood
.set(docs[0]); > } > >>> That could involve a lot of disk seeks unless you cache a pk->docid lookup >>> in ram. > That sounds interesting. How would the pk->docid lookup get populated? > Wouldn't a pk->docid cache be invalidated with each commit or merge?

Re: on-the-fly "filters" from docID lists

2010-07-22 Thread Mark Harwood
Re scalability of filter construction - the database is likely to hold stable primary keys not lucene doc ids which are unstable in the face of updates. You therefore need a quick way of converting stable database keys read from the db into current lucene doc ids to create the filter. That could

Re: XML results ranking

2010-07-16 Thread mark harwood
Lucene 2454 includes an example of matching logic that respects the structure in XML documents (see (https://issues.apache.org/jira/browse/LUCENE-2454 ) The example class TestNestedDocumentQuery queries xhtml marked up with hResume syntax. We don't have XQuery syntax support in a parser now (an

Re: Searching docs with multi-value fields

2010-07-09 Thread Mark Harwood
Check out lucene 2454 and accompanying slide show if your reason for doing this is modelling repeating elements. On 9 Jul 2010, at 13:43, "Hans-Gunther Birken" wrote: > I'm examining the following search problem. Consider a document with two > multi-va

Re: DuplicateFilter question

2010-05-31 Thread Mark Harwood
The DuplicateFilter passed to the searcher does not have visibility of the text query and is therefore evaluated independently from all other criteria. Sounds like the behaviour you want is to get the last duplicate that also matches your criteria, which seems like something fairly common to need

Re: best way to interest two queries?

2010-05-12 Thread mark harwood
terest and Query objects record match metadata in singleton MatchAttribute objects as they stream their way through result sets. Result set streaming and tokenisation streams are similar problems and the Attribute design seems like it can apply here. Cheers Mark Le 11-mai-10 à 12:02, mark harwo

Re: best way to interest two queries?

2010-05-11 Thread mark harwood
See https://issues.apache.org/jira/browse/LUCENE-1999 - Original Message From: Paul Libbrecht To: java-user@lucene.apache.org Sent: Tue, 11 May, 2010 10:52:14 Subject: Re: best way to interest two queries? Dear lucene experts, Let me try to make this precise since there was not answe

Re: Get info wheter a field is multivalued

2010-03-17 Thread mark harwood
Not the fastest thing in the world but works: Term startTerm=new Term("myFieldName",""); TermEnum te=reader.terms(startTerm); BitSet docsRead=new BitSet(reader.maxDoc()); boolean multiValued=false;

Re: SpanQueries in Luke

2010-03-05 Thread mark harwood
rietary metadata system, and any other config resource to hook into Luke. That would be pretty cool - Original Message From: Andrzej Bialecki To: java-user@lucene.apache.org Sent: Fri, 5 March, 2010 11:11:12 Subject: Re: SpanQueries in Luke On 2010-03-05 11:22, mark harwood wrote:

Re: SpanQueries in Luke

2010-03-05 Thread mark harwood
things like Solr's config. Cheers, Mark - Original Message From: Andrzej Bialecki To: java-user@lucene.apache.org Sent: Fri, 5 March, 2010 10:03:23 Subject: Re: SpanQueries in Luke On 2010-03-05 10:47, mark harwood wrote: > > >>> No, this simply means tha

Re: SpanQueries in Luke

2010-03-05 Thread mark harwood
>>No, this simply means that you will be able to use the xml-query-parser >>instead of the regular one Not sure exactly what you have in mind for an editor, Andrzej but there is an opportunity to do something smart here for little effort. The XMLQueryParser comes with a DTD which means you ca

Re: Query about Query.ToString()

2010-02-18 Thread Mark Harwood
/wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes > DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 > Million Euro funding! > > > > Mark Harwood wrote: >> Yes it is being maintained and I have it in product

Re: Query about Query.ToString()

2010-02-17 Thread Mark Harwood
nymous per request) got 2.6 Million Euro funding! Mark Harwood wrote: This was part of the rationale for creating the XMLQueryParser which can be found in contrib. See here for the background: http://marc.info/?l=lucene-dev&m=113355526731460&w=2 On 17 Feb 2010, at 18:44, Aaron Schon w

Re: Query about Query.ToString()

2010-02-17 Thread Mark Harwood
This was part of the rationale for creating the XMLQueryParser which can be found in contrib. See here for the background: http://marc.info/?l=lucene-dev&m=113355526731460&w=2 On 17 Feb 2010, at 18:44, Aaron Schon wrote: > Hi all, I know that persisting a Lucene query by query ToString() meth

Re: Strange Fuzzyquery results scoring when using a low minimal distance

2010-02-15 Thread mark harwood
This could be down to IDF ie "Lucane" is ranked higher because it is rarer despite having worse edit distance. This is arguably a bug. See http://issues.apache.org/jira/browse/LUCENE-329 which discusses this. You could try subclass QueryParser and override newFuzzyQuery to return FuzzyLikeThisQu

Re: Further refinement of search results - distinguishing hits with exact phrase match from the rest

2010-02-15 Thread mark harwood
Re Mike's delegating custom query suggestion - see https://issues.apache.org/jira/browse/LUCENE-1999 - Original Message From: Michael McCandless To: java-user@lucene.apache.org Sent: Mon, 15 February, 2010 10:03:30 Subject: Re: Further refinement of search results - distinguishing hi

Re: Lucene fields not analyzed

2010-02-09 Thread Mark Harwood
rd analyzer and changed the > name to be added to the index to "Mr.\\ Kumar" > but still couldn't get it to work. > > > > > > > Rohit Banga > > > On Tue, Feb 9, 2010 at 1:06 PM, Mark Harwood wrote: > >> I suspect it is because QueryPa

Re: Lucene fields not analyzed

2010-02-08 Thread Mark Harwood
I suspect it is because QueryParser uses space characters to separate different clauses in a query string while you want the space to represent some content in your "name" field. Try escaping the space character. Cheers Mark On 9 Feb 2010, at 07:26, Rohit Banga wrote: > Hello > > i have a f

Re: ComplexPhraseQueryParser (Expanded Form and Boosting)

2010-02-01 Thread Mark Harwood
Try call rewrite on the query object to expand then call tostring on the result. Cheers, Mark - On 1 Feb 2010, at 21:32, "Haghighi, Nariman" wrote: > We are relying on the ComplexPhraseQueryParser for some impressive > matching capabilities. > > Of concern is that Wildcard Queries,

Re: Extracting contact data

2010-01-14 Thread mark harwood
> > Do you think I can get any advantage from building a solution on > Lucene? Lucene is generally about information retrieval not information extraction (as suggested, GATE or UIMA are more commonly used for extraction). However, Lucene can play a role in extraction if you use it for determining

Re: Need help with XML Query Parser example in Lucene 3.0

2009-12-23 Thread mark harwood
Hi Fayyaz, >>I have found an error in the web.xml file, Good job! I found an error in your code so that makes us even :) It looks like you removed the line in the "openExampleIndex" method which opens the searcher. That explains your null pointer. The problem you found in the web.xml isn't a

Re: Problems with fragments size on highlight.

2009-11-18 Thread Mark Harwood
It could be the "merge contiguous fragments" feature that attempts to do exactly this to improve readability It's an option you can turn off. On 15 Nov 2009, at 01:21, Felipe Lobo wrote: Hi, i'm having some problems with the size of the fragmentes when i'm doing the highlight. I pass on the

Re: How to use Lucene to suppot quick search on huge databases where the primary content is of non textual format ?

2009-11-09 Thread mark harwood
So many questions.. >>Which one will be better As in. * Faster to implement? * Faster to search? * Faster to update? * Cheaper in licenses? * More robust? * Easier to maintain? * Easier to backup? Are results sorted by : * quality (e.g. when using fuzzy text matching)? * distance? * pric

Re: Storing a Lucene Index on a SAN Storage: good idea?

2009-09-26 Thread Mark Harwood
I have a client with 700 million doc index running on a SAN. The performance is v good but this obviously depends on your choice of SAN config. In this environment I have multiple search servers happily hitting the same physical lucene index on the SAN. The servers negotiate with each other via

Re: How to perform a phrase "begins with" query?

2009-09-17 Thread Mark Harwood
Since you can't (and it doesn't make sense to) use wildcards in phrase queries, You can with this: http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/misc/src/java/org/apache/lucene/queryParser/complexPhrase/ Discussion here: http://tinyurl.com/lrnage Cheers, Mark

Re: Stopping a runaway search, any ideas?

2009-09-11 Thread mark harwood
Or https://issues.apache.org/jira/browse/LUCENE-1720 offers lightweight timeout testing at all index access stages prior to calls to Collector e.g. will catch a runaway fuzzy query during it's expensive term expansion phase. - Original Message From: Uwe Schindler To: java-user@lucene

Re: First result in the group

2009-09-04 Thread Mark Harwood
>>It removes the duplicates at query time and not in the results. Not sure I understand that statement. Do you mean you want index-time rejection of potentially duplicate inserts? On 4 Sep 2009, at 07:01, Ganesh wrote: It removes the duplicates at query time and not in the results. --

Re: First result in the group

2009-09-02 Thread mark harwood
See "DuplicateFilter" in contrib. http://markmail.org/message/lsvnpu7mwhht3a4p Cheers Mark - Original Message From: Ganesh To: java-user@lucene.apache.org Sent: Wednesday, 2 September, 2009 12:38:35 Subject: Re: First result in the group I have a field called category and all docume

Re: Deletion of words in articles of Wikipedia

2009-09-02 Thread mark harwood
>>I need to start off with this project where we can find the ranking of >>controversial articles. Could anyone kindly help me how to start? Check out the wikipedia "logging" dumps which contain the reasons for actions on page titles (including ip blocks and deletes) but without the bulk of the

Re: Multi Value field

2009-07-07 Thread Mark Harwood
I just try norms idea as well no change You'll need to look at searcher.explain() for the two docs or post a Junit or code example that can be executed which shows the issue - To unsubscribe, e-mail: java-user-unsubscr...@l

Re: Multi Value field

2009-07-07 Thread Mark Harwood
if the term is "X Y" the document 2 is getting higher score then document 1. That may be length normalisation at play. Doc 2 is shorter so may be seen as a better match for that reason. Using the "explain" function helps illustrate the break down of scores in matches. You could try index

Re: Boolean retrieval

2009-07-07 Thread mark harwood
ts() + " hits"); The result is 0 hits (should be 640). [1] tinyurl.com/ml52ye 2009/7/4 Mark Harwood : > > Check out booleanfilter in contrib/queries. It can be wrapped in a > constantScoreQuery > > > > On 4 Jul 2009, at 17:37, Lukas Michelbacher > wrote: >

Re: Need help regarding Lucene index/query

2009-07-05 Thread Mark Harwood
I would appreciate if i can get help with the code as well. If you want to tweak an existing example rather than coding entirely from scratch the XMLQueryParser in /contrib has a demo web app for job search with a "location" field similar in principle to your "state" field plus it has a G

Re: Boolean retrieval

2009-07-04 Thread Mark Harwood
Check out booleanfilter in contrib/queries. It can be wrapped in a constantScoreQuery On 4 Jul 2009, at 17:37, Lukas Michelbacher wrote: This is about an experiment comparing plain Boolean retrieval with vector-space-based retrieval. I would like to disable all of Lucene's scoring mechani

Re: Highligheter fails using JapaneseAnalyzer

2009-07-01 Thread Mark Harwood
On 1 Jul 2009, at 17:39, k.sayama wrote: I could verify Token byte offsets The sytsem outputs aaa:0:3 bbb:0:3 ccc:4:7 That explains the highlighter behaviour. Clearly BBB is not at position 0-3 in the String you supplied String CONTENTS = "AAA :BBB CCC"; Looks like the Tokenizer need

Re: Highligheter fails using JapaneseAnalyzer

2009-07-01 Thread mark harwood
day, 1 July, 2009 16:13:17 Subject: Re: Highligheter fails using JapaneseAnalyzer Sorry I can not verify the Token byte offsets produced by JapaneseAnalyzer How should I verify it? - Original Message - From: "mark harwood" To: Sent: Wednesday, July 01, 2009 11:31 PM Subject:

Re: Highligheter fails using JapaneseAnalyzer

2009-07-01 Thread mark harwood
Can you verify the Token byte offsets produced by this particular analyzer are correct? - Original Message From: k.sayama To: java-user@lucene.apache.org Sent: Wednesday, 1 July, 2009 15:22:37 Subject: Re: Highligheter fails using JapaneseAnalyzer hi I verified it by using SimpleAn

Re: Doc-Doc Similarity Matrix Construction

2009-06-29 Thread Mark Harwood
See MoreLikeThis in the contrib/queries folder. It optimizes the speed of similarity comparisons by taking the most significant words only from a document as search terms. On 29 Jun 2009, at 20:14, Amir Hossein Jadidinejad wrote: Hi, It's my first experiment with Lucene. Please help me.

Re: Fuzzy vs Prefix query Performance

2009-06-15 Thread mark harwood
FuzzyQuery performance is related to number of unique terms in the index not the number of documents e.g. a single "telephone directory" document could contain millions of terms. Each term considered is compared using an "edit distance" algo which is CPU intensive. The FuzzyQuery prefix length

Re: Max size of index? How do search engines avoid this?

2009-05-18 Thread mark harwood
>techniques used by big search engines to search among such huge data. Two keywords here - partitioning and replication. Partitioning is breaking the content down into shards and assigning shards to servers. These can then be queried in parallel to make search response times independent of the

Re: Indexing becomes slow with time

2009-04-30 Thread mark harwood
If you're CPU-bound - I've had issues before with GC in long-running indexing tasks loading very large volumes (100s of millions) of docs. I was seeing lots of CPU usage tied up in GC. I solved all these problems by firing batches of indexing activity off in seperate processes then immediately

Re: Low-memory searcher

2009-04-24 Thread mark harwood
See IndexReader.setTermInfosIndexDivisor() for a way to help reduce memory usage without needing to re-index. If you have indexed fields with omitNorms off (the default) you will be paying a 1 byte per field per document memory cost and may need to look at re-indexing Cheers Mark - Orig

Re: SpanQuery wildcards?

2009-04-23 Thread mark harwood
Related: https://issues.apache.org/jira/browse/LUCENE-1486 - Original Message From: Steven A Rowe To: "java-user@lucene.apache.org" Sent: Thursday, 23 April, 2009 16:54:08 Subject: RE: SpanQuery wildcards? Hi Ivan, SpanRegexQuery should work - just use ".*" instead of "*". - Steve

Re: Servlets Sharing Resources

2009-04-21 Thread mark harwood
Spring is pretty useful for managing and sharing resources - see what looks like a related example here: http://croarkin.blogspot.com/2008/05/injecting-spring-bean-into-servlet.html Cheers, Mark - Original Message From: David Seltzer To: java-user@lucene.apache.org Sent: Tuesday,

Re: Speed of fuzzy searches

2009-04-02 Thread mark harwood
Try setting the minimum prefix length for fuzzy queries ( I think there is a setting on QueryParser or you may need to subclass) Prefix length of zero does edit distance comparisons for all unique terms e.g. from "aardvark" to "" Prefix length of one would cut this search space down to just

Re: What is an optimal approach?

2009-03-30 Thread mark harwood
ptimal approach incase someone already have similar situation. -Original Message----- From: mark harwood [mailto:markharw...@yahoo.co.uk] Sent: Mon 3/30/2009 11:16 AM To: java-user@lucene.apache.org Subject: Re: What is an optimal approach? That's probably more a question about MarkLogic A

Re: What is an optimal approach?

2009-03-30 Thread mark harwood
That's probably more a question about MarkLogic APIs than it is about Lucene. What APIs does MarkLogic provide for getting at the content e.g does it provide a JSR-170 standard interface ( http://www.slideshare.net/uncled/introduction-to-jcr ) I presume you have already ruled out the in-built M

Re: Lucene Highlighting and Dynamic Summaries

2009-03-12 Thread mark harwood
The attachment didn't make it through here. Can you add it as an attachment to a new JIRA issue? Thanks, Mark From: Amin Mohammed-Coleman To: java-user@lucene.apache.org Sent: Thursday, 12 March, 2009 7:47:20 Subject: Re: Lucene Highlighting and Dynamic Summ

Re: A model for predicting indexing memory costs?

2009-03-11 Thread mark harwood
OK, it's early days and I'm holding my breath but I'm currently progressing further through my content without an OOM just by using a different GC setting. Thanks to advice here and colleagues at work I've gone with a GC setting of -XX:+UseSerialGC for this indexing task. The rationale that is

Re: A model for predicting indexing memory costs?

2009-03-11 Thread mark harwood
Wednesday, 11 March, 2009 10:42:33 Subject: Re: A model for predicting indexing memory costs? * mark harwood: >>>Could you get a heap dump (eg with YourKit) of what's using up all the >>>memory when you hit OOM? > > On this particular machine I have a JRE, no adm

Re: A model for predicting indexing memory costs?

2009-03-10 Thread mark harwood
ts by pointing out that it's not only *your* time that's at risk, but customers' time too. Whether you define customers as internal or external is irrelevant. Every round of diagnosis/fix carries the risk that N people waste time (and get paid for it). All to avoid a little up-front co

Re: A model for predicting indexing memory costs?

2009-03-10 Thread mark harwood
ing a new IndexWriter each time? Or, just calling .commit() and then re-using the same writer? It seems likely this has something to do with merging, though from your listing I count 14 segments which shouldn't have been doing any merging at mergeFactor=20, so that's confusing.

Re: A model for predicting indexing memory costs?

2009-03-10 Thread mark harwood
Token class when creating the trie > encoded fields. > > How works TrieRange for you? Are you happy, does searches work well with > 30 > mio docs, which precisionStep do you use? > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http:

Re: A model for predicting indexing memory costs?

2009-03-10 Thread mark harwood
you? Are you happy, does searches work well with 30 mio docs, which precisionStep do you use? Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: mark harwood [mailto:markharw...@yahoo.co.uk] > Sent:

Re: A model for predicting indexing memory costs?

2009-03-10 Thread mark harwood
with -XX:-UseGCOverheadLimit http://java-monitor.com/forum/archive/index.php/t-54.html http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc.oom -- Ian. On Tue, Mar 10, 2009 at 10:45 AM, mark harwood wrote: > >>>But... how come setting IW's RAM buffer do

Re: A model for predicting indexing memory costs?

2009-03-10 Thread mark harwood
ent: Tuesday, 10 March, 2009 0:01:30 Subject: Re: A model for predicting indexing memory costs? mark harwood wrote: > > I've been building a large index (hundreds of millions) with mainly > structured data which consists of several fields with mostly unique values. > I've been

Re: Lucene 2.9

2009-03-09 Thread mark harwood
>>Maybe we could do something similar to declare that agiven field uses Trie*, >>and with what datatype. With the current implementation you can at least test for the presence of a field called: [fieldName]#trie ..which tells you some form of trie is used but could be extended to include

A model for predicting indexing memory costs?

2009-03-09 Thread mark harwood
I've been building a large index (hundreds of millions) with mainly structured data which consists of several fields with mostly unique values. I've been hitting out of memory issues when doing periodic commits/closes which I suspect is down to the sheer number of terms. I set the IndexWriter..

Re: IndexWriter 2-phase commit usage

2009-02-24 Thread mark harwood
As suggested, the window for failure here is very small. The commit is effectively an atomic single file rename operation to make the new segments file visible. However, should there be a failure between 2 commits the new deletion policy logic should help you recover to prior commit points. See

Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers

2009-02-18 Thread mark harwood
I was having some thoughts recently about speeding up fuzzy search. The current system does edit-distance on all terms A-Z, single threaded. Prefix length can reduce the search space and there is a "minimum similarity" threshold but that's roughly where we are. Multithreading this to make use o

Re: Poor QPS with highlighting

2009-02-03 Thread mark harwood
>>My documents are quite big sometimes up to 300ktokens. You could look at indexing them as seperate documents using overlapping sections of text. Erik used this for one of his projects. Cheers Mark - Original Message From: Michael Stoppelman To: java-user@lucene.apache.org Sent: Tu

Re: Optimize and Out Of Memory Errors

2008-12-23 Thread mark harwood
Field("field5", "groupId" + i, Field.Store.YES, Field.Index.UN_TOKENIZED)); writer.addDocument(doc); From: mark harwood To: java-user@lucene.apache.org Sent: Tuesday, December 23, 2008 2:42:25 PM Subject: Re: Optimize a

Re: Optimize and Out Of Memory Errors

2008-12-23 Thread mark harwood
I've had reports of OOM exceptions during optimize on a couple of large deployments recently (based on Lucene 2.4.0) I've given the usual advice of turning off norms, providing plenty of RAM and also suggested setting IndexWriter.setTermIndexInterval(). I don't have access to these deployment en

Re: [ANN] Luke 0.9 released

2008-11-14 Thread mark harwood
, Mark - Original Message From: Andrzej Bialecki <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, 14 November, 2008 10:47:03 Subject: Re: [ANN] Luke 0.9 released mark harwood wrote: > Hi Andrzej, > > Thanks for the update. Looks like you've been bus

  1   2   3   >