Re: Filter.getDocIdSet() returning null, and what this means for CachingWrapperFilter

2010-05-26 Thread Daniel Noll
h, which will probably result in another post sooner or later. Daniel -- Daniel NollForensic and eDiscovery Software Senior Developer The world's most advanced Nuixemail data analys

Filter.getDocIdSet() returning null, and what this means for CachingWrapperFilter

2010-05-25 Thread Daniel Noll
t;the entry isn't in the cache" from "the entry is in the cache but it's null". Daniel -- Daniel NollForensic and eDiscovery Software Senior Developer The world's most advanced Nuix

Re: Questions about the new query parser framework

2010-05-03 Thread Daniel Noll
robably fine anyway, as I don't really want to encourage the former way of formatting it as the latter is more concise. Actually it could even be... tag:(a AND (b OR c)) But I don't think my formatting logic is quite smart enough for that yet. Daniel -- Daniel Noll

Questions about the new query parser framework

2010-05-02 Thread Daniel Noll
uot;tag:a tag:b" and "tag:(a b)" both parse to the same node structure (making it impossible to figure out which the user actually used)? Daniel -- Daniel NollForensic and eDiscovery Software Senior Developer

Re: Filters and multiple, per-segment calls to getDocIdSet

2010-03-25 Thread Daniel Noll
he API would explicitly pass the docBase for the IndexReader - this would reduce the need to perform maths to determine the docBase ourselves, and also make it possible to parallelise those calls later. Daniel -- Daniel NollForensic and eDiscovery Software Senior De

Filters and multiple, per-segment calls to getDocIdSet

2010-03-24 Thread Daniel Noll
ter is now explicitly not threadsafe. We weren't keeping any state in them anyway, but now we will have to, so there is potential for a lot of new bugs if a filter is somehow used by two queries running at the same time. Daniel -- Daniel NollForensic and eDiscovery

Re: "Deleting" documents without deleting them

2010-03-16 Thread Daniel Noll
a filter which only matches the last doc for each term. Then I don't have to pay for the storage of a filter... but I guess it will cost to build this filter anyway so I don't know if it's practical yet. I guess storing the filter on disk would be an easier way to go, with the caveat

"Deleting" documents without deleting them

2010-03-15 Thread Daniel Noll
etty large even if I use a BitSet. :-( Is there any other way to go about it? Daniel -- Daniel NollForensic and eDiscovery Software Senior Developer The world's most advanced Nuixemai

New Query Parser: converting a QueryNode back into a String?

2009-11-29 Thread Daniel Noll
ding SyntaxFormatter to convert from QueryNode back to String. 3. What about going all the way from Query back to String? (My naive answer to my own question here is that some QueryNodeProcessor may perform an irreversible operation, making it impossible to do this, but I thought I would throw

Re: Finding the highest term in a field

2009-11-18 Thread Daniel Noll
On Thu, Nov 19, 2009 at 16:01, Yonik Seeley wrote: > On Wed, Nov 18, 2009 at 10:48 PM, Daniel Noll wrote: >> But what if I want to find the highest?  TermEnum can't step backwards. > > I've also wanted to do the same. It's coming with the new flex

Finding the highest term in a field

2009-11-18 Thread Daniel Noll
of binary search by getting the TermEnum for different terms until I find a term where there are terms higher than the term but no terms higher than the term for the next day? Daniel -- Daniel NollForensic and eDiscovery Software Senior Developer

Re: Directory.list() deprecation

2009-11-09 Thread Daniel Noll
oes not exist is somewhat simpler. Daniel -- Daniel NollForensic and eDiscovery Software Senior Developer The world's most advanced Nuixemail data analysis http://nuix.com/

Re: IndexWriter.close() no longer seems to close everything

2009-11-08 Thread Daniel Noll
index state (before adding docs.) When the IndexWriter was opened, another reader was opened, so even though we thought we were closing both, it turned out there were two readers and one writer, and we were only closing one of the readers. Daniel -- Daniel Noll

IndexWriter.close() no longer seems to close everything

2009-11-08 Thread Daniel Noll
time (though I was under the impression that close() waited for merges and so forth to complete before returning.) Daniel -- Daniel NollForensic and eDiscovery Software Senior Developer The world's most advanced

Re: Directory.list() deprecation

2009-11-08 Thread Daniel Noll
ere using while providing no replacement except for "write it yourself", the same as what happened when Hits got canned. Daniel -- Daniel NollForensic and eDiscovery Software Senior Developer The world's most advan

Directory.list() deprecation

2009-11-05 Thread Daniel Noll
ve at least all used the same filter.) Daniel -- Daniel NollForensic and eDiscovery Software Senior Developer The world's most advanced Nuixemail data analysis http://nuix.com/

Re: Phrase search

2009-06-10 Thread Daniel Noll
ly assuming a score of 1.0 for each hit, you would get something like... 1. "cool gaming laptop"=> 3 (cool, gaming, "cool gaming") 2. "cool gaming lappy"=> 3 (cool, gaming, "cool gaming") 3. "gaming laptop cool"=> 2 (cool,

Re: Migrating from Hit/Hits to TopDocs/TopDocCollector

2009-06-10 Thread Daniel Noll
arge number of documents in the index. It's a shame we don't have an inverted kind of HitCollector where we can say "give me the next hit", so that we can get the best of both worlds (like what StAX gives us in the XML world.) Daniel -- Daniel Noll

Extending StandardAnalyzer considered harmful

2009-06-03 Thread Daniel Noll
, it would have prevented the problem in its entirety as we would have realised much sooner that it wasn't safe to override in the beginning. Daniel -- Daniel NollForensic and eDiscovery Software Senior Developer The world'

ArrayIndexOutOfBoundsException from TermInfosReader.get (2.3.2)

2009-04-27 Thread Daniel Noll
sy to try 2.4.1 and see if it has been fixed, but was there a bug along these lines in 2.3.2? Daniel -- Daniel NollForensic and eDiscovery Software Senior Developer The world's most advanced Nuix

Internals question: BooleanQuery with many TermQuery children

2009-04-06 Thread Daniel Noll
tion.) Daniel -- Daniel NollForensic and eDiscovery Software Senior Developer The world's most advanced Nuixemail data analysis http://nuix.com/and

Re: underscore a word separator in StandardAnalyzer?

2009-03-15 Thread Daniel Noll
ene, wouldn't a trivial analyser which breaks on commas be the way to go? Daniel -- Daniel NollForensic and eDiscovery Software Senior Developer The world's most advanced Nuixemail data

Re: Lucene: MultiSearcher

2009-03-08 Thread Daniel Noll
Michael McCandless wrote: You could look at the docID of each hit, and compare to the .maxDoc() of each underlying reader. There is also MultiSearcher#subSearcher(int) which also works as you add more without having to do the maths yourself. Daniel -- Daniel Noll

Re: Optimum way to find all document without particular field

2009-03-04 Thread Daniel Noll
ontent" is more efficient. Daniel -- Daniel NollForensic and eDiscovery Software Senior Developer The world's most advanced Nuixemail data analysis http://nuix.com/

Re: double metaphone for misspellings

2008-12-17 Thread Daniel Noll
g similar. So you would end up with a DoubleMetaphoneFilter, which you could then use with PerFieldAnalyzerWrapper to have it apply only to the fields you use that for. Daniel -- Daniel NollForensic and eDiscovery Software Senior Developer

Re: Query Search returns always the same id

2008-10-28 Thread Daniel Noll
x27;t particularly surprising that it isn't stored. ;-) Daniel -- Daniel NollForensic and eDiscovery Software Senior Developer The world's most advanced Nuixemail data analysis

Re: QueryParser returning TermQuery instead of PhraseQuery?

2008-10-20 Thread Daniel Noll
s sounding like an X-Y problem, so what are you actually trying to achieve? It sounds like you don't want stemming (talking about "exact" matches) yet you chose the snowball analyser (whose sole purpose is stemming, unless I am mistaken...) Daniel -- Daniel Noll

Re: StandardTokenizer and Korean grouping with alphanum

2008-09-22 Thread Daniel Noll
age again instead of using [:letter:] which is much more convenient. Daniel -- Daniel NollForensic and eDiscovery Software Senior Developer The world's most advanced Nuixemai

StandardTokenizer and Korean grouping with alphanum

2008-09-21 Thread Daniel Noll
Basically I'm seeing some tokens come back with mixed digits and Hangul, and I'm questioning the correctness of that. Disclaimer: we're not performing any further processing of Korean in subsequent filters at the current point in time, and I don't know the language eit

Re: IndexSearcher.search

2008-09-19 Thread Daniel Noll
ating it even says something like "it was originally designed for GUI but was anyone even using it for that?" Some of us obviously were. Daniel -- Daniel NollForensic and eDiscovery Software Senior Developer The world's m

Re: IndexSearcher.search

2008-09-16 Thread Daniel Noll
in time, there is always the need to get the "set of every match" for any given search eventually. Maybe others have different opinions as they are working on webapps, where the user is already expecting paging before they even see the results page. Daniel -- Daniel Noll

Re: TopDocs question

2008-09-15 Thread Daniel Noll
know the *number* of hits, and don't need the hits themselves, then you should just use a custom HitCollector which increments a counter. It will run much faster. Daniel -- Daniel Noll - To unsubscribe, e-mail: [

Re: IndexSearcher.search

2008-09-15 Thread Daniel Noll
to grow? Daniel -- Daniel Noll - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to search

2008-08-25 Thread Daniel Noll
some time now, just not by default. Daniel -- Daniel Noll - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: How I can find wildcard symbol with WildcardQuery?

2008-08-19 Thread Daniel Noll
aniel -- Daniel Noll - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: How I can find wildcard symbol with WildcardQuery?

2008-08-19 Thread Daniel Noll
Kwon, Ohsang wrote: Why do you use to WildcardQuery? You are not need to whildcard. (maybe..) Use term query. What if you need to match a literal wildcard *and* an actual wildcard. :-) Daniel -- Daniel Noll - To

Re: How I can find wildcard symbol with WildcardQuery?

2008-08-19 Thread Daniel Noll
UN_TOKENIZED. Source code QFT: } else if (index == Index.NO_NORMS) { this.isIndexed = true; this.isTokenized = false; this.omitNorms = true; } ... Daniel -- Daniel Noll - To unsubscribe, e-mail: [EMAIL

Re: Testing for field existence

2008-08-18 Thread Daniel Noll
ou have to support older text indexes.) Daniel -- Daniel Noll - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene search for OR

2008-08-14 Thread Daniel Noll
s to search for "the". If it gives no results then you won't find "or" either, without reindexing with stop words off. Daniel -- Daniel Noll - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: SpanRegexQuery

2008-07-31 Thread Daniel Noll
same issue with regex queries here and had to apply a workaround of that sort. Daniel -- Daniel Noll - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Ignoring XML tags when Indexing

2008-07-24 Thread Daniel Noll
y as long as you have a BufferedReader wrapped around the entire thing. Daniel -- Daniel Noll - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: MoreLikeThis from a field with a specific value

2008-07-15 Thread Daniel Noll
category Not surprising at all. This is what you actually want: +(content:blah content:blah content:blah) +categoryId:2 Your original query's only REQUIRED constraint was that it match the category. Daniel -- Daniel

Re: Match all documents with non empty field

2008-07-02 Thread Daniel Noll
s you wrap it in a QueryFilter to cache the result, but I found it to be "fast enough" even for relatively large document sets. Daniel -- Daniel Noll - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional comm

Re: How to retrieve number of documents based on a query ?

2008-06-25 Thread Daniel Noll
On Thursday 26 June 2008 15:09:44 java_is_everything wrote: > Hi all. > > Is there a way to obtain the number of documents in the Lucene index > (2.0.0), having a particular term indexed, much like what we do in a > database ? I suspect the normal way is a HitCollector which does nothing but incre

Re: lucene search options

2008-06-23 Thread Daniel Noll
On Monday 23 June 2008 18:08:29 Aditi Goyal wrote: > Oh. For one moment I was elated to hear the news. :( > Is there any way out? *:* -"jakarta apache" Or subclass QueryParser and override the getBooleanQuery() method to do this behind the scenes using MatchAllDocsQuery. Daniel ---

Re: lucene search options

2008-06-22 Thread Daniel Noll
On Monday 23 June 2008 16:21:17 Aditi Goyal wrote: > I think wildcard (*) cannot be used in the beginning :( Wrong: http://lucene.apache.org/java/2_3_0/api/core/org/apache/lucene/queryParser/QueryParser.html#setAllowLeadingWildcard(boolean) Daniel ---

Re: creating Array of IndexReaders

2008-06-22 Thread Daniel Noll
On Saturday 21 June 2008 18:57:49 Sebastin wrote: > Since i am maintaining more than 1.5 years records in the windows 2003 > server,based on the user input for example if the user wants to display > june 1 - june 15 folders and fetch the records from them.if the user wants > to display may 1-may15

Re: How international languages are supported in Lucene

2008-06-09 Thread Daniel Noll
On Tuesday 10 June 2008 07:49:29 Otis Gospodnetic wrote: > Hi Daniel, > > What makes you say that about language detection? Wouldn't that depend on > the language detection approach or tool one uses and on the type and amount > of content one trains language detector on? And what is the threshold

Re: lucene memory consumption

2008-05-29 Thread Daniel Noll
On Friday 30 May 2008 08:17:52 Alex wrote: > Hi, > other than the in memory terms (.tii), and the few kilobytes of opened file > buffer, where are some other sources of significant memory consumption when > searching on a large index ? (> 100GB). The queries are just normal term > queries. Norms

Re: Is it possible to add multiple keywords to a single field from one doc?

2008-05-25 Thread Daniel Noll
On Monday 26 May 2008 02:25:40 Tom Conlon wrote: > Hi Mark, > > For example: > you have a content field (default) and you also have an 'attributes' > field. > > I'd like to add multiple attributes for a given document rather than > just one value and be able to somehow search on the attributes. > >

Re: Search for long titles - wildcard queries

2008-05-13 Thread Daniel Noll
On Saturday 10 May 2008 20:32:42 legrand thomas wrote: > I think I cannot use the WildcardQuery because the term shouldn't start > with "*" of "?". Should I use a QueryParser ? How can I do it ? WildcardQuery does permit a wildcard at the front, it's just much slower. Also, QueryParser allows w

Re: Does Lucene Supports Billions of data

2008-04-30 Thread Daniel Noll
On Thursday 01 May 2008 00:01:48 John Wang wrote: > I am not sure how well lucene would perform with > 2 Billion docs in a > single index anyway. Even if they're in multiple indexes, the doc IDs being ints will still prevent it going past 2Gi unless you wrap your own framework around it. Daniel

Re: Problems about using Lucene to generate tag cloud..

2008-04-02 Thread Daniel Noll
On Thursday 03 April 2008 08:08:09 Dominique Béjean wrote: > Hum, it looks like it is not true. > Use a do-while loop make the first terms.term().field() generate a null > pointer exception. Depends which terms method you use. TermEnum terms = reader.terms(); System.out.println(terms.term

Re: Problems about using Lucene to generate tag cloud..

2008-04-01 Thread Daniel Noll
On Tuesday 01 April 2008 18:51:55 Dominique Béjean wrote: > IndexReader reader = IndexReader.open(temp_index); > TermEnum terms = reader.terms(); > > while (terms.next()) { > String field = terms.term().field(); Gotcha: after calling terms() it's already pointin

Re: Contrib Highlighter and Phrase search

2008-03-19 Thread Daniel Noll
On Wednesday 19 March 2008 18:28:15 Itamar Syn-Hershko wrote: > 1. Build a Radix tree (PATRICIA) and populate it with all search terms. > Phrase queries will be considered as one big string, regardless their > spaces. > > 2. Iterate through your text ignoring spaces and punctuation marks, and for >

Re: java.lang.OutOfMemoryError: Java heap space when sorting the fields

2008-03-19 Thread Daniel Noll
On Thursday 20 March 2008 07:22:27 Mark Miller wrote: > You might think, if I only ask for the top 10 docs, don't i only read 10 > field values? But of course you don't know what docs will be returned as > each search comes in...so you have to cache them all. If it lazily cached one field at a tim

Re: Question with Hits Interface

2008-03-18 Thread Daniel Noll
On Wednesday 19 March 2008 01:44:33 Ramdas M Ramakrishnan wrote: > I am using a MultiFieldQueryParser to parse and search the index. Once I > have the Hits and iterate thru it, I need to know the following? > > For every hit document I need to know under which indexed field was this > Hit originati

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-17 Thread Daniel Noll
On Monday 17 March 2008 19:38:46 Michael McCandless wrote: > Well ... expungeDeletes() first forces a flush, at which point the > deletions are flushed as a .del file against the just flushed > segment. Still, if you call expungeDeletes after every flush > (commit) then it's only 1 segment whose d

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-16 Thread Daniel Noll
On Thursday 13 March 2008 19:46:20 Michael McCandless wrote: > But, when a normal merge of segments with deletions completes, your > docIDs will shift. In trunk we now explicitly compute the docID > shifting that happens after a merge, because we don't always flush > pending deletes when flushing

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-12 Thread Daniel Noll
On Thursday 13 March 2008 00:42:59 Erick Erickson wrote: > I certainly found that lazy loading changed my speed dramatically, but > that was on a particularly field-heavy index. > > I wonder if TermEnum/TermDocs would be fast enough on an indexed > (UN_TOKENIZED???) field for a unique id. > > Mostl

Re: indexing api wrt Analyzer

2008-03-12 Thread Daniel Noll
On Thursday 13 March 2008 15:21:19 Asgeir Frimannsson wrote: > >I was hoping to have IndexWriter take an AnalyzerFactory, where the > > AnalyzerFactory produces Analyzer depending on some criteria of the > > document, e.g. language. > With PerFieldAnalyzerWrapper, you can specify which analyze

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-12 Thread Daniel Noll
On Wednesday 12 March 2008 19:36:57 Michael McCandless wrote: > OK, I think very likely this is the issue: when IndexWriter hits an > exception while processing a document, the portion of the document > already indexed is left in the index, and then its docID is marked > for deletion. You can see

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-11 Thread Daniel Noll
On Wednesday 12 March 2008 10:20:12 Michael McCandless wrote: > Oh, so you do not see the problem with SerialMergeScheduler but you > do with ConcurrentMergeScheduler? [...] > Oh, there are no deletions?  Then this is very strange.  Is it   > optimize that messes up the docIDs?  Or, is it when you

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-11 Thread Daniel Noll
On Wednesday 12 March 2008 09:53:58 Erick Erickson wrote: > But to me, it always seems...er...fraught to even *think* about relying > on doc ids. I know you've been around the block with Lucene, but do you > have a compelling reason to use the doc ID and not your own unique ID? From memory it was

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-11 Thread Daniel Noll
On Tuesday 11 March 2008 19:55:39 Michael McCandless wrote: > Hi Daniel, > > 2.3 should be no different from 2.2 in that docIDs only "shift" when > a merge of segments with deletions completes. > > Could it be the ConcurrentMergeScheduler? Merges now run in the > background by default and commit w

Document ID shuffling under 2.3.x (on merge?)

2008-03-10 Thread Daniel Noll
Hi all. We're using the document ID to associate extra information stored outside Lucene. Some of this information is being stored at load-time and some afterwards; later on it turns out the information stored at load-time is returning the wrong results when converting the database contents ba

Re: searching for "Nothing"

2008-03-02 Thread Daniel Noll
On Monday 03 March 2008 05:40:39 Ghinwa Choueiter wrote: > thank you. You were right. Indexing by "" does not do what I need. > > How would one represent a null index? Perhaps another way of asking the > question is what query would return to me all the documents in the > database (all-pass filter)

Re: Inconsistent Search Speed

2008-02-28 Thread Daniel Noll
On Thursday 28 February 2008 01:52:27 Erick Erickson wrote: > And don't iterate through the Hits object for more than 100 or so hits. > Like Mark said. Really. Really don't ... Is there a good trick for avoiding this? Say you have a situation like this... - User searches - User sees first N h

Re: Rebuilding Document from index?

2008-02-28 Thread Daniel Noll
On Wednesday 27 February 2008 03:33:53 Itamar Syn-Hershko wrote: > I'm still trying to engineer the best possible solution for Lucene with > Hebrew, right now my path is NOT using a stemmer by default, only by > explicit request of the user. MoreLikeThis would only return relevant > results if I wi

Re: When does QueryParser creates PhraseQueries

2008-02-28 Thread Daniel Noll
On Wednesday 27 February 2008 00:50:04 [EMAIL PROTECTED] wrote: > Looks that this is really hard-coded behaviour, and not Analyzer-specific. The whitespace part is coded into QueryParser.jj, yes. So are the quotes and : and other query-specific things. > I want to search for directories with to

Re: When does QueryParser creates PhraseQueries

2008-02-25 Thread Daniel Noll
On Tuesday 26 February 2008 01:05:27 [EMAIL PROTECTED] wrote: > Hi all, > > I have the behaviour that when I search with Luke (version 0.7.1, Lucene > version 2.2.0) inside an arbritray field, the QueryParser creates a > PhraseQuery when I type in > ~ termA/termB (no "...") > When

Re: Searching multiple indexes

2008-02-21 Thread Daniel Noll
On Tuesday 19 February 2008 21:08:59 [EMAIL PROTECTED] wrote: > 1. IndexSearcher with a MultiReader will search the indexes > sequentially? Not exactly. It will fuse the indexes together such that things like TermEnum will merge the ones from the real indexes, and will search using those compos

Re: Indexing Speed: 2.3 vs 2.2 (real world numbers)

2008-02-04 Thread Daniel Noll
On Monday 04 February 2008 21:51:39 Michael McCandless wrote: > Even pre-2.3, you should have seen gains by adding threads, if indeed > your hardware has good concurrency. > > And definitely with the changes in 2.3, you should see gains by > adding threads. With regards to this, I have been wonder

Re: Lucene to index OCR text

2008-01-28 Thread Daniel Noll
On Friday 25 January 2008 19:26:44 Paul Elschot wrote: > There is no way to do exact phrase matching on OCR data, because no > correction of OCR data will be perfect. Otherwise the OCR would have made > the correction... > The problem I see with a fuzzy query is that if you have the fuzziness set

field:* query type, and prefix queries

2008-01-23 Thread Daniel Noll
Hi all... Just out of interest, why does field:* go via getWildcardQuery instead of getPrefixQuery? It seems to me that it should be treated as a prefix of "", but am I missing something important? Also, I've noticed that although RangeQuery was optimised in a recent version of Lucene, Prefix

Re: Question regarding adding documents

2008-01-07 Thread Daniel Noll
On Tuesday 08 January 2008 00:52:35 Developer Developer wrote: > here is another approach. > > StandardAnalyzer st = new StandardAnalyzer(); > StringReader reader= new StringReader("text to index..."); > TokenStream stream = st.tokenStream("content", reader); > > Then use the Field

Re: Question regarding adding documents

2008-01-06 Thread Daniel Noll
On Monday 07 January 2008 11:35:59 chris.b wrote: > is it possible to add a document to an index and, while doing so, get the > terms in that document? If so, how would one do this? :x My first thought would be: when adding fields to the document, use the Field constructors which accept a TokenSt

Fullwidth alphanumeric characters, plus a question on Korean ranges

2008-01-06 Thread Daniel Noll
Hi all. We discovered that fullwidth letters are not treated as and fullwidth digits are not treated as . This in itself is probably easy to fix (including the filter for normalising these back to the normal versions) but while sanity checking the blocks in StandardTokenizer.jj I found some s

Re: Query.rewrite - help me to understand it

2007-12-16 Thread Daniel Noll
On Thursday 13 December 2007 23:07:49 游泳池的鱼 wrote: > hehe ,you can do a test with PrefixQuery rewrite method,and extract terms . > like this > query = prefixQuery.rewrite(reader); > query.extractTerms(set); > for(String term : set){ > System.out.println(term); > } > > It will give you

Re: DEFAULT_OPERATOR_AND globally ?

2007-12-11 Thread Daniel Noll
On Wednesday 12 December 2007 03:34:08 Helmut Jarausch wrote: > Hi, > > I know how to set DEFAULT_OPERATOR_AND for an individual QueryParser > Objekt (after creation) > > Since I always want this to be set, is there a means to set a (global) > option such that any QueryParser object has this defaul

Tricky (maybe) query question

2007-12-05 Thread Daniel Noll
Hi all. Suppose you have a text index with a field used for deduplication, and then you later add a second field with further information that might also be used for deduplication. We'll call them A and B for the sake of brevity. If I have only a current text index, then I can use (a:foo AND b

Re: How to check which field contains Term

2007-11-07 Thread Daniel Noll
On Thursday 08 November 2007 02:41:50 Lukasz Rzeszotarski wrote: > I must write application, where client wants to make very complex query, > like: > find word "blabla" in (Content_1 OR Content_2) AND (...) AND (...)... > and as a result he expectes not only documents, but also information in

Re: Index Dedupe

2007-10-01 Thread Daniel Noll
On Tuesday 02 October 2007 12:25:47 Johnny R. Ruiz III wrote: > Hi, > > I can't seem to find a way to delete duplicate in lucene index. I hve a > unique key so it seems to be straight forward. But I can't find a simple > way to do it except for putting each record in the index into HashMap. >

Re: Storing Host and IP Information in Lucene

2007-09-10 Thread Daniel Noll
On Monday 10 September 2007 23:53:06 AnkitSinghal wrote: > And if i make the field as UNTOKENIZED i cannot search for queries like > host:xyz.* . I'm not sure why that wouldn't work. If the stored token is xyz.example.com, then xyz.* will certainly match it. Daniel ---

Re: How do YOU detect corrupt indexes?

2007-08-02 Thread Daniel Noll
herwise need to do to ensure consistency. Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 Web: http://nuix.com/ Fax: +61 2 9212 6902 - To

Re: Getting only the Ids, not the whole documents.

2007-08-02 Thread Daniel Noll
ore than others. Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 Web: http://nuix.com/ Fax: +61 2 9212 6902 - To unsubscribe, e-mai

Re: Lucene equivalent of SQL DISTINCT for a specific field's "stored values"

2007-07-26 Thread Daniel Noll
sort of thing only works with untokenised fields, unless you have somewhere else you can store the untokenised version which is quicker to iterate over. Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 Web: http://nuix.com/

Re: Search for null

2007-07-25 Thread Daniel Noll
way to do it for speed. For the least code you can probably do... BooleanFilter f = new BooleanFilter(); f.add(new FilterClause(RangeFilter.More("field", ""), BooleanClause.Occur.MUST_NOT)); f = new CachingWrapperFilter(f); Daniel -- Daniel No

Re: How to make a case insensitive search using a FuzzyQuery?

2007-07-05 Thread Daniel Noll
eally appreciate any help! Why don't you just have your analyser lowercase the field at indexing time? I don't see why you would use a FuzzyQuery for something where a normal PhraseQuery should suffice. Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, Austra

Can I delete without shuffling document IDs?

2007-06-28 Thread Daniel Noll
ping table from actual document ID to the sequence ID. (e.g. if documents 1000 through 1999 are deleted, there would be an entry in the table saying that ID 2000 starts at document ID 1000.) I just wanted to put the question out in case someone has solved the exact same problem already. Daniel

Re: Problem using RAMDirectory as a buffer

2007-06-21 Thread Daniel Noll
On Friday 22 June 2007 09:34:44 Tanya Levshina wrote: >          ramWriter.addDocument(doc); > >          fsWriter.addIndexes(new Directory[] {ramDir,}); As IndexWriter already does this internally, I'm not exactly sure why you're trying to implement it again on the outside.

Re: negative queries

2007-06-18 Thread Daniel Noll
On Tuesday 19 June 2007 11:03:25 Erik Hatcher wrote: > > Good way to discourage potential contributors I suppose. > > And (most) spammers, which is really the point of requiring a > profile. I believe this is called "throwing the baby out with the bath water." Dan

Re: negative queries

2007-06-18 Thread Daniel Noll
up, and click on the "Create Profile" button. Good way to discourage potential contributors I suppose. Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 Web: http://nuix.com/

Re: negative queries

2007-06-17 Thread Daniel Noll
AQ page claims to be immutable, however. Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 Web: http://nuix.com/ Fax: +61 2 9212 6902 - To unsubsc

Re: negative queries

2007-06-14 Thread Daniel Noll
this question on it? Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 Web: http://nuix.com/ Fax: +61 2 9212 6902 - To unsubscribe,

Re: regarding range search

2007-06-12 Thread Daniel Noll
, this will probably work well. (It gets harder if you want to do it inside ordinary text content as well.) Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 Web: http://nuix.com/

Re: I need 'cat???' to match 'cat' again!

2007-06-06 Thread Daniel Noll
by removing those > "new" lines, but I don't want to maintain a custom > lucene package. > > Please help! Can you not use RegexQuery instead? Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 W

Re: times of match in a document

2007-05-27 Thread Daniel Noll
ny code example for this... See IndexReader#termDocs(), termDocs#seek(Term), termDocs#skipTo(int) and termDocs#freq(). If you need to do it for multiple documents and terms, you probably want to do it in order to reduce redundant creation of multiple TermDocs objects. Daniel -- Daniel Noll N

Re: Concept Search

2007-05-16 Thread Daniel Noll
> > But I've been wrong before. Ah, I see. A feature I haven't toyed with just yet. That's rather nice. :-) Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 Web: http://nuix.com/

Re: Concept Search

2007-05-16 Thread Daniel Noll
nd end offset and highlight identically. What *would* be tricky is phrase queries since inserting a new term breaks the offsets AFAIK. Although, I suppose you could always store the concepts in a different field and not modify the analyser being used for the text itself. Daniel -- Daniel Noll

  1   2   3   >