Re: Document Frequency for a set of documents

2010-02-05 Thread Ard Schrijvers
crossposting to the user list as I think this issue belongs there. See my comments inline On Fri, Feb 5, 2010 at 10:27 AM, lionel duboeuf wrote: > Hi, > > Sorry for asking again, **I still have not found a scalable solution to get > the document frequency of a term t according a set of documents.

RE: Performance of never optimizing

2008-11-03 Thread Ard Schrijvers
Hello Justus, Chris and Otis, IIRC Ocean [1] by Jason Rutherglen addresses the issue for real time searches on large data sets. A conceptually comparable implementation is done for Jackrabbit, where you can see an enlighting picture over here [2]. In short: 1) IndexReaders are opened only once

RE: Hiring etiquette

2008-10-20 Thread Ard Schrijvers
Hello Rich, There is actually also a specific list indeed for it, [EMAIL PROTECTED], but it is a really low traffic list I must admit, most likely not read at all by the people you are looking for...though, officially, it is the list to use :-) Ard > Hi all, > > Is there a mailing-list-appropr

RE: Advise for Mediabase with Lucene

2008-10-06 Thread Ard Schrijvers
Hello Mathias, IMHO sounds like you are planning to re-invent the wheel while all things you want (AFAICS) are already largely available as open source projects, and perhaps more important, open standards. Your hierarchical data storage sounds like jsr-170 and jsr-283 are the open standard solu

RE: Reusing indexed and analyzed documents

2008-01-21 Thread Ard Schrijvers
Hello, > 21 jan 2008 kl. 16.37 skrev Ard Schrijvers: > > > is there a way to reuse a Lucene document which was indexed and > > analyzed before, but only one single Field has changed? > Karl Wetting wrote: > I don't think you can reuse document instances like t

Reusing indexed and analyzed documents

2008-01-21 Thread Ard Schrijvers
Hello, is there a way to reuse a Lucene document which was indexed and analyzed before, but only one single Field has changed? The use case (Jackrabbit indexing) is when a *lot* of documents have a common field which changes, and the rest of the document is unchanged . I would guess that there is

RE: how can i store lucene results from a webpage to a oracle database

2007-11-08 Thread Ard Schrijvers
I suppose you have for about 5 minutes to display a single search ? :-) Perhaps before pointing out your possible solutions, you might better start describing your functional requirements, because your suggested solution is headed for problems. So you need custom ordering, check out lucene scoring

RE: Search performance using BooleanQueries in BooleanQueries

2007-10-30 Thread Ard Schrijvers
> On Friday 26 October 2007 09:36:58 Ard Schrijvers wrote: > > Hello, > > > > I am seeing that a query with boolean queries in boolean > queries takes > > much longer than just a single boolean query when the > number of hits > > if fairly large. For e

Search performance using BooleanQueries in BooleanQueries

2007-10-26 Thread Ard Schrijvers
Hello, I am seeing that a query with boolean queries in boolean queries takes much longer than just a single boolean query when the number of hits if fairly large. For example +prop1:a +prop2:b +prop3:c +prop4:d +prop5:e is much faster than (+(+(+(+prop1:a +prop2:b) +prop3:c) +prop4:d) +pro

RE: Performance searching over multiple indexes

2007-10-25 Thread Ard Schrijvers
sistent indexes must be kept small I think. I'll do some more testing, thx for your advice, regards Ard > > > -Original Message- > From: Ard Schrijvers [mailto:[EMAIL PROTECTED] > Sent: Thursday, October 25, 2007 6:09 PM > To: java-user@lucene.apache.or

Performance searching over multiple indexes

2007-10-25 Thread Ard Schrijvers
Hello, I am experimenting with lucene MultiSearcher and do some simple BooleanQueries in which I combine a couple of TermQueries. I am experiencing, that a single lucene index for just 100.000 docs (~10 k each) is like 100 times faster than when I have about 100 seperate indexes and use MultiSear

RE: Indexing

2007-08-26 Thread Ard Schrijvers
> > Concept Search - > > 1. For example - Would like to search documents for "Wild > Animals". However, "Wild Animals" will consist of an unlimited number > of N-grams such as I am a bit confused. What is the point of N-grams regarding this concept search? I do not see how N-grams cou

RE: Indexing

2007-08-22 Thread Ard Schrijvers
10 updates per minute is not very much? Why not invalidate your used reader after every commit, and reopen it? If your index is really big, you might want to reopen it fewer times, but this is very simple to do (reopen every x updated times) Also the RAM and FS solution Erick suggests is possib

RE: search returns always the first indexed name

2007-08-22 Thread Ard Schrijvers
Use getValues("name"), see http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/document/Document.html#getValues(java.lang.String) Regards Ard Hi I am using lucene to index xml. I have already managed to index the elements. I am indexing the element of xml w

RE: Indexing

2007-08-22 Thread Ard Schrijvers
Do you reindex everything every 5 minutes from scratch? Can't you keep track of what changes, and only add/remove the correct parts to the index? Ard I'm new to this list. So first of all Hello to everyone! So right now I have a little issue I would like to discuss with you. Suppose that your a

RE: reg-ex based stop word removal

2007-08-22 Thread Ard Schrijvers
> "implement a TokenFilter http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/TokenFilter.html"; You might though want to check the performance implications :-) - To unsubscribe

RE: lucene suggest

2007-08-21 Thread Ard Schrijvers
the subject, the returned hits > will have duplicates ) > i was asking if i can remove duplicates from the hits?? > > thanks in advance > > Ard Schrijvers <[EMAIL PROTECTED]> wrote: Hello Heba, > > you need some lucene field that serves as an identifier for > your

RE: lucene suggest

2007-08-21 Thread Ard Schrijvers
Hello Heba, you need some lucene field that serves as an identifier for your documents that are indexed. Then, when re-indexing some documents, you can first use the identifier to delete the old indexed documents. You have to take care of this yourself. Regards Ard > > Hello > i would like

RE: What is the contrib/surround/src/java purpose

2007-08-09 Thread Ard Schrijvers
t; The minimal documentation is in the Java API documentation > on the lucene java site under contrib: Surround Parser, and in > the surround.txt file here: > http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surroun > d/surround.txt?view=log > > Groeten, > Paul Elsch

RE: Fastest way to perform 'like' searches

2007-08-09 Thread Ard Schrijvers
Thanks Daniel, I understand how it can be done. The only things that bothers me is that expanding the "*" might result in many phrases, and that in turn might imply a performance hit. I'll see what the impact is, Regards Ard > > On Wednesday 08 August 2007 10:28,

What is the contrib/surround/src/java purpose

2007-08-08 Thread Ard Schrijvers
Hello, without having to dive into the code, I was hoping somebody could tell me what this contrib block does? I can't seem to find any documentation or relevant hits when searching for it, Thanks in advance, Regards Ard - T

Fastest way to perform 'like' searches

2007-08-08 Thread Ard Schrijvers
Hello, I need to do a search that is capable to also match on substrings, for example: *oo bar the qu* should find a document that contains 'foo bar the quux' and 'foo bar the qux'. Now, should I index the text as UN_TOKENIZED also, and do a WildCardQuery on this field? Obviously, then every b

RE: How to show category count with results?

2007-07-31 Thread Ard Schrijvers
Hello Shailendra, AFAICS you are reasoning from a static doc-id POV, while documents do not have a static doc-id in lucene. When you have a frequently updated index, you'll end up invalidating cached BitSet's (which as the number of categories and number of documents grow can absorb quite amoun

RE: Search query with wildcard and spaces

2007-07-31 Thread Ard Schrijvers
Hello, is this just one single example of different words that should return the same results? You might consider implementing a synonym analyzer otherwise. In your case, storing NAME as UN_TOKENIZED should enable your NAME:"De Agos"* search Regards Ard > > Hi, > I would like to make a searc

RE: Running query text through an Analyzer without QueryParser?

2007-07-30 Thread Ard Schrijvers
> > So then would I just concatenate the tokens together to form > the query text? You might better create a TermQuery for each token instead of concatenating, and combine them in a BooleanQuery and say wether all terms must or should occur. Very simple, see [1] Regards Ard [1] http://luce

RE: Tokenizer

2007-07-30 Thread Ard Schrijvers
Hello, > I have two questions. > > First, Is there a tokenizer that takes every word and simply > makes a token > out of it? org.apache.lucene.analysis.WhitespaceTokenizer > So it looks for two white spaces and takes the characters > between them and makes a token out of them? > > If this to

RE: Indexing/Analyzer question - case-insensitive "contains" search

2007-07-30 Thread Ard Schrijvers
> > It does sound very strange to me, to default to a > WildCardQuery! Suppose I > > am looking for "bold", I am getting hits for "old". > > I know - but that's what the requirements dictate. A better > example might be > a MAC or IP address, where someone might be searching for a > string in

RE: How to show category count with results?

2007-07-30 Thread Ard Schrijvers
Or check out Solr and see if you can use that, or see how they do it, Regards Ard > > You might want to search the mail archive for "facets" or > "faceted search" > (no quotes), as I *think* this might be relevant. > > Best > Erick > > On 7/26/07, Ramana Jelda <[EMAIL PROTECTED]> wrote: > > >

RE: Indexing/Analyzer question - case-insensitive "contains" search

2007-07-30 Thread Ard Schrijvers
Hello, > Hi everyone, > > I told you I'd be back with more questions! :-) > Here is my situation. In my application, the field to be searched is > selected via a drop-down box. I want my searches to basically > be "contains" > searches - I take what the user typed in, put a wildcard > characte

RE: Search terms on a single "instance" of field

2007-07-27 Thread Ard Schrijvers
Hello, > > Company AB", ...). With this I´d like to search for documents that has > daniel and president on the same field, because in a same > text, can exist > daniel and president in different fields. Is this possible?? Not totally sure wether I understand your problem, because it does not s

RE: Lucene shows parts of search query as a HIT

2007-07-20 Thread Ard Schrijvers
ver. If > you're calling > > > > this fragment for each document, you'll always have > only one doc. Try > > > > changing the 'true' to 'false'. Or better yet, open the > writer outside > > > the > > > > document add

RE: Inrease the performance of Indexing in Lucene

2007-07-19 Thread Ard Schrijvers
Hello, Did take a look at nutch or hadoop or solr? They partially seem to address the things you describe...About the LSI I am not sure what has been done in those projects Regards Ard > > Hi, Please help me. > Its been a month since i am trying lucene. > My requirements are huge, i have to i

RE: Lucene shows parts of search query as a HIT

2007-07-19 Thread Ard Schrijvers
Hello Askar, Which analyzer are you using for indexing and searching? If you use an analyzer that uses stemming, you might see that "change", "changing", "changed", "chan" etc al get reduced to the same word "chan". In luke you can test with plugins that show you what tokens are created from y

RE: Token offset values for custom Tokenizer

2007-07-16 Thread Ard Schrijvers
that were placed into the token during > indexing > are not being returned, they have been shifted. > Thanks. > Shahan > > Ard Schrijvers wrote: > > Hello, > > > > > >> Hi, > >> I am storing custom values in the Tokens provided by

RE: Serving remote lucene client - RMI vs HTTP

2007-07-16 Thread Ard Schrijvers
Hello, > Hi EVeryone, > > Thank you all for your replies. > > And reply to your questions Grant: > We have more than 3 Million document in our index. > We get more than 150,000 searches (queries) per day. We > expect this no to go > up. Just curious, but suppose those 150.000 searches are don

RE: Does Index have a Tokenizer Built into it

2007-07-16 Thread Ard Schrijvers
orrect? Will I need to > store "term text" > in order to be able to access the actual term instead of > stemmed words? > > Thanks for all your help, > > --JP > > On 7/13/07, Ard Schrijvers <[EMAIL PROTECTED]> wrote: > > > > Hello,

RE: Token offset values for custom Tokenizer

2007-07-16 Thread Ard Schrijvers
Hello, > Hi, > I am storing custom values in the Tokens provided by a Tokenizer but > when retrieving them from the index the values don't match. What do you mean by retrieving? Do you mean retrieving terms, or do you mean doing a search with words you know that should be in, but you do not fi

RE: Does Index have a Tokenizer Built into it

2007-07-13 Thread Ard Schrijvers
Hello, > I'm wondering if after > opening the > index I can retrieve the Tokens (not the terms) of a > document, something > akin to IndexReader.Document(n).getTokenizer(). It is obviously not possible to get the original tokens of the document back when you haven't stored the document, becaus

RE: How to reflect index changes to search automatically

2007-07-13 Thread Ard Schrijvers
The SearchClient is obviously not aware of a changing index, so doesn't know when it has to be reopened. You can at least do the following: 1) you periodically check for the index folder wether its timestamp did change (or if this stays the same, do it with the files in it) --> if changed, reo

RE: Calling indexWriter.close() in web app

2007-07-12 Thread Ard Schrijvers
Hello, > > The lock file is only for Writers. The lock file ensures that > even two > writers from two JVM's will not step on each other. Readers > do not care > about what the writers are doing or whether there is a lock > file... Is this always true? The deleteDocuments method of the Index

RE: document field indexing

2007-07-10 Thread Ard Schrijvers
Hello John, see another thread about this issue this morning. Due to index performance in combination with an inverted index it is not possible what you want. Regards Ard > > Hi > Lets say we have a single lucene document that has two text fields: > field1 and field2. > Data kept in field1

RE: Calling indexWriter.close() in web app

2007-07-09 Thread Ard Schrijvers
Hello, > I'm developing a web app with struts that need to embed lucene > functionalities. I need that my app adds documents to the > index after that a > document is added (documents are very few, but of large > size). I read that i > have to use a single instance of indexwriter to edit the >

RE: Should the IndexSearcher be closed after very search completed

2007-07-09 Thread Ard Schrijvers
Closing the IndexSearcher is best only after a deleteDocuments with a reader or changes with a writer. For performance reasons, it is better to not close the IndexSearcher if not needed Regarsd Ard > > > sorry, the subject should be "Should the IndexSearcher be > closed after > every sear

RE: Auto Slop

2007-07-02 Thread Ard Schrijvers
> I just ran into an interesting problem today, and wanted to know if it > was my understanding or Lucene that was out of whack -- right now I'm > leaning toward a fault between the chair and the keyboard. > > I attempted to do a simple phrase query using the StandardAnalyzer: > "United States"

RE: Using Lucene to search Multiple Databases

2007-06-18 Thread Ard Schrijvers
A search server based on lucene which is very easy to use and implement. I think you can use it to achieve what you want, Regards > > @Ard Schrijvers > > > What is this Solr > i didnt get you. will you

RE: Using Lucene to search Multiple Databases

2007-06-18 Thread Ard Schrijvers
Hello Rajat, this sounds to me like something very suitable for Solr, Regards Ard > > > Rajat, > > I don't know about the Web Interface you are mentioning but > the task can be > done with a little bit coding from your side. > > I would suggest indexing each database in its own index which

RE: Does Lucene search over memory too?

2007-05-28 Thread Ard Schrijvers
Hello, think you can find your answer in the IndexWriter API: http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/IndexWriter.html The optional autoCommit argument to the constructors controls visibility of the changes to IndexReader instances reading

RE: Number of documents in an index with filter

2007-05-27 Thread Ard Schrijvers
> > > Greetings, > > I would like to add the number of possible hits in my > queries, for example, > "found 18 hits out of a possible 245,000 documents". I am > assuming that > IndexReader.numDocs() is the best way to get this value. > > However, I would like to use a filter as part of the

RE: Setting the maximum number of documents in a lucene segment

2007-05-26 Thread Ard Schrijvers
ufferedDocs. But, increasing the default number of documents in the "smallest" segments from 10 to, say 100, would also help me. Then again, I am not sure wether i am doing something which can be achieved more effectively/simply, thanks in advance for any pointers, Regards Ard Schri

RE: Setting the maximum number of documents in a lucene segment

2007-05-25 Thread Ard Schrijvers
axBufferedDocs(largeValue) does not do the trick > (I think because in my case because the writer is flushed and > closed after an few updates) > > Does anyone know wether it is possible to make the default > number of documents a segment can contain larger? > > Thanks in a

Setting the maximum number of documents in a lucene segment

2007-05-25 Thread Ard Schrijvers
documents a segment can contain larger? Thanks in advance, Ard Schrijvers -- Hippo Oosteinde 11 1017WT Amsterdam The Netherlands Tel +31 (0)20 5224466 - [EMAIL PROTECTED] / http://www.hippo.nl