crossposting to the user list as I think this issue belongs there. See
my comments inline
On Fri, Feb 5, 2010 at 10:27 AM, lionel duboeuf
wrote:
> Hi,
>
> Sorry for asking again, **I still have not found a scalable solution to get
> the document frequency of a term t according a set of documents.
Hello Justus, Chris and Otis,
IIRC Ocean [1] by Jason Rutherglen addresses the issue for real time
searches on large data sets. A conceptually comparable implementation is
done for Jackrabbit, where you can see an enlighting picture over here
[2]. In short:
1) IndexReaders are opened only once
Hello Rich,
There is actually also a specific list indeed for it, [EMAIL PROTECTED],
but it is a really low traffic list I must admit, most likely not read
at all by the people you are looking for...though, officially, it is the
list to use :-)
Ard
> Hi all,
>
> Is there a mailing-list-appropr
Hello Mathias,
IMHO sounds like you are planning to re-invent the wheel while all
things you want (AFAICS) are already largely available as open source
projects, and perhaps more important, open standards.
Your hierarchical data storage sounds like jsr-170 and jsr-283 are the
open standard solu
Hello,
> 21 jan 2008 kl. 16.37 skrev Ard Schrijvers:
>
> > is there a way to reuse a Lucene document which was indexed and
> > analyzed before, but only one single Field has changed?
> Karl Wetting wrote:
> I don't think you can reuse document instances like t
Hello,
is there a way to reuse a Lucene document which was indexed and analyzed
before, but only one single Field has changed? The use case (Jackrabbit
indexing) is when a *lot* of documents have a common field which
changes, and the rest of the document is unchanged . I would guess that
there is
I suppose you have for about 5 minutes to display a single search ? :-)
Perhaps before pointing out your possible solutions, you might better
start describing your functional requirements, because your suggested
solution is headed for problems. So you need custom ordering, check out
lucene scoring
> On Friday 26 October 2007 09:36:58 Ard Schrijvers wrote:
> > Hello,
> >
> > I am seeing that a query with boolean queries in boolean
> queries takes
> > much longer than just a single boolean query when the
> number of hits
> > if fairly large. For e
Hello,
I am seeing that a query with boolean queries in boolean queries takes
much longer than just a single boolean query when the number of hits if
fairly large. For example
+prop1:a +prop2:b +prop3:c +prop4:d +prop5:e
is much faster than
(+(+(+(+prop1:a +prop2:b) +prop3:c) +prop4:d) +pro
sistent indexes must be kept small I think. I'll do some more
testing,
thx for your advice,
regards Ard
>
>
> -Original Message-
> From: Ard Schrijvers [mailto:[EMAIL PROTECTED]
> Sent: Thursday, October 25, 2007 6:09 PM
> To: java-user@lucene.apache.or
Hello,
I am experimenting with lucene MultiSearcher and do some simple
BooleanQueries in which I combine a couple of TermQueries. I am
experiencing, that a single lucene index for just 100.000 docs (~10 k
each) is like 100 times faster than when I have about 100 seperate
indexes and use MultiSear
>
> Concept Search -
>
> 1. For example - Would like to search documents for "Wild
> Animals". However, "Wild Animals" will consist of an unlimited number
> of N-grams such as
I am a bit confused. What is the point of N-grams regarding this concept
search? I do not see how N-grams cou
10 updates per minute is not very much? Why not invalidate your used reader
after every commit, and reopen it? If your index is really big, you might want
to reopen it fewer times, but this is very simple to do (reopen every x updated
times)
Also the RAM and FS solution Erick suggests is possib
Use getValues("name"), see
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/document/Document.html#getValues(java.lang.String)
Regards Ard
Hi
I am using lucene to index xml. I have already managed to index the
elements. I am indexing the element of xml w
Do you reindex everything every 5 minutes from scratch? Can't you keep track of
what changes, and only add/remove the correct parts to the index?
Ard
I'm new to this list. So first of all Hello to everyone!
So right now I have a little issue I would like to discuss with you.
Suppose that your a
>
"implement a TokenFilter
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/TokenFilter.html";
You might though want to check the performance implications :-)
-
To unsubscribe
the subject, the returned hits
> will have duplicates )
> i was asking if i can remove duplicates from the hits??
>
> thanks in advance
>
> Ard Schrijvers <[EMAIL PROTECTED]> wrote: Hello Heba,
>
> you need some lucene field that serves as an identifier for
> your
Hello Heba,
you need some lucene field that serves as an identifier for your documents that
are indexed. Then, when re-indexing some documents, you can first use the
identifier to delete the old indexed documents. You have to take care of this
yourself.
Regards Ard
>
> Hello
> i would like
t; The minimal documentation is in the Java API documentation
> on the lucene java site under contrib: Surround Parser, and in
> the surround.txt file here:
> http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surroun
> d/surround.txt?view=log
>
> Groeten,
> Paul Elsch
Thanks Daniel,
I understand how it can be done. The only things that bothers me is that
expanding the "*" might result in many phrases, and that in turn might imply a
performance hit. I'll see what the impact is,
Regards Ard
>
> On Wednesday 08 August 2007 10:28,
Hello,
without having to dive into the code, I was hoping somebody could tell me what
this contrib block does? I can't seem to find any documentation or relevant
hits when searching for it,
Thanks in advance,
Regards Ard
-
T
Hello,
I need to do a search that is capable to also match on substrings, for example:
*oo bar the qu*
should find a document that contains 'foo bar the quux' and 'foo bar the qux'.
Now, should I index the text as UN_TOKENIZED also, and do a WildCardQuery on
this field? Obviously, then every b
Hello Shailendra,
AFAICS you are reasoning from a static doc-id POV, while documents do not have
a static doc-id in lucene. When you have a frequently updated index, you'll end
up invalidating cached BitSet's (which as the number of categories and number
of documents grow can absorb quite amoun
Hello,
is this just one single example of different words that should return the same
results? You might consider implementing a synonym analyzer otherwise.
In your case, storing NAME as UN_TOKENIZED should enable your NAME:"De Agos"*
search
Regards Ard
>
> Hi,
> I would like to make a searc
>
> So then would I just concatenate the tokens together to form
> the query text?
You might better create a TermQuery for each token instead of concatenating,
and combine them in a BooleanQuery and say wether all terms must or should
occur. Very simple, see [1]
Regards Ard
[1]
http://luce
Hello,
> I have two questions.
>
> First, Is there a tokenizer that takes every word and simply
> makes a token
> out of it?
org.apache.lucene.analysis.WhitespaceTokenizer
> So it looks for two white spaces and takes the characters
> between them and makes a token out of them?
>
> If this to
> > It does sound very strange to me, to default to a
> WildCardQuery! Suppose I
> > am looking for "bold", I am getting hits for "old".
>
> I know - but that's what the requirements dictate. A better
> example might be
> a MAC or IP address, where someone might be searching for a
> string in
Or check out Solr and see if you can use that, or see how they do it,
Regards Ard
>
> You might want to search the mail archive for "facets" or
> "faceted search"
> (no quotes), as I *think* this might be relevant.
>
> Best
> Erick
>
> On 7/26/07, Ramana Jelda <[EMAIL PROTECTED]> wrote:
> >
>
Hello,
> Hi everyone,
>
> I told you I'd be back with more questions! :-)
> Here is my situation. In my application, the field to be searched is
> selected via a drop-down box. I want my searches to basically
> be "contains"
> searches - I take what the user typed in, put a wildcard
> characte
Hello,
>
> Company AB", ...). With this I´d like to search for documents that has
> daniel and president on the same field, because in a same
> text, can exist
> daniel and president in different fields. Is this possible??
Not totally sure wether I understand your problem, because it does not s
ver. If
> you're calling
> > > > this fragment for each document, you'll always have
> only one doc. Try
> > > > changing the 'true' to 'false'. Or better yet, open the
> writer outside
> > > the
> > > > document add
Hello,
Did take a look at nutch or hadoop or solr? They partially seem to address the
things you describe...About the LSI I am not sure what has been done in those
projects
Regards Ard
>
> Hi, Please help me.
> Its been a month since i am trying lucene.
> My requirements are huge, i have to i
Hello Askar,
Which analyzer are you using for indexing and searching? If you use an analyzer
that uses stemming, you might see that "change", "changing", "changed", "chan"
etc al get reduced to the same word "chan".
In luke you can test with plugins that show you what tokens are created from
y
that were placed into the token during
> indexing
> are not being returned, they have been shifted.
> Thanks.
> Shahan
>
> Ard Schrijvers wrote:
> > Hello,
> >
> >
> >> Hi,
> >> I am storing custom values in the Tokens provided by
Hello,
> Hi EVeryone,
>
> Thank you all for your replies.
>
> And reply to your questions Grant:
> We have more than 3 Million document in our index.
> We get more than 150,000 searches (queries) per day. We
> expect this no to go
> up.
Just curious, but suppose those 150.000 searches are don
orrect? Will I need to
> store "term text"
> in order to be able to access the actual term instead of
> stemmed words?
>
> Thanks for all your help,
>
> --JP
>
> On 7/13/07, Ard Schrijvers <[EMAIL PROTECTED]> wrote:
> >
> > Hello,
Hello,
> Hi,
> I am storing custom values in the Tokens provided by a Tokenizer but
> when retrieving them from the index the values don't match.
What do you mean by retrieving? Do you mean retrieving terms, or do you mean
doing a search with words you know that should be in, but you do not fi
Hello,
> I'm wondering if after
> opening the
> index I can retrieve the Tokens (not the terms) of a
> document, something
> akin to IndexReader.Document(n).getTokenizer().
It is obviously not possible to get the original tokens of the document back
when you haven't stored the document, becaus
The SearchClient is obviously not aware of a changing index, so doesn't know
when it has to be reopened.
You can at least do the following:
1) you periodically check for the index folder wether its timestamp did change
(or if this stays the same, do it with the files in it) --> if changed, reo
Hello,
>
> The lock file is only for Writers. The lock file ensures that
> even two
> writers from two JVM's will not step on each other. Readers
> do not care
> about what the writers are doing or whether there is a lock
> file...
Is this always true? The deleteDocuments method of the Index
Hello John,
see another thread about this issue this morning. Due to index performance in
combination with an inverted index it is not possible what you want.
Regards Ard
>
> Hi
> Lets say we have a single lucene document that has two text fields:
> field1 and field2.
> Data kept in field1
Hello,
> I'm developing a web app with struts that need to embed lucene
> functionalities. I need that my app adds documents to the
> index after that a
> document is added (documents are very few, but of large
> size). I read that i
> have to use a single instance of indexwriter to edit the
>
Closing the IndexSearcher is best only after a deleteDocuments with a reader or
changes with a writer.
For performance reasons, it is better to not close the IndexSearcher if not
needed
Regarsd Ard
>
>
> sorry, the subject should be "Should the IndexSearcher be
> closed after
> every sear
> I just ran into an interesting problem today, and wanted to know if it
> was my understanding or Lucene that was out of whack -- right now I'm
> leaning toward a fault between the chair and the keyboard.
>
> I attempted to do a simple phrase query using the StandardAnalyzer:
> "United States"
A search server based on lucene which is very easy to use and implement. I
think you can use it to achieve what you want,
Regards
>
> @Ard Schrijvers
>
>
> What is this Solr
> i didnt get you. will you
Hello Rajat,
this sounds to me like something very suitable for Solr,
Regards Ard
>
>
> Rajat,
>
> I don't know about the Web Interface you are mentioning but
> the task can be
> done with a little bit coding from your side.
>
> I would suggest indexing each database in its own index which
Hello,
think you can find your answer in the IndexWriter API:
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/IndexWriter.html
The optional autoCommit argument to the constructors controls visibility of
the changes to IndexReader instances reading
>
>
> Greetings,
>
> I would like to add the number of possible hits in my
> queries, for example,
> "found 18 hits out of a possible 245,000 documents". I am
> assuming that
> IndexReader.numDocs() is the best way to get this value.
>
> However, I would like to use a filter as part of the
ufferedDocs. But, increasing the default number of documents in the
"smallest" segments from 10 to, say 100, would also help me.
Then again, I am not sure wether i am doing something which can be achieved
more effectively/simply,
thanks in advance for any pointers,
Regards Ard Schri
axBufferedDocs(largeValue) does not do the trick
> (I think because in my case because the writer is flushed and
> closed after an few updates)
>
> Does anyone know wether it is possible to make the default
> number of documents a segment can contain larger?
>
> Thanks in a
documents
a segment can contain larger?
Thanks in advance,
Ard Schrijvers
--
Hippo
Oosteinde 11
1017WT Amsterdam
The Netherlands
Tel +31 (0)20 5224466
-
[EMAIL PROTECTED] / http://www.hippo.nl
51 matches
Mail list logo