Hello Mathias,
IMHO sounds like you are planning to re-invent the wheel while all
things you want (AFAICS) are already largely available as open source
projects, and perhaps more important, open standards.
Your hierarchical data storage sounds like jsr-170 and jsr-283 are the
open standard solu
Hello Rich,
There is actually also a specific list indeed for it, [EMAIL PROTECTED],
but it is a really low traffic list I must admit, most likely not read
at all by the people you are looking for...though, officially, it is the
list to use :-)
Ard
> Hi all,
>
> Is there a mailing-list-appropr
Hello Justus, Chris and Otis,
IIRC Ocean [1] by Jason Rutherglen addresses the issue for real time
searches on large data sets. A conceptually comparable implementation is
done for Jackrabbit, where you can see an enlighting picture over here
[2]. In short:
1) IndexReaders are opened only once
crossposting to the user list as I think this issue belongs there. See
my comments inline
On Fri, Feb 5, 2010 at 10:27 AM, lionel duboeuf
wrote:
> Hi,
>
> Sorry for asking again, **I still have not found a scalable solution to get
> the document frequency of a term t according a set of documents.
Hello Heba,
you need some lucene field that serves as an identifier for your documents that
are indexed. Then, when re-indexing some documents, you can first use the
identifier to delete the old indexed documents. You have to take care of this
yourself.
Regards Ard
>
> Hello
> i would like
the subject, the returned hits
> will have duplicates )
> i was asking if i can remove duplicates from the hits??
>
> thanks in advance
>
> Ard Schrijvers <[EMAIL PROTECTED]> wrote: Hello Heba,
>
> you need some lucene field that serves as an identifier for
> your
>
"implement a TokenFilter
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/TokenFilter.html";
You might though want to check the performance implications :-)
-
To unsubscribe
Do you reindex everything every 5 minutes from scratch? Can't you keep track of
what changes, and only add/remove the correct parts to the index?
Ard
I'm new to this list. So first of all Hello to everyone!
So right now I have a little issue I would like to discuss with you.
Suppose that your a
Use getValues("name"), see
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/document/Document.html#getValues(java.lang.String)
Regards Ard
Hi
I am using lucene to index xml. I have already managed to index the
elements. I am indexing the element of xml w
10 updates per minute is not very much? Why not invalidate your used reader
after every commit, and reopen it? If your index is really big, you might want
to reopen it fewer times, but this is very simple to do (reopen every x updated
times)
Also the RAM and FS solution Erick suggests is possib
>
> Concept Search -
>
> 1. For example - Would like to search documents for "Wild
> Animals". However, "Wild Animals" will consist of an unlimited number
> of N-grams such as
I am a bit confused. What is the point of N-grams regarding this concept
search? I do not see how N-grams cou
Hello,
I am experimenting with lucene MultiSearcher and do some simple
BooleanQueries in which I combine a couple of TermQueries. I am
experiencing, that a single lucene index for just 100.000 docs (~10 k
each) is like 100 times faster than when I have about 100 seperate
indexes and use MultiSear
sistent indexes must be kept small I think. I'll do some more
testing,
thx for your advice,
regards Ard
>
>
> -Original Message-
> From: Ard Schrijvers [mailto:[EMAIL PROTECTED]
> Sent: Thursday, October 25, 2007 6:09 PM
> To: java-user@lucene.apache.or
Hello,
I am seeing that a query with boolean queries in boolean queries takes
much longer than just a single boolean query when the number of hits if
fairly large. For example
+prop1:a +prop2:b +prop3:c +prop4:d +prop5:e
is much faster than
(+(+(+(+prop1:a +prop2:b) +prop3:c) +prop4:d) +pro
> On Friday 26 October 2007 09:36:58 Ard Schrijvers wrote:
> > Hello,
> >
> > I am seeing that a query with boolean queries in boolean
> queries takes
> > much longer than just a single boolean query when the
> number of hits
> > if fairly large. For e
I suppose you have for about 5 minutes to display a single search ? :-)
Perhaps before pointing out your possible solutions, you might better
start describing your functional requirements, because your suggested
solution is headed for problems. So you need custom ordering, check out
lucene scoring
Hello,
is there a way to reuse a Lucene document which was indexed and analyzed
before, but only one single Field has changed? The use case (Jackrabbit
indexing) is when a *lot* of documents have a common field which
changes, and the rest of the document is unchanged . I would guess that
there is
Hello,
> 21 jan 2008 kl. 16.37 skrev Ard Schrijvers:
>
> > is there a way to reuse a Lucene document which was indexed and
> > analyzed before, but only one single Field has changed?
> Karl Wetting wrote:
> I don't think you can reuse document instances like t
documents
a segment can contain larger?
Thanks in advance,
Ard Schrijvers
--
Hippo
Oosteinde 11
1017WT Amsterdam
The Netherlands
Tel +31 (0)20 5224466
-
[EMAIL PROTECTED] / http://www.hippo.nl
axBufferedDocs(largeValue) does not do the trick
> (I think because in my case because the writer is flushed and
> closed after an few updates)
>
> Does anyone know wether it is possible to make the default
> number of documents a segment can contain larger?
>
> Thanks in a
ufferedDocs. But, increasing the default number of documents in the
"smallest" segments from 10 to, say 100, would also help me.
Then again, I am not sure wether i am doing something which can be achieved
more effectively/simply,
thanks in advance for any pointers,
Regards Ard Schri
>
>
> Greetings,
>
> I would like to add the number of possible hits in my
> queries, for example,
> "found 18 hits out of a possible 245,000 documents". I am
> assuming that
> IndexReader.numDocs() is the best way to get this value.
>
> However, I would like to use a filter as part of the
Hello,
think you can find your answer in the IndexWriter API:
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/IndexWriter.html
The optional autoCommit argument to the constructors controls visibility of
the changes to IndexReader instances reading
Hello Rajat,
this sounds to me like something very suitable for Solr,
Regards Ard
>
>
> Rajat,
>
> I don't know about the Web Interface you are mentioning but
> the task can be
> done with a little bit coding from your side.
>
> I would suggest indexing each database in its own index which
A search server based on lucene which is very easy to use and implement. I
think you can use it to achieve what you want,
Regards
>
> @Ard Schrijvers
>
>
> What is this Solr
> i didnt get you. will you
> I just ran into an interesting problem today, and wanted to know if it
> was my understanding or Lucene that was out of whack -- right now I'm
> leaning toward a fault between the chair and the keyboard.
>
> I attempted to do a simple phrase query using the StandardAnalyzer:
> "United States"
Closing the IndexSearcher is best only after a deleteDocuments with a reader or
changes with a writer.
For performance reasons, it is better to not close the IndexSearcher if not
needed
Regarsd Ard
>
>
> sorry, the subject should be "Should the IndexSearcher be
> closed after
> every sear
Hello,
> I'm developing a web app with struts that need to embed lucene
> functionalities. I need that my app adds documents to the
> index after that a
> document is added (documents are very few, but of large
> size). I read that i
> have to use a single instance of indexwriter to edit the
>
Hello John,
see another thread about this issue this morning. Due to index performance in
combination with an inverted index it is not possible what you want.
Regards Ard
>
> Hi
> Lets say we have a single lucene document that has two text fields:
> field1 and field2.
> Data kept in field1
Hello,
>
> The lock file is only for Writers. The lock file ensures that
> even two
> writers from two JVM's will not step on each other. Readers
> do not care
> about what the writers are doing or whether there is a lock
> file...
Is this always true? The deleteDocuments method of the Index
Hello,
> I'm wondering if after
> opening the
> index I can retrieve the Tokens (not the terms) of a
> document, something
> akin to IndexReader.Document(n).getTokenizer().
It is obviously not possible to get the original tokens of the document back
when you haven't stored the document, becaus
The SearchClient is obviously not aware of a changing index, so doesn't know
when it has to be reopened.
You can at least do the following:
1) you periodically check for the index folder wether its timestamp did change
(or if this stays the same, do it with the files in it) --> if changed, reo
Hello,
> Hi,
> I am storing custom values in the Tokens provided by a Tokenizer but
> when retrieving them from the index the values don't match.
What do you mean by retrieving? Do you mean retrieving terms, or do you mean
doing a search with words you know that should be in, but you do not fi
orrect? Will I need to
> store "term text"
> in order to be able to access the actual term instead of
> stemmed words?
>
> Thanks for all your help,
>
> --JP
>
> On 7/13/07, Ard Schrijvers <[EMAIL PROTECTED]> wrote:
> >
> > Hello,
Hello,
> Hi EVeryone,
>
> Thank you all for your replies.
>
> And reply to your questions Grant:
> We have more than 3 Million document in our index.
> We get more than 150,000 searches (queries) per day. We
> expect this no to go
> up.
Just curious, but suppose those 150.000 searches are don
that were placed into the token during
> indexing
> are not being returned, they have been shifted.
> Thanks.
> Shahan
>
> Ard Schrijvers wrote:
> > Hello,
> >
> >
> >> Hi,
> >> I am storing custom values in the Tokens provided by
Hello Askar,
Which analyzer are you using for indexing and searching? If you use an analyzer
that uses stemming, you might see that "change", "changing", "changed", "chan"
etc al get reduced to the same word "chan".
In luke you can test with plugins that show you what tokens are created from
y
Hello,
Did take a look at nutch or hadoop or solr? They partially seem to address the
things you describe...About the LSI I am not sure what has been done in those
projects
Regards Ard
>
> Hi, Please help me.
> Its been a month since i am trying lucene.
> My requirements are huge, i have to i
ver. If
> you're calling
> > > > this fragment for each document, you'll always have
> only one doc. Try
> > > > changing the 'true' to 'false'. Or better yet, open the
> writer outside
> > > the
> > > > document add
Hello,
>
> Company AB", ...). With this I´d like to search for documents that has
> daniel and president on the same field, because in a same
> text, can exist
> daniel and president in different fields. Is this possible??
Not totally sure wether I understand your problem, because it does not s
Hello,
> Hi everyone,
>
> I told you I'd be back with more questions! :-)
> Here is my situation. In my application, the field to be searched is
> selected via a drop-down box. I want my searches to basically
> be "contains"
> searches - I take what the user typed in, put a wildcard
> characte
Or check out Solr and see if you can use that, or see how they do it,
Regards Ard
>
> You might want to search the mail archive for "facets" or
> "faceted search"
> (no quotes), as I *think* this might be relevant.
>
> Best
> Erick
>
> On 7/26/07, Ramana Jelda <[EMAIL PROTECTED]> wrote:
> >
>
> > It does sound very strange to me, to default to a
> WildCardQuery! Suppose I
> > am looking for "bold", I am getting hits for "old".
>
> I know - but that's what the requirements dictate. A better
> example might be
> a MAC or IP address, where someone might be searching for a
> string in
Hello,
> I have two questions.
>
> First, Is there a tokenizer that takes every word and simply
> makes a token
> out of it?
org.apache.lucene.analysis.WhitespaceTokenizer
> So it looks for two white spaces and takes the characters
> between them and makes a token out of them?
>
> If this to
>
> So then would I just concatenate the tokens together to form
> the query text?
You might better create a TermQuery for each token instead of concatenating,
and combine them in a BooleanQuery and say wether all terms must or should
occur. Very simple, see [1]
Regards Ard
[1]
http://luce
Hello,
is this just one single example of different words that should return the same
results? You might consider implementing a synonym analyzer otherwise.
In your case, storing NAME as UN_TOKENIZED should enable your NAME:"De Agos"*
search
Regards Ard
>
> Hi,
> I would like to make a searc
Hello Shailendra,
AFAICS you are reasoning from a static doc-id POV, while documents do not have
a static doc-id in lucene. When you have a frequently updated index, you'll end
up invalidating cached BitSet's (which as the number of categories and number
of documents grow can absorb quite amoun
Hello,
I need to do a search that is capable to also match on substrings, for example:
*oo bar the qu*
should find a document that contains 'foo bar the quux' and 'foo bar the qux'.
Now, should I index the text as UN_TOKENIZED also, and do a WildCardQuery on
this field? Obviously, then every b
Hello,
without having to dive into the code, I was hoping somebody could tell me what
this contrib block does? I can't seem to find any documentation or relevant
hits when searching for it,
Thanks in advance,
Regards Ard
-
T
Thanks Daniel,
I understand how it can be done. The only things that bothers me is that
expanding the "*" might result in many phrases, and that in turn might imply a
performance hit. I'll see what the impact is,
Regards Ard
>
> On Wednesday 08 August 2007 10:28,
t; The minimal documentation is in the Java API documentation
> on the lucene java site under contrib: Surround Parser, and in
> the surround.txt file here:
> http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surroun
> d/surround.txt?view=log
>
> Groeten,
> Paul Elsch
51 matches
Mail list logo