> i think you want to adjust the end offset of the first output token,
> and the start offset of the second.
Makes sense. Thanks so much.
After thinking about this a bit more it seems I should think of the
contents of a Token's termBuffer simply as an index (or key) into the
region of text defin
Babak if your filter splits a token into two output tokens,
i think you want to adjust the end offset of the first output token,
and the start offset of the second.
Babak, for a fairly simple example of this, you can look at the
ThaiWordFilter in the lucene contrib-analyzers package.
it has to br
Thanks for your explanations. I think I have a basic understanding now.
What I'm not so sure about, now, is how to decide on the start and
ending offsets when the TokenFilter implementation wants to break an
input token into subtokens. Should the offsets of the emitted
subtokens be the same as the
Hello,
Also keep in mind prefix queries are not the cheapest.
Plug:
We've seen people use this successfully:
http://www.sematext.com/products/autocomplete/index.html
I believe somebody is trying this out with a set of 1B suggestions. The demo
at http://www.sematext.com/demo/ac/index.html search
Hello,
Comments inlined.
- Original Message
> From: vsevel
> To: java-user@lucene.apache.org
> Sent: Fri, November 13, 2009 11:32:02 AM
> Subject: Re: OutofMemory in large index
>
>
> Hi, I am jumping into the thread because I have got a similar issue.
> My index is 30Gb large and
On Fri, Nov 13, 2009 at 4:21 PM, Max Lynch wrote:
> Well already, without doing any boosting, documents matching more of the
> > terms
> > in your query will score higher. If you really want to make this effect
> > more
> > pronounced, yes, you can boost the more important query terms higher.
>
Well already, without doing any boosting, documents matching more of the
> terms
> in your query will score higher. If you really want to make this effect
> more
> pronounced, yes, you can boost the more important query terms higher.
>
> -jake
>
But there isn't a way to determine exactly what bo
On Fri, Nov 13, 2009 at 4:02 PM, Max Lynch wrote:
> > > Now, I would like to know exactly what term was found. For example, if
> a
> > > result comes back from the query above, how do I know whether John
> Smith
> > > was
> > > found, or both John Smith and his company, or just John Smith
> > Ma
> > Now, I would like to know exactly what term was found. For example, if a
> > result comes back from the query above, how do I know whether John Smith
> > was
> > found, or both John Smith and his company, or just John Smith
> Manufacturing
> > was found?
>
>
> In general, this is actually very
On Fri, Nov 13, 2009 at 3:35 PM, Max Lynch wrote:
> > query: "San Francisco" "California" +("John Smith" "John Smith
> > Manufacturing")
> >
> > Here the San Fran and CA clauses are optional, and the ("John Smith" OR
> > "John Smith Manufacturing") is required.
> >
>
> Thanks Jake, that works nic
> query: "San Francisco" "California" +("John Smith" "John Smith
> Manufacturing")
>
> Here the San Fran and CA clauses are optional, and the ("John Smith" OR
> "John Smith Manufacturing") is required.
>
Thanks Jake, that works nicely.
Now, I would like to know exactly what term was found. For e
Another example is if you used a stemmer, it might change the termLength:
(walking -> walk), but the offsets of the original unstemmed word (walking)
stay the same.
On Fri, Nov 13, 2009 at 6:01 PM, Uwe Schindler wrote:
> This is not coupled because:
>
> termLength() is the number of chars in the
This is not coupled because:
termLength() is the number of chars in the term buffer, where the offsets
give the offsets in the orginal char stream. If you use a CharFilter to e.g.
remove chars, the termLength will get shorter, but the offset are still the
original ones. Also both things are indexe
I'm writing a TokenFilter and am confused about why class Token has
both an *endOffset* and a *termLength* field. It would appear that
the following invariant should always hold for a Token instance:
termLength() == endOffset() - startOffset()
If so, then
1) Why 2 fields, instead of 1?
2) W
Did I do that wrong? I always mess up the AND/OR human-readable form
of this - it's clearer when you use +/- unary operators instead:
query: "San Francisco" "California" +("John Smith" "John Smith
Manufacturing")
Here the San Fran and CA clauses are optional, and the ("John Smith" OR
"John Smith
> You want a query like
>
> ("San Francisco" OR "California") AND ("John Smith" OR "John Smith
> Manufacturing")
>
Won't his require San Francisco or California to be present? I do not
require them to be, I only require "John Smith" OR "John Smith
Manufacturing", but I want to get a bigger scor
Hi Max,
You want a query like
("San Francisco" OR "California") AND ("John Smith" OR "John Smith
Manufacturing")
essentially? You can give Lucene exactly this query and it will require
that
either "John Smith" or "John Smith Manufacturing" be present, but will score
results which have these
Hi,
I am trying to move from a system where I counted the frequency of terms by
hand in a highlighter to determine if a result was useful to me. In an
earlier post on this list someone suggested I could boost the terms that are
useful to me and only accept hits above a certain threshold. However,
Ooooh, that'll teach me .
On Fri, Nov 13, 2009 at 1:30 PM, Uwe Schindler wrote:
> List IndexReader.getFieldNames() ?
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
> > -Original Message-
> > From: vsevel [mailto:v.se.
List IndexReader.getFieldNames() ?
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
> -Original Message-
> From: vsevel [mailto:v.se...@lombardodier.com]
> Sent: Friday, November 13, 2009 5:44 PM
> To: java-user@lucene.apache.org
> Su
Does TermEnum work in your situation?
Best
Erick
On Fri, Nov 13, 2009 at 11:44 AM, vsevel wrote:
>
> Hi,
>
> I am indexing log4j/logback/JUL logging events. my documents includes
> regular fields (eg: logger, message, date, ...) and custom fields that
> applications choose to use (eg: MDC).
> I
Hi again.
I've made a proof of concept using the boost factor. I have done the
following: add a field for each feature and put the field boost factor
as the feature value.
private static void addDocument(String id, Map
features, IndexWriter writer) throws IOException {
Doc
Hi,
I am indexing log4j/logback/JUL logging events. my documents includes
regular fields (eg: logger, message, date, ...) and custom fields that
applications choose to use (eg: MDC).
I would like to do full text searches on those fields just as I do on
regular fields, I just need to know about th
Hi, I am jumping into the thread because I have got a similar issue.
My index is 30Gb large and contains 21M docs.
I was able to stay with 1Gb of RAM on the server for a while. Recently I
started to simulate parallel searches. Just 2 parallel searches would get
the server to crash with out of memo
Hi Simon,
Thank you very much for your reply.
Maybe an example will help clarify my use case-
Say I have the following two indexed columns with this data
*data**boostfield*
african ant10
alligator50
anthem20
antelope 30
another5
Hi.
I am developing an application and I would like to add searching
capabilities. I have a database with items. Each item has a number of
"features" with a numeric value. Example: feature_x=100,
feature_y=200. Items can have common or different "features". And they
can have a variable number of "
Anjana, maybe I don't understand you question correctly but what you
want to do is a spell suggestion kind of thing on terms in the index,
right? You try to use prefix query to display those terms as an
auto-completion?! So I assume that what you do is run a query and
then get the possible terms f
We are using lucene for one our projects here and has been working very well
for last 2 years.
The new requirement is to use it for autocomplete. Here , queries like a* or
ab* pose a problem.
I have set BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE ) to get around
the TooManyClausesException.
T
I don't know how, because the problem is with JBOSS in a productions
environment, in a localhost this don't happens.
The JBOSS server is in a production environment, contains a lot of projects,
I don't know if Lucene enters in fight with othe libraries.
I don't have control of this computer, I
Alas, I can't repro this problem ("leaking file descriptors with NRT"), either.
I've got a decent stress test setup -- start with a 5M Wikipedia
index, update (delete & add) @ 1000 docs/sec (using 2 threads), reopen
10X per second, searching at redline (using 9 threads), and the open
file descript
On Fri, Nov 13, 2009 at 12:01 PM, Wenbo Zhao wrote:
> Thank you all... I think I need to read more docs
>
> A little question : how to add more memory over 1G ?
> When I specify more than -Xmx1450M, jvm gives error:
>>java -Xmx1450m asdf
> Exception in thread "main" java.lang.NoClassDefFoundError
Thank you all... I think I need to read more docs
A little question : how to add more memory over 1G ?
When I specify more than -Xmx1450M, jvm gives error:
>java -Xmx1450m asdf
Exception in thread "main" java.lang.NoClassDefFoundError: asdf
>java -Xmx1451m asdf
Error occurred during initializati
Phew :)
Thanks for bringing closure!
Mike
On Fri, Nov 13, 2009 at 5:22 AM, Benjamin Heilbrunn wrote:
> Hello,
>
> sorry for causing inconvenience.
> It was my mistake and i wasn't able to reproduce it completely this morning.
>
> My testcase was a little to complex and there were two or three b
Interrupting optimize shouldn't cause any problems. It should have no
effect on the index, except possibly the partially created files might
be orphan'd (left on disk but not referenced by the index), in which
case they'll be cleaned up the next time you open a writer on the
index.
Still, running
On Fri, Nov 13, 2009 at 11:17 AM, Ian Lea wrote:
>> I got OutOfMemoryError at
>> org.apache.lucene.search.Searcher.search(Searcher.java:183)
>> My index is 43G bytes. Is that too big for Lucene ?
>> Luke can see the index has over 1800M docs, but the search is also out
>> of memory.
>> I use -Xmx
Any luck narrowing this to a standalone test case that shows the problem?
That new exception appears to be inside the Java code created by the
app server compiling your JSP -- it's not very helpful since it
doesn't "enter" Lucene. Can you try to narrow this to a standalone
test case, too?
Thanks
Hello,
sorry for causing inconvenience.
It was my mistake and i wasn't able to reproduce it completely this morning.
My testcase was a little to complex and there were two or three bugs /
false assumptions which made it look to me like i explained above.
Benjamin
--
> I got OutOfMemoryError at
> org.apache.lucene.search.Searcher.search(Searcher.java:183)
> My index is 43G bytes. Is that too big for Lucene ?
> Luke can see the index has over 1800M docs, but the search is also out
> of memory.
> I use -Xmx1024M to specify 1G java heap space.
43Gb is not too bi
Hi,
About this problem I did a test yesterday, I did a downgrade, I changed
versión 2.9.1 to 2.4.1, and the problem has been solved, all the files are
closed corretly and JBOSS isn't unstable.
Another problem that we have observed is:
Sometimes, random success, when you try to make a serach the
39 matches
Mail list logo