Note that I believe with some work (marking the "zones" during
analysis), one can accomplish this with Spans without the field
creation problem that John mentions.
-Grant
On Apr 3, 2009, at 7:24 PM, John Wang wrote:
Not quite.For example, # of fields is static thru out the corpus. #
zone
Mike,
Thanks for the response -- we've already jumped on a couple of your suggestions.
Here is some feedback and follow ups:
We have watched GC times closely in the past. Most of the results of us trying
various settings was to make GC worse instead of better.
We didn't know about reopen() unti
OK I opened https://issues.apache.org/jira/browse/LUCENE-1586 to track
this. Thanks deminix!
Mike
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.or
Ah yes. I'd be happy with the ability to monitor it for now. Assuming it
is too involved to remove the limitation.
For all practical purposes we should only be using, worst case, 10% of the
term space today. That happens to make it risky enough that it needs an eye
kept on it, as this will be o
On Sat, Apr 4, 2009 at 11:57 AM, deminix wrote:
> Yea. That is all that matters anyway right, is the limit at the segment
> level?
Well... the problem is when merges kick off.
You could have N segments that each are below the limit, but when a
merge runs the merged segment would try to exceed t
On Sat, Apr 4, 2009 at 11:52 AM, deminix wrote:
> My crude regex'ing of the code has me thinking it is only term vectors that
> are limited to 32 bits, since they allocate arrays. Otherwise it seems
> good. Does that sound right?
Not quite... SegmentTermEnum.seek takes "int p". TermInfosReader
Yea. That is all that matters anyway right, is the limit at the segment
level?
On Sat, Apr 4, 2009 at 8:44 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:
> On Sat, Apr 4, 2009 at 10:25 AM, deminix wrote:
>
> > AFAIK there isn't an api that returns the current number of terms,
> cor
My crude regex'ing of the code has me thinking it is only term vectors that
are limited to 32 bits, since they allocate arrays. Otherwise it seems
good. Does that sound right?
On Sat, Apr 4, 2009 at 7:25 AM, deminix wrote:
> Thanks for the clarification.
>
> I'm partitioning the document spac
On Sat, Apr 4, 2009 at 10:25 AM, deminix wrote:
> AFAIK there isn't an api that returns the current number of terms, correct?
Alas, no. This limitation has been talked about before... maybe we
should add it.
But: it's not actually simple to compute, at the MultiSegmentReader
level. Each Segme
Thanks for the clarification.
I'm partitioning the document space, so I'm not really concerned about the
fact documents are ints. Some fields have very unique value spaces though
(and many values per document), and they don't align to the same way the
documents are partitioned so may have a very
Correct, and, not that I know of.
Mike
On Sat, Apr 4, 2009 at 7:55 AM, Murat Yakici
wrote:
>
> I assume the total number of documents that you can index is also limited
> by Java max int. Is this correct? Is there any way to index documents
> beyond this number in a single index?
>
> Murat
>
>
>
I assume the total number of documents that you can index is also limited
by Java max int. Is this correct? Is there any way to index documents
beyond this number in a single index?
Murat
> I tentatively think you are correct: the file format itself does not
> impose this limitation.
>
> But in
On Fri, Apr 3, 2009 at 10:21 PM, Dan OConnor wrote:
> All,
>
> I have a several questions regarding query response time and I would
> appreciate any help that can be provided.
>
> We have a system that indexes approximately 200,000 documents per day at a
> fairly constant rate and holds them in
I tentatively think you are correct: the file format itself does not
impose this limitation.
But in a least a couple places internally, Lucene uses a java int to
hold the term number, which is actually a limit of 2,147,483,648
terms. I'll update fileformats.html for 2.9.
Mike
On Sat, Apr 4, 200
14 matches
Mail list logo