I think I just answered my own question. No, there's no difference.
Including running into an OutOfMemory error when adding the doc. It seems
like I can think of this exactly as concatenating all the chunks together
and indexing them all at once, including the limitations.

So even though it all seems to work for small examples, I don't think it
helps.

Never mind......

Erick

On 8/15/06, Erick Erickson <[EMAIL PROTECTED]> wrote:

That's waaaaay cool. I'm not entirely clear on this point, but my test
sure worked out well. So I thought I'd see if the behavior I'm seeing is
expected. If so, way cool coding guys!

Here's the root question: "Am I reasonably safe, for a single document, in
thinking of indexing multiple chunks with the same field as being identical,
for all practical purposes, with indexing the field once with all the chunks
concatenated together?".

Details follow...

Background: I'm working with books now, so the 10,000 word limit isn't
acceptable. It is, of course, easy to bump it up via
IndexWriter.setMaxFieldLength, but I'd rather just break up the input
stream since I don't have a realistic expectation of putting an upper limit
on the size of a doc. So I experimented a bit with indexing the same field
multiple times in a document. It seems to me that everything "just works",
at least in my simple test case. This is one heck of a coincidence if not
intended behavior <G>, so all I'm asking is if anybody is surprised by this
(especially Otis, Chris, Erik, etc).

Please ignore the Field.Store, I realize it will probably have to be NO
for this app......

Lucene is fine with indexing the same field multiple times, like this....

    Document doc = new Document();
    doc.add(new Field("tokens", "one two three", Field.Store.YES,
Field.Index.TOKENIZED));

    doc.add(new Field("tokens", "four five six", Field.Store.YES,
Field.Index.TOKENIZED ));
    writer.addDocument(doc);

What surprised me a bit is that SpanQueries work just fine this way. If I
create a span query for "two" and "five", this doc is found for some slop
factors and not found for other slop factors, just as though I indexed the
"tokens" field once with "one two three four five six".

Boolean MUST clauses work as well.

So, are there any "gotchas" that spring to mind with the notion of
chunking the input to < 10,000 words and indexing the chunks multiple times
in the same field? Let me be clear I'm just beginning to design this, so all
I'm thinking about here is whether indexing the same field multiple times is
fraught with danger <G>, I have a bunch of details to work out about what
the requirements are and whether I really *want* to do this, but that's my
problem.

Thanks
Erick

Reply via email to