Re: a faster way to addDocument and get the ID just added?

Erick Erickson Tue, 29 Mar 2011 05:21:50 -0700

I'm always skeptical of storing the doc IDs since they can
change out from underneath you (just delete even a single
document and optimize). What is it you're doing with
the doc ID that you couldn't do with the guid? If your "guid list"
were ordered, I can imagine building filters quite quickly from
it using TermDocs.skipTo for instance..


Or is this entirely unreasonable???

Best
Erick

On Mon, Mar 28, 2011 at 8:31 PM, Trejkaz <trej...@trypticon.org> wrote:
> Hi all.
>
> I'm trying to parallelise writing documents into an index.  Let's set
> aside the fact that 3.1 is much better at this than 3.0.x... but I'm
> using 3.0.3.
>
> One of the things I need to know is the doc ID of each document added
> so that we can add them into auxiliary database tables which are keyed
> by it.  If multiple threads are using the same writer, I can still do
> this as follows:
>
>    IndexWriter writer;
>    boolean parallel;
>
>    // ...
>
>    private int addDocument(String guid, ...) {
>        Document doc = new Document();
>        doc.add(new Field("guid", guid, Store.YES, Index.ANALYZED));
>        // eliding other fields
>        writer.addDocument(doc);
>
>        if (parallel) {
>            IndexReader realTimeReader = writer.getReader();
>            try {
>                TermDocs termDocs = realTimeReader.termDocs();
>                try {
>                    termDocs.seek(new Term("guid", guid));
>                    if (termDocs.next()) {
>                        return termDocs.doc();
>                    } else {
>                        throw new IllegalStateException(String.format(
>                            "We added item with GUID %s but it wasn't
> found immediately afterwards", guid));
>                    }
>                } finally {
>                    termDocs.close();
>                }
>            } finally {
>                realTimeReader.close();
>            }
>        } else {
>            return writer.maxDoc();
>        }
>    }
>
> Benchmarking this for a single thread, there is a difference in cost
> between doing it using a search and doing it by calling maxDoc(), as
> you might expect:
>
>    Time for parallel-safe version: 147.561s
>    Time for unsafe version: 62.603s
>
> Is there a way to achieve this result with less overhead?
>
> (Note: for reasons of performance, we cannot use a field to store an
> ID to use for database tables, as this is several orders of magnitude
> slower when you need to build a filter based on a database query.)
>
> TX
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: a faster way to addDocument and get the ID just added?

Reply via email to