deleteDocuments(Term... terms) takes a long time to do nothing.

2013-12-13 Thread Jason Corekin
Let me start by stating that I almost certain that I am doing something
wrong, and that I hope that I am because if not there is a VERY large bug
in Lucene.   What I am trying to do is use the method


deleteDocuments(Term... terms)


 out of the IndexWriter class to delete several Term object Arrays, each
fed to it via a separate Thread.  Each array has around 460k+ Term object
in it.  The issue is that after running for around 30 minutes or more the
method finishes, I then have a commit run and nothing changes with my files.
To be fair, I am running a custom Directory implementation that might be
causing problems, but I do not think that this is the case as I do not even
see any of the my Directory methods in the stack trace.  In fact when I set
break points inside the delete methods of my Directory implementation they
never even get hit. To be clear replacing the custom Directory
implementation with a standard one is not an option due to the nature of
the data which is made up of terabytes of small (1k and less) files.  So,
if the issue is in the Directory implementation I have to figure out how to
fix it.


Below are the pieces of code that I think are relevant to this issue as
well as a copy of the stack trace thread that was doing work when I paused
the debug session.  As you are likely to notice, the thread is called a
DBCloner because it is being used to clone the underlying Index based
database (needed to avoid storing trillions of files directly on disk).  The
idea is to duplicate the selected group of terms into a new database and
then delete to original terms from the original database.  The duplicate
work wonderfully, but not matter what I do including cutting the program
down to one thread I cannot shrink the database and the time to try to do
the deletes takes drastically too long.


In an attempt to be as helpful as possible, I will say this.  I have been
tracing this problem for a few days and have seen that

BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)

is where that majority of the execution time is spent.  I have also noticed
that this method return false MUCH more often than it returns true.  I have
been trying to figure out how the mechanics of this process work just in
case the issue was not in my code and I might have been able  to find the
problem.  But I have yet to find the problem either in Lucene 4.5.1 or
Lucene 4.6.  If anyone has any ideas as to what I might be doing wrong, I
would really appreciate reading what you have to say.  Thanks in advance.



Jason



private void cloneDB() throws QueryNodeException {



Document doc;

ArrayList fileNames;

int start = docRanges[(threadNumber * 2)];

int stop = docRanges[(threadNumber * 2) +
1];



try {



fileNames = new
ArrayList(docsPerThread);

for (int i = start; i <
stop; i++) {

doc =
searcher.doc(i);

try {


adder.addDoc(doc);


fileNames.add(doc.get("FileName"));

} catch
(TransactionExceptionRE | TransactionException | LockConflictException te) {


adder.txnAbort();


System.err.println(Thread.currentThread().getName() + ": Adding a message
failed, retrying.");

}

}


deleters[threadNumber].deleteTerms("FileName",
fileNames);


deleters[threadNumber].commit();



} catch (IOException | ParseException ex) {


Logger.getLogger(DocReader.class.getName()).log(Level.SEVERE,
null, ex);

}

}





public void deleteTerms(String
dbField,ArrayList fieldTexts) throws IOException {

Term[] terms = new Term[fieldTexts.size()];

for(int i=0;i.readFirstRealTargetArc(long, Arc, BytesReader)
line: 979

FST.findTargetArc(int, Arc, Arc, BytesReader)
line: 1220


BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)
line: 1679

BufferedUpdatesStream.applyTermDeletes(Iterable,
ReadersAndUpdates, SegmentReader) line: 414

BufferedUpdatesStream.applyDeletesAndUpdates(ReaderPool,
List) line: 283

IndexWriter.applyAllDeletesAndUpdates() line: 3112

IndexWriter.applyDeletesAndPurge(boolean) line: 4641


DocumentsWriter$ApplyDeletesEvent.process(IndexWriter,
boolean, boolean) line: 673

IndexWriter.processEvents(Queue, boolean, boolean)
line: 4665

  

Re: deleteDocuments(Term... terms) takes a long time to do nothing.

2013-12-14 Thread Jason Corekin
I knew that I had forgotten something.  Below is the line that I use to
create the field that I am trying to use to delete the entries with.  I
hope this avoids some confusion.  Thank you very much to anyone that takes
the time to read these messages.

doc.add(new StringField("FileName",filename, Field.Store.YES));


On Sat, Dec 14, 2013 at 1:28 AM, Jason Corekin wrote:

> Let me start by stating that I almost certain that I am doing something
> wrong, and that I hope that I am because if not there is a VERY large bug
> in Lucene.   What I am trying to do is use the method
>
>
> deleteDocuments(Term... terms)
>
>
>  out of the IndexWriter class to delete several Term object Arrays, each
> fed to it via a separate Thread.  Each array has around 460k+ Term object
> in it.  The issue is that after running for around 30 minutes or more the
> method finishes, I then have a commit run and nothing changes with my files.
> To be fair, I am running a custom Directory implementation that might be
> causing problems, but I do not think that this is the case as I do not even
> see any of the my Directory methods in the stack trace.  In fact when I
> set break points inside the delete methods of my Directory implementation
> they never even get hit. To be clear replacing the custom Directory
> implementation with a standard one is not an option due to the nature of
> the data which is made up of terabytes of small (1k and less) files.  So,
> if the issue is in the Directory implementation I have to figure out how to
> fix it.
>
>
> Below are the pieces of code that I think are relevant to this issue as
> well as a copy of the stack trace thread that was doing work when I paused
> the debug session.  As you are likely to notice, the thread is called a
> DBCloner because it is being used to clone the underlying Index based
> database (needed to avoid storing trillions of files directly on disk).  The
> idea is to duplicate the selected group of terms into a new database and
> then delete to original terms from the original database.  The duplicate
> work wonderfully, but not matter what I do including cutting the program
> down to one thread I cannot shrink the database and the time to try to do
> the deletes takes drastically too long.
>
>
> In an attempt to be as helpful as possible, I will say this.  I have been
> tracing this problem for a few days and have seen that
>
> BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)
>
> is where that majority of the execution time is spent.  I have also
> noticed that this method return false MUCH more often than it returns true.
> I have been trying to figure out how the mechanics of this process work
> just in case the issue was not in my code and I might have been able  to
> find the problem.  But I have yet to find the problem either in Lucene
> 4.5.1 or Lucene 4.6.  If anyone has any ideas as to what I might be doing
> wrong, I would really appreciate reading what you have to say.  Thanks in
> advance.
>
>
>
> Jason
>
>
>
> private void cloneDB() throws QueryNodeException {
>
>
>
> Document doc;
>
> ArrayList fileNames;
>
> int start = docRanges[(threadNumber * 2)];
>
> int stop = docRanges[(threadNumber * 2) +
> 1];
>
>
>
> try {
>
>
>
> fileNames = new
> ArrayList(docsPerThread);
>
> for (int i = start; i <
> stop; i++) {
>
> doc =
> searcher.doc(i);
>
> try {
>
>
> adder.addDoc(doc);
>
>
> fileNames.add(doc.get("FileName"));
>
> } catch
> (TransactionExceptionRE | TransactionException | LockConflictException te) {
>
>
> adder.txnAbort();
>
>
> System.err.println(Thread.currentThread().getName() + ": Adding a message
> failed, retrying.");
>
> }
>
> }
>
> 
> deleters[threadNumber].deleteTerms("FileName",
> fileNames);
>
>
> deleters[threadNumber].commit();
>
>
>
> } catch (IOException | ParseException ex)
> {
>
> 
> Logger.getLogger(DocReader.class.getName()).log(Leve

Re: deleteDocuments(Term... terms) takes a long time to do nothing.

2013-12-14 Thread Jason Corekin
Mike,

Thanks for the input, it will take me some time to digest and trying
everything you wrote about.  I will post back the answers to your questions
and results to from the suggestions you made once I have gone over
everything.  Thanks for the quick reply,

Jason


On Sat, Dec 14, 2013 at 5:13 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> It sounds like there are at least two issues.
>
> First, that it takes so long to do the delete.
>
> Unfortunately, deleting by Term is at heart a costly operation.  It
> entails up to one disk seek per segment in your index; a custom
> Directory impl that makes seeking costly would slow things down, or if
> the OS doesn't have enough RAM to cache the "hot" pages (if your Dir
> impl is using the OS).  Is seeking somehow costly in your custom Dir
> impl?
>
> If you are deleting ~1M terms in ~30 minutes that works out to ~2 msec
> per Term, which may actually be expected.
>
> How many terms in your index?  Can you run CheckIndex and post the output?
>
> You could index your ID field using MemoryPostingsFormat, which should
> be a good speedup, but will consume more RAM.
>
> Is it possible to delete by query instead?  Ie, create a query that
> matches the 460K docs and pass that to
> IndexWriter.deleteDocuments(Query).
>
> Also, try passing fewer ids at once to Lucene, e.g. break the 460K
> into smaller chunks.  Lucene buffers up all deleted terms from one
> call, and then applies them, so my guess is you're using way too much
> intermediate memory by passign 460K in a single call.
>
> Instead of indexing everything into one index, and then deleting tons
> of docs to "clone" to a new index, why not just index to two separate
> indices to begin with?
>
> The second issue is that after all that work, nothing in fact changed.
>  For that, I think you should make a small test case that just tries
> to delete one document, and iterate/debug until that works.  Your
> StringField indexing line looks correct; make sure you're passing
> precisely the same field name and value?  Make sure you're not
> deleting already-deleted documents?  (Your for loop seems to ignore
> already deleted documents).
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Sat, Dec 14, 2013 at 11:38 AM, Jason Corekin 
> wrote:
> > I knew that I had forgotten something.  Below is the line that I use to
> > create the field that I am trying to use to delete the entries with.  I
> > hope this avoids some confusion.  Thank you very much to anyone that
> takes
> > the time to read these messages.
> >
> > doc.add(new StringField("FileName",filename, Field.Store.YES));
> >
> >
> > On Sat, Dec 14, 2013 at 1:28 AM, Jason Corekin  >wrote:
> >
> >> Let me start by stating that I almost certain that I am doing something
> >> wrong, and that I hope that I am because if not there is a VERY large
> bug
> >> in Lucene.   What I am trying to do is use the method
> >>
> >>
> >> deleteDocuments(Term... terms)
> >>
> >>
> >>  out of the IndexWriter class to delete several Term object Arrays, each
> >> fed to it via a separate Thread.  Each array has around 460k+ Term
> object
> >> in it.  The issue is that after running for around 30 minutes or more
> the
> >> method finishes, I then have a commit run and nothing changes with my
> files.
> >> To be fair, I am running a custom Directory implementation that might be
> >> causing problems, but I do not think that this is the case as I do not
> even
> >> see any of the my Directory methods in the stack trace.  In fact when I
> >> set break points inside the delete methods of my Directory
> implementation
> >> they never even get hit. To be clear replacing the custom Directory
> >> implementation with a standard one is not an option due to the nature of
> >> the data which is made up of terabytes of small (1k and less) files.
>  So,
> >> if the issue is in the Directory implementation I have to figure out
> how to
> >> fix it.
> >>
> >>
> >> Below are the pieces of code that I think are relevant to this issue as
> >> well as a copy of the stack trace thread that was doing work when I
> paused
> >> the debug session.  As you are likely to notice, the thread is called a
> >> DBCloner because it is being used to clone the underlying Index based
> >> database (needed to avoid storing trillions of files directly on disk).
>  The
> >> idea is to duplicate the selected group of terms into a new database and
> >

Re: deleteDocuments(Term... terms) takes a long time to do nothing.

2013-12-16 Thread Jason Corekin
Mike,



Thank you for your help.  Below are a few comments to directly reply to
your questions, but in general your suggestions helped to get me on the
right track and I believe that have been able to solve the Lucene component
of my problems.  The short answer was that I when I had previously tried to
search by query I used to filenames stored in each document as the query,
which was essentially equivalent to deleting by term.  You email helped me
to realize this and in turn change my query to be time range based, which
now takes seconds to run.



Thank You



Jason Corekin



>It sounds like there are at least two issues.

>

>First, that it takes so long to do the delete.

>

>Unfortunately, deleting by Term is at heart a costly operation.  It

>entails up to one disk seek per segment in your index; a custom

>Directory impl that makes seeking costly would slow things down, or if

>the OS doesn't have enough RAM to cache the "hot" pages (if your Dir

>impl is using the OS).  Is seeking somehow costly in your custom Dir

>impl?



No, seeks are not slow at all.

>

>If you are deleting ~1M terms in ~30 minutes that works out to ~2 msec

>per Term, which may actually be expected.

>

>How many terms in your index?  Can you run CheckIndex and post the output?

In the main test case that was causing problems I believe that there are
around 3.7million terms and this is tiny in comparison to what will need to
be held.  Unfortunately I forgot to save the CheckIndex output that I
created from this test set while the problem was occurring and now that the
problem is solved I do not think it is worth going back to recreate it.



>

>You could index your ID field using MemoryPostingsFormat, which should

>be a good speedup, but will consume more RAM.

>

>Is it possible to delete by query instead?  Ie, create a query that

>matches the 460K docs and pass that to

>IndexWriter.deleteDocuments(Query).

>

Thanks so much for this suggestion, I had thought of it on my own.



>Also, try passing fewer ids at once to Lucene, e.g. break the 460K

>into smaller chunks.  Lucene buffers up all deleted terms from one

>call, and then applies them, so my guess is you're using way too much

>intermediate memory by passign 460K in a single call.



This does not seem to be the issue now, but I will keep it in mind.

>

>Instead of indexing everything into one index, and then deleting tons

>of docs to "clone" to a new index, why not just index to two separate

>indices to begin with?

>

The clone idea is only a test, the final design is to be able to copy date
ranges of data out of the main index and into secondary indexes that will
be backed up and removed from the main system on a regular interval.  This
copy component of this idea seems to work just fine, it’s getting the
deletion from the made index to work that is giving me all the trouble.



>The second issue is that after all that work, nothing in fact changed.

> For that, I think you should make a small test case that just tries

>to delete one document, and iterate/debug until that works.  Your

>StringField indexing line looks correct; make sure you're passing

>precisely the same field name and value?  Make sure you're not

>deleting already-deleted documents?  (Your for loop seems to ignore

>already deleted documents).



This was caused be in incorrect use of the underlying data structure.  This
is partially fixed now and what I am currently working on.  I have this
fixed enough to identify  that it should no longer be related to Lucene.



>

>Mike McCandless


On Sat, Dec 14, 2013 at 5:58 PM, Jason Corekin wrote:

> Mike,
>
> Thanks for the input, it will take me some time to digest and trying
> everything you wrote about.  I will post back the answers to your questions
> and results to from the suggestions you made once I have gone over
> everything.  Thanks for the quick reply,
>
> Jason
>
>
> On Sat, Dec 14, 2013 at 5:13 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> It sounds like there are at least two issues.
>>
>> First, that it takes so long to do the delete.
>>
>> Unfortunately, deleting by Term is at heart a costly operation.  It
>> entails up to one disk seek per segment in your index; a custom
>> Directory impl that makes seeking costly would slow things down, or if
>> the OS doesn't have enough RAM to cache the "hot" pages (if your Dir
>> impl is using the OS).  Is seeking somehow costly in your custom Dir
>> impl?
>>
>> If you are deleting ~1M terms in ~30 minutes that works out to ~2 msec
>> per Term, which may actually be expected.
>>
>> How many terms in your index?  Can you run CheckIndex and post the output?
&