Re: LuceneJUnitResultFormatter sometimes fails to lock

Robert Muir Wed, 28 Apr 2010 09:55:10 -0700

As far as the build system goes, I implemented the two ideas mentioned
earlier in this message (not creating a new Formatter for each test, and not
spawning 26 jvms for each batch)


Jira is down, but if you want to help test you can try a patch here:
http://pastebin.com/iqwb73H2 (click Raw/Download)

Additionally this cuts 1:20 off the total Solrcene 'ant clean test' for me.

before:
BUILD SUCCESSFUL
Total time: 7 minutes 42 seconds

after:
BUILD SUCCESSFUL
Total time: 6 minutes 23 seconds

On Wed, Apr 28, 2010 at 12:25 PM, Michael McCandless <
[email protected]> wrote:

> I think this are good changes to NativeFSLockFactory.
>
> But: the chances that N JVMs launched at once would conflict on the
> randomly generated lock file name should be miniscule... though it
> does depend on how good new Random() is at seeding itself.  Do we
> really think this explains your exceptions Shai?  (And, if so, even w/
> these changes, the conflict could still happen?)  Maybe we should
> explicitly seed it?
>
> Mike
>
> On Wed, Apr 28, 2010 at 11:22 AM, Shai Erera <[email protected]> wrote:
> > I'd like to summarize the IRC discussion Mark and I had:
> >
> > The lock file's existence in the directory should not fail obtain() from
> > retrieving obtaining a lock. That's the whole difference between Simple
> and
> > Native. So we should make a best-effort to delete it. If the delete fails
> on
> > release(), then ok. On obtain(), we won't return false if the lock
> exists,
> > but attempt to really obtain it and fail appropriately.
> >
> > While the previously proposed fix (add "&& path.exists()" to release())
> > might work most of the times, it will only work "most of the times".
> I.e.,
> > between release() and delete(), an external process, like AntiVirus,
> might
> > lock the file, and delete will fail, but the file will still be there,
> and
> > we'll throw an exception still.
> >
> > So, the proposed changes are:
> > * release() is allowed to fail to delete the lock file.
> > * obtain() should not return false if the lock file exists - it should
> > really attempt to obtain it.
> > * in acquireTestLock(), if after release() is called, the lock file still
> > exists, we'll retry the delete few ms later, and if that fails, call
> > deleteOnExit.
> >
> > How's that sound?
> >
> > Shai
> >
> > On Wed, Apr 28, 2010 at 5:58 PM, Mark Miller <[email protected]>
> wrote:
> >>
> >> I don't follow. The simple lock impl must delete the file, but the
> native
> >> impl should not have to. The file has nothing to do with the lock - its
> just
> >> the medium to ask for and release the lock. If it already exists, you
> don't
> >> have to create it - you can just use it to try and get a native lock.
> >> Likewise, it doesn't need to be removed to release a native lock - you
> >> simply call unlock on it.
> >>
> >> On 4/28/10 10:34 AM, Shai Erera wrote:
> >>>
> >>> But this method is called also for the regular lock file - if release()
> >>> won't delete the file, then the next l.obtain() will return false.
> >>>
> >>> Shai
> >>>
> >>> On Wed, Apr 28, 2010 at 5:31 PM, Mark Miller <[email protected]
> >>> <mailto:[email protected]>> wrote:
> >>>
> >>>    It shouldn't need too though - the native lock file is simply a
> >>>    dummy file to apply the lock too - shouldn't matter if it already
> >>>    exists or not (though it seems to in the current code).
> >>>
> >>>
> >>>    On 4/28/10 10:22 AM, Shai Erera wrote:
> >>>
> >>>        If you won't delete the file, the next obtain will fail?
> >>>
> >>>        On Wed, Apr 28, 2010 at 5:12 PM, Mark Miller
> >>>        <[email protected] <mailto:[email protected]>
> >>>        <mailto:[email protected] <mailto:[email protected]>>>
> >>>        wrote:
> >>>
> >>>            I wonder if not being able to delete the file should throw a
> >>>        release
> >>>            failed exception at all. You have actually released the
> >>>        native lock
> >>>            - you where just not able to clean up - but that seems more
> >>>        like a
> >>>            warning situation than a failure.
> >>>
> >>>
> >>>            --
> >>>            - Mark
> >>>
> >>>        http://www.lucidimagination.com
> >>>
> >>>            On 4/28/10 9:53 AM, Shai Erera wrote:
> >>>
> >>>                I've hit it again and here's the full stacktrace (at
> least
> >>>                what's printed):
> >>>
> >>>                     [junit] Exception in thread "main"
> >>>        java.lang.RuntimeException:
> >>>                Failed to acquire random test lock; please verify
> >>>        filesystem for
> >>>                lock
> >>>                directory
> >>>        'C:\DOCUME~1\shaie\LOCALS~1\Temp\lucene_junit_lock'
> >>>                supports
> >>>                locking
> >>>                     [junit]     at
> >>>
> >>>
> >>>
>  
> org.apache.lucene.store.NativeFSLockFactory.acquireTestLock(NativeFSLockFactory.java:88)
> >>>                     [junit]     at
> >>>
> >>>
> >>>
>  
> org.apache.lucene.store.NativeFSLockFactory.makeLock(NativeFSLockFactory.java:127)
> >>>                     [junit]     at
> >>>
> >>>
> >>>
>  
> org.apache.lucene.util.LuceneJUnitResultFormatter.<init>(LuceneJUnitResultFormatter.java:74)
> >>>                     [junit]     at
> >>>                java.lang.J9VMInternals.newInstanceImpl(Native Method)
> >>>                     [junit]     at
> >>>        java.lang.Class.newInstance(Class.java:1325)
> >>>                     [junit]     at
> >>>
> >>>
> >>>
>  
> org.apache.tools.ant.taskdefs.optional.junit.FormatterElement.createFormatter(FormatterElement.java:248)
> >>>                     [junit]     at
> >>>
> >>>
> >>>
>  
> org.apache.tools.ant.taskdefs.optional.junit.FormatterElement.createFormatter(FormatterElement.java:214)
> >>>                     [junit]     at
> >>>
> >>>
> >>>
>  
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.transferFormatters(JUnitTestRunner.java:819)
> >>>                     [junit]     at
> >>>
> >>>
> >>>
>  
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:909)
> >>>                     [junit]     at
> >>>
> >>>
> >>>
>  
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:743)
> >>>                     [junit] Caused by:
> >>>                org.apache.lucene.store.LockReleaseFailedException:
> >>>        failed to delete
> >>>
> >>>
> >>>
>  C:\DOCUME~1\shaie\LOCALS~1\Temp\lucene_junit_lock\lucene-wn1v4z-test.lock
> >>>                     [junit]     at
> >>>
> >>>
> >>>
>  org.apache.lucene.store.NativeFSLock.release(NativeFSLockFactory.java:311)
> >>>                     [junit]     at
> >>>
> >>>
> >>>
>  
> org.apache.lucene.store.NativeFSLockFactory.acquireTestLock(NativeFSLockFactory.java:86)
> >>>                     [junit]     ... 9 more
> >>>
> >>>                The exception is thrown from NativeFSLock.release() b/c
> >>>        it fails to
> >>>                delete the lock file. I think I know what the problem is
> >>>        - and
> >>>                it must
> >>>                be related to the large number of JVMs that are created
> >>>        w/ the
> >>>                parallel
> >>>                tests:
> >>>                * Suppose that JVM1 draws the number '1' for the test
> >>>        lock file - it
> >>>                thus creates lock1.
> >>>                * Now suppose that JVM2 draws the same number, magically
> >>>        somehow
> >>>                - it
> >>>                thus creates lock1 as well.
> >>>                * The code of acquireTestLock in NativeFSLockFactory
> >>>        looks like
> >>>                this:
> >>>                     Lock l = makeLock(randomLockName);
> >>>                     try {
> >>>                       l.obtain();
> >>>                       l.release();
> >>>                --> both will create the same test Lock file. Then
> >>>        l.obtain()
> >>>                probably
> >>>                returns false for one of them, but it's not checked.
> >>>                * Then in release there are a couple of things to note:
> >>>                1) the method is synced on the instance, which does not
> >>>        affect
> >>>                the two JVMs.
> >>>                2) suppose that both JVMs pass through the if (exists())
> >>>        check. Then
> >>>                JVM1 releases the lock, and deletes the file.
> >>>                3) Now JVM2 kicks in, calls lock.release() which has no
> >>>        effect
> >>>                (from the
> >>>                jdoc: "If this lock object is invalid then invoking this
> >>>        method
> >>>                has no
> >>>                effect." ). Then when it comes to path.delete(), the
> >>>        file isn't
> >>>                there,
> >>>                the method returns false and thus an exception is thrown
> >>> ...
> >>>
> >>>                This situation is extremely unlikely to happen, but
> >>>        still, it
> >>>                happens on
> >>>                my machine quite frequently since the parallel tests.
> I'm
> >>>                thinking that
> >>>                acquireTestLock should be less strict, but perhaps we
> >>>        can fix it
> >>>                if we
> >>>                replace the line:
> >>>                      if (!path.delete()) (line 310)
> >>>                with this
> >>>                      if (!path.delete() && path.exists())
> >>>
> >>>                I.e., if the lock file fails to delete but is still
> >>>        there, throw the
> >>>                exception ...
> >>>
> >>>                What do you think?
> >>>
> >>>                Shai
> >>>
> >>>                On Tue, Apr 27, 2010 at 10:21 PM, Robert Muir
> >>>        <[email protected] <mailto:[email protected]>
> >>>        <mailto:[email protected] <mailto:[email protected]>>
> >>>        <mailto:[email protected] <mailto:[email protected]>
> >>>        <mailto:[email protected] <mailto:[email protected]>>>> wrote:
> >>>
> >>>
> >>>
> >>>                    On Tue, Apr 27, 2010 at 3:06 PM, Andi Vajda
> >>>        <[email protected] <mailto:[email protected]>
> >>>        <mailto:[email protected] <mailto:[email protected]
> >>
> >>>        <mailto:[email protected] <mailto:[email protected]
> >
> >>>
> >>>        <mailto:[email protected]
> >>>        <mailto:[email protected]>>>> wrote:
> >>>
> >>>
> >>>                        I've had similar random failures on Mac OS X
> >>>        10.6. They
> >>>                started
> >>>                        happening recently, about two weeks ago.
> >>>
> >>>
> >>>                    Thats just too randomly close to when i last worked
> >>>        on this
> >>>                build
> >>>                    system stuff for LUCENE-1709... perhaps I made it
> >>> worse
> >>>                instead of
> >>>                    better.
> >>>
> >>>                    --
> >>>                    Robert Muir
> >>>        [email protected] <mailto:[email protected]>
> >>>        <mailto:[email protected] <mailto:[email protected]>>
> >>>        <mailto:[email protected] <mailto:[email protected]>
> >>>        <mailto:[email protected] <mailto:[email protected]>>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>  ---------------------------------------------------------------------
> >>>            To unsubscribe, e-mail: [email protected]
> >>>        <mailto:[email protected]>
> >>>        <mailto:[email protected]
> >>>        <mailto:[email protected]>>
> >>>
> >>>            For additional commands, e-mail: [email protected]
> >>>        <mailto:[email protected]>
> >>>        <mailto:[email protected]
> >>>        <mailto:[email protected]>>
> >>>
> >>>
> >>>
> >>>
> >>>    --
> >>>    - Mark
> >>>
> >>>    http://www.lucidimagination.com
> >>>
> >>>
>  ---------------------------------------------------------------------
> >>>    To unsubscribe, e-mail: [email protected]
> >>>    <mailto:[email protected]>
> >>>    For additional commands, e-mail: [email protected]
> >>>    <mailto:[email protected]>
> >>>
> >>>
> >>
> >>
> >> --
> >> - Mark
> >>
> >> http://www.lucidimagination.com
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


-- 
Robert Muir
[email protected]

Re: LuceneJUnitResultFormatter sometimes fails to lock

Reply via email to