Re: Potential bug in task list management

Chris Hostetter Wed, 19 Mar 2025 17:50:45 -0700


: Essentially, whenever a query task is abnormally ended, ie either the 
: client times out and closes the connection, the query hits the 
: timeAllowed or cpuAllowed limit, or the task is cancelled through the 
: /solr/collection/tasks/cancel?queryUUID= mechanism, the task is never or 
: almost never removed from the list of tasks returned by the 
: /v2/collections/collection/tasks/list endpoint.


I'm not really familiar with the "task" management API and the cancelng 
queries, not have I tried to reproduce the behavior you are describing, 
but based on a skim of the only test I see that involves canceling 
queries, i don't see anything in that test that would rule out what you're 
descrbing.

TestTaskManagement.testNonExistentQuery
 - just asserts a 404 when trying to cancel an non existent UUID

TestTaskManagement.testCancellationQuery
 - runs some queries in background threads and then cancels them
 - nothing about the queries ensures they are still around to be canceled
 - test only asserts that the number of queries it created equals the 
   number of successs + failures in trying to cancel
 - so even if the 100% of the queries never ran, and were never tracked, 
   this test would pass
 - and nothing in the test confirms that any tasks which *might* have been 
   tracked are removed from the list at the end of the test

TestTaskManagement.testListCancellableQueries
 - runs 50 queries in background threads and then lists current tasks
 - only asserts that the number of items in the is: 0 <= n <= 50 
 - so again: if 100% of the queries never make it to solr the test passes
 - if 100% of the queries are stuck in the list forever, the test passes


So yeah.  There's really not much that this test actaully proves.

Can you please file a Jira with the details of your observations?



If you're up for it, here's how i would approach fixing the test:


1) write a custom SearchComponent that checks for some "blockTest=true" 
request param, and if it's set...

  - calls release() on a "public static final Semaphore REQ_READY"
  - then calls acquire() on a "public static final Semaphore REQ_WAITS_FOR"

...before letting the request finish

2) register & use that component in a /blocking SearchHandler (either via 
a new configset, or via the APIs to add them at query time)


3) change the test logic:

  - use the new /blocking request handler and send blockTest=true on all N 
requests
  - wait to acquire(N) permits from REQ_READY before making any 
    assertions about the task list and/or canceling X requests
  - then and only then release(N-X) on REQ_WAITS_FOR
  - wait for all the background request responses, and assert that the 
canceld ones failed, and the other ones succeeded
  - then check the task list and confirm it's now empty.



-Hoss
http://www.lucidworks.com/

Re: Potential bug in task list management

Reply via email to