[jira] [Commented] (SOLR-17416) Streaming Expressions: Exception swallowed and not propagated back to the client leading to inconsistent results

Chris M. Hostetter (Jira) Thu, 29 Aug 2024 16:41:37 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-17416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877915#comment-17877915
 ]


Chris M. Hostetter commented on SOLR-17416:
-------------------------------------------

I've seen this in the wild and after digging into it a bit can elaborate on the 
circumstances...
 * {{ExportBuffers.run(Callable)}} is designed around the use of two threads:
 ** a background thread that acts as a "filler" to read data from disk and 
populate a {{fillBuffer}}
 ** the caller thread, which serially invokes the "writer" logic passed in as 
the {{Callable}} argument, to serialize an {{outputBuffer}}
 * These two threads use {{exchangeBuffers()}} to swap with each other, which 
under the covers uses a {{CylicBarrier}}
 ** The {{CyclicBarrier.await(time)}} calls use a hard coded timeout of 600 
seconds
 *** This seems to have been chosen as an arbitrary "big value" to ensure that 
the neither thread sat around waiting forever (consuming ram) if the other 
thread died.

The problem however is that this 600 second timeout may not be enough to 
account for really slow downstream consumption of the data. With really large 
collections, and really complicated streaming expressions, this can happen even 
when well behaved clients that are actively trying to consume data.

One particular example I've seen is a stream where a {{leftOuterJoin}} is 
wrapped around two very large collections, spread over a large number of 
shards....
{noformat}
leftOuterJoin(search(biggest_left_collection,q="*:*",sort="join_field 
asc",fl="...",qt="/export"),
              search(big_right_collection,q="*:*",sort="join_field 
asc",fl="...",qt="/export"),
              on="join_field")
{noformat}
...After an initial burst of fetching the an initial {{batchSize}} of Tuples 
from both streams, and slurping up all the results from both streams that have 
the same {{join_field}} value as the head of the left Tuple, each {{read()}} 
call from the downstream client causes Tuples from the merged set to be 
consumed – but neither of the upstream "left" or "right" streams reads any new 
data for a bit. And even once additional values are read from the "left" 
stream, the "right" stream may sit stagnant for even longer as the node running 
the join code keeps consuming (and passing downstream) Tuples from the left 
stream until it encounters the same {{field_field}} value from the head of 
right stream and _then_ starts fetching more data from the individual replicas 
of that right stream collection.

Which can take a while – and in the meantime the "filler" thread on one (or 
more) of the replicas of {{big_right_collection}} may have encountered a 
{{TimeoutException}} from the {{barrier.await(time)}}

In _theory_ the "filler" code in {{ExportBuffers}} reports the 
{{TimeoutException}}  in this situation by by passing the exception to the 
following method before the filler thread ends...
{code:java}
  public void error(Throwable t) {
    error = t;
    // break the lock on the other thread too
    barrier.reset();
  }
{code}
...expecting the "writer" code (from the caller thread) to check the value of 
{{getError()}} and report it.

But there are two main problems with this approach to passing the exception 
along:

1. The common case "writer" logic in {{ExportWriter}} (which calls 
{{ExportBuffers.run()}} *_NEVER_* checks {{ExportBuffers.getError()}} – it is 
only ever checked by (the special case logic in) {{ExportWriterStream}} and 
even there it's only check in the event of a {{BrokenBarrierException}}
2. The logic inside {{ExportBuffers.run()}} itself only logs (and ignores) any 
type of Throwable that propagates up to it.

A very tightly related third problem is the call to {{barrier.reset()}} :
 * AFAICT The usage of this method call seems to be a complete mistake / 
misunderstanding
 ** It's comment & usage here in the {{error()}} method (ie: after one of the 
two parties gets either a {{TimeoutException}} or {{BrokenBarrierException}} ) 
seems to be with the intent of ensuring that _if_ the other thread is currently 
{{await(time)}} -ing on the barrier, that thread should get a 
{{BrokenBarrierException}} – but that's not what this method is designed for.
 ** The way we are using the {{CyclicBarrier}} in this code, the situation 
described in the comment should never even be possible.
 * In general, anytime any call to {{CyclicBarrier.await()}} throws an 
exception the barrier is left in a "broken" state, and any other calls to 
{{CyclicBarrier.await()}} ({_}either currently waiting, or in the future{_}) 
will get a {{BrokenBarrierException}}
 ** The call to {{barrier.reset()}} here causes that "broken" state to be 
reset, *_preventing_* any future calls to {{CyclicBarrier.await()}} (from the 
thread that didn't already encounter an error) to throw 
{{BrokenBarrierException}} – instead that thread waits the full timeout amount 
(for a thread that has already errored out and is long gone)

 
----
 

In general, I think the design of {{ExportBuffers}} (and use of a 
{{CyclicBarrier.await(time)}} to exchange buffers between two threads) is more 
complicated and error prone then it needs to be, and should to be re-considered 
in order to ensure that these kind of "slow consumption" situations can't 
result the "filler" thread giving up after an arbitrary time limit even though 
the "writer" thread is still around and the client is still consuming results. 
I will open a separate jira to discuss options for redesigning this.

 

In the meantime, my biggest concern is the "false success" that is returned to 
clients in the event that the filler code does timeout – so that client/caller 
code has no idea there was a problem and partial results have been returned.

I'm attaching a patch that does 3 things:
 * Fixes both "writer" code paths to check {{getError()}} (and report it) 
regardless of how they finish
 ** I am not 100% certain that these are the _best_ places to check 
{{getError()}} or that there are not _additional_ places that should do this 
check - but this seemed to address the situations where I could reproduce this 
type of problem
 * Makes the {{CyclicBarrier.await(time)}} timeout value configurable via an 
(undocumented) EnvUtils sysprop.
 ** So that Solr deployments with large collections have an expert level 
mechanism for addressing TimeoutErrors they may encounter
 * remove the call the {{barrier.reset()}}
 ** It just makes no sense in this code, and leaving it in ensures that _if_ 
the "filler" times out, the "writer" code (when it is ready to 
{{exchangeBuffers()}} ) winds up waiting the full time limit itself for no 
reason.

...the way the code is designed, we don't really have a straight forward way to 
write a junit test for these types of situations, and if we're going to spend 
time refactoring to improve the test coverage I'd rather focus on bigger 
picture refactoring to eliminate the risk of these bugs in general – but I'm 
open to suggestions if folks want to contribute test cases that should be 
committed along with these changes.

> Streaming Expressions:  Exception swallowed and not propagated back to the 
> client leading to inconsistent results
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-17416
>                 URL: https://issues.apache.org/jira/browse/SOLR-17416
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: streaming expressions
>            Reporter: Lamine
>            Priority: Major
>
> There appears to be a bug in the _ExportWriter/ExportBuffers_ implementation 
> within the Streaming Expressions plugin. Specifically, when an 
> InterruptedException occurs due to an ExportBuffers timeout, the exception is 
> swallowed and not propagated back to the client (still logged on the server 
> side though).
> As a result, the client receives an EOF marker, thinking that it has received 
> the full set of results, when in fact it has only received partial results. 
> This leads to inconsistent search results, as the client is unaware that the 
> export process was interrupted and terminated prematurely.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Commented] (SOLR-17416) Streaming Expressions: Exception swallowed and not propagated back to the client leading to inconsistent results

Reply via email to