Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

Teodor Sigaev Thu, 29 Mar 2018 04:54:32 -0700

Conception of max-retry option seems strange for me. if number of retriesreaches max-retry option, then we just increment counter of failed transactionand try again (possibly, with different random numbers). At the end we shoulddistinguish number of error transaction and failed transaction, to found thisdifference documentation suggests to rerun pgbench with debugging on.

May be I didn't catch an idea, but it seems to me max-tries should be removed.On transaction searialization or deadlock error pgbench should increment counterof failed transaction, resets conditional stack, variables, etc but not a randomgenerator and then start new transaction for the first line of script.


Marina Polyakova wrote:

On 26-03-2018 18:53, Fabien COELHO wrote:

Hello Marina,


Hello!

Many thanks to both of you! I'm working on a patch in this direction..

I think that the best approach for now is simply to reset (command
zero, random generator) and start over the whole script, without
attempting to be more intelligent. The limitations should be clearly
documented (one transaction per script), though. That would be a
significant enhancement already.

I'm not sure that we can always do this, because we can get new errors untilwe finish the failed transaction block, and we need destroy the conditionalstack..


Sure. I'm suggesting so as to simplify that on failures the retry
would always restarts from the beginning of the script by resetting
everything, indeed including the conditional stack, the random
generator state, the variable values, and so on.

This mean enforcing somehow one script is one transaction.

If the user does not do that, it would be their decision and the
result becomes unpredictable on errors (eg some sub-transactions could
be executed more than once).

Then if more is needed, that could be for another patch.

Here is the fifth version of the patch for pgbench (based on the commit4b9094eb6e14dfdbed61278ea8e51cc846e43579) where I tried to implement theseideas, thanks to your comments and those of Teodor Sigaev. Since we may need toexecute commands to complete a failed transaction block, the script is nowalways executed completely. If there is a serialization/deadlock failure whichcan be retried, the script is executed again with the same random state andarray of variables as before its first run. Meta commands errors as well as allSQL errors do not cause the aborting of the client. The first failure in thecurrent script execution determines whether the script run will be retried ornot, so only such failures (they have a retry) or errors (they are not retried)are reported.


I tried to make fixes in accordance with your previous reviews ([1], [2], [3]):

I'm unclear about the added example added in the documentation. There
are 71% errors, but 100% of transactions are reported as processed. If
there were errors, then it is not a success, so the transaction were
not
processed? To me it looks inconsistent. Also, while testing, it seems
that
failed transactions are counted in tps, which I think is not
appropriate:


About the feature:

 sh> PGOPTIONS='-c default_transaction_isolation=serializable' \
       ./pgbench -P 1 -T 3 -r -M prepared -j 2 -c 4
 starting vacuum...end.
 progress: 1.0 s, 10845.8 tps, lat 0.091 ms stddev 0.491, 10474 failed
 # NOT 10845.8 TPS...
 progress: 2.0 s, 10534.6 tps, lat 0.094 ms stddev 0.658, 10203 failed
 progress: 3.0 s, 10643.4 tps, lat 0.096 ms stddev 0.568, 10290 failed
 ...
 number of transactions actually processed: 32028 # NO!
 number of errors: 30969 (96.694 %)
 latency average = 2.833 ms
 latency stddev = 1.508 ms
 tps = 10666.720870 (including connections establishing) # NO
 tps = 10683.034369 (excluding connections establishing) # NO
 ...

For me this is all wrong. I think that the tps report is about
transactions
that succeeded, not mere attempts. I cannot say that a transaction
which aborted
was "actually processed"... as it was not.


Fixed

The order of reported elements is not logical:

 maximum number of transaction tries: 100
 scaling factor: 10
 query mode: prepared
 number of clients: 4
 number of threads: 2
 duration: 3 s
 number of transactions actually processed: 967
 number of errors: 152 (15.719 %)
 latency average = 9.630 ms
 latency stddev = 13.366 ms
 number of transactions retried: 623 (64.426 %)
 number of retries: 32272

I would suggest to group everything about error handling in one block,
eg something like:

 scaling factor: 10
 query mode: prepared
 number of clients: 4
 number of threads: 2
 duration: 3 s
 number of transactions actually processed: 967
 number of errors: 152 (15.719 %)
 number of transactions retried: 623 (64.426 %)
 number of retries: 32272
 maximum number of transaction tries: 100
 latency average = 9.630 ms
 latency stddev = 13.366 ms


Fixed

Also, percent character should be stuck to its number: 15.719% to have
the style more homogeneous (although there seems to be pre-existing
inhomogeneities).

I would replace "transaction tries/retried" by "tries/retried",
everything
is about transactions in the report anyway.

Without reading the documentation, the overall report semantics is
unclear,
especially given the absurd tps results I got with the my first
attempt,
as failing transactions are counted as "processed".


Fixed

About the code:

I'm at lost with the 7 states added to the automaton, where I would
have hoped
that only 2 (eg RETRY & FAIL, or even less) would be enough.


Fixed

I'm wondering whether the whole feature could be simplified by
considering that one script is one "transaction" (it is from the
report point of view at least), and that any retry is for the full
script only, from its beginning. That would remove the trying to guess
at transactions begin or end, avoid scanning manually for subcommands,
and so on.
 - Would it make sense?
 - Would it be ok for your use case?


Fixed

The proposed version of the code looks unmaintainable to me. There are
3 levels of nested "switch/case" with state changes at the deepest
level.
I cannot even see it on my screen which is not wide enough.


Fixed

There should be a typedef for "random_state", eg something like:

  typedef struct { unsigned short data[3]; } RandomState;

Please keep "const" declarations, eg "commandFailed".

I think that choosing script should depend on the thread random state,
not
the client random state, so that a run would generate the same pattern
per
thread, independently of which client finishes first.

I'm sceptical of the "--debug-fails" options. ISTM that --debug is
already there
and should just be reused.


Fixed

I agree that function naming style is a already a mess, but I think
that
new functions you add should use a common style, eg "is_compound" vs
"canRetry".


Fixed

Translating error strings to their enum should be put in a function.


Removed

I'm not sure this whole thing should be done anyway.


The processing of compound commands is removed.

The "node" is started but never stopped.


Fixed

For file contents, maybe the << 'EOF' here-document syntax would help
instead
of using concatenated backslashed strings everywhere.


I'm sorry, but I could not get it to work with regular expressions :(

I'd start by stating (i.e. documenting) that the features assumes that one
script is just *one* transaction.

Note that pgbench somehow already assumes that one script is one
transaction when it reports performance anyway.

If you want 2 transactions, then you have to put them in two scripts,
which looks fine with me. Different transactions are expected to be
independent, otherwise they should be merged into one transaction.


Fixed

Under these restrictions, ISTM that a retry is something like:

   case ABORTED:
      if (we want to retry) {
         // do necessary stats
         // reset the initial state (random, vars, current command)
         state = START_TX; // loop
      }
      else {
        // count as failed...
        state = FINISHED; // or done.
      }
      break;

...

I'm fine with having END_COMMAND skipping to START_TX if it can be done
easily and cleanly, esp without code duplication.

I did not want to add the additional if-expressions possibly to most of the codein CSTATE_START_TX/CSTATE_END_TX/CSTATE_END_COMMAND, so CSTATE_FAILURE is usedinstead of CSTATE_END_COMMAND in case of failure, and CSTATE_RETRY is calledbefore CSTATE_END_TX if there was a failure during the current script execution.

ISTM that ABORTED & FINISHED are currently exactly the same. That would
put a particular use to aborted. Also, there are many points where the
code may go to "aborted" state, so reusing it could help avoid duplicating
stuff on each abort decision.

To end and rollback the failed transaction block the script is always executedcompletely, and after the failure the following script command is executed..

[1]https://www.postgresql.org/message-id/alpine.DEB.2.20.1801031720270.20034%40lancre[2]https://www.postgresql.org/message-id/alpine.DEB.2.20.1801121309300.10810%40lancre[3]https://www.postgresql.org/message-id/alpine.DEB.2.20.1801121607310.13422%40lancre


--
Teodor Sigaev                                   E-mail: teo...@sigaev.ru
                                                   WWW: http://www.sigaev.ru/

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

Reply via email to