On 2/12/18 2:04 AM, Jacques-Henri Berthemet wrote:
Mahdi, you don’t need to re-read at CL ONE on line 9. When a LWT
statement is not applied, the values that prevented the LWT are
returned as part of the response, I’d expect them to be more
consistent than your read. I’m not 100% sure it’s the case for 2.0.x
but it’s the case for Cassandra 2.2.
Yes. That's an optimization that can be added. I need to check that it
works properly with the version of cassandra that I'm running. Right
now, we have line 9 done at a SERIAL consistency and the issue still
happens.
And it’s the same for line 1, you should only keep your LWT statement
unless you have a huge performance benefit of doing. In Cassandra
doing a read before write is a bad pattern.
I'll be trying this next and seeing if the issue disappears when we
change it to serial. Although, I still don't understand how this would
cause any inconsistencies. In the worst case, a non serial read would
return no rows for the specified primary key which I handle by trying to
do an LWT insert. If it's returning a result, I assume that result will
be the row that the winning lightweight transaction has written. I think
that assumption may not be correct all the time and I would love to
understand why that is the case.
--
Mahdi.
AFAIK a LWT statement is always executed as SERIAL, the only choice
you have is between SERIAL and LOCAL_SERIAL.
Regards,
*--*
*Jacques-Henri Berthemet*
*From:* DuyHai Doan [mailto:doanduy...@gmail.com]
*Sent:* Sunday, February 11, 2018 6:11 PM
*To:* user <user@cassandra.apache.org>
*Subject:* Re: LWT broken?
Mahdi , the issue in your code is here:
else // we lost LWT, fetch the winning value
9 existing_id = SELECT id FROM hash_id WHERE hash=computed_hash |
consistency = ONE
You lost LWT, it means that there is a concurrent LWT that has won the
Paxos round and has applied the value using QUORUM/SERIAL.
In best case, it means that the won LWT value has been applied to at
least 2 replicas out of 3 (assuming RF=3)
In worst case, the won LWT value has not been applied yet or is
pending to be applied to any replica
Now, if you immediately read with CL=ONE, you may:
1) Read the staled value on the 3rd replica which has not yet received
the correct won LWT value
2) Or worst, read a staled value because the won LWT is being applied
when the read operation is made
That's the main reason reading with CL=SERIAL is recommended
(CL=QUORUM is not sufficient enough)
Reading with CL=SERIAL will:
a. like QUORUM, contact strict majority of replicas
b. unlike QUORUM, look for validated (but not yet applied) previous
Paxos round value and force-applied it before actually reading the new
value
On Sun, Feb 11, 2018 at 5:36 PM, Mahdi Ben Hamida <ma...@signalfx.com
<mailto:ma...@signalfx.com>> wrote:
Totally understood that it's not worth (or it's rather incorrect)
to mix serial and non serial operations for LWT tables. It would
be highly satisfying to my engineer mind if someone can explain
why that would cause issues in this particular situation. The only
explanation I have is that a non serial read may cause a read
repair to happen and that could interfere with a concurrent serial
write, although I still can't explain how that would cause two
different "insert if not exist" transactions to both succeed.
--
Mahdi.
On 2/9/18 2:40 PM, Jonathan Haddad wrote:
If you want consistent reads you have to use the CL that
enforces it. There’s no way around it.
On Fri, Feb 9, 2018 at 2:35 PM Mahdi Ben Hamida
<ma...@signalfx.com <mailto:ma...@signalfx.com>> wrote:
In this case, we only write using CAS (code guarantees
that). We also never update, just insert if not exist.
Once a hash exists, it never changes (it may get deleted
later and that'll be a CAS delete as well).
--
Mahdi.
On 2/9/18 1:38 PM, Jeff Jirsa wrote:
On Fri, Feb 9, 2018 at 1:33 PM, Mahdi Ben Hamida
<ma...@signalfx.com <mailto:ma...@signalfx.com>> wrote:
Under what circumstances would we be reading
inconsistent results ? Is there a case where we
end up reading a value that actually end up not
being written ?
If you ever write the same value with CAS and without
CAS (different code paths both updating the same
value), you're using CAS wrong, and inconsistencies
can happen.