We seem to be having a fundamental misunderstanding. Thanks for your
comments. aj
On 7/3/2011 8:28 PM, William Oberman wrote:
I'm using cassandra as a tool, like a black box with a certain
contract to the world. Without modifying the "core", C* will send the
updates to all replicas, so your plan would cause the extra write (for
the placeholder). I wasn't assuming a modification to how C*
fundamentally works.
Sounds like you are hacking (or at least looking) at the source, so
all the power to you if/when you try these kind of changes.
will
On Sun, Jul 3, 2011 at 8:45 PM, AJ <a...@dude.podzone.net
<mailto:a...@dude.podzone.net>> wrote:
On 7/3/2011 6:32 PM, William Oberman wrote:
Was just going off of: " Send the value to the primary replica
and send placeholder values to the other replicas". Sounded like
you wanted to write the value to one, and write the placeholder
to N-1 to me.
Yes, that is what I was suggesting. The point of the placeholders
is to handle the crash case that I talked about... "like" a WAL does.
But, C* will propagate the value to N-1 eventually anyways,
'cause that's just what it does anyways :-)
will
On Sun, Jul 3, 2011 at 7:47 PM, AJ <a...@dude.podzone.net
<mailto:a...@dude.podzone.net>> wrote:
On 7/3/2011 3:49 PM, Will Oberman wrote:
Why not send the value itself instead of a placeholder? Now
it takes 2x writes on a random node to do a single update
(write placeholder, write update) and N*x writes from the
client (write value, write placeholder to N-1). Where N is
replication factor. Seems like extra network and IO instead
of less...
To send the value to each node is 1.) unnecessary, 2.) will
only cause a large burst of network traffic. Think about if
it's a large data value, such as a document. Just let C* do
it's thing. The extra messages are tiny and doesn't
significantly increase latency since they are all sent
asynchronously.
Of course, I still think this sounds like reimplementing
Cassandra internals in a Cassandra client (just guessing,
I'm not a cassandra dev)
I don't see how. Maybe you should take a peek at the source.
On Jul 3, 2011, at 5:20 PM, AJ <a...@dude.podzone.net
<mailto:a...@dude.podzone.net>> wrote:
Yang,
How would you deal with the problem when the 1st node
responds success but then crashes before completely
forwarding any replicas? Then, after switching to the next
primary, a read would return stale data.
Here's a quick-n-dirty way: Send the value to the primary
replica and send placeholder values to the other replicas.
The placeholder value is something like, "PENDING_UPDATE".
The placeholder values are sent with timestamps 1 less than
the timestamp for the actual value that went to the
primary. Later, when the changes propagate, the actual
values will overwrite the placeholders. In event of a
crash before the placeholder gets overwritten, the next
read value will tell the client so. The client will report
to the user that the key/column is unavailable. The
downside is you've overwritten your data and maybe would
like to know what the old data was! But, maybe there's
another way using other columns or with MVCC. The client
would want a success from the primary and the secondary
replicas to be certain of future read consistency in case
the primary goes down immediately as I said above. The
ability to set an "update_pending" flag on any column value
would probably make this work. But, I'll think more on
this later.
aj
On 7/2/2011 10:55 AM, Yang wrote:
there is a JIRA completed in 0.7.x that "Prefers" a
certain node in snitch, so this does roughly what you want
MOST of the time
but the problem is that it does not GUARANTEE that the
same node will always be read. I recently read into the
HBase vs Cassandra comparison thread that started after
Facebook dropped Cassandra for their messaging system, and
understood some of the differences. what you want is
essentially what HBase does. the fundamental difference
there is really due to the gossip protocol: it's a
probablistic, or eventually consistent failure detector
while HBase/Google Bigtable use Zookeeper/Chubby to
provide a strong failure detector (a distributed lock).
so in HBase, if a tablet server goes down, it really goes
down, it can not re-grab the tablet from the new tablet
server without going through a start up protocol
(notifying the master, which would notify the clients
etc), in other words it is guaranteed that one tablet is
served by only one tablet server at any given time. in
comparison the above JIRA only TRYIES to serve that key
from one particular replica. HBase can have that guarantee
because the group membership is maintained by the strong
failure detector.
just for hacking curiosity, a strong failure detector +
Cassandra replicas is not impossible (actually seems not
difficult), although the performance is not clear. what
would such a strong failure detector bring to Cassandra
besides this ONE-ONE strong consistency ? that is an
interesting question I think.
considering that HBase has been deployed on big clusters,
it is probably OK with the performance of the strong
Zookeeper failure detector. then a further question was:
why did Dynamo originally choose to use the probablistic
failure detector? yes Dynamo's main theme is "eventually
consistent", so the Phi-detector is **enough**, but if a
strong detector buys us more with little cost, wouldn't
that be great?
On Fri, Jul 1, 2011 at 6:53 PM, AJ <a...@dude.podzone.net
<mailto:a...@dude.podzone.net>> wrote:
Is this possible?
All reads and writes for a given key will always go to
the same node from a client. It seems the only thing
needed is to allow the clients to compute which node
is the closes replica for the given key using the same
algorithm C* uses. When the first replica receives
the write request, it will write to itself which
should complete before any of the other replicas and
then return. The loads should still stay balanced if
using random partitioner. If the first replica
becomes unavailable (however that is defined), then
the clients can send to the next repilca in the ring
and switch from ONE write/reads to QUORUM write/reads
temporarily until the first replica becomes available
again. QUORUM is required since there could be some
replicas that were not updated after the first replica
went down.
Will this work? The goal is to have strong
consistency with a read/write consistency level as low
as possible while secondarily a network performance boost.
--
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835 <tel:412-480-7835>
(E) ober...@civicscience.com <mailto:ober...@civicscience.com>
--
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com <mailto:ober...@civicscience.com>