Newbie question 1/3
(An earlier post seems not to have gone through. My apologies in the eventual case of a duplicate.) I'm thinking of using Riak to replace a large Oracle system, and I'm trying to understand its guarantees. I have a few introductory questions; this is the first of three. I'm trying to understand the reliability of stored data. Imagine (for example) that I have 5 Riak hosts, and an n_val of 3. Imagine that each host is down 1% of the time (I bought the disks at a flood sale), and imagine that host failures are uncorrelated, and imagine that when hosts come back up, they stay up long enough to fully rejoin the service, and imagine that I haven't done any writes for a long while. Given these assumptions, I might naïvely assume that my data are available with a probability of about 99.999%, or down about 5 minutes a year. This would be great (perhaps). Of course, this ignores the possibility that some of my data may not be replicated at all, perhaps even with all three copies on the same host. If all I know is that some data may not be replicated, then all I know is that (some of) my data may be unavailable as much as 3.65 days a year, which would not be nearly as great. I understand things probably won't be this bad, but "probably" isn't a probability. Is this right? Is there anything I can do to guarantee higher reliability, short of setting n_val to 5? Cheers, John ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Newbie question 2/3
(An earlier post seems not to have gone through. My apologies in the eventual case of a duplicate.) I'm thinking of using Riak to replace a large Oracle system, and I'm trying to understand its guarantees. I have a few introductory questions; this is the second of three. Imagine I do a write, and the write fails because it could not contact enough hosts. Am I right to imagine that the write may actually have persisted, and that the data might later be available for reading? Am I also right to imagine that the data, once read, might later vanish due to host failure, because it was persisted to fewer hosts than expected? Cheers, John ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Newbie question 3/3
(An earlier post seems not to have gone through. My apologies in the eventual case of a duplicate.) I'm thinking of using Riak to replace a large Oracle system, and I'm trying to understand its guarantees. I have a few introductory questions; this is the third of three. I would like to do two updates atomically, but of course I cannot. I imagine I could construct my own redo log, and perform a sequence of operations something like: write redo log entry (timestamp, A's update, B's update) to redo log update A update B delete redo log entry from redo log Asynchronously, I could read dangling entries from the redo log and repeat them, deleting them upon success. (Let's imagine for simplicity that the updates are idempotent and commutative.) This seems doable, but it's not pretty. Is this the best I can do? Or should I think about the problem differently? (BTW, I believe that secondary indexes won't help me.) Cheers, John ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Newbie question 2/3
Thanks for the reply, which confirms what I expected. Let me explain why I asked. I have an application that my intuition says would be a good match to Riak, but I don't trust my intuition since I've never used Riak and I'm not sure I understand all of its failure modes. One thing I'm trying is to work through a mental model-checking exercise—which I might eventually turn over to a real model checker—which is making me wonder about all the things that can go wrong. A failed write that is visible anyway, either permanently or just for a while, is just one example. In the long run, it would be great if Riak were documented perfectly and completely—and other other piece of software in the world too!—but in the meanwhile I'm just trying to build my own mental model. I'd prefer, of course, a mental model that does not depend on a detailed knowledge of Riak's internal workings, enumberating only the preconditions and postconditions of each operation. We'll see how far I can get Cheers, John On Jan 9, 2012, at 2:38 PM, John DeTreville wrote: > Thanks you very much for your reply. Longer response to follow. > > Cheers, > John > > On Jan 9, 2012, at 2:33 PM, Ryan Zezeski wrote: > >> John, >> >> To your first question, yes, it is possible that the client may receive a >> failure response from Riak but the data could have persisted on some of the >> nodes. This is because a single write to Riak is actually N writes to N >> different partitions inside of Riak. These N writes are not atomic in >> relation to each other. >> >> As for your second question, it depends on what happens between the time of >> the "failed" write and the time the node(s) with the replicas go down. If >> some form of anti-entropy is employed before the node failure then the >> replicas should have been repaired and N copies should exist. Riak's main >> form of anti-entropy is read repair that occurs at read time (we also have a >> form of active anti-entropy between Riak clusters in our enterprise >> offering). If the object is read before node failure then read-repair will >> occur and repair all N replicas. >> >> An example might help. If N=3/W=2 and two partitions fail to write then the >> overall request will fail but the remaining W is successful. If you perform >> a read after this "failed" write then you may or may not see the new value >> depending on the R value and which partitions respond to the coordinator >> first. However, regardless what is returned by that read the coordinator >> will stay alive a while longer in an attempt to perform read-repair. If >> read-repair is successful then you should have N copies and it will be like >> the write failure never occurred. If you hadn't performed that read and the >> replicas hadn't been repaired and the node containing the only replica went >> down and you did a read then you would get the old value or a not_found >> (depending on if a value existed for that key before the write). >> >> -Ryan >> >> >> On Mon, Jan 9, 2012 at 12:32 AM, John DeTreville wrote: >> (An earlier post seems not to have gone through. My apologies in the >> eventual case of a duplicate.) >> >> I'm thinking of using Riak to replace a large Oracle system, and I'm trying >> to understand its guarantees. I have a few introductory questions; this is >> the second of three. >> >> Imagine I do a write, and the write fails because it could not contact >> enough hosts. Am I right to imagine that the write may actually have >> persisted, and that the data might later be available for reading? Am I also >> right to imagine that the data, once read, might later vanish due to host >> failure, because it was persisted to fewer hosts than expected? >> >> Cheers, >> John >> ___ >> riak-users mailing list >> riak-users@lists.basho.com >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> > ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Newbie question 3/3
Right, I certainly don't want distributed transactions, which I agree would destroy availability. (I should add that my system is geographically distributed, making everything much worse.) Still, that leaves open the question of doing what my application needs without transactions. Let's consider two situations involving updates. The first situation is when I can reduce an update to a single write, such as by using Riak's secondary indexes. Unfortunately, I don't have a great understanding of the performance of secondary indexes, and I don't have a great understanding of their failure modes. Can you offer any guidance? The second situation is when I really need to do multiple writes, in which case I must model (some subset of) transactional semantics at the application level. One example is implementing my own redo log, as mentioned earlier. Have other users ever had such problems? What are the good ways to solve them? Heck, what are the bad ways (just so I'll know what to avoid)? Cheers, John On Jan 9, 2012, at 2:54 PM, Ryan Zezeski wrote: > John, > > As you already seem to understand, Riak doesn't provide a way to make > multiple ops atomic. Part of the reason is because Riak's main focus thus > far has been availability. Distributed transactions would work, but at the > cost of availability. I think a flaw with the redo log approach is that you > need to serialize all operations to A & B through _one_ client to keep from > reading an inconsistent state. > > A much simpler option, if you can bend your data, is to combine A and B into > one object. > > -Ryan > > On Mon, Jan 9, 2012 at 12:33 AM, John DeTreville wrote: > (An earlier post seems not to have gone through. My apologies in the eventual > case of a duplicate.) > > I'm thinking of using Riak to replace a large Oracle system, and I'm trying > to understand its guarantees. I have a few introductory questions; this is > the third of three. > > I would like to do two updates atomically, but of course I cannot. I imagine > I could construct my own redo log, and perform a sequence of operations > something like: > > write redo log entry (timestamp, A's update, B's update) to redo log > update A > update B > delete redo log entry from redo log > > Asynchronously, I could read dangling entries from the redo log and repeat > them, deleting them upon success. (Let's imagine for simplicity that the > updates are idempotent and commutative.) This seems doable, but it's not > pretty. Is this the best I can do? Or should I think about the problem > differently? > > (BTW, I believe that secondary indexes won't help me.) > > Cheers, > John > ___ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Newbie question 2/3
Let me elaborate a tiny bit more. Consider the write(2) syscall on Unix and likealooks. If it succeeds, it returns the number of bytes written. If it fails, it returns -1. One must sometimes learn the hard way that some bytes may have been written even in the case of failure, but that there is no way to know how many. Interesting! I'm just trying to accelerate my learning process with Riak. Cheers, John On Jan 9, 2012, at 3:25 PM, John DeTreville wrote: > Thanks for the reply, which confirms what I expected. > > Let me explain why I asked. I have an application that my intuition says > would be a good match to Riak, but I don't trust my intuition since I've > never used Riak and I'm not sure I understand all of its failure modes. One > thing I'm trying is to work through a mental model-checking exercise—which I > might eventually turn over to a real model checker—which is making me wonder > about all the things that can go wrong. A failed write that is visible > anyway, either permanently or just for a while, is just one example. > > In the long run, it would be great if Riak were documented perfectly and > completely—and other other piece of software in the world too!—but in the > meanwhile I'm just trying to build my own mental model. I'd prefer, of > course, a mental model that does not depend on a detailed knowledge of Riak's > internal workings, enumberating only the preconditions and postconditions of > each operation. We'll see how far I can get > > Cheers, > John > > On Jan 9, 2012, at 2:38 PM, John DeTreville wrote: > >> Thanks you very much for your reply. Longer response to follow. >> >> Cheers, >> John >> >> On Jan 9, 2012, at 2:33 PM, Ryan Zezeski wrote: >> >>> John, >>> >>> To your first question, yes, it is possible that the client may receive a >>> failure response from Riak but the data could have persisted on some of the >>> nodes. This is because a single write to Riak is actually N writes to N >>> different partitions inside of Riak. These N writes are not atomic in >>> relation to each other. >>> >>> As for your second question, it depends on what happens between the time of >>> the "failed" write and the time the node(s) with the replicas go down. If >>> some form of anti-entropy is employed before the node failure then the >>> replicas should have been repaired and N copies should exist. Riak's main >>> form of anti-entropy is read repair that occurs at read time (we also have >>> a form of active anti-entropy between Riak clusters in our enterprise >>> offering). If the object is read before node failure then read-repair will >>> occur and repair all N replicas. >>> >>> An example might help. If N=3/W=2 and two partitions fail to write then >>> the overall request will fail but the remaining W is successful. If you >>> perform a read after this "failed" write then you may or may not see the >>> new value depending on the R value and which partitions respond to the >>> coordinator first. However, regardless what is returned by that read the >>> coordinator will stay alive a while longer in an attempt to perform >>> read-repair. If read-repair is successful then you should have N copies >>> and it will be like the write failure never occurred. If you hadn't >>> performed that read and the replicas hadn't been repaired and the node >>> containing the only replica went down and you did a read then you would get >>> the old value or a not_found (depending on if a value existed for that key >>> before the write). >>> >>> -Ryan >>> >>> >>> On Mon, Jan 9, 2012 at 12:32 AM, John DeTreville >>> wrote: >>> (An earlier post seems not to have gone through. My apologies in the >>> eventual case of a duplicate.) >>> >>> I'm thinking of using Riak to replace a large Oracle system, and I'm trying >>> to understand its guarantees. I have a few introductory questions; this is >>> the second of three. >>> >>> Imagine I do a write, and the write fails because it could not contact >>> enough hosts. Am I right to imagine that the write may actually have >>> persisted, and that the data might later be available for reading? Am I >>> also right to imagine that the data, once read, might later vanish due to >>> host failure, because it was persisted to fewer hosts than expected? >>> >>> Cheers, >>> John >>> ___ >>> riak-users mailing list >>> riak-users@lists.basho.com >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>> >> > ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Newbie question 1/3
Excellent answer; thank you. I imagine the unavailability I see will depend strongly on the speed of read repairs. Since I have quite a lot of data, I imagine that they might be quite slow, but I probably can't say more than that without real measurements. A related question. You say that if my n_val is 3, some data may reside only on 2 physical nodes. Ignoring failures, might some of it of reside on just one node? Cheers, John ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Newbie question 1/3
That's good to know; thanks. I imagine I may have to vary my physical node count as time goes by, and I wondering how much planning ahead that might take. Going by your example, if my n_val is 4 and an object hashes to partition 6, then my object will be stored only on two physical nodes, right? In my system (as in many), some objects are much more important than others, although I unfortunately don't know which are which until after the fact. Having two server failures in rapid succession is not all that uncommon, so I might have to use an n_val of 5 to guarantee storage on 3 physical nodes, right? Cheers, John ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Non-standard ring size
That's a pity, as, if the number of physical hosts is not also a power of two, some number of records will reside on fewer than n_val physical hosts. Cheers, John On Jan 19, 2012, at 8:22 AM, Sean Cribbs wrote: > The ring size must be a power of two because it must evenly divide 2^160 (the > size of our consistent hashing space), which is not divisible by 3. Using a > non-power-of-two ring size will have unknown or unpredictable effects. ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Delta updates using pre-commit hooks
I for one would find the availability of CRDTs to be very interesting. Good luck! Cheers, John On Jan 18, 2012, at 5:36 AM, Marek Zawirski wrote: > thanks for your answer with a bunch of useful info. I should have introduced > ourselves better given your replies. In fact we authored the paper on CRDTs > that Ryan mentioned and we continue to work in this area. I am putting some > relevant people in CC. What we try now is to experiment with CRDTs on top of > Riak. ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
OutOfMemoryError in Java client
I have a simple single-threaded Java client for Riak that consistently runs out of memory creating threads. java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:658) at java.util.concurrent.ThreadPoolExecutor.addIfUnderCorePoolSize(ThreadPoolExecutor.java:703) at java.util.concurrent.ThreadPoolExecutor.prestartCoreThread(ThreadPoolExecutor.java:1381) at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:222) at java.util.concurrent.ScheduledThreadPoolExecutor.scheduleWithFixedDelay(ScheduledThreadPoolExecutor.java:443) at com.basho.riak.pbc.RiakConnectionPool.doStart(RiakConnectionPool.java:232) at com.basho.riak.pbc.RiakConnectionPool.access$100(RiakConnectionPool.java:41) at com.basho.riak.pbc.RiakConnectionPool$State$1.start(RiakConnectionPool.java:58) at com.basho.riak.pbc.RiakConnectionPool.start(RiakConnectionPool.java:227) at com.basho.riak.pbc.RiakClient.(RiakClient.java:90) at com.basho.riak.pbc.RiakClient.(RiakClient.java:81) at com.basho.riak.client.raw.pbc.PBClientAdapter.(PBClientAdapter.java:91) at com.basho.riak.client.RiakFactory.pbcClient(RiakFactory.java:107) The client is a JUnit test for some data structures I'm storing in Riak. When I run it, my Java client process starts about 2028 native threads before it collapses. This JUnit test creates a moderately large number of IRiakClient objects, but only one at a time. It does not close them, as there is no method for doing so. This happens with Riak 1.0.2 and with Riak 1.1.0RC2. As I've said, the client is single-theaded. Any ideas? Cheers, John ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: OutOfMemoryError in Java client
Excellent! I had imagined it was something like this, but it's nice to see it in confirmed. My real code is not so profligate with IRiakClient objects, of course, but it was a surprise to see this pop up in JUnit tests. Thanks very much for the quick answer. Cheers, John ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com