On Fri, Jan 20, 2012 at 15:53, Michael Clemmons <glassresis...@gmail.com> wrote:
>...
> I've decided to go back and cleanup that approach and reapply it to the
> current master branch.  To my surprise I found that the ConnectionManager
> supports multiple connections and has the tools to add and remove them.
> Looking at the simple case of the RiakHttpTransport layer and the
> http_request method it looks like it should grab a new connection(or an old
> one) and try it and if it fails move on to the next putting the host port
> pair at the bottom of the list.

Right. That is the intent.

I will note that a failure does not automatically remove a host/port
pair, but fails the particular request (after N retries). I'm not sure
that is entirely "the best strategy" (see below, ref: many
strategies), but the plumbing should be there for applications to
decide the proper behavior. It may be possible to decide a
best/default strategy so that (most) applications do not have to get
involved.

> So it looked like all I needed to do was update the client code to accept a
> list of hostport pairs and it would just work, which sounded too easy to be
> true.  I tested it anyways and if I use one hostport that is a working riak
> node it connects and everything works.  If I include 2 nodes one working and
> one a random port it fails no matter the order.  So its not just trying to
> connect to the first and failing its connecting to them all and failing if
> any fail.

Well, yeah. You started the thing up, saying all the host/port pairs
were proper. It is telling you they are not :-)

One question: when does the failure happen? At instantiation time, or
later at request time? On the first request, or some later request?

(as I recall, it should lazy-open all host/port pairs, so the failure
should not happen until later... and only when the pair is attempted
to be used)

> Anyone have any idea of why its built this way and what other solutions

The overall intent is to connect to (at least) one known working node.
That node can then be queried for "all" other known working nodes,
which are then added into the ConnectionManager (CM). The (long-lived)
process can then continue to monitor the status of the ring and make
corresponding updates to the CM.

The code does not (yet) have a well-defined process for *removal* of
non-working nodes. That is a complex application-level decision.
Should it remove the host/port permanently? If it is just a network
glitch, or the particular host is have transient issues, then maybe
the pair should be kept around (but unused) and re-installed in a
minute or two when the host starts replying again. Maybe you just
remove the pair and wait for a general background monitoring thread to
note their existence and reinstall the pair.

For a single-threaded, short-lived application, the multiple host/port
pair capability is not very useful. That functionality is really
necessary for multiple threads and/or long-lived processes. In this
scenario, as existing connections get used up, the CM will spin up new
connections for threads to use to perform their operations. (the
underlying connections are persistent and reused until the server
decides to close them, where the client will attempt to reopen and
reuse the connection again)

What happens when you give the CM a list of *working* host/port pairs?
Does that still fail for you? It is true that when one goes down, then
some level of the stack should remove the pair, but "which level" just
hasn't been decided.

There is also a "monitor" concept that has been sketched out in the
code, but not implemented. See riak/transports/monitor.py. That should
be used in a long-running application to periodically hit the riak
servers, querying what nodes are in the ring, and adding new ones and
removing broken ones. I sketched it out but neither myself nor anybody
else has further worked on such logic.

> people have worked out on their own?  My intent is to do this so it merges
> cleanly with the current master, and doesn't introduce unnecessary change,
> to increase the likelyhood of a successful pull request.

For production code, some of this host/port pair management needs to
be done. It would be nice to have the monitor (thread) completed, but
that may not be appropriate for your application.

I think that Brett's work on timeouts is necessary for production
code. The key decision point here is Python compatibility support. If
the library requires 2.7, then it should be quite easy to merge his
changes. I think (but don't recall offhand) that the timeout parameter
for HTTPConnection might be available in 2.6, but I definitely know it
is not available for Python 2.5. When I began my work on the client, I
was targeting 2.5 and made many compatibility changes with that in
mind. This was primarily to support my 2.5-based dev environment, even
though I was going to deploy to 2.7. I eventually upgraded my dev
environment, so compatibility isn't a huge concern for me any more. I
would leave that decision to the Basho folks, who are controlling the
decisions and guidance for the client.

Hopefully, that gives you a good background on the current decision
and thoughts around it. I'd be more than happy to elaborate further on
the choices made... please just ask!

Cheers,
-g

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to