Re: diverting riak as a filesystem replacement

Jonathan Langevin Mon, 26 Sep 2011 09:24:13 -0700

If you were to continue to pursue the use of Riak for a distributed FS, and
if you have any resources to toss at development, it may be possible to
build a FUSE driver that acts as a Riak client. FUSE = filesystem in
userspace, and can function across most any Linux/BSD variant (including Mac
OS X).


More info: http://en.wikipedia.org/wiki/Filesystem_in_Userspace

There is also a list of FUSE drivers at the above URL, several of which
mention "distributed" in the description. One of those may suffice for you
(if you've not already reviewed them). Otherwise, you could possibly use
their FUSE drivers as a basis for your own custom FUSE Riak driver.

 <http://www.loomlearning.com/>
* Jonathan Langevin
Systems Administrator
Loom Inc.
Wilmington, NC: (910) 241-0433 - jlange...@loomlearning.com -
www.loomlearning.com - Skype: intel352 *



On Sun, Sep 25, 2011 at 4:29 PM, Jeremiah Peschka <
jeremiah.pesc...@gmail.com> wrote:

> Responses inline
> ---
> Jeremiah Peschka - Founder, Brent Ozar PLF, LLC
> Microsoft SQL Server MVP
>
> On Sep 25, 2011, at 5:30 AM, pille wrote:
>
> > hi,
> >
> > i'm quite new to riak and only know it from the docs available online.
> > to be honest, i did not search for a key/value store, but for a reliable
> (HA) distributed, replicated filesystem that allows dynamic growth.
>
> To be honest, what you're looking for is a SAN. EMC's Isilon line, Dell's
> Equallogic, and HP's Lefthand devices all meet your needs very well. They
> don't require a lot of administrative knowledge, they're easy to set up and
> maintain, and they are very easy to expand. SANs provide the features and
> functionality that you're looking for and won't require any additional
> development or maintenance. Yes, they cost money, but they do just sorta
> work straight out of the box.
>
> That being said, I answered the rest of these questions as if you weren't
> willing to just throw a bucket of money and SAN gear at your problem.
>
> >
> > all these filesystems i've dealt with are either immature, abandoned, or
> are limited in features like dynamic scaling, snapshotting or fail in
> out-of-diskspace scenarios (as they don't give you high availability and
> data protection at the same time).
> >
> > somehow i stumbled upon this project and liked its features, despite not
> being a filesystem at all. i can live with its flat structure if it'll bring
> me all the other features i need.
> >
> > so i'm now at the point that after reading the online docs without any
> hands-on experience leaves some questions unanswered.
> > since i'm used to storing all data in a filesystem, our application's
> storage interface would need a complete rewrite to interface with riak and
> provide the same services as before. therefore i'd like to ask you to share
> your knowledge and experience.
> >
> > 1) are snapshots provided?
> >   i guess they aren't, but i'm more interested weather i can use the
> vectorclocks for that.
> >   i only need one snapshot and live data to provide an consistent old
> view of the data for our staging instance.
>
> Snapshots are not provided. You could probably cook something up yourself,
> but there's no snapshotting involved that I know of. Vector clocks are used
> for determining object lineage and conflict resolution.
>
> >
> > 2) how does riak deal with different storage capacities of the different
> nodes? is it a problem, if some nodes provide less space than others? is
> data distributed uniformly accross all nodes or is its capacity taken into
> account?
>
> AFAIK, data is distributed evenly across a number of virtual nodes (64 by
> default). Those virtual nodes are then distributed evenly across your
> physical nodes. I don't know of a way to change this, but I've been very
> wrong before.
> >
> > 3) we've got quite huge files for a database to store. is that a problem?
> what storage backend do you propose?
> >   currently we see the following distribution, but i expect more in the
> range from 512MB to 4GB to come in future:
> >         <   1KB: 64053
> >     1KB -   1MB: 873795
> >     1MB -   2MB: 4776
> >     2MB -   4MB: 3131
> >     4MB -   8MB: 3136
> >     8MB -  16MB: 2842
> >    16MB -  32MB: 3136
> >    32MB -  64MB: 4032
> >    64MB - 128MB: 3118
> >   128MB - 256MB: 3361
> >   256MB - 512MB: 3221
> >   512MB -   1GB: 1423
> >     1GB -   2GB: 75
>
> Riak KV's max acceptable performance size is about 64MB for a file, but
> performance would probably start degrading before that. Luwak is an
> application built on top of Riak that probably meets your needs a lot better
> than plain old Riak KV: http://wiki.basho.com/Luwak.html
>
>
> >
> > 4) is range access possible to read parts of a file^W value or do i need
> to stream the whole file through? this would not perform well on guge
> values.
>
> With Luwak it's possible to get a portion of the object using the option
> Range parameter: http://wiki.basho.com/HTTP-Fetch-Luwak-Object.html
>
> >
> > 5) to reduce the impact of a disk failure on the storage backend and i'd
> like each disk of a server to be assigned to its own riak-node. i guess
> healing the failed node ofter replacement is faster than raid recovery and
> less data is at risk.
> >   is it possible to reflect the hardware hierarchy in some way to
> influence the place for replicas? CephFS offers this to make sure replicas
> are hold on different hardware or even in different locations.
> >   e.g. a STORAGE is in a SERVER, which is in a RACK, which is in a
> DATACENTER. replicas of a file in a STORAGE should never be placed inside
> the same SERVER, (or RACK, or DATACENTER).
>
> You can purchase Riak EDS which has multi-site replication. Otherwise, Riak
> is just going to throw data into N nodes in your cluster and it will be up
> to you to make sure those nodes are in different racks.
>
> >
> > 6) what happens, if less that R or W nodes report data? does it mean not
> found or not available? even if the data is on an currently offline node.
>
> If less than R nodes are present, your write will fail. The R value means
> "this many nodes have to respond with data for it to be considered a
> successful read." Anything less than R would, thusly, mean there was a
> failure.
>
> If less than W nodes are able to write data, a hinted handoff will occur.
>
> >
> > 7) can he client applications connect to some random node?
> >   should it simply retry the next one in the list upon failure?
>
> Client applications should connect to a random node, yes. Even better, you
> should put a load balancing proxy server in front of your Riak cluster so
> developers don't have to worry about writing their own load balancing code.
>
> I'd retry on failure, but that's up to you. ;)
>
> >
> > 8) is the data reported back on read is compared/verifies with all
> replicas to ensure consistency or just its metadata (if R>1)
>
> Yes, R nodes have to respond with *the same* copy of the data before a read
> is successful. You can quickly do this by comparing vector clocks and other
> assorted metadata.
>
> >
> > 9) is data integrity in storage backend is secured through checksums?
>
> I think depends on the storage backend implementation. doing a quick grep
> through the source code turns up the word "checksum" a lot, though.
>
> >
> > these are the questions puzzling me at the moment.
> > if you know some filesystem that matches my featurelist, please don't
> hesitate to answer them off-topic ;-)
>
> Other options include HDFS and MogileFS (http://danga.com/mogilefs/).
> Last.fm use MogileFS
>
> >
> > cheers
> >  pille
> >
> > _______________________________________________
> > riak-users mailing list
> > riak-users@lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: diverting riak as a filesystem replacement

Reply via email to