On 10/19/2010 01:31 AM, Greg Stein wrote:
On Mon, Oct 18, 2010 at 23:51, Blair Zajac<bl...@orcaware.com>  wrote:
On 10/04/2010 06:45 AM, C. Michael Pilato wrote:

There, you can learn more about what the Meetups tend to look like, what
other Meetups are planned for this years conference, and so on.  You'll
also
find a link to the Subversion Meetup wiki page:

        http://subversion.open.collab.net/wiki/ApacheConNA2010Meetup

That's the first mention I've seen of FSv2.  What ideas are going into it?
  What problems is it primarily meant to solve?

FSv2 is a hand-wave.

Personally, I see it as a broad swath of API changes to align our
needs with the underlying storage. Trowbridge noted that our current
API makes it *really* difficult to implement an effective backend. I'd
also like to see a backend that allows for parallel PUTs during the
commit process. Hyrum sees FSv2 as some kind of super-key-value
storage with layers on top, allowing for various types of high-scaling
mechanisms.

How would that API look?  The API as it is is pretty clear.

Background for my wish list.

We use Subversion as a backend for a versioned asset management system. We get up to 5 commits per second from render processes generating new assets and artists saving assets. We have interactive GUI users that do asset lookups all the time.

While the immutability of svn has allowed us to cache revision data and our servers can push 4,000 lookups per second to our render farm that do lookups on a particular revision, interactive users that do HEAD lookups suffer because the high commit rate. We cache data by node-id in memcached, but because the root node always get a new node-id and because the first thing interactive users do is get a list of folders of the root node, we always get cache misses. I don't really want svn to change the way new node-ids are assigned to parent nodes all the way to the root.

1) Scalability to 30,000 child nodes in a single directory.

Currently, a single change to a node in a directory with 20,000 child nodes causes a new revision file in fsfs to use around 960 kB. With a commit rate of 1.5 commits per second in a repository, the disk usage is very high. We introduced a hidden layer of "hash:DD" directories, 30 in our case, that our internal Subversion server hashes path elements to. This makes the revision files much smaller, but now when getting a list of nodes in a directory, we have up to 30 child directories to index, increasing lookup times.

If we could remove the need to hash directories, then the lookup on the root node would be much faster and interactive users would be happier.

2) I would like to ensure that the new backend supports multiple modifications to the same node. I don't know if this was designed into the current backend, but given I expose svn_fs.h over RPC, clients can make any one or multiple modifications to the tree, so the new backend should support this.

And while we're discussing wants.

3) Pools are painful to use. We have repository, revision and transaction C++ objects stored in an LRU cache. They cache revision and transaction roots for improved performance. Using the wrong pool for a RPC method can cause memory leaks (we just found one Monday causing a backend server to run out of memory). Constructing and destroying pools in the wrong order can cause the process to crash. This is hard to get right, so using a different model would be very useful. I haven't had the cycles to look at Hyrum's new C++ object and see how that would help.

Blair

Reply via email to