Hi there,

Currently, ra_serf creates a significantly higher I/O and CPU load
than ra_svn during checkout. The root cause is that for each path,
it will make two DAG requests, first for path@co_rev and then for
path@last_changed_rev. Because last_changed revisions have
poor locality, this requires DAG walks for "random" revisions.

Luckily, we do already have all the info available and infrastructure
at hand to bypass the second DAG walk for this specific pattern.
Details are in the .log and .patch files.

If there is no objection to it, I'd like to have this patch in 1.9 and
apply this patch around the 15th.

- Stefan^2.
Index: subversion/libsvn_fs_fs/tree.c
===================================================================
--- subversion/libsvn_fs_fs/tree.c	(revision 1643578)
+++ subversion/libsvn_fs_fs/tree.c	(working copy)
@@ -203,6 +203,13 @@
      Thus, remember the last hit location for optimistic lookup. */
   apr_size_t last_hit;
 
+  /* Position of the last bucket hit that actually had a DAG node in it.
+     LAST_HIT may refer to a bucket that matches path&rev but has not
+     its NODE element set, yet.
+     This value is a mere hint for optimistic lookup and any value is
+     valid (as long as it is < BUCKET_COUNT). */
+  apr_size_t last_non_empty;
+
   /* List of receiving pools that are still alive. */
   cache_lock_t *first_lock;
 };
@@ -338,6 +345,10 @@
       && (result->path_len == path_len)
       && !memcmp(result->path, path, path_len))
     {
+      /* Remember the position of the last node we found in this cache. */
+      if (result->node)
+        cache->last_non_empty = cache->last_hit;
+
       return result;
     }
 
@@ -409,10 +420,38 @@
 
       cache->insertions++;
     }
+  else if (result->node)
+    {
+      /* This bucket is valid & has a suitable DAG node in it.
+         Remember its location. */
+      cache->last_non_empty = bucket_index;
+    }
 
   return result;
 }
 
+/* Optimistic lookup using the last seen non-empty location in CACHE.
+   Return that entry, if it is still in use and matches PATH.  Return
+   NULL otherwise.  Since the caller usually already knows the path
+   length, provide it in PATH_LEN. */
+static cache_entry_t *
+cache_lookup_last_path(fs_fs_dag_cache_t *cache,
+                       const char *path,
+                       apr_size_t path_len)
+{
+  cache_entry_t *result = &cache->buckets[cache->last_non_empty];
+  assert(strlen(path) == path_len);
+
+  if (   result->node
+      && (result->path_len == path_len)
+      && !memcmp(result->path, path, path_len))
+    {
+      return result;
+    }
+
+  return NULL;
+}
+
 /* 2nd level cache */
 
 /* Find and return the DAG node cache for ROOT and the key that
@@ -959,16 +998,63 @@
      at the respective position and replacing that with a '/' in the next
      iteration.  This is correct as we assert() PATH to be canonical. */
   svn_stringbuf_t *path_so_far = svn_stringbuf_create(path, pool);
+  apr_size_t path_len = path_so_far->len;
 
-  /* callers often traverse the tree in some path-based order.  That means
-     a sibling of PATH has been presently accessed.  Try to start the lookup
-     directly at the parent node, if the caller did not requested the full
-     parent chain. */
+  /* Callers often traverse the DAG in some path-based order or along the
+     history segments.  That allows us to try a few guesses about where to
+     find the next item.  This is only useful if the caller didn't request
+     the full parent chain. */
   assert(svn_fs__is_canonical_abspath(path));
   path_so_far->len = 0; /* "" */
   if (flags & open_path_node_only)
     {
-      const char *directory = svn_dirent_dirname(path, pool);
+      const char *directory;
+      fs_fs_data_t *ffd = root->fs->fsap_data;
+
+      /* First attempt: Assume that we access the DAG for the same path as
+         in the last lookup but for a different revision that happens to be
+         the last revision that touched the respective node.  This is a
+         common pattern when e.g. checking out over ra_serf.
+
+         This shortcut is quick and will exit this function upon success.
+         So, try it first. */
+      cache_entry_t *bucket
+        = cache_lookup_last_path(ffd->dag_node_cache, path, path_len);
+
+      /* Did we get a bucket with a committed node? */
+      if (bucket && !svn_fs_fs__dag_check_mutable(bucket->node))
+        {
+          /* Get the path&rev pair at which this node was created.
+             This is repository location for which this node is _known_ to
+             be the right lookup result irrespective of how we found it. */
+          const char *created_path
+            = svn_fs_fs__dag_get_created_path(bucket->node);
+          svn_revnum_t revision;
+          SVN_ERR(svn_fs_fs__dag_get_revision(&revision, bucket->node,
+            iterpool));
+
+          /* Is it an exact match? */
+          if (revision == root->rev && strcmp(created_path, path) == 0)
+            {
+              /* Yes, it is what we've been looking for.
+                 Cache it under its full path@rev access path. */
+              SVN_ERR(dag_node_cache_set(root, path, bucket->node,
+                                         iterpool));
+
+              /* Construct and return the result. */
+              svn_pool_destroy(iterpool);
+              parent_path = make_parent_path(bucket->node, 0, 0, pool);
+              parent_path->copy_inherit = copy_id_inherit_self;
+              *parent_path_p = parent_path;
+
+              return SVN_NO_ERROR;
+            }
+        }
+
+      /* Second attempt: Try starting the lookup immediately at the parent
+         node.  We will often have recently accessed either a sibling or
+         said parent DIRECTORY itself for the same revision. */
+      directory = svn_dirent_dirname(path, pool);
       if (directory[1] != 0) /* root nodes are covered anyway */
         {
           SVN_ERR(dag_node_cache_get(&here, root, directory, TRUE, pool));
Eliminate most of the "random DAG walks" in FSFS during checkout or ra_serf.

For a c/o at revision REV, ra_serf will first ask for the node representing
PATH@REV, get its last changed revision and then request PATH@LAST_CHANGED.
On a branch or other copy, most of those LAST_CHANGED revisions will be the
same and the parent / sibling lookup optimization in open_path() will kick
in because parent(PATH)@LAST_CHANGED has probably been accessed recently.

On /trunk, the LAST_CHANGED info rarely matches between PATH, its parent or
its siblings.  That requires a full path walk for path@LAST_CHANGED with
all parent directories for this revision being pulled without a good chance
to be reused soon.  This gets particularly bad for deltified directories
as their histories, hence deltification chains, are again mostly unrelated.

This patch makes the simple guess that if you ask for the same path in a
different revision (if it was the same revision you would get it from cache
directly instead of calling open_path), you will end up with same DAG node.
We only need to verify that this is actually the case.

A DAG node records the revision it was created in as well as the path that
it got created for.  If that pair matches the requested PATH@LAST_CHANGED,
we *know* that this it the correct answer: That node cannot be replaced or
hidden within the same revision nor can there be two such nodes.  Although
this verification would usually fail for the general case, it will usually
succeed for ra_serf's pattern on /trunk.

* subversion/libsvn_fs_fs/tree.c
  (fs_fs_dag_cache_t): Add a member pointing to last node read.
  (cache_lookup): Remember the position of the last node access.
  (cache_lookup_last_path,
   try_match_last_node): New functions using that info for the new
                         optimistic lookup.
  (open_path): Attempt yet another optimistic lookup.

Reply via email to