I was able to make the code change to create the tmp directory in the 3-byte hash directory and fix the unit tests to get this to work. I will file a bug to get a discussion started on this, in case there are people not following this thread.
On Wed, Apr 29, 2015 at 4:08 PM, Shrinand Javadekar <shrin...@maginatics.com> wrote: > Hi, > > I have been investigating a pretty serious Swift performance problem > for a while now. I have a single node Swift instance with 16 cores, > 64GB memory and 8 MDs of 3TB each. I only write 256KB objects into > this Swift instance with high concurrency; 256 parallel object PUTs. > Also, I was sharding the objects equally across 32 containers. > > On a completely clean system, we were getting ~375 object puts per > second. But this kept on reducing pretty quickly and by the time we > had 600GB of data in Swift, the throughput was ~100 objects per > second. > > We used sysdig to get a trace of what's happening in the system and > found that the open system calls were taking way longer; several 100s > of milliseconds, sometimes even 1 second. > > Investigating this further revealed a problem in the way Swift writes > the objects on XFS. Swift's object server creates a temp directory > under the mount point /srv/node/r0. It create an file under this temp > directory first (say /srv/node/r0/tmp/tmpASDF) and eventually renames > this file to its final destination. > > rename /srv/node/r0/tmp/tmpASDF -> > /srv/node/r0/objects/312/eef/deadbeef/33453453454323424.data. > > XFS creates an inode in the same allocation group as it parent. So, > when the temp file tmpASDF is created, it goes in the same allocation > group of "tmp". When the rename happens, only the filesystem metadata > gets modified. The allocation groups of the inodes don't change. > > Since all object PUTs start off in the tmp directory, all inodes get > created in the same allocation group. The B-tree used for keeping > track of these inodes in the allocation group grows bigger and bigger > as more files are written and parsing this tree for existence checks > or for creating new inodes becomes more and more expensive. > > See this discussion [1] I had on the XFS mailing list where this issue > was brought to light. And this other slightly old thread where the > problem was identical [2]. > > I validated this theory by periodically deleting the temp directory. I > observed that the objects per second was not reducing at the same rate > as earlier. Staring at ~375 obj/s, after 600GB data in Swift, I was > getting ~340 obj/s. > > Now, how do we fix this? > > One option would be to make the temp directory somewhere deeper in the > filesystem rather than immediately under the mount point. E.g. create > one temp directory under each of the 3-byte hash directories. And use > the temp directory corresponding to the object's hash. > > But, it's unclear what other repercussions will this have? Will the > replicator start replicating this temp directory? > > Another option is to actually delete the tmp directory periodically. > Problem is that we don't know when. And whenever we decide to do it, > the temp directory may have some file in it making it impossible to > delete the directory. > > Any other options? > > Thanks in advance. > -Shri > > [1] http://www.spinics.net/lists/xfs/msg32868.html > [2] > http://xfs.9218.n7.nabble.com/Performance-degradation-over-time-td28514.html _______________________________________________ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack