This is the latest default kernel with CentOS7. We also tried a newer
kernel (from elrepo), a 4.4 that has the same problem, so I don't think
that is it. Thank you for the suggestion though.
We upgraded our cluster to the 10.2.2 release today, and it didn't resolve
all of the issues. It's possible that a related issue is actually
permissions. Something may not be right with our config (or a bug) here.
While testing we noticed that there may actually be two issues here. I am
unsure, as we noticed that the most consistent way to reproduce our issue
is to use vim or sed -i which does in place renames:
[root@ftp01 cron]# ls -la
total 3
drwx------ 1 root root 2044 Jun 16 15:50 .
drwxr-xr-x. 10 root root 104 May 19 09:34 ..
-rw-r--r-- 1 root root 300 Jun 16 15:50 file
-rw------- 1 root root 2044 Jun 16 13:47 root
[root@ftp01 cron]# sed -i 's/^/#/' file
sed: cannot rename ./sedfB2CkO: Permission denied
Strangely, adding or deleting files works fine, it's only renaming that
fails. And strangely I was able to successfully edit the file on ftp02:
[root@ftp02 cron]# sed -i 's/^/#/' file
[root@ftp02 cron]# ls -la
total 3
drwx------ 1 root root 2044 Jun 16 15:49 .
drwxr-xr-x. 10 root root 104 May 19 09:34 ..
-rw-r--r-- 1 root root 313 Jun 16 15:49 file
-rw------- 1 root root 2044 Jun 16 13:47 root
Then it worked on ftp01 this time:
[root@ftp01 cron]# ls -la
total 3
drwx------ 1 root root 2357 Jun 16 15:49 .
drwxr-xr-x. 10 root root 104 May 19 09:34 ..
-rw-r--r-- 1 root root 313 Jun 16 15:49 file
-rw------- 1 root root 2044 Jun 16 13:47 root
Then, I vim'd it successfully on ftp01... Then ran the sed again:
[root@ftp01 cron]# sed -i 's/^/#/' file
sed: cannot rename ./sedfB2CkO: Permission denied
[root@ftp01 cron]# ls -la
total 3
drwx------ 1 root root 2044 Jun 16 15:51 .
drwxr-xr-x. 10 root root 104 May 19 09:34 ..
-rw-r--r-- 1 root root 300 Jun 16 15:50 file
-rw------- 1 root root 2044 Jun 16 13:47 root
And now we have the zero file problem again:
[root@ftp02 cron]# ls -la
total 2
drwx------ 1 root root 2044 Jun 16 15:51 .
drwxr-xr-x. 10 root root 104 May 19 09:34 ..
-rw-r--r-- 1 root root 0 Jun 16 15:50 file
-rw------- 1 root root 2044 Jun 16 13:47 root
Anyway, I wonder how much of this issue is related to that cannot rename
issue above. Here are our security settings:
client.ftp01
key: <redacted>
caps: [mds] allow r, allow rw path=/ftp
caps: [mon] allow r
caps: [osd] allow rw pool=cephfs_metadata, allow rw pool=cephfs_data
client.ftp02
key: <redacted>
caps: [mds] allow r, allow rw path=/ftp
caps: [mon] allow r
caps: [osd] allow rw pool=cephfs_metadata, allow rw pool=cephfs_data
/ftp is the directory on cephfs under which cron lives; the full path is
/ftp/cron .
I hope this helps and thank you for your time!
Jason
On 6/15/16, 4:43 PM, "John Spray" <[email protected]> wrote:
>On Wed, Jun 15, 2016 at 10:21 PM, Jason Gress <[email protected]>
>wrote:
>> While trying to use CephFS as a clustered filesystem, we stumbled upon a
>> reproducible bug that is unfortunately pretty serious, as it leads to
>>data
>> loss. Here is the situation:
>>
>> We have two systems, named ftp01 and ftp02. They are both running
>>CentOS
>> 7.2, with this kernel release and ceph packages:
>>
>> kernel-3.10.0-327.18.2.el7.x86_64
>
>That is an old-ish kernel to be using with cephfs. It may well be the
>source of your issues.
>
>> [root@ftp01 cron]# rpm -qa | grep ceph
>> ceph-base-10.2.1-0.el7.x86_64
>> ceph-deploy-1.5.33-0.noarch
>> ceph-mon-10.2.1-0.el7.x86_64
>> libcephfs1-10.2.1-0.el7.x86_64
>> ceph-selinux-10.2.1-0.el7.x86_64
>> ceph-mds-10.2.1-0.el7.x86_64
>> ceph-common-10.2.1-0.el7.x86_64
>> ceph-10.2.1-0.el7.x86_64
>> python-cephfs-10.2.1-0.el7.x86_64
>> ceph-osd-10.2.1-0.el7.x86_64
>>
>> Mounted like so:
>> XX.XX.XX.XX:/ftp/cron /var/spool/cron ceph
>> _netdev,relatime,name=ftp01,secretfile=/etc/ceph/ftp01.secret 0 0
>> And:
>> XX.XX.XX.XX:/ftp/cron /var/spool/cron ceph
>> _netdev,relatime,name=ftp02,secretfile=/etc/ceph/ftp02.secret 0 0
>>
>> This filesystem has 234GB worth of data on it, and I created another
>> subdirectory and mounted it, NFS style.
>>
>> Here were the steps to reproduce:
>>
>> First, I created a file (I was mounting /var/spool/cron on two systems)
>>on
>> ftp01:
>> (crond is not running right now on either system to keep the variables
>>down)
>>
>> [root@ftp01 cron]# cp /tmp/root .
>>
>> Shows up on both fine:
>> [root@ftp01 cron]# ls -la
>> total 2
>> drwx------ 1 root root 0 Jun 15 15:50 .
>> drwxr-xr-x. 10 root root 104 May 19 09:34 ..
>> -rw------- 1 root root 2043 Jun 15 15:50 root
>> [root@ftp01 cron]# md5sum root
>> 0636c8deaeadfea7b9ddaa29652b43ae root
>>
>> [root@ftp02 cron]# ls -la
>> total 2
>> drwx------ 1 root root 2043 Jun 15 15:50 .
>> drwxr-xr-x. 10 root root 104 May 19 09:34 ..
>> -rw------- 1 root root 2043 Jun 15 15:50 root
>> [root@ftp02 cron]# md5sum root
>> 0636c8deaeadfea7b9ddaa29652b43ae root
>>
>> Now, I vim the file on one of them:
>> [root@ftp01 cron]# vim root
>> [root@ftp01 cron]# ls -la
>> total 2
>> drwx------ 1 root root 0 Jun 15 15:51 .
>> drwxr-xr-x. 10 root root 104 May 19 09:34 ..
>> -rw------- 1 root root 2044 Jun 15 15:50 root
>> [root@ftp01 cron]# md5sum root
>> 7a0c346bbd2b61c5fe990bb277c00917 root
>>
>> [root@ftp02 cron]# md5sum root
>> 7a0c346bbd2b61c5fe990bb277c00917 root
>>
>> So far so good, right? Then, a few seconds later:
>>
>> [root@ftp02 cron]# ls -la
>> total 0
>> drwx------ 1 root root 0 Jun 15 15:51 .
>> drwxr-xr-x. 10 root root 104 May 19 09:34 ..
>> -rw------- 1 root root 0 Jun 15 15:50 root
>> [root@ftp02 cron]# cat root
>> [root@ftp02 cron]# md5sum root
>> d41d8cd98f00b204e9800998ecf8427e root
>>
>> And on ftp01:
>>
>> [root@ftp01 cron]# ls -la
>> total 2
>> drwx------ 1 root root 0 Jun 15 15:51 .
>> drwxr-xr-x. 10 root root 104 May 19 09:34 ..
>> -rw------- 1 root root 2044 Jun 15 15:50 root
>> [root@ftp01 cron]# md5sum root
>> 7a0c346bbd2b61c5fe990bb277c00917 root
>>
>> I later create a 'root2' on ftp02 and cause a similar issue. The end
>> results are two non-matching files:
>>
>> [root@ftp01 cron]# ls -la
>> total 2
>> drwx------ 1 root root 0 Jun 15 15:53 .
>> drwxr-xr-x. 10 root root 104 May 19 09:34 ..
>> -rw------- 1 root root 2044 Jun 15 15:50 root
>> -rw-r--r-- 1 root root 0 Jun 15 15:53 root2
>>
>> [root@ftp02 cron]# ls -la
>> total 2
>> drwx------ 1 root root 0 Jun 15 15:53 .
>> drwxr-xr-x. 10 root root 104 May 19 09:34 ..
>> -rw------- 1 root root 0 Jun 15 15:50 root
>> -rw-r--r-- 1 root root 1503 Jun 15 15:53 root2
>>
>> We were able to reproduce this on two other systems with the same cephfs
>> filesystem. I have also seen cases where the file would just blank out
>>on
>> both as well.
>>
>> We could not reproduce it with our dev/test cluster running the
>>development
>> ceph version:
>>
>> ceph-10.2.2-1.g502540f.el7.x86_64
>
>Strange. In that cluster, was the same 3.x kernel in use? There
>aren't a whole lot of changes on the server side in v10.2.2 that I
>could imagine affecting this case.
>
>The best thing to do right now is to try using ceph-fuse in your
>production environment, to check that it is not exhibiting the same
>behaviour as the old kernel client. Once you confirm that, I would
>recommend upgrading your kernel to the most recent 4.x that you are
>comfortable with, and confirm that that also does not exhibit the bad
>behaviour.
>
>John
>
>> Is this a known bug with the current production Jewel release? If so,
>>will
>> it be patched in the next release?
>>
>> Thank you very much,
>>
>> Jason Gress
>>
>> "This message and any attachments may contain confidential information.
>>If
>> you
>> have received this message in error, any use or distribution is
>>prohibited.
>> Please notify us by reply e-mail if you have mistakenly received this
>> message,
>> and immediately and permanently delete it and any attachments. Thank
>>you."
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
"This message and any attachments may contain confidential information. If you
have received this message in error, any use or distribution is prohibited.
Please notify us by reply e-mail if you have mistakenly received this message,
and immediately and permanently delete it and any attachments. Thank you."
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com