[ceph-users] One osd crashing daily, the problem with osd.50

2016-05-09 Thread Ronny Aasen

hello

I am running a small lab ceph cluster consisting of 6 old used servers. 
they have 36 slots for drives. but too little ram, 32GB max, for this 
mainboard, to take advantage of them all. When i get to around 20 osd's 
on a node the OOM killer becomes a problem, if there is incidents that 
require recovery.


In order to remedy some of the ram problems i am running the osd's on 5 
disk raid5 software sets. this gives me about 7 12TB osd's on a node and 
a global hotspare. I have tried this on one of the nodes with good 
success. and I am in the process of doing the migrations on the other 
nodes as well.


i am running on debian jessie using the 0.94.6 hammer from ceph's repo.

but a issue has started appering on one of these raid5 osd's

osd.50 have a tendency to stop ~daily with the error message seen in the 
log below. The osd is running on a healthy software raid5 disk, and i 
can see nothing in dmesg or any other log that can indicate a problem 
with this md device.
once i restart the osd it's up and in and probably stays up and in for 
some hours upto a few days. the other 6 osd's on this node does not show 
the same problem. i have restarted this osd about 8-10 times. so it's 
fairly regular.


the raid5 sets are 12TB so i was hoping to be able to fix the problem, 
rather then zapping the md and recreating from scratch. I was also 
worrying if there was something fundamentaly wrong about running osd's 
on software md raid5 devices.



kind regards
Ronny Aasen





NB: ignore osd.41 in the log below, that was a single broken disk
--
ceph-50.log
-7> 2016-05-05 04:36:04.382514 7f2758135700 1 -- 10.24.11.22:6805/22452 
<== osd.66 10.24.12.24:0/5339 22940  osd_ping(ping e51484 stamp 
2016-05-05 04:36:04.367488) v2  47+0+0 (1824128128 0 0) 0x41bfba00 
con 0x428df1e0
-6> 2016-05-05 04:36:04.382534 7f2756932700 1 -- 10.24.12.22:6803/22452 
--> 10.24.12.24:0/5339 -- osd_ping(ping_reply e51484 stamp 2016-05-05 
04:36:04.367488) v2 -- ?+0 0x4df27200 con 0x428de6e0
-5> 2016-05-05 04:36:04.382576 7f2758135700 1 -- 10.24.11.22:6805/22452 
--> 10.24.12.24:0/5339 -- osd_ping(ping_reply e51484 stamp 2016-05-05 
04:36:04.367488) v2 -- ?+0 0x4df4a200 con 0x428df1e0
-4> 2016-05-05 04:36:04.412314 7f2756932700 1 -- 10.24.12.22:6803/22452 
<== osd.19 10.24.12.25:0/5355 22879  osd_ping(ping e51484 stamp 
2016-05-05 04:36:04.412495) v2  47+0+0 (1694664336 0 0) 0x57434a00 
con 0x421ddb20
-3> 2016-05-05 04:36:04.412366 7f2756932700 1 -- 10.24.12.22:6803/22452 
--> 10.24.12.25:0/5355 -- osd_ping(ping_reply e51484 stamp 2016-05-05 
04:36:04.412495) v2 -- ?+0 0x56c67800 con 0x421ddb20
-2> 2016-05-05 04:36:04.412394 7f2758135700 1 -- 10.24.11.22:6805/22452 
<== osd.19 10.24.12.25:0/5355 22879  osd_ping(ping e51484 stamp 
2016-05-05 04:36:04.412495) v2  47+0+0 (1694664336 0 0) 0x4e485600 
con 0x428de9a0
-1> 2016-05-05 04:36:04.412440 7f2758135700 1 -- 10.24.11.22:6805/22452 
--> 10.24.12.25:0/5355 -- osd_ping(ping_reply e51484 stamp 2016-05-05 
04:36:04.412495) v2 -- ?+0 0x41bfba00 con 0x428de9a0
0> 2016-05-05 04:36:04.418305 7f274c91e700 -1 os/FileStore.cc: In 
function 'virtual int FileStore::read(coll_t, const ghobject_t&, 
uint64_t, size_t, ceph::bufferlist&, uint32_t, bool)' thread 
7f274c91e700 time 2016-05-05 04:36:04.115448
os/FileStore.cc: 2854: FAILED assert(allow_eio || !m_filestore_fail_eio 
|| got != -5)


ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x76) [0xc03c46]
2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned 
long, ceph::buffer::list&, unsigned int, bool)+0xcc2) [0x90af82]
3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int, 
ScrubMap::object&, ThreadPool::TPHandle&)+0x31c) [0xa1c0ec]
4: (PGBackend::be_scan_list(ScrubMap&, std::vectorstd::allocator > const&, bool, unsigned int, 
ThreadPool::TPHandle&)+0x2ca) [0x8cd23a]
5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, 
unsigned int, ThreadPool::TPHandle&)+0x1fa) [0x7dc0ba]

6: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x3be) [0x7e437e]
7: (PG::scrub(ThreadPool::TPHandle&)+0x1d7) [0x7e5a87]
8: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x6b3e69]
9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa77) [0xbf41f7]
10: (ThreadPool::WorkThread::entry()+0x10) [0xbf52c0]
11: (()+0x80a4) [0x7f27790c20a4]
12: (clone()+0x6d) [0x7f277761d87d]
NOTE: a copy of the executable, or `objdump -rdS ` is needed 
to interpret this.


--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 fin

Re: [ceph-users] One osd crashing daily, the problem with osd.50

2016-05-09 Thread Christian Balzer

Hello,

On Mon, 9 May 2016 09:31:20 +0200 Ronny Aasen wrote:

> hello
> 
> I am running a small lab ceph cluster consisting of 6 old used servers. 

That's larger than quite a few production deployments. ^_-

> they have 36 slots for drives. but too little ram, 32GB max, for this 
> mainboard, to take advantage of them all. When i get to around 20 osd's 
> on a node the OOM killer becomes a problem, if there is incidents that 
> require recovery.
> 
No surprise there, if you're limited to that little RAM I suspect you'd
run out of CPU power with a full load, too.

> In order to remedy some of the ram problems i am running the osd's on 5 
> disk raid5 software sets. this gives me about 7 12TB osd's on a node and 
> a global hotspare. I have tried this on one of the nodes with good 
> success. and I am in the process of doing the migrations on the other 
> nodes as well.
> 
That's optimizing for space and nothing else.

Having done something similar in the past I would strongly recommend the
following:
a) Use RAID6, so that you never have to worry about an OSD failure. 
I've personally lost 2 RAID5 sets of similar size due to double disk
failures.

b) use RAID10 for much improved performance (IOPS). To offset the loss in
space, consider running with a replication of 2, which would be safe, same
for option a).

> i am running on debian jessie using the 0.94.6 hammer from ceph's repo.
> 
> but a issue has started appering on one of these raid5 osd's
> 
> osd.50 have a tendency to stop ~daily with the error message seen in the 
> log below. The osd is running on a healthy software raid5 disk, and i 
> can see nothing in dmesg or any other log that can indicate a problem 
> with this md device.

The key part of that log is EIO failed assert, if you google for 

"FAILED assert(allow_eio" you will get hits from last year, this is FS
issue and has nothing to do with the RAID per se.

Which FS are you using?

If it's not BTRFS and since your other OSDs are not having issues, it
might be worth going over this FS with a fine comb. 

The "near full" OSD is something that you want to address, too.

> once i restart the osd it's up and in and probably stays up and in for 
> some hours upto a few days. the other 6 osd's on this node does not show 
> the same problem. i have restarted this osd about 8-10 times. so it's 
> fairly regular.
> 
Might have to bite the bullet and re-create it if you can't find the issue.

> the raid5 sets are 12TB so i was hoping to be able to fix the problem, 
> rather then zapping the md and recreating from scratch. I was also 
> worrying if there was something fundamentaly wrong about running osd's 
> on software md raid5 devices.
> 
No problem in and by itself, other than reduced performance.

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One osd crashing daily, the problem with osd.50

2016-05-09 Thread Ronny Aasen

thanks for your commends. answers inline.

On 05/09/16 09:53, Christian Balzer wrote:


Hello,

On Mon, 9 May 2016 09:31:20 +0200 Ronny Aasen wrote:


hello

I am running a small lab ceph cluster consisting of 6 old used servers.


That's larger than quite a few production deployments. ^_-


:)


they have 36 slots for drives. but too little ram, 32GB max, for this
mainboard, to take advantage of them all. When i get to around 20 osd's
on a node the OOM killer becomes a problem, if there is incidents that
require recovery.


No surprise there, if you're limited to that little RAM I suspect you'd
run out of CPU power with a full load, too.


cpu have not been an issue yet. but i suspect it might change when i can 
release enough ram to try erasure coding.





In order to remedy some of the ram problems i am running the osd's on 5
disk raid5 software sets. this gives me about 7 12TB osd's on a node and
a global hotspare. I have tried this on one of the nodes with good
success. and I am in the process of doing the migrations on the other
nodes as well.


That's optimizing for space and nothing else.

Having done something similar in the past I would strongly recommend the
following:
a) Use RAID6, so that you never have to worry about an OSD failure.
I've personally lost 2 RAID5 sets of similar size due to double disk
failures.

b) use RAID10 for much improved performance (IOPS). To offset the loss in
space, consider running with a replication of 2, which would be safe, same
for option a).


yes definitivly optimizing for space, and space only. that's also the 
reason for raid5 and not raid6 or 10.

and why erasure coding is a must have if i can get it there.


i am running on debian jessie using the 0.94.6 hammer from ceph's repo.

but a issue has started appering on one of these raid5 osd's

osd.50 have a tendency to stop ~daily with the error message seen in the
log below. The osd is running on a healthy software raid5 disk, and i
can see nothing in dmesg or any other log that can indicate a problem
with this md device.


The key part of that log is EIO failed assert, if you google for

"FAILED assert(allow_eio" you will get hits from last year, this is FS
issue and has nothing to do with the RAID per se.

Which FS are you using?

If it's not BTRFS and since your other OSDs are not having issues, it
might be worth going over this FS with a fine comb.

The "near full" OSD is something that you want to address, too.


the near full is allready fixed.
am running on XFS. i had assumed that FS issues also would show in 
system logs. Ill google for this issue.



once i restart the osd it's up and in and probably stays up and in for
some hours upto a few days. the other 6 osd's on this node does not show
the same problem. i have restarted this osd about 8-10 times. so it's
fairly regular.


Might have to bite the bullet and re-create it if you can't find the issue.


thanks. I'll do this today if nobody (or google) have any better suggestions


the raid5 sets are 12TB so i was hoping to be able to fix the problem,
rather then zapping the md and recreating from scratch. I was also
worrying if there was something fundamentaly wrong about running osd's
on software md raid5 devices.


No problem in and by itself, other than reduced performance.


Knid regards
Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RadosGW - Problems running the S3 and SWIFT API at the same time

2016-05-09 Thread Saverio Proto
I try to simplify the question to get some feedback.

Is anyone running the RadosGW in production with S3 and SWIFT API
active at the same time ?

thank you !

Saverio


2016-05-06 11:39 GMT+02:00 Saverio Proto :
> Hello,
>
> We have been running the Rados GW with the S3 API and we did not have
> problems for more than a year.
>
> We recently enabled also the SWIFT API for our users.
>
> radosgw --version
> ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
>
> The idea is that each user of the system is free of choosing the S3
> client or the SWIFT client to access the same container/buckets.
>
> Please tell us if this is possible by design or if we are doing something 
> wrong.
>
> We have now a problem that some files wrote in the past with S3,
> cannot be read with the SWIFT API because the md5sum always fails.
>
> I am able to reproduce the bug in this way:
>
> We have this file googlebooks-fre-all-2gram-20120701-ts.gz and we know
> the correct md5 is 1c8113d2bd21232688221ec74dccff3a
> You can download the same file here:
> https://www.dropbox.com/s/auq16vdv2maw4p7/googlebooks-fre-all-2gram-20120701-ts.gz?dl=0
>
> rclone mkdir lss3:bugreproduce
> rclone copy googlebooks-fre-all-2gram-20120701-ts.gz lss3:bugreproduce
>
> The file is successfully uploaded.
>
> At this point I can succesfully download again the file:
> rclone copy lss3:bugreproduce/googlebooks-fre-all-2gram-20120701-ts.gz test.gz
>
> but not with swift:
>
> swift download googlebooks-ngrams-gz
> fre/googlebooks-fre-all-2gram-20120701-ts.gz
> Error downloading object
> 'googlebooks-ngrams-gz/fre/googlebooks-fre-all-2gram-20120701-ts.gz':
> u'Error downloading fre/googlebooks-fre-all-2gram-20120701-ts.gz:
> md5sum != etag, 1c8113d2bd21232688221ec74dccff3a !=
> 1a209a31b4ac3eb923fac5e8d194d9d3-2'
>
> Also I found strange the dash character '-' at the end of the md5 that
> is trying to compare.
>
> Of course upload a file with the swift client and redownloading the
> same file just works.
>
> Should I open a bug for the radosgw on http://tracker.ceph.com/ ?
>
> thank you
>
> Saverio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Boot from image (create a new volume) on Openstack dashboard

2016-05-09 Thread Дробышевский , Владимир
Hi, Calvin!

  Actually it's an OpenStack's question rather than ceph's, but you can use
"cinder quota-show" and "cinder quota-update" commands to show and set
tennat's quota.

  You can look this quetion\answer about setting volumes quota for a
certain volume types, it's a kind of relevant to your question:

https://ask.openstack.org/en/question/57912/can-you-restrict-a-user-or-tenant-to-a-cinder-volume-type/

  I don't remember exactly but as far as I remember you can set volume
limits in Horizon as well (but I'm not sure you can set per volume type
quota from there).

Kind regards,
Vladimir

2016-05-09 11:35 GMT+05:00 박선규 [slash] :

> Hi
>
> When launch a new instance as 'Boot from image (create a new volume)' on
> Openstack dashboard,
> normal user got below error message. (admin user is okay)
>
> 'The requested instance cannot be launched. Requested volume exceeds
> quota: Available: 0, Requested: 1.'
>
> How to assign permission or quota to normal user ?
>
> Regards
> Calvin
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to avoid kernel conflicts

2016-05-09 Thread Ilya Dryomov
On Mon, May 9, 2016 at 12:19 AM, K.C. Wong  wrote:
>
>> As the tip said, you should not use rbd via kernel module on an OSD host
>>
>> However, using it with userspace code (librbd etc, as in kvm) is fine
>>
>> Generally, you should not have both:
>> - "server" in userspace
>> - "client" in kernelspace
>
> If `librbd` would help avoid this problem, then switch to `rbd-fuse`
> should do the trick, right?
>
> The reason for my line of question is that I've seen occasionl freeze
> up of `rbd map` that's resolved by a 'slight tap' by way of an strace.
> There is definitely great attractiveness to not have specialized nodes
> and make every one the same as the next one on the rack.

The problem with placing the kernel client on the OSD node is the
potential deadlock under heavy I/O when memory becomes scarce.  It's
not recommended, but people are doing it - if you don't stress your
system too much, it'll never happen.

"rbd map" freeze is definitely not related to the abov.  Did the actual
command hang?  Could you describe what you saw in more detail and how
did strace help?  It could be that you ran into

http://tracker.ceph.com/issues/14737

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RadosGW - Problems running the S3 and SWIFT API at the same time

2016-05-09 Thread Xusangdi
Hi,

I'm not running a cluster as yours, but I don't think the issue is caused by 
you using 2 APIs at the same time.
IIRC the dash thing is append by S3 multipart upload, with a following digit 
indicating the number of parts.
You may want to check this reported in s3cmd community:
https://sourceforge.net/p/s3tools/bugs/123/

and some basic info from Amazon:
http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html

Hope this helps :D

Regards,
---Sandy

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Saverio Proto
> Sent: Monday, May 09, 2016 4:42 PM
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] RadosGW - Problems running the S3 and SWIFT API at 
> the same time
>
> I try to simplify the question to get some feedback.
>
> Is anyone running the RadosGW in production with S3 and SWIFT API active at 
> the same time ?
>
> thank you !
>
> Saverio
>
>
> 2016-05-06 11:39 GMT+02:00 Saverio Proto :
> > Hello,
> >
> > We have been running the Rados GW with the S3 API and we did not have
> > problems for more than a year.
> >
> > We recently enabled also the SWIFT API for our users.
> >
> > radosgw --version
> > ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
> >
> > The idea is that each user of the system is free of choosing the S3
> > client or the SWIFT client to access the same container/buckets.
> >
> > Please tell us if this is possible by design or if we are doing something 
> > wrong.
> >
> > We have now a problem that some files wrote in the past with S3,
> > cannot be read with the SWIFT API because the md5sum always fails.
> >
> > I am able to reproduce the bug in this way:
> >
> > We have this file googlebooks-fre-all-2gram-20120701-ts.gz and we know
> > the correct md5 is 1c8113d2bd21232688221ec74dccff3a You can download
> > the same file here:
> > https://www.dropbox.com/s/auq16vdv2maw4p7/googlebooks-fre-all-2gram-20
> > 120701-ts.gz?dl=0
> >
> > rclone mkdir lss3:bugreproduce
> > rclone copy googlebooks-fre-all-2gram-20120701-ts.gz lss3:bugreproduce
> >
> > The file is successfully uploaded.
> >
> > At this point I can succesfully download again the file:
> > rclone copy lss3:bugreproduce/googlebooks-fre-all-2gram-20120701-ts.gz
> > test.gz
> >
> > but not with swift:
> >
> > swift download googlebooks-ngrams-gz
> > fre/googlebooks-fre-all-2gram-20120701-ts.gz
> > Error downloading object
> > 'googlebooks-ngrams-gz/fre/googlebooks-fre-all-2gram-20120701-ts.gz':
> > u'Error downloading fre/googlebooks-fre-all-2gram-20120701-ts.gz:
> > md5sum != etag, 1c8113d2bd21232688221ec74dccff3a !=
> > 1a209a31b4ac3eb923fac5e8d194d9d3-2'
> >
> > Also I found strange the dash character '-' at the end of the md5 that
> > is trying to compare.
> >
> > Of course upload a file with the swift client and redownloading the
> > same file just works.
> >
> > Should I open a bug for the radosgw on http://tracker.ceph.com/ ?
> >
> > thank you
> >
> > Saverio
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-
本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
邮件!
This e-mail and its attachments contain confidential information from H3C, 
which is
intended only for the person or entity whose address is listed above. Any use 
of the
information contained herein in any way (including, but not limited to, total 
or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify 
the sender
by phone or email immediately and delete it!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS + CTDB/Samba - MDS session timeout on lockfile

2016-05-09 Thread Nick Fisk
Hi All,

I've been testing an active/active Samba cluster over CephFS, performance
seems really good with small files compared to Gluster. Soft reboots work
beautifully with little to no interruption in file access. However when I
perform a hard shutdown/reboot of one of the samba nodes, the remaining node
detects that the other Samba node has disappeared but then eventually bans
itself. If I leave everything for around 5 minutes, CTDB unbans itself and
then everything continues running.

>From what I can work out it looks like as the MDS has a stale session from
the powered down node, it won't let the remaining node access the CTDB lock
file (which is also sitting the on the CephFS). CTDB meanwhile is hammering
away trying to access the lock file, but it sees what it thinks is a split
brain scenario because something still has a lock on the lockfile, and so
bans itself.

I'm guessing the solution is to either reduce the mds session timeout or
increase the amount of time/retries for CTDB, but I'm not sure what's the
best approach. Does anyone have any ideas?

Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on lockfile

2016-05-09 Thread Sage Weil
On Mon, 9 May 2016, Nick Fisk wrote:
> Hi All,
> 
> I've been testing an active/active Samba cluster over CephFS, performance
> seems really good with small files compared to Gluster. Soft reboots work
> beautifully with little to no interruption in file access. However when I
> perform a hard shutdown/reboot of one of the samba nodes, the remaining node
> detects that the other Samba node has disappeared but then eventually bans
> itself. If I leave everything for around 5 minutes, CTDB unbans itself and
> then everything continues running.
> 
> From what I can work out it looks like as the MDS has a stale session from
> the powered down node, it won't let the remaining node access the CTDB lock
> file (which is also sitting the on the CephFS). CTDB meanwhile is hammering
> away trying to access the lock file, but it sees what it thinks is a split
> brain scenario because something still has a lock on the lockfile, and so
> bans itself.
> 
> I'm guessing the solution is to either reduce the mds session timeout or
> increase the amount of time/retries for CTDB, but I'm not sure what's the
> best approach. Does anyone have any ideas?

I believe Ira was looking at this exact issue, and addressed it by 
lowering the mds_session_timeout to 30 seconds?

sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on lockfile

2016-05-09 Thread Gregory Farnum
On Mon, May 9, 2016 at 8:48 AM, Sage Weil  wrote:
> On Mon, 9 May 2016, Nick Fisk wrote:
>> Hi All,
>>
>> I've been testing an active/active Samba cluster over CephFS, performance
>> seems really good with small files compared to Gluster. Soft reboots work
>> beautifully with little to no interruption in file access. However when I
>> perform a hard shutdown/reboot of one of the samba nodes, the remaining node
>> detects that the other Samba node has disappeared but then eventually bans
>> itself. If I leave everything for around 5 minutes, CTDB unbans itself and
>> then everything continues running.
>>
>> From what I can work out it looks like as the MDS has a stale session from
>> the powered down node, it won't let the remaining node access the CTDB lock
>> file (which is also sitting the on the CephFS). CTDB meanwhile is hammering
>> away trying to access the lock file, but it sees what it thinks is a split
>> brain scenario because something still has a lock on the lockfile, and so
>> bans itself.
>>
>> I'm guessing the solution is to either reduce the mds session timeout or
>> increase the amount of time/retries for CTDB, but I'm not sure what's the
>> best approach. Does anyone have any ideas?
>
> I believe Ira was looking at this exact issue, and addressed it by
> lowering the mds_session_timeout to 30 seconds?

That's the default timeout. I think he lowered the beacon intervals to
5 seconds, plus whatever else flows out from that.

We aren't quite sure if that's a good idea for real deployments or not, though!
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ACL support in Jewel using fuse and SAMBA

2016-05-09 Thread Gregory Farnum
On Sat, May 7, 2016 at 9:53 PM, Eric Eastman
 wrote:
> On Fri, May 6, 2016 at 2:14 PM, Eric Eastman
>  wrote:
>
>> As it should be working, I will increase the logging level in my
>> smb.conf file and see what info I can get out of the logs, and report back.
>
> Setting the log level = 20 in my smb.conf file, and trying to add an
> additional user to a directory on the Windows 2012 server, that has
> mounted the share using a fuse mount to a Ceph file system shows the
> error: "Operation not supported"  in the smbd log file:
>
> [2016/05/07 23:41:19.213997, 10, pid=2823630, effective(2000501,
> 2000514), real(2000501, 0)]
> ../source3/modules/vfs_posixacl.c:92(posixacl_sys_acl_set_file)
>   Calling acl_set_file: New folder (4), 0
> [2016/05/07 23:41:19.214170, 10, pid=2823630, effective(2000501,
> 2000514), real(2000501, 0)]
> ../source3/modules/vfs_posixacl.c:111(posixacl_sys_acl_set_file)
>   acl_set_file failed: Operation not supported
>
> A simple test of setting an ACL from the command line to a fuse
> mounted Ceph file system also fails:
> # mkdir /cephfsFUSE/x
> # setfacl -m d:o:rw /cephfsFUSE/x
> setfacl: /cephfsFUSE/x: Operation not supported
>
> The same test to the same Ceph file system using the kernel mount
> method works.
>
> Is there some option in my ceph.conf file or on the mount line that
> needs to be used to support setting ACLs on a fuse mounted Ceph file
> system?

A quick check of the man page doesn't tell me what setfacl is doing,
but I imagine this is another oddity of using FUSE filesystems.

Judging by https://sourceforge.net/p/fuse/mailman/message/23787505/
there's some superblock flag that needs to be set in order for the VFS
to allow ACLs. I'm not sure offhand if that's something that FUSE will
let us do or not; please create a tracker ticket and somebody will get
to it.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to avoid kernel conflicts

2016-05-09 Thread K.C. Wong
The systems on which the `rbd map` hangs problem occurred are
definitely not under memory stress. I don't believer they
are doing a lot of disk I/O either. Here's the basic set-up:

* all nodes in the "data-plane" are identical
* they each host and OSD instance, sharing one of the drive
* I'm running Docker containers using an RBD volume plugin and
  Docker Compose
* when the hang happens, the most visible behavior is that
  `docker ps` hangs
* then I run `systemctl status` and see and `rbd map` process
  spawned by the RBD volume plugin
* I then tried an `strace -f -p ` and that process
  promptly exits (with RC 0) and the hang resolves itself

I'll tried to capture the strace output the next time I run into
it and share with the mailing list.

Thanks, Ilya.

-kc

> On May 9, 2016, at 2:21 AM, Ilya Dryomov  wrote:
> 
> On Mon, May 9, 2016 at 12:19 AM, K.C. Wong  wrote:
>> 
>>> As the tip said, you should not use rbd via kernel module on an OSD host
>>> 
>>> However, using it with userspace code (librbd etc, as in kvm) is fine
>>> 
>>> Generally, you should not have both:
>>> - "server" in userspace
>>> - "client" in kernelspace
>> 
>> If `librbd` would help avoid this problem, then switch to `rbd-fuse`
>> should do the trick, right?
>> 
>> The reason for my line of question is that I've seen occasionl freeze
>> up of `rbd map` that's resolved by a 'slight tap' by way of an strace.
>> There is definitely great attractiveness to not have specialized nodes
>> and make every one the same as the next one on the rack.
> 
> The problem with placing the kernel client on the OSD node is the
> potential deadlock under heavy I/O when memory becomes scarce.  It's
> not recommended, but people are doing it - if you don't stress your
> system too much, it'll never happen.
> 
> "rbd map" freeze is definitely not related to the abov.  Did the actual
> command hang?  Could you describe what you saw in more detail and how
> did strace help?  It could be that you ran into
> 
>http://tracker.ceph.com/issues/14737
> 
> Thanks,
> 
>Ilya

K.C. Wong
kcw...@verseon.com
4096R/B8995EDE  E527 CBE8 023E 79EA 8BBB  5C77 23A6 92E9 B899 5EDE
hkps://hkps.pool.sks-keyservers.net



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ACL support in Jewel using fuse and SAMBA

2016-05-09 Thread Eric Eastman
On Mon, May 9, 2016 at 10:36 AM, Gregory Farnum  wrote:
> On Sat, May 7, 2016 at 9:53 PM, Eric Eastman
>  wrote:
>> On Fri, May 6, 2016 at 2:14 PM, Eric Eastman
>>  wrote:
>>

>>
>> A simple test of setting an ACL from the command line to a fuse
>> mounted Ceph file system also fails:
>> # mkdir /cephfsFUSE/x
>> # setfacl -m d:o:rw /cephfsFUSE/x
>> setfacl: /cephfsFUSE/x: Operation not supported
>>
>> The same test to the same Ceph file system using the kernel mount
>> method works.
>>
>> Is there some option in my ceph.conf file or on the mount line that
>> needs to be used to support setting ACLs on a fuse mounted Ceph file
>> system?
>
> A quick check of the man page doesn't tell me what setfacl is doing,
> but I imagine this is another oddity of using FUSE filesystems.
>
> Judging by https://sourceforge.net/p/fuse/mailman/message/23787505/
> there's some superblock flag that needs to be set in order for the VFS
> to allow ACLs. I'm not sure offhand if that's something that FUSE will
> let us do or not; please create a tracker ticket and somebody will get
> to it.
> -Greg

Thank you for your help. I have opened: http://tracker.ceph.com/issues/15783

Eric
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on lockfile

2016-05-09 Thread Eric Eastman
I am trying to do some similar testing with SAMBA and CTDB with the
Ceph file system.  Are you using the vfs_ceph SAMBA module or are you
kernel mounting the Ceph file system?

Thanks
Eric

On Mon, May 9, 2016 at 9:31 AM, Nick Fisk  wrote:
> Hi All,
>
> I've been testing an active/active Samba cluster over CephFS, performance
> seems really good with small files compared to Gluster. Soft reboots work
> beautifully with little to no interruption in file access. However when I
> perform a hard shutdown/reboot of one of the samba nodes, the remaining node
> detects that the other Samba node has disappeared but then eventually bans
> itself. If I leave everything for around 5 minutes, CTDB unbans itself and
> then everything continues running.
>
> From what I can work out it looks like as the MDS has a stale session from
> the powered down node, it won't let the remaining node access the CTDB lock
> file (which is also sitting the on the CephFS). CTDB meanwhile is hammering
> away trying to access the lock file, but it sees what it thinks is a split
> brain scenario because something still has a lock on the lockfile, and so
> bans itself.
>
> I'm guessing the solution is to either reduce the mds session timeout or
> increase the amount of time/retries for CTDB, but I'm not sure what's the
> best approach. Does anyone have any ideas?
>
> Nick
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on lockfile

2016-05-09 Thread Nick Fisk
> -Original Message-
> From: Ira Cooper [mailto:icoo...@redhat.com]
> Sent: 09 May 2016 17:31
> To: Sage Weil 
> Cc: Nick Fisk ; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on
> lockfile
> - Original Message -
> > On Mon, 9 May 2016, Nick Fisk wrote:
> > > Hi All,
> > >
> > > I've been testing an active/active Samba cluster over CephFS,
> > > performance seems really good with small files compared to Gluster.
> > > Soft reboots work beautifully with little to no interruption in file
> > > access. However when I perform a hard shutdown/reboot of one of the
> > > samba nodes, the remaining node detects that the other Samba node
> > > has disappeared but then eventually bans itself. If I leave
> > > everything for around 5 minutes, CTDB unbans itself and then
> > > everything continues running.
> > >
> > > From what I can work out it looks like as the MDS has a stale
> > > session from the powered down node, it won't let the remaining node
> > > access the CTDB lock file (which is also sitting the on the CephFS).
> > > CTDB meanwhile is hammering away trying to access the lock file, but
> > > it sees what it thinks is a split brain scenario because something
> > > still has a lock on the lockfile, and so bans itself.
> > >
> > > I'm guessing the solution is to either reduce the mds session
> > > timeout or increase the amount of time/retries for CTDB, but I'm not
> > > sure what's the best approach. Does anyone have any ideas?
> >
> > I believe Ira was looking at this exact issue, and addressed it by
> > lowering the mds_session_timeout to 30 seconds?
> 
> Actually...
> 
> There's a problem with the way I did it, in that there's issues in CephFS that
> start to come out.  Like the fact that it doesn't ban clients properly. :(

Could you shed any more light on what this issues might be? I'm assuming they 
are around the locking part of ctdb?

> 
> Greg's made comments about this not being production safe, I tend to agree.
> ;)
> 
> But it is possible, to make the cluster happy, I've been testing on VMs with
> the following added to my ceph.conf for "a while" now.
> 
> DISCLAIMER: THESE ARE NOT PRODUCTION SETTINGS!  DO NOT USE IN
> PRODUCTION IF YOU LIKE YOUR DATA!
> 
> mds_session_timeout = 5
> mds_tick_interval = 1
> mon_tick_interval = 1
> mon_session_timeout = 2
> mds_session_autoclose = 15

These all look like they make Ceph more responsive to the loss of a client, as 
per your warning above, what negative effects do you see potentially arising 
from them? Or is that more of a warning as they haven't had long term testing?

If the problem is only around the ctdb locking to avoid split brain, I would 
imagine using ctdb in conjunction with pacemaker to handle the fencing would 
also be a workaround?

> 
> Since I did this, there have been changes made to CTDB to allow an external
> program to be the arbitrator instead of the fcntl lockfile.  I'm working on an
> etcd integration for that.  Not that it is that complicated, but making sure 
> you
> get the details right is a minor pain.
> 
> Also I'll be giving a talk on all of this at SambaXP on Thursday, so if you 
> are
> there, feel free to catch me in the hall.  (That goes for anyone interested in
> this topic or ceph/samba topics in general!)

I would be really interested in slides/video if there will be any post event.

> 
> Clearly my being at SambaXP will slow the etcd integration down.  And I'm
> betting Greg, John or Sage will want to talk to me about using mon instead of
> etcd ;).  Call it a "feeling".
> 
> Cheers,
> 
> -Ira

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on lockfile

2016-05-09 Thread Nick Fisk
Hi Eric,

> -Original Message-
> From: Eric Eastman [mailto:eric.east...@keepertech.com]
> Sent: 09 May 2016 19:21
> To: Nick Fisk 
> Cc: Ceph Users 
> Subject: Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on
> lockfile
> 
> I am trying to do some similar testing with SAMBA and CTDB with the Ceph
> file system.  Are you using the vfs_ceph SAMBA module or are you kernel
> mounting the Ceph file system?

I'm using the kernel client. I couldn't find any up to date information on if 
the vfs plugin supported all the necessary bits and pieces.

How is your testing coming along? I would be very interested in any findings 
you may have come across.

Nick

> 
> Thanks
> Eric
> 
> On Mon, May 9, 2016 at 9:31 AM, Nick Fisk  wrote:
> > Hi All,
> >
> > I've been testing an active/active Samba cluster over CephFS,
> > performance seems really good with small files compared to Gluster.
> > Soft reboots work beautifully with little to no interruption in file
> > access. However when I perform a hard shutdown/reboot of one of the
> > samba nodes, the remaining node detects that the other Samba node has
> > disappeared but then eventually bans itself. If I leave everything for
> > around 5 minutes, CTDB unbans itself and then everything continues
> running.
> >
> > From what I can work out it looks like as the MDS has a stale session
> > from the powered down node, it won't let the remaining node access the
> > CTDB lock file (which is also sitting the on the CephFS). CTDB
> > meanwhile is hammering away trying to access the lock file, but it
> > sees what it thinks is a split brain scenario because something still
> > has a lock on the lockfile, and so bans itself.
> >
> > I'm guessing the solution is to either reduce the mds session timeout
> > or increase the amount of time/retries for CTDB, but I'm not sure
> > what's the best approach. Does anyone have any ideas?
> >
> > Nick
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on lockfile

2016-05-09 Thread Eric Eastman
On Mon, May 9, 2016 at 3:28 PM, Nick Fisk  wrote:
> Hi Eric,
>
>>
>> I am trying to do some similar testing with SAMBA and CTDB with the Ceph
>> file system.  Are you using the vfs_ceph SAMBA module or are you kernel
>> mounting the Ceph file system?
>
> I'm using the kernel client. I couldn't find any up to date information on if 
> the vfs plugin supported all the necessary bits and pieces.
>
> How is your testing coming along? I would be very interested in any findings 
> you may have come across.
>
> Nick

I am also using CephFS kernel mounts, with 4 SAMBA gateways. When from
a SAMBA client, I write a large file (about 2GB) to a gateway that is
not the holder of the CTDB lock file, and then kill that gateway
server during the write, the IP failover works as expected, and in
most cases the file ends up being the correct size after the new
server finishes writing it, but the data is corrupt. The data in the
file, from the point of the failover, is all zeros.

I thought the issue may be with the kernel mount, so I looked into
using  the SAMBA vfs_ceph module, but I need SAMBA with AD support and
the current vfs_ceph module, even in the SAMBA git master version, is
lacking ACL support for CephFS, as the vfs_ceph.c patches summited to
the SAMBA mail list are not yet available. See:
https://lists.samba.org/archive/samba-technical/2016-March/113063.html

I tried using a FUSE mount of the CephFS, and it also fails setting
ACLs.  See: http://tracker.ceph.com/issues/15783.

My current status is IP failover is working, but I am seeing data
corruption on writes to the share when using kernel mounts. I am also
seeing the issue you reported when I kill the system holding the CTDB
lock file.  Are you verifying your data after each failover?

Eric
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ACL support in Jewel using fuse and SAMBA

2016-05-09 Thread Yan, Zheng
On Tue, May 10, 2016 at 2:10 AM, Eric Eastman
 wrote:
> On Mon, May 9, 2016 at 10:36 AM, Gregory Farnum  wrote:
>> On Sat, May 7, 2016 at 9:53 PM, Eric Eastman
>>  wrote:
>>> On Fri, May 6, 2016 at 2:14 PM, Eric Eastman
>>>  wrote:
>>>
>
>>>
>>> A simple test of setting an ACL from the command line to a fuse
>>> mounted Ceph file system also fails:
>>> # mkdir /cephfsFUSE/x
>>> # setfacl -m d:o:rw /cephfsFUSE/x
>>> setfacl: /cephfsFUSE/x: Operation not supported
>>>
>>> The same test to the same Ceph file system using the kernel mount
>>> method works.
>>>
>>> Is there some option in my ceph.conf file or on the mount line that
>>> needs to be used to support setting ACLs on a fuse mounted Ceph file
>>> system?
>>
>> A quick check of the man page doesn't tell me what setfacl is doing,
>> but I imagine this is another oddity of using FUSE filesystems.
>>
>> Judging by https://sourceforge.net/p/fuse/mailman/message/23787505/
>> there's some superblock flag that needs to be set in order for the VFS
>> to allow ACLs. I'm not sure offhand if that's something that FUSE will
>> let us do or not; please create a tracker ticket and somebody will get
>> to it.
>> -Greg
>
> Thank you for your help. I have opened: http://tracker.ceph.com/issues/15783


fuse kernel does not have ACL support. To use ACL, you need to add
"--fuse_default_permission=0 --client_acl_type=posix_acl" options to
ceph-fuse. The  '--fuse_default_permission=0' option disables kernel
file permission check and let ceph-fuse do the check.

Regards
Yan, Zheng


>
> Eric
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] thanks for a double check on ceph's config

2016-05-09 Thread Geocast
Hi members,

We have 21 hosts for ceph OSD servers, each host has 12 SATA disks (4TB
each), 64GB memory.
ceph version 10.2.0, Ubuntu 16.04 LTS
The whole cluster is new installed.

Can you help check what the arguments we put in ceph.conf is reasonable or
not?
thanks.

[osd]
osd_data = /var/lib/ceph/osd/ceph-$id
osd_journal_size = 2
osd_mkfs_type = xfs
osd_mkfs_options_xfs = -f
filestore_xattr_use_omap = true
filestore_min_sync_interval = 10
filestore_max_sync_interval = 15
filestore_queue_max_ops = 25000
filestore_queue_max_bytes = 10485760
filestore_queue_committing_max_ops = 5000
filestore_queue_committing_max_bytes = 1048576
journal_max_write_bytes = 1073714824
journal_max_write_entries = 1
journal_queue_max_ops = 5
journal_queue_max_bytes = 1048576
osd_max_write_size = 512
osd_client_message_size_cap = 2147483648
osd_deep_scrub_stride = 131072
osd_op_threads = 8
osd_disk_threads = 4
osd_map_cache_size = 1024
osd_map_cache_bl_size = 128
osd_mount_options_xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier"
osd_recovery_op_priority = 4
osd_recovery_max_active = 10
osd_max_backfills = 4

[client]
rbd_cache = true
rbd_cache_size = 268435456
rbd_cache_max_dirty = 134217728
rbd_cache_max_dirty_age = 5
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ACL support in Jewel using fuse and SAMBA

2016-05-09 Thread Eric Eastman
On Mon, May 9, 2016 at 8:08 PM, Yan, Zheng  wrote:
> On Tue, May 10, 2016 at 2:10 AM, Eric Eastman
>  wrote:
>> On Mon, May 9, 2016 at 10:36 AM, Gregory Farnum  wrote:
>>> On Sat, May 7, 2016 at 9:53 PM, Eric Eastman
>>>  wrote:
 On Fri, May 6, 2016 at 2:14 PM, Eric Eastman
  wrote:

>>

 A simple test of setting an ACL from the command line to a fuse
 mounted Ceph file system also fails:
 # mkdir /cephfsFUSE/x
 # setfacl -m d:o:rw /cephfsFUSE/x
 setfacl: /cephfsFUSE/x: Operation not supported

 The same test to the same Ceph file system using the kernel mount
 method works.

 Is there some option in my ceph.conf file or on the mount line that
 needs to be used to support setting ACLs on a fuse mounted Ceph file
 system?
>>>
>>> A quick check of the man page doesn't tell me what setfacl is doing,
>>> but I imagine this is another oddity of using FUSE filesystems.
>>>
>>> Judging by https://sourceforge.net/p/fuse/mailman/message/23787505/
>>> there's some superblock flag that needs to be set in order for the VFS
>>> to allow ACLs. I'm not sure offhand if that's something that FUSE will
>>> let us do or not; please create a tracker ticket and somebody will get
>>> to it.
>>> -Greg
>>
>> Thank you for your help. I have opened: http://tracker.ceph.com/issues/15783
>
>
> fuse kernel does not have ACL support. To use ACL, you need to add
> "--fuse_default_permission=0 --client_acl_type=posix_acl" options to
> ceph-fuse. The  '--fuse_default_permission=0' option disables kernel
> file permission check and let ceph-fuse do the check.
>
> Regards
> Yan, Zheng
>

Thank you for the answer.  My command line test to set the ACL:

 setfacl -m d:o:rw /cephfsFUSE/x

Now works.  I did find a minor typo, where --fuse_default_permission=0
needs an "s", so it is --fuse_default_permissions=0. The line I am
using in my /etc/fstab is:

id=cephfs,keyring=/etc/ceph/client.cephfs.keyring,client_acl_type=posix_acl,fuse_default_permissions=0
/cephfsFUSE fuse.ceph noatime,noauto 0 0

-Eric
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com