Re: [ceph-users] Striping

2014-06-12 Thread David
Hi,

Depends what you mean with a ”user”. You can set up pools with different 
replication / erasure coding etc:

http://ceph.com/docs/master/rados/operations/pools/

Kind Regards,
David Majchrzak


12 jun 2014 kl. 10:22 skrev  
:

> Hi All,
>  
>  
> I have a ceph cluster.  If a user wants just striping or distributed or 
> replicated storages , can we provide these types of storages exclusively ?
>  
>  
> Thanks
> Kumar
>  
>  
> 
> 
> This message is for the designated recipient only and may contain privileged, 
> proprietary, or otherwise confidential information. If you have received it 
> in error, please notify the sender immediately and delete the original. Any 
> other use of the e-mail by you is prohibited. Where allowed by local law, 
> electronic communications with Accenture and its affiliates, including e-mail 
> and instant messaging (including content), may be scanned by our systems for 
> the purposes of information security and assessment of internal compliance 
> with Accenture policy. 
> __
> 
> www.accenture.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Backfilling, latency and priority

2014-06-12 Thread David
Hi,

We have 5 OSD servers, with 10 OSDs each (journals on enterprise SSDs).

We lost an OSD and the cluster started to backfill the data to the rest of the 
OSDs - during which the latency skyrocketed on some OSDs and connected clients 
experienced massive IO wait.

I’m trying to rectify the situation now and from what I can tell, these are the 
settings that might help.

osd client op priority
osd recovery op priority
osd max backfills
osd recovery max active

1. Does a high priority value mean it has higher priority? (if the other one 
has lower value) Or does a priority of 1 mean highest priority?
2. I’m running with default on these settings. Does anyone else have any 
experience changing those?

Kind Regards,
David Majchrzak
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Backfilling, latency and priority

2014-06-12 Thread David
Thanks Mark!

Well, our workload has more IOs and quite low throughput, perhaps 10MB/s -> 
100MB/s. It’s a quite mixed workload, but mostly small files (http / mail / 
sql).
During the recovery we had ranged between 600-1000MB/s throughput.

So the only way to currently ”fix” this is to have enough IO to handle both 
recovery and client IOs?
What’s the easiest/best way to add more IOs to a current cluster if you don’t 
want to scale? Add more RAM to OSD servers or add a SSD backed r/w cache tier?

Kind Regards,

David Majchrzak


12 jun 2014 kl. 14:42 skrev Mark Nelson :

> On 06/12/2014 03:44 AM, David wrote:
>> Hi,
>> 
>> We have 5 OSD servers, with 10 OSDs each (journals on enterprise SSDs).
>> 
>> We lost an OSD and the cluster started to backfill the data to the rest of 
>> the OSDs - during which the latency skyrocketed on some OSDs and connected 
>> clients experienced massive IO wait.
>> 
>> I’m trying to rectify the situation now and from what I can tell, these are 
>> the settings that might help.
>> 
>> osd client op priority
>> osd recovery op priority
>> osd max backfills
>> osd recovery max active
>> 
>> 1. Does a high priority value mean it has higher priority? (if the other one 
>> has lower value) Or does a priority of 1 mean highest priority?
>> 2. I’m running with default on these settings. Does anyone else have any 
>> experience changing those?
> 
> We did some investigation into this a little while back.  I suspect you'll 
> see some benefit by reducing backfill/recovery priority and max concurrent 
> operations, but you have to be careful.  We found that the higher the number 
> of concurrent client IOs (past the saturation point), the greater relative 
> proportion of throughput is used by client IO. That makes it hard to nail 
> down specific priority and concurrency settings.  If your workload requires 
> high throughput and low latency with few client IOs (ie below the saturation 
> point), you may need to overly favour client IO.  If you are over-saturating 
> the cluster with many concurrent IOs, you may want to give client IO less 
> priority.  If you overly favor client IO when over-saturating the cluster, 
> recovery can take much much longer and client throughput may actually be 
> lower in aggregate.  Obviously this isn't ideal, but seems to be what's going 
> on right now.
> 
> Mark
> 
>> 
>> Kind Regards,
>> David Majchrzak
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] spiky io wait within VMs running on rbd

2014-06-12 Thread David
Hi Simon,

Did you check iostat on the OSDs to check their utilization? What does your 
ceph -w say - pehaps you’re maxing your cluster’s IOPS?
Also, are you running any monitoring of your VMs iostats? We’ve often found 
some culprits overusing IOs..

Kind Regards,
David Majchrzak

12 jun 2014 kl. 15:22 skrev Xu (Simon) Chen :

> Hi folks,
> 
> We have two similar ceph deployments, but one of them is having trouble: VMs 
> running with ceph-provided block devices are seeing frequent high io wait, 
> every a few minutes, usually 15-20%, but as high as 60-70%. This is 
> cluster-wide and not correlated with VM's IO load. We turned on rbd cache and 
> enabled writeback in qemu, but the problem persists. No-deepscrub doesn't 
> help either.
> 
> Without providing any one of our probably wrong theories, any ideas on how to 
> troubleshoot?
> 
> Thanks.
> -Simon
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Taking down one OSD node (10 OSDs) for maintenance - best practice?

2014-06-13 Thread David
Hi,

We’re going to take down one OSD node for maintenance (add cpu + ram) which 
might take 10-20 minutes.
What’s the best practice here in a production cluster running dumpling 
0.67.7-1~bpo70+1?

Kind Regards,
David Majchrzak

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Taking down one OSD node (10 OSDs) for maintenance - best practice?

2014-06-13 Thread David
Thanks Wido,

So during no out data will be degraded but not resynced, which won’t interrupt 
operations ( running default 3 replicas and a normal map, so each osd node only 
has 1 replica of the data)
Do we need to do anything after bringing the node up again or will it resynch 
automatically?

Kind Regards,
David Majchrzak

13 jun 2014 kl. 11:13 skrev Wido den Hollander :

> On 06/13/2014 10:56 AM, David wrote:
>> Hi,
>> 
>> We’re going to take down one OSD node for maintenance (add cpu + ram) which 
>> might take 10-20 minutes.
>> What’s the best practice here in a production cluster running dumpling 
>> 0.67.7-1~bpo70+1?
>> 
> 
> I suggest:
> 
> $ ceph osd set noout
> 
> This way NO OSD will be marked as out and prevent data re-distribution.
> 
> After the OSDs are back up and synced:
> 
> $ ceph osd unset noout
> 
>> Kind Regards,
>> David Majchrzak
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> -- 
> Wido den Hollander
> 42on B.V.
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Taking down one OSD node (10 OSDs) for maintenance - best practice?

2014-06-13 Thread David
Alright, thanks! :)

Kind Regards,
David Majchrzak

13 jun 2014 kl. 11:21 skrev Wido den Hollander :

> On 06/13/2014 11:18 AM, David wrote:
>> Thanks Wido,
>> 
>> So during no out data will be degraded but not resynced, which won’t 
>> interrupt operations ( running default 3 replicas and a normal map, so each 
>> osd node only has 1 replica of the data)
>> Do we need to do anything after bringing the node up again or will it 
>> resynch automatically?
>> 
> 
> Correct. The OSDs will be marked as down, so that will cause the PGs to go 
> into a degraded state, but they will stay marked as "in", not triggering data 
> re-distribution.
> 
> You don't have to do anything. Just let the machine and OSDs boot and Ceph 
> will take care of the rest (assuming it's all configured properly).
> 
> Afterwards unset the noout flag.
> 
> Wido
> 
>> Kind Regards,
>> David Majchrzak
>> 
>> 13 jun 2014 kl. 11:13 skrev Wido den Hollander :
>> 
>>> On 06/13/2014 10:56 AM, David wrote:
>>>> Hi,
>>>> 
>>>> We’re going to take down one OSD node for maintenance (add cpu + ram) 
>>>> which might take 10-20 minutes.
>>>> What’s the best practice here in a production cluster running dumpling 
>>>> 0.67.7-1~bpo70+1?
>>>> 
>>> 
>>> I suggest:
>>> 
>>> $ ceph osd set noout
>>> 
>>> This way NO OSD will be marked as out and prevent data re-distribution.
>>> 
>>> After the OSDs are back up and synced:
>>> 
>>> $ ceph osd unset noout
>>> 
>>>> Kind Regards,
>>>> David Majchrzak
>>>> 
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>> 
>>> 
>>> --
>>> Wido den Hollander
>>> 42on B.V.
>>> 
>>> Phone: +31 (0)20 700 9902
>>> Skype: contact42on
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> -- 
> Wido den Hollander
> 42on B.V.
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Taking down one OSD node (10 OSDs) for maintenance - best practice?

2014-06-19 Thread David
Hi,

Thanks all for answers - we actually already did this yesterday night , one OSD 
node at a time without disrupting service.
We used the noout flag and also paused deep scrub which was running with 
nodeepscrub flag during the maintenance.

Took down one node with 10 OSDs just through normal shutdown and put in CPU / 
RAM, took around 5-7 min and booted it again. When it came up it recovered the 
missing writes - then when it was done we took down the next one until we had 
finished our 5 node cluster.

There was of course a little bit of iowait on some disks due to higher latency 
during the recovery process, nothing too disruptive for our workload (since we 
mostly have high workload during daytime and did this during the night).

Kind Regards,
David Majchrzak


19 jun 2014 kl. 19:58 skrev Gregory Farnum :

> No, you definitely don't need to shut down the whole cluster. Just do
> a polite shutdown of the daemons, optionally with the noout flag that
> Wido mentioned.
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> 
> On Thu, Jun 19, 2014 at 1:55 PM, Alphe Salas Michels  wrote:
>> Hello, the best practice is to simply shut down the whole cluster starting
>> form the clients,  monitors the mds and the osd. You do your maintenance
>> then you bring back everyone starting from monitors, mds, osd. clients.
>> 
>> Other while the osds missing will lead to a reconstruction of your cluster
>> that will not end with the return of the "faulty" osd(s). In the case you
>> turn off everything related to ceph cluster then it will be transparent for
>> the monitors and will not have to deal with partial reconstruction to clean
>> up and rescrubing of the returned OSD(s).
>> 
>> best regards.
>> 
>> Alphe Salas
>> T.I ingeneer.
>> 
>> 
>> 
>> On 06/13/2014 04:56 AM, David wrote:
>>> 
>>> Hi,
>>> 
>>> We’re going to take down one OSD node for maintenance (add cpu + ram)
>>> which might take 10-20 minutes.
>>> What’s the best practice here in a production cluster running dumpling
>>> 0.67.7-1~bpo70+1?
>>> 
>>> Kind Regards,
>>> David Majchrzak
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to enable the writeback qemu rbd cache

2014-07-08 Thread David
Do you set cache=writeback in your vm’s qemu conf for that disk?

// david


8 jul 2014 kl. 14:34 skrev lijian :

> Hello,
> 
> I want to enable the qemu rbd writeback cache, the following is the settings 
> in /etc/ceph/ceph.conf
> [client]
> rbd_cache = true
> rbd_cache_writethrough_until_flush = false
> rbd_cache_size = 27180800
> rbd_cache_max_dirty = 20918080
> rbd_cache_target_dirty = 16808000
> rbd_cache_max_dirty_age = 60
> 
> and the next section is the vm definition xml:
> 
>   
>   
> 
>   
>   
> 
>   
>   
>function='0x0'/>
> 
> 
> my host OS is Ubuntu, kernel 3.11.0-12-generic, the kvm-qemu is 
> 1.5.0+dfsg-3ubuntu5.4, the guest os is Ubuntu 13.11
> ceph version is 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
> 
> no performance improvements using the above cache settings, So what's wrong 
> with me, please help, thanks!
> 
> Jian Li 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Possible to schedule deep scrub to nights?

2014-07-18 Thread David
Is there any known workarounds to schedule deep scrubs to run nightly?
Latency does go up a little bit when it runs so I’d rather that it didn’t 
affect our daily activities.

Kind Regards,
David

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Possible to schedule deep scrub to nights?

2014-07-20 Thread David
Thanks!

Found this thread, guess I’ll do something like this then.
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg09984.html

Question though - will it still obey the scrubbing variables? Say I’ll schedule 
1000 PGs during night, will it still just do 1 OSD at a time (default max 
scrub)?

Kind Regards,
David


18 jul 2014 kl. 20:04 skrev Gregory Farnum :

> There's nothing built in to the system but I think some people have
> had success with scripts that set nobackfill during the day, and then
> trigger them regularly at night. Try searching the list archives. :)
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> 
> On Fri, Jul 18, 2014 at 12:56 AM, David  wrote:
>> Is there any known workarounds to schedule deep scrubs to run nightly?
>> Latency does go up a little bit when it runs so I’d rather that it didn’t 
>> affect our daily activities.
>> 
>> Kind Regards,
>> David
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Using Crucial MX100 for journals or cache pool

2014-08-01 Thread David
Performance seems quite low on those. I’d really step it up to intel s3700s.

Check the performance benchmarks here and compare between them:

http://www.anandtech.com/show/8066/crucial-mx100-256gb-512gb-review/3

http://www.anandtech.com/show/6433/intel-ssd-dc-s3700-200gb-review/3

If you’re going to run it in production I’d go with the intel one.

Kind Regards,
David

1 aug 2014 kl. 10:38 skrev Andrei Mikhailovsky :

> Hello guys,
> 
> Was wondering if anyone has tried using the Crucial MX100 ssds either for osd 
> journals or cache pool? It seems like a good cost effective alternative to 
> the more expensive drives and read/write performance is very good as well.
> 
> Thanks
> 
> -- 
> Andrei Mikhailovsky
> Director
> Arhont Information Security
> 
> Web: http://www.arhont.com
> http://www.wi-foo.com
> Tel: +44 (0)870 4431337
> Fax: +44 (0)208 429 3111
> PGP: Key ID - 0x2B3438DE
> PGP: Server - keyserver.pgp.com
> 
> DISCLAIMER
> 
> The information contained in this email is intended only for the use of the 
> person(s) to whom it is addressed and may be confidential or contain legally 
> privileged information. If you are not the intended recipient you are hereby 
> notified that any perusal, use, distribution, copying or disclosure is 
> strictly prohibited. If you have received this email in error please 
> immediately advise us by return email at and...@arhont.com and delete and 
> purge the email and any attachments without making a copy.
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Huge issues with slow requests

2014-09-04 Thread David
b8e9b3d1b58ba.5c00 [stat,write 2457600~16384] 3.47dbbb97 
e13901) v4 currently waiting for subops from [12,29]

Kind Regards,

David


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge issues with slow requests

2014-09-04 Thread David
Hi,

Sorry for the lack of information yesterday, this was "solved" after some 30 
minutes, after having reloaded/restarted all osd daemons.
Unfortunately we couldn’t pin point it to a single OSD or drive, all drives 
seemed ok, some had a bit higher latency and we tried to out / in them to see 
if it had any effect which it did not.

The cluster consists of 3 mon servrers, 5 OSD servers with 10 enterprise HDDs 
backed with 2 S3700 SSDs for journals each. OSD servers have 256GB of RAM, 2x 
E5-2630 v2 @ 2.60GHz CPUs.

The log that I posted yesterday was just a small taste of the full one ;) They 
were all pointing to different osd’s that they were waiting for.
We’re also monitoring all of the VMs running on KVM, and we didn’t see any 
exceptional throughput or iops usage before or during this event. We were 
checking iostat etc and nothing was out of the ordinary..

Going to double check SMART and also see if we can off load some of the cluster 
in any way. If you have any other advice that’d be appreciated :)

Thanks for your help!

Kind Regards,
David

5 sep 2014 kl. 07:30 skrev Martin B Nielsen :

> Just echoing what Christian said.
> 
> Also, iirc the "currently waiting for subobs on [" could also mean a problem 
> on those as it waits for ack from them (I might remember wrong).
> 
> If that is the case you might want to check in on osd 13 & 37 as well.
> 
> With the cluster load and size you should not have this problem; I'm pretty 
> sure you're dealing with a rogue/faulty osd/node somewhere.
> 
> Cheers,
> Martin
> 
> 
> On Fri, Sep 5, 2014 at 2:28 AM, Christian Balzer  wrote:
> On Thu, 4 Sep 2014 12:02:13 +0200 David wrote:
> 
> > Hi,
> >
> > We’re running a ceph cluster with version:
> >
> > 0.67.7-1~bpo70+1
> >
> > All of a sudden we’re having issues with the cluster (running RBD images
> > for kvm) with slow requests on all of the OSD servers. Any idea why and
> > how to fix it?
> >
> You give us a Ceph version at least, but for anybody to make guesses we
> need much more information than a log spew.
> 
> How many nodes/OSDs, OS, hardware, OSD details (FS, journals on SSDs), etc.
> 
> Run atop (in a sufficiently large terminal) on all your nodes, see if you
> can spot a bottleneck, like a disk being at 100% all the time with a
> much higher avio than the others.
> Looking at your logs, I'd pay particular attention to the disk holding
> osd.22.
> A single slow disk can bring a whole large cluster to a crawl.
> If you're using a hardware controller with a battery backed up cache,
> check if that is fine, loss of the battery would switch from writeback to
> writethrough and massively slow down IOPS.
> 
> Regards,
> 
> Christian
> >
> > 2014-09-04 11:56:35.868521 mon.0 [INF] pgmap v12504451: 6860 pgs: 6860
> > active+clean; 12163 GB data, 36308 GB used, 142 TB / 178 TB avail;
> > 634KB/s rd, 487KB/s wr, 90op/s 2014-09-04 11:56:29.510270 osd.22 [WRN]
> > 15 slow requests, 1 included below; oldest blocked for > 44.745754 secs
> > 2014-09-04 11:56:29.510276 osd.22 [WRN] slow request 30.999821 seconds
> > old, received at 2014-09-04 11:55:58.510424:
> > osd_op(client.10731617.0:81868956
> > rbd_data.967e022eb141f2.0e72 [write 0~4194304] 3.c585cebe
> > e13901) v4 currently waiting for subops from [37,13] 2014-09-04
> > 11:56:30.510528 osd.22 [WRN] 21 slow requests, 6 included below; oldest
> > blocked for > 45.745989 secs 2014-09-04 11:56:30.510534 osd.22 [WRN]
> > slow request 30.122555 seconds old, received at 2014-09-04
> > 11:56:00.387925: osd_op(client.13425082.0:11962345
> > rbd_data.54f24c3d1b58ba.3753 [stat,write 1114112~8192]
> > 3.c9e49140 e13901) v4 currently waiting for subops from [13,42]
> > 2014-09-04 11:56:30.510537 osd.22 [WRN] slow request 30.122362 seconds
> > old, received at 2014-09-04 11:56:00.388118:
> > osd_op(client.13425082.0:11962352
> > rbd_data.54f24c3d1b58ba.3753 [stat,write 1134592~4096]
> > 3.c9e49140 e13901) v4 currently waiting for subops from [13,42]
> > 2014-09-04 11:56:30.510541 osd.22 [WRN] slow request 30.122298 seconds
> > old, received at 2014-09-04 11:56:00.388182:
> > osd_op(client.13425082.0:11962353
> > rbd_data.54f24c3d1b58ba.3753 [stat,write 4046848~8192]
> > 3.c9e49140 e13901) v4 currently waiting for subops from [13,42]
> > 2014-09-04 11:56:30.510544 osd.22 [WRN] slow request 30.121577 seconds
> > old, received at 2014-09-04 11:56:00.388903:
> > osd_op(client.13425082.0:11962374
> > rbd_data.54f24c3d1b58ba.47f2 [stat,write 2527232~4096]
> > 3.cd9a9015 e13901) v4 currently waiting for subops from [45,1]
&

Re: [ceph-users] Huge issues with slow requests

2014-09-05 Thread David
 
active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 16195KB/s 
rd, 15554KB/s wr, 2444op/s
2014-09-05 10:44:57.452136 mon.0 [INF] pgmap v12582787: 6860 pgs: 6860 
active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 15016KB/s 
rd, 16356KB/s wr, 2358op/s
2014-09-05 10:44:58.465958 mon.0 [INF] pgmap v12582788: 6860 pgs: 6860 
active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 11668KB/s 
rd, 18443KB/s wr, 2029op/s
2014-09-05 10:44:59.483462 mon.0 [INF] pgmap v12582789: 6860 pgs: 6860 
active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 12253KB/s 
rd, 10846KB/s wr, 1529op/s
2014-09-05 10:45:00.492322 mon.0 [INF] pgmap v12582790: 6860 pgs: 6860 
active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 12247KB/s 
rd, 7084KB/s wr, 1464op/s
2014-09-05 10:45:01.516581 mon.0 [INF] pgmap v12582791: 6860 pgs: 6860 
active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 16460KB/s 
rd, 12089KB/s wr, 2537op/s
2014-09-05 10:45:02.527110 mon.0 [INF] pgmap v12582792: 6860 pgs: 6860 
active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 13382KB/s 
rd, 15080KB/s wr, 2563op/s
2014-09-05 10:45:03.538090 mon.0 [INF] pgmap v12582793: 6860 pgs: 6860 
active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 10902KB/s 
rd, 18745KB/s wr, 2863op/s
2014-09-05 10:45:04.558261 mon.0 [INF] pgmap v12582794: 6860 pgs: 6860 
active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 10850KB/s 
rd, 15995KB/s wr, 2695op/s
2014-09-05 10:45:05.565750 mon.0 [INF] pgmap v12582795: 6860 pgs: 6860 
active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 9636KB/s rd, 
13262KB/s wr, 2372op/s
2014-09-05 10:45:06.593984 mon.0 [INF] pgmap v12582796: 6860 pgs: 6860 
active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 18808KB/s 
rd, 19329KB/s wr, 3819op/s
2014-09-05 10:45:07.595866 mon.0 [INF] pgmap v12582797: 6860 pgs: 6860 
active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 21265KB/s 
rd, 20743KB/s wr, 3861op/s
2014-09-05 10:45:08.624949 mon.0 [INF] pgmap v12582798: 6860 pgs: 6860 
active+clean; 12254 GB data, 36574 GB used, 142 TB / 178 TB avail; 20114KB/s 
rd, 18543KB/s wr, 3248op/s
2014-09-05 10:45:09.627901 mon.0 [INF] pgmap v12582799: 6860 pgs: 6860 
active+clean; 12254 GB data, 36574 GB used, 142 TB / 178 TB avail; 14717KB/s 
rd, 15141KB/s wr, 2302op/s
2014-09-05 10:45:10.643234 mon.0 [INF] pgmap v12582800: 6860 pgs: 6860 
active+clean; 12254 GB data, 36574 GB used, 142 TB / 178 TB avail; 8328KB/s rd, 
13950KB/s wr, 1919op/s
2014-09-05 10:45:11.651602 mon.0 [INF] pgmap v12582801: 6860 pgs: 6860 
active+clean; 12254 GB data, 36574 GB used, 142 TB / 178 TB avail; 16978KB/s 
rd, 15921KB/s wr, 3377op/s
2014-09-05 10:45:12.674819 mon.0 [INF] pgmap v12582802: 6860 pgs: 6860 
active+clean; 12254 GB data, 36574 GB used, 142 TB / 178 TB avail; 16471KB/s 
rd, 14034KB/s wr, 3379op/s
2014-09-05 10:45:13.688080 mon.0 [INF] pgmap v12582803: 6860 pgs: 6860 
active+clean; 12254 GB data, 36574 GB used, 142 TB / 178 TB avail; 16149KB/s 
rd, 12657KB/s wr, 2734op/s

Aye, we actually saw latency on the disks go up a bit when we had 128GB of RAM 
on the OSDs and decided to beef them up to 256GB which helped.
They’re running different workloads (shared hosting) but we’ve never 
encountered the issue we had yesterday even during our testing/benchmarking.

Kind Regards,
David

5 sep 2014 kl. 09:05 skrev Christian Balzer :

> 
> Hello,
> 
> On Fri, 5 Sep 2014 08:26:47 +0200 David wrote:
> 
>> Hi,
>> 
>> Sorry for the lack of information yesterday, this was "solved" after
>> some 30 minutes, after having reloaded/restarted all osd daemons.
>> Unfortunately we couldn’t pin point it to a single OSD or drive, all
>> drives seemed ok, some had a bit higher latency and we tried to out / in
>> them to see if it had any effect which it did not.
>> 
> This is odd. 
> Having it "fixed" by restarting all OSDs would suggest either a software
> problem (bug) with Ceph or some resource other than the storage system
> being starved. But memory seems unlikely, even with bloated, leaking OSD
> daemon. And CPU seems even less likely.
> 
>> The cluster consists of 3 mon servrers, 5 OSD servers with 10 enterprise
>> HDDs backed with 2 S3700 SSDs for journals each. OSD servers have 256GB
>> of RAM, 2x E5-2630 v2 @ 2.60GHz CPUs.
>> 
>> The log that I posted yesterday was just a small taste of the full
>> one ;) They were all pointing to different osd’s that they were waiting
>> for. We’re also monitoring all of the VMs running on KVM, and we didn’t
>> see any exceptional throughput or iops usage before or during this
>> event. We were checking iostat etc and nothing was out of the ordinary..
>> 
>> Going to double check SMART and also see if we can off load some of the
>&

Re: [ceph-users] Introducing "Learning Ceph" : The First ever Book on Ceph

2015-02-13 Thread David
Thanks, just bought a paperback copy :)
Always great to have as a reference, even if ceph is still evolving quickly.

Cheers!


13 feb 2015 kl. 09:43 skrev Karan Singh :

> Here is the new link for sample book : 
> https://www.dropbox.com/s/2zcxawtv4q29fm9/Learning_Ceph_Sample.pdf?dl=0
> 
> 
> 
> Karan Singh 
> Systems Specialist , Storage Platforms
> CSC - IT Center for Science,
> Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland
> mobile: +358 503 812758
> tel. +358 9 4572001
> fax +358 9 4572302
> http://www.csc.fi/
> 
> 
>> On 13 Feb 2015, at 05:25, Frank Yu  wrote:
>> 
>> Wow, Cong
>> BTW, I found the link of sample copy is 404.
>> 
>> 
>> 
>> 2015-02-06 6:53 GMT+08:00 Karan Singh :
>> Hello Community Members
>> 
>> I am happy to introduce the first book on Ceph with the title “Learning 
>> Ceph”. 
>> 
>> Me and many folks from the publishing house together with technical 
>> reviewers spent several months to get this book compiled and published.
>> 
>> Finally the book is up for sale on , i hope you would like it and surely 
>> will learn a lot from it.
>> 
>> Amazon :  
>> http://www.amazon.com/Learning-Ceph-Karan-Singh/dp/1783985623/ref=sr_1_1?s=books&ie=UTF8&qid=1423174441&sr=1-1&keywords=ceph
>> Packtpub : https://www.packtpub.com/application-development/learning-ceph
>> 
>> You can grab the sample copy from here :  
>> https://www.dropbox.com/s/ek76r01r9prs6pb/Learning_Ceph_Packt.pdf?dl=0
>> 
>> Finally , I would like to express my sincere thanks to 
>> 
>> Sage Weil - For developing Ceph and everything around it as well as writing 
>> foreword for “Learning Ceph”.
>> Patrick McGarry - For his usual off the track support that too always.
>> 
>> Last but not the least , to our great community members , who are also 
>> reviewers of the book Don Talton , Julien Recurt , Sebastien Han and Zihong 
>> Chen , Thank you guys for your efforts.
>> 
>> 
>> 
>> Karan Singh 
>> Systems Specialist , Storage Platforms
>> CSC - IT Center for Science,
>> Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland
>> mobile: +358 503 812758
>> tel. +358 9 4572001
>> fax +358 9 4572302
>> http://www.csc.fi/
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> 
>> 
>> -- 
>> Regards
>> Frank Yu
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Shutting down a cluster fully and powering it back up

2015-02-28 Thread David
Hi!

I’m about to do maintenance on a Ceph Cluster, where we need to shut it all 
down fully.
We’re currently only using it for rados block devices to KVM Hypervizors.

Are these steps sane?

Shutting it down

1. Shut down all IO to the cluster. Means turning off all clients (KVM 
Hypervizors in our case).
2. Set cluster to noout by running: ceph osd set noout
3. Shut down the MON nodes.
4. Shut down the OSD nodes.

Starting it up

1. Start the OSD nodes.
2. Start the MON nodes.
3. Check ceph -w to see the status of ceph and take actions if something is 
wrong.
4. Start up the clients (KVM Hypervizors)
5. Run ceph osd unset noout

Kind Regards,
David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Segfault in libtcmalloc.so.4.2.2

2016-05-13 Thread David
Hi,

Been getting some segfaults in our newest ceph cluster running ceph 9.2.1-1 on 
Debian 8.3

segfault at 0 ip 7f27e85120f7 sp 7f27cff9e860 error 4 in 
libtcmalloc.so.4.2.2

I saw there’s already a bug up there on the tracker: 
http://tracker.ceph.com/issues/15628 <http://tracker.ceph.com/issues/15628>
Don’t know how many other are affected by it. We stop and start the osd to 
bring it up again but it’s quite annoying.

I’m guessing this affects Jewel as well?

Kind Regards,

David Majchrzak

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Segfault in libtcmalloc.so.4.2.2

2016-05-13 Thread David
Linux osd11.storage 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u3 
(2016-01-17) x86_64 GNU/Linux

apt-show-versions linux-image-3.16.0-4-amd64
linux-image-3.16.0-4-amd64:amd64/jessie-updates 3.16.7-ckt20-1+deb8u3 
upgradeable to 3.16.7-ckt25-2

apt-show-versions libtcmalloc-minimal4
libtcmalloc-minimal4:amd64/jessie 2.2.1-0.2 uptodate



> 13 maj 2016 kl. 16:02 skrev Somnath Roy :
> 
> What is the exact kernel version ?
> Ubuntu has a new tcmalloc incorporated from 3.16.0.50 kernel onwards. If you 
> are using older kernel than this better to upgrade kernel or try building 
> latest tcmalloc and try to see if this is happening there.
> Ceph is not packaging tcmalloc it is using the tcmalloc available with distro.
>  
> Thanks & Regards
> Somnath
>  
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of David
> Sent: Friday, May 13, 2016 6:13 AM
> To: ceph-users
> Subject: [ceph-users] Segfault in libtcmalloc.so.4.2.2
>  
> Hi,
>  
> Been getting some segfaults in our newest ceph cluster running ceph 9.2.1-1 
> on Debian 8.3
> 
> segfault at 0 ip 7f27e85120f7 sp 7f27cff9e860 error 4 in 
> libtcmalloc.so.4.2.2
>  
> I saw there’s already a bug up there on the tracker: 
> http://tracker.ceph.com/issues/15628 <http://tracker.ceph.com/issues/15628>
> Don’t know how many other are affected by it. We stop and start the osd to 
> bring it up again but it’s quite annoying.
>  
> I’m guessing this affects Jewel as well?
>  
> Kind Regards,
>  
> David Majchrzak
>  
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] openSuse Leap 42.1, slow krbd, max_sectors_kb = 127

2016-05-23 Thread David
Hi All

I'm doing some testing with OpenSUSE Leap 42.1, it ships with kernel 4.1.12
but I've also tested with 4.1.24

When I map an image with the kernel RBD client, max_sectors_kb = 127. I'm
unable to increase:

# echo 4096 > /sys/block/rbd0/queue/max_sectors_kb
-bash: echo: write error: Invalid argument

I'm seeing very poor sequential read performance which I suspect is a
result of max_sectors_kb being stuck at 127

Does anyone know what's going on here?

The behavior I've seen on older kernels is max_sectors_kb is set to 512 but
can be increased up to max_hw_sectors_kb. On newer kernels max_sectors_kb
matches the image's object size.

The image is default 4MB object size, here's everything in
/sys/block/rbd0/queue:

# grep -r .
nomerges:0
logical_block_size:512
rq_affinity:1
discard_zeroes_data:1
max_segments:128
max_segment_size:4194304
rotational:0
scheduler:none
read_ahead_kb:512
max_hw_sectors_kb:4096
discard_granularity:4194304
discard_max_bytes:4194304
write_same_max_bytes:0
max_integrity_segments:0
max_sectors_kb:127
physical_block_size:512
add_random:0
nr_requests:128
minimum_io_size:4194304
hw_sector_size:512
optimal_io_size:4194304
iostats:1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS: slow writes over NFS when fs is mounted with kernel driver but fast with Fuse

2016-05-30 Thread David
Hi All

I'm having an issue with slow writes over NFS (v3) when cephfs is mounted
with the kernel driver. Writing a single 4K file from the NFS client is
taking 3 - 4 seconds, however a 4K write (with sync) into the same folder
on the server is fast as you would expect. When mounted with ceph-fuse, I
don't get this issue on the NFS client.

Test environment is a small cluster with a single MON and single MDS, all
running 10.2.1, CephFS metadata is an ssd pool, data is on spinners. The
NFS server is CentOS 7, I've tested with the current shipped kernel (3.10),
ELrepo 4.4 and ELrepo 4.6.

More info:

With the kernel driver, I mount the filesystem with "-o name=admin,secret"

I've exported a folder with the following options:

*(rw,root_squash,sync,wdelay,no_subtree_check,fsid=1244,sec=1)

I then mount the folder on a CentOS 6 client with the following options
(all default):

rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.3.231,mountvers=3,mountport=597,mountproto=udp,local_lock=none

A small 4k write is taking 3 - 4 secs:

 # time dd if=/dev/zero of=testfile bs=4k count=1
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 3.59678 s, 1.1 kB/s

real0m3.624s
user0m0.000s
sys 0m0.001s

But a sync write on the sever directly into the same folder is fast (this
is with the kernel driver):

# time dd if=/dev/zero of=testfile2 bs=4k count=1 conv=fdatasync
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.0121925 s, 336 kB/s

real0m0.015s
user0m0.000s
sys 0m0.002s

If I mount cephfs with Fuse instead of the kernel, the NFS client write is
fast:

dd if=/dev/zero of=fuse01 bs=4k count=1
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.026078 s, 157 kB/s

Does anyone know what's going on here?

Thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: slow writes over NFS when fs is mounted with kernel driver but fast with Fuse

2016-06-03 Thread David
Zheng, thanks for looking into this, it makes sense although strangely I've
set up a new nfs server (different hardware, same OS, Kernel etc.) and I'm
unable to recreate the issue. I'm no longer getting the delay, the nfs
export is still using sync. I'm now comparing the servers to see what's
different on the original server. Apologies if I've wasted your time on
this!

Jan, I did some more testing with Fuse on the original server and I was
seeing the same issue, yes I was testing from the nfs client. As above I
think there was something weird with that original server. Noted on sync vs
async, I plan on sticking with sync.

On Fri, Jun 3, 2016 at 5:03 AM, Yan, Zheng  wrote:

> On Mon, May 30, 2016 at 10:29 PM, David  wrote:
> > Hi All
> >
> > I'm having an issue with slow writes over NFS (v3) when cephfs is mounted
> > with the kernel driver. Writing a single 4K file from the NFS client is
> > taking 3 - 4 seconds, however a 4K write (with sync) into the same
> folder on
> > the server is fast as you would expect. When mounted with ceph-fuse, I
> don't
> > get this issue on the NFS client.
> >
> > Test environment is a small cluster with a single MON and single MDS, all
> > running 10.2.1, CephFS metadata is an ssd pool, data is on spinners. The
> NFS
> > server is CentOS 7, I've tested with the current shipped kernel (3.10),
> > ELrepo 4.4 and ELrepo 4.6.
> >
> > More info:
> >
> > With the kernel driver, I mount the filesystem with "-o
> name=admin,secret"
> >
> > I've exported a folder with the following options:
> >
> > *(rw,root_squash,sync,wdelay,no_subtree_check,fsid=1244,sec=1)
> >
> > I then mount the folder on a CentOS 6 client with the following options
> (all
> > default):
> >
> >
> rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.3.231,mountvers=3,mountport=597,mountproto=udp,local_lock=none
> >
> > A small 4k write is taking 3 - 4 secs:
> >
> >  # time dd if=/dev/zero of=testfile bs=4k count=1
> > 1+0 records in
> > 1+0 records out
> > 4096 bytes (4.1 kB) copied, 3.59678 s, 1.1 kB/s
> >
> > real0m3.624s
> > user0m0.000s
> > sys 0m0.001s
> >
> > But a sync write on the sever directly into the same folder is fast
> (this is
> > with the kernel driver):
> >
> > # time dd if=/dev/zero of=testfile2 bs=4k count=1 conv=fdatasync
> > 1+0 records in
> > 1+0 records out
> > 4096 bytes (4.1 kB) copied, 0.0121925 s, 336 kB/s
>
>
> Your nfs export has sync option. 'dd if=/dev/zero of=testfile bs=4k
> count=1' on nfs client is equivalent to 'dd if=/dev/zero of=testfile
> bs=4k count=1 conv=fsync' on cephfs. The reason that sync metadata
> operation takes 3~4 seconds is that the MDS flushes its journal every
> 5 seconds.  Adding async option to nfs export can avoid this delay.
>
> >
> > real0m0.015s
> > user0m0.000s
> > sys 0m0.002s
> >
> > If I mount cephfs with Fuse instead of the kernel, the NFS client write
> is
> > fast:
> >
> > dd if=/dev/zero of=fuse01 bs=4k count=1
> > 1+0 records in
> > 1+0 records out
> > 4096 bytes (4.1 kB) copied, 0.026078 s, 157 kB/s
> >
>
> In this case, ceph-fuse sends an extra request (getattr request on
> directory) to MDS. The request causes MDS to flush its journal.
> Whether or not client sends the extra request depends on what
> capabilities it has.  What capabilities client has, in turn, depend on
> how many clients are accessing the directory. In my test, nfs on
> ceph-fuse is not always fast.
>
> Yan, Zheng
>
>
> > Does anyone know what's going on here?
>
>
>
> >
> > Thanks
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS in the wild

2016-06-03 Thread David
I'm hoping to implement cephfs in production at some point this year so I'd
be interested to hear your progress on this.

Have you considered SSD for your metadata pool? You wouldn't need loads of
capacity although even with reliable SSD I'd probably still do x3
replication for metadata. I've been looking at the intel s3610's for this.



On Wed, Jun 1, 2016 at 9:50 PM, Brady Deetz  wrote:

> Question:
> I'm curious if there is anybody else out there running CephFS at the scale
> I'm planning for. I'd like to know some of the issues you didn't expect
> that I should be looking out for. I'd also like to simply see when CephFS
> hasn't worked out and why. Basically, give me your war stories.
>
>
> Problem Details:
> Now that I'm out of my design phase and finished testing on VMs, I'm ready
> to drop $100k on a pilo. I'd like to get some sense of confidence from the
> community that this is going to work before I pull the trigger.
>
> I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320 with
> CephFS by this time next year (hopefully by December). My workload is a mix
> of small and vary large files (100GB+ in size). We do fMRI analysis on
> DICOM image sets as well as other physio data collected from subjects. We
> also have plenty of spreadsheets, scripts, etc. Currently 90% of our
> analysis is I/O bound and generally sequential.
>
> In deploying Ceph, I am hoping to see more throughput than the 7320 can
> currently provide. I'm also looking to get away from traditional
> file-systems that require forklift upgrades. That's where Ceph really
> shines for us.
>
> I don't have a total file count, but I do know that we have about 500k
> directories.
>
>
> Planned Architecture:
>
> Storage Interconnect:
> Brocade VDX 6940 (40 gig)
>
> Access Switches for clients (servers):
> Brocade VDX 6740 (10 gig)
>
> Access Switches for clients (workstations):
> Brocade ICX 7450
>
> 3x MON:
> 128GB RAM
> 2x 200GB SSD for OS
> 2x 400GB P3700 for LevelDB
> 2x E5-2660v4
> 1x Dual Port 40Gb Ethernet
>
> 2x MDS:
> 128GB RAM
> 2x 200GB SSD for OS
> 2x 400GB P3700 for LevelDB (is this necessary?)
> 2x E5-2660v4
> 1x Dual Port 40Gb Ethernet
>
> 8x OSD:
> 128GB RAM
> 2x 200GB SSD for OS
> 2x 400GB P3700 for Journals
> 24x 6TB Enterprise SATA
> 2x E5-2660v4
> 1x Dual Port 40Gb Ethernet
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS in the wild

2016-06-06 Thread David
On Mon, Jun 6, 2016 at 7:06 AM, Christian Balzer  wrote:

>
> Hello,
>
> On Fri, 3 Jun 2016 15:43:11 +0100 David wrote:
>
> > I'm hoping to implement cephfs in production at some point this year so
> > I'd be interested to hear your progress on this.
> >
> > Have you considered SSD for your metadata pool? You wouldn't need loads
> > of capacity although even with reliable SSD I'd probably still do x3
> > replication for metadata. I've been looking at the intel s3610's for
> > this.
> >
> That's an interesting and potentially quite beneficial thought, but it
> depends on a number of things (more below).
>
> I'm using S3610s (800GB) for a cache pool with 2x replication and am quite
> happy with that, but then again I have a very predictable usage pattern
> and am monitoring those SSDs religiously and I'm sure they will outlive
> things by a huge margin.
>
> We didn't go for 3x replication due to (in order):
> a) cost
> b) rack space
> c) increased performance with 2x


I'd also be happy with 2x replication for data pools and that's probably
what I'll do for the reasons you've given. I plan on using File Layouts to
map some dirs to the ssd pool. I'm testing this at the moment and it's an
awesome feature. I'm just very paranoid with the metadata and considering
the relatively low capacity requirement I'd stick with the 3x replication
although as you say that means a performance hit.


>
> Now for how useful/helpful a fast meta-data pool would be, I reckon it
> depends on a number of things:
>
> a) Is the cluster write or read heavy?
> b) Do reads, flocks, anything that is not directly considered a read
>cause writes to the meta-data pool?
> c) Anything else that might cause write storms to the meta-data pool, like
>bit in the current NFS over CephFS thread with sync?
>
> A quick glance at my test cluster seems to indicate that CephFS meta data
> per filesystem object is about 2KB, somebody with actual clues please
> confirm this.
>

2K per object appears to be the case on my test cluster too.


> Brady has large amounts of NVMe space left over in his current design,
> assuming 10GB journals about 2.8TB of raw space.
> So if running the (verified) numbers indicates that the meta data can fit
> in this space, I'd put it there.
>
> Otherwise larger SSDs (indeed S3610s) for OS and meta-data pool storage may
> be the way forward.
>
> Regards,
>
> Christian
> >
> >
> > On Wed, Jun 1, 2016 at 9:50 PM, Brady Deetz  wrote:
> >
> > > Question:
> > > I'm curious if there is anybody else out there running CephFS at the
> > > scale I'm planning for. I'd like to know some of the issues you didn't
> > > expect that I should be looking out for. I'd also like to simply see
> > > when CephFS hasn't worked out and why. Basically, give me your war
> > > stories.
> > >
> > >
> > > Problem Details:
> > > Now that I'm out of my design phase and finished testing on VMs, I'm
> > > ready to drop $100k on a pilo. I'd like to get some sense of
> > > confidence from the community that this is going to work before I pull
> > > the trigger.
> > >
> > > I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320 with
> > > CephFS by this time next year (hopefully by December). My workload is
> > > a mix of small and vary large files (100GB+ in size). We do fMRI
> > > analysis on DICOM image sets as well as other physio data collected
> > > from subjects. We also have plenty of spreadsheets, scripts, etc.
> > > Currently 90% of our analysis is I/O bound and generally sequential.
> > >
> > > In deploying Ceph, I am hoping to see more throughput than the 7320 can
> > > currently provide. I'm also looking to get away from traditional
> > > file-systems that require forklift upgrades. That's where Ceph really
> > > shines for us.
> > >
> > > I don't have a total file count, but I do know that we have about 500k
> > > directories.
> > >
> > >
> > > Planned Architecture:
> > >
> > > Storage Interconnect:
> > > Brocade VDX 6940 (40 gig)
> > >
> > > Access Switches for clients (servers):
> > > Brocade VDX 6740 (10 gig)
> > >
> > > Access Switches for clients (workstations):
> > > Brocade ICX 7450
> > >
> > > 3x MON:
> > > 128GB RAM
> > > 2x 200GB SSD for OS
> > > 2x 400

Re: [ceph-users] which CentOS 7 kernel is compatible with jewel?

2016-06-13 Thread David
"rbd ls" does work with 4.6 (just tested with 4.6.1-1.el7.elrepo.x86_64).
That's against a 10.2.0 cluster with ceph-common-10.2.0-0

What's the error you're getting? Are you using default rbd pool or
specifying pool with '-p'? I'd recommend checking your ceph-common package.

Thanks,

On Fri, Jun 10, 2016 at 8:29 PM, Michael Kuriger  wrote:

> Hi Everyone,
> I’ve been running jewel for a while now, with tunables set to hammer.
> However, I want to test the new features but cannot find a fully compatible
> Kernel for CentOS 7.  I’ve tried a few of the elrepo kernels -
> elrepo-kernel 4.6 works perfectly in CentOS 6, but not CentOS 7.  I’ve
> tried 3.10, 4.3, 4.5, and 4.6.
>
> What does seem to work with the 4.6 kernel is mounting, read/write to a
> cephfs, and rbd map / mounting works also.  I just can’t do 'rbd ls'
>
> 'rbd ls' does not work with 4.6 kernel but it does work with the stock
> 3.10 kernel.
> 'rbd mount' does not work with the stock 3.10 kernel, but works with the
> 4.6 kernel.
>
> Very odd.  Any advice?
>
> Thanks!
>
>
>
>
> Michael Kuriger
> Sr. Unix Systems Engineer
> * mk7...@yp.com |( 818-649-7235
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph benchmark

2016-06-16 Thread David
I'm probably misunderstanding the question but if you're getting 3GB/s from
your dd, you're already caching. Can you provide some more detail on what
you're trying to achieve.
On 16 Jun 2016 21:53, "Patrick McGarry"  wrote:

> Moving this over to ceph-user where it’ll get the eyeballs you need.
>
> On Mon, Jun 13, 2016 at 2:58 AM, Marcus Strasser
>  wrote:
> > Hello!
> >
> >
> >
> > I have a little test cluster with 2 server. Each Server have an osd with
> 800
> > GB, there is a 10 Gbps Link between the servers.
> >
> > On a ceph-client i have configured a cephfs, mount kernelspace. The
> client
> > is also connected with a 10 Gbps Link.
> >
> > All 3 use debian
> >
> > 4.5.5 kernel
> >
> > 64 GB mem
> >
> > There is no special configuration.
> >
> >
> >
> > Now the question:
> >
> > When i use the dd (~11GB) command in the cephfs mount, i get a result of
> 3
> > GB/s
> >
> >
> >
> > dd if=/dev/zero of=/cephtest/test bs=1M count=10240
> >
> >
> >
> > Is it possble to transfer the data faster (use full capacity oft he
> network)
> > and cache it with the memory?
> >
> >
> >
> > Thanks,
> >
> > Marcus Strasser
> >
> >
> >
> >
> >
> > Marcus Strasser
> >
> > Linux Systeme
> >
> > Russmedia IT GmbH
> >
> > A-6850 Schwarzach, Gutenbergstr. 1
> >
> >
> >
> > T +43 5572 501-872
> >
> > F +43 5572 501-97872
> >
> > marcus.stras...@highspeed.vol.at
> >
> > highspeed.vol.at
> >
> >
> >
> >
>
>
>
> --
>
> Best Regards,
>
> Patrick McGarry
> Director Ceph Community || Red Hat
> http://ceph.com  ||  http://community.redhat.com
> @scuttlemonkey || @ceph
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance Testing

2016-06-17 Thread David
On 17 Jun 2016 3:33 p.m., "Carlos M. Perez"  wrote:

>
> Hi,
>
>
>
> I found the following on testing performance  -
http://tracker.ceph.com/projects/ceph/wiki/Benchmark_Ceph_Cluster_Performance
and have a few questions:
>
>
>
> -  By testing the block device Do the performance tests take the
overall cluster performance (how long it takes the data to replicate to the
other nodes based on copies, etc.)? or is it just a local portion, ignoring
the backend/external ceph processes?  We’re using ceph as block devices for
proxmox storage for kvms/containers.
>

I'm not sure what you mean by "local portion", are you doing the
benchmarking directly on an OSD node? When writing with rbd bench or fio,
the writes will be distributed across the cluster according to your cluster
config so the performance will reflect the various attributes of your
cluster (replication count, journal speed, network latency etc.).

>
>
> -  If the above is as a whole, is there a way to test the “local”
storage independently of the cluster/pool as a whole.  Basically, I’m
testing a few different journal drive options (Intel S3700, Samsung SM863)
and controllers (ICH, LSI, Adaptec) and would prefer to change hardware in
one node (also limits purchasing requirements for testing), rather than
having to replicate it in all nodes.  Getting close enough numbers to a
fully deployed setup is good enough for .  We currently have three nodes,
two pools, 6 OSDs per node, and trying to find an appropriate drive before
we scale the system and start putting workloads.
>

If I understand correctly, you're doing your rbd testing on an OSD node and
you want to just test the performance of the OSD's in that node. Localising
in this way isn't really a common use case for Ceph. You could potentially
create a new pool containing just the OSD's in the node but you would need
to play around with your crush map to get that working e.g changing the
'osd crush chooseleaf type'.

>
>
> -  Write Cache – In most benchmarking scenarios, it’s said to
disable write caching on the drive.  However, according to this (
http://tracker.ceph.com/projects/ceph/wiki/Benchmark_Ceph_Cluster_Performance)
it seems to indicate that “Newer kernels should work fine” – does this mean
that on a “modern” kernel this setting is not necessary since it’s
accounted for during the use of the journal, or that the disabling should
work fine?  We’ve seen vast differences using Sebastien Han’s guide (
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/)
but that uses fio directly to the device (which will clear out the
partitions on a “live” journal…yes it was a test system so nothing major,
just an unexpected issue of the OSD’s not coming up after reboot).  We’ve
been disabling it but just want to check to see if this is an unnecessary
step, or a “best practice” step that should be done regardless.
>

I think you meant this

link. It is saying that on kernels newer than 2.6.33 there is no need to
disable the write cache on a raw disk being used for a journal. That is
because the data is properly flushed to the disk before it sends an ACK.


>
>
> Thanks in advance….
>
>
>
> Carlos M. Perez
>
> CMP Consulting Services
>
> 305-669-1515
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster ceph -s error

2016-06-18 Thread David
Is this a test cluster that has never been healthy or a working cluster
which has just gone unhealthy?  Have you changed anything? Are all hosts,
drives, network links working? More detail please. Any/all of the following
would help:

ceph health detail
ceph osd stat
ceph osd tree
Your ceph.conf
Your crushmap

On 17 Jun 2016 14:14, "Ishmael Tsoaela"  wrote:
>
> Hi All,
>
> please assist to fix the error:
>
> 1 X admin
> 2 X admin(hosting admin as well)
>
> 4 osd each node

Please provide more detail, this suggests you should have 12 osd's but your
osd map shows 10 osd's, 5 of which are down.
>
>
> cluster a04e9846-6c54-48ee-b26f-d6949d8bacb4
>  health HEALTH_ERR
> 819 pgs are stuck inactive for more than 300 seconds
> 883 pgs degraded
> 64 pgs stale
> 819 pgs stuck inactive
> 245 pgs stuck unclean
> 883 pgs undersized
> 17 requests are blocked > 32 sec
> recovery 2/8 objects degraded (25.000%)
> recovery 2/8 objects misplaced (25.000%)
> crush map has legacy tunables (require argonaut, min is
firefly)
> crush map has straw_calc_version=0
>  monmap e1: 1 mons at {nodeB=155.232.195.4:6789/0}
> election epoch 7, quorum 0 nodeB
>  osdmap e80: 10 osds: 5 up, 5 in; 558 remapped pgs
> flags sortbitwise
>   pgmap v480: 1064 pgs, 3 pools, 6454 bytes data, 4 objects
> 25791 MB used, 4627 GB / 4652 GB avail
> 2/8 objects degraded (25.000%)
> 2/8 objects misplaced (25.000%)
>  819 undersized+degraded+peered
>  181 active
>   64 stale+active+undersized+degraded
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Should I use different pool?

2016-06-27 Thread David
Yes you should definitely create different pools for different HDD types.
Another decision you need to make is whether you want dedicated nodes for
SSD or want to mix them in the same node. You need to ensure you have
sufficient CPU and fat enough network links to get the most out of your
SSD's.

You can add multiple data pools to Cephfs so if you can identify the hot
and cold data in your dataset you could do "manual" tiering as an
alternative to using a cache tier.

18TB is a relatively small capacity, have you considered an all-SSD cluster?

On Sun, Jun 26, 2016 at 10:18 AM, EM - SC 
wrote:

> Hi,
>
> I'm new to ceph and in the mailing list, so hello all!
>
> I'm testing ceph and the plan is to migrate our current 18TB storage
> (zfs/nfs) to ceph. This will be using CephFS and mounted in our backend
> application.
> We are also planning on using virtualisation (opennebula) with rbd for
> images and, if it makes sense, use rbd for our oracle server.
>
> My question is about pools.
> For what I read, I should create different pools for different HD speed
> (SAS, SSD, etc).
> - What else should I consider for creating pools?
> - should I create different pools for rbd, cephfs, etc?
>
> thanks in advanced,
> em
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Cache

2016-06-28 Thread David
Hi,

Please clarify what you mean by "osd cache". Raid controller cache or
Ceph's cache tiering feature?

On Tue, Jun 28, 2016 at 10:21 AM, Mohd Zainal Abidin Rabani <
zai...@nocser.net> wrote:

> Hi,
>
>
>
> We have using osd on production. SSD as journal. We have test io and show
> good result. We plan to use osd cache to get better iops. Have anyone here
> success deploy osd cache? Please share or advice here.
>
>
>
> Thanks.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How many nodes/OSD can fail

2016-06-28 Thread David
Hi,

This is probably the min_size on your cephfs data and/or metadata pool. I
believe the default is 2, if you have less than 2 replicas available I/O
will stop. See:
http://docs.ceph.com/docs/master/rados/operations/pools/#set-the-number-of-object-replicas

On Tue, Jun 28, 2016 at 10:23 AM, willi.feh...@t-online.de <
willi.feh...@t-online.de> wrote:

> Hello,
>
> I'm still very new to Ceph. I've created a small test Cluster.
>
>
>
> ceph-node1
>
> osd0
>
> osd1
>
> osd2
>
> ceph-node2
>
> osd3
>
> osd4
>
> osd5
>
> ceph-node3
>
> osd6
>
> osd7
>
> osd8
>
>
>
> My pool for CephFS has a replication count of 3. I've powered of 2 nodes(6
> OSDs went down) and my cluster status became critical and my ceph
> clients(cephfs) run into a timeout. My data(I had only one file on my pool)
> was still on one of the active OSDs. Is this the expected behaviour that
> the Cluster status became critical and my Clients run into a timeout?
>
>
>
> Many thanks for your feedback.
>
>
>
> Regards - Willi
>
>
> 
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH Replication

2016-07-01 Thread David
It will work but be aware 2x replication is not a good idea if your data is
important. The exception would be if the OSD's are DC class SSD's that you
monitor closely.

On Fri, Jul 1, 2016 at 1:09 PM, Ashley Merrick 
wrote:

> Hello,
>
> Perfect, I want to keep on separate node's, so wanted to make sure the
> expected behaviour was that it would do that.
>
> And no issues with running an odd number of nodes for a replication of 2?
> I know you have quorum, just wanted to make sure would not effect when
> running an even replication.
>
> Will be adding nodes in future as require, but will always keep an uneven
> number.
>
> ,Ashley
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> c...@jack.fr.eu.org
> Sent: 01 July 2016 13:07
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] CEPH Replication
>
> It will put each object on 2 OSD, on 2 separate node All nodes, and all
> OSDs will have the same used space (approx)
>
> If you want to allow both copies of an object to put stored on the same
> node, you should use osd_crush_chooseleaf_type = 0 (see
> http://docs.ceph.com/docs/master/rados/operations/crush-map/#crush-map-bucket-types
> and
> http://docs.ceph.com/docs/hammer/rados/configuration/pool-pg-config-ref/)
>
>
> On 01/07/2016 13:49, Ashley Merrick wrote:
> > Hello,
> >
> > Looking at setting up a new CEPH Cluster, starting with the following.
> >
> > 3 x CEPH OSD Servers
> >
> > Each Server:
> >
> > 20Gbps Network
> > 12 OSD's
> > SSD Journal
> >
> > Looking at running with replication of 2, will there be any issues using
> 3 nodes with a replication of two, this should "technically" give me ½ the
> available total capacity of the 3 node's?
> >
> > Will the CRUSH map automaticly setup each 12 OSD's as a separate group,
> so that the two replicated objects are put on separate OSD servers?
> >
> > Thanks,
> > Ashley
> >
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 40Gb fileserver/NIC suggestions

2016-07-13 Thread David
Aside from the 10GbE vs 40GbE question, if you're planning to export an RBD
image over smb/nfs I think you are going to struggle to reach anywhere near
1GB/s in a single threaded read. This is because even with readahead
cranked right up you're still only going be hitting a handful of disks at a
time. There's a few threads on this list about sequential reads with the
kernel rbd client. I think CephFS would be more appropriate in your use
case.

On Wed, Jul 13, 2016 at 1:52 PM, Götz Reinicke - IT Koordinator <
goetz.reini...@filmakademie.de> wrote:

> Am 13.07.16 um 14:27 schrieb Wido den Hollander:
> >> Op 13 juli 2016 om 12:00 schreef Götz Reinicke - IT Koordinator <
> goetz.reini...@filmakademie.de>:
> >>
> >>
> >> Am 13.07.16 um 11:47 schrieb Wido den Hollander:
>  Op 13 juli 2016 om 8:19 schreef Götz Reinicke - IT Koordinator <
> goetz.reini...@filmakademie.de>:
> 
> 
>  Hi,
> 
>  can anybody give some realworld feedback on what hardware
>  (CPU/Cores/NIC) you use for a 40Gb (file)server (smb and nfs)? The
> Ceph
>  Cluster will be mostly rbd images. S3 in the future, CephFS we will
> see :)
> 
>  Thanks for some feedback and hints! Regadrs . Götz
> 
> >>> Why do you think you need 40Gb? That's some serious traffic to the
> OSDs and I doubt it's really needed.
> >>>
> >>> Latency-wise 40Gb isn't much better than 10Gb, so why not stick with
> that?
> >>>
> >>> It's also better to have more smaller nodes than a few big nodes with
> Ceph.
> >>>
> >>> Wido
> >>>
> >> Hi Wido,
> >>
> >> may be my post was misleading. The OSD Nodes do have 10G, the FIleserver
> >> in front to the Clients/Destops should have 40G.
> >>
> > Ah, the fileserver will re-export RBD/Samba? Any Xeon E5 CPU will do
> just fine I think.
> True @re-export
> > Still, 40GbE is a lot of bandwidth!
> :) I know, but we have users which like to transfer e.g. raw movie
> footage for a normal project which might be quick at 1TB and they dont
> want to wait hours ;). Or others like to screen/stream 4K Video footage
> raw which is +- 10Gb/second ... Thats the challenge :)
>
> And yes our Ceph Cluster is well designed .. on the paper ;) SSDs
> considered. With lot of helpful feedback from the List!!
>
> I just try to find linux/ceph useres with 40Gb experiences :)
>
> cheers . Götz
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS | Recursive stats not displaying with GNU ls

2016-07-18 Thread David
Hi all

Recursive statistics on directories are no longer showing on an ls -l
output but getfattr is accurate:

# ls -l
total 0
drwxr-xr-x 1 root root 3 Jul 18 12:42 dir1
drwxr-xr-x 1 root root 0 Jul 18 12:42 dir2

]# getfattr -d -m ceph.dir.* dir1
# file: dir1
ceph.dir.entries="3"
ceph.dir.files="3"
ceph.dir.rbytes="27917283328"
ceph.dir.rctime="1468842139.0979444"
ceph.dir.rentries="4"
ceph.dir.rfiles="3"
ceph.dir.rsubdirs="1"
ceph.dir.subdirs="0"

I've potentially done something silly but I don't recall changing anything.
When I last checked this a few weeks back I'm pretty sure I was getting the
correct rbytes on the directory listing. Remounting makes no difference.

Does anyone know what's going on?

Client details:

CentOS Linux release 7.2.1511
4.6.1-1.el7.elrepo.x86_64
Cephfs mount options rw,relatime,name=admin,secret=,acl 0 0

Cheers,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS | Recursive stats not displaying with GNU ls

2016-07-18 Thread David
Thanks Zheng, I should have checked that.

Sean, from the commit:

When rbytes mount option is enabled, directory size is recursive size.
Recursive size is not updated instantly. This can cause directory size to
change between successive stat(1)

On Mon, Jul 18, 2016 at 2:49 PM, Sean Redmond 
wrote:

> Hi,
>
> Is this disabled because its not a stable feature or just user preference?
>
> Thanks
>
> On Mon, Jul 18, 2016 at 2:37 PM, Yan, Zheng  wrote:
>
>> On Mon, Jul 18, 2016 at 9:00 PM, David  wrote:
>> > Hi all
>> >
>> > Recursive statistics on directories are no longer showing on an ls -l
>> output
>> > but getfattr is accurate:
>> >
>> > # ls -l
>> > total 0
>> > drwxr-xr-x 1 root root 3 Jul 18 12:42 dir1
>> > drwxr-xr-x 1 root root 0 Jul 18 12:42 dir2
>> >
>> > ]# getfattr -d -m ceph.dir.* dir1
>> > # file: dir1
>> > ceph.dir.entries="3"
>> > ceph.dir.files="3"
>> > ceph.dir.rbytes="27917283328"
>> > ceph.dir.rctime="1468842139.0979444"
>> > ceph.dir.rentries="4"
>> > ceph.dir.rfiles="3"
>> > ceph.dir.rsubdirs="1"
>> > ceph.dir.subdirs="0"
>> >
>> > I've potentially done something silly but I don't recall changing
>> anything.
>> > When I last checked this a few weeks back I'm pretty sure I was getting
>> the
>> > correct rbytes on the directory listing. Remounting makes no difference.
>> >
>> > Does anyone know what's going on?
>> >
>> > Client details:
>> >
>> > CentOS Linux release 7.2.1511
>> > 4.6.1-1.el7.elrepo.x86_64
>> > Cephfs mount options rw,relatime,name=admin,secret=,acl 0 0
>> >
>>
>> There is rbytes mount option. It's not enabled by default since 4.6
>> kernel.
>>
>>
>>
>>
>> > Cheers,
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon_osd_nearfull_ratio (unchangeable) ?

2016-07-26 Thread David
Try:

ceph pg set_nearfull_ratio 0.9

On 26 Jul 2016 08:16, "Goncalo Borges"  wrote:

> Hello...
>
> I do not think that these settings are working properly in jewel. Maybe
> someone else can confirm.
>
> So, to summarize:
>
> 1./ I've restarted mon and osd services (systemctl restart ceph.target)
> after setting
>
> # grep nearfull /etc/ceph/ceph.conf
> mon osd nearfull ratio = 0.90
>
> 2./ Thos configs seems active in the daemons configurations
>
> # ceph --admin-daemon /var/run/ceph/ceph-mon.rccephmon1.asok config show
> |grep mon_osd_nearfull_ratio
> "mon_osd_nearfull_ratio": "0.9",
> [
> # ceph daemon mon.rccephmon1 config show | grep mon_osd_nearfull_ratio
> "mon_osd_nearfull_ratio": "0.9",
>
> 3./ However, I still receive a warning of near full osds if they are above
> 85%
>
> 4./ A ceph pg dump does show:
>
> # ceph pg dump
> dumped all in format plain
> version 12415999
> stamp 2016-07-26 07:15:29.018848
> last_osdmap_epoch 2546
> last_pg_scan 2546
> full_ratio 0.95
> *nearfull_ratio 0.85*
>
>
> Cheers
> G.
>
>
> On 07/26/2016 12:39 PM, Brad Hubbard wrote:
>
> On Tue, Jul 26, 2016 at 12:16:35PM +1000, Goncalo Borges wrote:
>
> Hi Brad
>
> Thanks for replying.
>
> Answers inline.
>
>
>
> I am a bit confused about the 'unchachable' message we get in Jewel 10.2.2
> when I try to change some cluster configs.
>
> For example:
>
> 1./ if I try to change mon_osd_nearfull_ratio from 0.85 to 0.90, I get
>
> # ceph tell mon.* injectargs "--mon_osd_nearfull_ratio 0.90"
> mon.rccephmon1: injectargs:mon_osd_nearfull_ratio = '0.9'
> (unchangeable)
> mon.rccephmon3: injectargs:mon_osd_nearfull_ratio = '0.9'
> (unchangeable)
> mon.rccephmon2: injectargs:mon_osd_nearfull_ratio = '0.9'
> (unchangeable)
>
> This is telling you that this variable has no observers (i.e. nothing monitors
> it dynamically) so changing it at runtime has no effect. IOW it is read at
> start-up and not referred to again after that IIUC.
>
>
> but the 0.85 default values continues to be showed in
>
>  ceph --show-config --conf /dev/null | grep mon_osd_nearfull_ratio
>  mon_osd_nearfull_ratio = 0.85
>
> Try something like the following.
>
> $ ceph daemon mon.a config show|grep mon_osd_nearfull_ratio
>
>
> and I continue to have health warnings regarding near full osds.
>
> So the actual config value has been changed but has no affect and will not
> persist. IOW, this value needs to be modified in the conf file and the daemon
> restarted.
>
>
>
> 2./ If I change in the ceph.conf and restart services, I get the same
> behaviour as in 1./ However, if I check the daemon configuration, I see:
>
> Please clarify what you mean by "the same behaviour"?
>
>
> So, in my ceph.conf I've set 'mon osd nearfull ratio = 0.90' and restarted
> mon and osd (not sure if those were needed) daemons everywhere.
>
> After restarting, I am still getting the health warnings regarding near full
> osds above 85%. If the new value was active, I should not get such warnings.
>
>
>   # ceph daemon mon.rccephmon2 config show | grep mon_osd_nearfull_ratio
>  "mon_osd_nearfull_ratio": "0.9",
>
> Use the daemon command I showed above.
>
>
> Isn't it the same as you suggested? That was run after restarting services
>
>
> Yes, it is. I assumed wrongly that you were using the "--show-config" command
> again here.
>
>
> so it is still unclear to me why the new value is not picked up and why
> running 'ceph --show-config --conf /dev/null | grep mon_osd_nearfull_ratio'
>
>
> That command shows the default ceph config, try something like this.
>
> $ ceph -n mon.rccephmon2 --show-config|grep mon_osd_nearfull_ratio
>
>
> still shows 0.85
>
> Maybe a restart if services is not what has to be done but a stop/start
> instead?
>
>
> You can certainly try it but I would have thought a restart would involve
> stop/start of the MON daemon. This thread includes additional information that
> may be relevant to you atm.
> http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/23391
>
>
> Cheers
> Goncalo
>
>
> --
> Goncalo Borges
> Research Computing
> ARC Centre of Excellence for Particle Physics at the Terascale
> School of Physics A28 | University of Sydney, NSW  2006
> T: +61 2 93511937
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2TB useable - small business - help appreciated

2016-07-30 Thread David
Hi Richard,

It would be useful to know what you're currently using for storage as that
would help in recommending a strategy. My guess is an all CephFS set up
might be best for your use case. I haven't tested this myself but I'd mount
CephFS on the OSD nodes with the Fuse client and export over NFS or Samba.
So something resembling a Gluster set up. My preference would be to use
separate gateway servers but if you are limited to those 4 servers I don't
think you have another option.

Your Ubuntu clients could mount CephFS directly. The OSX clients Samba or
NFS. I'm no expert on ESX integration but from recent threads on this list
it seems like NFS is the simplest way of getting some decent performance at
the moment.

If you only have those servers to work with, 3 servers would run Mons and
the non-Mon server should run the active MDS. I'd run a standby or
standby-replay MDS on the Mon server with the highest IP:port. If you've
got any spare RAM handy, stick as much as you can in the MDS.

Agree with Wido that all SSD would be the way to go for such a small
capacity requirement.

On Sat, Jul 30, 2016 at 2:12 PM, Wido den Hollander  wrote:

>
> > Op 30 juli 2016 om 8:51 schreef Richard Thornton <
> richie.thorn...@gmail.com>:
> >
> >
> > Hi,
> >
> > Thanks for taking a look, any help you can give would be much
> appreciated.
> >
> > In the next few months or so I would like to implement Ceph for my
> > small business because it sounds cool and I love tinkering.
> >
> > The requirements are simple, I only need a minimum of 2TB (useable) of
> > (highly) redundant file storage for our Mac's, Ubuntu and VSphere to
> > use, Ubuntu is usually my distro of choice.
> >
>
> What will you be using? RBD, CephFS?
>
> > I already have the following spare hardware that I could use:
> >
> > 4 x Supermicro c2550 servers
> > 4 x 24GB Intel SLC drives
> > 6 x 200GB Intel DC S3700
> > 2 x Intel 750 400GB PCIe NVMe
> > 4 x 2TB 7200rpm drives
> > 10GBe NICs
> >
>
> Since you only need 2TB of usable storage I would suggest to skip spinning
> disks and go completely to SSD/Flash.
>
> For example, take the Samsun PM836 SSDs. They go up to 4TB per SSD right
> now. They aren't cheap, but the price per I/O is low. Spinning disks are
> cheap with storage, but very expensive per I/O.
>
> Per server:
> - SSD for OS (simple one)
> - Multiple SSDs for OSD
>
> > I am a little confused on how I should set it up, I have 4 servers so
> > it's going to look more like your example PoC environment, should I
> > just use 3 of the 4 servers to save on energy costs (the 4th server
> > could be a cold spare)?
> >
>
> No, more machines is better. I would go for 4.
>
> > So I guess I will have my monitor nodes on my OSD nodes.
> >
> > Would I just have each of the physical nodes with just one 2TB disk,
> > would I use BlueStore (it looks cool but I read it's not stable until
> > later this year)?
> >
> > I have no idea on what I should do for RGW, RBD and CephFS, should I
> > just have them all running on the 3 nodes?
> >
>
> I always try to spread services. MONs on dedicated hardware, OSDs and the
> same with RGW and CephFS MDS servers.
>
> It is not a requirement persé, but it makes things easier to run.
>
> Wido
>
> > Thanks again!
> >
> > Richard
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to configure OSD heart beat to happen on public network

2016-07-31 Thread David
The purpose of the cluster network is to isolate the heartbeat (and
recovery) traffic. I imagine that is why you are struggling to get the
heartbeat traffic on the public network.

On 27 Jul 2016 8:32 p.m., "Venkata Manojawa Paritala" 
wrote:

> Hi,
>
> I have configured the below 2 networks in Ceph.conf.
>
> 1. public network
> 2. cluster_network
>
> Now, the heart beat for the OSDs is happening thru cluster_network. How
> can I configure the heart beat to happen thru public network?
>
> I actually configured the property "osd heartbeat address" in the global
> section and provided public network's subnet, but it is not working out.
>
> Am I doing something wrong? Appreciate your quick responses, as I need to
> urgently.
>
>
> Thanks & Regards,
> Manoj
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Giant to Jewel poor read performance with Rados bench

2016-08-06 Thread David
Hi All

I've just installed Jewel 10.2.2 on hardware that has previously been
running Giant. Rados Bench with the default rand and seq tests is giving me
approx 40% of the throughput I used to achieve. On Giant I would get
~1000MB/s (so probably limited by the 10GbE interface), now I'm getting 300
- 400MB/s.

I can see there is no activity on the disks during the bench so the data is
all coming out of cache. The cluster isn't doing anything else during the
test. I'm fairly sure my network is sound, I've done the usual testing with
iperf etc. The write test seems about the same as I used to get (~400MB/s).

This was a fresh install rather than an upgrade.

Are there any gotchas I should be aware of?

Some more details:

OS: CentOS 7
Kernel: 3.10.0-327.28.2.el7.x86_64
5 nodes (each 10 * 4TB SATA, 2 * Intel dc3700 SSD partitioned up for
journals).
10GbE public network
10GbE cluster network
MTU 9000 on all interfaces and switch
Ceph installed from ceph repo

Ceph.conf is pretty basic (IPs, hosts etc omitted):

filestore_xattr_use_omap = true
osd_journal_size = 1
osd_pool_default_size = 3
osd_pool_default_min_size = 2
osd_pool_default_pg_num = 4096
osd_pool_default_pgp_num = 4096
osd_crush_chooseleaf_type = 1
max_open_files = 131072
mon_clock_drift_allowed = .15
mon_clock_drift_warn_backoff = 30
mon_osd_down_out_interval = 300
mon_osd_report_timeout = 300
mon_osd_full_ratio = .95
mon_osd_nearfull_ratio = .80
osd_backfill_full_ratio = .80

Thanks
David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant to Jewel poor read performance with Rados bench

2016-08-07 Thread David
I created a new pool that only contains OSDs on a single node. The Rados
bench gives me the speed I'd expect (1GB/s...all coming out of cache)

I then created a pool that contains OSDs from 2 nodes. Now the strange part
is, if I run the Rados bench from either of those nodes, I get the speed
I'd expect: 2GB/s (1GB local and 1GB coming over from the other node). If I
run the same bench from a 3rd node, I only get about 200MB/s. During that
bench, I monitor the interfaces on the 2 OSD nodes and they are not going
any faster than 1Gb/s. It's almost as if the speed has negotiated down to
1Gb. If I run iperf tests between the 3 nodes I'm getting the full 10Gb
speed.

'rados -p 2node bench 60 rand --no-cleanup' from one of the nodes in the 2
node pool:

Total time run:   60.036413
Total reads made: 33496
Read size:4194304
Object size:  4194304
Bandwidth (MB/sec):   2231.71
Average IOPS: 557
Stddev IOPS:  10
Max IOPS: 584
Min IOPS: 535
Average Latency(s):   0.0275722
Max latency(s):   0.164382
Min latency(s):   0.00480053

'rados -p 2node bench 60 rand --no-cleanup' from a node not in the 3 node
pool:

Total time run:   60.383206
Total reads made: 2715
Read size:4194304
Object size:  4194304
Bandwidth (MB/sec):   179.851
Average IOPS: 44
Stddev IOPS:  10
Max IOPS: 77
Min IOPS: 28
Average Latency(s):   0.355126
Max latency(s):   2.17366
Min latency(s):   0.00641849

I appreciate this may not be a Ceph config issue but any tips on tracking
down this issue would be much appreciated.


On Sat, Aug 6, 2016 at 9:38 PM, David  wrote:

> Hi All
>
> I've just installed Jewel 10.2.2 on hardware that has previously been
> running Giant. Rados Bench with the default rand and seq tests is giving me
> approx 40% of the throughput I used to achieve. On Giant I would get
> ~1000MB/s (so probably limited by the 10GbE interface), now I'm getting 300
> - 400MB/s.
>
> I can see there is no activity on the disks during the bench so the data
> is all coming out of cache. The cluster isn't doing anything else during
> the test. I'm fairly sure my network is sound, I've done the usual testing
> with iperf etc. The write test seems about the same as I used to get
> (~400MB/s).
>
> This was a fresh install rather than an upgrade.
>
> Are there any gotchas I should be aware of?
>
> Some more details:
>
> OS: CentOS 7
> Kernel: 3.10.0-327.28.2.el7.x86_64
> 5 nodes (each 10 * 4TB SATA, 2 * Intel dc3700 SSD partitioned up for
> journals).
> 10GbE public network
> 10GbE cluster network
> MTU 9000 on all interfaces and switch
> Ceph installed from ceph repo
>
> Ceph.conf is pretty basic (IPs, hosts etc omitted):
>
> filestore_xattr_use_omap = true
> osd_journal_size = 1
> osd_pool_default_size = 3
> osd_pool_default_min_size = 2
> osd_pool_default_pg_num = 4096
> osd_pool_default_pgp_num = 4096
> osd_crush_chooseleaf_type = 1
> max_open_files = 131072
> mon_clock_drift_allowed = .15
> mon_clock_drift_warn_backoff = 30
> mon_osd_down_out_interval = 300
> mon_osd_report_timeout = 300
> mon_osd_full_ratio = .95
> mon_osd_nearfull_ratio = .80
> osd_backfill_full_ratio = .80
>
> Thanks
> David
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recover Data from Deleted RBD Volume

2016-08-08 Thread David
That will be down to the pool the rbd was in, the crush rule for that pool
will dictate which osd's store objects. In a standard config that rbd will
likely have objects on every osd in your cluster.

On 8 Aug 2016 9:51 a.m., "Georgios Dimitrakakis" 
wrote:

> Hi,
>>
>>
>> On 08.08.2016 09:58, Georgios Dimitrakakis wrote:
>>
>>> Dear all,
>>>
>>> I would like your help with an emergency issue but first let me describe
>>> our environment.
>>>
>>> Our environment consists of 2OSD nodes with 10x 2TB HDDs each and 3MON
>>> nodes (2 of them are the OSD nodes as well) all with ceph version 0.80.9
>>> (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
>>>
>>> This environment provides RBD volumes to an OpenStack Icehouse
>>> installation.
>>>
>>> Although not a state of the art environment is working well and within
>>> our expectations.
>>>
>>> The issue now is that one of our users accidentally deleted one of the
>>> volumes without keeping its data first!
>>>
>>> Is there any way (since the data are considered critical and very
>>> important) to recover them from CEPH?
>>>
>>
>> Short answer: no
>>
>> Long answer: no, but
>>
>> Consider the way Ceph stores data... each RBD is striped into chunks
>> (RADOS objects with 4MB size by default); the chunks are distributed
>> among the OSDs with the configured number of replicates (probably two
>> in your case since you use 2 OSD hosts). RBD uses thin provisioning,
>> so chunks are allocated upon first write access.
>> If an RBD is deleted all of its chunks are deleted on the
>> corresponding OSDs. If you want to recover a deleted RBD, you need to
>> recover all individual chunks. Whether this is possible depends on
>> your filesystem and whether the space of a former chunk is already
>> assigned to other RADOS objects. The RADOS object names are composed
>> of the RBD name and the offset position of the chunk, so if an
>> undelete mechanism exists for the OSDs' filesystem, you have to be
>> able to recover file by their filename, otherwise you might end up
>> mixing the content of various deleted RBDs. Due to the thin
>> provisioning there might be some chunks missing (e.g. never allocated
>> before).
>>
>> Given the fact that
>> - you probably use XFS on the OSDs since it is the preferred
>> filesystem for OSDs (there is RDR-XFS, but I've never had to use it)
>> - you would need to stop the complete ceph cluster (recovery tools do
>> not work on mounted filesystems)
>> - your cluster has been in use after the RBD was deleted and thus
>> parts of its former space might already have been overwritten
>> (replication might help you here, since there are two OSDs to try)
>> - XFS undelete does not work well on fragmented files (and OSDs tend
>> to introduce fragmentation...)
>>
>> the answer is no, since it might not be feasible and the chance of
>> success are way too low.
>>
>> If you want to spend time on it I would propose the stop the ceph
>> cluster as soon as possible, create copies of all involved OSDs, start
>> the cluster again and attempt the recovery on the copies.
>>
>> Regards,
>> Burkhard
>>
>
> Hi! Thanks for the info...I understand that this is a very difficult and
> probably not feasible task but in case I need to try a recovery what other
> info should I need? Can I somehow find out on which OSDs the specific data
> were stored and minimize my search there?
> Any ideas on how should I proceed?
>
>
> Best,
>
> G.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recover Data from Deleted RBD Volume

2016-08-08 Thread David
I don't think there's a way of getting the prefix from the cluster at this
point.

If the deleted image was a similar size to the example you've given, you
will likely have had objects on every OSD. If this data is absolutely
critical you need to stop your cluster immediately or make copies of all
the drives with something like dd. If you've never deleted any other rbd
images and assuming you can recover data with names, you may be able to
find the rbd objects.

On Mon, Aug 8, 2016 at 7:28 PM, Georgios Dimitrakakis 
wrote:

> Hi,
>>>
>>>
>>> On 08.08.2016 10:50, Georgios Dimitrakakis wrote:
>>>
 Hi,
>
>
> On 08.08.2016 09:58, Georgios Dimitrakakis wrote:
>
>> Dear all,
>>
>> I would like your help with an emergency issue but first let me
>> describe our environment.
>>
>> Our environment consists of 2OSD nodes with 10x 2TB HDDs each and
>> 3MON nodes (2 of them are the OSD nodes as well) all with ceph version
>> 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
>>
>> This environment provides RBD volumes to an OpenStack Icehouse
>> installation.
>>
>> Although not a state of the art environment is working well and
>> within our expectations.
>>
>> The issue now is that one of our users accidentally deleted one of
>> the volumes without keeping its data first!
>>
>> Is there any way (since the data are considered critical and very
>> important) to recover them from CEPH?
>>
>
> Short answer: no
>
> Long answer: no, but
>
> Consider the way Ceph stores data... each RBD is striped into chunks
> (RADOS objects with 4MB size by default); the chunks are distributed
> among the OSDs with the configured number of replicates (probably two
> in your case since you use 2 OSD hosts). RBD uses thin provisioning,
> so chunks are allocated upon first write access.
> If an RBD is deleted all of its chunks are deleted on the
> corresponding OSDs. If you want to recover a deleted RBD, you need to
> recover all individual chunks. Whether this is possible depends on
> your filesystem and whether the space of a former chunk is already
> assigned to other RADOS objects. The RADOS object names are composed
> of the RBD name and the offset position of the chunk, so if an
> undelete mechanism exists for the OSDs' filesystem, you have to be
> able to recover file by their filename, otherwise you might end up
> mixing the content of various deleted RBDs. Due to the thin
> provisioning there might be some chunks missing (e.g. never allocated
> before).
>
> Given the fact that
> - you probably use XFS on the OSDs since it is the preferred
> filesystem for OSDs (there is RDR-XFS, but I've never had to use it)
> - you would need to stop the complete ceph cluster (recovery tools do
> not work on mounted filesystems)
> - your cluster has been in use after the RBD was deleted and thus
> parts of its former space might already have been overwritten
> (replication might help you here, since there are two OSDs to try)
> - XFS undelete does not work well on fragmented files (and OSDs tend
> to introduce fragmentation...)
>
> the answer is no, since it might not be feasible and the chance of
> success are way too low.
>
> If you want to spend time on it I would propose the stop the ceph
> cluster as soon as possible, create copies of all involved OSDs, start
> the cluster again and attempt the recovery on the copies.
>
> Regards,
> Burkhard
>

 Hi! Thanks for the info...I understand that this is a very difficult
 and probably not feasible task but in case I need to try a recovery what
 other info should I need? Can I somehow find out on which OSDs the specific
 data were stored and minimize my search there?
 Any ideas on how should I proceed?

>>> First of all you need to know the exact object names for the RADOS
>>> objects. As mentioned before, the name is composed of the RBD name and
>>> an offset.
>>>
>>> In case of OpenStack, there are three different patterns for RBD names:
>>>
>>> , e.g. 50f2a0bd-15b1-4dbb-8d1f-fc43ce535f13
>>> for glance images,
>>> , e.g. 9aec1f45-9053-461e-b176-c65c25a48794_disk for nova
>>> images
>>> , e.g. volume-0ca52f58-7e75-4b21-8b0f-39cbcd431c42 for
>>> cinder volumes
>>>
>>> (not considering snapshots etc, which might use different patterns)
>>>
>>> The RBD chunks are created using a certain prefix (using examples
>>> from our openstack setup):
>>>
>>> # rbd -p os-images info 8fa3d9eb-91ed-4c60-9550-a62f34aed014
>>> rbd image '8fa3d9eb-91ed-4c60-9550-a62f34aed014':
>>> size 446 MB in 56 objects
>>> order 23 (8192 kB objects)
>>> block_name_prefix: rbd_data.30e57d54dea573
>>> format: 2
>>> features: layering, striping
>>> flags:
>>> stripe unit: 8192 kB
>>> stripe count: 1
>>>

Re: [ceph-users] Giant to Jewel poor read performance with Rados bench

2016-08-09 Thread David
Hi Mark, thanks for following up. I'm now pretty convinced I have issues
with my network, it's not Ceph related. My cursory iperf tests between
pairs of hosts were looking fine but with multiple clients I'm seeing
really high tcp retransmissions.

On Mon, Aug 8, 2016 at 1:07 PM, Mark Nelson  wrote:

> Hi David,
>
> We haven't done any direct giant to jewel comparisons, but I wouldn't
> expect a drop that big, even for cached tests.  How long are you running
> the test for, and how large are the IOs?  Did you upgrade anything else at
> the same time Ceph was updated?
>
> Mark
>
>
> On 08/06/2016 03:38 PM, David wrote:
>
>> Hi All
>>
>> I've just installed Jewel 10.2.2 on hardware that has previously been
>> running Giant. Rados Bench with the default rand and seq tests is giving
>> me approx 40% of the throughput I used to achieve. On Giant I would get
>> ~1000MB/s (so probably limited by the 10GbE interface), now I'm getting
>> 300 - 400MB/s.
>>
>> I can see there is no activity on the disks during the bench so the data
>> is all coming out of cache. The cluster isn't doing anything else during
>> the test. I'm fairly sure my network is sound, I've done the usual
>> testing with iperf etc. The write test seems about the same as I used to
>> get (~400MB/s).
>>
>> This was a fresh install rather than an upgrade.
>>
>> Are there any gotchas I should be aware of?
>>
>> Some more details:
>>
>> OS: CentOS 7
>> Kernel: 3.10.0-327.28.2.el7.x86_64
>> 5 nodes (each 10 * 4TB SATA, 2 * Intel dc3700 SSD partitioned up for
>> journals).
>> 10GbE public network
>> 10GbE cluster network
>> MTU 9000 on all interfaces and switch
>> Ceph installed from ceph repo
>>
>> Ceph.conf is pretty basic (IPs, hosts etc omitted):
>>
>> filestore_xattr_use_omap = true
>> osd_journal_size = 1
>> osd_pool_default_size = 3
>> osd_pool_default_min_size = 2
>> osd_pool_default_pg_num = 4096
>> osd_pool_default_pgp_num = 4096
>> osd_crush_chooseleaf_type = 1
>> max_open_files = 131072
>> mon_clock_drift_allowed = .15
>> mon_clock_drift_warn_backoff = 30
>> mon_osd_down_out_interval = 300
>> mon_osd_report_timeout = 300
>> mon_osd_full_ratio = .95
>> mon_osd_nearfull_ratio = .80
>> osd_backfill_full_ratio = .80
>>
>> Thanks
>> David
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recover Data from Deleted RBD Volume

2016-08-09 Thread David
On Mon, Aug 8, 2016 at 9:39 PM, Georgios Dimitrakakis 
wrote:

> Dear David (and all),
>
> the data are considered very critical therefore all this attempt to
> recover them.
>
> Although the cluster hasn't been fully stopped all users actions have. I
> mean services are running but users are not able to read/write/delete.
>
> The deleted image was the exact same size of the example (500GB) but it
> wasn't the only one deleted today. Our user was trying to do a "massive"
> cleanup by deleting 11 volumes and unfortunately one of them was very
> important.
>
> Let's assume that I "dd" all the drives what further actions should I do
> to recover the files? Could you please elaborate a bit more on the phrase
> "If you've never deleted any other rbd images and assuming you can recover
> data with names, you may be able to find the rbd objects"??
>

Sorry that last comment was a bit confusing, I was suggesting at this stage
you just need to concentrate on recovering everything you can and then try
and find the data you need.

The dd is to make a backup of the partition so you can work on it safely.
Ideally you would make a 2nd copy of the dd'd partition and work on that.
Then you need to find tools to attempt the recovery which is going to be
slow and painful and not guaranteed to be successful.



>
> Do you mean that if I know the file names I can go through and check for
> them? How?
> Do I have to know *all* file names or by searching for a few of them I can
> find all data that exist?
>
> Thanks a lot for taking the time to answer my questions!
>
> All the best,
>
> G.
>
> I dont think theres a way of getting the prefix from the cluster at
>> this point.
>>
>> If the deleted image was a similar size to the example youve given,
>> you will likely have had objects on every OSD. If this data is
>> absolutely critical you need to stop your cluster immediately or make
>> copies of all the drives with something like dd. If youve never
>> deleted any other rbd images and assuming you can recover data with
>> names, you may be able to find the rbd objects.
>>
>> On Mon, Aug 8, 2016 at 7:28 PM, Georgios Dimitrakakis  wrote:
>>
>> Hi,
>>>>>
>>>>> On 08.08.2016 10:50, Georgios Dimitrakakis wrote:
>>>>>
>>>>> Hi,
>>>>>>>
>>>>>>> On 08.08.2016 09:58, Georgios Dimitrakakis wrote:
>>>>>>>
>>>>>>> Dear all,
>>>>>>>>
>>>>>>>> I would like your help with an emergency issue but first
>>>>>>>> let me describe our environment.
>>>>>>>>
>>>>>>>> Our environment consists of 2OSD nodes with 10x 2TB HDDs
>>>>>>>> each and 3MON nodes (2 of them are the OSD nodes as well)
>>>>>>>> all with ceph version 0.80.9
>>>>>>>> (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
>>>>>>>>
>>>>>>>> This environment provides RBD volumes to an OpenStack
>>>>>>>> Icehouse installation.
>>>>>>>>
>>>>>>>> Although not a state of the art environment is working
>>>>>>>> well and within our expectations.
>>>>>>>>
>>>>>>>> The issue now is that one of our users accidentally
>>>>>>>> deleted one of the volumes without keeping its data first!
>>>>>>>>
>>>>>>>> Is there any way (since the data are considered critical
>>>>>>>> and very important) to recover them from CEPH?
>>>>>>>>
>>>>>>>
>>>>>>> Short answer: no
>>>>>>>
>>>>>>> Long answer: no, but
>>>>>>>
>>>>>>> Consider the way Ceph stores data... each RBD is striped
>>>>>>> into chunks
>>>>>>> (RADOS objects with 4MB size by default); the chunks are
>>>>>>> distributed
>>>>>>> among the OSDs with the configured number of replicates
>>>>>>> (probably two
>>>>>>> in your case since you use 2 OSD hosts). RBD uses thin
>>>>>>> provisioning,
>>>>>>> so chunks are allocated upon first write access.
>>>>>>> If an RBD is deleted all of its chunks are deleted on the
>>>>>>> corresponding OSDs. If you want to recover a deleted RBD,

[ceph-users] CephFS: cached inodes with active-standby

2016-08-15 Thread David
Hi All

When I compare a  'ceph daemon mds.*id* perf dump mds' on my active MDS
with my standby-replay MDS, the inodes count on the standby is a lot less
than the active. I would expect to see a very similar number of inodes or
have I misunderstood this feature? My understanding was the replay daemon
will maintain the same cache as the active.

If I stop the mds daemon on the active, the standby-replay rejoins quickly,
I'm just curious about the discrepancy in the inode count.

This is Jewel 10.2.2

On the active server I see:

 "inode_max": 20,
"inodes": 200015,

On the standby-replay:

 "inode_max": 20,
"inodes": 98000,

mds section from my ceph.conf (hostnames changed):

[mds]
  mds data = /var/lib/ceph/mds/mds.$host
  keyring = /var/lib/ceph/mds/mds.$host/mds.$host.keyring
  mds standby replay = true

[mds.active]
  host = active
  standby for rank = 0
  mds_cache_size = 20

[mds.standbyreplay]
  host = standbyreplay
  standby for rank = 0
  mds_cache_size = 20
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Single-node Ceph & Systemd shutdown

2016-08-20 Thread David
It sounds like the Ceph services are being stopped before it gets to the
unmounts. It probably can't unmount the rbd cleanly so shutdowm hangs.

Btw mounting with the kernel client on an OSD node isn't recommend.

On 20 Aug 2016 6:35 p.m., "Marcus"  wrote:

> For a home server project I've set up a single-node ceph system.
>
> Everything works just fine; I can mount block devices and store stuff on
> them, however the system will not shut down without hanging.
>
> I've traced it back to systemd; it shuts down part(s?) of ceph before
> unmounting or unmapping the block devices, so when it tries to do so it
> hangs.
>
> Does anyone have any idea how I might fix this?
> If this weren't systemd I'm sure I could find a place to kludge it, but
> I've had no such luck.
>
> Thanks!,
> Marcus
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problems getting nfs-ganesha with cephfs backend to work.

2017-07-18 Thread David
You mentioned the Kernel client works but the Fuse mount would be a better
test in relation to the Ganesha FSAL.

The following config didn't give me the error you describe in 1) but I'm
mounting on the client with NFSv4, not sure about 2), is that dm-nfs?

EXPORT
{
Export_ID = 1;
Path = "/";
Pseudo = "/";
Access_Type = RW;
Squash = No_Root_Squash;
SecType = "none";
Protocols = "3", "4";
Transports = "TCP";

FSAL {
Name = CEPH;
}
}

Ganesha version 2.5.0.1 from the nfs-ganesha repo hosted on
download.ceph.com
CentOS 7.3 server and client


On Mon, Jul 17, 2017 at 2:26 PM, Micha Krause  wrote:

> Hi,
>
>
> > Change Pseudo to something like /mypseudofolder
>
> I tried this, without success, but I managed to get something working with
> version 2.5.
>
> I can mount the NFS export now, however 2 problems remain:
>
> 1. The root directory of the mount-point looks empty (ls shows no files),
> however directories
>and files can be accessed, and ls works in subdirectories.
>
> 2. I can't create devices in the nfs mount, not sure if ganesha supports
> this with other backends.
>
>
>
> Micha Krause
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 答复: How's cephfs going?

2017-07-19 Thread David
On Wed, Jul 19, 2017 at 4:47 AM, 许雪寒  wrote:

> Is there anyone else willing to share some usage information of cephfs?
>

I look after 2 Cephfs deployments, both Jewel, been in production since
Jewel went stable so just over a year I think. We've had a really positive
experience, I've not experienced any MDS crashes or read-only operation
(touch wood!). The majority of clients are accessing through gateway
servers re-exporting over SMB and NFS. Data is mixed but lots of image
sequences/video.

Workload is primarily large reads but clients are all 1GbE currently
(gateway servers are on faster links) so I'd say our performance
requirements are modest. If/when we get clients on 10GbE we'll probably
need to start looking at performance more closely, definitely playing
around with stripe settings.

As I think someone already mentioned, the recursive stats are awesome. I
use the Python xattr module to grab the stats and format with the
prettytable library, it's a real pleasure to not have to wait for du to
stat through the directory tree. Thinking about doing something cool with
Kibana in the future.

The main issues we've had are with Kernel NFS, writes are currently slow in
Jewel, see http://tracker.ceph.com/issues/17563. That was fixed in master (
https://github.com/ceph/ceph/pull/11710) but I don't think that will make
its way into Jewel so I'm eagerly awaiting stable Luminous. I've also
experienced nfsd lock ups when OSDs fail.

Hope that helps a bit

> Could developers tell whether cephfs is a major effort in the whole ceph
> development?
>

I'm not a dev but I can confidently say this is very actively being worked
on.


>
> 发件人: 许雪寒
> 发送时间: 2017年7月17日 11:00
> 收件人: ceph-users@lists.ceph.com
> 主题: How's cephfs going?
>
> Hi, everyone.
>
> We intend to use cephfs of Jewel version, however, we don’t know its
> status. Is it production ready in Jewel? Does it still have lots of bugs?
> Is it a major effort of the current ceph development? And who are using
> cephfs now?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 答复: How's cephfs going?

2017-07-19 Thread David
On Tue, Jul 18, 2017 at 6:54 AM, Blair Bethwaite 
wrote:

> We are a data-intensive university, with an increasingly large fleet
> of scientific instruments capturing various types of data (mostly
> imaging of one kind or another). That data typically needs to be
> stored, protected, managed, shared, connected/moved to specialised
> compute for analysis. Given the large variety of use-cases we are
> being somewhat more circumspect it our CephFS adoption and really only
> dipping toes in the water, ultimately hoping it will become a
> long-term default NAS choice from Luminous onwards.
>
> On 18 July 2017 at 15:21, Brady Deetz  wrote:
> > All of that said, you could also consider using rbd and zfs or whatever
> filesystem you like. That would allow you to gain the benefits of scaleout
> while still getting a feature rich fs. But, there are some down sides to
> that architecture too.
>
> We do this today (KVMs with a couple of large RBDs attached via
> librbd+QEMU/KVM), but the throughput able to be achieved this way is
> nothing like native CephFS - adding more RBDs doesn't seem to help
> increase overall throughput. Also, if you have NFS clients you will
> absolutely need SSD ZIL. And of course you then have a single point of
> failure and downtime for regular updates etc.
>
> In terms of small file performance I'm interested to hear about
> experiences with in-line file storage on the MDS.
>
> Also, while we're talking about CephFS - what size metadata pools are
> people seeing on their production systems with 10s-100s millions of
> files?
>

On a system with 10.1 million files, metadata pool is 60MB



> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 答复: How's cephfs going?

2017-07-20 Thread David
On Wed, Jul 19, 2017 at 7:09 PM, Gregory Farnum  wrote:

>
>
> On Wed, Jul 19, 2017 at 10:25 AM David  wrote:
>
>> On Tue, Jul 18, 2017 at 6:54 AM, Blair Bethwaite <
>> blair.bethwa...@gmail.com> wrote:
>>
>>> We are a data-intensive university, with an increasingly large fleet
>>> of scientific instruments capturing various types of data (mostly
>>> imaging of one kind or another). That data typically needs to be
>>> stored, protected, managed, shared, connected/moved to specialised
>>> compute for analysis. Given the large variety of use-cases we are
>>> being somewhat more circumspect it our CephFS adoption and really only
>>> dipping toes in the water, ultimately hoping it will become a
>>> long-term default NAS choice from Luminous onwards.
>>>
>>> On 18 July 2017 at 15:21, Brady Deetz  wrote:
>>> > All of that said, you could also consider using rbd and zfs or
>>> whatever filesystem you like. That would allow you to gain the benefits of
>>> scaleout while still getting a feature rich fs. But, there are some down
>>> sides to that architecture too.
>>>
>>> We do this today (KVMs with a couple of large RBDs attached via
>>> librbd+QEMU/KVM), but the throughput able to be achieved this way is
>>> nothing like native CephFS - adding more RBDs doesn't seem to help
>>> increase overall throughput. Also, if you have NFS clients you will
>>> absolutely need SSD ZIL. And of course you then have a single point of
>>> failure and downtime for regular updates etc.
>>>
>>> In terms of small file performance I'm interested to hear about
>>> experiences with in-line file storage on the MDS.
>>>
>>> Also, while we're talking about CephFS - what size metadata pools are
>>> people seeing on their production systems with 10s-100s millions of
>>> files?
>>>
>>
>> On a system with 10.1 million files, metadata pool is 60MB
>>
>>
> Unfortunately that's not really an accurate assessment, for good but
> terrible reasons:
> 1) CephFS metadata is principally stored via the omap interface (which is
> designed for handling things like the directory storage CephFS needs)
> 2) omap is implemented via Level/RocksDB
> 3) there is not a good way to determine which pool is responsible for
> which portion of RocksDBs data
> 4) So the pool stats do not incorporate omap data usage at all in their
> reports (it's part of the overall space used, and is one of the things that
> can make that larger than the sum of the per-pool spaces)
>
> You could try and estimate it by looking at how much "lost" space there is
> (and subtracting out journal sizes and things, depending on setup). But I
> promise there's more than 60MB of CephFS metadata for 10.1 million files!
> -Greg
>

That makes more sense, I did think I'd got my units mixed up or something.
For a start my MDS daemon is using about 17GB with 1.5mil inodes in cache.
Thanks for shedding some light on this.

>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Writing data to pools other than filesystem

2017-07-20 Thread David
I think the multiple namespace feature would be more appropriate for your
use case. So that would be multiple file systems within the same pools
rather than multiple pools in a single filesystem.

With that said, that might be overkill for your requirement. You might be
able to achieve what you need with path restriction:
http://docs.ceph.com/docs/master/cephfs/client-auth/

On Thu, Jul 20, 2017 at 10:23 AM,  wrote:

> 19. Juli 2017 17:34, "LOPEZ Jean-Charles"  schrieb:
>
> > Hi,
> >
> > you must add the extra pools to your current file system configuration:
> ceph fs add_data_pool
> > {fs_name} {pool_name}
> >
> > Once this is done, you just have to create some specific directory
> layout within CephFS to modify
> > the name of the pool targetted by a specific directory. See
> > http://docs.ceph.com/docs/master/cephfs/file-layouts
> >
> > Just set the ceph.dir.layout.pool attribute to the appropriate Pool ID
> of the new pool.
> >
> > Regards
> > JC
> >
> >> On Jul 19, 2017, at 07:59, c.mo...@web.de wrote:
> >>
> >> Hello!
> >>
> >> I want to organize data in pools and therefore created additional pools:
> >> ceph osd lspools
> >> 0 rbd,1 templates,2 hdb-backup,3 cephfs_data,4 cephfs_metadata,
> >>
> >> As you can see, pools "cephfs_data" and "cephfs_metadata" belong to a
> Ceph filesystem.
> >>
> >> Question:
> >> How can I write data to other pools, e.g. hdb-backup?
> >>
> >> THX
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> Hello JC,
>
> thanks for your reply.
>
> I'm not sure why I should add pools to a current file system configuration.
> Therefore it could be helpful to explain my use case.
>
> The Ceph Storage Cluster should provide storage for database backups.
> For security reasons I consider to create one pool per database identified
> by an unique id (e.g. ABC).
> And for each pool only a dedicated user (+ ceph admin) can access (read /
> write) the data in the related pool;
> this user is unique for each database (e.g. abcadm).
>
> The first question is:
> Do I need to create two RADOS pools as documented in guide 'Create a Ceph
> filesystem' (http://docs.ceph.com/docs/master/cephfs/createfs/) for each
> database id:
> "A Ceph filesystem requires at least two RADOS pools, one for data and one
> for metadata."
> If yes, this would mean to create the following pools:
> $ ceph osd pool create abc_data 
> $ ceph osd pool create abc_metadata 
> $ ceph osd pool create xyz_data 
> $ ceph osd pool create xyz_metadata 
>
> Or should I create only one "File System Pool" (= cephfs_data and
> cephfs_metadata) and add all database pools to this file system?
> In that case, how can I ensure that admin "abcadm" cannot modify files
> belonging to database XYZ?
>
> THX
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS Q Size troubleshooting

2017-07-20 Thread David
Hi James

On Tue, Jul 18, 2017 at 8:07 AM, James Wilkins 
wrote:

> Hello list,
>
> I'm looking for some more information relating to CephFS and the 'Q' size,
> specifically how to diagnose what contributes towards it rising up
>
> Ceph Version: 11.2.0.0
> OS: CentOS 7
> Kernel (Ceph Servers): 3.10.0-514.10.2.el7.x86_64
> Kernel (CephFS Clients): 4.4.76-1.el7.elrepo.x86_64 - using kernel mount
>

Not suggesting this is the cause but I think the current official CentOS
kernel (the one you're using on the servers) has more up to date CephFS
code than 4.4


> Storage: 8 OSD Servers, 2TB NVME (P3700) in front of 6 x 6TB Disks (bcache)
> 2 pools for CephFS
>
> pool 1 'cephfs_data' replicated size 3 min_size 2 crush_ruleset 0
> object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 1984 flags
> hashpspool crash_replay_interval 45 stripe_width 0 pool 2
> 'cephfs_metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash
> rjenkins pg_num 256 pgp_num 256 last_change 40695 flags hashpspool
> stripe_width 0
>
> Average client IO is between 1000-2000 op/s and 150-200MB/s
>
> We track the q size attribute coming out of ceph daemon
> /var/run/ceph/ perf dump mds q into prometheus on a regular basis
> and this figure is always northbound of 5K
>
> When we run into performance issues/sporadic failovers of the MDS servers
> this figure is the warning sign and normally peaks at >50K prior to an
> issue occuring
>

What do other resources on the server look like at this time?

How big is your MDS cache?


> I've attached a sample graph showing the last 12 hours of the q figure as
> an example
>
>
>
>
>
> Does anyone have any suggestions as to where we look at what is causing
> this Q size?
>


>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Writing data to pools other than filesystem

2017-07-20 Thread David
On Thu, Jul 20, 2017 at 3:05 PM,  wrote:

> Hello!
>
> My understanding is that I create on (big) pool for all DB backups written
> to storage.
> The clients have restricted access to a specific directory only, means
> they can mount only this directory.
>
> Can I define a quota for a specific directory, or only for the pool?
>

You can define quotas per directory but there are a number of caveats with
quotas: http://docs.ceph.com/docs/master/cephfs/quota/



> And do I need to define the OSD Restriction?
>

I think you would still need to do this if you have other pools


> "To prevent clients from writing or reading data to pools other than those
> in use for CephFS, set an OSD authentication capability that restricts
> access to the CephFS data pool(s)."
>
> THX
>
>
>
> 20. Juli 2017 14:00, "David"  <%22david%22%20%3cdclistsli...@gmail.com%3E>> schrieb:
>
> I think the multiple namespace feature would be more appropriate for your
> use case. So that would be multiple file systems within the same pools
> rather than multiple pools in a single filesystem.
>
> With that said, that might be overkill for your requirement. You might be
> able to achieve what you need with path restriction:
> http://docs.ceph.com/docs/master/cephfs/client-auth/
> On Thu, Jul 20, 2017 at 10:23 AM,  wrote:
>
> 19. Juli 2017 17:34, "LOPEZ Jean-Charles"  schrieb:
>
> > Hi,
> >
> > you must add the extra pools to your current file system configuration:
> ceph fs add_data_pool
> > {fs_name} {pool_name}
> >
> > Once this is done, you just have to create some specific directory
> layout within CephFS to modify
> > the name of the pool targetted by a specific directory. See
> > http://docs.ceph.com/docs/master/cephfs/file-layouts
> >
> > Just set the ceph.dir.layout.pool attribute to the appropriate Pool ID
> of the new pool.
> >
> > Regards
> > JC
> >
> >> On Jul 19, 2017, at 07:59, c.mo...@web.de wrote:
> >>
> >> Hello!
> >>
> >> I want to organize data in pools and therefore created additional pools:
> >> ceph osd lspools
> >> 0 rbd,1 templates,2 hdb-backup,3 cephfs_data,4 cephfs_metadata,
> >>
> >> As you can see, pools "cephfs_data" and "cephfs_metadata" belong to a
> Ceph filesystem.
> >>
> >> Question:
> >> How can I write data to other pools, e.g. hdb-backup?
> >>
> >> THX
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> Hello JC,
>
> thanks for your reply.
>
> I'm not sure why I should add pools to a current file system configuration.
> Therefore it could be helpful to explain my use case.
>
> The Ceph Storage Cluster should provide storage for database backups.
> For security reasons I consider to create one pool per database identified
> by an unique id (e.g. ABC).
> And for each pool only a dedicated user (+ ceph admin) can access (read /
> write) the data in the related pool;
> this user is unique for each database (e.g. abcadm).
>
> The first question is:
> Do I need to create two RADOS pools as documented in guide 'Create a Ceph
> filesystem' (http://docs.ceph.com/docs/master/cephfs/createfs/) for each
> database id:
> "A Ceph filesystem requires at least two RADOS pools, one for data and one
> for metadata."
> If yes, this would mean to create the following pools:
> $ ceph osd pool create abc_data 
> $ ceph osd pool create abc_metadata 
> $ ceph osd pool create xyz_data 
> $ ceph osd pool create xyz_metadata 
>
> Or should I create only one "File System Pool" (= cephfs_data and
> cephfs_metadata) and add all database pools to this file system?
> In that case, how can I ensure that admin "abcadm" cannot modify files
> belonging to database XYZ?
>
> THX
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] oVirt/RHEV and Ceph

2017-07-25 Thread David
My understanding was Cinder is needed to create/delete/manage etc. on
volumes but I/O to the volumes is direct from the hypervisors. In theory
you could lose your Cinder service and VMs would stay up.

On 25 Jul 2017 4:18 a.m., "Brady Deetz"  wrote:

Thanks for pointing to some documentation. I'd seen that and it is
certainly an option. From my understanding, with a Cinder deployment, you'd
have the same failure domains and similar performance characteristics to an
oVirt + NFS + RBD deployment. This is acceptable. But, the dream I have in
my head is where the RBD images are mounted and controlled on each
hypervisor instead of a central storage authority like Cinder. Does that
exist for anything or is this a fundamentally flawed idea?

On Mon, Jul 24, 2017 at 9:41 PM, Jason Dillaman  wrote:

> oVirt 3.6 added Cinder/RBD integration [1] and it looks like they are
> currently working on integrating Cinder within a container to simplify
> the integration [2].
>
> [1] http://www.ovirt.org/develop/release-management/features/sto
> rage/cinder-integration/
> [2] http://www.ovirt.org/develop/release-management/features/cin
> derglance-docker-integration/
>
> On Mon, Jul 24, 2017 at 10:27 PM, Brady Deetz  wrote:
> > Funny enough, I just had a call with Redhat where the OpenStack engineer
> was
> > voicing his frustration that there wasn't any movement on RBD for oVirt.
> > This is important to me because I'm building out a user-facing private
> cloud
> > that just isn't going to be big enough to justify OpenStack and its
> > administrative overhead. But, I already have 1.75PB (soon to be 2PB) of
> > CephFS in production. So, it puts me in a really difficult design
> position.
> >
> > On Mon, Jul 24, 2017 at 9:09 PM, Dino Yancey  wrote:
> >>
> >> I was as much as told by Redhat in a sales call that they push Gluster
> >> for oVirt/RHEV and Ceph for OpenStack, and don't have any plans to
> >> change that in the short term. (note this was about a year ago, i
> >> think - so this isn't super current information).
> >>
> >> I seem to recall the hangup was that oVirt had no orchestration
> >> capability for RBD comparable to OpenStack, and that CephFS wasn't
> >> (yet?) viable for use as a "POSIX filesystem" oVirt storage domain.
> >> Personally, I feel like Redhat is worried about competing with
> >> themselves with GlusterFS versus CephFS and is choosing to focus on
> >> Gluster as a filesystem, and Ceph as everything minus the filesystem.
> >>
> >> Which is a shame, as I'm a fan of both Ceph and oVirt and would love
> >> to use my existing RHEV infrastructure to bring Ceph into my
> >> environment.
> >>
> >>
> >> On Mon, Jul 24, 2017 at 8:39 PM, Brady Deetz  wrote:
> >> > I haven't seen much talk about direct integration with oVirt.
> Obviously
> >> > it
> >> > kind of comes down to oVirt being interested in participating. But, is
> >> > the
> >> > only hold-up getting development time toward an integration or is
> there
> >> > some
> >> > kind of friction between the dev teams?
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> >>
> >>
> >>
> >> --
> >> __
> >> Dino Yancey
> >> 2GNT.com Admin
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Jason
>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bad IO performance CephFS vs. NFS for block size 4k/128k

2017-09-04 Thread David
On Mon, Sep 4, 2017 at 4:27 PM,  wrote:

> Hello!
>
> I'm validating IO performance of CephFS vs. NFS.
>
> Therefore I have mounted the relevant filesystems on the same client.
> Then I start fio with the following parameters:
> action = randwrite randrw
> blocksize = 4k 128k 8m
> rwmixreadread = 70 50 30
> 32 jobs run in parallel
>
> The NFS share is striping over 5 virtual disks with a 4+1 RAID5
> configuration; each disk has ~8TB.
> The CephFS is configured on 2 MDS servers (1 up:active, 1 up:standby);
> each MDS has 47 OSDs where 1 OSD is represented by single 8TB disk.
> (The disks of RAID5 and OSD are identical.)
>

So that's a 2 node cluster? I'm assuming filestore OSDs with journals on
the OSDs, 2x or 3x replication. The NFS server on local storage is going to
perform much better as you've found.

>
> What I can see is that the IO performance of blocksize 8m is slightly
> better with CephFS, but worse (by factor 4-10) with blocksize 4k / 128k.


Not surprising. You can

> Here the stats for randrw with mix 30:
> ld9930:/home # tail -n 3 ld9930-fio-test-cephfs-randrw30-8m
> Run status group 0 (all jobs):
>READ: bw=335MiB/s (351MB/s), 335MiB/s-335MiB/s (351MB/s-351MB/s),
> io=19.7GiB (21.2GB), run=60099-60099msec
>   WRITE: bw=753MiB/s (789MB/s), 753MiB/s-753MiB/s (789MB/s-789MB/s),
> io=44.2GiB (47.5GB), run=60099-60099msec
>
> ld9930:/home # tail -n 3 ld9930-fio-test-nfs-randrw30-8m
> Run status group 0 (all jobs):
>READ: bw=324MiB/s (340MB/s), 324MiB/s-324MiB/s (340MB/s-340MB/s),
> io=19.0GiB (20.5GB), run=60052-60052msec
>   WRITE: bw=725MiB/s (760MB/s), 725MiB/s-725MiB/s (760MB/s-760MB/s),
> io=42.6GiB (45.7GB), run=60052-60052msec
>
> ld9930:/home # tail -n 3 ld9930-fio-test-nfs-randrw30-128k
> Run status group 0 (all jobs):
>READ: bw=287MiB/s (301MB/s), 287MiB/s-287MiB/s (301MB/s-301MB/s),
> io=16.9GiB (18.7GB), run=60006-60006msec
>   WRITE: bw=667MiB/s (700MB/s), 667MiB/s-667MiB/s (700MB/s-700MB/s),
> io=39.1GiB (41.1GB), run=60006-60006msec
>
> ld9930:/home # tail -n 3 ld9930-fio-test-cephfs-randrw30-128k
> Run status group 0 (all jobs):
>READ: bw=69.2MiB/s (72.6MB/s), 69.2MiB/s-69.2MiB/s (72.6MB/s-72.6MB/s),
> io=4172MiB (4375MB), run=60310-60310msec
>   WRITE: bw=161MiB/s (169MB/s), 161MiB/s-161MiB/s (169MB/s-169MB/s),
> io=9732MiB (10.3GB), run=60310-60310msec
>
> ld9930:/home # tail -n 3 ld9930-fio-test-cephfs-randrw30-4k
> Run status group 0 (all jobs):
>READ: bw=5631KiB/s (5766kB/s), 5631KiB/s-5631KiB/s (5766kB/s-5766kB/s),
> io=330MiB (346MB), run=60043-60043msec
>   WRITE: bw=12.8MiB/s (13.4MB/s), 12.8MiB/s-12.8MiB/s (13.4MB/s-13.4MB/s),
> io=767MiB (804MB), run=60043-60043msec
>
> ld9930:/home # tail -n 3 ld9930-fio-test-nfs-randrw30-4k
> Run status group 0 (all jobs):
>READ: bw=77.2MiB/s (80.8MB/s), 77.2MiB/s-77.2MiB/s (80.8MB/s-80.8MB/s),
> io=4621MiB (4846MB), run=60004-60004msec
>   WRITE: bw=180MiB/s (188MB/s), 180MiB/s-180MiB/s (188MB/s-188MB/s),
> io=10.6GiB (11.4GB), run=60004-60004msec
>
>
> This implies that for good IO performance only data with blocksize > 128k
> (I guess > 1M) should be used.
> Can anybody confirm this?
>
> THX
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] debian-hammer wheezy Packages file incomplete?

2017-09-12 Thread David
Hi!

Noticed tonight during maintenance that the hammer repo for debian wheezy only 
has 2 packages listed in the Packages file.
Thought perhaps it's being moved to archive or something. However the files are 
still there: https://download.ceph.com/debian-hammer/pool/main/c/ceph/ 
<https://download.ceph.com/debian-hammer/pool/main/c/ceph/>

Is it a known issue or rather a "feature" =D

Kind Regards,

David Majchrzak___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] debian-hammer wheezy Packages file incomplete?

2017-09-13 Thread David
Case close, found answer in the mailing list archive.
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-March/016706.html 
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-March/016706.html>

Weird though that we installed it through the repo in June 2017.
Why not put them in the archive like debian-dumpling and debian-firefly?


> 13 sep. 2017 kl. 03:09 skrev David :
> 
> Hi!
> 
> Noticed tonight during maintenance that the hammer repo for debian wheezy 
> only has 2 packages listed in the Packages file.
> Thought perhaps it's being moved to archive or something. However the files 
> are still there: https://download.ceph.com/debian-hammer/pool/main/c/ceph/ 
> <https://download.ceph.com/debian-hammer/pool/main/c/ceph/>
> 
> Is it a known issue or rather a "feature" =D
> 
> Kind Regards,
> 
> David Majchrzak

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Jewel -> Luminous upgrade, package install stopped all daemons

2017-09-13 Thread David
Hi All

I did a Jewel -> Luminous upgrade on my dev cluster and it went very
smoothly.

I've attempted to upgrade on a small production cluster but I've hit a
snag.

After installing the ceph 12.2.0 packages with "yum install ceph" on the
first node and accepting all the dependencies, I found that all the OSD
daemons, the MON and the MDS running on that node were terminated. Systemd
appears to have attempted to restart them all but the daemons didn't start
successfully (not surprising as first stage of upgrading all mons in
cluster not completed). I was able to start the MON and it's running. The
OSDs are all down and I'm reluctant to attempt to start them without
upgrading the other MONs in the cluster. I'm also reluctant to attempt
upgrading the remaining 2 MONs without understanding what happened.

The cluster is on Jewel 10.2.5 (as was the dev cluster)
Both clusters running on CentOS 7.3

The only obvious difference I can see between the dev and production is the
production has selinux running in permissive mode, the dev had it disabled.

Any advice on how to proceed at this point would be much appreciated. The
cluster is currently functional, but I have 1 node out 4 with all OSDs
down. I had noout set before the upgrade and I've left it set for now.

Here's the journalctl right after the packages were installed (hostname
changed):

https://pastebin.com/fa6NMyjG
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel -> Luminous upgrade, package install stopped all daemons

2017-09-15 Thread David
Hi David

I like your thinking! Thanks for the suggestion. I've got a maintenance
window later to finish the update so will give it a try.


On Thu, Sep 14, 2017 at 6:24 PM, David Turner  wrote:

> This isn't a great solution, but something you could try.  If you stop all
> of the daemons via systemd and start them all in a screen as a manually
> running daemon in the foreground of each screen... I don't think that yum
> updating the packages can stop or start the daemons.  You could copy and
> paste the running command (viewable in ps) to know exactly what to run in
> the screens to start the daemons like this.
>
> On Wed, Sep 13, 2017 at 6:53 PM David  wrote:
>
>> Hi All
>>
>> I did a Jewel -> Luminous upgrade on my dev cluster and it went very
>> smoothly.
>>
>> I've attempted to upgrade on a small production cluster but I've hit a
>> snag.
>>
>> After installing the ceph 12.2.0 packages with "yum install ceph" on the
>> first node and accepting all the dependencies, I found that all the OSD
>> daemons, the MON and the MDS running on that node were terminated. Systemd
>> appears to have attempted to restart them all but the daemons didn't start
>> successfully (not surprising as first stage of upgrading all mons in
>> cluster not completed). I was able to start the MON and it's running. The
>> OSDs are all down and I'm reluctant to attempt to start them without
>> upgrading the other MONs in the cluster. I'm also reluctant to attempt
>> upgrading the remaining 2 MONs without understanding what happened.
>>
>> The cluster is on Jewel 10.2.5 (as was the dev cluster)
>> Both clusters running on CentOS 7.3
>>
>> The only obvious difference I can see between the dev and production is
>> the production has selinux running in permissive mode, the dev had it
>> disabled.
>>
>> Any advice on how to proceed at this point would be much appreciated. The
>> cluster is currently functional, but I have 1 node out 4 with all OSDs
>> down. I had noout set before the upgrade and I've left it set for now.
>>
>> Here's the journalctl right after the packages were installed (hostname
>> changed):
>>
>> https://pastebin.com/fa6NMyjG
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel -> Luminous upgrade, package install stopped all daemons

2017-09-15 Thread David
Happy to report I got everything up to Luminous, used your tip to keep the
OSDs running, David, thanks again for that.

I'd say this is a potential gotcha for people collocating MONs. It appears
that if you're running selinux, even in permissive mode, upgrading the
ceph-selinux packages forces a restart on all the OSDs. You're left with a
load of OSDs down that you can't start as you don't have a Luminous mon
quorum yet.


On 15 Sep 2017 4:54 p.m., "David"  wrote:

Hi David

I like your thinking! Thanks for the suggestion. I've got a maintenance
window later to finish the update so will give it a try.


On Thu, Sep 14, 2017 at 6:24 PM, David Turner  wrote:

> This isn't a great solution, but something you could try.  If you stop all
> of the daemons via systemd and start them all in a screen as a manually
> running daemon in the foreground of each screen... I don't think that yum
> updating the packages can stop or start the daemons.  You could copy and
> paste the running command (viewable in ps) to know exactly what to run in
> the screens to start the daemons like this.
>
> On Wed, Sep 13, 2017 at 6:53 PM David  wrote:
>
>> Hi All
>>
>> I did a Jewel -> Luminous upgrade on my dev cluster and it went very
>> smoothly.
>>
>> I've attempted to upgrade on a small production cluster but I've hit a
>> snag.
>>
>> After installing the ceph 12.2.0 packages with "yum install ceph" on the
>> first node and accepting all the dependencies, I found that all the OSD
>> daemons, the MON and the MDS running on that node were terminated. Systemd
>> appears to have attempted to restart them all but the daemons didn't start
>> successfully (not surprising as first stage of upgrading all mons in
>> cluster not completed). I was able to start the MON and it's running. The
>> OSDs are all down and I'm reluctant to attempt to start them without
>> upgrading the other MONs in the cluster. I'm also reluctant to attempt
>> upgrading the remaining 2 MONs without understanding what happened.
>>
>> The cluster is on Jewel 10.2.5 (as was the dev cluster)
>> Both clusters running on CentOS 7.3
>>
>> The only obvious difference I can see between the dev and production is
>> the production has selinux running in permissive mode, the dev had it
>> disabled.
>>
>> Any advice on how to proceed at this point would be much appreciated. The
>> cluster is currently functional, but I have 1 node out 4 with all OSDs
>> down. I had noout set before the upgrade and I've left it set for now.
>>
>> Here's the journalctl right after the packages were installed (hostname
>> changed):
>>
>> https://pastebin.com/fa6NMyjG
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS Luminous | MDS frequent "replicating dir" message in log

2017-09-25 Thread David
Hi All

Since upgrading a cluster from Jewel to Luminous I'm seeing a lot of the
following line in my ceph-mds log (path name changed by me - the messages
refer to different dirs)

2017-09-25 12:47:23.073525 7f06df730700  0 mds.0.bal replicating dir [dir
0x1003e5b /path/to/dir/ [2,head] auth v=50477 cv=50465/50465 ap=0+3+4
state=1610612738|complete f(v0 m2017-03-27 11:04:17.935529 51=19+32)
n(v3297 rc2017-09-25 12:46:13.379651 b14050737379 13086=10218+2868)/n(v3297
rc2017-09-25 12:46:13.052651 b14050862881 13083=10215+2868) hs=51+0,ss=0+0
dirty=1 | child=1 dirty=1 waiter=0 authpin=0 0x7f0707298000] pop 13139 ..
rdp 191 adj 0

I've not had any issues reported, just interested to know why I'm suddenly
seeing a lot of these messages, the client versions and workload hasn't
changed. Anything to be concerned about?

Single MDS with standby-replay
Luminous 12.2.0
Kernel clients: 3.10.0-514.2.2.el7.x86_64

Thanks,
David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Updating ceps client - what will happen to services like NFS on clients

2017-09-25 Thread David
Hi Götz

If you did a rolling upgrade, RBD clients shouldn't have experienced
interrupted IO and therefor IO to NFS exports shouldn't have been affected.
However, in the past when using kernel NFS over kernel RBD, I did have some
lockups when OSDs went down in the cluster so that's something to watch out
for.


On Mon, Sep 25, 2017 at 8:38 AM, Götz Reinicke <
goetz.reini...@filmakademie.de> wrote:

> Hi,
>
> I updated our ceph OSD/MON Nodes from 10.2.7 to 10.2.9 and everything
> looks good so far.
>
> Now I was wondering (as I may have forgotten how this works) what will
> happen to a  NFS server which has the nfs shares on a ceph rbd ? Will the
> update interrupt any access to the NFS share or is it that smooth that e.g.
> clients accessing the NFS share will not notice?
>
> Thanks for some lecture on managing ceph and regards . Götz
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] nfs-ganesha / cephfs issues

2017-10-01 Thread David
Cephfs does have repair tools but I wouldn't jump the gun, your metadata
pool is probably fine. Unless you're getting health errors or seeing errors
in your MDS log?

Are you exporting a fuse or kernel mount with Ganesha (i.e using the vfs
FSAL) or using the Ceph FSAL? Have you tried any tests directly on a CephFS
mount (taking Ganesha out of the equation)?


On Sat, Sep 30, 2017 at 11:09 PM, Marc Roos 
wrote:

>
>
> I have on luminous 12.2.1 on a osd node nfs-ganesha 2.5.2 (from ceph
> download) running. And when I rsync on a vm that has the nfs mounted, I
> get stalls.
>
> I thought it was related to the amount of files of rsyncing the centos7
> distro. But when I tried to rsync just one file it also stalled. It
> looks like it could not create the update of the 'CentOS_BuildTag' file.
>
> Could this be a problem in the meta data pool of cephfs? Does this sound
> familiar? Is there something like an fsck for cephfs?
>
> drwxr-xr-x 1 500 500 7 Jan 24  2016 ..
> -rw-r--r-- 1 500 50014 Dec  5  2016 CentOS_BuildTag
> -rw-r--r-- 1 500 50029 Dec  5  2016 .discinfo
> -rw-r--r-- 1 500 500   946 Jan 12  2017 .treeinfo
> drwxr-xr-x 1 500 500 1 Sep  5 15:36 LiveOS
> drwxr-xr-x 1 500 500 1 Sep  5 15:36 EFI
> drwxr-xr-x 1 500 500 3 Sep  5 15:36 images
> drwxrwxr-x 1 500 50010 Sep  5 23:57 repodata
> drwxrwxr-x 1 500 500  9591 Sep 19 20:33 Packages
> drwxr-xr-x 1 500 500 9 Sep 19 20:33 isolinux
> -rw--- 1 500 500 0 Sep 30 23:49 .CentOS_BuildTag.PKZC1W
> -rw--- 1 500 500 0 Sep 30 23:52 .CentOS_BuildTag.gM1C1W
> drwxr-xr-x 1 500 50015 Sep 30 23:52 .
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph monitoring

2017-10-02 Thread David
If you take Ceph out of your search string you should find loads of
tutorials on setting up the popular collectd/influxdb/grafana stack. Once
you've got that in place, the Ceph bit should be fairly easy. There's Ceph
collectd plugins out there or you could write your own.



On Mon, Oct 2, 2017 at 12:34 PM, Osama Hasebou  wrote:

> Hi Everyone,
>
> Is there a guide/tutorial about how to setup Ceph monitoring system using
> collectd / grafana / graphite ? Other suggestions are welcome as well !
>
> I found some GitHub solutions but not much documentation on how to
> implement.
>
> Thanks.
>
> Regards,
> Ossi
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Calamari ( what a nightmare !!! )

2017-12-11 Thread David
Hi!

I think Calamari is more or less deprecated now that ceph luminous is out with 
Ceph Manager and the dashboard plugin:

http://docs.ceph.com/docs/master/mgr/dashboard/

You could also try out:

https://www.openattic.org/ <https://www.openattic.org/>

or if you want to start a whole new cluster without needing to know how to 
operate it ;)

https://croit.io/ <https://croit.io/>

The latter isn't open sourced yet as far as I know.

Kind Regards,

David


> 12 dec. 2017 kl. 02:18 skrev DHD.KOHA :
> 
> Hello list,
> 
> Newbie here,
> 
> After managing to install ceph, with all possible ways that I could manage  
> on 4 nodes, 4 osd and 3 monitors , with ceph-deploy and latter with 
> ceph-ansible, I thought to to give a try to install CALAMARI on UBUNTU 14.04 
> ( another separate server being not a node or anything in a cluster ).
> 
> After all the mess of salt 2014.7.5 and different UBUNTU's since I am 
> installing nodes on xenial but CALAMARI on trusty while the calamari packages 
> on node come from download.ceph.com and trusty, I ended up having a server 
> that refuses to gather anything from anyplace at all.
> 
> 
> # salt '*' ceph.get_heartbeats
> c1.zz.prv:
>The minion function caused an exception: Traceback (most recent call last):
>  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1020, in 
> _thread_return
>return_data = func(*args, **kwargs)
>  File "/var/cache/salt/minion/extmods/modules/ceph.py", line 467, in 
> get_heartbeats
>service_data = service_status(filename)
>  File "/var/cache/salt/minion/extmods/modules/ceph.py", line 526, in 
> service_status
>fsid = json.loads(admin_socket(socket_path, ['status'], 
> 'json'))['cluster_fsid']
>KeyError: 'cluster_fsid'
> c2.zz.prv:
>The minion function caused an exception: Traceback (most recent call last):
>  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1020, in 
> _thread_return
>return_data = func(*args, **kwargs)
>  File "/var/cache/salt/minion/extmods/modules/ceph.py", line 467, in 
> get_heartbeats
>service_data = service_status(filename)
>  File "/var/cache/salt/minion/extmods/modules/ceph.py", line 526, in 
> service_status
>fsid = json.loads(admin_socket(socket_path, ['status'], 
> 'json'))['cluster_fsid']
>KeyError: 'cluster_fsid'
> c3.zz.prv:
>The minion function caused an exception: Traceback (most recent call last):
>  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1020, in 
> _thread_return
>return_data = func(*args, **kwargs)
>  File "/var/cache/salt/minion/extmods/modules/ceph.py", line 467, in 
> get_heartbeats
>service_data = service_status(filename)
>  File "/var/cache/salt/minion/extmods/modules/ceph.py", line 526, in 
> service_status
>fsid = json.loads(admin_socket(socket_path, ['status'], 
> 'json'))['cluster_fsid']
>KeyError: 'cluster_fsid'
> 
> which means obviously that I am doing something WRONG and I have no IDEA what 
> is it.
> 
> Given the fact that documentation on the matter is very poor to limited,
> 
> Is there anybody out-there with some clues or hints that is willing to share ?
> 
> Regards,
> 
> Harry.
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Linux Meltdown (KPTI) fix and how it affects performance?

2018-01-05 Thread David
Hi!

nopti or pti=off in kernel options should disable some of the kpti.
I haven't tried it yet though, so give it a whirl.

https://en.wikipedia.org/wiki/Kernel_page-table_isolation 
<https://en.wikipedia.org/wiki/Kernel_page-table_isolation>

Kind Regards,

David Majchrzak


> 5 jan. 2018 kl. 11:03 skrev Xavier Trilla :
> 
> Hi Nick,
> 
> I'm actually wondering about exactly the same. Regarding OSDs, I agree, there 
> is no reason to apply the security patch to the machines running the OSDs -if 
> they are properly isolated in your setup-.
> 
> But I'm worried about the hypervisors, as I don't know how meltdown or 
> Spectre patches -AFAIK, only Spectre patch needs to be applied to the host 
> hypervisor, Meltdown patch only needs to be applied to guest- will affect 
> librbd performance in the hypervisors. 
> 
> Does anybody have some information about how Meltdown or Spectre affect ceph 
> OSDs and clients? 
> 
> Also, regarding Meltdown patch, seems to be a compilation option, meaning you 
> could build a kernel without it easily.
> 
> Thanks,
> Xavier. 
> 
> -Mensaje original-
> De: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] En nombre de Nick 
> Fisk
> Enviado el: jueves, 4 de enero de 2018 17:30
> Para: 'ceph-users' 
> Asunto: [ceph-users] Linux Meltdown (KPTI) fix and how it affects performance?
> 
> Hi All,
> 
> As the KPTI fix largely only affects the performance where there are a large 
> number of syscalls made, which Ceph does a lot of, I was wondering if anybody 
> has had a chance to perform any initial tests. I suspect small write 
> latencies will the worse affected?
> 
> Although I'm thinking the backend Ceph OSD's shouldn't really be at risk from 
> these vulnerabilities, due to them not being direct user facing and could 
> have this work around disabled?
> 
> Nick
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Migrating filestore to bluestore using ceph-volume

2018-01-26 Thread David
/db/version_set.cc:2867] Column family [default] 
(ID 0), log number is 94

2018-01-26 15:09:07.379087 7f545d3b9cc0  4 rocksdb: EVENT_LOG_v1 
{"time_micros": 1516979347379083, "job": 1, "event": "recovery_started", 
"log_files": [96]}
2018-01-26 15:09:07.379091 7f545d3b9cc0  4 rocksdb: 
[/build/ceph-12.2.2/src/rocksdb/db/db_impl_open.cc:482] Recovering log #96 mode 0
2018-01-26 15:09:07.379102 7f545d3b9cc0  4 rocksdb: 
[/build/ceph-12.2.2/src/rocksdb/db/version_set.cc:2395] Creating manifest 98

2018-01-26 15:09:07.380466 7f545d3b9cc0  4 rocksdb: EVENT_LOG_v1 
{"time_micros": 1516979347380463, "job": 1, "event": "recovery_finished"}
2018-01-26 15:09:07.381331 7f545d3b9cc0  4 rocksdb: 
[/build/ceph-12.2.2/src/rocksdb/db/db_impl_open.cc:1063] DB pointer 
0x556ecb8c3000
2018-01-26 15:09:07.381353 7f545d3b9cc0  1 bluestore(/var/lib/ceph/osd/ceph-0) 
_open_db opened rocksdb path db options 
compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152
2018-01-26 15:09:07.381616 7f545d3b9cc0  1 freelist init
2018-01-26 15:09:07.381660 7f545d3b9cc0  1 bluestore(/var/lib/ceph/osd/ceph-0) 
_open_alloc opening allocation metadata
2018-01-26 15:09:07.381679 7f545d3b9cc0  1 bluestore(/var/lib/ceph/osd/ceph-0) 
_open_alloc loaded 447 G in 1 extents
2018-01-26 15:09:07.382077 7f545d3b9cc0  0 _get_class not permitted to load kvs
2018-01-26 15:09:07.382309 7f545d3b9cc0  0  
/build/ceph-12.2.2/src/cls/cephfs/cls_cephfs.cc:197: loading cephfs
2018-01-26 15:09:07.382583 7f545d3b9cc0  0 _get_class not permitted to load sdk
2018-01-26 15:09:07.382827 7f545d3b9cc0  0  
/build/ceph-12.2.2/src/cls/hello/cls_hello.cc:296: loading cls_hello
2018-01-26 15:09:07.385755 7f545d3b9cc0  0 _get_class not permitted to load lua
2018-01-26 15:09:07.386073 7f545d3b9cc0  0 osd.0 0 crush map has features 
288232575208783872, adjusting msgr requires for clients
2018-01-26 15:09:07.386078 7f545d3b9cc0  0 osd.0 0 crush map has features 
288232575208783872 was 8705, adjusting msgr requires for mons
2018-01-26 15:09:07.386079 7f545d3b9cc0  0 osd.0 0 crush map has features 
288232575208783872, adjusting msgr requires for osds
2018-01-26 15:09:07.386132 7f545d3b9cc0  0 osd.0 0 load_pgs
2018-01-26 15:09:07.386134 7f545d3b9cc0  0 osd.0 0 load_pgs opened 0 pgs
2018-01-26 15:09:07.386137 7f545d3b9cc0  0 osd.0 0 using weightedpriority op 
queue with priority op cut off at 64.
2018-01-26 15:09:07.386580 7f545d3b9cc0 -1 osd.0 0 log_to_monitors 
{default=true}
2018-01-26 15:09:07.388077 7f545d3b9cc0 -1 osd.0 0 init authentication failed: 
(1) Operation not permitted


The old osd is still there.

# ceph osd tree
ID CLASS WEIGHT  TYPE NAME STATUSREWEIGHT PRI-AFF
-1   2.60458 root default
-2   0.86819 host int1
 0   ssd 0.43159 osd.0 destroyed0 1.0
 3   ssd 0.43660 osd.3up  1.0 1.0
-3   0.86819 host int2
 1   ssd 0.43159 osd.1up  1.0 1.0
 4   ssd 0.43660 osd.4up  1.0 1.0
-4   0.86819 host int3
 2   ssd 0.43159 osd.2up  1.0 1.0
 5   ssd 0.43660     osd.5up  1.0 1.0


What's the best course of action? Purging osd.0, zapping the device again and 
creating without --osd-id set?


Kind Regards,

David Majchrzak___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS: caps went stale, renewing

2016-09-02 Thread David
Hi All

Kernel client: 4.6.4-1.el7.elrepo.x86_64
MDS version: 10.2.2
OS: CentOS 7

I have Cephfs mounted on a few servers, I see the following in the log
approx every 20 secs on all of them:

kernel: ceph: mds0 caps went stale, renewing
kernel: ceph: mds0 caps stale
kernel: ceph: mds0 caps renewed

I'm trying to debug a few intermittent nfs related issues (which may or may
not be cephfs related) and I'm just wondering if these lines are anything
to worry about.

I've mounted on a server that is almost completely idle and I'm still
seeing those lines every 20 secs.

Ceph health is OK and the MDS server seems pretty happy although I am
occasionally seeing some "closing stale session" lines in the ceph-mds log but
I think that's a separate issue.

Thanks,
David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: caps went stale, renewing

2016-09-03 Thread David
Guys, thanks for looking into this, turns out I had the following options
configured on the MDS:

mds_session_timeout = 10
mds_session_autoclose = 15

I must have set them when fiddling around with some ctdb stuff. After
reverting to the defaults (60 and 300 respectively), I'm no longer seeing
the stale caps errors.

Sorry for the noise.

Thanks,

On Sat, Sep 3, 2016 at 4:14 AM, Yan, Zheng  wrote:

> On Sat, Sep 3, 2016 at 1:35 AM, Gregory Farnum  wrote:
> > On Fri, Sep 2, 2016 at 2:58 AM, David  wrote:
> >> Hi All
> >>
> >> Kernel client: 4.6.4-1.el7.elrepo.x86_64
> >> MDS version: 10.2.2
> >> OS: CentOS 7
> >>
> >> I have Cephfs mounted on a few servers, I see the following in the log
> >> approx every 20 secs on all of them:
> >>
> >> kernel: ceph: mds0 caps went stale, renewing
> >> kernel: ceph: mds0 caps stale
> >> kernel: ceph: mds0 caps renewed
> >>
> >> I'm trying to debug a few intermittent nfs related issues (which may or
> may
> >> not be cephfs related) and I'm just wondering if these lines are
> anything to
> >> worry about.
> >>
> >> I've mounted on a server that is almost completely idle and I'm still
> seeing
> >> those lines every 20 secs.
> >
> > If it's idle I think that's why — normally these are kept live
> > (unstale) just by passing normal messages back and forth. Although,
> > Zheng, it ought to be sending off messages prior to stale if there's
> > no other traffic, shouldn't it?
>
> client sends CEPH_SESSION_REQUEST_RENEWCAPS to mds even it's idle.
>
> Regards
> Yan, Zheng
>
> >
> > If you're not seeing any problems on these hosts I don't think you
> > need to worry about it, though.
> >
> >
> >> Ceph health is OK and the MDS server seems pretty happy although I am
> >> occasionally seeing some "closing stale session" lines in the ceph-mds
> log
> >> but I think that's a separate issue.
> >
> > Closing stale sessions means the MDS is removing clients which haven't
> > checked in for a while. That might be a problem if the client computer
> > is still around and wants to access the FS. (Alternatively, if you've
> > got eg laptops mounting CephFS and they're timing out when they get
> > taken home for the night, not so much.)
> > -Greg
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Raw data size used seems incorrect (version Jewel, 10.2.2)

2016-09-07 Thread David
Could be related to this? http://tracker.ceph.com/issues/13844

On Wed, Sep 7, 2016 at 7:40 AM, james  wrote:

> Hi,
>
> Not sure if anyone can help clarify or provide any suggestion on how to
> troubleshoot this
>
> We have a ceph cluster recently build up with ceph version Jewel, 10.2.2.
> Based on "ceph -s" it shows that the data size is around 3TB but rawdata
> used is only around 6TB,
> as the ceph is set with 3 replicates, I suppose the raw data should be
> around 9TB, is this correct and work as design?
> Thank you
>
> ceph@ceph1:~$ ceph -s
> cluster 292a8b61-549e-4529-866e-01776520b6bf
>  health HEALTH_OK
>  monmap e1: 3 mons at {cpm1=192.168.1.7:6789/0,cpm2=
> 192.168.1.8:6789/0,cpm3=192.168.1.9:6789/0}
> election epoch 70, quorum 0,1,2 cpm1,cpm2,cpm3
>  osdmap e1980: 18 osds: 18 up, 18 in
> flags sortbitwise
>   pgmap v1221102: 512 pgs, 1 pools, 3055 GB data, 801 kobjects
> 6645 GB used, 60380 GB / 67026 GB avail
>  512 active+clean
>
> ceph@ceph1:~$ ceph osd dump
> epoch 1980
> fsid 292a8b61-549e-4529-866e-01776520b6bf
> created 2016-08-12 09:30:28.771332
> modified 2016-09-06 06:34:43.068060
> flags sortbitwise
> pool 1 'default' replicated size 3 min_size 2 crush_ruleset 0 object_hash
> rjenkins pg_num 512 pgp_num 512 last_change 45 flags hashpspool
> stripe_width 0
> removed_snaps [1~3]
> 
> ceph@ceph1:~$ ceph df
> GLOBAL:
> SIZE   AVAIL  RAW USED %RAW USED
> 67026G 60380G6645G  9.91
> POOLS:
> NAMEID USED  %USED MAX AVAIL OBJECTS
> default 1  3055G 13.6826124G  821054
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NFS gateway

2016-09-07 Thread David
I have clients accessing CephFS over nfs (kernel nfs). I was seeing slow
writes with sync exports. I haven't had a chance to investigate and in the
meantime I'm exporting with async (not recommended, but acceptable in my
environment).

I've been meaning to test out Ganesha for a while now

@Sean, have you used Ganesha with Ceph? How does performance compare with
kernel nfs?

On Wed, Sep 7, 2016 at 3:30 PM, jan hugo prins  wrote:

> Hi,
>
> One of the use-cases I'm currently testing is the possibility to replace
> a NFS storage cluster using a Ceph cluster.
>
> The idea I have is to use a server as an intermediate gateway. On the
> client side it will expose a NFS share and on the Ceph side it will
> mount the CephFS using mount.ceph. The whole network that holds the Ceph
> environment is 10G connected and when I use the same server as S3
> gateway I can store files rather quickly. When I use the same server as
> a NFS gateway putting data on the Ceph cluster is really very slow.
>
> The reason we want to do this is that we want to create a dedicated Ceph
> storage network and have all clients that need some data access either
> use S3 or NFS to access the data. I want to do this this way because I
> don't want to give the clients in some specific networks full access to
> the Ceph filesystem.
>
> Has anyone tried this before? Is this the way to go, or are there better
> ways to fix this?
>
> --
> Met vriendelijke groet / Best regards,
>
> Jan Hugo Prins
> Infra and Isilon storage consultant
>
> Better.be B.V.
> Auke Vleerstraat 140 E | 7547 AN Enschede | KvK 08097527
> T +31 (0) 53 48 00 694 | M +31 (0)6 26 358 951
> jpr...@betterbe.com | www.betterbe.com
>
> This e-mail is intended exclusively for the addressee(s), and may not
> be passed on to, or made available for use by any person other than
> the addressee(s). Better.be B.V. rules out any and every liability
> resulting from any electronic transmission.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot start the Ceph daemons using upstart after upgrading to Jewel 10.2.2

2016-09-08 Thread David
Afaik, the daemons are managed by systemd now on most distros e.g:

systemctl start ceph-osd@0.service



On Thu, Sep 8, 2016 at 3:36 PM, Simion Marius Rad  wrote:

> Hello,
>
> Today I upgraded an Infernalis 9.2.1 cluster to Jewel 10.2.2.
> All went well until I wanted to restart the daemons using upstart (initctl
> ).
> Any upstart invocation fails to start the daemons.
> In order to keep the cluster up I started the daemons by myself using the
> commands invoked usually by upstart.
>
>
> The cluster runs on Ubuntu 14.04 LTS (kernel 3.19 ).
>
> Did someone else have a similar issue after upgrade ?
>
> Thanks,
> Simion Rad
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] I/O freeze while a single node is down.

2016-09-13 Thread David
What froze? Kernel RBD? Librbd? CephFS?

Ceph version?

On Tue, Sep 13, 2016 at 11:24 AM, Daznis  wrote:

> Hello,
>
>
> I have encountered a strange I/O freeze while rebooting one OSD node
> for maintenance purpose. It was one of the 3 Nodes in the entire
> cluster. Before this rebooting or shutting down and entire node just
> slowed down the ceph, but not completely froze it.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Full OSD halting a cluster - isn't this violating the "no single point of failure" promise?

2016-09-19 Thread David
Ceph is pretty awesome but I'm not sure it can be expected to keep I/O
going if there is no available capacity. Granted, the osds aren't always
balanced evenly but generally if you've got one drive hitting full ratio,
you've probably got a lot more not far behind.

Although probably not recommend, it should be pretty easy to automate
taking an OSD out of the cluster if it gets too full. Of course the best
practice is to not let osds get past nearfull without taking action.

On 16 Sep 2016 19:36, "Christian Theune"  wrote:

> Hi,
>
> (just in case: this isn’t intended as a rant and I hope it doesn’t get
> read at it. I’m trying to understand what some perspectives towards
> potential future improvements are and I think it would be valuable to have
> this discoverable in the archives)
>
> We’ve had a “good" time recently balancing our growing cluster and did a
> lot of reweighting after a full OSD actually did bite us once.
>
> Apart from paying our dues (tight monitoring, reweighting and generally
> hedging the cluster) I was wondering whether this behaviour is a violation
> of the “no single point of failure” promise: independent of how big your
> setup grows, a single OSD can halt practically everything. Even just
> stopping the OSD would unblock your cluster (assuming that Crush made a
> particular pathological choice and that 1 OSD being extremely off the curve
> compared to the others) and keep going.
>
> I haven’t found much whether this is “it’s the way it is and we don’t see
> a way forward” or whether this behaviour is considered something that could
> be improved in the future and whether there are strategies around already?
>
> From my perspective this is directly related to how well Crush weighting
> works with respect to placing data evenly. (I would expect that in certain
> situations like a single RBD cluster where all objects are identically
> sized that this should be something that Crush can perform well in, but my
> last weeks tells me that isn’t the case. :) )
>
> An especially interesting edge case is if your cluster consists of 2 pools
> where each runs using a completely disjoint set of OSDs: I guess it’s an
> accidental (not intentional) behaviour that the one pool would be affecting
> the other, right?
>
> Thoughts?
>
> Hugs,
> Christian
>
> --
> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
> Flying Circus Internet Operations GmbH · http://flyingcircus.io
> Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
> HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian.
> Zagrodnick
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Jewel Docs | error on mount.ceph page

2016-09-20 Thread David
Sorry I don't know the correct way to report this.

Potential error on this page:

on http://docs.ceph.com/docs/jewel/man/8/mount.ceph/

Currently:

rsize
int (bytes), max readahead, multiple of 1024, Default: 524288 (512*1024)

Should it be something like the following?

rsize
int (bytes), max read size. Default: none

rasize
int (bytes), max readahead, multiple of 1024, Default: 8388608 (8192*1024)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS metadata pool size

2016-09-26 Thread David
Ryan, a team at Ebay recently did some metadata testing, have a search on
this list. Pretty sure they found there wasn't a huge benefit it putting
the metadata pool on solid. As Christian says, it's all about ram and Cpu.
You want to get as many inodes into cache as possible.

On 26 Sep 2016 2:09 a.m., "Christian Balzer"  wrote:

>
> Hello,
>
> On Sun, 25 Sep 2016 19:51:25 -0400 (EDT) Tyler Bishop wrote:
>
> > 800TB of NVMe?  That sounds wonderful!
> >
> That's not what he wrote at all.
> 800TB capacity, of which the meta-data will likely be a small fraction.
>
> As for the OP, try your google foo on the ML archives, this of course has
> been discussed before.
> See the "CephFS in the wild" thread 3 months ago for example.
>
> In short, you need to have an idea of the number of files and calculate
> 2KB per object (file).
> Plus some overhead for the underlying OSD FS, for the time being at least.
>
> And while having the meta-data pool on fast storage certainly won't hurt,
> the consensus here seems to be that the CPU (few, fast cores) and RAM of
> the MDS have a much higher priority/benefit.
>
> Christian
> >
> > - Original Message -
> > From: "Ryan Leimenstoll" 
> > To: "ceph new" 
> > Sent: Saturday, September 24, 2016 5:37:08 PM
> > Subject: [ceph-users] CephFS metadata pool size
> >
> > Hi all,
> >
> > We are in the process of expanding our current Ceph deployment (Jewel,
> 10.2.2) to incorporate CephFS for fast, network attached scratch storage.
> We are looking to have the metadata pool exist entirely on SSDs (or NVMe),
> however I am not sure how big to expect this pool to grow to. Is there any
> good rule of thumb or guidance to getting an estimate on this before
> purchasing hardware? We are expecting upwards of 800T usable capacity at
> the start.
> >
> > Thanks for any insight!
> >
> > Ryan Leimenstoll
> > rleim...@umiacs.umd.edu
> > University of Maryland Institute for Advanced Computer Studies
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Very Small Cluster

2016-09-29 Thread David
Ranjan,

If you unmount the file system on both nodes and then gracefully stop the
Ceph services (or even yank the network cable for that node), what state is
your cluster in? Are you able to do a basic rados bench write and read?

How are you mounting CephFS, through the Kernel or Fuse client? Have you
tested with both to see if you get the same issue with blocked requests?

When you say an OSD on each node, are we talking about literally 1 OSD
daemon on each node? What is the storage behind that?




On Wed, Sep 28, 2016 at 4:03 PM, Ranjan Ghosh  wrote:

> Hi everyone,
>
> Up until recently, we were using GlusterFS to have two web servers in sync
> so we could take one down and switch back and forth between them - e.g. for
> maintenance or failover. Usually, both were running, though. The
> performance was abysmal, unfortunately. Copying many small files on the
> file system caused outages for several minutes - simply unacceptable. So I
> found Ceph. It's fairly new but I thought I'd give it a try. I liked
> especially the good, detailed documentation, the configurability and the
> many command-line tools which allow you to find out what is going on with
> your Cluster. All of this is severly lacking with GlusterFS IMHO.
>
> Because we're on a very tiny budget for this project we cannot currently
> have more than two file system servers. I added a small Virtual Server,
> though, only for monitoring. So at least we have 3 monitoring nodes. I also
> created 3 MDS's, though as far as I understood, two are only for standby.
> To sum it up, we have:
>
> server0: Admin (Deployment started from here) + Monitor + MDS
> server1: Monitor + MDS + OSD
> server2: Monitor + MDS + OSD
>
> So, the OSD is on server1 and server2 which are next to each other
> connected by a local GigaBit-Ethernet connection. The cluster is mounted
> (also on server1 and server2) as /var/www and Apache is serving files off
> the cluster.
>
> I've used these configuration settings:
>
> osd pool default size = 2
> osd pool default min_size = 1
>
> My idea was that by default everything should be replicated on 2 servers
> i.e. each file is normally written on server1 and server2. In case of
> emergency though (one server has a failure), it's better to keep operating
> and only write the file to one server. Therefore, i set min_size = 1. My
> further understanding is (correct me if I'm wrong), that when the server
> comes back online, the files that were written to only 1 server during the
> outage will automatically be replicated to the server that has come back
> online.
>
> So far, so good. With two servers now online, the performance is
> light-years away from sluggish GlusterFS. I've also worked with XtreemFS,
> OCFS2, AFS and never had such a good performance with any Cluster. In fact
> it's so blazingly fast, that I had to check twice I really had the cluster
> mounted and wasnt accidentally working on the hard drive. Impressive. I can
> edit files on server1 and they are immediately changed on server2 and vice
> versa. Great!
>
> Unfortunately, when I'm now stopping all ceph-Services on server1, the
> websites on server2 start to hang/freeze. And "ceph health" shows "#x
> blocked requests". Now, what I don't understand: Why is it blocking?
> Shouldnt both servers have the file? And didn't I set min_size to "1"? And
> if there are a few files (could be some unimportant stuff) that's missing
> on one of the servers: How can I abort the blocking? I'd rather have a
> missing file or whatever, then a completely blocking website.
>
> Are my files really duplicated 1:1 - or are they perhaps spread evenly
> between both OSDs? Do I have to edit the crushmap to achieve a real
> "RAID-1"-type of replication? Is there a command to find out for a specific
> file where it actually resides and whether it has really been replicated?
>
> Thank you!
> Ranjan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New OSD Nodes, pgs haven't changed state

2016-10-10 Thread David
Can you provide a 'ceph health detail'

On 9 Oct 2016 3:56 p.m., "Mike Jacobacci"  wrote:

Hi,

Yesterday morning I added two more OSD nodes and changed the crushmap from
disk to node. It looked to me like everything went ok besides some disks
missing that I can re-add later, but the cluster status hasn't changed
since then.  Here is the output of ceph -w:

cluster 395fb046-0062-4252-914c-013258c5575c
>  health HEALTH_ERR
> 1761 pgs are stuck inactive for more than 300 seconds
> 1761 pgs peering
> 1761 pgs stuck inactive
> 8 requests are blocked > 32 sec
> crush map has legacy tunables (require bobtail, min is firefly)
>  monmap e2: 3 mons at {birkeland=192.168.10.190:
> 6789/0,immanuel=192.168.10.1
> 25:6789/0,peratt=192.168.10.187:6789/0}
> election epoch 14, quorum 0,1,2 immanuel,peratt,birkeland
>  osdmap e186: 26 osds: 26 up, 26 in; 1796 remapped pgs
> flags sortbitwise
>   pgmap v6599413: 1796 pgs, 4 pools, 1343 GB data, 336 kobjects
> 4049 GB used, 92779 GB / 96829 GB avail
> 1761 remapped+peering
>   35 active+clean
> 2016-10-09 07:00:00.000776 mon.0 [INF] HEALTH_ERR; 1761 pgs are stuck
> inactive f or more than 300 seconds;
> 1761 pgs peering; 1761 pgs stuck inactive; 8 requests
>are blocked > 32 sec; crush map has legacy tunables (require
> bobtail, min is fir efly)



I have legacy tunables on since Ceph is only backing our Xenserver
infrastructure.  The number of pgs remapping and clean haven't changed and
there isn't seem to be that much data... Is this normal behavior?

Here is my crushmap:

# begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> tunable straw_calc_version 1
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> device 4 osd.4
> device 5 osd.5
> device 6 osd.6
> device 7 osd.7
> device 8 osd.8
> device 9 osd.9
> device 10 osd.10
> device 11 osd.11
> device 12 osd.12
> device 13 osd.13
> device 14 osd.14
> device 15 osd.15
> device 16 osd.16
> device 17 osd.17
> device 18 osd.18
> device 19 osd.19
> device 20 osd.20
> device 21 osd.21
> device 22 osd.22
> device 23 osd.23
> device 24 osd.24
> device 25 osd.25
> # types
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> type 4 row
> type 5 pdu
> type 6 pod
> type 7 room
> type 8 datacenter
> type 9 region
> type 10 root
> # buckets
> host tesla {
> id -2   # do not change unnecessarily
> # weight 36.369
> alg straw
> hash 0  # rjenkins1
> item osd.5 weight 3.637
> item osd.0 weight 3.637
> item osd.2 weight 3.637
> item osd.4 weight 3.637
> item osd.8 weight 3.637
> item osd.3 weight 3.637
> item osd.6 weight 3.637
> item osd.1 weight 3.637
> item osd.9 weight 3.637
> item osd.7 weight 3.637
> }
> host faraday {
> id -3   # do not change unnecessarily
> # weight 32.732
> alg straw
> hash 0  # rjenkins1
> item osd.23 weight 3.637
> item osd.18 weight 3.637
> item osd.17 weight 3.637
> item osd.25 weight 3.637
> item osd.20 weight 3.637
> item osd.22 weight 3.637
> item osd.21 weight 3.637
> item osd.19 weight 3.637
> item osd.24 weight 3.637
> }
> host hertz {
> id -4   # do not change unnecessarily
> # weight 25.458
> alg straw
> hash 0  # rjenkins1
> item osd.15 weight 3.637
> item osd.12 weight 3.637
> item osd.13 weight 3.637
> item osd.14 weight 3.637
> item osd.16 weight 3.637
> item osd.10 weight 3.637
> item osd.11 weight 3.637
> }
> root default {
> id -1   # do not change unnecessarily
> # weight 94.559
> alg straw
> hash 0  # rjenkins1
> item tesla weight 36.369
> item faraday weight 32.732
> item hertz weight 25.458
> }
> # rules
> rule replicated_ruleset {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type host
> step emit
> }
> # end crush map



Cheers,
Mike

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] crush map has straw_calc_version=0

2018-06-24 Thread David
Hi!

So I've got an old dumpling production cluster which has slowly been upgraded 
to Jewel.
Now I'm facing the Ceph Health warning that straw_calc_version = 0

According to an old thread from 2016 and the docs it could trigger a small to 
moderate amount of migration.
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009702.html 
(https://link.getmailspring.com/link/1529849940.local-c03a6474-c0ef-v1.2.2-96fb3...@getmailspring.com/0?redirect=http%3A%2F%2Flists.ceph.com%2Fpipermail%2Fceph-users-ceph.com%2F2016-May%2F009702.html&recipient=Y2VwaC11c2Vyc0BsaXN0cy5jZXBoLmNvbQ%3D%3D)
http://docs.ceph.com/docs/master/rados/operations/crush-map/#straw-calc-version-tunable-introduced-with-firefly-too
 
(https://link.getmailspring.com/link/1529849940.local-c03a6474-c0ef-v1.2.2-96fb3...@getmailspring.com/1?redirect=http%3A%2F%2Fdocs.ceph.com%2Fdocs%2Fmaster%2Frados%2Foperations%2Fcrush-map%2F%23straw-calc-version-tunable-introduced-with-firefly-too&recipient=Y2VwaC11c2Vyc0BsaXN0cy5jZXBoLmNvbQ%3D%3D)
Since we're heading on to Luminous and later on Mimic, I'm not sure it's wise 
to leave it as it is. Since this is a filestore HDD + SSD journals cluster, a 
moderate migration might cause issues to our production servers.
Any way to "test" how much migration it will cause? The servers/disks are 
homogeneous.
Also, would ignoring it cause any issues with Luminous/Mimic? The plan is to 
set up another pool and replicate all data to the new pool on the same OSDs 
(not sure that's in Mimic yet though?)
Kind Regards,
David Majchrzak
> Moving to straw_calc_version 1 and then adjusting a straw bucket (by adding, 
> removing, or reweighting an item, or by using the reweight-all command) can 
> trigger a small to moderate amount of data movement if the cluster has hit 
> one of the problematic conditions.
>
>
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-04-25 Thread David
On 19 Apr 2017 18:01, "Adam Carheden"  wrote:

Does anyone know if XFS uses a single thread to write to it's journal?


You probably know this but just to avoid any confusion, the journal in this
context isn't the metadata journaling in XFS, it's a separate journal
written to by the OSD daemons

I think the number of threads per OSD is controlled by the 'osd op threads'
setting which defaults to 2


I'm evaluating SSDs to buy as journal devices. I plan to have multiple
OSDs share a single SSD for journal.


I'm benchmarking several brands as
described here:

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-tes
t-if-your-ssd-is-suitable-as-a-journal-device/

It appears that sequential write speed using multiple threads varies
widely between brands. Here's what I have so far:
 SanDisk SDSSDA240G, dd:6.8 MB/s
 SanDisk SDSSDA240G, fio  1 jobs:   6.7 MB/s
 SanDisk SDSSDA240G, fio  2 jobs:   7.4 MB/s
 SanDisk SDSSDA240G, fio  4 jobs:   7.5 MB/s
 SanDisk SDSSDA240G, fio  8 jobs:   7.5 MB/s
 SanDisk SDSSDA240G, fio 16 jobs:   7.5 MB/s
 SanDisk SDSSDA240G, fio 32 jobs:   7.6 MB/s
 SanDisk SDSSDA240G, fio 64 jobs:   7.6 MB/s
HFS250G32TND-N1A2A 3P10, dd:1.8 MB/s
HFS250G32TND-N1A2A 3P10, fio  1 jobs:   4.8 MB/s
HFS250G32TND-N1A2A 3P10, fio  2 jobs:   5.2 MB/s
HFS250G32TND-N1A2A 3P10, fio  4 jobs:   9.5 MB/s
HFS250G32TND-N1A2A 3P10, fio  8 jobs:  23.4 MB/s
HFS250G32TND-N1A2A 3P10, fio 16 jobs:   7.2 MB/s
HFS250G32TND-N1A2A 3P10, fio 32 jobs:  49.8 MB/s
HFS250G32TND-N1A2A 3P10, fio 64 jobs:  70.5 MB/s
INTEL SSDSC2BB150G7, dd:   90.1 MB/s
INTEL SSDSC2BB150G7, fio  1 jobs:  91.0 MB/s
INTEL SSDSC2BB150G7, fio  2 jobs: 108.3 MB/s
INTEL SSDSC2BB150G7, fio  4 jobs: 134.2 MB/s
INTEL SSDSC2BB150G7, fio  8 jobs: 118.2 MB/s
INTEL SSDSC2BB150G7, fio 16 jobs:  39.9 MB/s
INTEL SSDSC2BB150G7, fio 32 jobs:  25.4 MB/s
INTEL SSDSC2BB150G7, fio 64 jobs:  15.8 MB/s

The SanDisk is slow, but speed is the same at any number of threads. The

Intel peaks at 4-6 threads and then declines rapidly into sub-par
performance (at least for a pricey "enterprise" drive). The SK Hynix is
slow at low numbers of threads but gets huge performance gains with more
threads. (This is all with one trial, but I have a script running
multiple trials across all drives today.)


I would avoid the SanDisk and Hynix. The s3500 isn't too bad. Perhaps
consider going up to a 37xx and putting more OSDs on it. Of course with the
caveat that you'll lose more OSDs if it goes down.


So if XFS has a single thread that does journaling, it looks like my
best option would be 1 intel SSD shared by 4-6 OSDs. If XFS already
throws multiple threads at the journal, then having OSDs share an Intel
drive will likely kill my SSD performance, but having as many OSDs as I
can cram in a chassis share the SK Hynix drive would get me great
performance for a fraction of the cost.


I don't think the hynix is going to give you great performance with
multiple OSDs


Anyone have any related advice or experience to share regarding journal
SSD selection?


Need to know a bit more about the type of cluster you're planning to build,
number of nodes, type of OSD, workload etc.



--
Adam Carheden

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH backup strategy and best practices

2017-06-04 Thread David


> 4 juni 2017 kl. 23:23 skrev Roger Brown :
> 
> I'm a n00b myself, but I'll go on record with my understanding.
> 
> On Sun, Jun 4, 2017 at 3:03 PM Benoit GEORGELIN - yulPa 
> mailto:benoit.george...@yulpa.io>> wrote:
> Hi ceph users, 
> 
> Ceph have a very good documentation about technical usage, but there is a lot 
> of conceptual things missing (from my point of view) 
> It's not easy to understand all at the same time, but yes, little by little 
> it's working. 
> 
> Here are some questions about ceph , hope someone can take a little time to 
> point me where I can find answers :
> 
>  - Backup  :
> Do you backup data from a CEPH cluster or you consider a copy as a backup of 
> that file ? 
> Let's say I have replica size of 3 . Somehow , my crush map will keep 2 copy 
> in my main rack and 1 copy to another rack in another datacenter 
> Can I consider the third copy as a backup ? What would be your position ? 
> 
> Replicas are not backups. Just ask GitLab after accidental deletion. source: 
> https://www.theregister.co.uk/2017/02/01/gitlab_data_loss/ 
> 
> 
> 
> - Writing process of ceph object storage using radosgw
> Simple question, but not sure about it. 
> The more replica the more slower will be my cluster ? Does CEPH have to 
> acknowledge  the number of replica before saying it's good ? 
> From what I read, CEPH will write and acknowledge the de primary OSD of the 
> pool , So if that the cas, I does not matter how many replica I want and how 
> far are situated the others OSD that would work the same. 
> Can I chose myseft the primary OSD in my zone 1 ,  have a copy on zone 2 
> (same rack) and a third zone 3 in another datacenter that might have some 
> latency . 
> 
> More replicas make slower cluster because it waits for all devices to 
> acknowledge write before reporting back. source: ?

I’d say stick with 3 replicas in one DC, then if you want to add another DC for 
better data protection (note, not backup), you’ll just add asynchronous 
mirroring between DCs (http://docs.ceph.com/docs/master/rbd/rbd-mirroring/ 
) with another cluster 
there.
That way you’ll have a quick cluster (especially if you use awesome disks like 
NVME SSD journals + SSD storage or better) with a location redundancy.

> 
> 
> - Data persistance / availability 
> If crush map is by hosts and I have 3 hosts with replication of 3 
> This means , I will have 1 copy on each hosts
> Does it means I can lost 2 hosts and still have my cluster working, at least 
> on read mode ? and eventually in write too if i say , osd pool default min 
> size = 1
> 
> Yes, I think. But best practice is to have at least 5 hosts (N+2) so you can 
> lose 2 hosts and still keep 3 replicas.
>  

Keep in mind that you ”should” have enough storage free as well to be able to 
loose 2 nodes. If you fill 5 nodes to 80% and loose 2 nodes you won’t be able 
to repair it all until you get them up and running again.

> 
> Thanks for your help. 
> - 
> 
> Benoît G
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 
> Roger
>  
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS | flapping OSD locked up NFS

2017-06-19 Thread David
Hi All

We had a faulty OSD that was going up and down for a few hours until Ceph
marked it out. During this time Cephfs was accessible, however, for about
10 mins all NFS processes (kernel NFSv3) on a server exporting Cephfs were
hung, locking up all the NFS clients. The cluster was healthy before the
faulty OSD. I'm trying to understand if this is expected behaviour, a bug
or something else. Any insights would be appreciated.

MDS active/passive
Jewel 10.2.2
Ceph client 3.10.0-514.6.1.el7.x86_64
Cephfs mount: (rw,relatime,name=admin,secret=,acl)

I can see some slow requests in the MDS log during the time the NFS
processes were hung, some for setattr calls:

2017-06-15 04:29:37.081175 7f889401f700  0 log_channel(cluster) log [WRN] :
slow request 60.974528 seconds old, received at 2017-06-15 04:
28:36.106598: client_request(client.2622511:116375892 setattr size=0
#100025b3554 2017-06-15 04:28:36.104928) currently acquired locks

and some for getattr:

2017-06-15 04:29:42.081224 7f889401f700  0 log_channel(cluster) log [WRN] :
slow request 32.225883 seconds old, received at 2017-06-15 04:
29:09.855302: client_request(client.2622511:116380541 getattr pAsLsXsFs
#100025b4d37 2017-06-15 04:29:09.853772) currently failed to rdloc
k, waiting

And a "client not responding to mclientcaps revoke" warning:

2017-06-15 04:31:12.084561 7f889401f700  0 log_channel(cluster) log [WRN] :
client.2344872 isn't responding to mclientcaps(revoke), ino 100025b4d37
pending pAsxLsXsxFcb issued pAsxLsXsxFsxcrwb, sent 122.229172 seconds ag

These issues seemed to have cleared once the faulty OSD was marked out.

In general I have noticed the NFS processes exporting Cephfs do seem to
spend a lot of time in 'D' state, with WCHAN as 'lock_page', compared with
a NFS server exporting a local file system. Also, NFS performance hasn't
been great with small reads/writes, particularly writes with the default
sync export option, I've had to export with async for the time-being. I
haven't had a chance to troubleshoot this in any depth yet, just mentioning
in case it's relevant.

Thanks,
David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS | flapping OSD locked up NFS

2017-06-20 Thread David
Hi John

I've had nfs-ganesha testing on the to do list for a while, I think I might
move it closer to the top!  I'll certainly report back with the results.

I'd still be interested to hear any kernel nfs experiences/tips, my
understanding is nfs is included in the ceph testing suite so there is an
expectation people will want to use it.

Thanks,
David


On 19 Jun 2017 3:56 p.m., "John Petrini"  wrote:

> Hi David,
>
> While I have no personal experience with this; from what I've been told,
> if you're going to export cephfs over NFS it's recommended that you use a
> userspace implementation of NFS (like nfs-ganesha) rather than
> nfs-kernel-server. This may be the source of you issues and might be worth
> testing. I'd be interested to hear the results if you do.
>
> ___
>
> John Petrini
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Again - state of Ceph NVMe and SSDs

2016-01-16 Thread David
Hi!

We’re planning our third ceph cluster and been trying to find how to maximize 
IOPS on this one.

Our needs:
* Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM servers)
* Pool for storage of many small files, rbd (probably dovecot maildir and 
dovecot index etc)

So I’ve been reading up on:

https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance

and ceph-users from october 2015:

http://www.spinics.net/lists/ceph-users/msg22494.html

We’re planning something like 5 OSD servers, with:

* 4x 1.2TB Intel S3510
* 8st 4TB HDD
* 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and one for 
HDD pool journal)
* 2x 80GB Intel S3510 raid1 for system
* 256GB RAM
* 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better

This cluster will probably run Hammer LTS unless there are huge improvements in 
Infernalis when dealing 4k IOPS.

The first link above hints at awesome performance. The second one from the list 
not so much yet.. 

Is anyone running Hammer or Infernalis with a setup like this?
Is it a sane setup?
Will we become CPU constrained or can we just throw more RAM on it? :D

Kind Regards,
David Majchrzak___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Again - state of Ceph NVMe and SSDs

2016-01-17 Thread David
Thanks Wido, those are good pointers indeed :)
So we just have to make sure the backend storage (SSD/NVMe journals) won’t be 
saturated (or the controllers) and then go with as many RBD per VM as possible.

Kind Regards,
David Majchrzak

16 jan 2016 kl. 22:26 skrev Wido den Hollander :

> On 01/16/2016 07:06 PM, David wrote:
>> Hi!
>> 
>> We’re planning our third ceph cluster and been trying to find how to
>> maximize IOPS on this one.
>> 
>> Our needs:
>> * Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM
>> servers)
>> * Pool for storage of many small files, rbd (probably dovecot maildir
>> and dovecot index etc)
>> 
> 
> Not completely NVMe related, but in this case, make sure you use
> multiple disks.
> 
> For MySQL for example:
> 
> - Root disk for OS
> - Disk for /var/lib/mysql (data)
> - Disk for /var/log/mysql (binary log)
> - Maybe even a InnoDB logfile disk
> 
> With RBD you gain more performance by sending I/O into the cluster in
> parallel. So when ever you can, do so!
> 
> Regarding small files, it might be interesting to play with the stripe
> count and stripe size there. By default this is 1 and 4MB. But maybe 16
> and 256k work better here.
> 
> With Dovecot as well, use a different RBD disk for the indexes and a
> different one for the Maildir itself.
> 
> Ceph excels at parallel performance. That is what you want to aim for.
> 
>> So I’ve been reading up on:
>> 
>> https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance
>> 
>> and ceph-users from october 2015:
>> 
>> http://www.spinics.net/lists/ceph-users/msg22494.html
>> 
>> We’re planning something like 5 OSD servers, with:
>> 
>> * 4x 1.2TB Intel S3510
>> * 8st 4TB HDD
>> * 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and
>> one for HDD pool journal)
>> * 2x 80GB Intel S3510 raid1 for system
>> * 256GB RAM
>> * 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better
>> 
>> This cluster will probably run Hammer LTS unless there are huge
>> improvements in Infernalis when dealing 4k IOPS.
>> 
>> The first link above hints at awesome performance. The second one from
>> the list not so much yet.. 
>> 
>> Is anyone running Hammer or Infernalis with a setup like this?
>> Is it a sane setup?
>> Will we become CPU constrained or can we just throw more RAM on it? :D
>> 
>> Kind Regards,
>> David Majchrzak
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> -- 
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Again - state of Ceph NVMe and SSDs

2016-01-17 Thread David
That is indeed great news! :)
Thanks for the heads up.

Kind Regards,
David Majchrzak


17 jan 2016 kl. 21:34 skrev Tyler Bishop :

> The changes you are looking for are coming from Sandisk in the ceph "Jewel" 
> release coming up.
> 
> Based on benchmarks and testing, sandisk has really contributed heavily on 
> the tuning aspects and are promising 90%+ native iop of a drive in the 
> cluster.
> 
> The biggest changes will come from the memory allocation with writes.  
> Latency is going to be a lot lower.
> 
> 
> - Original Message -
> From: "David" 
> To: "Wido den Hollander" 
> Cc: ceph-users@lists.ceph.com
> Sent: Sunday, January 17, 2016 6:49:25 AM
> Subject: Re: [ceph-users] Again - state of Ceph NVMe and SSDs
> 
> Thanks Wido, those are good pointers indeed :)
> So we just have to make sure the backend storage (SSD/NVMe journals) won’t be 
> saturated (or the controllers) and then go with as many RBD per VM as 
> possible.
> 
> Kind Regards,
> David Majchrzak
> 
> 16 jan 2016 kl. 22:26 skrev Wido den Hollander :
> 
>> On 01/16/2016 07:06 PM, David wrote:
>>> Hi!
>>> 
>>> We’re planning our third ceph cluster and been trying to find how to
>>> maximize IOPS on this one.
>>> 
>>> Our needs:
>>> * Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM
>>> servers)
>>> * Pool for storage of many small files, rbd (probably dovecot maildir
>>> and dovecot index etc)
>>> 
>> 
>> Not completely NVMe related, but in this case, make sure you use
>> multiple disks.
>> 
>> For MySQL for example:
>> 
>> - Root disk for OS
>> - Disk for /var/lib/mysql (data)
>> - Disk for /var/log/mysql (binary log)
>> - Maybe even a InnoDB logfile disk
>> 
>> With RBD you gain more performance by sending I/O into the cluster in
>> parallel. So when ever you can, do so!
>> 
>> Regarding small files, it might be interesting to play with the stripe
>> count and stripe size there. By default this is 1 and 4MB. But maybe 16
>> and 256k work better here.
>> 
>> With Dovecot as well, use a different RBD disk for the indexes and a
>> different one for the Maildir itself.
>> 
>> Ceph excels at parallel performance. That is what you want to aim for.
>> 
>>> So I’ve been reading up on:
>>> 
>>> https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance
>>> 
>>> and ceph-users from october 2015:
>>> 
>>> http://www.spinics.net/lists/ceph-users/msg22494.html
>>> 
>>> We’re planning something like 5 OSD servers, with:
>>> 
>>> * 4x 1.2TB Intel S3510
>>> * 8st 4TB HDD
>>> * 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and
>>> one for HDD pool journal)
>>> * 2x 80GB Intel S3510 raid1 for system
>>> * 256GB RAM
>>> * 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better
>>> 
>>> This cluster will probably run Hammer LTS unless there are huge
>>> improvements in Infernalis when dealing 4k IOPS.
>>> 
>>> The first link above hints at awesome performance. The second one from
>>> the list not so much yet.. 
>>> 
>>> Is anyone running Hammer or Infernalis with a setup like this?
>>> Is it a sane setup?
>>> Will we become CPU constrained or can we just throw more RAM on it? :D
>>> 
>>> Kind Regards,
>>> David Majchrzak
>>> 
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>> 
>> 
>> -- 
>> Wido den Hollander
>> 42on B.V.
>> Ceph trainer and consultant
>> 
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph and NFS

2016-01-18 Thread david
Hello All.
Does anyone provides Ceph rbd/rgw/cephfs through NFS?  I have a 
requirement about Ceph Cluster which needs to provide NFS service. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and NFS

2016-01-18 Thread david
Hi,
Does CephFS stable enough to deploy it in product environments? and Do 
you compare the performance between nfs-ganesha and standard kernel based NFSd 
which are based on CephFS?

> On Jan 18, 2016, at 20:34, Burkhard Linke 
>  wrote:
> 
> Hi,
> 
> On 18.01.2016 10:36, david wrote:
>> Hello All.
>>  Does anyone provides Ceph rbd/rgw/cephfs through NFS?  I have a 
>> requirement about Ceph Cluster which needs to provide NFS service.
> 
> We export a CephFS mount point on one of our NFS servers. Works out of the 
> box with Ubuntu Trusty, a recent kernel and kernel-based cephfs driver.
> 
> ceph-fuse did not work that well, and using nfs-ganesha 2.2 instead of 
> standard kernel based NFSd resulted in segfaults and permissions problems.
> 
> Regards,
> Burkhard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and NFS

2016-01-18 Thread david
Hi,
Thanks for your answer. Does CephFS stable enough to deploy it in 
product environments? and Do you compare the performance between nfs-ganesha 
and standard kernel based NFSd which are based on CephFS?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrading Ceph

2016-02-01 Thread david
Hi Vlad,
I just upgraded my ceph cluster from firefly to hammer and all right. 
Please do it according to the manuals in www.ceph.com  
and restart monitors and then osds. I restarted osds one by one, which means 
restart one OSD and waits for it runs normal and then restart the other OSD and 
so on.


> On Feb 2, 2016, at 09:10, Vlad Blando  wrote:
> 
> What if the upgrade fails, what is the rollback scenario?
> 
> 
> 
> On Wed, Jan 27, 2016 at 10:10 PM,  > wrote:
> I just upgraded from firefly to infernalis (firefly to hammer to
> infernalis) my cluster
> All came like a charm
> I upgraded mon first, then the osd, one by one, restarting the daemon
> after each upgrade
> Remember to change uid of your file, as the osd daemon now run under the
> user ceph (and not root): can be pretty long operation .. do the chown,
> then upgrade, then shut the daemon, rechown the remaining files, start
> the daemon (use chown --from=0:0 ceph:ceph -R /var/lib/ceph/osd/ceph-2
> to speed up the last chown)
> 
> To conclude, I upgraded the whole cluster without a single downtime
> (pretty surprised, didn't expect the process to be "that" robust)
> 
> On 27/01/2016 15:00, Vlad Blando wrote:
> > Hi,
> >
> > I have a production Ceph Cluster
> > - 3 nodes
> > - 3 mons on each nodes
> > - 9 OSD @ 4TB per node
> > - using ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
> >
> > ​Now I want to upgrade it to Hammer, I saw the documentation on upgrading,
> > it looks straight forward, but I want to know to those who have tried
> > upgrading a production environment, any precautions, caveats, preparation
> > that I need to do before doing it?
> >
> > - Vlad
> > ᐧ
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > 
> >
> 
> 
> ᐧ
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and Google Summer of Code

2016-02-29 Thread David
Great idea! +1

David Majchrzak

> 29 feb. 2016 kl. 22:53 skrev Wido den Hollander :
> 
> A long wanted feature is mail storage in RADOS:
> http://tracker.ceph.com/issues/12430
> 
> Would that be a good idea? I'd be more than happy to mentor this one.
> 
> I will probably lack the technical C++ skills, but e-mail storage itself is
> something I'm very familiar with.
> 
> Wido
> 
>> Op 29 februari 2016 om 22:12 schreef Patrick McGarry :
>> 
>> 
>> Hey cephers,
>> 
>> As many of you may have seen by now, Ceph was accepted back for
>> another year of GSoC. I’m asking all of you to make sure that any
>> applicable students that you know consider working with Ceph this
>> year.
>> 
>> We’re happy to accept proposals from our ideas list [0], or any custom
>> proposal that you and they might dream up. This also applies to
>> mentors. While we have a great group of initial mentors, if you are
>> interested in mentoring and have a student work with you to create a
>> proposal, I can add you as a mentor all the way through the end of the
>> application period.
>> 
>> If you have questions or comments, please feel free to reach out to me
>> directly. Thanks!
>> 
>> [0] http://ceph.com/gsoc2016
>> 
>> -- 
>> 
>> Best Regards,
>> 
>> Patrick McGarry
>> Director Ceph Community || Red Hat
>> http://ceph.com  ||  http://community.redhat.com
>> @scuttlemonkey || @ceph
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] DSS 7000 for large scale object storage

2016-03-21 Thread David
Sounds like you’ll have a field day waiting for rebuild in case of a node 
failure or an upgrade of the crush map ;)

David


> 21 mars 2016 kl. 09:55 skrev Bastian Rosner :
> 
> Hi,
> 
> any chance that somebody here already got hands on Dell DSS 7000 machines?
> 
> 4U chassis containing 90x 3.5" drives and 2x dual-socket server sleds 
> (DSS7500). Sounds ideal for high capacity and density clusters, since each of 
> the server-sleds would run 45 drives, which I believe is a suitable number of 
> OSDs per node.
> 
> When searching for this model there's not much detailed information out there.
> Sadly I could not find a review from somebody who actually owns a bunch of 
> them and runs a decent PB-size cluster with it.
> 
> Cheers, Bastian
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] DSS 7000 for large scale object storage

2016-03-21 Thread David
From my experience you’ll be better off planning exactly how many OSD’s and 
nodes you’re going to have and if possible equip them from the start.

By just adding a new drive to the same pool ceph will start to rearrange data 
across the whole cluster which might lead to less client IO depending on what 
you’re comfortable with. In a worst case scenario, your clients won’t have 
enough IO and your services might be ”down” until it’s healthy again.

Rebuilding 60 x 6TB drives will take quite some time. Each SATA drive has about 
75MB-125MB throughput at best, so a rebuild of once such drive would take 
approx. 16-17 hours. Usually it takes some x2 or x3 times longer in a normal 
case and if your controllers or network is limited.

// david


> 21 mars 2016 kl. 13:13 skrev Bastian Rosner :
> 
> Yes, rebuild in case of a whole chassis failure is indeed an issue. That 
> depends on how the failure domain looks like.
> 
> I'm currently thinking of initially not running fully equipped nodes.
> Let's say four of these machines with 60x 6TB drives each, so only loaded 2/3.
> That's raw 1440TB distributed over eight OSD nodes.
> Each individual OSD-node would therefore host "only" 30 OSDs but still allow 
> for fast expansion.
> 
> Usually delivery and installation of a bunch of HDDs is much faster than 
> servers.
> 
> I really wonder how easy it is to add additional disks and whether chance for 
> node- or even chassis-failure increases.
> 
> Cheers, Bastian
> 
> Am 2016-03-21 10:33, schrieb David:
>> Sounds like you’ll have a field day waiting for rebuild in case of a
>> node failure or an upgrade of the crush map ;)
>> David
>>> 21 mars 2016 kl. 09:55 skrev Bastian Rosner :
>>> Hi,
>>> any chance that somebody here already got hands on Dell DSS 7000 machines?
>>> 4U chassis containing 90x 3.5" drives and 2x dual-socket server sleds 
>>> (DSS7500). Sounds ideal for high capacity and density clusters, since each 
>>> of the server-sleds would run 45 drives, which I believe is a suitable 
>>> number of OSDs per node.
>>> When searching for this model there's not much detailed information out 
>>> there.
>>> Sadly I could not find a review from somebody who actually owns a bunch of 
>>> them and runs a decent PB-size cluster with it.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] What's the best practice for Erasure Coding

2019-07-07 Thread David
Hi Ceph-Users,

 

I'm working with a  Ceph cluster (about 50TB, 28 OSDs, all Bluestore on lvm).

Recently, I'm trying to use the Erasure Code pool.

My question is "what's the best practice for using EC pools ?".

More specifically, which plugin (jerasure, isa, lrc, shec or  clay) should I 
adopt, and how to choose the combinations of (k,m) (e.g. (k=3,m=2), (k=6,m=3) ).

 

Does anyone share some experience?

 

Thanks for any help.

 

Regards,

David

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Multi-MDS setup, one MDS stuck in resolve, 3 stuck in standby, can't make another MDS come live

2014-06-05 Thread David Jericho
Hi all,

I did a bit of an experiment with multi-mds on firefly, and it worked fine 
until one of the MDS crashed when rebalancing. It's not the end of the world, 
and I could just start fresh with the cluster, but I'm keen to see if this can 
be fixed as running multi-mds is something I would like to do in production, as 
when it was working, it did reduce load and improve response time significantly.

The output of ceph mds dump is:

dumped mdsmap epoch 1232
epoch   1232
flags   0
created 2014-03-24 23:24:35.584469
modified2014-06-06 00:17:54.336201
tableserver 0
root0
session_timeout 60
session_autoclose   300
max_file_size   1099511627776
last_failure1227
last_failure_osd_epoch  24869
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses 
versioned encoding,6=dirfrag is stored in omap}
max_mds 2
in  0,1
up  {1=578616}
failed
stopped
data_pools  
0,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,101,105
metadata_pool   1
inline_data disabled
578616: 10.60.8.18:6808/252227 'c' mds.1.36 up:resolve seq 2
577576: 10.60.8.19:6800/58928 'd' mds.-1.0 up:standby seq 1
577603: 10.60.8.2:6801/245281 'a' mds.-1.0 up:standby seq 1
578623: 10.60.8.3:6800/75325 'b' mds.-1.0 up:standby seq 1

Modifying max_mds has no effect, and restarting/rebooting the cluster has no 
effect. No matter what combination of commands I try with the ceph-mds binary, 
or via the ceph tool, can I make a second MDS startup, causing mds.1 to leave 
resolve and move to the next step. Running with -debug_mds 10 provides no 
really enlightening information, nor does watching the mon logs. At a guess, 
it's looking for mds.0 to communicate with.

Anyone have some pointers?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-MDS setup, one MDS stuck in resolve, 3 stuck in standby, can't make another MDS come live

2014-06-05 Thread David Jericho
> -Original Message-
> From: Yan, Zheng [mailto:uker...@gmail.com]
> looks like you removed mds.0 from the failed list. I don't think there is a
> command to add mds the failed back. maybe you can use 'ceph mds setmap
> ...' .

>From memory, I probably did, misunderstanding how it worked. 

Is there any documentation on how to do this? My Google searching isn't turning 
up results on how to determine the last known good mdsmap. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Swift API Authentication Failure

2014-06-06 Thread David Curtiss
Over the last two days, I set up ceph on a set of ubuntu 12.04 VMs (my
first time working with ceph), and it seems to be working fine (I have
HEALTH_OK, and can create a test document via the rados commandline tool),
but *I can't authenticate with the swift API*.

I followed the quickstart guides to get ceph and radosgw installed. (Listed
here, if you want to check my work: http://pastebin.com/nfPWCn9P )

Visiting the root of the web server *shows the ListAllMyBucketsResult XML*,
as expected, but trying to authenticate always gives me *"403 Forbidden"
errors*.

Here's the output of "radosgw-admin user info --uid=hive_cache":
http://pastebin.com/vwwbyd4c
And here's my curl invocation: http://pastebin.com/EfQ8nw8a

Any ideas on what might be wrong?

Thanks,
David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG Selection Criteria for Deep-Scrub

2014-06-11 Thread David Zafman

The code checks the pg with the oldest scrub_stamp/deep_scrub_stamp to see 
whether the osd_scrub_min_interval/osd_deep_scrub_interval time has elapsed.  
So the output you are showing with the very old scrub stamps shouldn’t happen 
under default settings.  As soon set deep-scrub is re-enabled, the 5 pgs with 
that old stamp should be the first to get run.

A PG needs to have active and clean set to be scrubbed.   If any weren’t 
active+clean, then even a manual scrub would do nothing.

Now that I’m looking at the code I see that your symptom is possible if the 
values of osd_scrub_min_interval or osd_scrub_max_interval are larger than your 
osd_deep_scrub_interval.  Should the osd_scrub_min_interval be greater than 
osd_deep_scrub_interval, there won't be a deep scrub until the 
osd_scrub_min_interval has elapsed.  If an OSD is under load and the 
osd_scrub_max_interval is greater than the osd_deep_scrub_interval, there won't 
be a deep scrub until osd_scrub_max_interval has elapsed.

Please check the 3 interval config values.  Verify that your PGs are 
active+clean just to be sure.

David


On May 20, 2014, at 5:21 PM, Mike Dawson  wrote:

> Today I noticed that deep-scrub is consistently missing some of my Placement 
> Groups, leaving me with the following distribution of PGs and the last day 
> they were successfully deep-scrubbed.
> 
> # ceph pg dump all | grep active | awk '{ print $20}' | sort -k1 | uniq -c
>  5 2013-11-06
>221 2013-11-20
>  1 2014-02-17
> 25 2014-02-19
> 60 2014-02-20
>  4 2014-03-06
>  3 2014-04-03
>  6 2014-04-04
>  6 2014-04-05
> 13 2014-04-06
>  4 2014-04-08
>  3 2014-04-10
>  2 2014-04-11
> 50 2014-04-12
> 28 2014-04-13
> 14 2014-04-14
>  3 2014-04-15
> 78 2014-04-16
> 44 2014-04-17
>  8 2014-04-18
>  1 2014-04-20
> 16 2014-05-02
> 69 2014-05-04
>140 2014-05-05
>569 2014-05-06
>   9231 2014-05-07
>103 2014-05-08
>514 2014-05-09
>   1593 2014-05-10
>393 2014-05-16
>   2563 2014-05-17
>   1283 2014-05-18
>   1640 2014-05-19
>   1979 2014-05-20
> 
> I have been running the default "osd deep scrub interval" of once per week, 
> but have disabled deep-scrub on several occasions in an attempt to avoid the 
> associated degraded cluster performance I have written about before.
> 
> To get the PGs longest in need of a deep-scrub started, I set the 
> nodeep-scrub flag, and wrote a script to manually kick off deep-scrub 
> according to age. It is processing as expected.
> 
> Do you consider this a feature request or a bug? Perhaps the code that 
> schedules PGs to deep-scrub could be improved to prioritize PGs that have 
> needed a deep-scrub the longest.
> 
> Thanks,
> Mike Dawson
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Swift API Authentication Failure

2014-06-11 Thread David Curtiss
Success! You nailed it. Thanks, Yehuda.

I can successfully use the second subuser.

Given this success, I also tried the following:

$ rados -p .users.swift get '' tmp
$ rados -p .users.swift put hive_cache:swift tmp
$ rados -p .users.swift rm ''
$ rados -p .users.swift ls
hive_cache:swift2
hive_cache:swift

So everything looked good, as far as I can tell, but I still can't
authenticate with the first subuser. (But at least the second one still
works.)

- David


On Wed, Jun 11, 2014 at 5:38 PM, Yehuda Sadeh  wrote:

>  (resending also to list)
> Right. So Basically the swift subuser wasn't created correctly. I created
> issue #8587. Can you try creating a second subuser, see if it's created
> correctly the second time?
>
>
> On Wed, Jun 11, 2014 at 2:03 PM, David Curtiss  > wrote:
>
>> Hmm Using that method, the subuser object appears to be an empty
>> string.
>>
>> First, note that I skipped the "Create Pools" step:
>> http://ceph.com/docs/master/radosgw/config/#create-pools
>> because it says "If the user you created has permissions, the gateway
>> will create the pools automatically."
>>
>> And indeed, the .users.swift pool is there:
>>
>> $ rados lspools
>> data
>> metadata
>> rbd
>> .rgw.root
>> .rgw.control
>> .rgw
>> .rgw.gc
>> .users.uid
>> .users.email
>> .users
>> .users.swift
>>
>> But the only entry in that pool is an empty string.
>>
>> $ rados ls -p .users.swift
>> 
>>
>> And that is indeed a blank line (as opposed to 0 lines), because there is
>> 1 object in that pool:
>> $ rados df
>> pool name   category KB  objects   clones
>> degraded  unfound   rdrd KB   wrwr KB
>> ...
>> .users.swift-  110
>>  0   00011
>>
>> For comparison, the 'df' line for the .users pool lists 2 objects, which
>> are as follows:
>>
>> $ rados ls -p .users
>> 4U5H60BMDL7OSI5ZBL8P
>> F7HZCI4SL12KVVSJ9UVZ
>>
>> - David
>>
>>
>> On Tue, Jun 10, 2014 at 11:49 PM, Yehuda Sadeh 
>> wrote:
>>
>>> Can you verify that the subuser object actually exist? Try doing:
>>>
>>> $ rados ls -p .users.swift
>>>
>>> (unless you have non default pools set)
>>>
>>> Yehuda
>>>
>>> On Tue, Jun 10, 2014 at 6:44 PM, David Curtiss
>>>  wrote:
>>> > No good. In fact, for some reason when I tried to load up my cluster
>>> VMs
>>> > today, I couldnt't get them to work (something to do with a pipe
>>> fault), so
>>> > I recreated my VMs nearly from scratch, to no avail.
>>> >
>>> > Here are the commands I used to create the user and subuser:
>>> > radosgw-admin user create --uid=hive_cache --display-name="Hive Cache"
>>> > --email=pds.supp...@ni.com
>>> > radosgw-admin subuser create --uid=hive_cache
>>> --subuser=hive_cache:swift
>>> > --access=full
>>> > radosgw-admin key create --subuser=hive_cache:swift --key-type=swift
>>> > --secret=QFAMEDSJP5DEKJO0DDXY
>>> >
>>> > - David
>>> >
>>> >
>>> > On Mon, Jun 9, 2014 at 11:14 PM, Yehuda Sadeh 
>>> wrote:
>>> >>
>>> >> It seems that the subuser object was not created for some reason. Can
>>> >> you try recreating it?
>>> >>
>>> >> Yehuda
>>> >>
>>> >> On Sun, Jun 8, 2014 at 5:50 PM, David Curtiss
>>> >>  wrote:
>>> >> > Here's the log: http://pastebin.com/bRt9kw9C
>>> >> >
>>> >> > Thanks,
>>> >> > David
>>> >> >
>>> >> >
>>> >> > On Fri, Jun 6, 2014 at 10:58 PM, Yehuda Sadeh 
>>> >> > wrote:
>>> >> >>
>>> >> >> On Wed, Jun 4, 2014 at 12:00 PM, David Curtiss
>>> >> >>  wrote:
>>> >> >> > Over the last two days, I set up ceph on a set of ubuntu 12.04
>>> VMs
>>> >> >> > (my
>>> >> >> > first
>>> >> >> > time working with ceph), and it seems to be working fine (I have
>>> >> >> > HEALTH_OK,
>>> >> >> > and can create a test document via the rados commandline tool),
>>> but I
>>> >> >> > can't
>>> >> >> > authenticate with the swift API.
>>> >> >> >
>>> >> >> > I followed the quickstart guides to get ceph and radosgw
>>> installed.
>>> >> >> > (Listed
>>> >> >> > here, if you want to check my work: http://pastebin.com/nfPWCn9P
>>> )
>>> >> >> >
>>> >> >> > Visiting the root of the web server shows the
>>> ListAllMyBucketsResult
>>> >> >> > XML, as
>>> >> >> > expected, but trying to authenticate always gives me "403
>>> Forbidden"
>>> >> >> > errors.
>>> >> >> >
>>> >> >> > Here's the output of "radosgw-admin user info --uid=hive_cache":
>>> >> >> > http://pastebin.com/vwwbyd4c
>>> >> >> > And here's my curl invocation: http://pastebin.com/EfQ8nw8a
>>> >> >> >
>>> >> >> > Any ideas on what might be wrong?
>>> >> >> >
>>> >> >>
>>> >> >> Not sure. Can you try reproducing it with 'debug rgw = 20' and
>>> 'debug
>>> >> >> ms = 1' on rgw and provide the log?
>>> >> >>
>>> >> >> Thanks,
>>> >> >> Yehuda
>>> >> >
>>> >> >
>>> >
>>> >
>>>
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What exactly is the kernel rbd on osd issue?

2014-06-12 Thread David Zafman

This was commented on recently on ceph-users, but I’ll explain the scenario.

If the single kernel needs to flush rbd blocks to reclaim memory and the OSD 
process needs memory to handle the flushes, you end up deadlocked.

If you run the rbd client in a VM with dedicated memory allocation from the 
point of view of the host kernel, this won’t happen.

David Zafman
Senior Developer
http://www.inktank.com
http://www.redhat.com

On Jun 12, 2014, at 6:33 PM, lists+c...@deksai.com wrote:

> I remember reading somewhere that the kernel ceph clients (rbd/fs) could
> not run on the same host as the OSD.  I tried finding where I saw that,
> and could only come up with some irc chat logs.
> 
> The issue stated there is that there can be some kind of deadlock.  Is
> this true, and if so, would you have to run a totally different kernel
> in a vm, or would some form of namespacing be enough to avoid it?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   3   4   5   6   7   8   9   10   >