[ceph-users] large difference between "STORED" and "USED" size of ceph df

2020-05-03 Thread Lee, H. (Hurng-Chun)
Hello,

We use purely cephfs in out ceph cluster (version 14.2.7).  The cephfs
data is an EC pool (k=4, m=2) with hdd OSDs using bluestore. The
default file layout (i.e. 4MB object size) is used.

We see the following output of ceph df:

---
RAW STORAGE:
CLASS SIZEAVAIL   USEDRAW USED %RAW
USED
hdd   951 TiB 888 TiB  63 TiB   63
TiB  6.58
ssd   9.6 TiB 9.6 TiB 1.4 GiB   16
GiB  0.17
TOTAL 961 TiB 898 TiB  63 TiB   63
TiB  6.52

POOLS:
POOLID STORED  OBJECTS USED%USE
D MAX AVAIL
cephfs-data  2  34 TiB  12.51M  52
TiB  5.93   553 TiB
cephfs-metadata  4 994 MiB  98.61k 1.5
GiB  0.02   3.0 TiB
---

What triggered my attention is the discrepency between the reported
size of "USED" (52 TiB) and "STORED" (34 TiB) on the cephfs-data pool.

>From this document (
https://docs.ceph.com/docs/master/releases/nautilus/#upgrade-compatibility-notes
), it says that

- "USED" represents amount of space allocated purely for data by all
OSD nodes in KB
- "STORED"  represents amount of data stored by the user.

I seem to undersand that the "USED" size can be roughly taken as the
number of objects (12.51M) times the object size (4MB) of the file
layout; and since there are many files with size smaller than 4 MB in
our system, the actual stored data is less.

Is my interpretation correct? If so, does it mean that we will be
wasting a lot of space when we have a lot files smaller than the object
size of 4MB in the system? Thanks for the help!

Cheers, Hong

-- 
Hurng-Chun (Hong) Lee, PhD
ICT manager

Donders Institute for Brain, Cognition and Behaviour, 
Centre for Cognitive Neuroimaging
Radboud University Nijmegen

e-mail: h@donders.ru.nl
tel: +31(0) 243610977
web: http://www.ru.nl/donders/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 14.2.9 MDS Failing

2020-05-03 Thread Sasha Litvak
Marco,

Could you please share what was done to make your cluster stable again?

On Fri, May 1, 2020 at 4:47 PM Marco Pizzolo  wrote:
>
> Thanks Everyone,
>
> I was able to address the issue at least temporarily.  The filesystem and
> MDSes are for the time staying online and the pgs are being remapped.
>
> What i'm not sure about is the best tuning for MDS given our use case, nor
> am i sure of exactly what caused the OSD to flap as they did, so I don't
> yet know how to avoid a recurrence.
>
> I do very much like Ceph though
>
> Best wishes,
>
> Marco
>
> On Fri, May 1, 2020 at 3:49 PM Marco Pizzolo  wrote:
>
> > Understood Paul, thanks.
> >
> > In case this helps to shed any further light...Digging through logs I'm
> > also seeing this:
> >
> > 2020-05-01 10:06:55.984 7eff10cc3700  1 mds.prdceph01 Updating MDS map to
> > version 1487236 from mon.2
> > 2020-05-01 10:06:56.398 7eff0e4be700  0 log_channel(cluster) log [WRN] :
> > 17 slow requests, 1 included below; oldest blocked for > 254.203584 secs
> > 2020-05-01 10:06:56.398 7eff0e4be700  0 log_channel(cluster) log [WRN] :
> > slow request 60.552277 seconds old, received at 2020-05-01 10:05:55.846466:
> > client_request(client.2525280:277916371 mkdir #0x10014e76974/1f 2020-05-01
> > 10:05:55.844490 caller_uid=1010, caller_gid=1015{}) currently submit entry:
> > journal_and_reply
> > 2020-05-01 10:06:57.400 7eff0e4be700  0 log_channel(cluster) log [WRN] :
> > 17 slow requests, 2 included below; oldest blocked for > 255.205489 secs
> > 2020-05-01 10:06:57.400 7eff0e4be700  0 log_channel(cluster) log [WRN] :
> > slow request 60.564545 seconds old, received at 2020-05-01 10:05:56.836104:
> > client_request(client.2525280:277921203 create
> > #0x10014f12b86/9254b3f0-1d5a-4e88-8d41-d36f244bcb12.zip 2020-05-01
> > 10:05:56.834494 caller_uid=1010, caller_gid=1015{}) currently submit entry:
> > journal_and_reply
> > 2020-05-01 10:06:57.400 7eff0e4be700  0 log_channel(cluster) log [WRN] :
> > slow request 60.550874 seconds old, received at 2020-05-01 10:05:56.849775:
> > client_request(client.2525280:277921267 mkdir #0x10014e78bec/e0 2020-05-01
> > 10:05:56.848494 caller_uid=1010, caller_gid=1015{}) currently submit entry:
> > journal_and_reply
> > 2020-05-01 10:06:58.400 7eff0e4be700  0 log_channel(cluster) log [WRN] :
> > 17 slow requests, 0 included below; oldest blocked for > 256.205519 secs
> > 2020-05-01 10:07:15.250 7eff0dcbd700  1 heartbeat_map is_healthy 'MDSRank'
> > had timed out after 15
> > 2020-05-01 10:07:15.250 7eff0dcbd700  0 mds.beacon.prdceph01 Skipping
> > beacon heartbeat to monitors (last acked 3.9s ago); MDS internal
> > heartbeat is not healthy!
> > 2020-05-01 10:07:15.750 7eff0dcbd700  1 heartbeat_map is_healthy 'MDSRank'
> > had timed out after 15
> > 2020-05-01 10:07:15.750 7eff0dcbd700  0 mds.beacon.prdceph01 Skipping
> > beacon heartbeat to monitors (last acked 4.4s ago); MDS internal
> > heartbeat is not healthy!
> > 2020-05-01 10:07:16.250 7eff0dcbd700  1 heartbeat_map is_healthy 'MDSRank'
> > had timed out after 15
> > 2020-05-01 10:07:16.250 7eff0dcbd700  0 mds.beacon.prdceph01 Skipping
> > beacon heartbeat to monitors (last acked 4.8s ago); MDS internal
> > heartbeat is not healthy!
> > 2020-05-01 10:07:16.750 7eff0dcbd700  1 heartbeat_map is_healthy 'MDSRank'
> > had timed out after 15
> > 2020-05-01 10:07:16.750 7eff0dcbd700  0 mds.beacon.prdceph01 Skipping
> > beacon heartbeat to monitors (last acked 5.49998s ago); MDS internal
> > heartbeat is not healthy!
> > 2020-05-01 10:07:17.250 7eff0dcbd700  1 heartbeat_map is_healthy 'MDSRank'
> > had timed out after 15
> > 2020-05-01 10:07:17.250 7eff0dcbd700  0 mds.beacon.prdceph01 Skipping
> > beacon heartbeat to monitors (last acked 5.8s ago); MDS internal
> > heartbeat is not healthy!
> >
> >
> > THEN about 5 minutes later...
> >
> >
> >
> > 2020-05-01 10:07:35.559 7eff10cc3700  1 mds.prdceph01  9: 'ceph'
> > 2020-05-01 10:07:35.559 7eff10cc3700  1 mds.prdceph01 respawning with exe
> > /usr/bin/ceph-mds
> > 2020-05-01 10:07:35.559 7eff10cc3700  1 mds.prdceph01  exe_path
> > /proc/self/exe
> > 2020-05-01 10:07:50.785 7fbff66291c0  0 ceph version 14.2.9
> > (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable), process
> > ceph-mds, pid 9710
> > 2020-05-01 10:07:50.787 7fbff66291c0  0 pidfile_write: ignore empty
> > --pid-file
> > 2020-05-01 10:07:50.817 7fbfe4408700  1 mds.prdceph01 Updating MDS map to
> > version 1487238 from mon.2
> > 2020-05-01 10:07:55.820 7fbfe4408700  1 mds.prdceph01 Updating MDS map to
> > version 1487239 from mon.2
> > 2020-05-01 10:07:55.820 7fbfe4408700  1 mds.prdceph01 Map has assigned me
> > to become a standby
> > 2020-05-01 10:11:07.369 7fbfe4408700  1 mds.prdceph01 Updating MDS map to
> > version 1487282 from mon.2
> > 2020-05-01 10:11:07.373 7fbfe4408700  1 mds.0.1487282 handle_mds_map i am
> > now mds.0.1487282
> > 2020-05-01 10:11:07.373 7fbfe4408700  1 mds.0.1487282 handle_mds_map state
> > change up:boot --> up:re

[ceph-users] Re: upmap balancer and consequences of osds briefly marked out

2020-05-03 Thread Anthony D'Atri
Do I misunderstand this script, or does it not _quite_ do what’s desired here?

I fully get the scenario of applying a full-cluster map to allow incremental 
topology changes.

To be clear, if this is run to effectively freeze backfill during / following a 
traumatic event, it will freeze that adapted state, not strictly return to the 
pre-event state?  And thus the pg-upmap balancer would still need to be run to 
revert to the prior state?  And this would also hold true for a failed/replaced 
OSD?


> On May 1, 2020, at 7:37 AM, Dylan McCulloch  wrote:
> 
> Thanks Dan, that looks like a really neat method & script for a few 
> use-cases. We've actually used several of the scripts in that repo over the 
> years, so, many thanks for sharing.
> 
> That method will definitely help in the scenario in which a set of 
> unnecessary pg remaps have been triggered and can be caught early and 
> reverted. I'm still a little concerned about the possibility of, for example, 
> a brief network glitch occurring at night and then waking up to a full 
> unbalanced cluster. Especially with NVMe clusters that can rapidly remap and 
> rebalance (and for which we also have a greater impetus to squeeze out as 
> much available capacity as possible with upmap due to cost per TB). It's just 
> a risk I hadn't previously considered and was wondering if others have either 
> run into it or felt any need to plan around it.
> 
> Cheers,
> Dylan
> 
> 
>> From: Dan van der Ster 
>> Sent: Friday, 1 May 2020 5:53 PM
>> To: Dylan McCulloch 
>> Cc: ceph-users 
>> 
>> Subject: Re: [ceph-users] upmap balancer and consequences of osds briefly 
>> marked out
>> 
>> Hi,
>> 
>> You're correct that all the relevant upmap entries are removed when an
>> OSD is marked out.
>> You can try to use this script which will recreate them and get the
>> cluster back to HEALTH_OK quickly:
>> https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py
>> 
>> Cheers, Dan
>> 
>> 
>> On Fri, May 1, 2020 at 9:36 AM Dylan McCulloch  wrote:
>>> 
>>> Hi all,
>>> 
>>> We're using upmap balancer which has made a huge improvement in evenly 
>>> distributing data on our osds and has provided a substantial increase in 
>>> usable capacity.
>>> 
>>> Currently on ceph version: 12.2.13 luminous
>>> 
>>> We ran into a firewall issue recently which led to a large number of osds 
>>> being briefly marked 'down' & 'out'. The osds came back 'up' & 'in' after 
>>> about 25 mins and the cluster was fine but had to perform a significant 
>>> amount of backfilling/recovery despite
>> there being no end-user client I/O during that period.
>>> 
>>> Presumably the large number of remapped pgs and backfills were due to 
>>> pg_upmap_items being removed from the osdmap when osds were marked out and 
>>> subsequently those pgs were redistributed using the default crush algorithm.
>>> As a result of the brief outage our cluster became significantly imbalanced 
>>> again with several osds very close to full.
>>> Is there any reasonable mitigation for that scenario?
>>> 
>>> The auto-balancer will not perform optimizations while there are degraded 
>>> pgs, so it would only start reapplying pg upmap exceptions after initial 
>>> recovery is complete (at which point capacity may be dangerously reduced).
>>> Similarly, as admins, we normally only apply changes when the cluster is in 
>>> a healthy state, but if the same issue were to occur again would it be 
>>> advisable to manually apply balancer plans while initial recovery is still 
>>> taking place?
>>> 
>>> I guess my concern from this experience is that making use of the capacity 
>>> gained by using upmap balancer appears to carry some risk. i.e. it's 
>>> possible for a brief outage to remove those space efficiencies relatively 
>>> quickly and potentially result in full
>> osds/cluster before the automatic balancer is able to resume and 
>> redistribute pgs using upmap.
>>> 
>>> Curious whether others have any thoughts or experience regarding this.
>>> 
>>> Cheers,
>>> Dylan
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: What's the best practice for Erasure Coding

2020-05-03 Thread Alex Gorbachev
Hi Frank,

Reviving this old thread as to whether the performance on these raw NL-SAS
drives is adequate?  I was wondering if this is a deep archive with almost
no retrieval, or how many drives are used?  In my experience with large
parallel writes, WAL/DB with bluestore, or journal drives on SSD with
filestore have always been needed to sustain a reasonably consistent
transfer rate.
Very much appreciate any reference info as to your design.

Best regards,
Alex

On Mon, Jul 8, 2019 at 4:30 AM Frank Schilder  wrote:

> Hi David,
>>
>> I'm running a cluster with bluestore on raw devices (no lvm) and all
>> journals collocated on the same disk with the data. Disks are spinning
>> NL-SAS. Our goal was to build storage at lowest cost, therefore all data on
>> HDD only. I got a few SSDs that I'm using for FS and RBD meta data. All
>> large pools are EC on spinning disk.
>>
>> I spent at least one month to run detailed benchmarks (rbd bench)
>> depending on EC profile, object size, write size, etc. Results were varying
>> a lot. My advice would be to run benchmarks with your hardware. If there
>> was a single perfect choice, there wouldn't be so many options. For
>> example, my tests will not be valid when using separate fast disks for WAL
>> and DB.
>>
>> There are some results though that might be valid in general:
>>
>> 1) EC pools have high throughput but low IOP/s compared with replicated
>> pools
>>
>> I see single-thread write speeds of up to 1.2GB (gigabyte) per second,
>> which is probably the network limit and not the disk limit. IOP/s get
>> better with more disks, but are way lower than what replicated pools can
>> provide. On a cephfs with EC data pool, small-file IO will be comparably
>> slow and eat a lot of resources.
>>
>> 2) I observe massive network traffic amplification on small IO sizes,
>> which is due to the way EC overwrites are handled. This is one bottleneck
>> for IOP/s. We have 10G infrastructure and use 2x10G client and 4x10G OSD
>> network. OSD bandwidth at least 2x client network, better 4x or more.
>>
>> 3) k should only have small prime factors, power of 2 if possible
>>
>> I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All
>> other choices were poor. The value of m seems not relevant for performance.
>> Larger k will require more failure domains (more hardware).
>>
>> 4) object size matters
>>
>> The best throughput (1M write size) I see with object sizes of 4MB or
>> 8MB, with IOP/s getting somewhat better with slower object sizes but
>> throughput dropping fast. I use the default of 4MB in production. Works
>> well for us.
>>
>> 5) jerasure is quite good and seems most flexible
>>
>> jerasure is quite CPU efficient and can handle smaller chunk sizes than
>> other plugins, which is preferrable for IOP/s. However, CPU usage can
>> become a problem and a plugin optimized for specific values of k and m
>> might help here. Under usual circumstances I see very low load on all OSD
>> hosts, even under rebalancing. However, I remember that once I needed to
>> rebuild something on all OSDs (I don't remember what it was, sorry). In
>> this situation, CPU load went up to 30-50% (meaning up to half the cores
>> were at 100%), which is really high considering that each server has only
>> 16 disks at the moment and is sized to handle up to 100. CPU power could
>> become a bottle for us neck in the future.
>>
>> These are some general observations and do not replace benchmarks for
>> specific use cases. I was hunting for a specific performance pattern, which
>> might not be what you want to optimize for. I would recommend to run
>> extensive benchmarks if you have to live with a configuration for a long
>> time - EC profiles cannot be changed.
>>
>> We settled on 8+2 and 6+2 pools with jerasure and object size 4M. We also
>> use bluestore compression. All meta data pools are on SSD, only very little
>> SSD space is required. This choice works well for the majority of our use
>> cases. We can still build small expensive pools to accommodate special
>> performance requests.
>>
>> Best regards,
>>
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> 
>> From: ceph-users  on behalf of David <
>> xiaomajia...@gmail.com>
>> Sent: 07 July 2019 20:01:18
>> To: ceph-us...@lists.ceph.com
>> Subject: [ceph-users]  What's the best practice for Erasure Coding
>>
>> Hi Ceph-Users,
>>
>> I'm working with a  Ceph cluster (about 50TB, 28 OSDs, all Bluestore on
>> lvm).
>> Recently, I'm trying to use the Erasure Code pool.
>> My question is "what's the best practice for using EC pools ?".
>> More specifically, which plugin (jerasure, isa, lrc, shec or  clay)
>> should I adopt, and how to choose the combinations of (k,m) (e.g.
>> (k=3,m=2), (k=6,m=3) ).
>>
>> Does anyone share some experience?
>>
>> Thanks for any help.
>>
>> Regards,
>> David
>>
>> ___
>> ceph-users mailin

[ceph-users] mount issues with rbd running xfs - Structure needs cleaning

2020-05-03 Thread Void Star Nill
Hello All,

One of the use cases (e.g. machine learning workloads) for RBD volumes in
our production environment is that, users could mount an RBD volume in RW
mode in a container, write some data to it and later use the same volume in
RO mode into a number of containers in parallel to consume the data.

I am trying to test this scenario with different file systems (ext3/4 and
xfs). I have an automated test code that creates a volume, maps it to a
node, mounts in RW mode and write some data into it. Later the same volume
is mounted in RO mode in a number of other nodes and a process reads from
the file.

I dont see any issues with ext3 or 4 filesystems, but with XFS, I notice
that 1 or 2 (out of 6) parallel read-only mounts fail with "Structure needs
cleaning" error. What is surprising is that, the rest of 4 or 5 mounts will
be successful and I dont see any I/O issues on those - which suggests that
there shouldn't be any corruptions on the volume itself. Also note that
there is no other process writing to the volume at this time so no chance
of corruption that way.

I am doing xfs mounts with "ro,nouuid" mount options.

Any inputs on why I may be seeing this issue randomly?

Regards,
Shridhar
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mount issues with rbd running xfs - Structure needs cleaning

2020-05-03 Thread Adam Tygart
I'm pretty sure to XFS, "read-only" is not quite "read-only." My
understanding is that XFS replays the journal on mount, unless it is also
mounted with norecovery.

--
Adam

On Sun, May 3, 2020, 22:14 Void Star Nill  wrote:

> Hello All,
>
> One of the use cases (e.g. machine learning workloads) for RBD volumes in
> our production environment is that, users could mount an RBD volume in RW
> mode in a container, write some data to it and later use the same volume in
> RO mode into a number of containers in parallel to consume the data.
>
> I am trying to test this scenario with different file systems (ext3/4 and
> xfs). I have an automated test code that creates a volume, maps it to a
> node, mounts in RW mode and write some data into it. Later the same volume
> is mounted in RO mode in a number of other nodes and a process reads from
> the file.
>
> I dont see any issues with ext3 or 4 filesystems, but with XFS, I notice
> that 1 or 2 (out of 6) parallel read-only mounts fail with "Structure needs
> cleaning" error. What is surprising is that, the rest of 4 or 5 mounts will
> be successful and I dont see any I/O issues on those - which suggests that
> there shouldn't be any corruptions on the volume itself. Also note that
> there is no other process writing to the volume at this time so no chance
> of corruption that way.
>
> I am doing xfs mounts with "ro,nouuid" mount options.
>
> Any inputs on why I may be seeing this issue randomly?
>
> Regards,
> Shridhar
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mount issues with rbd running xfs - Structure needs cleaning

2020-05-03 Thread brad . swanson
Are you mounting the RO with noatime?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mount issues with rbd running xfs - Structure needs cleaning

2020-05-03 Thread Void Star Nill
Hello Brad, Adam,

Thanks for the quick responses.

I am not passing any arguments other than "ro,nouuid" on mount.

One thing I forgot to mention is that, there could be more than one mount
of the same volume on a host - I dont know how this plays out for xfs.

Appreciate your inputs.

Regards,
Shridhar


On Sun, 3 May 2020 at 21:43,  wrote:

> Are you mounting the RO with noatime?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] page cache flush before unmap?

2020-05-03 Thread Void Star Nill
Hello,

I wanted to know if rbd will flush any writes in the page cache when a
volume is "unmap"ed on the host, of if we need to flush explicitly using
"sync" before unmap?

Thanks,
Shridhar
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io