[ceph-users] Re: Nautilus 14.2.6 ceph-volume bluestore _read_fsid unparsable uuid

2020-01-28 Thread Jan Fajerski
On Mon, Jan 27, 2020 at 03:23:55PM -0500, Dave Hall wrote:
>All,
>
>I've just spent a significant amount of time unsuccessfully chasing 
>the  _read_fsid unparsable uuid error on Debian 10 / Natilus 14.2.6.  
>Since this is a brand new cluster, last night I gave up and moved back 
>to Debian 9 / Luminous 12.2.11.  In both cases I'm using the packages 
>from Debian Backports with ceph-ansible as my deployment tool.
>
>Note that above I said 'the _read_fsid unparsable uuid' error. I've 
>searched around a bit and found some previously reported issues, but I 
>did not see any conclusive resolutions.
>
>I would like to get to Nautilus as quickly as possible, so I'd gladly 
>provide additional information to help track down the cause of this 
>symptom.  I can confirm that, looking at the ceph-volume.log on the 
>OSD host I see no difference between the ceph-volume lvm batch command 
>generated by the ceph-ansible versions associated with these two Ceph 
>releases:
>
>   ceph-volume --cluster ceph lvm batch --bluestore --yes
>   --block-db-size 133358734540 /dev/sdc /dev/sdd /dev/sde /dev/sdf
>   /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/nvme0n1
>
>Note that I'm using --block-db-size to divide my NVMe into 12 segments 
>as I have 4 empty drive bays on my OSD servers that I may eventually 
>be able to fill.
>
>My OSD hardware is:
>
>   Disk /dev/nvme0n1: 1.5 TiB, 1600321314816 bytes, 3125627568 sectors
>   Disk /dev/sdc: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors
>   Disk /dev/sdd: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors
>   Disk /dev/sde: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors
>   Disk /dev/sdf: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors
>   Disk /dev/sdg: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors
>   Disk /dev/sdh: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors
>   Disk /dev/sdi: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors
>   Disk /dev/sdj: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors
>
>I'd send the output of ceph-volume inventory on Luminous, but I'm 
>getting  -->: KeyError: 'human_readable_size'.
>
>Please let me know if I can provide any further information.
Mind re-running you ceph-volume command with  debug output 
enabled:
CEPH_VOLUME_DEBUG=true ceph-volume --cluster ceph lvm batch --bluestore ...

Ideally you could also openen a bug report here 
https://tracker.ceph.com/projects/ceph-volume/issues/new

Thanks!
>
>Thanks.
>
>-Dave
>
>-- 
>Dave Hall
>Binghamton University
>
>___
>ceph-users mailing list -- ceph-users@ceph.io
>To unsubscribe send an email to ceph-users-le...@ceph.io

-- 
Jan Fajerski
Senior Software Engineer Enterprise Storage
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: data loss on full file system?

2020-01-28 Thread Paul Emmerich
Yes, data that is not synced is not guaranteed to be written to disk,
this is consistent with POSIX semantics.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Jan 27, 2020 at 9:11 PM Håkan T Johansson  wrote:
>
>
> Hi,
>
> for test purposes, I have set up two 100 GB OSDs, one
> taking a data pool and the other metadata pool for cephfs.
>
> Am running 14.2.6-1-gffd69200ad-1 with packages from
> https://mirror.croit.io/debian-nautilus
>
> Am then running a program that creates a lot of 1 MiB files by calling
>fopen()
>fwrite()
>fclose()
> for each of them.  Error codes are checked.
>
> This works successfully for ~100 GB of data, and then strangely also succeeds
> for many more 100 GB of data...  ??
>
> All written files have size 1 MiB with 'ls', and thus should contain the data
> written.  However, on inspection, the files written after the first ~100 GiB,
> are full of just 0s.  (hexdump -C)
>
>
> To further test this, I use the standard tool 'cp' to copy a few 
> random-content
> files into the full cephfs filessystem.  cp reports no complaints, and after
> the copy operations, content is seen with hexdump -C.  However, after forcing
> the data out of cache on the client by reading other earlier created files,
> hexdump -C show all-0 content for the files copied with 'cp'.  Data that was
> there is suddenly gone...?
>
>
> I am new to ceph.  Is there an option I have missed to avoid this behaviour?
> (I could not find one in
> https://docs.ceph.com/docs/master/man/8/mount.ceph/ )
>
> Is this behaviour related to
> https://docs.ceph.com/docs/mimic/cephfs/full/
> ?
>
> (That page states 'sometime after a write call has already returned 0'. But if
> write returns 0, then no data has been written, so the user program would not
> assume any kind of success.)
>
> Best regards,
>
> Håkan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: EC pool creation results in incorrect M value?

2020-01-28 Thread Geoffrey Rhodes
Hi Eric,

With regards to "From the output of “ceph osd pool ls detail” you can see
min_size=4, the crush rule says min_size=3 however the pool does NOT
survive 2 hosts failing.  Am I missing something?"

For your EC profile you need to set the pool min_size=3 to still read/write
to the pool with two host failures.
RUN:  sudo ceph osd pool set ec32pool min_size 3

Kind regards
Geoffrey Rhodes

On Mon, 27 Jan 2020 at 22:11,  wrote:

> Send ceph-users mailing list submissions to
> ceph-users@ceph.io
>
> To subscribe or unsubscribe via email, send a message with subject or
> body 'help' to
> ceph-users-requ...@ceph.io
>
> You can reach the person managing the list at
> ceph-users-ow...@ceph.io
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of ceph-users digest..."
>
> Today's Topics:
>
>1. Re: EC pool creation results in incorrect M value? (Paul Emmerich)
>2. Re: EC pool creation results in incorrect M value? (Smith, Eric)
>3. Re: EC pool creation results in incorrect M value? (Smith, Eric)
>4. data loss on full file system? (Håkan T Johansson)
>
>
> --
>
> Date: Mon, 27 Jan 2020 17:14:55 +0100
> From: Paul Emmerich 
> Subject: [ceph-users] Re: EC pool creation results in incorrect M
> value?
> To: "Smith, Eric" 
> Cc: "ceph-users@ceph.io" 
> Message-ID:
> <
> cad9ytbfb28fx_xanxruuxq1uqaya_ouho_fke+vbts1vxgk...@mail.gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
> min_size in the crush rule and min_size in the pool are completely
> different things that happen to share the same name.
>
> Ignore min_size in the crush rule, it has virtually no meaning in
> almost all cases (like this one).
>
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> On Mon, Jan 27, 2020 at 3:41 PM Smith, Eric  wrote:
> >
> > I have a Ceph Luminous (12.2.12) cluster with 6 nodes. I’m attempting to
> create an EC3+2 pool with the following commands:
> >
> > Create the EC profile:
> >
> > ceph osd erasure-code-profile set es32 k=3 m=2 plugin=jerasure w=8
> technique=reed_sol_van crush-failure-domain=host crush-root=sgshared
> >
> > Verify profile creation:
> >
> > [root@mon-1 ~]# ceph osd erasure-code-profile get es32
> >
> > crush-device-class=
> >
> > crush-failure-domain=host
> >
> > crush-root=sgshared
> >
> > jerasure-per-chunk-alignment=false
> >
> > k=3
> >
> > m=2
> >
> > plugin=jerasure
> >
> > technique=reed_sol_van
> >
> > w=8
> >
> > Create a pool using this profile:
> >
> > ceph osd pool create ec32pool 1024 1024 erasure es32
> >
> > List pool detail:
> >
> > pool 31 'es32' erasure size 5 min_size 4 crush_rule 11 object_hash
> rjenkins pg_num 1024 pgp_num 1024 last_change 1568 flags hashpspool
> stripe_width 12288 application ES
> >
> > Here’s the crush rule that’s created:
> > {
> >
> > "rule_id": 11,
> >
> > "rule_name": "es32",
> >
> > "ruleset": 11,
> >
> > "type": 3,
> >
> > "min_size": 3,
> >
> > "max_size": 5,
> >
> > "steps": [
> >
> > {
> >
> > "op": "set_chooseleaf_tries",
> >
> > "num": 5
> >
> > },
> >
> > {
> >
> > "op": "set_choose_tries",
> >
> > "num": 100
> >
> > },
> >
> > {
> >
> > "op": "take",
> >
> > "item": -2,
> >
> > "item_name": "sgshared"
> >
> > },
> >
> > {
> >
> > "op": "chooseleaf_indep",
> >
> > "num": 0,
> >
> > "type": "host"
> >
> > },
> >
> > {
> >
> > "op": "emit"
> >
> > }
> >
> > ]
> >
> > },
> >
> >
> >
> > From the output of “ceph osd pool ls detail” you can see min_size=4, the
> crush rule says min_size=3 however the pool does NOT survive 2 hosts
> failing.
> >
> >
> >
> > Am I missing something?
> >
> >
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> --
>
> Date: Mon, 27 Jan 2020 16:22:06 +
> From: "Smith, Eric" 
> Subject: [ceph-users] Re: EC pool creation results in incorrect M
> value?
> To: Paul Emmerich 
> Cc: "ceph-users@ceph.io" 
> Message-ID:   prd14.prod.outlook.com>
> Content-Type: text/plain; charset="utf-8"
>
> Thanks for the info regarding min_size in the crush rule - does this seem
> like a bug to you then? Is anyone else able to reproduce this?
>
> -Original Message-
> From: Paul Emmerich 
> Sent: Monday, January 27, 2020 11:15 AM
> To: Smith, Eric 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] EC po

[ceph-users] Renaming LVM Groups of OSDs

2020-01-28 Thread Stolte, Felix
Hi all,

I would like to rename the logical volumes / volume groups used by my OSDs. Do 
I need to change anything else than the block and block.db links under 
/var/lib/ceph/osd/?

IT-Services
Telefon 02461 61-9243
E-Mail: f.sto...@fz-juelich.de
-
-
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
-
-
 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Renaming LVM Groups of OSDs

2020-01-28 Thread Wido den Hollander
Hi,

Keep in mind that /var/lib/ceph/osd/ is a tmpfs which is created by
'ceph-bluestore-tool' on OSD startup.

All the data in there comes from the lvtags set on the LVs.

So I *think* you can just rename the Volume Group and rescan with
ceph-volume.

Wido

On 1/28/20 10:25 AM, Stolte, Felix wrote:
> Hi all,
> 
> I would like to rename the logical volumes / volume groups used by my OSDs. 
> Do I need to change anything else than the block and block.db links under 
> /var/lib/ceph/osd/?
> 
> IT-Services
> Telefon 02461 61-9243
> E-Mail: f.sto...@fz-juelich.de
> -
> -
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> Prof. Dr. Sebastian M. Schmidt
> -
> -
>  
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Renaming LVM Groups of OSDs

2020-01-28 Thread Kaspar Bosma

Hi,In my experience it is also wise to make sure the lvtags reflect the new vg/lv names!KasparOp 28 januari 2020 om 10:38 schreef Wido den Hollander :Hi,Keep in mind that /var/lib/ceph/osd/ is a tmpfs which is created by'ceph-bluestore-tool' on OSD startup.All the data in there comes from the lvtags set on the LVs.So I *think* you can just rename the Volume Group and rescan withceph-volume.WidoOn 1/28/20 10:25 AM, Stolte, Felix wrote:Hi all,I would like to rename the logical volumes / volume groups used by my OSDs. Do I need to change anything else than the block and block.db links under /var/lib/ceph/osd/?IT-ServicesTelefon 02461 61-9243E-Mail: f.sto...@fz-juelich.de--Forschungszentrum Juelich GmbH52425 JuelichSitz der Gesellschaft: JuelichEingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen HuthmacherGeschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,Prof. Dr. Sebastian M. Schmidt-- ___ceph-users mailing list -- ceph-users@ceph.ioTo unsubscribe send an email to ceph-users-le...@ceph.io___ceph-users mailing list -- ceph-users@ceph.ioTo unsubscribe send an email to ceph-users-le...@ceph.io
 ___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] moving small production cluster to different datacenter

2020-01-28 Thread Marc Roos


Say one is forced to move a production cluster (4 nodes) to a different 
datacenter. What options do I have, other than just turning it off at 
the old location and on on the new location?

Maybe buying some extra nodes, and move one node at a time?


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Luminous Bluestore OSDs crashing with ASSERT

2020-01-28 Thread Stefan Priebe - Profihost AG
Hello Igor,

i updated all servers to latest 4.19.97 kernel but this doesn't fix the
situation.

I can provide you with all those logs - any idea where to upload / how
to sent them to you?

Greets,
Stefan

Am 20.01.20 um 13:12 schrieb Igor Fedotov:
> Hi Stefan,
> 
> these lines are result of transaction dump performed on a failure during
> transaction submission (which is shown as
> 
> "submit_transaction error: Corruption: block checksum mismatch code = 2"
> 
> Most probably they are out of interest (checksum errors are unlikely to
> be caused by transaction content) and hence we need earlier stuff to
> learn what caused that
> 
> checksum mismatch.
> 
> It's hard to give any formal overview of what you should look for, from
> my troubleshooting experience generally one may try to find:
> 
> - some previous error/warning indications (e.g. allocation, disk access,
> etc)
> 
> - prior OSD crashes (sometimes they might have different causes/stack
> traces/assertion messages)
> 
> - any timeout or retry indications
> 
> - any uncommon log patterns which aren't present during regular running
> but happen each time before the crash/failure.
> 
> Anyway I think the inspection depth should be much(?) deeper than
> presumably it is (from what I can see from your log snippets).
> 
> Ceph keeps last 1 log events with an increased log level and dumps
> them on crash with negative index starting at - up to -1 as a prefix.
> 
> -1> 2020-01-16 01:10:13.404090 7f3350a14700 -1 rocksdb:
> 
> 
> It would be great If you share several log snippets for different
> crashes containing these last 1 lines.
> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 1/19/2020 9:42 PM, Stefan Priebe - Profihost AG wrote:
>> Hello Igor,
>>
>> there's absolutely nothing in the logs before.
>>
>> What do those lines mean:
>> Put( Prefix = O key =
>> 0x7f8001cc45c881217262'd_data.4303206b8b4567.9632!='0xfffe6f0012'x'
>>
>> Value size = 480)
>> Put( Prefix = O key =
>> 0x7f8001cc45c881217262'd_data.4303206b8b4567.9632!='0xfffe'o'
>>
>> Value size = 510)
>>
>> on the right size i always see 0xfffe on all
>> failed OSDs.
>>
>> greets,
>> Stefan
>> Am 19.01.20 um 14:07 schrieb Stefan Priebe - Profihost AG:
>>> Yes, except that this happens on 8 different clusters with different
>>> hw but same ceph version and same kernel version.
>>>
>>> Greets,
>>> Stefan
>>>
 Am 19.01.2020 um 11:53 schrieb Igor Fedotov :

 So the intermediate summary is:

 Any OSD in the cluster can experience interim RocksDB checksum
 failure. Which isn't present after OSD restart.

 No HW issues observed, no persistent artifacts (except OSD log)
 afterwards.

 And looks like the issue is rather specific to the cluster as no
 similar reports from other users seem to be present.


 Sorry, I'm out of ideas other then collect all the failure logs and
 try to find something common in them. May be this will shed some
 light..

 BTW from my experience it might make sense to inspect OSD log prior
 to failure (any error messages and/or prior restarts, etc) sometimes
 this might provide some hints.


 Thanks,

 Igor


> On 1/17/2020 2:30 PM, Stefan Priebe - Profihost AG wrote:
> HI Igor,
>
>> Am 17.01.20 um 12:10 schrieb Igor Fedotov:
>> hmmm..
>>
>> Just in case - suggest to check H/W errors with dmesg.
> this happens on around 80 nodes - i don't expect all of those have not
> identified hw errors. Also all of them are monitored - no dmesg
> outpout
> contains any errors.
>
>> Also there are some (not very much though) chances this is another
>> incarnation of the following bug:
>> https://tracker.ceph.com/issues/22464
>> https://github.com/ceph/ceph/pull/24649
>>
>> The corresponding PR works around it for main device reads (user data
>> only!) but theoretically it might still happen
>>
>> either for DB device or DB data at main device.
>>
>> Can you observe any bluefs spillovers? Are there any correlation
>> between
>> failing OSDs and spillover presence if any, e.g. failing OSDs always
>> have a spillover. While OSDs without spillovers never face the
>> issue...
>>
>> To validate this hypothesis one can try to monitor/check (e.g. once a
>> day for a week or something) "bluestore_reads_with_retries"
>> counter over
>> OSDs to learn if the issue is happening
>>
>> in the system.  Non-zero values mean it's there for user data/main
>> device and hence is likely to happen for DB ones as well (which
>> doesn't
>> have any workaround yet).
> OK i checked bluestore_reads_with_retries on 360 osds but all of
> them say 0.
>
>
>> Additionally you might want to monitor memory usage as th

[ceph-users] Re: moving small production cluster to different datacenter

2020-01-28 Thread Wido den Hollander



On 1/28/20 11:19 AM, Marc Roos wrote:
> 
> Say one is forced to move a production cluster (4 nodes) to a different 
> datacenter. What options do I have, other than just turning it off at 
> the old location and on on the new location?
> 
> Maybe buying some extra nodes, and move one node at a time?

I did this ones. This cluster was running IPv6-only (still is) and thus
I had the flexibility of new IPs.

First I temporarily moved the MONs from hardware to Virtual. MONMAP went
from 3 to 5 MONs.

Then I moved the MONs one by one to the new DC and then removed the 2
additional VMs.

Then I set the 'noout' flag and moved the OSD nodes one by one. These
datacenters were located very close thus each node could be moved within
20 minutes.

Wait for recovery to finish and then move the next node.

Keep in mind that there is/might be additional latency between the two
datacenters.

Wido

> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: moving small production cluster to different datacenter

2020-01-28 Thread Tobias Urdin

We did this as well, pretty much the same as Wido.
We had a fiber connection with good latency between the locations.

We installed a virtual monitor in the destination datacenter to always 
keep quorum then we

simply moved one node at a time after setting noout.

When we took a node up on the destination we had a small moving of data 
then the cluster

was back to healthy again.

We had a higher apply and commit latency until we all the nodes was on 
the destination

side but we never noticed any performance issues that  caused issues for us.

Best regards

On 1/28/20 1:30 PM, Wido den Hollander wrote:


On 1/28/20 11:19 AM, Marc Roos wrote:

Say one is forced to move a production cluster (4 nodes) to a different
datacenter. What options do I have, other than just turning it off at
the old location and on on the new location?

Maybe buying some extra nodes, and move one node at a time?

I did this ones. This cluster was running IPv6-only (still is) and thus
I had the flexibility of new IPs.

First I temporarily moved the MONs from hardware to Virtual. MONMAP went
from 3 to 5 MONs.

Then I moved the MONs one by one to the new DC and then removed the 2
additional VMs.

Then I set the 'noout' flag and moved the OSD nodes one by one. These
datacenters were located very close thus each node could be moved within
20 minutes.

Wait for recovery to finish and then move the next node.

Keep in mind that there is/might be additional latency between the two
datacenters.

Wido



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: moving small production cluster to different datacenter

2020-01-28 Thread Simon Ironside
And us too, exactly as below. One at a time then wait for things to 
recover before moving the next host. We didn't have any issues with this 
approach either.


Regards,
Simon.

On 28/01/2020 13:03, Tobias Urdin wrote:

We did this as well, pretty much the same as Wido.
We had a fiber connection with good latency between the locations.

We installed a virtual monitor in the destination datacenter to always 
keep quorum then we

simply moved one node at a time after setting noout.

When we took a node up on the destination we had a small moving of data 
then the cluster

was back to healthy again.

We had a higher apply and commit latency until we all the nodes was on 
the destination
side but we never noticed any performance issues that  caused issues for 
us.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CephFS - objects in default data pool

2020-01-28 Thread CASS Philip
I have a query about https://docs.ceph.com/docs/master/cephfs/createfs/:

"The data pool used to create the file system is the "default" data pool and 
the location for storing all inode backtrace information, used for hard link 
management and disaster recovery. For this reason, all inodes created in CephFS 
have at least one object in the default data pool."

This does not match my experience (nautilus servers, nautlius FUSE client or 
Centos 7 kernel client). I have a cephfs with a replicated top-level pool and a 
directory set to use erasure coding with setfattr, though I also did the same 
test using the subvolume commands with the same result.  "Ceph df detail" shows 
no objects used in the top level pool, as shown in 
https://gist.github.com/pcass-epcc/af24081cf014a66809e801f33bcb535b (also 
displayed in-line below)

It would be useful if indeed clients didn't have to write to the top-level 
pool, since that would mean we could give different clients permission only to 
pool-associated subdirectories without giving everyone write access to a pool 
with data structures shared between all users of the filesystem.

[root@hdr-admon01 ec]# ceph df detail; ceph fs ls; ceph fs status
RAW STORAGE:
CLASS SIZEAVAIL   USEDRAW USED %RAW USED
hdd   3.3 PiB 3.3 PiB  32 TiB   32 TiB  0.95
nvme  2.9 TiB 2.9 TiB 504 MiB  2.5 GiB  0.08
TOTAL 3.3 PiB 3.3 PiB  32 TiB   32 TiB  0.95

POOLS:
POOL   ID STORED  OBJECTS USED
%USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR  
   UNDER COMPR
cephfs.fs1.metadata 5 162 MiB  63 324 MiB  
0.01   1.4 TiB N/A   N/A630 B   
  0 B
cephfs.fs1-replicated.data  6 0 B   0 0 B   
  0   1.0 PiB N/A   N/A 00 B
 0 B
cephfs.fs1-ec.data  7 8.0 GiB   2.05k  11 GiB   
  0   2.4 PiB N/A   N/A 2.05k0 B
 0 B
name: fs1, metadata pool: cephfs.fs1.metadata, data pools: 
[cephfs.fs1-replicated.data cephfs.fs1-ec.data ]
fs1 - 4 clients
===
+--+++---+---+---+
| Rank | State  |MDS |Activity   |  dns  |  inos |
+--+++---+---+---+
|  0   | active | hdr-meta02 | Reqs:0 /s |   29  |   16  |
+--+++---+---+---+
++--+---+---+
|Pool|   type   |  used | avail |
++--+---+---+
|cephfs.fs1.metadata | metadata |  324M | 1414G |
| cephfs.fs1-replicated.data |   data   |0  | 1063T |
| cephfs.fs1-ec.data |   data   | 11.4G | 2505T |
++--+---+---+
+-+
| Standby MDS |
+-+
|  hdr-meta01 |
+-+
MDS version: ceph version 14.2.5 (ad5bd132e1492173c85fda2cc863152730b16a92) 
nautilus (stable)

[root@hdr-admon01 ec]# ll /test-fs/ec/
total 12582912
-rw-r--r--. 1 root root 4294967296 Jan 27 22:26 new-file
-rw-r--r--. 2 root root 4294967296 Jan 28 14:06 new-file2
-rw-r--r--. 2 root root 4294967296 Jan 28 14:06 new-file-same-inode-as-newfile2

Regards,
Phil
_
Philip Cass
HPC Systems Specialist - Senior Systems Administrator
EPCC

[cid:image002.png@01D5D5EF.2E463230]
Advanced Computing Facility
Bush Estate
Penicuik

Tel:+44 (0)131 4457815
Email:   
p.c...@epcc.ed.ac.uk

_

The University of Edinburgh is a charitable body, registered in Scotland, with 
registration number SC005336.
The information contained in this e-mail (including any attachments) is 
confidential and is intended for the use of the addressee only.  If you have 
received this message in error, please delete it and notify the originator 
immediately.
Please consider the environment before printing this email.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] No Activity?

2020-01-28 Thread DHilsbos
All;

I haven't had a single email come in from the ceph-users list at ceph.io since 
01/22/2020.

Is there just that little traffic right now?

Thank you,

Dominic L. Hilsbos, MBA 
Director - Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: No Activity?

2020-01-28 Thread Sage Weil
On Tue, 28 Jan 2020, dhils...@performair.com wrote:
> All;
> 
> I haven't had a single email come in from the ceph-users list at ceph.io 
> since 01/22/2020.
> 
> Is there just that little traffic right now?

I'm seeing 10-20 messages per day.  Confirm your registration and/or check 
your filters?

sage
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus 14.2.6 ceph-volume bluestore _read_fsid unparsable uuid

2020-01-28 Thread Dave Hall

Jan,

Unfortunately I'm under immense pressure right now to get some form of 
Ceph into production, so it's going to be Luminous for now, or maybe a 
live upgrade to Nautilus without recreating the OSDs (if that's possible).


The good news is that in the next couple months I expect to add more 
hardware that should be nearly identical.  I will gladly give it a go at 
that time and see if I can recreate.  (Or, if I manage to thoroughly 
crash my current fledgling cluster, I'll give it another go on one node 
while I'm up all night recovering.)


If you could tell me where to look I'd gladly read some code and see if 
I can find anything that way.  Or if there's any sort of design document 
describing the deep internals I'd be glad to scan it to see if I've hit 
a corner case of some sort.  Actually, I'd be interested in reading 
those documents anyway if I could.


Thanks.

-Dave

Dave Hall

On 1/28/2020 3:05 AM, Jan Fajerski wrote:

On Mon, Jan 27, 2020 at 03:23:55PM -0500, Dave Hall wrote:

All,

I've just spent a significant amount of time unsuccessfully chasing
the  _read_fsid unparsable uuid error on Debian 10 / Natilus 14.2.6.
Since this is a brand new cluster, last night I gave up and moved back
to Debian 9 / Luminous 12.2.11.  In both cases I'm using the packages

>from Debian Backports with ceph-ansible as my deployment tool.

Note that above I said 'the _read_fsid unparsable uuid' error. I've
searched around a bit and found some previously reported issues, but I
did not see any conclusive resolutions.

I would like to get to Nautilus as quickly as possible, so I'd gladly
provide additional information to help track down the cause of this
symptom.  I can confirm that, looking at the ceph-volume.log on the
OSD host I see no difference between the ceph-volume lvm batch command
generated by the ceph-ansible versions associated with these two Ceph
releases:

   ceph-volume --cluster ceph lvm batch --bluestore --yes
   --block-db-size 133358734540 /dev/sdc /dev/sdd /dev/sde /dev/sdf
   /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/nvme0n1

Note that I'm using --block-db-size to divide my NVMe into 12 segments
as I have 4 empty drive bays on my OSD servers that I may eventually
be able to fill.

My OSD hardware is:

   Disk /dev/nvme0n1: 1.5 TiB, 1600321314816 bytes, 3125627568 sectors
   Disk /dev/sdc: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors
   Disk /dev/sdd: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors
   Disk /dev/sde: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors
   Disk /dev/sdf: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors
   Disk /dev/sdg: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors
   Disk /dev/sdh: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors
   Disk /dev/sdi: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors
   Disk /dev/sdj: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors

I'd send the output of ceph-volume inventory on Luminous, but I'm
getting  -->: KeyError: 'human_readable_size'.

Please let me know if I can provide any further information.

Mind re-running you ceph-volume command with  debug output
enabled:
CEPH_VOLUME_DEBUG=true ceph-volume --cluster ceph lvm batch --bluestore ...

Ideally you could also openen a bug report here
https://tracker.ceph.com/projects/ceph-volume/issues/new

Thanks!

Thanks.

-Dave

--
Dave Hall
Binghamton University

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] getting rid of incomplete pg errors

2020-01-28 Thread Hartwig Hauschild
Hi.

before I descend into what happened and why it happened: I'm talking about a
test-cluster so I don't really care about the data in this case.

We've recently started upgrading from luminous to nautilus, and for us that
means we're retiring ceph-disk in favour of ceph-volume with lvm and
dmcrypt.

Our setup is in containers and we've got DBs separated from Data.
When testing our upgrade-path we discovered that running the host on
ubuntu-xenial and the containers on centos-7.7 leads to lvm inside the
containers not using lvmetad because it's too old. That in turn means that
not running `vgscan --cache` on the host before adding a LV to a VG
essentially zeros the metadata for all LVs in that VG.

That happened on two out of three hosts for a bunch of OSDs and those OSDs
are gone. I have no way of getting them back, they've been overwritten
multiple times trying to figure out what went wrong.

So now I have a cluster that's got 16 pgs in 'incomplete', 14 of them with 0
objects, 2 with about 150 objects each.

I have found a couple of howtos that tell me to use ceph-objectstore-tool to
find the pgs on the active osds and I've given that a try, but
ceph-objectstore-tool always tells me it can't find the pg I am looking for.

Can I tell ceph to re-init the pgs? Do I have to delete the pools and
recreate them?

There's no data I can't get back in there, I just don't feel like
scrapping and redeploying the whole cluster.


-- 
Cheers,
Hardy
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Question about erasure code

2020-01-28 Thread Zorg

Hi,

we are planning to use EC

I have  3 questions about it

1 / what is the advantage of having more machines than (k + m)? We are 
planning to have 11 nodes and use k=8 and m=3. does it improved 
performance to have more node than K+M? of how many ? what ratio?


2 / what behavior should we expect if we lose 1,2 or 3 nodes: perfs / 
recovery. how many nodes could we loose before the cluster state change 
to read only


3 / what is the maximum occupancy rate on the EC?

Thans for your help

Zorg



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Question about erasure code

2020-01-28 Thread Janne Johansson
Den tis 28 jan. 2020 kl 17:34 skrev Zorg :

> Hi,
>
> we are planning to use EC
>
> I have  3 questions about it
>
> 1 / what is the advantage of having more machines than (k + m)? We are
> planning to have 11 nodes and use k=8 and m=3. does it improved
> performance to have more node than K+M? of how many ? what ratio?
>

You should always try to have one or two (or many more!) hosts than the
replication
size or the sum of K+M in EC. If you run exactly with K+M hosts, any
surprise means the
cluster is degraded. One host falling over means that ALL pgs are now in
degraded mode and
most will resort to rebuilding data via checksums on read. This is bad.

If you ever plan to do maintenance on a host (be it OS upgrades, package
upgraded, the non-ending
Intel mitigation patches, whatever), and you don't have spare hosts then
you are getting into
degraded mode even for planned downtimes, as well as for all the unplanned
ones.

Having more nodes (regardless of if you do replication or EC), more hosts
bring more networking capacity, more CPU, more RAM for caches, more IOPS in
total and
more ways to share the load of normal usage and recovery, not just "getting
more space"


> 2 / what behavior should we expect if we lose 1,2 or 3 nodes: perfs /
> recovery. how many nodes could we loose before the cluster state change
> to read only
>

Depends a bit on the M of course, but EC in ceph will not be happy at K
hosts,
so K+1 is minimum for it to work and recover by itself.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS - objects in default data pool

2020-01-28 Thread Gregory Farnum
On Tue, Jan 28, 2020 at 4:26 PM CASS Philip  wrote:

> I have a query about https://docs.ceph.com/docs/master/cephfs/createfs/:
>
>
>
> “The data pool used to create the file system is the “default” data pool
> and the location for storing all inode backtrace information, used for hard
> link management and disaster recovery. For this reason, all inodes created
> in CephFS have at least one object in the default data pool.”
>
>
>
> This does not match my experience (nautilus servers, nautlius FUSE client
> or Centos 7 kernel client). I have a cephfs with a replicated top-level
> pool and a directory set to use erasure coding with setfattr, though I also
> did the same test using the subvolume commands with the same result.  "Ceph
> df detail" shows no objects used in the top level pool, as shown in
> https://gist.github.com/pcass-epcc/af24081cf014a66809e801f33bcb535b (also
> displayed in-line below)
>

Hmm I think this is tripping over the longstanding issue that omap data is
not reflected in the pool stats (although I would expect it to still show
up as objects, but perhaps the "ceph df" view has a different reporting
chain? Or else I'm confused somehow.)
But anyway...


>
>
> It would be useful if indeed clients didn’t have to write to the top-level
> pool, since that would mean we could give different clients permission only
> to pool-associated subdirectories without giving everyone write access to a
> pool with data structures shared between all users of the filesystem.
>

*Clients* don't need write permission to the default data pool unless you
want them to write files there. The backtraces are maintained by the MDS. :)
-Greg


>
>
> [root@hdr-admon01 ec]# ceph df detail; ceph fs ls; ceph fs status
>
> RAW STORAGE:
>
> CLASS SIZEAVAIL   USEDRAW USED %RAW USED
>
> hdd   3.3 PiB 3.3 PiB  32 TiB   32 TiB  0.95
>
> nvme  2.9 TiB 2.9 TiB 504 MiB  2.5 GiB  0.08
>
> TOTAL 3.3 PiB 3.3 PiB  32 TiB   32 TiB  0.95
>
>
>
> POOLS:
>
> POOL   ID STORED  OBJECTS
> USED%USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES
> DIRTY USED COMPR UNDER COMPR
>
> cephfs.fs1.metadata 5 162 MiB  63 324
> MiB  0.01   1.4 TiB N/A   N/A
> 630 B 0 B
>
> cephfs.fs1-replicated.data  6 0 B   0 0
> B 0   1.0 PiB N/A   N/A
> 00 B 0 B
>
> cephfs.fs1-ec.data  7 8.0 GiB   2.05k  11
> GiB 0   2.4 PiB N/A   N/A
> 2.05k0 B 0 B
>
> name: fs1, metadata pool: cephfs.fs1.metadata, data pools:
> [cephfs.fs1-replicated.data cephfs.fs1-ec.data ]
>
> fs1 - 4 clients
>
> ===
>
> +--+++---+---+---+
>
> | Rank | State  |MDS |Activity   |  dns  |  inos |
>
> +--+++---+---+---+
>
> |  0   | active | hdr-meta02 | Reqs:0 /s |   29  |   16  |
>
> +--+++---+---+---+
>
> ++--+---+---+
>
> |Pool|   type   |  used | avail |
>
> ++--+---+---+
>
> |cephfs.fs1.metadata | metadata |  324M | 1414G |
>
> | cephfs.fs1-replicated.data |   data   |0  | 1063T |
>
> | cephfs.fs1-ec.data |   data   | 11.4G | 2505T |
>
> ++--+---+---+
>
> +-+
>
> | Standby MDS |
>
> +-+
>
> |  hdr-meta01 |
>
> +-+
>
> MDS version: ceph version 14.2.5
> (ad5bd132e1492173c85fda2cc863152730b16a92) nautilus (stable)
>
>
>
> [root@hdr-admon01 ec]# ll /test-fs/ec/
>
> total 12582912
>
> -rw-r--r--. 1 root root 4294967296 Jan 27 22:26 new-file
>
> -rw-r--r--. 2 root root 4294967296 Jan 28 14:06 new-file2
>
> -rw-r--r--. 2 root root 4294967296 Jan 28 14:06
> new-file-same-inode-as-newfile2
>
>
>
> Regards,
> Phil
> _
> *Philip Cass*
>
> *HPC Systems Specialist – Senior Systems Administrator *
>
> *EPCC*
>
>
>
> Advanced Computing Facility
>
> Bush Estate
>
> Penicuik
>
>
>
> *Tel:*+44 (0)131 4457815
>
> *Email:*   p.c...@epcc.ed.ac.uk
>
>
> _
>
>
>
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
>
> The information contained in this e-mail (including any attachments) is
> confidential and is intended for the use of the addressee only.  If you
> have received this message in error, please delete it and notify the
> originator immediately.
>
> Please consider the environment before printing this email.
>
>
> 

[ceph-users] Re: CephFS - objects in default data pool

2020-01-28 Thread CASS Philip
Hi Greg,

Thanks – if I understand https://ceph.io/geen-categorie/get-omap-keyvalue-size/ 
correctly, “rados -p cephfs.fs1-replicated.data ls” should show any such 
objects?  It’s also returning blank (and correctly returns a lot for the EC 
pool).

That being said – if it’s only written to by the MDS in any case, my concerns 
are moot.  Do clients need _read_ access to the default pool either?

Regards,
Phil
_
Philip Cass
HPC Systems Specialist – Senior Systems Administrator
EPCC

[cid:image001.png@01D5D604.0DF1A090]
Advanced Computing Facility
Bush Estate
Penicuik

Tel:+44 (0)131 4457815
Email:   
p.c...@epcc.ed.ac.uk

_

The University of Edinburgh is a charitable body, registered in Scotland, with 
registration number SC005336.
The information contained in this e-mail (including any attachments) is 
confidential and is intended for the use of the addressee only.  If you have 
received this message in error, please delete it and notify the originator 
immediately.
Please consider the environment before printing this email.

From: Gregory Farnum 
Sent: 28 January 2020 17:13
To: CASS Philip 
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] CephFS - objects in default data pool

On Tue, Jan 28, 2020 at 4:26 PM CASS Philip 
mailto:p.c...@epcc.ed.ac.uk>> wrote:
I have a query about https://docs.ceph.com/docs/master/cephfs/createfs/:

“The data pool used to create the file system is the “default” data pool and 
the location for storing all inode backtrace information, used for hard link 
management and disaster recovery. For this reason, all inodes created in CephFS 
have at least one object in the default data pool.”

This does not match my experience (nautilus servers, nautlius FUSE client or 
Centos 7 kernel client). I have a cephfs with a replicated top-level pool and a 
directory set to use erasure coding with setfattr, though I also did the same 
test using the subvolume commands with the same result.  "Ceph df detail" shows 
no objects used in the top level pool, as shown in 
https://gist.github.com/pcass-epcc/af24081cf014a66809e801f33bcb535b (also 
displayed in-line below)

Hmm I think this is tripping over the longstanding issue that omap data is not 
reflected in the pool stats (although I would expect it to still show up as 
objects, but perhaps the "ceph df" view has a different reporting chain? Or 
else I'm confused somehow.)
But anyway...


It would be useful if indeed clients didn’t have to write to the top-level 
pool, since that would mean we could give different clients permission only to 
pool-associated subdirectories without giving everyone write access to a pool 
with data structures shared between all users of the filesystem.

*Clients* don't need write permission to the default data pool unless you want 
them to write files there. The backtraces are maintained by the MDS. :)
-Greg


[root@hdr-admon01 ec]# ceph df detail; ceph fs ls; ceph fs status
RAW STORAGE:
CLASS SIZEAVAIL   USEDRAW USED %RAW USED
hdd   3.3 PiB 3.3 PiB  32 TiB   32 TiB  0.95
nvme  2.9 TiB 2.9 TiB 504 MiB  2.5 GiB  0.08
TOTAL 3.3 PiB 3.3 PiB  32 TiB   32 TiB  0.95

POOLS:
POOL   ID STORED  OBJECTS USED
%USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR  
   UNDER COMPR
cephfs.fs1.metadata 5 162 MiB  63 324 MiB  
0.01   1.4 TiB N/A   N/A630 B   
  0 B
cephfs.fs1-replicated.data  6 0 B   0 0 B   
  0   1.0 PiB N/A   N/A 00 B
 0 B
cephfs.fs1-ec.data  7 8.0 GiB   2.05k  11 GiB   
  0   2.4 PiB N/A   N/A 2.05k0 B
 0 B
name: fs1, metadata pool: cephfs.fs1.metadata, data pools: 
[cephfs.fs1-replicated.data cephfs.fs1-ec.data ]
fs1 - 4 clients
===
+--+++---+---+---+
| Rank | State  |MDS |Activity   |  dns  |  inos |
+--+++---+---+---+
|  0   | active | hdr-meta02 | Reqs:0 /s |   29  |   16  |
+--+++---+---+---+
++--+---+---+
|Pool|   type   |  used | avail |
++--+---+---+
|cephfs.fs1.metadata | metadata |  324M | 1414G |
| cephfs.fs1-replicated.data |   data   |0  | 1063T |
| cephfs.fs1-ec.data |   data   | 11.4G | 2505T |
++--+---+---+
+-+
| Standby MDS |
+-

[ceph-users] Re: moving small production cluster to different datacenter

2020-01-28 Thread Reed Dier
I did this, but with the benefit of taking the network with me, just a forklift 
from one datacenter to the next.

Shutdown the clients, then OSDs, then MDS/MON/MGRs, then switches.

Reverse order back up, 

> On Jan 28, 2020, at 4:19 AM, Marc Roos  wrote:
> 
> 
> Say one is forced to move a production cluster (4 nodes) to a different 
> datacenter. What options do I have, other than just turning it off at 
> the old location and on on the new location?
> 
> Maybe buying some extra nodes, and move one node at a time?
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] librados behavior when some OSDs are unreachables

2020-01-28 Thread David DELON
Hi, 

i had a problem with one application (seafile) which uses CEPH backend with 
librados. 
The corresponding pools are defined with size=3 and each object copy is on a 
different host. 
The cluster health is OK: all the monitors see all the hosts. 

Now, a network problem just happens between my RADOS client and a single host. 
Then, when my application/client tries to access an object which is situed on 
the unreachable host (primary for the corresponding PG), 
it does not failover to another copy/host (and my application crashes later 
because after a while, with many requests, too many files are opened on Linux). 
Is it the normal behavior? My storage is resilient (great!) but not its 
access... 
If on the host, i stop the OSDs or change the affinity to zero, it solves, 
so it seems like the librados just check and trust the osdmap 
And doing a tcpdump show the client tries to access the same OSD without 
timeout. 

It can be easily reproduced with defining a netfilter rule on a host to drop 
packets coming from the client. 
Note: i am still on Luminous (both on lient and cluster sides). 

Thanks for reading. 

D. 



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS - objects in default data pool

2020-01-28 Thread Gregory Farnum
On Tue, Jan 28, 2020 at 6:55 PM CASS Philip  wrote:

> Hi Greg,
>
>
> Thanks – if I understand
> https://ceph.io/geen-categorie/get-omap-keyvalue-size/ correctly, “rados
> -p cephfs.fs1-replicated.data ls” should show any such objects?  It’s also
> returning blank (and correctly returns a lot for the EC pool).
>
>
>
> That being said – if it’s only written to by the MDS in any case, my
> concerns are moot.  Do clients need _*read*_ access to the default pool
> either?
>

Nope!

>
>
> Regards,
> Phil
> _
> *Philip Cass*
>
> *HPC Systems Specialist – Senior Systems Administrator *
>
> *EPCC*
>
>
>
> Advanced Computing Facility
>
> Bush Estate
>
> Penicuik
>
>
>
> *Tel:*+44 (0)131 4457815
>
> *Email:*   p.c...@epcc.ed.ac.uk
>
>
> _
>
>
>
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
>
> The information contained in this e-mail (including any attachments) is
> confidential and is intended for the use of the addressee only.  If you
> have received this message in error, please delete it and notify the
> originator immediately.
>
> Please consider the environment before printing this email.
>
>
>
> *From:* Gregory Farnum 
> *Sent:* 28 January 2020 17:13
> *To:* CASS Philip 
> *Cc:* ceph-users@ceph.io
> *Subject:* Re: [ceph-users] CephFS - objects in default data pool
>
>
>
> On Tue, Jan 28, 2020 at 4:26 PM CASS Philip  wrote:
>
> I have a query about https://docs.ceph.com/docs/master/cephfs/createfs/:
>
>
>
> “The data pool used to create the file system is the “default” data pool
> and the location for storing all inode backtrace information, used for hard
> link management and disaster recovery. For this reason, all inodes created
> in CephFS have at least one object in the default data pool.”
>
>
>
> This does not match my experience (nautilus servers, nautlius FUSE client
> or Centos 7 kernel client). I have a cephfs with a replicated top-level
> pool and a directory set to use erasure coding with setfattr, though I also
> did the same test using the subvolume commands with the same result.  "Ceph
> df detail" shows no objects used in the top level pool, as shown in
> https://gist.github.com/pcass-epcc/af24081cf014a66809e801f33bcb535b (also
> displayed in-line below)
>
>
>
> Hmm I think this is tripping over the longstanding issue that omap data is
> not reflected in the pool stats (although I would expect it to still show
> up as objects, but perhaps the "ceph df" view has a different reporting
> chain? Or else I'm confused somehow.)
>
> But anyway...
>
>
>
>
>
> It would be useful if indeed clients didn’t have to write to the top-level
> pool, since that would mean we could give different clients permission only
> to pool-associated subdirectories without giving everyone write access to a
> pool with data structures shared between all users of the filesystem.
>
>
>
> *Clients* don't need write permission to the default data pool unless you
> want them to write files there. The backtraces are maintained by the MDS. :)
>
> -Greg
>
>
>
>
>
> [root@hdr-admon01 ec]# ceph df detail; ceph fs ls; ceph fs status
>
> RAW STORAGE:
>
> CLASS SIZEAVAIL   USEDRAW USED %RAW USED
>
> hdd   3.3 PiB 3.3 PiB  32 TiB   32 TiB  0.95
>
> nvme  2.9 TiB 2.9 TiB 504 MiB  2.5 GiB  0.08
>
> TOTAL 3.3 PiB 3.3 PiB  32 TiB   32 TiB  0.95
>
>
>
> POOLS:
>
> POOL   ID STORED  OBJECTS
> USED%USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES
> DIRTY USED COMPR UNDER COMPR
>
> cephfs.fs1.metadata 5 162 MiB  63 324
> MiB  0.01   1.4 TiB N/A   N/A
> 630 B 0 B
>
> cephfs.fs1-replicated.data  6 0 B   0 0
> B 0   1.0 PiB N/A   N/A
> 00 B 0 B
>
> cephfs.fs1-ec.data  7 8.0 GiB   2.05k  11
> GiB 0   2.4 PiB N/A   N/A
> 2.05k0 B 0 B
>
> name: fs1, metadata pool: cephfs.fs1.metadata, data pools:
> [cephfs.fs1-replicated.data cephfs.fs1-ec.data ]
>
> fs1 - 4 clients
>
> ===
>
> +--+++---+---+---+
>
> | Rank | State  |MDS |Activity   |  dns  |  inos |
>
> +--+++---+---+---+
>
> |  0   | active | hdr-meta02 | Reqs:0 /s |   29  |   16  |
>
> +--+++---+---+---+
>
> ++--+---+---+
>
> |Pool|   type   |  used | avail |
>
> ++--+---+---+
>
> |cephfs.fs1.metadata | metadata |

[ceph-users] Re: Nautilus 14.2.6 ceph-volume bluestore _read_fsid unparsable uuid

2020-01-28 Thread bauen1

Hi,

I've run into the same issue while testing:

ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus 
(stable)


debian bullseye

Ceph was installed using ceph-ansible on a vm from the repo 
http://download.ceph.com/debian-nautilus


The output of `sudo sh -c 'CEPH_VOLUME_DEBUG=true ceph-volume --cluster 
test lvm batch --bluestore /dev/vdb'` has been attached.


Also worth noting might be that '/var/lib/ceph/osd/test-0/fsid' is empty 
(but I don't know too much about the internals)


- bauen1

On 1/28/20 4:54 PM, Dave Hall wrote:

Jan,

Unfortunately I'm under immense pressure right now to get some form of 
Ceph into production, so it's going to be Luminous for now, or maybe a 
live upgrade to Nautilus without recreating the OSDs (if that's 
possible).


The good news is that in the next couple months I expect to add more 
hardware that should be nearly identical.  I will gladly give it a go 
at that time and see if I can recreate.  (Or, if I manage to 
thoroughly crash my current fledgling cluster, I'll give it another go 
on one node while I'm up all night recovering.)


If you could tell me where to look I'd gladly read some code and see 
if I can find anything that way.  Or if there's any sort of design 
document describing the deep internals I'd be glad to scan it to see 
if I've hit a corner case of some sort.  Actually, I'd be interested 
in reading those documents anyway if I could.


Thanks.

-Dave

Dave Hall

On 1/28/2020 3:05 AM, Jan Fajerski wrote:

On Mon, Jan 27, 2020 at 03:23:55PM -0500, Dave Hall wrote:

All,

I've just spent a significant amount of time unsuccessfully chasing
the  _read_fsid unparsable uuid error on Debian 10 / Natilus 14.2.6.
Since this is a brand new cluster, last night I gave up and moved back
to Debian 9 / Luminous 12.2.11.  In both cases I'm using the packages

>from Debian Backports with ceph-ansible as my deployment tool.

Note that above I said 'the _read_fsid unparsable uuid' error. I've
searched around a bit and found some previously reported issues, but I
did not see any conclusive resolutions.

I would like to get to Nautilus as quickly as possible, so I'd gladly
provide additional information to help track down the cause of this
symptom.  I can confirm that, looking at the ceph-volume.log on the
OSD host I see no difference between the ceph-volume lvm batch command
generated by the ceph-ansible versions associated with these two Ceph
releases:

   ceph-volume --cluster ceph lvm batch --bluestore --yes
   --block-db-size 133358734540 /dev/sdc /dev/sdd /dev/sde /dev/sdf
   /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/nvme0n1

Note that I'm using --block-db-size to divide my NVMe into 12 segments
as I have 4 empty drive bays on my OSD servers that I may eventually
be able to fill.

My OSD hardware is:

   Disk /dev/nvme0n1: 1.5 TiB, 1600321314816 bytes, 3125627568 sectors
   Disk /dev/sdc: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors
   Disk /dev/sdd: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors
   Disk /dev/sde: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors
   Disk /dev/sdf: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors
   Disk /dev/sdg: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors
   Disk /dev/sdh: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors
   Disk /dev/sdi: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors
   Disk /dev/sdj: 10.9 TiB, 12000138625024 bytes, 23437770752 sectors

I'd send the output of ceph-volume inventory on Luminous, but I'm
getting  -->: KeyError: 'human_readable_size'.

Please let me know if I can provide any further information.

Mind re-running you ceph-volume command with  debug output
enabled:
CEPH_VOLUME_DEBUG=true ceph-volume --cluster ceph lvm batch 
--bluestore ...


Ideally you could also openen a bug report here
https://tracker.ceph.com/projects/ceph-volume/issues/new

Thanks!

Thanks.

-Dave

--
Dave Hall
Binghamton University

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
sysadmin@ceph-test:~$ sudo setenforce 0
sysadmin@ceph-test:~$ sudo sh -c 'CEPH_VOLUME_DEBUG=true ceph-volume --cluster 
test lvm batch --bluestore /dev/vdb'

Total OSDs: 1

  TypePathLV 
Size % of device

  [data]  /dev/vdb63.00 
GB100.0%
--> The above OSDs would be created if the operation continues
--> do you want to proceed? (yes/no) yes
Running command: /usr/sbin/vgcreate -s 1G --force --yes 
ceph-1cc81d7c-a153-462a-8080-ec3d217c7180 /dev/vdb
 stdout: Physical volume "/dev/vdb" successfully created.
 stdout: Volum

[ceph-users] unable to obtain rotating service keys

2020-01-28 Thread Raymond Clotfelter
I have a server with 12 OSDs on it. Five of them are unable to start, and give 
the following error message in the their logs:

2020-01-28 13:00:41.760 7f61fb490c80  0 monclient: wait_auth_rotating timed out 
after 30
2020-01-28 13:00:41.760 7f61fb490c80 -1 osd.178 411005 unable to obtain 
rotating service keys; retrying

These OSDs were up and running when they initially just died on me. I tried to 
restart them and they failed to come up. I rebooted the node and they did not 
recover. All 5 died within a few hours and were all 5 down by time I started 
poking them. I previously had this happen with 2 other OSDs, one each on 2 
servers each with 12 OSDs. I ended up just purging and recreating those OSDs. I 
would really like to find a solution to fix this problem that does not involve 
purging the OSDs.

I have tried stopping and starting all monitors and managers, one at a time, 
and all at the same time. Additionally, all servers in the cluster have been 
restarted over the past couple of days for various other reasons.

I am on Ceph 14.2.6, Debian buster and am using the Debian packages. All of my 
servers are kept in the time sync via ntp, and this has been verified multiple 
times that everything remains in time sync.

I have googled the error message and tried all of the solutions offered from 
that, but nothing makes any difference.

I would appreciate any constructive advice.

Thanks.

-- ray

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: moving small production cluster to different datacenter

2020-01-28 Thread Wido den Hollander


On 1/28/20 6:58 PM, Anthony D'Atri wrote:
> 
> 
>> I did this ones. This cluster was running IPv6-only (still is) and thus
>> I had the flexibility of new IPs.
> 
> Dumb question — how was IPv6 a factor in that flexibility?  Was it just that 
> you had unused addresses within an existing block?
> 

There are no dumb questions :-)

Usually Ceph is put into RFC1918 IPv4 space (10.x, 172.X) and those are
usually more difficult to route in networks.

IPv6 address space is globally routed in most networks thus making this
easier.

As long as the hosts can talk IP(4/6) with each other you can perform
such a migration.

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: librados behavior when some OSDs are unreachables

2020-01-28 Thread Wido den Hollander



On 1/28/20 7:03 PM, David DELON wrote:
> Hi, 
> 
> i had a problem with one application (seafile) which uses CEPH backend with 
> librados. 
> The corresponding pools are defined with size=3 and each object copy is on a 
> different host. 
> The cluster health is OK: all the monitors see all the hosts. 
> 
> Now, a network problem just happens between my RADOS client and a single 
> host. 
> Then, when my application/client tries to access an object which is situed on 
> the unreachable host (primary for the corresponding PG), 
> it does not failover to another copy/host (and my application crashes later 
> because after a while, with many requests, too many files are opened on 
> Linux). 
> Is it the normal behavior? My storage is resilient (great!) but not its 
> access...

Yes. Reads and Writes for a PG are always served by the primary OSD.
That's how Ceph is designed.


> If on the host, i stop the OSDs or change the affinity to zero, it solves, 
> so it seems like the librados just check and trust the osdmap 
> And doing a tcpdump show the client tries to access the same OSD without 
> timeout. 

There is a network issue and that's the root cause. Ceph can't fix that
for you. You will need to make sure the network is functioning.

> 
> It can be easily reproduced with defining a netfilter rule on a host to drop 
> packets coming from the client. 
> Note: i am still on Luminous (both on lient and cluster sides). 

Again, this is exactly how Ceph works :-)

The primary OSD serves reads and writes. Only when its marked as down
the client is informed using an osdmap update and then it goes to
another OSD.

Wido

> 
> Thanks for reading. 
> 
> D. 
> 
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: No Activity?

2020-01-28 Thread Marc Roos


https://www.mail-archive.com/ceph-users@ceph.io/

https://www.mail-archive.com/ceph-users@lists.ceph.com/
 

-Original Message-
Sent: 28 January 2020 16:32
To: ceph-users@ceph.io
Subject: [ceph-users] No Activity?

All;

I haven't had a single email come in from the ceph-users list at ceph.io 
since 01/22/2020.

Is there just that little traffic right now?

Thank you,

Dominic L. Hilsbos, MBA
Director - Information Technology
Perform Air International Inc.
dhils...@performair.com
www.PerformAir.com

___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] High CPU usage by ceph-mgr in 14.2.6

2020-01-28 Thread jbardgett
After upgrading one of our clusters from Luminous 12.2.12 to Nautilus 14.2.6, I 
am seeing 100% CPU usage by a single ceph-mgr thread (found using 'top -H').  
The way we found this was due to Prometheus being unable to report out certain 
pieces of data, specifically OSD Usage, OSD Apply and Commit Latency.  Which 
are all similar issues people were having in previous versions of Nautilus.

Bryan Stillwell reported this previously on a separate cluster, 14.2.5, we have 
here:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/VW3GNVJGOOWA5RMUULRMZCQL5OEY44N7/#6QNDSLMHDVN7AZ3T6OPGU3YOJYAVUAEY

That issue was resolved with the upgrade to 14.2.6.

We are seeing a similar issue on this other cluster with a couple differences.

This cluster has 1900+ OSD in it, the previous one had 300+
The top user is libceph-common, instead of mmap 

4.86%  libceph-common.so.0   [.] EventCenter::create_time_event
2.78%  [kernel] [k] nmi
2.64%  libstdc++.so.6.0.19   [.] __dynamic_cast

On all our other clusters that have been upgraded to 14.2.6 we are not 
experiencing this issue, the next largest being 800+ OSD.

We feel this is related to the size of the cluster, similarly to the previous 
report.

Anyone else experiencing this and/or can provide some direction on how to go 
about resolving this?

Thanks,
Joe
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph MDS specific perf info disappeared in Nautilus

2020-01-28 Thread Stefan Kooman
Quoting Stefan Kooman (ste...@bit.nl):
> Hi,
> 
> The command "ceph daemon mds.$mds perf dump" does not give the
> collection with MDS specific data anymore. In Mimic I get the following
> MDS specific collections:
> 
> - mds
> - mds_cache
> - mds_log
> - mds_mem
> - mds_server
> - mds_sessions
> 
> But those are not available in Nautilus anymore (14.2.4). Also not
> listed in a "perf schema".
> 
> Where did these metrics go?

Nobody? It would a shame to lose these helpfull metrics when upgrading
to Nautilus (as none except prometheus metric collectors get (all)
useful metrics and you need to reside to "local" daemon metrics for the
time being ...).

Thanks,

Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph MDS specific perf info disappeared in Nautilus

2020-01-28 Thread Dan van der Ster
On Wed, Jan 29, 2020 at 7:33 AM Stefan Kooman  wrote:
>
> Quoting Stefan Kooman (ste...@bit.nl):
> > Hi,
> >
> > The command "ceph daemon mds.$mds perf dump" does not give the
> > collection with MDS specific data anymore. In Mimic I get the following
> > MDS specific collections:
> >
> > - mds
> > - mds_cache
> > - mds_log
> > - mds_mem
> > - mds_server
> > - mds_sessions
> >
> > But those are not available in Nautilus anymore (14.2.4). Also not
> > listed in a "perf schema".
> >
> > Where did these metrics go?
>
> Nobody? It would a shame to lose these helpfull metrics when upgrading
> to Nautilus (as none except prometheus metric collectors get (all)
> useful metrics and you need to reside to "local" daemon metrics for the
> time being ...).

wfm:

# ceph daemon mds.`hostname -s` version
{"version":"14.2.6","release":"nautilus","release_type":"stable"}
# ceph daemon mds.`hostname -s` perf dump | jq 'keys[]'
"AsyncMessenger::Worker-0"
"AsyncMessenger::Worker-1"
"AsyncMessenger::Worker-2"
"cct"
"finisher-PurgeQueue"
"mds"
"mds_cache"
"mds_log"
"mds_mem"
"mds_server"
"mds_sessions"
"mempool"
"objecter"
"purge_queue"
"throttle-msgr_dispatch_throttler-mds"
"throttle-objecter_bytes"
"throttle-objecter_ops"
"throttle-write_buf_throttle"
"throttle-write_buf_throttle-0x5575a16c5dc0"

Maybe you're checking a standby MDS ?

-- Dan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io