[ceph-users] Re: some ceph general questions about the design

2020-04-21 Thread Anthony D'Atri


> 
> 1. shoud i use a raid controller a create for example a raid 5 with all disks 
> on each osd server? or should i passtrough all disks to ceph osd?
> 
> If your OSD servers have HDDs, buy a good RAID Controller with a 
> battery-backed write cache and configure it using multiple RAID-0 volumes (1 
> physical disk per volume). That way, reads and write will be accelerated by 
> the cache on the HBA.

I’ve lived this scenario and hated it.  Multiple firmware and manufacturing 
issues, batteries/supercaps can fail and need to be monitored, bugs causing 
staged data to be lost before writing to disk, another bug that required 
replacing the card if there was preserved cache for a failed drive, because it 
would refuse to boot, difficulties in drive monitoring, HBA monitoring utility 
that would lock the HBA or peg the CPU, the list goes on.

For the additional cost of RoC, cache RAM, supercap to (fingers crossed) 
protect the cache, all the additional monitoring and hands work … you might 
find that SATA SSDs on a JBOD HBA are no more expensive.

> 3. if i have a 3 physically node osd cluster, did i need 5 physicall mons?
> No. 3 MON are enough

If you have good hands and spares.  If your cluster is on a different continent 
and colo hands can’t find their own butts …..  it’s nice to survive a double 
failure.

ymmv
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW and the orphans

2020-04-21 Thread Janne Johansson
Den tis 21 apr. 2020 kl 07:29 skrev Eric Ivancich :

> Please be certain to read the associated docs in both:
>
> doc/radosgw/orphans.rst
> doc/man/8/rgw-orphan-list.rst
>
> so you understand the limitations and potential pitfalls. Generally this
> tool will be a precursor to a large delete job, so understanding what’s
> going on is important.
> I look forward to your report! And please feel free to post additional
> questions in this forum.
>
>
Where are those?
https://github.com/ceph/ceph/tree/master/doc/man/8
https://github.com/ceph/ceph/tree/master/doc/radosgw
don't seem to contain them in master. Nor in nautilus branch or octopus.

This whole issue feels weird, rgw (or its users) produces dead fragments of
mulitparts, orphans and whatnot that needs cleaning up sooner or later and
the info we get is that the old cleaner isn't meant to be used, it hasn't
worked for a long while, there is no fixed version, perhaps there is a
script somewhere with caveats. This (slightly frustrated) issue is of
course on top of
"bi trim"
"bilog trim"
"mdlog trim"
"usage trim"

"datalog trim"

"sync error trim"

"gc process"

"reshard stale-instances rm"



that we rgw admins are supposed to know when to run, how often, what their
quirks are and so on.


'Docs' for rgw means "datalog trim" --help says "trims the datalog", and
the long version on the web would be "this operation trims the datalog" or
something that doesn't add anything more.




-- 
"Grumpy cat was an optimist"
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus cluster damaged + crashing OSDs

2020-04-21 Thread Paul Emmerich
On Tue, Apr 21, 2020 at 3:20 AM Brad Hubbard  wrote:
>
> Wait for recovery to finish so you know whether any data from the down
> OSDs is required. If not just reprovision them.

Recovery will not finish from this state as several PGs are down and/or stale.


Paul

>
> If data is required from the down OSDs you will need to run a query on
> the pg(s) to find out what OSDs have the required copies of the
> pg/object required. you can then export the pg from the down osd using
> the ceph-objectstore-tool, back it up, then import it back into the
> cluster.
>
> On Tue, Apr 21, 2020 at 1:05 AM Robert Sander
>  wrote:
> >
> > Hi,
> >
> > one of our customers had his Ceph cluster crashed due to a power or network 
> > outage (they still try to figure out what happened).
> >
> > The cluster is very unhealthy but recovering:
> >
> > # ceph -s
> >   cluster:
> > id: 1c95ca5d-948b-4113-9246-14761cb9a82a
> > health: HEALTH_ERR
> > 1 filesystem is degraded
> > 1 mds daemon damaged
> > 1 osds down
> > 1 pools have many more objects per pg than average
> > 1/115117480 objects unfound (0.000%)
> > Reduced data availability: 71 pgs inactive, 53 pgs down, 18 pgs 
> > peering, 27 pgs stale
> > Possible data damage: 1 pg recovery_unfound
> > Degraded data redundancy: 7303464/230234960 objects degraded 
> > (3.172%), 693 pgs degraded, 945 pgs undersized
> > 14 daemons have recently crashed
> >
> >   services:
> > mon: 3 daemons, quorum maslxlabstore01,maslxlabstore02,maslxlabstore04 
> > (age 64m)
> > mgr: maslxlabstore01(active, since 69m), standbys: maslxlabstore03, 
> > maslxlabstore02, maslxlabstore04
> > mds: cephfs:2/3 
> > {0=maslxlabstore03=up:resolve,1=maslxlabstore01=up:resolve} 2 up:standby, 1 
> > damaged
> > osd: 140 osds: 130 up (since 4m), 131 in (since 4m); 847 remapped pgs
> > rgw: 4 daemons active (maslxlabstore01.rgw0, maslxlabstore02.rgw0, 
> > maslxlabstore03.rgw0, maslxlabstore04.rgw0)
> >
> >   data:
> > pools:   6 pools, 8328 pgs
> > objects: 115.12M objects, 218 TiB
> > usage:   425 TiB used, 290 TiB / 715 TiB avail
> > pgs: 0.853% pgs not active
> >  7303464/230234960 objects degraded (3.172%)
> >  13486/230234960 objects misplaced (0.006%)
> >  1/115117480 objects unfound (0.000%)
> >  7311 active+clean
> >  338  active+undersized+degraded+remapped+backfill_wait
> >  255  active+undersized+degraded+remapped+backfilling
> >  215  active+undersized+remapped+backfilling
> >  99   active+undersized+degraded
> >  44   down
> >  37   active+undersized+remapped+backfill_wait
> >  13   stale+peering
> >  9stale+down
> >  5stale+remapped+peering
> >  1active+recovery_unfound+undersized+degraded+remapped
> >  1active+clean+remapped
> >
> >   io:
> > client:   168 B/s rd, 0 B/s wr, 0 op/s rd, 0 op/s wr
> > recovery: 1.9 GiB/s, 15 keys/s, 948 objects/s
> >
> >
> > The MDS cluster is unable to start because one of them is damaged.
> >
> > 10 of the OSDs do not start. They crash very early in the boot process:
> >
> > 2020-04-20 16:26:14.935 7f818ec8cc00  0 set uid:gid to 64045:64045 
> > (ceph:ceph)
> > 2020-04-20 16:26:14.935 7f818ec8cc00  0 ceph version 14.2.9 
> > (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable), process 
> > ceph-osd, pid 69463
> > 2020-04-20 16:26:14.935 7f818ec8cc00  0 pidfile_write: ignore empty 
> > --pid-file
> > 2020-04-20 16:26:15.503 7f818ec8cc00  0 starting osd.42 osd_data 
> > /var/lib/ceph/osd/ceph-42 /var/lib/ceph/osd/ceph-42/journal
> > 2020-04-20 16:26:15.523 7f818ec8cc00  0 load: jerasure load: lrc load: isa
> > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > compaction_readahead_size = 2MB
> > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > compaction_style = kCompactionStyleLevel
> > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > compaction_threads = 32
> > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option compression = 
> > kNoCompression
> > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option flusher_threads 
> > = 8
> > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > level0_file_num_compaction_trigger = 8
> > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > level0_slowdown_writes_trigger = 32
> > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > level0_stop_writes_trigger = 64
> > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > max_background_compactions = 31
> > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > max_bytes_for_level_base = 536870912
> > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > max_bytes_for_level_multiplier = 8
> > 2

[ceph-users] Re: RGW and the orphans

2020-04-21 Thread Katarzyna Myrek
Hi

I was looking into running the tool. The question is: Do I need to
compile the whole Ceph? Or is there radosgw-admin available for
download precompiled? A nightly build or sth?

Kind regards / Pozdrawiam,
Katarzyna Myrek


wt., 21 kwi 2020 o 09:57 Janne Johansson  napisał(a):
>
> Den tis 21 apr. 2020 kl 07:29 skrev Eric Ivancich :
>>
>> Please be certain to read the associated docs in both:
>>
>> doc/radosgw/orphans.rst
>> doc/man/8/rgw-orphan-list.rst
>>
>> so you understand the limitations and potential pitfalls. Generally this 
>> tool will be a precursor to a large delete job, so understanding what’s 
>> going on is important.
>> I look forward to your report! And please feel free to post additional 
>> questions in this forum.
>>
>
> Where are those?
> https://github.com/ceph/ceph/tree/master/doc/man/8
> https://github.com/ceph/ceph/tree/master/doc/radosgw
> don't seem to contain them in master. Nor in nautilus branch or octopus.
>
> This whole issue feels weird, rgw (or its users) produces dead fragments of 
> mulitparts, orphans and whatnot that needs cleaning up sooner or later and 
> the info we get is that the old cleaner isn't meant to be used, it hasn't 
> worked for a long while, there is no fixed version, perhaps there is a script 
> somewhere with caveats. This (slightly frustrated) issue is of course on top 
> of
> "bi trim"
> "bilog trim"
> "mdlog trim"
> "usage trim"
>
> "datalog trim"
>
> "sync error trim"
>
> "gc process"
>
> "reshard stale-instances rm"
>
>
>
> that we rgw admins are supposed to know when to run, how often, what their 
> quirks are and so on.
>
>
> 'Docs' for rgw means "datalog trim" --help says "trims the datalog", and the 
> long version on the web would be "this operation trims the datalog" or 
> something that doesn't add anything more.
>
>
>
>
> --
>
> "Grumpy cat was an optimist"
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus cluster damaged + crashing OSDs

2020-04-21 Thread Marc Roos


I had a test data cephfs pool with 1x replication, that left me with 1 
stale pg also. I have no idea how to resolve this. I already marked the 
osd as lost. Do I need to manually 'unconfigure' this cepfs data pool? 
Or can I 'reinitialize' it?

 

-Original Message-
To: Brad Hubbard
Cc: ceph-users
Subject: [ceph-users] Re: Nautilus cluster damaged + crashing OSDs

On Tue, Apr 21, 2020 at 3:20 AM Brad Hubbard  
wrote:
>
> Wait for recovery to finish so you know whether any data from the down 

> OSDs is required. If not just reprovision them.

Recovery will not finish from this state as several PGs are down and/or 
stale.


Paul

>
> If data is required from the down OSDs you will need to run a query on 

> the pg(s) to find out what OSDs have the required copies of the 
> pg/object required. you can then export the pg from the down osd using 

> the ceph-objectstore-tool, back it up, then import it back into the 
> cluster.
>
> On Tue, Apr 21, 2020 at 1:05 AM Robert Sander 
>  wrote:
> >
> > Hi,
> >
> > one of our customers had his Ceph cluster crashed due to a power or 
network outage (they still try to figure out what happened).
> >
> > The cluster is very unhealthy but recovering:
> >
> > # ceph -s
> >   cluster:
> > id: 1c95ca5d-948b-4113-9246-14761cb9a82a
> > health: HEALTH_ERR
> > 1 filesystem is degraded
> > 1 mds daemon damaged
> > 1 osds down
> > 1 pools have many more objects per pg than average
> > 1/115117480 objects unfound (0.000%)
> > Reduced data availability: 71 pgs inactive, 53 pgs down, 
18 pgs peering, 27 pgs stale
> > Possible data damage: 1 pg recovery_unfound
> > Degraded data redundancy: 7303464/230234960 objects 
degraded (3.172%), 693 pgs degraded, 945 pgs undersized
> > 14 daemons have recently crashed
> >
> >   services:
> > mon: 3 daemons, quorum 
maslxlabstore01,maslxlabstore02,maslxlabstore04 (age 64m)
> > mgr: maslxlabstore01(active, since 69m), standbys: 
maslxlabstore03, maslxlabstore02, maslxlabstore04
> > mds: cephfs:2/3 
{0=maslxlabstore03=up:resolve,1=maslxlabstore01=up:resolve} 2 
up:standby, 1 damaged
> > osd: 140 osds: 130 up (since 4m), 131 in (since 4m); 847 
remapped pgs
> > rgw: 4 daemons active (maslxlabstore01.rgw0, 
> > maslxlabstore02.rgw0, maslxlabstore03.rgw0, maslxlabstore04.rgw0)
> >
> >   data:
> > pools:   6 pools, 8328 pgs
> > objects: 115.12M objects, 218 TiB
> > usage:   425 TiB used, 290 TiB / 715 TiB avail
> > pgs: 0.853% pgs not active
> >  7303464/230234960 objects degraded (3.172%)
> >  13486/230234960 objects misplaced (0.006%)
> >  1/115117480 objects unfound (0.000%)
> >  7311 active+clean
> >  338  active+undersized+degraded+remapped+backfill_wait
> >  255  active+undersized+degraded+remapped+backfilling
> >  215  active+undersized+remapped+backfilling
> >  99   active+undersized+degraded
> >  44   down
> >  37   active+undersized+remapped+backfill_wait
> >  13   stale+peering
> >  9stale+down
> >  5stale+remapped+peering
> >  1
active+recovery_unfound+undersized+degraded+remapped
> >  1active+clean+remapped
> >
> >   io:
> > client:   168 B/s rd, 0 B/s wr, 0 op/s rd, 0 op/s wr
> > recovery: 1.9 GiB/s, 15 keys/s, 948 objects/s
> >
> >
> > The MDS cluster is unable to start because one of them is damaged.
> >
> > 10 of the OSDs do not start. They crash very early in the boot 
process:
> >
> > 2020-04-20 16:26:14.935 7f818ec8cc00  0 set uid:gid to 64045:64045 
> > (ceph:ceph) 2020-04-20 16:26:14.935 7f818ec8cc00  0 ceph version 
> > 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable), 

> > process ceph-osd, pid 69463 2020-04-20 16:26:14.935 7f818ec8cc00  0 
> > pidfile_write: ignore empty --pid-file 2020-04-20 16:26:15.503 
> > 7f818ec8cc00  0 starting osd.42 osd_data /var/lib/ceph/osd/ceph-42 
> > /var/lib/ceph/osd/ceph-42/journal 2020-04-20 16:26:15.523 
> > 7f818ec8cc00  0 load: jerasure load: lrc load: isa 2020-04-20 
> > 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > compaction_readahead_size = 2MB 2020-04-20 16:26:16.339 7f818ec8cc00 
 
> > 0  set rocksdb option compaction_style = kCompactionStyleLevel 
> > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > compaction_threads = 32 2020-04-20 16:26:16.339 7f818ec8cc00  0  set 

> > rocksdb option compression = kNoCompression 2020-04-20 16:26:16.339 
> > 7f818ec8cc00  0  set rocksdb option flusher_threads = 8 2020-04-20 
> > 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > level0_file_num_compaction_trigger = 8 2020-04-20 16:26:16.339 
> > 7f818ec8cc00  0  set rocksdb option level0_slowdown_writes_trigger = 

> > 32 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> 

[ceph-users] Re: Nautilus cluster damaged + crashing OSDs

2020-04-21 Thread Robert Sander
Hi,

On 21.04.20 10:33, Paul Emmerich wrote:
> On Tue, Apr 21, 2020 at 3:20 AM Brad Hubbard  wrote:
>>
>> Wait for recovery to finish so you know whether any data from the down
>> OSDs is required. If not just reprovision them.
> 
> Recovery will not finish from this state as several PGs are down and/or stale.
> 

Thanks for your input so far.

It looks like this issue: https://tracker.ceph.com/issues/36337
We will try to use the linked Python script to repair the OSD.
ceph-bluestore-tool repair did not find anything.

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 93818 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: missing amqp-exchange on bucket-notification with AMQP endpoint

2020-04-21 Thread Yuval Lifshitz
Hi Andreas,
The message format you tried to use is the standard one (the one being
emitted from boto3, or any other AWS SDK [1]).
It passes the arguments using 'x-www-form-urlencoded'. For example:

POST / HTTP/1.1
Host: localhost:8000

Accept-Encoding: identity
Date: Tue, 21 Apr 2020 08:52:35 GMT
Content-Length: 293

Content-Type: application/x-www-form-urlencoded; charset=utf-8

Authorization: AWS KOC0EIWUFANCC3FX:8PunIZ4F36uK2c+3AKwhaKXgK84=

User-Agent: Boto3/1.9.225 Python/2.7.17 Linux/5.5.13-200.fc31.x86_64
Botocore/1.15.28

Name=ajmmvc-1_topic_1&
Attributes.entry.2.key=amqp-exchange&
Attributes.entry.1.key=amqp-ack-level&
Attributes.entry.2.value=amqp.direct&
Version=2010-03-31&
Attributes.entry.3.value=amqp%3A%2F%2F127.0.0.1%3A7001&
Attributes.entry.1.value=none&
Action=CreateTopic&
Attributes.entry.3.key=push-endpoint

Note that the arguments are passed inside the message body (no '?' in the
URL), and are using the "Attributes" for all the non-standard parameters we
added on top of the standard AWS topic creation command.

The format that worked for you, is a non-standard one that we support, as
documented for pubsub [2], which is using regular URL encoded parameters.
Feel free to use either, but would recommend on the standard one.

Anyway, thanks for pointing this confusion, will clarify that in the doc,
and also fix the 'push-endpoint' part.

Yuval

[1] https://docs.aws.amazon.com/sns/latest/api/API_CreateTopic.html
[2] https://docs.ceph.com/docs/master/radosgw/pubsub-module/#create-a-topic


On Mon, Apr 20, 2020 at 8:05 PM Andreas Unterkircher 
wrote:

> I've tried to debug this a bit.
>
> >  
> > amqp://
> rabbitmquser:rabbitmqp...@rabbitmq.example.com:5672
> >
> Attributes.entry.1.key=amqp-exchange&Attributes.entry.1.value=amqp.direct&push-endpoint=amqp://
> rabbitmquser:rabbitmqp...@rabbitmq.example.com:5672
> >  testtopic
> >  
>
> For the above I was using the following request to create the topic -
> similar as it is described here [1]:
>
>
> https://ceph.example.com/?Action=CreateTopic&Name=testtopic&Attributes.entry.1.key=amqp-exchange&Attributes.entry.1.value=amqp.direct&push-endpoint=amqp://rabbitmquser:rabbitmqp...@rabbitmq.example.com:5672
>
> (of course endpoint then URL-encoded)
>
> It seems to me that RGWHTTPArgs::parse() is not translating the
> "Attributes.entry.1..." strings into keys & values in its map.
>
> This are the keys & values that can now be found in the map:
>
>
> Found name:  Attributes.entry.1.key
> Found value: amqp-exchange
> Found name:  Attributes.entry.1.value
> Found value: amqp.direct
> Found name:  push-endpoint
> Found value: amqp://rabbitmquser:rabbitmqp...@rabbitmq.example.com:5672
>
> If I simply change the request to:
>
>
> https://ceph.example.com/?Action=CreateTopic&Name=testtopic&amqp-exchange=amqp.direct&push-endpoint=amqp://rabbitmquser:rabbitmqp...@rabbitmq.example.com:5672/foobar
>
> -> at voila, the entries in the map are correct
>
>
> Found name:  amqp-exchange
> Found value: amqp.direct
> Found name:  push-endpoint
> Found value: amqp://rabbitmquser:rabbitmqp...@rabbitmq.example.com:5672
>
> And then the bucket-notification works like it should.
>
> But I don't think the documentation is wrong, or is it?
>
> Cheers,
> Andreas
>
>
> [1]
> https://docs.ceph.com/docs/master/radosgw/notifications/#create-a-topic
>
>
>
> [2] Index: ceph-15.2.1/src/rgw/rgw_common.cc
> ===
> --- ceph-15.2.1.orig/src/rgw/rgw_common.cc
> +++ ceph-15.2.1/src/rgw/rgw_common.cc
> @@ -810,6 +810,8 @@ int RGWHTTPArgs::parse()
> string& name = nv.get_name();
> string& val = nv.get_val();
>
> +  cout << "Found name:  " << name << std::endl;
> +  cout << "Found value: " << val << std::endl;
> append(name, val);
>   }
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus cluster damaged + crashing OSDs

2020-04-21 Thread Brad Hubbard
On Tue, Apr 21, 2020 at 6:35 PM Paul Emmerich  wrote:
>
> On Tue, Apr 21, 2020 at 3:20 AM Brad Hubbard  wrote:
> >
> > Wait for recovery to finish so you know whether any data from the down
> > OSDs is required. If not just reprovision them.
>
> Recovery will not finish from this state as several PGs are down and/or stale.

What I meant was let recovery get as far as it can.

>
>
> Paul
>
> >
> > If data is required from the down OSDs you will need to run a query on
> > the pg(s) to find out what OSDs have the required copies of the
> > pg/object required. you can then export the pg from the down osd using
> > the ceph-objectstore-tool, back it up, then import it back into the
> > cluster.
> >
> > On Tue, Apr 21, 2020 at 1:05 AM Robert Sander
> >  wrote:
> > >
> > > Hi,
> > >
> > > one of our customers had his Ceph cluster crashed due to a power or 
> > > network outage (they still try to figure out what happened).
> > >
> > > The cluster is very unhealthy but recovering:
> > >
> > > # ceph -s
> > >   cluster:
> > > id: 1c95ca5d-948b-4113-9246-14761cb9a82a
> > > health: HEALTH_ERR
> > > 1 filesystem is degraded
> > > 1 mds daemon damaged
> > > 1 osds down
> > > 1 pools have many more objects per pg than average
> > > 1/115117480 objects unfound (0.000%)
> > > Reduced data availability: 71 pgs inactive, 53 pgs down, 18 
> > > pgs peering, 27 pgs stale
> > > Possible data damage: 1 pg recovery_unfound
> > > Degraded data redundancy: 7303464/230234960 objects degraded 
> > > (3.172%), 693 pgs degraded, 945 pgs undersized
> > > 14 daemons have recently crashed
> > >
> > >   services:
> > > mon: 3 daemons, quorum 
> > > maslxlabstore01,maslxlabstore02,maslxlabstore04 (age 64m)
> > > mgr: maslxlabstore01(active, since 69m), standbys: maslxlabstore03, 
> > > maslxlabstore02, maslxlabstore04
> > > mds: cephfs:2/3 
> > > {0=maslxlabstore03=up:resolve,1=maslxlabstore01=up:resolve} 2 up:standby, 
> > > 1 damaged
> > > osd: 140 osds: 130 up (since 4m), 131 in (since 4m); 847 remapped pgs
> > > rgw: 4 daemons active (maslxlabstore01.rgw0, maslxlabstore02.rgw0, 
> > > maslxlabstore03.rgw0, maslxlabstore04.rgw0)
> > >
> > >   data:
> > > pools:   6 pools, 8328 pgs
> > > objects: 115.12M objects, 218 TiB
> > > usage:   425 TiB used, 290 TiB / 715 TiB avail
> > > pgs: 0.853% pgs not active
> > >  7303464/230234960 objects degraded (3.172%)
> > >  13486/230234960 objects misplaced (0.006%)
> > >  1/115117480 objects unfound (0.000%)
> > >  7311 active+clean
> > >  338  active+undersized+degraded+remapped+backfill_wait
> > >  255  active+undersized+degraded+remapped+backfilling
> > >  215  active+undersized+remapped+backfilling
> > >  99   active+undersized+degraded
> > >  44   down
> > >  37   active+undersized+remapped+backfill_wait
> > >  13   stale+peering
> > >  9stale+down
> > >  5stale+remapped+peering
> > >  1active+recovery_unfound+undersized+degraded+remapped
> > >  1active+clean+remapped
> > >
> > >   io:
> > > client:   168 B/s rd, 0 B/s wr, 0 op/s rd, 0 op/s wr
> > > recovery: 1.9 GiB/s, 15 keys/s, 948 objects/s
> > >
> > >
> > > The MDS cluster is unable to start because one of them is damaged.
> > >
> > > 10 of the OSDs do not start. They crash very early in the boot process:
> > >
> > > 2020-04-20 16:26:14.935 7f818ec8cc00  0 set uid:gid to 64045:64045 
> > > (ceph:ceph)
> > > 2020-04-20 16:26:14.935 7f818ec8cc00  0 ceph version 14.2.9 
> > > (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable), process 
> > > ceph-osd, pid 69463
> > > 2020-04-20 16:26:14.935 7f818ec8cc00  0 pidfile_write: ignore empty 
> > > --pid-file
> > > 2020-04-20 16:26:15.503 7f818ec8cc00  0 starting osd.42 osd_data 
> > > /var/lib/ceph/osd/ceph-42 /var/lib/ceph/osd/ceph-42/journal
> > > 2020-04-20 16:26:15.523 7f818ec8cc00  0 load: jerasure load: lrc load: isa
> > > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > > compaction_readahead_size = 2MB
> > > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > > compaction_style = kCompactionStyleLevel
> > > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > > compaction_threads = 32
> > > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option compression = 
> > > kNoCompression
> > > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > > flusher_threads = 8
> > > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > > level0_file_num_compaction_trigger = 8
> > > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > > level0_slowdown_writes_trigger = 32
> > > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > > level0_stop_writes_t

[ceph-users] Re: Nautilus cluster damaged + crashing OSDs

2020-04-21 Thread Paul Emmerich
On Tue, Apr 21, 2020 at 12:44 PM Brad Hubbard  wrote:
>
> On Tue, Apr 21, 2020 at 6:35 PM Paul Emmerich  wrote:
> >
> > On Tue, Apr 21, 2020 at 3:20 AM Brad Hubbard  wrote:
> > >
> > > Wait for recovery to finish so you know whether any data from the down
> > > OSDs is required. If not just reprovision them.
> >
> > Recovery will not finish from this state as several PGs are down and/or 
> > stale.
>
> What I meant was let recovery get as far as it can.

Which doesn't solve anything, you can already see that you need to get
at least some of these OSDs back in order to fix it.
No point in waiting for the recovery.

I agree that it looks like https://tracker.ceph.com/issues/36337
I happen to know Jonas who opened that issue and wrote the script;
I'll poke him maybe he has an idea or additional input


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

>
> >
> >
> > Paul
> >
> > >
> > > If data is required from the down OSDs you will need to run a query on
> > > the pg(s) to find out what OSDs have the required copies of the
> > > pg/object required. you can then export the pg from the down osd using
> > > the ceph-objectstore-tool, back it up, then import it back into the
> > > cluster.
> > >
> > > On Tue, Apr 21, 2020 at 1:05 AM Robert Sander
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > one of our customers had his Ceph cluster crashed due to a power or 
> > > > network outage (they still try to figure out what happened).
> > > >
> > > > The cluster is very unhealthy but recovering:
> > > >
> > > > # ceph -s
> > > >   cluster:
> > > > id: 1c95ca5d-948b-4113-9246-14761cb9a82a
> > > > health: HEALTH_ERR
> > > > 1 filesystem is degraded
> > > > 1 mds daemon damaged
> > > > 1 osds down
> > > > 1 pools have many more objects per pg than average
> > > > 1/115117480 objects unfound (0.000%)
> > > > Reduced data availability: 71 pgs inactive, 53 pgs down, 18 
> > > > pgs peering, 27 pgs stale
> > > > Possible data damage: 1 pg recovery_unfound
> > > > Degraded data redundancy: 7303464/230234960 objects 
> > > > degraded (3.172%), 693 pgs degraded, 945 pgs undersized
> > > > 14 daemons have recently crashed
> > > >
> > > >   services:
> > > > mon: 3 daemons, quorum 
> > > > maslxlabstore01,maslxlabstore02,maslxlabstore04 (age 64m)
> > > > mgr: maslxlabstore01(active, since 69m), standbys: maslxlabstore03, 
> > > > maslxlabstore02, maslxlabstore04
> > > > mds: cephfs:2/3 
> > > > {0=maslxlabstore03=up:resolve,1=maslxlabstore01=up:resolve} 2 
> > > > up:standby, 1 damaged
> > > > osd: 140 osds: 130 up (since 4m), 131 in (since 4m); 847 remapped 
> > > > pgs
> > > > rgw: 4 daemons active (maslxlabstore01.rgw0, maslxlabstore02.rgw0, 
> > > > maslxlabstore03.rgw0, maslxlabstore04.rgw0)
> > > >
> > > >   data:
> > > > pools:   6 pools, 8328 pgs
> > > > objects: 115.12M objects, 218 TiB
> > > > usage:   425 TiB used, 290 TiB / 715 TiB avail
> > > > pgs: 0.853% pgs not active
> > > >  7303464/230234960 objects degraded (3.172%)
> > > >  13486/230234960 objects misplaced (0.006%)
> > > >  1/115117480 objects unfound (0.000%)
> > > >  7311 active+clean
> > > >  338  active+undersized+degraded+remapped+backfill_wait
> > > >  255  active+undersized+degraded+remapped+backfilling
> > > >  215  active+undersized+remapped+backfilling
> > > >  99   active+undersized+degraded
> > > >  44   down
> > > >  37   active+undersized+remapped+backfill_wait
> > > >  13   stale+peering
> > > >  9stale+down
> > > >  5stale+remapped+peering
> > > >  1active+recovery_unfound+undersized+degraded+remapped
> > > >  1active+clean+remapped
> > > >
> > > >   io:
> > > > client:   168 B/s rd, 0 B/s wr, 0 op/s rd, 0 op/s wr
> > > > recovery: 1.9 GiB/s, 15 keys/s, 948 objects/s
> > > >
> > > >
> > > > The MDS cluster is unable to start because one of them is damaged.
> > > >
> > > > 10 of the OSDs do not start. They crash very early in the boot process:
> > > >
> > > > 2020-04-20 16:26:14.935 7f818ec8cc00  0 set uid:gid to 64045:64045 
> > > > (ceph:ceph)
> > > > 2020-04-20 16:26:14.935 7f818ec8cc00  0 ceph version 14.2.9 
> > > > (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable), process 
> > > > ceph-osd, pid 69463
> > > > 2020-04-20 16:26:14.935 7f818ec8cc00  0 pidfile_write: ignore empty 
> > > > --pid-file
> > > > 2020-04-20 16:26:15.503 7f818ec8cc00  0 starting osd.42 osd_data 
> > > > /var/lib/ceph/osd/ceph-42 /var/lib/ceph/osd/ceph-42/journal
> > > > 2020-04-20 16:26:15.523 7f818ec8cc00  0 load: jerasure load: lrc load: 
> > > > isa
> > > > 2020-04-2

[ceph-users] Re: Nautilus cluster damaged + crashing OSDs

2020-04-21 Thread Jonas Jelten
Hi!

Yes, it looks like you hit the same bug.

My corruption back then happed because the server was out-of-memory and OSDs 
restarted and crashed quickly again and
again for quite some time...

What I think happens is that the journals somehow get out of sync between OSDs, 
which is something that should
definitely not happen under the intended consistency guarantees.

However, I've managed to resolve it back then by deleting the PG with the older 
log (under the assumption that the newer
one is the more recent and better one). This only works if enough shards of 
that PG are available of course, and then
the regular recovery process will restore the missing shards again.

I hope my script still works for you. If you need any help, I'll see what I can 
do :)
If things fail, you can still manually import the exported-and-deleted PGs back 
into any OSD (which will probably cause
the other OSDs of the PG to crash since then the logs won't overlap once again).


Cheers
  -- Jonas

On 21/04/2020 11.26, Robert Sander wrote:
> Hi,
> 
> On 21.04.20 10:33, Paul Emmerich wrote:
>> On Tue, Apr 21, 2020 at 3:20 AM Brad Hubbard  wrote:
>>>
>>> Wait for recovery to finish so you know whether any data from the down
>>> OSDs is required. If not just reprovision them.
>>
>> Recovery will not finish from this state as several PGs are down and/or 
>> stale.
>>
> 
> Thanks for your input so far.
> 
> It looks like this issue: https://tracker.ceph.com/issues/36337
> We will try to use the linked Python script to repair the OSD.
> ceph-bluestore-tool repair did not find anything.
> 
> Regards
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] block.db symlink missing after each reboot

2020-04-21 Thread Stefan Priebe - Profihost AG
Hi there,

i've a bunch of hosts where i migrated HDD only OSDs to hybird ones using:
sudo -E -u ceph -- bash -c 'ceph-bluestore-tool --path
/var/lib/ceph/osd/ceph-${OSD} bluefs-bdev-new-db --dev-target
/dev/bluefs_db1/db-osd${OSD}'

while this worked fine and each OSD was running fine.

It looses it's block.db symlink after reboot.

If i manually recreate the block.db symlink inside:
/var/lib/ceph/osd/ceph-*

all osds start fine. Can anybody help who creates those symlinks and why
they're not created automatically in case of migrated db?

Greets,
Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus cluster damaged + crashing OSDs

2020-04-21 Thread Robert Sander
Hi Jonas,

On 21.04.20 14:47, Jonas Jelten wrote:

> I hope my script still works for you. If you need any help, I'll see what I 
> can do :)

The script currently does not find the info it needs and wants us to
increase to logging level.

We set the logging level to 10 and tried to restart the OSD (which
resulted in a crash) but the script still is not able to find the info.

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 93818 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: block.db symlink missing after each reboot

2020-04-21 Thread Igor Fedotov

Hi Stefan,

I think that's the cause:

https://tracker.ceph.com/issues/42928


On 4/21/2020 4:02 PM, Stefan Priebe - Profihost AG wrote:

Hi there,

i've a bunch of hosts where i migrated HDD only OSDs to hybird ones using:
sudo -E -u ceph -- bash -c 'ceph-bluestore-tool --path
/var/lib/ceph/osd/ceph-${OSD} bluefs-bdev-new-db --dev-target
/dev/bluefs_db1/db-osd${OSD}'

while this worked fine and each OSD was running fine.

It looses it's block.db symlink after reboot.

If i manually recreate the block.db symlink inside:
/var/lib/ceph/osd/ceph-*

all osds start fine. Can anybody help who creates those symlinks and why
they're not created automatically in case of migrated db?

Greets,
Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Sporadic mgr segmentation fault

2020-04-21 Thread XuYun
Dear ceph users,

We are experiencing sporadic mgr crash in all three ceph clusters (version 
14.2.6 and version 14.2.8), the crash log is:

2020-04-17 23:10:08.986 7fed7fe07700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/common/buffer.cc:
 In function 'const char* ceph::buffer::v14_2_0::ptr::c_str() const' thread 
7fed7fe07700 time 2020-04-17 23:10:08.984887
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/common/buffer.cc:
 578: FAILED ceph_assert(_raw)

 ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus 
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14a) [0x7fed8605c325]
 2: (()+0x2534ed) [0x7fed8605c4ed]
 3: (()+0x5a21ed) [0x7fed863ab1ed]
 4: (PosixConnectedSocketImpl::send(ceph::buffer::v14_2_0::list&, bool)+0xbd) 
[0x7fed863840ed]
 5: (AsyncConnection::_try_send(bool)+0xb6) [0x7fed8632fc76]
 6: (ProtocolV2::write_message(Message*, bool)+0x832) [0x7fed8635bf52]
 7: (ProtocolV2::write_event()+0x175) [0x7fed863718c5]
 8: (AsyncConnection::handle_write()+0x40) [0x7fed86332600]
 9: (EventCenter::process_events(unsigned int, std::chrono::duration >*)+0x1397) [0x7fed8637f997]
 10: (()+0x57c977) [0x7fed86385977]
 11: (()+0x80bdaf) [0x7fed86614daf]
 12: (()+0x7e65) [0x7fed8394ce65]
 13: (clone()+0x6d) [0x7fed825fa88d]

2020-04-17 23:10:08.990 7fed7ee05700 -1 *** Caught signal (Segmentation fault) 
**
 in thread 7fed7ee05700 thread_name:msgr-worker-2

 ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus 
(stable)
 1: (()+0xf5f0) [0x7fed839545f0]
 2: (ceph::buffer::v14_2_0::ptr::release()+0x8) [0x7fed863aafd8]
 3: 
(ceph::crypto::onwire::AES128GCM_OnWireTxHandler::~AES128GCM_OnWireTxHandler()+0x59)
 [0x7fed86388669]
 4: (ProtocolV2::reset_recv_state()+0x11f) [0x7fed8635f5af]
 5: (ProtocolV2::stop()+0x77) [0x7fed8635f857]
 6: 
(ProtocolV2::handle_existing_connection(boost::intrusive_ptr)+0x5ef)
 [0x7fed86374f8f]
 7: (ProtocolV2::handle_client_ident(ceph::buffer::v14_2_0::list&)+0xd9c) 
[0x7fed8637673c]
 8: (ProtocolV2::handle_frame_payload()+0x1fb) [0x7fed86376c1b]
 9: (ProtocolV2::handle_read_frame_dispatch()+0x150) [0x7fed86376e70]
 10: 
(ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr&&, int)+0x44d) [0x7fed863773cd]
 11: (ProtocolV2::run_continuation(Ct&)+0x34) [0x7fed86360534]
 12: (AsyncConnection::process()+0x186) [0x7fed86330656]
 13: (EventCenter::process_events(unsigned int, std::chrono::duration >*)+0xa15) [0x7fed8637f015]
 14: (()+0x57c977) [0x7fed86385977]
 15: (()+0x80bdaf) [0x7fed86614daf]
 16: (()+0x7e65) [0x7fed8394ce65]
 17: (clone()+0x6d) [0x7fed825fa88d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

Any thoughts about this issue?
Xu Yun
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus cluster damaged + crashing OSDs

2020-04-21 Thread Jonas Jelten
Hi!

Since you are on nautilus and I was on mimic back then, the messages may have 
changed.
The script is only an automatization for deleting many broken PGs, you can 
perform the procedure by hand first.

You can perform the steps in my state machine by hand and identify the right 
messages, and then update the parser.

-- Jonas


On 21/04/2020 15.13, Robert Sander wrote:
> Hi Jonas,
> 
> On 21.04.20 14:47, Jonas Jelten wrote:
> 
>> I hope my script still works for you. If you need any help, I'll see what I 
>> can do :)
> 
> The script currently does not find the info it needs and wants us to
> increase to logging level.
> 
> We set the logging level to 10 and tried to restart the OSD (which
> resulted in a crash) but the script still is not able to find the info.
> 
> Regards
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: block.db symlink missing after each reboot

2020-04-21 Thread Stefan Priebe - Profihost AG
Hi Igor,

Am 21.04.20 um 15:52 schrieb Igor Fedotov:
> Hi Stefan,
> 
> I think that's the cause:
> 
> https://tracker.ceph.com/issues/42928

thanks yes that matches. Is there any way to fix this manually?

And is this also related to:
https://tracker.ceph.com/issues/44509

Greets,
Stefan

> 
> On 4/21/2020 4:02 PM, Stefan Priebe - Profihost AG wrote:
>> Hi there,
>>
>> i've a bunch of hosts where i migrated HDD only OSDs to hybird ones
>> using:
>> sudo -E -u ceph -- bash -c 'ceph-bluestore-tool --path
>> /var/lib/ceph/osd/ceph-${OSD} bluefs-bdev-new-db --dev-target
>> /dev/bluefs_db1/db-osd${OSD}'
>>
>> while this worked fine and each OSD was running fine.
>>
>> It looses it's block.db symlink after reboot.
>>
>> If i manually recreate the block.db symlink inside:
>> /var/lib/ceph/osd/ceph-*
>>
>> all osds start fine. Can anybody help who creates those symlinks and why
>> they're not created automatically in case of migrated db?
>>
>> Greets,
>> Stefan
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: block.db symlink missing after each reboot

2020-04-21 Thread Igor Fedotov

On 4/21/2020 4:59 PM, Stefan Priebe - Profihost AG wrote:

Hi Igor,

Am 21.04.20 um 15:52 schrieb Igor Fedotov:

Hi Stefan,

I think that's the cause:

https://tracker.ceph.com/issues/42928

thanks yes that matches. Is there any way to fix this manually?


I think so - AFAIK missed tags are pure LVM stuff and hence can be set 
by regular LVM tools.


 ceph-volume does that during OSD provisioning as well. But 
unfortunately I haven't dived into this topic deeper yet. So can't 
provide you with the details how to fix this step-by-step.




And is this also related to:
https://tracker.ceph.com/issues/44509


Probably unrelated. That's either a different bug or rather some 
artifact from RocksDB/BlueFS interaction.


Leaving a request for more info in the ticket...



Greets,
Stefan


On 4/21/2020 4:02 PM, Stefan Priebe - Profihost AG wrote:

Hi there,

i've a bunch of hosts where i migrated HDD only OSDs to hybird ones
using:
sudo -E -u ceph -- bash -c 'ceph-bluestore-tool --path
/var/lib/ceph/osd/ceph-${OSD} bluefs-bdev-new-db --dev-target
/dev/bluefs_db1/db-osd${OSD}'

while this worked fine and each OSD was running fine.

It looses it's block.db symlink after reboot.

If i manually recreate the block.db symlink inside:
/var/lib/ceph/osd/ceph-*

all osds start fine. Can anybody help who creates those symlinks and why
they're not created automatically in case of migrated db?

Greets,
Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: some ceph general questions about the design

2020-04-21 Thread Antoine Lecrux
Hi Anthony,

You bring a very valid point. My advice is to carefully chose the HBA and the 
disks and do extensive testing during the initial phase of the project and have 
controlled Firmware upgrade campains with a good pre-production setup.

In a multiple RAID-0 scenario, there are some parameters you need to disable 
such as rebuild priority or consistency check if you don't want your entire OSD 
server to temporarily go down in case of a single drive failure.

The points you bring are valid too with SATA flash disks as you have to deal 
with Disks, HBA and sometimes Backplane Firmware.

- Antoine

PS: the "preserved cache" issue you're refering too... I had to ditch an HBA 
that had that "feature" during my initial hardware tests. It was dramatically 
affecting the stability of the entire OSD.


From: Anthony D'Atri 
Sent: Tuesday, April 21, 2020 2:59 AM
To: ceph-users 
Subject: [ceph-users] Re: some ceph general questions about the design



>
> 1. shoud i use a raid controller a create for example a raid 5 with all disks 
> on each osd server? or should i passtrough all disks to ceph osd?
>
> If your OSD servers have HDDs, buy a good RAID Controller with a 
> battery-backed write cache and configure it using multiple RAID-0 volumes (1 
> physical disk per volume). That way, reads and write will be accelerated by 
> the cache on the HBA.

I’ve lived this scenario and hated it.  Multiple firmware and manufacturing 
issues, batteries/supercaps can fail and need to be monitored, bugs causing 
staged data to be lost before writing to disk, another bug that required 
replacing the card if there was preserved cache for a failed drive, because it 
would refuse to boot, difficulties in drive monitoring, HBA monitoring utility 
that would lock the HBA or peg the CPU, the list goes on.

For the additional cost of RoC, cache RAM, supercap to (fingers crossed) 
protect the cache, all the additional monitoring and hands work … you might 
find that SATA SSDs on a JBOD HBA are no more expensive.

> 3. if i have a 3 physically node osd cluster, did i need 5 physicall mons?
> No. 3 MON are enough

If you have good hands and spares.  If your cluster is on a different continent 
and colo hands can’t find their own butts …..  it’s nice to survive a double 
failure.

ymmv
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PG deep-scrub does not finish

2020-04-21 Thread Andras Pataki

Hi Brad,

Indeed - osd.694 kept crashing with a read error (medium error on the 
hard drive), and got restarted by systemd.  So net net the system ended 
up in an infinite loop of deep scrub attempts on the PG for a week.  
Typically when a scrub encounters a read error, I get an inconsistent 
placement group, not an OSD crash and an infinite loop of scrub 
attempts.  With the inconsistent placement group a pg repair fixes the 
read error (by reallocating the sector inside the drive).


Here is the stack trace of the osd.694 crash on the scrub read error:

Apr 19 03:39:17 popeye-oss-3-03 kernel: sd 14:0:27:0: [sdz] tag#1 FAILED 
Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 19 03:39:17 popeye-oss-3-03 kernel: sd 14:0:27:0: [sdz] tag#1 Sense 
Key : Medium Error [current] [descriptor]
Apr 19 03:39:17 popeye-oss-3-03 kernel: sd 14:0:27:0: [sdz] tag#1 Add. 
Sense: Unrecovered read error
Apr 19 03:39:17 popeye-oss-3-03 kernel: sd 14:0:27:0: [sdz] tag#1 CDB: 
Read(10) 28 00 6e a9 7a 30 00 00 80 00
Apr 19 03:39:17 popeye-oss-3-03 kernel: print_req_error: critical medium 
error, dev sdz, sector 14852804992
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 2020-04-19 03:39:17.095 
7fffd2e2b700 -1 bluestore(/var/lib/ceph/osd/ceph-694) _do_read bdev-read 
failed: (61) No data available
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/os/bluestore/BlueStore.cc: 
In function 'int BlueStore::_do_read(BlueStore::Collection*, 
BlueStore::OnodeRef, uint64_t, size_t, ceph::bufferlist&, uint32_t, 
uint64_t)' thread 7fffd2e2b700 time 2020-04-19 03:39:17.099677
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/os/bluestore/BlueStore.cc: 
9214: FAILED ceph_assert(r == 0)
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: ceph version 14.2.8 
(2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 1: 
(ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14a) [0x55a1ea4d]

Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 2: (()+0x4cac15) [0x55a1ec15]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 3: 
(BlueStore::_do_read(BlueStore::Collection*, 
boost::intrusive_ptr, unsigned long, unsigned long, 
ceph::buffer::v14_2_0::list&, unsigned int, unsigned long)+0x3512) 
[0x55f64132]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 4: 
(BlueStore::read(boost::intrusive_ptr&, 
ghobject_t const&, unsigned long, unsigned long, 
ceph::buffer::v14_2_0::list&, unsigned int)+0x1b8) [0x55f64778]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 5: 
(ReplicatedBackend::be_deep_scrub(hobject_t const&, ScrubMap&, 
ScrubMapBuilder&, ScrubMap::object&)+0x2c2) [0x55deb832]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 6: 
(PGBackend::be_scan_list(ScrubMap&, ScrubMapBuilder&)+0x663) 
[0x55d082c3]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 7: 
(PG::build_scrub_map_chunk(ScrubMap&, ScrubMapBuilder&, hobject_t, 
hobject_t, bool, ThreadPool::TPHandle&)+0x8b) [0x55bbaacb]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 8: 
(PG::chunky_scrub(ThreadPool::TPHandle&)+0x181c) [0x55be4fcc]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 9: (PG::scrub(unsigned int, 
ThreadPool::TPHandle&)+0x4bb) [0x55be61db]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 10: (PGScrub::run(OSD*, 
OSDShard*, boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x12) 
[0x55d8c7b2]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 11: 
(OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0x90f) [0x55b1898f]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 12: 
(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) 
[0x560bd056]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 13: 
(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x560bfb70]

Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 14: (()+0x7e65) [0x75025e65]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 15: (clone()+0x6d) 
[0x73ee988d]


I ended up recreating the OSD (and thus overwriting all data) to fix the 
issue.


Andras


On 4/20/20 9:28 PM, Brad Hubbard wrote:

On Mon, Apr 20, 2020 at 11:01 PM Andras Pataki
 wrote:

On a cluster running Nautilus (14.2.8), we are getting a complaint about
a PG not being deep-scrubbed on time.  Looking at the primary OSD's
logs, it looks like it tries to deep-scrub the PG every hour or so,
emits some complaints that I don't understand, but the deep scrub does
not finish (either with or without a scrub error).

Here is the PG from pg dump:

1.43f 31794  00 0   0
66930087214   0  0 3004 3004
active+clean+scrubbing+deep 2020-04-20 04:48:13.055481 46286'483734
46286:563439 [354,694,851]354 [354,694,851]354
3959

[ceph-users] Re: block.db symlink missing after each reboot

2020-04-21 Thread Stefan Priebe - Profihost AG
Hi Igor,

mhm i updated the missing lv tags:

# lvs -o lv_tags
/dev/ceph-3a295647-d5a1-423c-81dd-1d2b32d7c4c5/osd-block-c2676c5f-111c-4603-b411-473f7a7638c2
| tr ',' '\n' | sort


  LV Tags







ceph.block_device=/dev/ceph-3a295647-d5a1-423c-81dd-1d2b32d7c4c5/osd-block-c2676c5f-111c-4603-b411-473f7a7638c2
ceph.block_uuid=0wBREi-I5t1-UeUa-EvbA-sET0-S9O0-VaxOgg
ceph.cephx_lockbox_secret=
ceph.cluster_fsid=7e242332-55c3-4926-9646-149b2f5c8081
ceph.cluster_name=ceph
ceph.crush_device_class=None
ceph.db_device=/dev/bluefs_db1/db-osd0
ceph.db_uuid=UUw35K-YnNT-HZZE-IfWd-Rtxn-0eVW-kTuQmj
ceph.encrypted=0
ceph.osd_fsid=c2676c5f-111c-4603-b411-473f7a7638c2
ceph.osd_id=0
ceph.type=block
ceph.vdo=0

# lvdisplay /dev/bluefs_db1/db-osd0
  --- Logical volume ---
  LV Path/dev/bluefs_db1/db-osd0
  LV Namedb-osd0
  VG Namebluefs_db1
  LV UUIDUUw35K-YnNT-HZZE-IfWd-Rtxn-0eVW-kTuQmj
  LV Write Accessread/write
  LV Creation host, time cloud10-1517, 2020-02-28 21:32:48 +0100
  LV Status  available
  # open 0
  LV Size185,00 GiB
  Current LE 47360
  Segments   1
  Allocation inherit
  Read ahead sectors auto
  - currently set to 256
  Block device   253:1

but lvm trigger still says:

# /usr/sbin/ceph-volume lvm trigger
0-c2676c5f-111c-4603-b411-473f7a7638c2
-->  RuntimeError: could not find db with uuid
UUw35K-YnNT-HZZE-IfWd-Rtxn-0eVW-kTuQmj

Mit freundlichen Grüßen
  Stefan Priebe
Bachelor of Science in Computer Science (BSCS)
Vorstand (CTO)

---
Profihost AG
Expo Plaza 1
30539 Hannover
Deutschland

Tel.: +49 (511) 5151 8181 | Fax.: +49 (511) 5151 8282
URL: http://www.profihost.com | E-Mail: i...@profihost.com

Sitz der Gesellschaft: Hannover, USt-IdNr. DE813460827
Registergericht: Amtsgericht Hannover, Register-Nr.: HRB 202350
Vorstand: Cristoph Bluhm, Stefan Priebe
Aufsichtsrat: Prof. Dr. iur. Winfried Huck (Vorsitzender)

Am 21.04.20 um 16:07 schrieb Igor Fedotov:
> On 4/21/2020 4:59 PM, Stefan Priebe - Profihost AG wrote:
>> Hi Igor,
>>
>> Am 21.04.20 um 15:52 schrieb Igor Fedotov:
>>> Hi Stefan,
>>>
>>> I think that's the cause:
>>>
>>> https://tracker.ceph.com/issues/42928
>> thanks yes that matches. Is there any way to fix this manually?
> 
> I think so - AFAIK missed tags are pure LVM stuff and hence can be set
> by regular LVM tools.
> 
>  ceph-volume does that during OSD provisioning as well. But
> unfortunately I haven't dived into this topic deeper yet. So can't
> provide you with the details how to fix this step-by-step.
> 
>>
>> And is this also related to:
>> https://tracker.ceph.com/issues/44509
> 
> Probably unrelated. That's either a different bug or rather some
> artifact from RocksDB/BlueFS interaction.
> 
> Leaving a request for more info in the ticket...
> 
>>
>> Greets,
>> Stefan
>>
>>> On 4/21/2020 4:02 PM, Stefan Priebe - Profihost AG wrote:
 Hi there,

 i've a bunch of hosts where i migrated HDD only OSDs to hybird ones
 using:
 sudo -E -u ceph -- bash -c 'ceph-bluestore-tool --path
 /var/lib/ceph/osd/ceph-${OSD} bluefs-bdev-new-db --dev-target
 /dev/bluefs_db1/db-osd${OSD}'

 while this worked fine and each OSD was running fine.

 It looses it's block.db symlink after reboot.

 If i manually recreate the block.db symlink inside:
 /var/lib/ceph/osd/ceph-*

 all osds start fine. Can anybody help who creates those symlinks and
 why
 they're not created automatically in case of migrated db?

 Greets,
 Stefan
 ___
 ceph-users mailing list -- ceph-users@ceph.io
 To unsubscribe send an email to ceph-users-le...@ceph.io
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Rebuilding the Ceph.io site with Jekyll

2020-04-21 Thread Lars Marowsky-Bree
Hi all,

as part of the Ceph Foundation, we're considering to re-launch the Ceph
website and migrate it away from a dated WordPress to Jekyll, backed by
Git et al. (Either hosted on our own infrastructure or even GitHub
pages.)

This would involve building/customizing a Jekyll theme, providing
feedback on the site structure proposal and usability, migrating content
(where appropriate) from the existing site, and working with the Ceph
infra team on getting it hosted/deployed.

Some help with improving the design would be welcome.

Content creation isn't necessarily part of the requirements, but working
with stakeholders on filling in blanks is; and if we could get someone
savvy with Ceph who wants to fill in a few pages, that's a plus!

After the launch, we should be mostly self-sufficient again for
day-to-day tasks.

If that's the kind of contract work you or a friend is interested in,
please reach out to me.

(The Foundation hasn't yet approved the budget, we're still trying to
get a feeling for the funding required. But I'd be fairly optimistic.)



Regards,
Lars

-- 
SUSE Software Solutions Germany GmbH, MD: Felix Imendörffer, HRB 36809 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PG deep-scrub does not finish

2020-04-21 Thread Brad Hubbard
Looks like that drive is dying.

On Wed, Apr 22, 2020 at 12:25 AM Andras Pataki
 wrote:
>
> Hi Brad,
>
> Indeed - osd.694 kept crashing with a read error (medium error on the
> hard drive), and got restarted by systemd.  So net net the system ended
> up in an infinite loop of deep scrub attempts on the PG for a week.
> Typically when a scrub encounters a read error, I get an inconsistent
> placement group, not an OSD crash and an infinite loop of scrub
> attempts.  With the inconsistent placement group a pg repair fixes the
> read error (by reallocating the sector inside the drive).
>
> Here is the stack trace of the osd.694 crash on the scrub read error:
>
> Apr 19 03:39:17 popeye-oss-3-03 kernel: sd 14:0:27:0: [sdz] tag#1 FAILED
> Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Apr 19 03:39:17 popeye-oss-3-03 kernel: sd 14:0:27:0: [sdz] tag#1 Sense
> Key : Medium Error [current] [descriptor]
> Apr 19 03:39:17 popeye-oss-3-03 kernel: sd 14:0:27:0: [sdz] tag#1 Add.
> Sense: Unrecovered read error
> Apr 19 03:39:17 popeye-oss-3-03 kernel: sd 14:0:27:0: [sdz] tag#1 CDB:
> Read(10) 28 00 6e a9 7a 30 00 00 80 00
> Apr 19 03:39:17 popeye-oss-3-03 kernel: print_req_error: critical medium
> error, dev sdz, sector 14852804992
> Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 2020-04-19 03:39:17.095
> 7fffd2e2b700 -1 bluestore(/var/lib/ceph/osd/ceph-694) _do_read bdev-read
> failed: (61) No data available
> Apr 19 03:39:17 popeye-oss-3-03 ceph-osd:
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/os/bluestore/BlueStore.cc:
> In function 'int BlueStore::_do_read(BlueStore::Collection*,
> BlueStore::OnodeRef, uint64_t, size_t, ceph::bufferlist&, uint32_t,
> uint64_t)' thread 7fffd2e2b700 time 2020-04-19 03:39:17.099677
> Apr 19 03:39:17 popeye-oss-3-03 ceph-osd:
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/os/bluestore/BlueStore.cc:
> 9214: FAILED ceph_assert(r == 0)
> Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: ceph version 14.2.8
> (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
> Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 1:
> (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x14a) [0x55a1ea4d]
> Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 2: (()+0x4cac15) [0x55a1ec15]
> Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 3:
> (BlueStore::_do_read(BlueStore::Collection*,
> boost::intrusive_ptr, unsigned long, unsigned long,
> ceph::buffer::v14_2_0::list&, unsigned int, unsigned long)+0x3512)
> [0x55f64132]
> Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 4:
> (BlueStore::read(boost::intrusive_ptr&,
> ghobject_t const&, unsigned long, unsigned long,
> ceph::buffer::v14_2_0::list&, unsigned int)+0x1b8) [0x55f64778]
> Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 5:
> (ReplicatedBackend::be_deep_scrub(hobject_t const&, ScrubMap&,
> ScrubMapBuilder&, ScrubMap::object&)+0x2c2) [0x55deb832]
> Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 6:
> (PGBackend::be_scan_list(ScrubMap&, ScrubMapBuilder&)+0x663)
> [0x55d082c3]
> Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 7:
> (PG::build_scrub_map_chunk(ScrubMap&, ScrubMapBuilder&, hobject_t,
> hobject_t, bool, ThreadPool::TPHandle&)+0x8b) [0x55bbaacb]
> Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 8:
> (PG::chunky_scrub(ThreadPool::TPHandle&)+0x181c) [0x55be4fcc]
> Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 9: (PG::scrub(unsigned int,
> ThreadPool::TPHandle&)+0x4bb) [0x55be61db]
> Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 10: (PGScrub::run(OSD*,
> OSDShard*, boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x12)
> [0x55d8c7b2]
> Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 11:
> (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x90f) [0x55b1898f]
> Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 12:
> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6)
> [0x560bd056]
> Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 13:
> (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x560bfb70]
> Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 14: (()+0x7e65) [0x75025e65]
> Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 15: (clone()+0x6d)
> [0x73ee988d]
>
> I ended up recreating the OSD (and thus overwriting all data) to fix the
> issue.
>
> Andras
>
>
> On 4/20/20 9:28 PM, Brad Hubbard wrote:
> > On Mon, Apr 20, 2020 at 11:01 PM Andras Pataki
> >  wrote:
> >> On a cluster running Nautilus (14.2.8), we are getting a complaint about
> >> a PG not being deep-scrubbed on time.  Looking at the primary OSD's
> >> logs, it looks like it tries to deep-scrub the PG every hour or so,
> >> emits some complaints that I don't understand, but the deep scrub does
> >> not finish (either with or without a scrub error).
> >>
> >> Here is the PG from pg dump:
> >>
> >> 1.43f