Re: [ceph-users] RBD image "lightweight snapshots"

2018-08-27 Thread Bartosz Rabiega

Bumping the topic.


So, what do you think guys?


On 08/13/2018 12:22 PM, Bartosz Rabiega wrote:



On 08/11/2018 07:56 AM, Paweł Sadowski wrote:

On 08/10/2018 06:24 PM, Gregory Farnum wrote:

On Fri, Aug 10, 2018 at 4:53 AM, Paweł Sadowsk  wrote:

On 08/09/2018 04:39 PM, Alex Elder wrote:

On 08/09/2018 08:15 AM, Sage Weil wrote:

On Thu, 9 Aug 2018, Piotr Dałek wrote:

Hello,

At OVH we're heavily utilizing snapshots for our backup system. 
We think
there's an interesting optimization opportunity regarding 
snapshots I'd like

to discuss here.

The idea is to introduce a concept of a "lightweight" snapshots 
- such
snapshot would not contain data but only the information about 
what has
changed on the image since it was created (so basically only the 
object map

part of snapshots).

Our backup solution (which seems to be a pretty common practice) 
is as

follows:

1. Create snapshot of the image we want to backup
2. If there's a previous backup snapshot, export diff and apply 
it on the

backup image
3. If there's no older snapshot, just do a full backup of image

This introduces one big issue: it enforces COW snapshot on 
image, meaning that
original image access latencies and consumed space increases. 
"Lightweight"
snapshots would remove these inefficiencies - no COW performance 
and storage

overhead.
The snapshot in 1 would be lightweight you mean?  And you'd do 
the backup

some (short) time later based on a diff with changed extents?

I'm pretty sure this will export a garbage image.  I mean, it 
will usually
be non-garbage, but the result won't be crash consistent, and in 
some

(many?) cases won't be usable.

Consider:

- take reference snapshot
- back up this image (assume for now it is perfect)
- write A to location 1
- take lightweight snapshot
- write B to location 1
- backup process copie location 1 (B) to target

The way I (we) see it working is a bit different:
  - take snapshot (1)
  - data write might occur, it's ok - CoW kicks in here to preserve 
data

  - export data
  - convert snapshot (1) to a lightweight one (not create new):
    * from now on just remember which blocks has been modified instead
  of doing CoW
    * you can get rid on previously CoW data blocks (they've been
  exported already)
  - more writes
  - take snapshot (2)
  - export diff - only blocks modified since snap (1)
  - convert snapshot (2) to a lightweight one
  - ...


That way I don't see a place for data corruption. Of course this has
some drawbacks - you can't rollback/export data from such lightweight
snapshot anymore. But on the other hand we are reducing need for CoW -
and that's the main goal with this idea. Instead of making CoW ~all 
the

time it's needed only for the time of exporting image/modified blocks.

What's the advantage of remembering the blocks changed for a
"lightweight snapshot" once the actual data diff is no longer there?
Is there a meaningful difference between this and just immediately
deleting a snapshot after doing the export?
-Greg


Advantage is that when I need to export diff I know which blocks 
changed,

without checking (reading) others so I can just export them for backup.
If i delete snapshot after export, next time I'll have to read whole 
image

again - no possibility to do differential backup.

But as Sage wrote, we are doing this on Filestore. I don't know how 
Bluestore
works with snapshots (are whole 4MB chunks copied or only area of 
current write)

so performance might be much better - need to test it.

Our main goal with this idea is to improve performance in case where 
all images

have at least one snapshot taken every *backup period* (24h or lower).



The actual advantage lies in keeping COW at minimum.

Assuming that you want to do differential backups every 24h.

With normal snapshots:
1. Create snapshot A, do full image export, takes 3h
2. Typical client IO, all writes are COW for 24h
3. After 24h Create snapshot B, and do export diff (A -> B), takes 0.5h
4. Remove snapshot A, as it's no longer needed
5. Typical client IO, all writes are COW for 24h
6. After 24h Create snapshot C, and do export diff (B -> C), takes 0.5h
7. Remove snapshot B, as it's no longer needed
8. Typical client IO, all writes are COW for 24h

Simplified estimation:
COW done for writes all the time since snapshot A = 72h of COW

With 'lightweight' snapshots
1. Create snapshot A, do full image export, takes 3h
2. Convert snapshot A to lightweight
3. Typical client IO, COW was done for 3h only
4. After 24h Create snapshot B, and do export diff (A -> B), takes 0.5h
5. Remove snapshot A, as it's no longer needed
6. Convert snapshot B to lightweight
7. Typical client IO, COW was done only for 0.5h
8. After 24h Create snapshot C, and do export diff (B -> C), takes 0.5h
9. Remove snapshot B, as it's no longer needed
10. Convert snapshot C to lightweight
11. Typical client IO, all writes are COW for 0.5h

Simplified estimation:
COW done for full snapshot lifespan - 3h + 0.5h + 0.5h 

Re: [ceph-users] Design a PetaByte scale CEPH object storage

2018-08-27 Thread Marc Roos
 

> I am a software developer and am new to this domain. 

So maybe first get some senior system admin or so? You also do not want 
me to start doing some amateur brain surgery, do you?

> each file has approx 15 TB
 Pfff, maybe rethink/work this to







-Original Message-
From: James Watson [mailto:import.me...@gmail.com] 
Sent: zondag 26 augustus 2018 20:24
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Design a PetaByte scale CEPH object storage

Hi CEPHers,

I need to design an HA CEPH object storage system. The scenario is that 
we are recording HD Videos and end of the day we need to copy all these 
video files (each file has approx 15 TB ) to our storage system.

1)Which would be the best tech in storage to transfer these PBs size 
loads of videos to CEPH based object storage wirelessly.

2)How should I design my CEPH in the scale of PBs and make sure its 
future proof.

3)What are the latest hardware components I might require to accomplish 
this task?


I am a software developer and am new to this domain. Kindly request all 
to provide the name of even the most basic of hardware components 
required for the setup so that I can do a cost estimation and compare 
with other techs.

My novice solution so far:

1. Transmitting module if using WiFi (802.11ac (aka Gigabit Wifi) max 
200 Mbps speed) to transfer a file of size 15 TB to CEPH Storage takes 7 
days !!

2.CEPH needs to be configured with High Availability A SAN with FC 
networking in place (GEN 6 SANS) using NVMe SSD with HBAs that support 
NVMe over Fibre Channel giving a transfer rate of 16 Gbps to Host 
Server.


Thanks for your help in advance. 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS Quota and ACL support

2018-08-27 Thread Oliver Freyermuth
Dear Cephalopodians,

sorry if this is the wrong place to ask - but does somebody know if the 
recently added quota support in the kernel client,
and the ACL support, are going to be backported to RHEL 7 / CentOS 7 kernels? 
Or can someone redirect me to the correct place to ask? 
We don't have a RHEL subscription, but are using CentOS. 

These features are critical for us, so right now we use the Fuse client. My 
hope is CentOS 8 will use a recent enough kernel
to get those features automatically, though. 

Cheers and thanks,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-Deploy error on 15/71 stage

2018-08-27 Thread Eugen Block

Hi Jones,

all ceph logs are in the directory /var/log/ceph/, each daemon has its  
own log file, e.g. OSD logs are named ceph-osd.*.


I haven't tried it but I don't think SUSE Enterprise Storage deploys  
OSDs on partitioned disks. Is there a way to attach a second disk to  
the OSD nodes, maybe via USB or something?


Although this thread is ceph related it is referring to a specific  
product, so I would recommend to post your question in the SUSE forum  
[1].


Regards,
Eugen

[1] https://forums.suse.com/forumdisplay.php?99-SUSE-Enterprise-Storage

Zitat von Jones de Andrade :


Hi Eugen.

Thanks for the suggestion. I'll look for the logs (since it's our first
attempt with ceph, I'll have to discover where they are, but no problem).

One thing called my attention on your response however:

I haven't made myself clear, but one of the failures we encountered were
that the files now containing:

node02:
   --
   storage:
   --
   osds:
   --
   /dev/sda4:
   --
   format:
   bluestore
   standalone:
   True

Were originally empty, and we filled them by hand following a model found
elsewhere on the web. It was necessary, so that we could continue, but the
model indicated that, for example, it should have the path for /dev/sda
here, not /dev/sda4. We chosen to include the specific partition
identification because we won't have dedicated disks here, rather just the
very same partition as all disks were partitioned exactly the same.

While that was enough for the procedure to continue at that point, now I
wonder if it was the right call and, if it indeed was, if it was done
properly.  As such, I wonder: what you mean by "wipe" the partition here?
/dev/sda4 is created, but is both empty and unmounted: Should a different
operation be performed on it, should I remove it first, should I have
written the files above with only /dev/sda as target?

I know that probably I wouldn't run in this issues with dedicated discks,
but unfortunately that is absolutely not an option.

Thanks a lot in advance for any comments and/or extra suggestions.

Sincerely yours,

Jones

On Sat, Aug 25, 2018 at 5:46 PM Eugen Block  wrote:


Hi,

take a look into the logs, they should point you in the right direction.
Since the deployment stage fails at the OSD level, start with the OSD
logs. Something's not right with the disks/partitions, did you wipe
the partition from previous attempts?

Regards,
Eugen

Zitat von Jones de Andrade :


(Please forgive my previous email: I was using another message and
completely forget to update the subject)

Hi all.

I'm new to ceph, and after having serious problems in ceph stages 0, 1

and

2 that I could solve myself, now it seems that I have hit a wall harder
than my head. :)

When I run salt-run state.orch ceph.stage.deploy, i monitor I see it

going

up to here:

###
[14/71]   ceph.sysctl on
  node01... ✓ (0.5s)
  node02 ✓ (0.7s)
  node03... ✓ (0.6s)
  node04. ✓ (0.5s)
  node05... ✓ (0.6s)
  node06.. ✓ (0.5s)

[15/71]   ceph.osd on
  node01.. ❌ (0.7s)
  node02 ❌ (0.7s)
  node03... ❌ (0.7s)
  node04. ❌ (0.6s)
  node05... ❌ (0.6s)
  node06.. ❌ (0.7s)

Ended stage: ceph.stage.deploy succeeded=14/71 failed=1/71 time=624.7s

Failures summary:

ceph.osd (/srv/salt/ceph/osd):
  node02:
deploy OSDs: Module function osd.deploy threw an exception.

Exception:

Mine on node02 for cephdisks.list
  node03:
deploy OSDs: Module function osd.deploy threw an exception.

Exception:

Mine on node03 for cephdisks.list
  node01:
deploy OSDs: Module function osd.deploy threw an exception.

Exception:

Mine on node01 for cephdisks.list
  node04:
deploy OSDs: Module function osd.deploy threw an exception.

Exception:

Mine on node04 for cephdisks.list
  node05:
deploy OSDs: Module function osd.deploy threw an exception.

Exception:

Mine on node05 for cephdisks.list
  node06:
deploy OSDs: Module function osd.deploy threw an exception.

Exception:

Mine on node06 for cephdisks.list
###

Since this is a first attempt in 6 simple test machines, we are going to
put the mon, osds, etc, in all nodes at first. Only the master is left

in a

single machine (node01) by now.

As they are simple machines, they have a single hdd, which is partitioned
as follows (the hda4 partition is unmounted and left for the ceph

system):


##

Re: [ceph-users] Design a PetaByte scale CEPH object storage

2018-08-27 Thread John Hearns
James, I would recommend that you do the following

a) write out a clear set of requirements and use cases for this system. Do
not mention any specific technology
b) plan to install and test a small ProofOfConcept system. You can then
assess if it meets the requirement in (a)

On Mon, 27 Aug 2018 at 09:14, Marc Roos  wrote:

>
>
> > I am a software developer and am new to this domain.
>
> So maybe first get some senior system admin or so? You also do not want
> me to start doing some amateur brain surgery, do you?
>
> > each file has approx 15 TB
>  Pfff, maybe rethink/work this to
>
>
>
>
>
>
>
> -Original Message-
> From: James Watson [mailto:import.me...@gmail.com]
> Sent: zondag 26 augustus 2018 20:24
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Design a PetaByte scale CEPH object storage
>
> Hi CEPHers,
>
> I need to design an HA CEPH object storage system. The scenario is that
> we are recording HD Videos and end of the day we need to copy all these
> video files (each file has approx 15 TB ) to our storage system.
>
> 1)Which would be the best tech in storage to transfer these PBs size
> loads of videos to CEPH based object storage wirelessly.
>
> 2)How should I design my CEPH in the scale of PBs and make sure its
> future proof.
>
> 3)What are the latest hardware components I might require to accomplish
> this task?
>
>
> I am a software developer and am new to this domain. Kindly request all
> to provide the name of even the most basic of hardware components
> required for the setup so that I can do a cost estimation and compare
> with other techs.
>
> My novice solution so far:
>
> 1. Transmitting module if using WiFi (802.11ac (aka Gigabit Wifi) max
> 200 Mbps speed) to transfer a file of size 15 TB to CEPH Storage takes 7
> days !!
>
> 2.CEPH needs to be configured with High Availability A SAN with FC
> networking in place (GEN 6 SANS) using NVMe SSD with HBAs that support
> NVMe over Fibre Channel giving a transfer rate of 16 Gbps to Host
> Server.
>
>
> Thanks for your help in advance.
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pgs incomplete and inactive

2018-08-27 Thread Josef Zelenka
Hi, i've had a very ugly thing happen to me over the weekend. Some of 
my  OSDs in a root that handles metadata pools overflowed to 100% disk 
usage due to omap size(even though i had 97% full ratio, which is odd) 
and refused to start. There were some pgs on those OSDs that went away 
with them. I have tried compacting the omap, moving files away etc, but 
nothing  - i can't export the pgs, i get errors like this:


2018-08-27 04:42:33.436182 7fcb53382580  4 rocksdb: EVENT_LOG_v1 
{"time_micros": 1535359353436170, "job": 1, "event": "recovery_started", 
"log_files": [5504, 5507]}
2018-08-27 04:42:33.436194 7fcb53382580  4 rocksdb: 
[/build/ceph-12.2.5/src/rocksdb/db/db_impl_open.cc:482] Recovering log 
#5504 mode 2
2018-08-27 04:42:35.422502 7fcb53382580  4 rocksdb: 
[/build/ceph-12.2.5/src/rocksdb/db/db_impl.cc:217] Shutdown: canceling 
all background work
2018-08-27 04:42:35.431613 7fcb53382580  4 rocksdb: 
[/build/ceph-12.2.5/src/rocksdb/db/db_impl.cc:343] Shutdown complete
2018-08-27 04:42:35.431716 7fcb53382580 -1 rocksdb: IO error: No space 
left on device/var/lib/ceph/osd/ceph-5//current/omap/005507.sst: No 
space left on device

Mount failed with '(1) Operation not permitted'
2018-08-27 04:42:35.432945 7fcb53382580 -1 
filestore(/var/lib/ceph/osd/ceph-5/) mount(1723): Error initializing 
rocksdb :


I decided to take the loss and mark the osds as lost and remove them 
from the cluster, however, it left 4 pgs hanging in incomplete + 
inactive state, which apparently prevents my radosgw from starting. Is 
there another way to export/import the pgs into their new osds/recreate 
them? I'm running Luminous 12.2.5 on Ubuntu 16.04.


Thanks

Josef

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Error EINVAL: (22) Invalid argument While using ceph osd safe-to-destroy

2018-08-27 Thread Eugen Block

Hi,

could you please paste your osd tree and the exact command you try to execute?

Extra note, the while loop in the instructions look like it's bad.   
I had to change it to make it work in bash.


The documented command didn't work for me either.

Regards,
Eugen

Zitat von Robert Stanford :


I am following the procedure here:
http://docs.ceph.com/docs/mimic/rados/operations/bluestore-migration/

 When I get to the part to run "ceph osd safe-to-destroy $ID" in a while
loop, I get a EINVAL error.  I get this error when I run "ceph osd
safe-to-destroy 0" on the command line by itself, too.  (Extra note, the
while loop in the instructions look like it's bad.  I had to change it to
make it work in bash.)

 I know my ID is correct because I was able to use it in the previous step
(ceph osd out $ID).  I also substituted $ID for the number on the command
line and got the same error.  Why isn't this working?

Error: Error EINVAL: (22) Invalid argument While using ceph osd
safe-to-destroy

 Thank you
R




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Quota and ACL support

2018-08-27 Thread Sergey Malinin
It is supported in mainline kernel from elrepo. 
http://elrepo.org/tiki/tiki-index.php 

> On 27.08.2018, at 10:51, Oliver Freyermuth  
> wrote:
> 
> Dear Cephalopodians,
> 
> sorry if this is the wrong place to ask - but does somebody know if the 
> recently added quota support in the kernel client,
> and the ACL support, are going to be backported to RHEL 7 / CentOS 7 kernels? 
> Or can someone redirect me to the correct place to ask? 
> We don't have a RHEL subscription, but are using CentOS. 
> 
> These features are critical for us, so right now we use the Fuse client. My 
> hope is CentOS 8 will use a recent enough kernel
> to get those features automatically, though. 
> 
> Cheers and thanks,
>   Oliver
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs incomplete and inactive

2018-08-27 Thread Paul Emmerich
Don't ever let an OSD run 100% full, that's usually bad news.
Two ways to salvage this:

1. You can try to extract the PGs with ceph-objectstore-tool and
inject them into another OSD; Ceph will find them and recover
2. You seem to be using Filestore, so you should easily be able to
just delete a whole PG on the full OSD's file system to make space
(preferably one that is already recovered and active+clean even
without the dead OSD)


Paul

2018-08-27 10:44 GMT+02:00 Josef Zelenka :
> Hi, i've had a very ugly thing happen to me over the weekend. Some of my
> OSDs in a root that handles metadata pools overflowed to 100% disk usage due
> to omap size(even though i had 97% full ratio, which is odd) and refused to
> start. There were some pgs on those OSDs that went away with them. I have
> tried compacting the omap, moving files away etc, but nothing  - i can't
> export the pgs, i get errors like this:
>
> 2018-08-27 04:42:33.436182 7fcb53382580  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1535359353436170, "job": 1, "event": "recovery_started",
> "log_files": [5504, 5507]}
> 2018-08-27 04:42:33.436194 7fcb53382580  4 rocksdb:
> [/build/ceph-12.2.5/src/rocksdb/db/db_impl_open.cc:482] Recovering log #5504
> mode 2
> 2018-08-27 04:42:35.422502 7fcb53382580  4 rocksdb:
> [/build/ceph-12.2.5/src/rocksdb/db/db_impl.cc:217] Shutdown: canceling all
> background work
> 2018-08-27 04:42:35.431613 7fcb53382580  4 rocksdb:
> [/build/ceph-12.2.5/src/rocksdb/db/db_impl.cc:343] Shutdown complete
> 2018-08-27 04:42:35.431716 7fcb53382580 -1 rocksdb: IO error: No space left
> on device/var/lib/ceph/osd/ceph-5//current/omap/005507.sst: No space left on
> device
> Mount failed with '(1) Operation not permitted'
> 2018-08-27 04:42:35.432945 7fcb53382580 -1
> filestore(/var/lib/ceph/osd/ceph-5/) mount(1723): Error initializing rocksdb
> :
>
> I decided to take the loss and mark the osds as lost and remove them from
> the cluster, however, it left 4 pgs hanging in incomplete + inactive state,
> which apparently prevents my radosgw from starting. Is there another way to
> export/import the pgs into their new osds/recreate them? I'm running
> Luminous 12.2.5 on Ubuntu 16.04.
>
> Thanks
>
> Josef
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs incomplete and inactive

2018-08-27 Thread Josef Zelenka
The fullratio was ignored, that's why that happenned most likely. I 
can't delete pgs, because it's only kb's worth of space - the osd is 
40gb, 39.8 gb is taken up by omap - that's why i can't move/extract. Any 
clue on how to compress/move away the omap dir?




On 27/08/18 12:34, Paul Emmerich wrote:

Don't ever let an OSD run 100% full, that's usually bad news.
Two ways to salvage this:

1. You can try to extract the PGs with ceph-objectstore-tool and
inject them into another OSD; Ceph will find them and recover
2. You seem to be using Filestore, so you should easily be able to
just delete a whole PG on the full OSD's file system to make space
(preferably one that is already recovered and active+clean even
without the dead OSD)


Paul

2018-08-27 10:44 GMT+02:00 Josef Zelenka :

Hi, i've had a very ugly thing happen to me over the weekend. Some of my
OSDs in a root that handles metadata pools overflowed to 100% disk usage due
to omap size(even though i had 97% full ratio, which is odd) and refused to
start. There were some pgs on those OSDs that went away with them. I have
tried compacting the omap, moving files away etc, but nothing  - i can't
export the pgs, i get errors like this:

2018-08-27 04:42:33.436182 7fcb53382580  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1535359353436170, "job": 1, "event": "recovery_started",
"log_files": [5504, 5507]}
2018-08-27 04:42:33.436194 7fcb53382580  4 rocksdb:
[/build/ceph-12.2.5/src/rocksdb/db/db_impl_open.cc:482] Recovering log #5504
mode 2
2018-08-27 04:42:35.422502 7fcb53382580  4 rocksdb:
[/build/ceph-12.2.5/src/rocksdb/db/db_impl.cc:217] Shutdown: canceling all
background work
2018-08-27 04:42:35.431613 7fcb53382580  4 rocksdb:
[/build/ceph-12.2.5/src/rocksdb/db/db_impl.cc:343] Shutdown complete
2018-08-27 04:42:35.431716 7fcb53382580 -1 rocksdb: IO error: No space left
on device/var/lib/ceph/osd/ceph-5//current/omap/005507.sst: No space left on
device
Mount failed with '(1) Operation not permitted'
2018-08-27 04:42:35.432945 7fcb53382580 -1
filestore(/var/lib/ceph/osd/ceph-5/) mount(1723): Error initializing rocksdb
:

I decided to take the loss and mark the osds as lost and remove them from
the cluster, however, it left 4 pgs hanging in incomplete + inactive state,
which apparently prevents my radosgw from starting. Is there another way to
export/import the pgs into their new osds/recreate them? I'm running
Luminous 12.2.5 on Ubuntu 16.04.

Thanks

Josef

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Odp.: pgs incomplete and inactive

2018-08-27 Thread Tomasz Kuzemko
Hello Josef,
I would suggest setting up a bigger disk (if not physical then maybe a LVM 
volume from 2 smaller disks) and cloning (remember about extended attributes!) 
the OSD data dir to the new disk, then try to bring the OSD back into cluster.

--
Tomasz Kuzemko
tomasz.kuze...@corp.ovh.com


Od: ceph-users  w imieniu użytkownika Josef 
Zelenka 
Wysłane: poniedziałek, 27 sierpnia 2018 13:29
Do: Paul Emmerich; ceph-users@lists.ceph.com
Temat: Re: [ceph-users] pgs incomplete and inactive

The fullratio was ignored, that's why that happenned most likely. I
can't delete pgs, because it's only kb's worth of space - the osd is
40gb, 39.8 gb is taken up by omap - that's why i can't move/extract. Any
clue on how to compress/move away the omap dir?



On 27/08/18 12:34, Paul Emmerich wrote:
> Don't ever let an OSD run 100% full, that's usually bad news.
> Two ways to salvage this:
>
> 1. You can try to extract the PGs with ceph-objectstore-tool and
> inject them into another OSD; Ceph will find them and recover
> 2. You seem to be using Filestore, so you should easily be able to
> just delete a whole PG on the full OSD's file system to make space
> (preferably one that is already recovered and active+clean even
> without the dead OSD)
>
>
> Paul
>
> 2018-08-27 10:44 GMT+02:00 Josef Zelenka :
>> Hi, i've had a very ugly thing happen to me over the weekend. Some of my
>> OSDs in a root that handles metadata pools overflowed to 100% disk usage due
>> to omap size(even though i had 97% full ratio, which is odd) and refused to
>> start. There were some pgs on those OSDs that went away with them. I have
>> tried compacting the omap, moving files away etc, but nothing  - i can't
>> export the pgs, i get errors like this:
>>
>> 2018-08-27 04:42:33.436182 7fcb53382580  4 rocksdb: EVENT_LOG_v1
>> {"time_micros": 1535359353436170, "job": 1, "event": "recovery_started",
>> "log_files": [5504, 5507]}
>> 2018-08-27 04:42:33.436194 7fcb53382580  4 rocksdb:
>> [/build/ceph-12.2.5/src/rocksdb/db/db_impl_open.cc:482] Recovering log #5504
>> mode 2
>> 2018-08-27 04:42:35.422502 7fcb53382580  4 rocksdb:
>> [/build/ceph-12.2.5/src/rocksdb/db/db_impl.cc:217] Shutdown: canceling all
>> background work
>> 2018-08-27 04:42:35.431613 7fcb53382580  4 rocksdb:
>> [/build/ceph-12.2.5/src/rocksdb/db/db_impl.cc:343] Shutdown complete
>> 2018-08-27 04:42:35.431716 7fcb53382580 -1 rocksdb: IO error: No space left
>> on device/var/lib/ceph/osd/ceph-5//current/omap/005507.sst: No space left on
>> device
>> Mount failed with '(1) Operation not permitted'
>> 2018-08-27 04:42:35.432945 7fcb53382580 -1
>> filestore(/var/lib/ceph/osd/ceph-5/) mount(1723): Error initializing rocksdb
>> :
>>
>> I decided to take the loss and mark the osds as lost and remove them from
>> the cluster, however, it left 4 pgs hanging in incomplete + inactive state,
>> which apparently prevents my radosgw from starting. Is there another way to
>> export/import the pgs into their new osds/recreate them? I'm running
>> Luminous 12.2.5 on Ubuntu 16.04.
>>
>> Thanks
>>
>> Josef
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse slow cache?

2018-08-27 Thread Stefan Kooman
Hi,

Quoting Yan, Zheng (uker...@gmail.com):
> Could you strace apacha process, check which syscall waits for a long time.

Yes, that's how I did all the tests (strace -t -T apache2 -X). With
debug=20 (ceph-fuse) you see apache waiting for almost 20 seconds before it 
starts serving data:

13:33:55 accept4(4, {sa_family=AF_INET6, sin6_port=htons(36829), 
inet_pton(AF_INET6, ":::213.136.12.151", &sin6_addr), sin6_flowinfo=0, 
sin6_scope_id=0}, [28], SOCK_CLOEXEC) = 24 <1.218381>
13:33:56 getsockname(24, {sa_family=AF_INET6, sin6_port=htons(80), 
inet_pton(AF_INET6, ":::10.5.80.8", &sin6_addr), sin6_flowinfo=0, 
sin6_scope_id=0}, [28]) = 0 <0.000113>
13:33:56 fcntl(24, F_GETFL) = 0x2 (flags O_RDWR) <0.72>
13:33:56 fcntl(24, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <0.25>
13:33:56 clone(child_stack=0, 
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
child_tidptr=0x7f132cf6aa50) = 24260 <0.001235>
13:33:56 wait4(24260, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 24260 
<19.424578>
13:34:15 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=24260, 
si_uid=5003, si_status=0, si_utime=35, si_stime=27} ---
13:34:15 close(24)  = 0 <0.60>
13:34:15 clone(child_stack=0, 
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
child_tidptr=0x7f132cf6aa50) = 24263 <0.001003>
13:34:15 wait4(24263, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 24263 
<5.043079>
13:34:20 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=24263, 
si_uid=5003, si_status=0, si_utime=0, si_stime=0} ---
13:34:20 close(24)  = 0 <0.96>

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why rbd rn did not clean used pool?

2018-08-27 Thread Jason Dillaman
On Sat, Aug 25, 2018 at 10:29 AM Fyodor Ustinov  wrote:
>
> Hi!
>
> Configuration:
> rbd - erasure pool
> rbdtier - tier pool for rbd
>
> ceph osd tier add-cache rbd rbdtier 549755813888
> ceph osd tier cache-mode rbdtier writeback
>
> Create new rbd block device:
> rbd create --size 16G  rbdtest
> rbd feature disable rbdtest object-map fast-diff deep-flatten
> rbd device map rbdtest
>
> And fill in rbd0 by data (dd, fio and like).
>
> Remove rbd block device:
> rbd device unmap rbdtest
> rbd rm rbdtest

What version of librbd are you using?

> And now pool usage look like:
>
> POOLS:
> NAMEID USED%USED MAX AVAIL OBJECTS
> rbd 9   16 GiB 0   0 B4094
> rbdtier 14 104 KiB 0   1.7 TiB5110
>
> rbd and rbdtier contain some objects:
> rados -p rbdtier ls
> rbd_data.14716b8b4567.0dc4
> rbd_data.14716b8b4567.02fc
> rbd_data.14716b8b4567.0e82
> rbd_data.14716b8b4567.03d7
> rbd_data.14716b8b4567.0fb1
> rbd_data.14716b8b4567.0018
> [...]
>
> rados - p rbd ls
> rbd_data.14716b8b4567.0dc4
> rbd_data.14716b8b4567.02fc
> rbd_data.14716b8b4567.0e82
> rbd_data.14716b8b4567.03d7
> rbd_data.14716b8b4567.0fb1
> [...]
>
> why rbd rm do not remove all used objects from pools?
>
>
>
> WBR,
> Fyodor.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client hangs

2018-08-27 Thread Zhenshi Zhou
Hi,
The kernel version is 4.12.8-1.el7.elrepo.x86_64.
Client.267792 has gone as I restart the server at weekend.
Does ceph-fuse more stable than kernel client?

Yan, Zheng  于2018年8月27日周一 上午11:41写道:

> please check client.213528, instead of client.267792. which version of
> kernel client.213528 use.
> On Sat, Aug 25, 2018 at 6:12 AM Zhenshi Zhou  wrote:
> >
> > Hi,
> > This time,  osdc:
> >
> > REQUESTS 0 homeless 0
> > LINGER REQUESTS
> >
> > monc:
> >
> > have monmap 2 want 3+
> > have osdmap 4545 want 4546
> > have fsmap.user 0
> > have mdsmap 446 want 447+
> > fs_cluster_id -1
> >
> > mdsc:
> >
> > 649065  mds0setattr  #12e7e5a
> >
> > Anything useful?
> >
> >
> >
> > Yan, Zheng  于2018年8月25日周六 上午7:53写道:
> >>
> >> Are there hang request in /sys/kernel/debug/ceph//osdc
> >>
> >> On Fri, Aug 24, 2018 at 9:32 PM Zhenshi Zhou 
> wrote:
> >> >
> >> > I'm afaid that the client hangs again...the log shows:
> >> >
> >> > 2018-08-24 21:27:54.714334 [WRN]  slow request 62.607608 seconds old,
> received at 2018-08-24 21:26:52.106633: client_request(client.213528:241811
> getattr pAsLsXsFs #0x12e7e5a 2018-08-24 21:26:52.106425 caller_uid=0,
> caller_gid=0{}) currently failed to rdlock, waiting
> >> > 2018-08-24 21:27:54.714320 [WRN]  3 slow requests, 1 included below;
> oldest blocked for > 843.556758 secs
> >> > 2018-08-24 21:27:24.713740 [WRN]  slow request 32.606979 seconds old,
> received at 2018-08-24 21:26:52.106633: client_request(client.213528:241811
> getattr pAsLsXsFs #0x12e7e5a 2018-08-24 21:26:52.106425 caller_uid=0,
> caller_gid=0{}) currently failed to rdlock, waiting
> >> > 2018-08-24 21:27:24.713729 [WRN]  3 slow requests, 1 included below;
> oldest blocked for > 813.556129 secs
> >> > 2018-08-24 21:25:49.711778 [WRN]  slow request 483.807963 seconds
> old, received at 2018-08-24 21:17:45.903726:
> client_request(client.213528:241810 getattr pAsLsXsFs #0x12e7e5a
> 2018-08-24 21:17:45.903049 caller_uid=0, caller_gid=0{}) currently failed
> to rdlock, waiting
> >> > 2018-08-24 21:25:49.711766 [WRN]  2 slow requests, 1 included below;
> oldest blocked for > 718.554206 secs
> >> > 2018-08-24 21:21:54.707536 [WRN]  client.213528 isn't responding to
> mclientcaps(revoke), ino 0x12e7e5a pending pAsLsXsFr issued
> pAsLsXsFscr, sent 483.548912 seconds ago
> >> > 2018-08-24 21:21:54.706930 [WRN]  slow request 483.549363 seconds
> old, received at 2018-08-24 21:13:51.157483:
> client_request(client.267792:649065 setattr size=0 mtime=2018-08-24
> 21:13:51.163236 #0x12e7e5a 2018-08-24 21:13:51.163236 caller_uid=0,
> caller_gid=0{}) currently failed to xlock, waiting
> >> > 2018-08-24 21:21:54.706920 [WRN]  2 slow requests, 1 included below;
> oldest blocked for > 483.549363 secs
> >> > 2018-08-24 21:21:49.706838 [WRN]  slow request 243.803027 seconds
> old, received at 2018-08-24 21:17:45.903726:
> client_request(client.213528:241810 getattr pAsLsXsFs #0x12e7e5a
> 2018-08-24 21:17:45.903049 caller_uid=0, caller_gid=0{}) currently failed
> to rdlock, waiting
> >> > 2018-08-24 21:21:49.706828 [WRN]  2 slow requests, 1 included below;
> oldest blocked for > 478.549269 secs
> >> > 2018-08-24 21:19:49.704294 [WRN]  slow request 123.800486 seconds
> old, received at 2018-08-24 21:17:45.903726:
> client_request(client.213528:241810 getattr pAsLsXsFs #0x12e7e5a
> 2018-08-24 21:17:45.903049 caller_uid=0, caller_gid=0{}) currently failed
> to rdlock, waiting
> >> > 2018-08-24 21:19:49.704284 [WRN]  2 slow requests, 1 included below;
> oldest blocked for > 358.546729 secs
> >> > 2018-08-24 21:18:49.703073 [WRN]  slow request 63.799269 seconds old,
> received at 2018-08-24 21:17:45.903726: client_request(client.213528:241810
> getattr pAsLsXsFs #0x12e7e5a 2018-08-24 21:17:45.903049 caller_uid=0,
> caller_gid=0{}) currently failed to rdlock, waiting
> >> > 2018-08-24 21:18:49.703062 [WRN]  2 slow requests, 1 included below;
> oldest blocked for > 298.545511 secs
> >> > 2018-08-24 21:18:19.702465 [WRN]  slow request 33.798637 seconds old,
> received at 2018-08-24 21:17:45.903726: client_request(client.213528:241810
> getattr pAsLsXsFs #0x12e7e5a 2018-08-24 21:17:45.903049 caller_uid=0,
> caller_gid=0{}) currently failed to rdlock, waiting
> >> > 2018-08-24 21:18:19.702456 [WRN]  2 slow requests, 1 included below;
> oldest blocked for > 268.544880 secs
> >> > 2018-08-24 21:17:54.702517 [WRN]  client.213528 isn't responding to
> mclientcaps(revoke), ino 0x12e7e5a pending pAsLsXsFr issued
> pAsLsXsFscr, sent 243.543893 seconds ago
> >> > 2018-08-24 21:17:54.701904 [WRN]  slow request 243.544331 seconds
> old, received at 2018-08-24 21:13:51.157483:
> client_request(client.267792:649065 setattr size=0 mtime=2018-08-24
> 21:13:51.163236 #0x12e7e5a 2018-08-24 21:13:51.163236 caller_uid=0,
> caller_gid=0{}) currently failed to xlock, waiting
> >> > 2018-08-24 21:17:54.701894 [WRN]  1 slow requests, 1 included below;
> oldest blocked for > 243.544331 secs
> >> > 2018-08-24 21:15:

Re: [ceph-users] RBD image "lightweight snapshots"

2018-08-27 Thread Jason Dillaman
On Mon, Aug 27, 2018 at 3:29 AM Bartosz Rabiega
 wrote:
>
> Bumping the topic.
>
>
> So, what do you think guys?

Not sure if you saw my response from August 13th, but I stated that
this is something that you should be able to build right now using the
RADOS Python bindings and the rbd CLI. It would be pretty dangerous
for the average user to use without adding a lot of safety guardrails
to the entire process, however.

Of course, now that I think about it some more, I am not sure how the
OSDs would behave if sent a snap set with a deleted snapshot. They
used to just filter the errant entry, but I'm not sure how they would
behave under the removed snapshot interval set cleanup logic [1].

> On 08/13/2018 12:22 PM, Bartosz Rabiega wrote:
> >
> >
> > On 08/11/2018 07:56 AM, Paweł Sadowski wrote:
> >> On 08/10/2018 06:24 PM, Gregory Farnum wrote:
> >>> On Fri, Aug 10, 2018 at 4:53 AM, Paweł Sadowsk  wrote:
>  On 08/09/2018 04:39 PM, Alex Elder wrote:
> > On 08/09/2018 08:15 AM, Sage Weil wrote:
> >> On Thu, 9 Aug 2018, Piotr Dałek wrote:
> >>> Hello,
> >>>
> >>> At OVH we're heavily utilizing snapshots for our backup system.
> >>> We think
> >>> there's an interesting optimization opportunity regarding
> >>> snapshots I'd like
> >>> to discuss here.
> >>>
> >>> The idea is to introduce a concept of a "lightweight" snapshots
> >>> - such
> >>> snapshot would not contain data but only the information about
> >>> what has
> >>> changed on the image since it was created (so basically only the
> >>> object map
> >>> part of snapshots).
> >>>
> >>> Our backup solution (which seems to be a pretty common practice)
> >>> is as
> >>> follows:
> >>>
> >>> 1. Create snapshot of the image we want to backup
> >>> 2. If there's a previous backup snapshot, export diff and apply
> >>> it on the
> >>> backup image
> >>> 3. If there's no older snapshot, just do a full backup of image
> >>>
> >>> This introduces one big issue: it enforces COW snapshot on
> >>> image, meaning that
> >>> original image access latencies and consumed space increases.
> >>> "Lightweight"
> >>> snapshots would remove these inefficiencies - no COW performance
> >>> and storage
> >>> overhead.
> >> The snapshot in 1 would be lightweight you mean?  And you'd do
> >> the backup
> >> some (short) time later based on a diff with changed extents?
> >>
> >> I'm pretty sure this will export a garbage image.  I mean, it
> >> will usually
> >> be non-garbage, but the result won't be crash consistent, and in
> >> some
> >> (many?) cases won't be usable.
> >>
> >> Consider:
> >>
> >> - take reference snapshot
> >> - back up this image (assume for now it is perfect)
> >> - write A to location 1
> >> - take lightweight snapshot
> >> - write B to location 1
> >> - backup process copie location 1 (B) to target
>  The way I (we) see it working is a bit different:
>    - take snapshot (1)
>    - data write might occur, it's ok - CoW kicks in here to preserve
>  data
>    - export data
>    - convert snapshot (1) to a lightweight one (not create new):
>  * from now on just remember which blocks has been modified instead
>    of doing CoW
>  * you can get rid on previously CoW data blocks (they've been
>    exported already)
>    - more writes
>    - take snapshot (2)
>    - export diff - only blocks modified since snap (1)
>    - convert snapshot (2) to a lightweight one
>    - ...
> 
> 
>  That way I don't see a place for data corruption. Of course this has
>  some drawbacks - you can't rollback/export data from such lightweight
>  snapshot anymore. But on the other hand we are reducing need for CoW -
>  and that's the main goal with this idea. Instead of making CoW ~all
>  the
>  time it's needed only for the time of exporting image/modified blocks.
> >>> What's the advantage of remembering the blocks changed for a
> >>> "lightweight snapshot" once the actual data diff is no longer there?
> >>> Is there a meaningful difference between this and just immediately
> >>> deleting a snapshot after doing the export?
> >>> -Greg
> >>
> >> Advantage is that when I need to export diff I know which blocks
> >> changed,
> >> without checking (reading) others so I can just export them for backup.
> >> If i delete snapshot after export, next time I'll have to read whole
> >> image
> >> again - no possibility to do differential backup.
> >>
> >> But as Sage wrote, we are doing this on Filestore. I don't know how
> >> Bluestore
> >> works with snapshots (are whole 4MB chunks copied or only area of
> >> current write)
> >> so performance might be much better - need to test it.
> >>
> >> Our main goal with this idea is to improve performance in case where
> >> all imag

Re: [ceph-users] limited disk slots - should I ran OS on SD card ?

2018-08-27 Thread Paul Emmerich
This exact problem with the OS disk and problems deploying lots of
servers in an efficient way
was the main motivator for developing our croit orchestration product:
https://croit.io

I've talked about this on a few Ceph days, but the short summary is:

We started with Ceph in 2013 and decided to use SATA DOMs with a custom
installer for Ubuntu (derived from an existing internal tool; but it's
just a live image
that calls debootstrap and creates a few config files). That worked
reasonably well
for some time.

But a year later or so more and more of the SATA DOMs started to fail
and Servers
failed in the most annoying way possible: Byzantine failures --
locking up some random
CPU cores and no longer replying to some services/requests while
appearing perfectly
healthy to others...

We thought we didn't write anything to the sticks (logs send via
rsyslogd), but I guess we
missed something. I suspect the ntp drift while might have been one of
the problems.
We had several nodes fail within a week when we decided that this
approach clearly
wasn't working out for us.
Maybe SATA DOMs are better today? This was in 2013. Maybe we should have caught
what was writing to these disks.

Anyways, there is no point in installing an operating system on
something like a Ceph
OSD server. You got lots of them and they all look the same. So I
wrote a quick and
dirty PXE live boot system based on Debian. It was really just a
collection of shell scripts
that creates the image and DHCP server configuration. Debian (and
Ubuntu) make that
*really easy*. You basically just run deboostrap, add initramfs-live,
customize the chroot
and put the result in a squashfs image, that's it. (CentOS/RHEL is
significantly more
complicated because of dracut. I do like dracut, but the way it does
live boot is unnecessarily
complicated.)

The initial prototype running on one of the OSD servers took a few
hours to create. It
then grew into an unmaintanable mess of bash scripts over the coming years...

We started croit based on this idea in early 2017. It's based on the
same concept, but
the whole implementation behind it is completely new. Dockerized
deployment, fully
featured REST API on a Kotlin/Java stack for management, vue.js HTML5 UI, ...
Also, we are still planning to open source it later this year (working
on separating
some componenents to release it as 'open core')

What I'm saying is: there are only very few circumstances under which
I would consider
installing an operating system on a server that is used as a "cattle
server". It makes no
sense in most setups, you just add a point of failure, management
overhead and you
waste time when deploying a server. Adding a new OSD server on Ceph deployments
that we manage is this simple: put server into the rack, plug it in,
boot it. That's it. No
need to install or configure *anything* on the server.
I also really like the "immutable infrastructure" part of these
deployments. I can easily
get back to clean slate by rebooting servers. I can upgrade lots of
servers by running
a rolling reboot task.


Paul

2018-08-17 11:01 GMT+02:00 Daznis :
> Hi,
>
> We used a PXE boot with NFS server, but had some issues if NFS server
> crapped out and dropped connections or needed a reboot for
> maintenance. If I remember it correctly it sometimes took out some of
> the rebooted servers. So we switched to PXE with livecd based images.
> You basically create a livecd image, then boot it with specially
> prepared initramfs image and it uses a copy on write disk for basic
> storage. With mimic osd's are started automatically, just need to feed
> some basic settings for that server.
> On Fri, Aug 17, 2018 at 11:31 AM Florian Florensa  wrote:
>>
>> What about PXE booting the OSD's server ? I am considering doing these
>> sort of things as it doesn't seem that complicated.
>> A simple script could easily bring the osd back onine using some lvm
>> commands to bring the lvm back online and then some ceph-lvm activate
>> command to fire the osd's back up.
>>
>>
>> 2018-08-15 16:09 GMT+02:00 Götz Reinicke :
>> > Hi,
>> >
>> >> Am 15.08.2018 um 15:11 schrieb Steven Vacaroaia :
>> >>
>> >> Thank you all
>> >>
>> >> Since all concerns were about reliability I am assuming  performance 
>> >> impact of having OS running on SD card is minimal / negligible
>> >
>> > some time ago we had a some Cisco Blades booting VMware esxi from SD cards 
>> > and hat no issue for month …till after an update the blade was rebooted 
>> > and the SD failed …and then an other on an other server … From my POV at 
>> > that time the „server" SDs where not close as reliable as SSDs or rotating 
>> > disks. My experiences from some years ago.
>> >
>> >>
>> >> In other words, an OSD server is not writing/reading from Linux OS 
>> >> partitions too much ( especially with logs at minimum )
>> >> so its performance is not dependent on what type of disk  OS resides  on
>> >
>> > Regarding performance: What kind of SDs are supported? You can get some 
>> > 

Re: [ceph-users] cephfs kernel client hangs

2018-08-27 Thread Yan, Zheng
On Mon, Aug 27, 2018 at 6:10 AM Zhenshi Zhou  wrote:
>
> Hi,
> The kernel version is 4.12.8-1.el7.elrepo.x86_64.
> Client.267792 has gone as I restart the server at weekend.
> Does ceph-fuse more stable than kernel client?
>

For old kernels such as 4.12. ceph-fuse is more stable. If you use
kernel client, you'd better to recent kernel and all client use same
kernel version


> Yan, Zheng  于2018年8月27日周一 上午11:41写道:
>>
>> please check client.213528, instead of client.267792. which version of
>> kernel client.213528 use.
>> On Sat, Aug 25, 2018 at 6:12 AM Zhenshi Zhou  wrote:
>> >
>> > Hi,
>> > This time,  osdc:
>> >
>> > REQUESTS 0 homeless 0
>> > LINGER REQUESTS
>> >
>> > monc:
>> >
>> > have monmap 2 want 3+
>> > have osdmap 4545 want 4546
>> > have fsmap.user 0
>> > have mdsmap 446 want 447+
>> > fs_cluster_id -1
>> >
>> > mdsc:
>> >
>> > 649065  mds0setattr  #12e7e5a
>> >
>> > Anything useful?
>> >
>> >
>> >
>> > Yan, Zheng  于2018年8月25日周六 上午7:53写道:
>> >>
>> >> Are there hang request in /sys/kernel/debug/ceph//osdc
>> >>
>> >> On Fri, Aug 24, 2018 at 9:32 PM Zhenshi Zhou  wrote:
>> >> >
>> >> > I'm afaid that the client hangs again...the log shows:
>> >> >
>> >> > 2018-08-24 21:27:54.714334 [WRN]  slow request 62.607608 seconds old, 
>> >> > received at 2018-08-24 21:26:52.106633: 
>> >> > client_request(client.213528:241811 getattr pAsLsXsFs #0x12e7e5a 
>> >> > 2018-08-24 21:26:52.106425 caller_uid=0, caller_gid=0{}) currently 
>> >> > failed to rdlock, waiting
>> >> > 2018-08-24 21:27:54.714320 [WRN]  3 slow requests, 1 included below; 
>> >> > oldest blocked for > 843.556758 secs
>> >> > 2018-08-24 21:27:24.713740 [WRN]  slow request 32.606979 seconds old, 
>> >> > received at 2018-08-24 21:26:52.106633: 
>> >> > client_request(client.213528:241811 getattr pAsLsXsFs #0x12e7e5a 
>> >> > 2018-08-24 21:26:52.106425 caller_uid=0, caller_gid=0{}) currently 
>> >> > failed to rdlock, waiting
>> >> > 2018-08-24 21:27:24.713729 [WRN]  3 slow requests, 1 included below; 
>> >> > oldest blocked for > 813.556129 secs
>> >> > 2018-08-24 21:25:49.711778 [WRN]  slow request 483.807963 seconds old, 
>> >> > received at 2018-08-24 21:17:45.903726: 
>> >> > client_request(client.213528:241810 getattr pAsLsXsFs #0x12e7e5a 
>> >> > 2018-08-24 21:17:45.903049 caller_uid=0, caller_gid=0{}) currently 
>> >> > failed to rdlock, waiting
>> >> > 2018-08-24 21:25:49.711766 [WRN]  2 slow requests, 1 included below; 
>> >> > oldest blocked for > 718.554206 secs
>> >> > 2018-08-24 21:21:54.707536 [WRN]  client.213528 isn't responding to 
>> >> > mclientcaps(revoke), ino 0x12e7e5a pending pAsLsXsFr issued 
>> >> > pAsLsXsFscr, sent 483.548912 seconds ago
>> >> > 2018-08-24 21:21:54.706930 [WRN]  slow request 483.549363 seconds old, 
>> >> > received at 2018-08-24 21:13:51.157483: 
>> >> > client_request(client.267792:649065 setattr size=0 mtime=2018-08-24 
>> >> > 21:13:51.163236 #0x12e7e5a 2018-08-24 21:13:51.163236 caller_uid=0, 
>> >> > caller_gid=0{}) currently failed to xlock, waiting
>> >> > 2018-08-24 21:21:54.706920 [WRN]  2 slow requests, 1 included below; 
>> >> > oldest blocked for > 483.549363 secs
>> >> > 2018-08-24 21:21:49.706838 [WRN]  slow request 243.803027 seconds old, 
>> >> > received at 2018-08-24 21:17:45.903726: 
>> >> > client_request(client.213528:241810 getattr pAsLsXsFs #0x12e7e5a 
>> >> > 2018-08-24 21:17:45.903049 caller_uid=0, caller_gid=0{}) currently 
>> >> > failed to rdlock, waiting
>> >> > 2018-08-24 21:21:49.706828 [WRN]  2 slow requests, 1 included below; 
>> >> > oldest blocked for > 478.549269 secs
>> >> > 2018-08-24 21:19:49.704294 [WRN]  slow request 123.800486 seconds old, 
>> >> > received at 2018-08-24 21:17:45.903726: 
>> >> > client_request(client.213528:241810 getattr pAsLsXsFs #0x12e7e5a 
>> >> > 2018-08-24 21:17:45.903049 caller_uid=0, caller_gid=0{}) currently 
>> >> > failed to rdlock, waiting
>> >> > 2018-08-24 21:19:49.704284 [WRN]  2 slow requests, 1 included below; 
>> >> > oldest blocked for > 358.546729 secs
>> >> > 2018-08-24 21:18:49.703073 [WRN]  slow request 63.799269 seconds old, 
>> >> > received at 2018-08-24 21:17:45.903726: 
>> >> > client_request(client.213528:241810 getattr pAsLsXsFs #0x12e7e5a 
>> >> > 2018-08-24 21:17:45.903049 caller_uid=0, caller_gid=0{}) currently 
>> >> > failed to rdlock, waiting
>> >> > 2018-08-24 21:18:49.703062 [WRN]  2 slow requests, 1 included below; 
>> >> > oldest blocked for > 298.545511 secs
>> >> > 2018-08-24 21:18:19.702465 [WRN]  slow request 33.798637 seconds old, 
>> >> > received at 2018-08-24 21:17:45.903726: 
>> >> > client_request(client.213528:241810 getattr pAsLsXsFs #0x12e7e5a 
>> >> > 2018-08-24 21:17:45.903049 caller_uid=0, caller_gid=0{}) currently 
>> >> > failed to rdlock, waiting
>> >> > 2018-08-24 21:18:19.702456 [WRN]  2 slow requests, 1 included below; 
>> >> > oldest blocked for > 268.544880 secs
>> >> > 2018-08-24 21:17:54.702517 [WRN]  client.213528 isn't respo

Re: [ceph-users] ceph-fuse slow cache?

2018-08-27 Thread Yan, Zheng
On Mon, Aug 27, 2018 at 4:47 AM Stefan Kooman  wrote:
>
> Hi,
>
> Quoting Yan, Zheng (uker...@gmail.com):
> > Could you strace apacha process, check which syscall waits for a long time.
>
> Yes, that's how I did all the tests (strace -t -T apache2 -X). With
> debug=20 (ceph-fuse) you see apache waiting for almost 20 seconds before it 
> starts serving data:
>
> 13:33:55 accept4(4, {sa_family=AF_INET6, sin6_port=htons(36829), 
> inet_pton(AF_INET6, ":::213.136.12.151", &sin6_addr), sin6_flowinfo=0, 
> sin6_scope_id=0}, [28], SOCK_CLOEXEC) = 24 <1.218381>
> 13:33:56 getsockname(24, {sa_family=AF_INET6, sin6_port=htons(80), 
> inet_pton(AF_INET6, ":::10.5.80.8", &sin6_addr), sin6_flowinfo=0, 
> sin6_scope_id=0}, [28]) = 0 <0.000113>
> 13:33:56 fcntl(24, F_GETFL) = 0x2 (flags O_RDWR) <0.72>
> 13:33:56 fcntl(24, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <0.25>
> 13:33:56 clone(child_stack=0, 
> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
> child_tidptr=0x7f132cf6aa50) = 24260 <0.001235>
> 13:33:56 wait4(24260, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 
> 24260 <19.424578>
> 13:34:15 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=24260, 
> si_uid=5003, si_status=0, si_utime=35, si_stime=27} ---
> 13:34:15 close(24)  = 0 <0.60>
> 13:34:15 clone(child_stack=0, 
> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
> child_tidptr=0x7f132cf6aa50) = 24263 <0.001003>
> 13:34:15 wait4(24263, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 
> 24263 <5.043079>
> 13:34:20 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=24263, 
> si_uid=5003, si_status=0, si_utime=0, si_stime=0} ---
> 13:34:20 close(24)  = 0 <0.96>
>
> Gr. Stefan
>

please add '-f' option (trace child processes' syscall)  to strace,

> --
> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mimic + cephmetrics + prometheus - working ?

2018-08-27 Thread Steven Vacaroaia
Hi

has anyone been able to use Mimic + cephmetric + prometheus ?

I am struggling to make it fully functional as it appears data provided by
node_exporter has a different name than the one grafana expectes

As a result of the above, only certain dashboards are being populated ( the
ones ceph specific)
while others have "no data points" ( the ones server specific)

Any advice/suggestion/troubleshooting tips will be greatly appreciated

Example:

Grafana latency by server uses
node_disk_read_time_ms

but node_exporter does not provide it

 curl http://osd01:9100/metrics | grep node_disk_read_time
  % Total% Received % Xferd  Average Speed   TimeTime Time
Current
 Dload  Upload   Total   SpentLeft
Speed
  0 00 00 0  0  0 --:--:-- --:--:-- --:--:--
 0# HELP node_disk_read_time_seconds_total The total number of milliseconds
spent by all reads.
# TYPE node_disk_read_time_seconds_total counter
node_disk_read_time_seconds_total{device="dm-0"} 8910.801
node_disk_read_time_seconds_total{device="sda"} 0.525
node_disk_read_time_seconds_total{device="sdb"} 14221.732
node_disk_read_time_seconds_total{device="sdc"} 0.465
node_disk_read_time_seconds_total{device="sdd"} 0.46
node_disk_read_time_seconds_total{device="sde"} 0.017
node_disk_read_time_seconds_total{device="sdf"} 455.064
node_disk_read_time_seconds_total{device="sr0"} 0
100 64683  100 646830 0  4452k  0 --:--:-- --:--:-- --:--:--
5263k
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Odp.: pgs incomplete and inactive

2018-08-27 Thread David Turner
I came across a problem like this before with small flash OSDs for
metadata.  There is an open tracker about why it was able to fill 100% of
the way up, but no work done on it in 6 months after I got back to
healthy.  The way I did that was deleting one copy of a PG from each OSD
(different PGs on each) taking me down to 2 replicas of those PGs.  ie I
used the ceph-objectstore-tool to delete pg 4.1, 4.2, and 4.3 on osd.6, pg
4.4, 4.5, and 4.7 on osd.7, etc.  That allowed the cluster to compact the
things it needed to as well as allow me to change the crush rule for the
pool so that it would move the pgs for the pool to larger disks until I had
larger OSDs to put it back to later.

On Mon, Aug 27, 2018 at 7:36 AM Tomasz Kuzemko 
wrote:

> Hello Josef,
> I would suggest setting up a bigger disk (if not physical then maybe a LVM
> volume from 2 smaller disks) and cloning (remember about extended
> attributes!) the OSD data dir to the new disk, then try to bring the OSD
> back into cluster.
>
> --
> Tomasz Kuzemko
> tomasz.kuze...@corp.ovh.com
>
> 
> Od: ceph-users  w imieniu użytkownika
> Josef Zelenka 
> Wysłane: poniedziałek, 27 sierpnia 2018 13:29
> Do: Paul Emmerich; ceph-users@lists.ceph.com
> Temat: Re: [ceph-users] pgs incomplete and inactive
>
> The fullratio was ignored, that's why that happenned most likely. I
> can't delete pgs, because it's only kb's worth of space - the osd is
> 40gb, 39.8 gb is taken up by omap - that's why i can't move/extract. Any
> clue on how to compress/move away the omap dir?
>
>
>
> On 27/08/18 12:34, Paul Emmerich wrote:
> > Don't ever let an OSD run 100% full, that's usually bad news.
> > Two ways to salvage this:
> >
> > 1. You can try to extract the PGs with ceph-objectstore-tool and
> > inject them into another OSD; Ceph will find them and recover
> > 2. You seem to be using Filestore, so you should easily be able to
> > just delete a whole PG on the full OSD's file system to make space
> > (preferably one that is already recovered and active+clean even
> > without the dead OSD)
> >
> >
> > Paul
> >
> > 2018-08-27 10:44 GMT+02:00 Josef Zelenka  >:
> >> Hi, i've had a very ugly thing happen to me over the weekend. Some of my
> >> OSDs in a root that handles metadata pools overflowed to 100% disk
> usage due
> >> to omap size(even though i had 97% full ratio, which is odd) and
> refused to
> >> start. There were some pgs on those OSDs that went away with them. I
> have
> >> tried compacting the omap, moving files away etc, but nothing  - i can't
> >> export the pgs, i get errors like this:
> >>
> >> 2018-08-27 04:42:33.436182 7fcb53382580  4 rocksdb: EVENT_LOG_v1
> >> {"time_micros": 1535359353436170, "job": 1, "event": "recovery_started",
> >> "log_files": [5504, 5507]}
> >> 2018-08-27 04:42:33.436194 7fcb53382580  4 rocksdb:
> >> [/build/ceph-12.2.5/src/rocksdb/db/db_impl_open.cc:482] Recovering log
> #5504
> >> mode 2
> >> 2018-08-27 04:42:35.422502 7fcb53382580  4 rocksdb:
> >> [/build/ceph-12.2.5/src/rocksdb/db/db_impl.cc:217] Shutdown: canceling
> all
> >> background work
> >> 2018-08-27 04:42:35.431613 7fcb53382580  4 rocksdb:
> >> [/build/ceph-12.2.5/src/rocksdb/db/db_impl.cc:343] Shutdown complete
> >> 2018-08-27 04:42:35.431716 7fcb53382580 -1 rocksdb: IO error: No space
> left
> >> on device/var/lib/ceph/osd/ceph-5//current/omap/005507.sst: No space
> left on
> >> device
> >> Mount failed with '(1) Operation not permitted'
> >> 2018-08-27 04:42:35.432945 7fcb53382580 -1
> >> filestore(/var/lib/ceph/osd/ceph-5/) mount(1723): Error initializing
> rocksdb
> >> :
> >>
> >> I decided to take the loss and mark the osds as lost and remove them
> from
> >> the cluster, however, it left 4 pgs hanging in incomplete + inactive
> state,
> >> which apparently prevents my radosgw from starting. Is there another
> way to
> >> export/import the pgs into their new osds/recreate them? I'm running
> >> Luminous 12.2.5 on Ubuntu 16.04.
> >>
> >> Thanks
> >>
> >> Josef
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Quota and ACL support

2018-08-27 Thread Patrick Donnelly
On Mon, Aug 27, 2018 at 12:51 AM, Oliver Freyermuth
 wrote:
> These features are critical for us, so right now we use the Fuse client. My 
> hope is CentOS 8 will use a recent enough kernel
> to get those features automatically, though.

Your cluster needs to be running Mimic and Linux v4.17+.

See also: https://github.com/ceph/ceph/pull/23728/files

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Quota and ACL support

2018-08-27 Thread Oliver Freyermuth
Thanks for the replies. 

Am 27.08.18 um 19:25 schrieb Patrick Donnelly:
> On Mon, Aug 27, 2018 at 12:51 AM, Oliver Freyermuth
>  wrote:
>> These features are critical for us, so right now we use the Fuse client. My 
>> hope is CentOS 8 will use a recent enough kernel
>> to get those features automatically, though.
> 
> Your cluster needs to be running Mimic and Linux v4.17+.
> 
> See also: https://github.com/ceph/ceph/pull/23728/files
> 

Yes, I know that it's part of the official / vanilla kernel as of 4.17. 
However, I was wondering whether this functionality is also likely to be 
backported to the RedHat-maintained kernel which is also used in CentOS 7? 
Even though the kernel version is "stone-aged", it matches CentOS 7's userspace 
and RedHat is taking good care to implement fixes. 

Seeing that even features are backported, it would be really helpful if also 
this functionality would appear as part of CentOS 7.6 / 7.7,
especially since CentOS 8 still appears to be quite some time away. 

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Quota and ACL support

2018-08-27 Thread Brett Niver
+Ilya

On Mon, Aug 27, 2018 at 10:53 AM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Thanks for the replies.
>
> Am 27.08.18 um 19:25 schrieb Patrick Donnelly:
> > On Mon, Aug 27, 2018 at 12:51 AM, Oliver Freyermuth
> >  wrote:
> >> These features are critical for us, so right now we use the Fuse
> client. My hope is CentOS 8 will use a recent enough kernel
> >> to get those features automatically, though.
> >
> > Your cluster needs to be running Mimic and Linux v4.17+.
> >
> > See also: https://github.com/ceph/ceph/pull/23728/files
> >
>
> Yes, I know that it's part of the official / vanilla kernel as of 4.17.
> However, I was wondering whether this functionality is also likely to be
> backported to the RedHat-maintained kernel which is also used in CentOS 7?
> Even though the kernel version is "stone-aged", it matches CentOS 7's
> userspace and RedHat is taking good care to implement fixes.
>
> Seeing that even features are backported, it would be really helpful if
> also this functionality would appear as part of CentOS 7.6 / 7.7,
> especially since CentOS 8 still appears to be quite some time away.
>
> Cheers,
> Oliver
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [question] one-way RBD mirroring doesn't work

2018-08-27 Thread sat
> -Original Message-
> From: Jason Dillaman 
> Sent: Friday, August 24, 2018 12:09 AM
> To: sat 
> Cc: ceph-users 
> Subject: Re: [ceph-users] [question] one-way RBD mirroring doesn't work
> 
> On Thu, Aug 23, 2018 at 10:56 AM sat  wrote:
> >
> > Hi,
> >
> >
> > I'm trying to make a one-way RBD mirroed cluster between two Ceph
> > clusters. But it hasn't worked yet. It seems to sucecss, but after
> > making an RBD image from local cluster, it's considered as "unknown".
> >
> > ```
> > $ sudo rbd --cluster local create rbd/local.img --size=1G
> > --image-feature=exclusive-lock,journaling
> > $ sudo rbd --cluster local ls rbd
> > local.img
> > $ sudo rbd --cluster remote ls rbd
> > local.img
> > $ sudo rbd --cluster local mirror pool status rbd
> > health: WARNING
> > images: 1 total
> > 1 unknown
> > $ sudo rbd --cluster remote mirror pool status rbd
> > health: OK
> > images: 1 total
> > 1 replaying
> > $
> > ```
> >
> > Could you tell me what is wrong?
> 
> Nothing -- with one-directional RBD mirroring, on the receive side would 
> report
> status. If you started an rbd-mirror daemon against the "local" cluster, it 
> would
> report as healthy w/ that particular image in the "stopped" state since it's
> primary.

Thank you very much!

So, since my images's state is "unknown" rather than "stopped", my rbd-mirror's
setting is not correct. I'll chenk it.

Best,
Satoru

> 
> >
> > # detail
> >
> > There are two clusters, named "local" and "remote". "remote" is the mirror 
> > of
> "local".
> > Both two clusters has a pool, named "rbd".
> >
> > ## system environment
> >
> > - OS: ubuntu 16.04
> > - kernel: 4.4.0-112-generic
> > - ceph: luminous 12.2.5
> >
> > ## system configuration diagram
> >
> >
> 
> ==
> > +- manager(192.168.33.2): manipulate two clusters,
> > |
> > +- node0(192.168.33.3): "local"'s MON, MGR, and OSD0
> > |
> > +- node1(192.168.33.4); "local"'s OSD1
> > |
> > +- node2(192.168.33.5); "local"'s OSD2
> > |
> > +- remote-node0(192.168.33.7): "remote"'s MON, MGR, OSD0, and
> > +ceph-rbd-mirror
> > |
> > +- remote-node1(192.168.33.8); "remote"'s OSD1
> > |
> > +- remote-node2(192.168.33.9); "remote"'s OSD2
> >
> 
> 
> >
> > # Step to reproduce
> >
> > 1. Prepare two clusters "local" and "remote"
> >
> > ```
> > $ sudo ceph --cluster local -s
> >   cluster:
> >   id: 9faca802-745d-43d8-b572-16617e553a5f
> >   health: HEALTH_WARN
> >   application not enabled on 1 pool(s)
> >
> >   services:
> >   mon: 1 daemons, quorum 0
> >   mgr: 0(active)
> >   osd: 3 osds: 3 up, 3 in
> >
> >   data:
> >   pools:   1 pools, 128 pgs
> >   objects: 16 objects, 12395 kB
> >   usage:   3111 MB used, 27596 MB / 30708 MB avail
> >   pgs: 128 active+clean
> >
> >   io:
> >   client:   852 B/s rd, 0 op/s rd, 0 op/s wr
> >
> > $ sudo ceph --cluster remote -s
> >   cluster:
> >   id: 1ecb0aa6-5a00-4946-bdba-bad78bfa4372
> >   health: HEALTH_WARN
> >   application not enabled on 1 pool(s)
> >
> >   services:
> >   mon:1 daemons, quorum 0
> >   mgr:0(active)
> >   osd:3 osds: 3 up, 3 in
> >   rbd-mirror: 1 daemon active
> >
> >   data:
> >   pools:   1 pools, 128 pgs
> >   objects: 18 objects, 7239 kB
> >   usage:   3100 MB used, 27607 MB / 30708 MB avail
> >   pgs: 128 active+clean
> >
> >   io:
> >   client:   39403 B/s rd, 0 B/s wr, 4 op/s rd, 0 op/s wr
> >
> > $
> > 
> >
> >
> > Two clusters looks fine.
> >
> > 2. Setup one-way RBD pool mirroring from "local" to "remote"
> >
> > Setup an RBD pool mirroring between "local" and "remote" with the following
> steps.
> >
> >
> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/h
> > tml/block_device_guide/block_device_mirroring
> >
> > Both cluster's status look fine as follows.
> >
> > ```
> > $ sudo rbd --cluster local mirror pool info rbd
> > Mode: pool
> > Peers: none
> > $ sudo rbd --cluster local mirror pool status rbd
> > health: OK
> > images: 0 total
> > $ sudo rbd --cluster remote mirror pool info rbd
> > Mode: pool
> > Peers:
> >   UUID NAME  CLIENT
> >   53fb3a9a-c451-4552-b409-c08709ebe1a9 local client.local $ sudo rbd
> > --cluster remote mirror pool status rbd
> > health: OK
> > images: 0 total
> > $
> > ```
> > 3. Create an RBD image
> >
> > ```
> > $ sudo rbd --cluster local create rbd/local.img --size=1G
> > --image-feature=exclusive-lock,journaling
> > $ sudo rbd --cluster local ls rbd
> > local.img
> > $ sudo rbd --cluster remote ls rbd
> > local.img
> > $
> > ```
> >
> > "rbd/local.img" seemd to be created and be mirrored fine.
> >
> > 4. Check both cluster's status and info
> >
> > Execute "rbd mirror pool info/status

Re: [ceph-users] fixable inconsistencies but more appears

2018-08-27 Thread Alfredo Daniel Rezinovsky

Well. it seems memory

I have 3 ODS per host with 8G RAM and block.db in SSD

Setting bluestore_cache_size_ssd=1G seems to have fixed the problem. No 
new inconsistencies.




On 21/08/18 16:09, Paul Emmerich wrote:

Are you running tight on memory?

Paul

2018-08-21 20:37 GMT+02:00 Alfredo Daniel Rezinovsky
:

My cluster suddenly shows many inconsistent PGs.

with this kind of log

2018-08-21 15:29:39.065613 osd.2 osd.2 10.64.1.1:6801/1310438 146 : cluster
[ERR] 2.61 shard 5: soid 2:864a5b37:::170510e.0004:head candidate
had a read error
2018-08-21 15:31:38.542447 osd.2 osd.2 10.64.1.1:6801/1310438 147 : cluster
[ERR] 2.61 shard 5: soid 2:86783f28:::1241f7f.:head candidate
had a read error

Al error fixes with "ceph pg repair" eventually but new inconsistencies
appears.

smart and kernel logs shows no hdd problems.

I have bluestore OSDs in HDD with journal in an SDD partition.

--
Alfredo Daniel Rezinovsky
Director de Tecnologías de Información y Comunicaciones
Facultad de Ingeniería - Universidad Nacional de Cuyo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--
Alfredo Daniel Rezinovsky
Director de Tecnologías de Información y Comunicaciones
Facultad de Ingeniería - Universidad Nacional de Cuyo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore crashing constantly with load on newly created cluster/host.

2018-08-27 Thread Tyler Bishop
Having a constant segfault issue under io load with my newly created
bluestore deployment.

https://pastebin.com/82YjXRm7

Setup is 28GB SSD LVM for block.db and 6T spinner for data.

Config:
[global]
fsid =  REDACTED
mon_initial_members = cephmon-1001, cephmon-1002, cephmon-1003
mon_host = 10.20.142.5,10.20.142.6,10.20.142.7
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true

# Fixes issue where image is created with newer than supported features
enabled.
rbd_default_features = 3


# Debug Tuning
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0

[osd]
osd_mkfs_type = xfs
osd_mount_options_xfs =
rw,noatime,,nodiratime,inode64,logbsize=256k,delaylog
osd_mkfs_options_xfs = -f -i size=2048
osd_journal_size = 10240
filestore_queue_max_ops=1000
filestore_queue_max_bytes = 1048576000
filestore_max_sync_interval = 10
filestore_merge_threshold = 500
filestore_split_multiple = 100
osd_op_shard_threads = 6
journal_max_write_entries = 5000
journal_max_write_bytes = 1048576000
journal_queueu_max_ops = 3000
journal_queue_max_bytes = 1048576000
ms_dispatch_throttle_bytes = 1048576000
objecter_inflight_op_bytes = 1048576000
public network = 10.20.142.0/24
cluster_network = 10.20.136.0/24
osd_disk_thread_ioprio_priority = 7
osd_disk_thread_ioprio_class = idle
osd_max_backfills = 2
osd_recovery_sleep = 0.10


[client]
rbd_cache = False
rbd cache size = 33554432
rbd cache target dirty = 16777216
rbd cache max dirty = 25165824
rbd cache max dirty age = 2
rbd cache writethrough until flush = false





2018-08-28 02:31:30.961954 7f64a895a700  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/flush_job.cc:319]
[default] [JOB 19] Level-0 flush table #688: 6121532 bytes OK
2018-08-28 02:31:30.962476 7f64a895a700  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/db_impl_files.cc:242]
adding log 681 to recycle list

2018-08-28 02:31:30.962495 7f64a895a700  4 rocksdb: (Original Log Time
2018/08/28-02:31:30.961973)
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/memtable_list.cc:360]
[default] Level-0 commit table #688 started
2018-08-28 02:31:30.962501 7f64a895a700  4 rocksdb: (Original Log Time
2018/08/28-02:31:30.962413)
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/memtable_list.cc:383]
[default] Level-0 commit table #688: memtable #1 done
2018-08-28 02:31:30.962505 7f64a895a700  4 rocksdb: (Original Log Time
2018/08/28-02:31:30.962432) EVENT_LOG_v1 {"time_micros": 1535423490962423,
"job": 19, "event": "flush_finished", "lsm_state": [1, 4, 1, 0, 0, 0, 0],
"immutable_memtables": 0}
2018-08-28 02:31:30.962509 7f64a895a700  4 rocksdb: (Original Log Time
2018/08/28-02:31:30.962458)
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/db_impl_compaction_flush.cc:132]
[default] Level summary: base level 1 max bytes base 268435456 files[1 4 1
0 0 0 0] max score 0.84

2018-08-28 02:31:30.962517 7f64a895a700  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/db_impl_files.cc:388]
[JOB 19] Try to delete WAL files size 258068015, prev total WAL file size
260608480, number of live WAL files 2.

2018-08-28 02:32:06.102335 7f64b917b700  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/db_impl_write.cc:684]
reusing log 681 from recycle list

2018-08-28 02:32:06.102473 7f64b917b700  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/

Re: [ceph-users] Bluestore crashing constantly with load on newly created cluster/host.

2018-08-27 Thread Alfredo Daniel Rezinovsky
I had blockdb in ssd, with 3 OSDs per host (8G ram) and the default 3G 
bluestore_cache_size_ssd


I stopped having inconsistencies dropping the cache to 1G.


On 27/08/18 23:32, Tyler Bishop wrote:
Having a constant segfault issue under io load with my newly created 
bluestore deployment.


https://pastebin.com/82YjXRm7

Setup is 28GB SSD LVM for block.db and 6T spinner for data.

Config:
[global]
fsid =  REDACTED
mon_initial_members = cephmon-1001, cephmon-1002, cephmon-1003
mon_host = 10.20.142.5,10.20.142.6,10.20.142.7
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true

# Fixes issue where image is created with newer than supported 
features enabled.

rbd_default_features = 3


# Debug Tuning
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0

[osd]
osd_mkfs_type = xfs
osd_mount_options_xfs = 
rw,noatime,,nodiratime,inode64,logbsize=256k,delaylog

osd_mkfs_options_xfs = -f -i size=2048
osd_journal_size = 10240
filestore_queue_max_ops=1000
filestore_queue_max_bytes = 1048576000
filestore_max_sync_interval = 10
filestore_merge_threshold = 500
filestore_split_multiple = 100
osd_op_shard_threads = 6
journal_max_write_entries = 5000
journal_max_write_bytes = 1048576000
journal_queueu_max_ops = 3000
journal_queue_max_bytes = 1048576000
ms_dispatch_throttle_bytes = 1048576000
objecter_inflight_op_bytes = 1048576000
public network = 10.20.142.0/24 
cluster_network = 10.20.136.0/24 
osd_disk_thread_ioprio_priority = 7
osd_disk_thread_ioprio_class = idle
osd_max_backfills = 2
osd_recovery_sleep = 0.10


[client]
rbd_cache = False
rbd cache size = 33554432
rbd cache target dirty = 16777216
rbd cache max dirty = 25165824
rbd cache max dirty age = 2
rbd cache writethrough until flush = false





2018-08-28 02:31:30.961954 7f64a895a700  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/flush_job.cc:319] 
[default] [JOB 19] Level-0 flush table #688: 6121532 bytes OK
2018-08-28 02:31:30.962476 7f64a895a700  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/db_impl_files.cc:242] 
adding log 681 to recycle list


2018-08-28 02:31:30.962495 7f64a895a700  4 rocksdb: (Original Log Time 
2018/08/28-02:31:30.961973) 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/memtable_list.cc:360] 
[default] Level-0 commit table #688 started
2018-08-28 02:31:30.962501 7f64a895a700  4 rocksdb: (Original Log Time 
2018/08/28-02:31:30.962413) 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/memtable_list.cc:383] 
[default] Level-0 commit table #688: memtable #1 done
2018-08-28 02:31:30.962505 7f64a895a700  4 rocksdb: (Original Log Time 
2018/08/28-02:31:30.962432) EVENT_LOG_v1 {"time_micros": 
1535423490962423, "job": 19, "event": "flush_finished", "lsm_state": 
[1, 4, 1, 0, 0, 0, 0], "immutable_memtables": 0}
2018-08-28 02:31:30.962509 7f64a895a700  4 rocksdb: (Original Log Time 
2018/08/28-02:31:30.962458) 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/db_impl_compaction_flush.cc:132] 
[default] Level summary: base level 1 max bytes base 268435456 files[1 
4 1 0 0 0 0] max score 0.84


2018-08-28 02:31:30.962517 7f64a895a700  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/db_impl_files.cc:388] 
[JOB 19] Try to delete WAL files size 258068015, prev total WAL file 
size 260608480, number of live WAL files 2.


2018-08-28 02:32:06.102335 7f64b917b700  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/sr

Re: [ceph-users] Bluestore crashing constantly with load on newly created cluster/host.

2018-08-27 Thread Tyler Bishop
My host has 256GB of ram.  62GB used under most heavy io workload.
_

*Tyler Bishop*
EST 2007


O: 513-299-7108 x1000
M: 513-646-5809
http://BeyondHosting.net 


This email is intended only for the recipient(s) above and/or
otherwise authorized personnel. The information contained herein and
attached is confidential and the property of Beyond Hosting. Any
unauthorized copying, forwarding, printing, and/or disclosing
any information related to this email is prohibited. If you received this
message in error, please contact the sender and destroy all copies of this
email and any attachment(s).


On Mon, Aug 27, 2018 at 10:36 PM Alfredo Daniel Rezinovsky <
alfredo.rezinov...@ingenieria.uncuyo.edu.ar> wrote:

> I had blockdb in ssd, with 3 OSDs per host (8G ram) and the default 3G
> bluestore_cache_size_ssd
>
> I stopped having inconsistencies dropping the cache to 1G.
>
> On 27/08/18 23:32, Tyler Bishop wrote:
>
> Having a constant segfault issue under io load with my newly created
> bluestore deployment.
>
> https://pastebin.com/82YjXRm7
>
> Setup is 28GB SSD LVM for block.db and 6T spinner for data.
>
> Config:
> [global]
> fsid =  REDACTED
> mon_initial_members = cephmon-1001, cephmon-1002, cephmon-1003
> mon_host = 10.20.142.5,10.20.142.6,10.20.142.7
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true
>
> # Fixes issue where image is created with newer than supported features
> enabled.
> rbd_default_features = 3
>
>
> # Debug Tuning
> debug_lockdep = 0/0
> debug_context = 0/0
> debug_crush = 0/0
> debug_buffer = 0/0
> debug_timer = 0/0
> debug_filer = 0/0
> debug_objecter = 0/0
> debug_rados = 0/0
> debug_rbd = 0/0
> debug_journaler = 0/0
> debug_objectcatcher = 0/0
> debug_client = 0/0
> debug_osd = 0/0
> debug_optracker = 0/0
> debug_objclass = 0/0
> debug_filestore = 0/0
> debug_journal = 0/0
> debug_ms = 0/0
> debug_monc = 0/0
> debug_tp = 0/0
> debug_auth = 0/0
> debug_finisher = 0/0
> debug_heartbeatmap = 0/0
> debug_perfcounter = 0/0
> debug_asok = 0/0
> debug_throttle = 0/0
> debug_mon = 0/0
> debug_paxos = 0/0
> debug_rgw = 0/0
>
> [osd]
> osd_mkfs_type = xfs
> osd_mount_options_xfs =
> rw,noatime,,nodiratime,inode64,logbsize=256k,delaylog
> osd_mkfs_options_xfs = -f -i size=2048
> osd_journal_size = 10240
> filestore_queue_max_ops=1000
> filestore_queue_max_bytes = 1048576000
> filestore_max_sync_interval = 10
> filestore_merge_threshold = 500
> filestore_split_multiple = 100
> osd_op_shard_threads = 6
> journal_max_write_entries = 5000
> journal_max_write_bytes = 1048576000
> journal_queueu_max_ops = 3000
> journal_queue_max_bytes = 1048576000
> ms_dispatch_throttle_bytes = 1048576000
> objecter_inflight_op_bytes = 1048576000
> public network = 10.20.142.0/24
> cluster_network = 10.20.136.0/24
> osd_disk_thread_ioprio_priority = 7
> osd_disk_thread_ioprio_class = idle
> osd_max_backfills = 2
> osd_recovery_sleep = 0.10
>
>
> [client]
> rbd_cache = False
> rbd cache size = 33554432
> rbd cache target dirty = 16777216
> rbd cache max dirty = 25165824
> rbd cache max dirty age = 2
> rbd cache writethrough until flush = false
>
>
> 
>
>
> 2018-08-28 02:31:30.961954 7f64a895a700  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/flush_job.cc:319]
> [default] [JOB 19] Level-0 flush table #688: 6121532 bytes OK
> 2018-08-28 02:31:30.962476 7f64a895a700  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/db_impl_files.cc:242]
> adding log 681 to recycle list
>
> 2018-08-28 02:31:30.962495 7f64a895a700  4 rocksdb: (Original Log Time
> 2018/08/28-02:31:30.961973)
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/memtable_list.cc:360]
> [default] Level-0 commit table #688 started
> 2018-08-28 02:31:30.962501 7f64a895a700  4 rocksdb: (Original Log Time
> 2018/08/28-02:31:30.962413)
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/memtable_list.cc:383]
> [default] Level-0 commit table #688: memtable #1 done
> 2018-08-28 02:31:30.962505 7f64a895a700  4 rocksdb: (Original Log Time
> 2018/08/28-02:31:30.962432) EVENT_LOG_v1 {"time_micros": 1535423490962423,
> "job": 19, "event": "flush_finished", "lsm_state": [1, 4, 1, 0, 0, 0, 0],
> "immutable_memtables": 0}
> 2018-08-28 02:31:30.962509 7f64a895a700  4 rocksdb: (Original Log Time
> 2018/08/28-02:31:30.962458)
> [/home/j

Re: [ceph-users] OSD Segfaults after Bluestore conversion

2018-08-27 Thread Tyler Bishop
Did you solve this?  Similar issue.
_


On Wed, Feb 28, 2018 at 3:46 PM Kyle Hutson  wrote:

> I'm following up from awhile ago. I don't think this is the same bug. The
> bug referenced shows "abort: Corruption: block checksum mismatch", and I'm
> not seeing that on mine.
>
> Now I've had 8 OSDs down on this one server for a couple of weeks, and I
> just tried to start it back up. Here's a link to the log of that OSD (which
> segfaulted right after starting up):
> http://people.beocat.ksu.edu/~kylehutson/ceph-osd.414.log
>
> To me, it looks like the logs are providing surprisingly few hints as to
> where the problem lies. Is there a way I can turn up logging to see if I
> can get any more info as to why this is happening?
>
> On Thu, Feb 8, 2018 at 3:02 AM, Mike O'Connor  wrote:
>
>> On 7/02/2018 8:23 AM, Kyle Hutson wrote:
>> > We had a 26-node production ceph cluster which we upgraded to Luminous
>> > a little over a month ago. I added a 27th-node with Bluestore and
>> > didn't have any issues, so I began converting the others, one at a
>> > time. The first two went off pretty smoothly, but the 3rd is doing
>> > something strange.
>> >
>> > Initially, all the OSDs came up fine, but then some started to
>> > segfault. Out of curiosity more than anything else, I did reboot the
>> > server to see if it would get better or worse, and it pretty much
>> > stayed the same - 12 of the 18 OSDs did not properly come up. Of
>> > those, 3 again segfaulted
>> >
>> > I picked one that didn't properly come up and copied the log to where
>> > anybody can view it:
>> > http://people.beocat.ksu.edu/~kylehutson/ceph-osd.426.log
>> > 
>> >
>> > You can contrast that with one that is up:
>> > http://people.beocat.ksu.edu/~kylehutson/ceph-osd.428.log
>> > 
>> >
>> > (which is still showing segfaults in the logs, but seems to be
>> > recovering from them OK?)
>> >
>> > Any ideas?
>> Ideas ? yes
>>
>> There is a a bug which is hitting a small number of systems and at this
>> time there is no solution. Issues details at
>> http://tracker.ceph.com/issues/22102.
>>
>> Please submit more details of your problem on the ticket.
>>
>> Mike
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Segfaults after Bluestore conversion

2018-08-27 Thread Adam Tygart
This issue was related to using Jemalloc. Jemalloc is not as well
tested with Bluestore and lead to lots of segfaults. We moved back to
the default of tcmalloc with Bluestore and these stopped.

Check /etc/sysconfig/ceph under RHEL based distros.

--
Adam
On Mon, Aug 27, 2018 at 9:51 PM Tyler Bishop
 wrote:
>
> Did you solve this?  Similar issue.
> _
>
>
> On Wed, Feb 28, 2018 at 3:46 PM Kyle Hutson  wrote:
>>
>> I'm following up from awhile ago. I don't think this is the same bug. The 
>> bug referenced shows "abort: Corruption: block checksum mismatch", and I'm 
>> not seeing that on mine.
>>
>> Now I've had 8 OSDs down on this one server for a couple of weeks, and I 
>> just tried to start it back up. Here's a link to the log of that OSD (which 
>> segfaulted right after starting up): 
>> http://people.beocat.ksu.edu/~kylehutson/ceph-osd.414.log
>>
>> To me, it looks like the logs are providing surprisingly few hints as to 
>> where the problem lies. Is there a way I can turn up logging to see if I can 
>> get any more info as to why this is happening?
>>
>> On Thu, Feb 8, 2018 at 3:02 AM, Mike O'Connor  wrote:
>>>
>>> On 7/02/2018 8:23 AM, Kyle Hutson wrote:
>>> > We had a 26-node production ceph cluster which we upgraded to Luminous
>>> > a little over a month ago. I added a 27th-node with Bluestore and
>>> > didn't have any issues, so I began converting the others, one at a
>>> > time. The first two went off pretty smoothly, but the 3rd is doing
>>> > something strange.
>>> >
>>> > Initially, all the OSDs came up fine, but then some started to
>>> > segfault. Out of curiosity more than anything else, I did reboot the
>>> > server to see if it would get better or worse, and it pretty much
>>> > stayed the same - 12 of the 18 OSDs did not properly come up. Of
>>> > those, 3 again segfaulted
>>> >
>>> > I picked one that didn't properly come up and copied the log to where
>>> > anybody can view it:
>>> > http://people.beocat.ksu.edu/~kylehutson/ceph-osd.426.log
>>> > 
>>> >
>>> > You can contrast that with one that is up:
>>> > http://people.beocat.ksu.edu/~kylehutson/ceph-osd.428.log
>>> > 
>>> >
>>> > (which is still showing segfaults in the logs, but seems to be
>>> > recovering from them OK?)
>>> >
>>> > Any ideas?
>>> Ideas ? yes
>>>
>>> There is a a bug which is hitting a small number of systems and at this
>>> time there is no solution. Issues details at
>>> http://tracker.ceph.com/issues/22102.
>>>
>>> Please submit more details of your problem on the ticket.
>>>
>>> Mike
>>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Segfaults after Bluestore conversion

2018-08-27 Thread Tyler Bishop
Okay so far since switching back it looks more stable.  I have around 2GB/s
and 100k iops flowing with FIO atm to test.
_



On Mon, Aug 27, 2018 at 11:06 PM Adam Tygart  wrote:

> This issue was related to using Jemalloc. Jemalloc is not as well
> tested with Bluestore and lead to lots of segfaults. We moved back to
> the default of tcmalloc with Bluestore and these stopped.
>
> Check /etc/sysconfig/ceph under RHEL based distros.
>
> --
> Adam
> On Mon, Aug 27, 2018 at 9:51 PM Tyler Bishop
>  wrote:
> >
> > Did you solve this?  Similar issue.
> > _
> >
> >
> > On Wed, Feb 28, 2018 at 3:46 PM Kyle Hutson  wrote:
> >>
> >> I'm following up from awhile ago. I don't think this is the same bug.
> The bug referenced shows "abort: Corruption: block checksum mismatch", and
> I'm not seeing that on mine.
> >>
> >> Now I've had 8 OSDs down on this one server for a couple of weeks, and
> I just tried to start it back up. Here's a link to the log of that OSD
> (which segfaulted right after starting up):
> http://people.beocat.ksu.edu/~kylehutson/ceph-osd.414.log
> >>
> >> To me, it looks like the logs are providing surprisingly few hints as
> to where the problem lies. Is there a way I can turn up logging to see if I
> can get any more info as to why this is happening?
> >>
> >> On Thu, Feb 8, 2018 at 3:02 AM, Mike O'Connor  wrote:
> >>>
> >>> On 7/02/2018 8:23 AM, Kyle Hutson wrote:
> >>> > We had a 26-node production ceph cluster which we upgraded to
> Luminous
> >>> > a little over a month ago. I added a 27th-node with Bluestore and
> >>> > didn't have any issues, so I began converting the others, one at a
> >>> > time. The first two went off pretty smoothly, but the 3rd is doing
> >>> > something strange.
> >>> >
> >>> > Initially, all the OSDs came up fine, but then some started to
> >>> > segfault. Out of curiosity more than anything else, I did reboot the
> >>> > server to see if it would get better or worse, and it pretty much
> >>> > stayed the same - 12 of the 18 OSDs did not properly come up. Of
> >>> > those, 3 again segfaulted
> >>> >
> >>> > I picked one that didn't properly come up and copied the log to where
> >>> > anybody can view it:
> >>> > http://people.beocat.ksu.edu/~kylehutson/ceph-osd.426.log
> >>> > 
> >>> >
> >>> > You can contrast that with one that is up:
> >>> > http://people.beocat.ksu.edu/~kylehutson/ceph-osd.428.log
> >>> > 
> >>> >
> >>> > (which is still showing segfaults in the logs, but seems to be
> >>> > recovering from them OK?)
> >>> >
> >>> > Any ideas?
> >>> Ideas ? yes
> >>>
> >>> There is a a bug which is hitting a small number of systems and at this
> >>> time there is no solution. Issues details at
> >>> http://tracker.ceph.com/issues/22102.
> >>>
> >>> Please submit more details of your problem on the ticket.
> >>>
> >>> Mike
> >>>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore crashing constantly with load on newly created cluster/host.

2018-08-27 Thread Alfredo Daniel Rezinovsky

Have you created the blockdb partitions or LVM manually ?

What size?

On 27/08/18 23:48, Tyler Bishop wrote:

My host has 256GB of ram.  62GB used under most heavy io workload.
_

*Tyler Bishop*
EST 2007


O:513-299-7108 x1000
M:513-646-5809
http://BeyondHosting.net 


This email is intended only for the recipient(s) above and/or 
otherwise authorized personnel. The information contained herein and 
attached is confidential and the property of Beyond Hosting. Any 
unauthorized copying, forwarding, printing, and/or disclosing 
any information related to this email is prohibited. If you received 
this message in error, please contact the sender and destroy all 
copies of this email and any attachment(s).



On Mon, Aug 27, 2018 at 10:36 PM Alfredo Daniel Rezinovsky 
> wrote:


I had blockdb in ssd, with 3 OSDs per host (8G ram) and the
default 3G bluestore_cache_size_ssd

I stopped having inconsistencies dropping the cache to 1G.


On 27/08/18 23:32, Tyler Bishop wrote:

Having a constant segfault issue under io load with my newly
created bluestore deployment.

https://pastebin.com/82YjXRm7

Setup is 28GB SSD LVM for block.db and 6T spinner for data.

Config:
[global]
fsid =  REDACTED
mon_initial_members = cephmon-1001, cephmon-1002, cephmon-1003
mon_host = 10.20.142.5,10.20.142.6,10.20.142.7
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true

# Fixes issue where image is created with newer than supported
features enabled.
rbd_default_features = 3


# Debug Tuning
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0

[osd]
osd_mkfs_type = xfs
osd_mount_options_xfs =
rw,noatime,,nodiratime,inode64,logbsize=256k,delaylog
osd_mkfs_options_xfs = -f -i size=2048
osd_journal_size = 10240
filestore_queue_max_ops=1000
filestore_queue_max_bytes = 1048576000
filestore_max_sync_interval = 10
filestore_merge_threshold = 500
filestore_split_multiple = 100
osd_op_shard_threads = 6
journal_max_write_entries = 5000
journal_max_write_bytes = 1048576000
journal_queueu_max_ops = 3000
journal_queue_max_bytes = 1048576000
ms_dispatch_throttle_bytes = 1048576000
objecter_inflight_op_bytes = 1048576000
public network = 10.20.142.0/24 
cluster_network = 10.20.136.0/24 
osd_disk_thread_ioprio_priority = 7
osd_disk_thread_ioprio_class = idle
osd_max_backfills = 2
osd_recovery_sleep = 0.10


[client]
rbd_cache = False
rbd cache size = 33554432
rbd cache target dirty = 16777216
rbd cache max dirty = 25165824
rbd cache max dirty age = 2
rbd cache writethrough until flush = false





2018-08-28 02:31:30.961954 7f64a895a700  4 rocksdb:

[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/flush_job.cc:319]
[default] [JOB 19] Level-0 flush table #688: 6121532 bytes OK
2018-08-28 02:31:30.962476 7f64a895a700  4 rocksdb:

[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/db_impl_files.cc:242]
adding log 681 to recycle list

2018-08-28 02:31:30.962495 7f64a895a700  4 rocksdb: (Original Log
Time 2018/08/28-02:31:30.961973)

[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/memtable_list.cc:360]
[default] Level-0 commit table #688 started
2018-08-28 02:31:30.962501 7f64a895a700  4 rocksdb: (Original Log
Time 2018/08/28-02:31:30.962413)

[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/memtable_list.cc:383]
[default] Level-0 commit table #688: memtable #1 done
2018-08-2

Re: [ceph-users] Bluestore crashing constantly with load on newly created cluster/host.

2018-08-27 Thread Tyler Bishop
I bumped another post from earlier in the year.  I got this reply:


Adam Tygart 
11:06 PM (8 minutes ago)
to me, Kyle, Ceph
This issue was related to using Jemalloc. Jemalloc is not as well
tested with Bluestore and lead to lots of segfaults. We moved back to
the default of tcmalloc with Bluestore and these stopped.

Check /etc/sysconfig/ceph under RHEL based distros.

---

I had enabled jemalloc in the sysconfig previously. Disabled that and now
appear to have stable OSDs.


On Mon, Aug 27, 2018 at 11:13 PM Alfredo Daniel Rezinovsky <
alfredo.rezinov...@ingenieria.uncuyo.edu.ar> wrote:

> Have you created the blockdb partitions or LVM manually ?
>
> What size?
> On 27/08/18 23:48, Tyler Bishop wrote:
>
> My host has 256GB of ram.  62GB used under most heavy io workload.
> _
>
> *Tyler Bishop*
> EST 2007
>
>
> O: 513-299-7108 x1000
> M: 513-646-5809
> http://BeyondHosting.net 
>
>
> This email is intended only for the recipient(s) above and/or
> otherwise authorized personnel. The information contained herein and
> attached is confidential and the property of Beyond Hosting. Any
> unauthorized copying, forwarding, printing, and/or disclosing
> any information related to this email is prohibited. If you received this
> message in error, please contact the sender and destroy all copies of this
> email and any attachment(s).
>
>
> On Mon, Aug 27, 2018 at 10:36 PM Alfredo Daniel Rezinovsky <
> alfredo.rezinov...@ingenieria.uncuyo.edu.ar> wrote:
>
>> I had blockdb in ssd, with 3 OSDs per host (8G ram) and the default 3G
>> bluestore_cache_size_ssd
>>
>> I stopped having inconsistencies dropping the cache to 1G.
>>
>> On 27/08/18 23:32, Tyler Bishop wrote:
>>
>> Having a constant segfault issue under io load with my newly created
>> bluestore deployment.
>>
>> https://pastebin.com/82YjXRm7
>>
>> Setup is 28GB SSD LVM for block.db and 6T spinner for data.
>>
>> Config:
>> [global]
>> fsid =  REDACTED
>> mon_initial_members = cephmon-1001, cephmon-1002, cephmon-1003
>> mon_host = 10.20.142.5,10.20.142.6,10.20.142.7
>> auth_cluster_required = cephx
>> auth_service_required = cephx
>> auth_client_required = cephx
>> filestore_xattr_use_omap = true
>>
>> # Fixes issue where image is created with newer than supported features
>> enabled.
>> rbd_default_features = 3
>>
>>
>> # Debug Tuning
>> debug_lockdep = 0/0
>> debug_context = 0/0
>> debug_crush = 0/0
>> debug_buffer = 0/0
>> debug_timer = 0/0
>> debug_filer = 0/0
>> debug_objecter = 0/0
>> debug_rados = 0/0
>> debug_rbd = 0/0
>> debug_journaler = 0/0
>> debug_objectcatcher = 0/0
>> debug_client = 0/0
>> debug_osd = 0/0
>> debug_optracker = 0/0
>> debug_objclass = 0/0
>> debug_filestore = 0/0
>> debug_journal = 0/0
>> debug_ms = 0/0
>> debug_monc = 0/0
>> debug_tp = 0/0
>> debug_auth = 0/0
>> debug_finisher = 0/0
>> debug_heartbeatmap = 0/0
>> debug_perfcounter = 0/0
>> debug_asok = 0/0
>> debug_throttle = 0/0
>> debug_mon = 0/0
>> debug_paxos = 0/0
>> debug_rgw = 0/0
>>
>> [osd]
>> osd_mkfs_type = xfs
>> osd_mount_options_xfs =
>> rw,noatime,,nodiratime,inode64,logbsize=256k,delaylog
>> osd_mkfs_options_xfs = -f -i size=2048
>> osd_journal_size = 10240
>> filestore_queue_max_ops=1000
>> filestore_queue_max_bytes = 1048576000
>> filestore_max_sync_interval = 10
>> filestore_merge_threshold = 500
>> filestore_split_multiple = 100
>> osd_op_shard_threads = 6
>> journal_max_write_entries = 5000
>> journal_max_write_bytes = 1048576000
>> journal_queueu_max_ops = 3000
>> journal_queue_max_bytes = 1048576000
>> ms_dispatch_throttle_bytes = 1048576000
>> objecter_inflight_op_bytes = 1048576000
>> public network = 10.20.142.0/24
>> cluster_network = 10.20.136.0/24
>> osd_disk_thread_ioprio_priority = 7
>> osd_disk_thread_ioprio_class = idle
>> osd_max_backfills = 2
>> osd_recovery_sleep = 0.10
>>
>>
>> [client]
>> rbd_cache = False
>> rbd cache size = 33554432
>> rbd cache target dirty = 16777216
>> rbd cache max dirty = 25165824
>> rbd cache max dirty age = 2
>> rbd cache writethrough until flush = false
>>
>>
>> 
>>
>>
>> 2018-08-28 02:31:30.961954 7f64a895a700  4 rocksdb:
>> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/flush_job.cc:319]
>> [default] [JOB 19] Level-0 flush table #688: 6121532 bytes OK
>> 2018-08-28 02:31:30.962476 7f64a895a700  4 rocksdb:
>> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/db_impl_files.cc:242]
>> adding log 681 to recycle list
>>
>> 2018-08-28 02:31:30.962495 7f64a895a700  4 rocksdb: (Original Log Time
>> 2018/08/28-02:31:30.961973)
>> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/rele

Re: [ceph-users] CephFS Quota and ACL support

2018-08-27 Thread Yan, Zheng
On Mon, Aug 27, 2018 at 10:53 AM Oliver Freyermuth
 wrote:
>
> Thanks for the replies.
>
> Am 27.08.18 um 19:25 schrieb Patrick Donnelly:
> > On Mon, Aug 27, 2018 at 12:51 AM, Oliver Freyermuth
> >  wrote:
> >> These features are critical for us, so right now we use the Fuse client. 
> >> My hope is CentOS 8 will use a recent enough kernel
> >> to get those features automatically, though.
> >
> > Your cluster needs to be running Mimic and Linux v4.17+.
> >
> > See also: https://github.com/ceph/ceph/pull/23728/files
> >
>
> Yes, I know that it's part of the official / vanilla kernel as of 4.17.
> However, I was wondering whether this functionality is also likely to be 
> backported to the RedHat-maintained kernel which is also used in CentOS 7?
> Even though the kernel version is "stone-aged", it matches CentOS 7's 
> userspace and RedHat is taking good care to implement fixes.
>

We have already backported quota patches to RHEL 3.10 kernel. It may
take some time for redhat to release the new kernel.

Regards
Yan, Zheng

> Seeing that even features are backported, it would be really helpful if also 
> this functionality would appear as part of CentOS 7.6 / 7.7,
> especially since CentOS 8 still appears to be quite some time away.
>
> Cheers,
> Oliver
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com