Re: [ceph-users] OSD + Flashcache + udev + Partition uuid

2015-03-20 Thread Burkhard Linke

Hi,

On 03/19/2015 10:41 PM, Nick Fisk wrote:

I'm looking at trialling OSD's with a small flashcache device over them to
hopefully reduce the impact of metadata updates when doing small block io.
Inspiration from here:-

http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/12083

One thing I suspect will happen, is that when the OSD node starts up udev
could possibly mount the base OSD partition instead of flashcached device,
as the base disk will have the ceph partition uuid type. This could result
in quite nasty corruption.
I ran into this problem with an enhanceio based cache for one of our 
database servers.


I think you can prevent this problem by using bcache, which is also 
integrated into the official kernel tree. It does not act as a drop in 
replacement, but creates a new device that is only available if the 
cache is initialized correctly. A GPT partion table on the bcache device 
should be enough to allow the standard udev rules to kick in.


I haven't used bcache in this scenario yet, and I cannot comment on its 
speed and reliability compared to other solutions. But from the 
operational point of view it is "safer" than enhanceio/flashcache.


Best regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 'pgs stuck unclean ' problem

2015-03-20 Thread Burkhard Linke

Hi,


On 03/20/2015 01:58 AM, houguanghua wrote:

Dear all,
Ceph 0.72.2 is deployed in three hosts. But the ceph's status is 
 HEALTH_WARN . The status is as follows:


 # ceph -s
cluster e25909ed-25d9-42fd-8c97-0ed31eec6194
 health HEALTH_WARN 768 pgs degraded; 768 pgs stuck unclean;
recovery 2/3 objects degraded (66.667%)
 monmap e3: 3 mons at

{ceph-node1=192.168.57.101:6789/0,ceph-node2=192.168.57.102:6789/0,ceph-node3=192.168.57.103:6789/0},
election epoch 34, quorum 0,1,2 ceph-node1,ceph-node2,ceph-node3
 osdmap e170: 9 osds: 9 up, 9 in
  pgmap v1741: 768 pgs, 7 pools, 36 bytes data, 1 objects
367 MB used, 45612 MB / 45980 MB avail
2/3 objects degraded (66.667%)
 768 active+degraded



*snipsnap*


Other info is depicted here.

# ceph osd tree
# idweight  type name   up/down reweight
-1  0   root default
-7  0   rack rack03
-4  0   host ceph-node3
6   0   osd.6   up  1
7   0   osd.7   up  1
8   0   osd.8   up  1
-6  0   rack rack02
-3  0   host ceph-node2
3   0   osd.3   up  1
4   0   osd.4   up  1
5   0   osd.5   up  1
-5  0   rack rack01
-2  0   host ceph-node1
0   0   osd.0   up  1
1   0   osd.1   up  1
2   0   osd.2   up  1

The weights for all OSD devices are 0. As a result all OSDs are 
considered unusable for Ceph and not considered for storing objects on them.


This problem usually occurs in test setups with very small OSDs devices. 
If this is the case in your setup, you can adjust the weight of the OSDs 
or use larger devices. If your devices should have a sufficient size, 
you need to check why the weights of the OSDs are not adjusted accordingly.


Best regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Disabling btrfs snapshots for existing OSDs

2015-04-23 Thread Burkhard Linke

Hi,

I have a small number of OSDs running Ubuntu Trusty 14.04 and Ceph 
Firefly 0.80.9. Due to stability issues I would like to disable the 
btrfs snapshot feature (filestore btrfs snap = false).


Is it possible to apply this change to an existing OSD (stop OSD, change 
config, restart OSD), or do I need to recreate the OSD from scratch?


Best regards,
Burkhard

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] "Compacting" btrfs file storage

2015-04-23 Thread Burkhard Linke

Hi,

I've noticed that the btrfs file storage is performing some 
cleanup/compacting operations during OSD startup.


Before OSD start:
/dev/sdc1  2.8T  2.4T  390G  87% /var/lib/ceph/osd/ceph-58

After OSD start:
/dev/sdc1  2.8T  2.2T  629G  78% /var/lib/ceph/osd/ceph-58

OSDs are configured with firefly default settings.

This "compacting" of the underlying storage happens during the PG 
loading phase of the OSD start.


Is it possible to trigger this compacting without restarting the OSD?

Best regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse unable to run through "screen" ?

2015-04-23 Thread Burkhard Linke

Hi,

I had a similar problem during reboots. It was solved by adding 
'_netdev' to the options for the fstab entry. Otherwise the system may 
try to mount the cephfs mount point before the network is available.


This solution is for Ubuntu, YMMV.

Best regards,
Burkhard

--
Dr. rer. nat. Burkhard Linke
Bioinformatics and Systems Biology
Justus-Liebig-University Giessen
35392 Giessen, Germany
Phone: (+49) (0)641 9935810

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD move after reboot

2015-04-23 Thread Burkhard Linke

Hi,

On 04/23/2015 11:18 AM, Jake Grimmett wrote:

Dear All,

I have multiple disk types (15k & 7k) on each ceph node, which I 
assign to different pools, but have a problem as whenever I reboot a 
node, the OSD's move in the CRUSH map.


i.e. on host ceph4, before a reboot I have this osd tree

-10  7.68980 root 15k-disk
(snip)
 -9  2.19995 host ceph4-15k

*snipsnap*

 -1 34.96852 root 7k-disk
(snip)
 -5  7.36891 host ceph4

*snipsnap*

After a reboot I have this:

-10  5.48985 root 15k-disk
 -6  2.19995 host ceph1-15k
 32  0.54999 osd.32 up  1.0 1.0
 33  0.54999 osd.33 up  1.0 1.0
 34  0.54999 osd.34 up  1.0 1.0
 35  0.54999 osd.35 up  1.0 1.0
 -70 host ceph2-15k
 -80 host ceph3-15k
 -90 host ceph4-15k
-1 37.16847 root 7k-disk
(snip)
 -5  9.56886 host ceph4

*snipsnap*



My current cludge, is to just put a series of "osd crush set" lines 
like this in rc.local:


ceph osd crush set osd.44 0.54999 root=15k-disk host=ceph4-15k

*snipsnap*

Upon reboot, the OSD updates its location in the crush tree by default. 
It uses the hostname of the box if no other information is given (output 
of 'hostname -s').


You can either disable updating the location at all or define a custom 
location (either fixed or via a script). See the "CRUSH LOCATION" 
paragraph on http://docs.ceph.com/docs/master/rados/operations/crush-map/


Best regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs: recovering from transport endpoint not connected?

2015-04-27 Thread Burkhard Linke

Hi,

I've deployed ceph on a number of nodes in our compute cluster (Ubuntu 
14.04 Ceph Firefly 0.80.9). /ceph is mounted via ceph-fuse.


From time to time some nodes loose their access to cephfs with the 
following error message:


# ls /ceph
ls: cannot access /ceph: Transport endpoint is not connected

The ceph client log contains the entries like:
2015-04-22 14:25:42.834607 7fcca6fa07c0  0 ceph version 0.80.9 
(b5a67f0e1d15385bc0d60a6da6e7fc810bde6047), process ceph-fuse, pid 156483


2015-04-26 17:23:15.430052 7f08570777c0  0 ceph version 0.80.9 
(b5a67f0e1d15385bc0d60a6da6e7fc810bde6047), process ceph-fuse, pid 140778

2015-04-26 17:23:15.625731 7f08570777c0 -1 fuse_parse_cmdline failed.
2015-04-26 17:23:18.921788 7f5bc299b7c0  0 ceph version 0.80.9 
(b5a67f0e1d15385bc0d60a6da6e7fc810bde6047), process ceph-fuse, pid 140807

2015-04-26 17:23:19.166199 7f5bc299b7c0 -1 fuse_parse_cmdline failed.

Re-mounting resolves the problem, but it may not be possible due to 
processes with (now stale) access to the mount point. Is there a better 
way to resolve this problem (especially without remounting)?


Best regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs: recovering from transport endpoint not connected?

2015-04-28 Thread Burkhard Linke

Hi,

On 04/27/2015 02:31 PM, Yan, Zheng wrote:

On Mon, Apr 27, 2015 at 3:42 PM, Burkhard Linke
 wrote:

Hi,

I've deployed ceph on a number of nodes in our compute cluster (Ubuntu 14.04
Ceph Firefly 0.80.9). /ceph is mounted via ceph-fuse.

 From time to time some nodes loose their access to cephfs with the following
error message:

# ls /ceph
ls: cannot access /ceph: Transport endpoint is not connected

looks like ceph-fuse was crashed. please check if there is any crash
related information in the client log files


The ceph client log contains the entries like:
2015-04-22 14:25:42.834607 7fcca6fa07c0  0 ceph version 0.80.9
(b5a67f0e1d15385bc0d60a6da6e7fc810bde6047), process ceph-fuse, pid 156483

2015-04-26 17:23:15.430052 7f08570777c0  0 ceph version 0.80.9
(b5a67f0e1d15385bc0d60a6da6e7fc810bde6047), process ceph-fuse, pid 140778
2015-04-26 17:23:15.625731 7f08570777c0 -1 fuse_parse_cmdline failed.
2015-04-26 17:23:18.921788 7f5bc299b7c0  0 ceph version 0.80.9
(b5a67f0e1d15385bc0d60a6da6e7fc810bde6047), process ceph-fuse, pid 140807
2015-04-26 17:23:19.166199 7f5bc299b7c0 -1 fuse_parse_cmdline failed.
The lines above are the complete log output in ceph-client.admin.log for 
the day the filesystem was mounted (04/22) and the day it became 
unavailable (04/26). The problem is not reproducible (or the trigger is 
not known yet) and has affected several hosts during the last weeks.


I'll update to hammer today, maybe it resolves this problem.

Best regards,
Burkhard Linke
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Btrfs defragmentation

2015-05-07 Thread Burkhard Linke

Hi,

On 05/07/2015 12:04 PM, Lionel Bouton wrote:

On 05/06/15 19:51, Lionel Bouton wrote:

*snipsnap*

We've seen progress on this front. Unfortunately for us we had 2 power
outages and they seem to have damaged the disk controller of the system
we are testing Btrfs on: we just had a system crash.
On the positive side this gives us an update on the OSD boot time.

With a freshly booted system without anything in cache :
- the first Btrfs OSD we installed loaded the pgs in ~1mn30s which is
half of the previous time,
- the second Btrfs OSD where defragmentation was disabled for some time
and was considered more fragmented by our tool took nearly 10 minutes to
load its pgs (and even spent 1mn before starting to load them).
- the third Btrfs OSD which was always defragmented took 4mn30 seconds
to load its pgs (it was considered more fragmented than the first and
less than the second).

My current assumption is that the defragmentation process we use can't
handle large spikes of writes (at least when originally populating the
OSD with data through backfills) but then can repair the damage on
performance they cause at least partially (it's still slower to boot
than the 3 XFS OSDs on the same system where loading pgs took 6-9 seconds).
In the current setup the defragmentation is very slow to process because
I set it up to generate very little load on the filesystems it processes
: there may be room to improve.


Part of the OSD boot up process is also the handling of existing 
snapshots and journal replay. I've also had several btrfs based OSDs 
that took up to 20-30 minutes to start, especially after a crash. During 
journal replay the OSD daemon creates a number of new snapshot for its 
operations (newly created snap_XYZ directories that vanish after a short 
time). This snapshotting probably also adds overhead to the OSD startup 
time.
I have disabled snapshots in my setup now, since the stock ubuntu trusty 
kernel had some stability problems with btrfs.


I also had to establish cron jobs for rebalancing the btrfs partitions. 
It compacts the extents and may reduce the total amount of space taken. 
Unfortunately this procedure is not a default in most distribution (it 
definitely should be!). The problems associated with unbalanced extents 
should have been solved in kernel 3.18, but I didn't had the time to 
check it yet.


As a side note: I had several OSD with dangling snapshots (more than the 
two usually handled by the OSD). They are probably due to crashed OSD 
daemons. You have to remove the manually, otherwise they start to 
consume disk space.


Best regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: Re: Unexpected disk write activity with btrfs OSDs

2015-06-19 Thread Burkhard Linke


Forget the reply to the list...

 Forwarded Message 
Subject:Re: [ceph-users] Unexpected disk write activity with btrfs OSDs
Date:   Fri, 19 Jun 2015 09:06:33 +0200
From:   Burkhard Linke 
To: Lionel Bouton 



Hi,

On 06/18/2015 11:28 PM, Lionel Bouton wrote:

Hi,

*snipsnap*


- Disks with btrfs OSD have a spike of activity every 30s (2 intervals
of 10s with nearly 0 activity, one interval with a total amount of
writes of ~120MB). The averages are : 4MB/s, 100 IO/s.


Just a guess:

btrfs has a commit interval which defaults to 30 seconds.

You can verify this by changing the interval with the commit=XYZ mount
option.

Best regards,
Burkhard



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Removing empty placement groups / empty objects

2015-06-29 Thread Burkhard Linke

Hi,

I've noticed that a number of placement groups in our setup contain 
objects, but no actual data

(ceph pg dump | grep remapped during a hard disk replace operation):

7.616   26360   0   52720   4194304 3003 3003
active+remapped+wait_backfill   2015-06-29 13:43:28.716687  
159913'33987160091:526298   [30,6,36] 30
  [30,36,3]   30  153699'338922015-06-29 
07:30:16.030470  149573'325652015-06-23 07:00:21.948563
7.60a   26960   0   53920   0   3046 3046
active+remapped+wait_backfill   2015-06-29 13:43:09.847541  
159919'34627160091:388532   [2,36,3] 2
   [2,36,31]   2   153669'344962015-06-28 
20:09:51.850005  153669'344962015-06-28 20:09:51.850005
7.60d   26940   2   53880   0   3026 3026
active+remapped+wait_backfill   2015-06-29 13:43:27.202928  
159939'33708160091:392535   [31,6,38] 31
  [31,38,3]   31  152584'336102015-06-29 
07:11:37.484500  152584'336102015-06-29 07:11:37.484500



Pool 7 was used a data pool in cephfs, but almost all files stored in 
that pool have been removed:

~# rados df
pool name KB  objects   clones degraded  
unfound   rdrd KB   wr wr KB
cephfs_test_data   940066  55378380 202   
0  2022238   1434381904 21823705   3064326550


Is it possible to remove these "zombie" objects, since they influence 
maintenance operations like backfilling or recovery?


Best regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Removing empty placement groups / empty objects

2015-07-01 Thread Burkhard Linke

Hi,

On 07/01/2015 06:09 PM, Gregory Farnum wrote:

On Mon, Jun 29, 2015 at 1:44 PM, Burkhard Linke
 wrote:

Hi,

I've noticed that a number of placement groups in our setup contain objects,
but no actual data
(ceph pg dump | grep remapped during a hard disk replace operation):

7.616   26360   0   52720   4194304 3003 3003
active+remapped+wait_backfill   2015-06-29 13:43:28.716687  159913'33987
160091:526298   [30,6,36] 30
   [30,36,3]   30  153699'338922015-06-29 07:30:16.030470
149573'325652015-06-23 07:00:21.948563
7.60a   26960   0   53920   0   3046 3046
active+remapped+wait_backfill   2015-06-29 13:43:09.847541  159919'34627
160091:388532   [2,36,3] 2
[2,36,31]   2   153669'344962015-06-28 20:09:51.850005
153669'344962015-06-28 20:09:51.850005
7.60d   26940   2   53880   0   3026 3026
active+remapped+wait_backfill   2015-06-29 13:43:27.202928  159939'33708
160091:392535   [31,6,38] 31
   [31,38,3]   31  152584'336102015-06-29 07:11:37.484500
152584'336102015-06-29 07:11:37.484500


Pool 7 was used a data pool in cephfs, but almost all files stored in that
pool have been removed:
~# rados df
pool name KB  objects   clones degraded  unfound
rdrd KB   wr wr KB
cephfs_test_data   940066  55378380 202   0
2022238   1434381904 21823705   3064326550

Is it possible to remove these "zombie" objects, since they influence
maintenance operations like backfilling or recovery?

That's odd; the actual objects should have been deleted (not just
truncated). Have you used this pool for anything else (CephFS metadata
storage, RGW bucket indexes, etc)? What version of Ceph are you
running and what workload did you do to induce this issue?
Ceph version is 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) 
running Ubuntu 14.04 with kernel 3.13.0-55-generic.


The cephfs_test_data has only been used as cephfs data pool in a backup 
scenario using rsync. It contained a mix of files resulted from several 
rsync attempts from a failing NAS device. Most files were small (kbyte 
range). The total number of files in that pool was about 10-15 million 
before almost all files were removed. The total size of the pool was 
about 10 TB.


Since I want to remove the pool completely I'm currently trying to 
locate the remaining files in the filesystem, but that's a low priority 
task at the moment.


Best regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] any recommendation of using EnhanceIO?

2015-07-02 Thread Burkhard Linke

Hi,

On 07/01/2015 10:13 PM, German Anders wrote:

Hi cephers,

Is anyone out there that implement enhanceIO in a production 
environment? any recommendation? any perf output to share with the 
diff between using it and not?


I've used EnhanceIO as accelerator for our MySQL server, but I had to 
discard it after a fatal kernel crash related to the module.


In my experience it works stable in write through mode, but write back 
is buggy. Since the later one is the interesting one in almost any use 
case, I would not recommend to use it.


Best regards,
Burkhard Linke
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] State of nfs-ganesha CEPH fsal

2015-07-27 Thread Burkhard Linke

Hi,

the nfs-ganesha documentation states:

"... This FSAL links to a modified version of the CEPH library that has 
been extended to expose its distributed cluster and replication 
facilities to the pNFS operations in the FSAL. ... The CEPH library 
modifications have not been merged into the upstream yet. "


(https://github.com/nfs-ganesha/nfs-ganesha/wiki/Fsalsupport#ceph)

Is this still the case with the hammer release?

Best regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] State of nfs-ganesha CEPH fsal

2015-07-28 Thread Burkhard Linke

Hi,

On 07/27/2015 05:42 PM, Gregory Farnum wrote:

On Mon, Jul 27, 2015 at 4:33 PM, Burkhard Linke
 wrote:

Hi,

the nfs-ganesha documentation states:

"... This FSAL links to a modified version of the CEPH library that has been
extended to expose its distributed cluster and replication facilities to the
pNFS operations in the FSAL. ... The CEPH library modifications have not
been merged into the upstream yet. "

(https://github.com/nfs-ganesha/nfs-ganesha/wiki/Fsalsupport#ceph)

Is this still the case with the hammer release?

The FSAL has been upstream for quite a while, but it's not part of our
regular testing yet and I'm not sure what it gets from the Ganesha
side. I'd encourage you to test it, but be wary — we had a recent
report of some issues we haven't been able to set up to reproduce yet.
Can you give some details on that issues? I'm currently looking for a 
way to provide NFS based access to CephFS to our desktop machines.


The kernel NFS implementation in Ubuntu had some problems with CephFS in 
our setup, which I was not able to resolve yet. Ganesha seems to be more 
promising, since it uses libcephfs directly and does not need a 
mountpoint of its own.


Best regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] State of nfs-ganesha CEPH fsal

2015-07-28 Thread Burkhard Linke

Hi,

On 07/28/2015 11:08 AM, Haomai Wang wrote:

On Tue, Jul 28, 2015 at 4:47 PM, Gregory Farnum  wrote:

On Tue, Jul 28, 2015 at 8:01 AM, Burkhard Linke
 wrote:


*snipsnap*
Can you give some details on that issues? I'm currently looking for 
a way to provide NFS based access to CephFS to our desktop machines. 

Ummm...sadly I can't; we don't appear to have any tracker tickets and
I'm not sure where the report went to. :( I think it was from
Haomai...

My fault, I should report this to ticket.

I have forgotten the details about the problem, I submit the infos to IRC :-(

It related to the "ls" output. It will print the wrong user/group
owner as "-1", maybe related to root squash?
Are you sure this problem is related to the CephFS FSAL? I also had a 
hard time setting up ganesha correctly, especially with respect to user 
and group mappings, especially with a kerberized setup.


I'm currently running a small test setup with one server and one client 
to single out the last kerberos related problems (nfs-ganesha 2.2.0 / 
Ceph Hammer 0.94.2 / Ubuntu 14.04). User/group listings have been OK so 
far. Do you remember whether the problem occurs every time or just 
arbitrarily?


Best regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Group permission problems with CephFS

2015-08-04 Thread Burkhard Linke

Hi,

I've encountered some problems accesing files on CephFS:

$ ls -al syntenyPlot.png
-rw-r- 1 edgar edgar 9329 Jun 11  2014 syntenyPlot.png

$ groups
... edgar ...

$ cat syntenyPlot.png
cat: syntenyPlot.png: Permission denied

CephFS is mounted via ceph-fuse:
ceph-fuse on /ceph type fuse.ceph-fuse 
(rw,nosuid,nodev,noatime,allow_other,default_permissions)


OS is Ubuntu 14.04, Ceph version is 0.94.2

I've isolated a test machine and activated debugging (debug_client = 
20/20). The following lines correspond to the 'cat' invocation:


2015-08-04 12:59:44.030372 7f574dffb700 20 client.421984 _ll_get 
0x7f5758024da0 100022310be -> 13
2015-08-04 12:59:44.030398 7f574dffb700  3 client.421984 ll_getattr 
100022310be.head
2015-08-04 12:59:44.030403 7f574dffb700 10 client.421984 _getattr mask 
pAsLsXsFs issued=1
2015-08-04 12:59:44.030413 7f574dffb700 10 client.421984 fill_stat on 
100022310be snap/devhead mode 042770 mtime 2014-06-12 09:31:39.00 
ctime 2015-07-31 14:17:12.364416
2015-08-04 12:59:44.030426 7f574dffb700  3 client.421984 ll_getattr 
100022310be.head = 0
2015-08-04 12:59:44.030443 7f574dffb700  3 client.421984 ll_forget 
100022310be 1
2015-08-04 12:59:44.030447 7f574dffb700 20 client.421984 _ll_put 
0x7f5758024da0 100022310be 1 -> 12
2015-08-04 12:59:44.030459 7f574dffb700 20 client.421984 _ll_get 
0x7f5758024da0 100022310be -> 13
2015-08-04 12:59:44.030463 7f574dffb700  3 client.421984 ll_lookup 
0x7f5758024da0 syntenyPlot.png
2015-08-04 12:59:44.030469 7f574dffb700 20 client.421984 _lookup have dn 
syntenyPlot.png mds.-1 ttl 0.00 seq 0
2015-08-04 12:59:44.030476 7f574dffb700 10 client.421984 _lookup 
100022310be.head(ref=3 ll_ref=13 cap_refs={} open={} mode=42770 size=0/0 
mtime=2014-06-12 09:31:39.00 caps=pAsLsXsFs(0=pAsLsXsFs) COMPLETE 
parents=0x7f57580261d0 0x7f5758024da0) syntenyPlot.png = 
1000223121e.head(ref=2 ll_ref=20 cap_refs={} open={} mode=100640 
size=9329/0 mtime=2014-06-11 09:05:47.00 
caps=pAsLsXsFscr(0=pAsLsXsFscr) objectset[1000223121e ts 0/0 objects 0 
dirty_or_tx 0] parents=0x7f575802d290 0x7f575802c5a0)
2015-08-04 12:59:44.030530 7f574dffb700 10 client.421984 fill_stat on 
1000223121e snap/devhead mode 0100640 mtime 2014-06-11 09:05:47.00 
ctime 2015-08-04 11:07:53.623370
2015-08-04 12:59:44.030539 7f574dffb700 20 client.421984 _ll_get 
0x7f575802c5a0 1000223121e -> 21
2015-08-04 12:59:44.030542 7f574dffb700  3 client.421984 ll_lookup 
0x7f5758024da0 syntenyPlot.png -> 0 (1000223121e)
2015-08-04 12:59:44.030555 7f574dffb700  3 client.421984 ll_forget 
100022310be 1
2015-08-04 12:59:44.030558 7f574dffb700 20 client.421984 _ll_put 
0x7f5758024da0 100022310be 1 -> 12
2015-08-04 12:59:44.030628 7f57467fc700 20 client.421984 _ll_get 
0x7f575802c5a0 1000223121e -> 22
2015-08-04 12:59:44.030645 7f57467fc700  3 client.421984 ll_getattr 
1000223121e.head
2015-08-04 12:59:44.030649 7f57467fc700 10 client.421984 _getattr mask 
pAsLsXsFs issued=1
2015-08-04 12:59:44.030659 7f57467fc700 10 client.421984 fill_stat on 
1000223121e snap/devhead mode 0100640 mtime 2014-06-11 09:05:47.00 
ctime 2015-08-04 11:07:53.623370
2015-08-04 12:59:44.030672 7f57467fc700  3 client.421984 ll_getattr 
1000223121e.head = 0
2015-08-04 12:59:44.030690 7f57467fc700  3 client.421984 ll_forget 
1000223121e 1
2015-08-04 12:59:44.030695 7f57467fc700 20 client.421984 _ll_put 
0x7f575802c5a0 1000223121e 1 -> 21
2015-08-04 12:59:44.030760 7f574e7fc700 20 client.421984 _ll_get 
0x7f575802c5a0 1000223121e -> 22
2015-08-04 12:59:44.030775 7f574e7fc700  3 client.421984 ll_open 
1000223121e.head 32768
2015-08-04 12:59:44.030779 7f574e7fc700  3 client.421984 ll_open 
1000223121e.head 32768 = -13 (0)
2015-08-04 12:59:44.030797 7f574e7fc700  3 client.421984 ll_forget 
1000223121e 1
2015-08-04 12:59:44.030802 7f574e7fc700 20 client.421984 _ll_put 
0x7f575802c5a0 1000223121e 1 -> 21



The return value of -13 in the open call is probably 'permission denied'.

The setup looks ok with respect to permissions. The root user is able to 
read the file in question. The owning user is also able to read the file 
(after sudo). The problem occurs on several hosts for a number of files, 
but not all files or all users on CephFS are affected. User and group 
information are stored in LDAP and made available via SSSD; ls -l 
displays to correct group and user names, and id(1) lists the correct id 
and names.


Any hints on what's going wrong here?

Best regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unable to start libvirt VM when using cache tiering.

2015-08-05 Thread Burkhard Linke

Hi,

On 08/05/2015 02:13 PM, Pieter Koorts wrote:

Hi All,

This seems to be a weird issue. Firstly all deployment is done with 
"ceph-deploy" and 3 host machines acting as MON and OSD using the 
Hammer release on Ubuntu 14.04.3 and running KVM (libvirt).


When using vanilla CEPH, single rbd pool no log device or cache 
tiering, the virtual machine will start without any problem. I can see 
CEPH doing data work and the virtual machine runs the OS installer fine.


However when I enable cache tiering so that I have a separate RBD and 
SSD pool, libvirt is unable to start and virtual machines at all with 
an error of a feature I think CEPH disabled (not sure). This is 
entirely repeatable as I did re-installs of software and even the 
operating system.


I'm not sure if it is CEPH in this case but it only seems to happen 
when doing cache tiering which leads me to believe it has some of the 
blame.


*snipsnap*

Did you grant access to the cache tier pool for the libvirt ceph user?

Best regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unable to start libvirt VM when using cache tiering.

2015-08-05 Thread Burkhard Linke

Hi,

On 08/05/2015 02:54 PM, Pieter Koorts wrote:

Hi Burkhard,

I seemed to have missed that part but even though allowing access 
(rwx) to the cache pool it still has a similar (not same) problem. The 
VM process starts but it looks more like a dead or stuck process 
trying forever to start and has high CPU (for the qemu-system-x86 
process). When I kill the process, as it never times out, I get the 
following error.


internal error: early end of file from monitor: possible problem: 
libust[6583/6583]: Warning: HOME environment variable not set. 
Disabling LTTng-UST per-user tracing. (in setup_local_apps() at 
lttng-ust-comm.c:305) libust[6583/6584]: Error: Error opening shm 
/lttng-ust-wait-5 (in get_wait_shm() at lttng-ust-comm.c:886) 
libust[6583/6584]: Error: Error opening shm /lttng-ust-wait-5 (in 
get_wait_shm() at lttng-ust-comm.c:886) libust[6583/6584]: Error: 
Error opening shm /lttng-ust-wait-5 (in get_wait_shm() at 
lttng-ust-comm.c:886) libust[6583/6584]: Error: Error opening shm 
/lttng-ust-wait-5 (in get_wait_shm() at lttng-ust-comm.c:886) 
libust[6583/6584]: Error: Error opening shm /lttng-ust-wait-5 (in 
get_wait_shm() at lttng-ust-comm.c:886) libust[6583/6584]: Error: 
Error opening shm /lttng-ust-wait-5 (in get_wait_shm() at 
lttng-ust-comm.c:886) libust[6583/6584]: Error: Error opening shm 
/lttng-ust-wait-5 (in get_wait_shm() at lttng-ust-comm.c:886)


I understand that there is something similar on launchpad and some 
replies refer to hammer disabling the feature causing the error with 
"lttng-ust-wait-5" but I still seem to get it.

At least the libvirt user is able to access both pools now.

Can you post the complete configuration for both pool (eg. ceph osd dump 
| grep pool)? I remember having some trouble with configuration cache 
pools for the first time. You need to set all the relevant options 
(target size/objects etc.).


Best regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unable to start libvirt VM when using cache tiering.

2015-08-05 Thread Burkhard Linke

Hi,


On 08/05/2015 03:09 PM, Pieter Koorts wrote:

Hi,

This is my OSD dump below

###
osc-mgmt-1:~$ sudo ceph osd dump | grep pool
pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 128 pgp_num 128 last_change 43 lfor 43 flags 
hashpspool tiers 1 read_tier 1 write_tier 1 stripe_width 0
pool 1 'ssd' replicated size 3 min_size 2 crush_ruleset 1 object_hash 
rjenkins pg_num 128 pgp_num 128 last_change 44 flags 
hashpspool,incomplete_clones tier_of 0 cache_mode writeback hit_set 
bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 0s x0 
stripe_width 0

###

I have also attached my crushmap (plain text version) if that can 
provide any detail too.


This is the setup of my VM cache pool:
pool 9 'ssd_cache' replicated size 2 min_size 1 crush_ruleset 2 
object_hash rjenkins pg_num 128 pgp_num 128 last_change 182947 flags 
hashpspool,incomplete_clones tier_of 5 cache_mode writeback target_bytes 
5000 target_objects 100 hit_set 
bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 3600s 
x1 min_read_recency_for_promote 1 stripe_width 0


You proabably need to set at least either target_bytes or 
target_objects. These are the values the flush/evict ratios of cache 
pools refer to.


Best regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unable to start libvirt VM when using cache tiering.

2015-08-05 Thread Burkhard Linke

Hi,

On 08/05/2015 05:54 PM, Pieter Koorts wrote:

Hi

I suspect something more sinister may be going on. I have set the 
values (though smaller) on my cluster but the same issue happens. I 
also find when the VM is trying to start there might be an IRQ flood 
as processes like ksoftirqd seem to use more CPU than they should.



pool 1 'ssd' replicated size 3 min_size 2 crush_ruleset 1 object_hash 
rjenkins pg_num 128 pgp_num 128 last_change 60 flags 
hashpspool,incomplete_clones tier_of 0 cache_mode writeback 
target_bytes 1200 target_objects 100 hit_set 
bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 1800s 
x1 stripe_width 0




You can check whether the cache pool operates correctly by using the 
ceph admin user and the rbd command line tool or qemu-img to create some 
objects in the pools, e.g.


qemu-img -p  create test 1G

rbd -p  import 

(not sure about the correct syntax...)

If this is working correctly the pool setup is fine.

Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Different filesystems on OSD hosts at the samecluster

2015-08-07 Thread Burkhard Linke

Hi,


On 08/07/2015 04:04 PM, Udo Lembke wrote:

Hi,
some time ago I switched all OSDs from XFS to ext4 (step by step).
I had no issues during mixed osd-format (the process takes some weeks).

And yes, for me ext4 performs also better (esp. the latencies).

Just out of curiosity:

Do you use a ext4 setup as described in the documentation? Did you try 
to use external ext4 journals on SSD?


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Different filesystems on OSD hosts at the samecluster

2015-08-07 Thread Burkhard Linke

Hi,

On 08/07/2015 04:30 PM, Udo Lembke wrote:

Hi,
I use the ext4-parameters like Christian Balzer wrote in one posting:
osd mount options ext4 = "user_xattr,rw,noatime,nodiratime"
osd_mkfs_options_ext4 = -J size=1024 -E lazy_itable_init=0,lazy_journal_init=0

Thx for the details.


The osd-journals are on SSD-Partitions (without filesystem). IMHO ext4 don't 
support an different journal-device, like
xfs do, but I assume you mean the osd-jounal and not the filesystem journal?!

No, I was indeed talking about the ext4 journals, e.g. described here:

http://raid6.com.au/posts/fs_ext4_external_journal_caveats/

The setup is tempting (both ext4 + OSD journal on SSD), but the problem 
with the persistent device names is keeping me from trying it.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS: removing default data pool

2015-09-28 Thread Burkhard Linke

Hi,

I've created CephFS with a certain data pool some time ago (using 
firefly release). I've added addtional pools in the meantime and moved 
all data to them. But a large number of empty (or very small) objects 
are left in the pool according to 'ceph df':


cephfs_test_data 7918M 0 45424G  6751721

The number of objects change if new files are added to CephFS or deleted.

Does the first data pool play a different role and is used to store 
additional information? How can I remove this pool? In the current 
configuration the pool is a burden both to recovery/backfilling (many 
objects) and to performance due to object creation/deletion.


Regards,
Burkhard


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] "stray" objects in empty cephfs data pool

2015-10-08 Thread Burkhard Linke

Hi,

I've moved all files from a CephFS data pool (EC pool with frontend 
cache tier) in order to remove the pool completely.


Some objects are left in the pools ('ceph df' output of the affected pools):

cephfs_ec_data   19  7565k 0 66288G   13

Listing the objects and the readable part of their 'parent' attribute:

# for obj in $(rados -p cephfs_ec_data ls); do echo $obj; rados -p 
cephfs_ec_data getxattr parent | strings; done

1f6119f.
1f6119f
stray9
1f63fe5.
1f6119f
stray9
1f61196.
1f6119f
stray9
...

The names are valid CephFS object names. But the parent attribute does 
not contain the path of file the object belongs to; instead the string 
'stray' is the only useful information (without dissecting the binary 
content of the parent attribute).


What are those objects and is it safe to remove the pool in this state?

Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "stray" objects in empty cephfs data pool

2015-10-08 Thread Burkhard Linke

Hi John,

On 10/08/2015 12:05 PM, John Spray wrote:

On Thu, Oct 8, 2015 at 10:21 AM, Burkhard Linke
 wrote:

Hi,

*snipsnap*


I've moved all files from a CephFS data pool (EC pool with frontend cache
tier) in order to remove the pool completely.

Some objects are left in the pools ('ceph df' output of the affected pools):

 cephfs_ec_data   19  7565k 0 66288G   13

Listing the objects and the readable part of their 'parent' attribute:

# for obj in $(rados -p cephfs_ec_data ls); do echo $obj; rados -p
cephfs_ec_data getxattr parent | strings; done
1f6119f.
1f6119f
stray9
1f63fe5.
1f6119f
stray9
1f61196.
1f6119f
stray9
...


*snipsnap*


Well, they're strays :-)

You get stray dentries when you unlink files.  They hang around either
until the inode is ready to be purged, or if there are hard links then
they hang around until something prompts ceph to "reintegrate" the
stray into a new path.
Thanks for the fast reply. During the transfer of all files from the EC 
pool to a standard replicated pool I've copied the file to a new file 
name, removed the orignal one and renamed the copy. There might have 
been some processed with open files at that time, which might explain 
the stray files objects.


I've also been able to locate some processes that might be the reason 
for these leftover files. I've terminated these processes, but the 
objects are still present in the pool. How long does purging an inode 
usually take?


You don't say what version you're running, so it's possible you're
running an older version (pre hammer, I think) where you're
experiencing either a bug holding up deletion (we've had a few) or a
bug preventing reintegration (we had one of those too).  The bugs
holding up deletion can usually be worked around with some client
and/or mds restarts.
The cluster is running on hammer. I'm going to restart the mds to try to 
get rid of these objects.


It isn't safe to remove the pool in this state.  The MDS is likely to
crash if it eventually gets around to trying to purge these files.
That's bad. Does the mds provide a way to get more information about 
these files, e.g. which client is blocking purging? We have about 3 
hosts working on CephFS, and checking every process might be difficult.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "stray" objects in empty cephfs data pool

2015-10-08 Thread Burkhard Linke

Hi John,

On 10/08/2015 01:03 PM, John Spray wrote:

On Thu, Oct 8, 2015 at 11:41 AM, Burkhard Linke
 wrote:


*snipsnap*


Thanks for the fast reply. During the transfer of all files from the EC pool
to a standard replicated pool I've copied the file to a new file name,
removed the orignal one and renamed the copy. There might have been some
processed with open files at that time, which might explain the stray files
objects.

I've also been able to locate some processes that might be the reason for
these leftover files. I've terminated these processes, but the objects are
still present in the pool. How long does purging an inode usually take?

If nothing is holding a file open, it'll start purging within a couple
of journal-latencies of the unlink (i.e. pretty darn quick), and it'll
take as long to purge as there are objects in the file (again, pretty
darn quick for normal-sized files and a non-overloaded cluster).
Chances are if you're noticing strays, they're stuck for some reason.
You're probably on the right track looking for processes holding files
open.


You don't say what version you're running, so it's possible you're
running an older version (pre hammer, I think) where you're
experiencing either a bug holding up deletion (we've had a few) or a
bug preventing reintegration (we had one of those too).  The bugs
holding up deletion can usually be worked around with some client
and/or mds restarts.

The cluster is running on hammer. I'm going to restart the mds to try to get
rid of these objects.

OK, let us know how it goes.  You may find the num_strays,
num_strays_purging, num_strays_delayted performance counters (ceph
daemon mds. perf dump) useful.
The number of objects dropped to 7 after the mds restart. I was also 
able to identify the application the objects belong to (some where perl 
modules), but I've been unable to locate a running instance of this 
application. The main user of this application is also not aware of any 
running instance at the moment.

It isn't safe to remove the pool in this state.  The MDS is likely to
crash if it eventually gets around to trying to purge these files.

That's bad. Does the mds provide a way to get more information about these
files, e.g. which client is blocking purging? We have about 3 hosts working
on CephFS, and checking every process might be difficult.

If a client has caps on an inode, you can find out about it by dumping
(the whole!) cache from a running MDS.  We have tickets for adding a
more surgical version of this[1] but for now it's bit of a heavyweight
thing.  You can do JSON ("ceph daemon mds. dump cache > foo.json")
or plain text ("ceph daemon mds. dump cache foo.txt").  The latter
version is harder to parse but is less likely to eat all the memory on
your MDS (JSON output builds the whole thing in memory before writing
it)!
Hammer 0.94.3 does not support a 'dump cache' mds command. 
'dump_ops_in_flight' does not list any pending operations. Is there any 
other way to access the cache?


'perf dump' stray information (after mds restart):
"num_strays": 2327,
"num_strays_purging": 0,
"num_strays_delayed": 0,
"strays_created": 33,
"strays_purged": 34,

The data pool is a combination of EC pool and cache tier. I've evicted 
the cache pool resulting in 128 objects left (one per PG? hitset 
information?). After restarting the MDS the number of objects increases 
by 7 objects (the ones left in the data pool). So either the MDS rejoin 
process promotes them back to the cache, or some ceph-fuse instance 
insists on reading them.



Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "stray" objects in empty cephfs data pool

2015-10-12 Thread Burkhard Linke

Hi,

On 10/08/2015 09:14 PM, John Spray wrote:

On Thu, Oct 8, 2015 at 7:23 PM, Gregory Farnum  wrote:

On Thu, Oct 8, 2015 at 6:29 AM, Burkhard Linke
 wrote:

Hammer 0.94.3 does not support a 'dump cache' mds command.
'dump_ops_in_flight' does not list any pending operations. Is there any
other way to access the cache?

"dumpcache", it looks like. You can get all the supported commands
with "help" and look for things of interest or alternative phrasings.
:)

To head off any confusion for someone trying to just replace dump
cache with dumpcache: "dump cache" is the new (post hammer,
apparently) admin socket command, dumpcache is the old tell command.
So it's "ceph mds tell  dumpcache ".
Thanks, that did the trick. I was able to locate the host blocking the 
file handles and remove the objects from the EC pool.


Well, all except one:

# ceph df
  ...
ec_ssd_cache 18  4216k 0 2500G  129
cephfs_ec_data   19  4096k 0 31574G1

# rados -p ec_ssd_cache ls
1ef540f.0386
# rados -p cephfs_ec_data ls
1ef540f.0386
# ceph mds tell cb-dell-pe620r dumpcache cache.file
# grep 1ef540f /cache.file
#

It does not show up in the dumped cache file, but keeps being promoted 
to the cache tier after MDS restarts. I've restarted most of the cephfs 
clients by unmounting cephfs and restarting ceph-fuse, but the object 
remains active.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS and page cache

2015-10-16 Thread Burkhard Linke

Hi,

I've noticed that CephFS (both ceph-fuse and kernel client in version 
4.2.3) remove files from page cache as soon as they are not in use by a 
process anymore.


Is this intended behaviour? We use CephFS as a replacement for NFS in 
our HPC cluster. It should serve large files which are read by multiple 
jobs on multiple hosts, so keeping them in the page cache over the 
duration of several job invocations is crucial.


Mount options are defaults,noatime,_netdev (+ extra options for the 
kernel client). Is there an option to keep data in page cache just like 
any other filesystem?


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and page cache

2015-10-19 Thread Burkhard Linke

Hi,

On 10/19/2015 05:27 AM, Yan, Zheng wrote:

On Sat, Oct 17, 2015 at 1:42 AM, Burkhard Linke
 wrote:

Hi,

I've noticed that CephFS (both ceph-fuse and kernel client in version 4.2.3)
remove files from page cache as soon as they are not in use by a process
anymore.

Is this intended behaviour? We use CephFS as a replacement for NFS in our
HPC cluster. It should serve large files which are read by multiple jobs on
multiple hosts, so keeping them in the page cache over the duration of
several job invocations is crucial.

Yes. MDS needs resource to track the cached data. We don't want MDS
use too much resource.


Mount options are defaults,noatime,_netdev (+ extra options for the kernel
client). Is there an option to keep data in page cache just like any other
filesystem?

So far there is no option to do that. Later, we may add an option to
keep the cached data for a few seconds.


This renders CephFS useless for almost any HPC cluster application. And 
keeping data for a few seconds is not a solution in most cases.


CephFS supports capabilities to manages access to objects, enforce 
consistency of data etc. IMHO a sane way to handle the page cache is use 
a capability to inform the mds about caches objects; as long as no other 
client claims write access to an object or its metadata, the cache copy 
is considered consistent. Upon write access the client should drop the 
capability (and thus remove the object from the page cache). If another 
process tries to access a cache object with intact 'cache' capability, 
it may be promoted to read/write capability.


I haven't dug into the details of either capabilities or kernel page 
cache, but the method described above should be very similar to the 
existing read only capability. I don't know whether there's a kind of 
eviction callback in the page cache that cephfs can use to update 
capabilities if an object is removed from the page cache (e.g. due to 
memory pressure), but I'm pretty sure that other filesystems like NFS 
also need to keep track of what's cached.


This approach will probably increase the resources for both MDS and 
cephfs clients, but the benefits are obvious. For use cases with limited 
resource the MDS may refuse the 'cache' capability to client to reduce 
the memory footprint.


Just my 2 ct and regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and page cache

2015-10-19 Thread Burkhard Linke

Hi,

On 10/19/2015 10:34 AM, Shinobu Kinjo wrote:

What kind of applications are you talking about regarding to applications
for HPC.

Are you talking about like netcdf?

Caching is quite necessary for some applications for computation.
But it's not always the case.

It's not quite related to this topic but I'm really interested in your
thought using Ceph cluster for HPC computation.
Our application are in the field of bioinformatics. This involves read 
mapping, homology search in databases etc.


In almost all cases there's a fixed dataset or database like the human 
genome with all read mapping index files (> 20GB) or the database with 
all known protein sequences (>25 GB). With enough RAM in the cluster 
machines most of these datasets can be keep in memory for subsequent 
processing runs.


These datasets are updated from time to time, so keeping them on a 
network storage is simplier than distributing updates to instances on 
local hard disks. It would also require intensive interaction with the 
queuing system to ensure that one job array operates on a consistent 
datasets. It worked fine with NFS based storage, but NFS introduces a 
single point of failure (except for pNFS).


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and page cache

2015-10-19 Thread Burkhard Linke

Hi,

On 10/19/2015 12:34 PM, John Spray wrote:

On Mon, Oct 19, 2015 at 8:59 AM, Burkhard Linke
 wrote:

Hi,

On 10/19/2015 05:27 AM, Yan, Zheng wrote:

On Sat, Oct 17, 2015 at 1:42 AM, Burkhard Linke
 wrote:

Hi,

I've noticed that CephFS (both ceph-fuse and kernel client in version
4.2.3)
remove files from page cache as soon as they are not in use by a process
anymore.

Is this intended behaviour? We use CephFS as a replacement for NFS in our
HPC cluster. It should serve large files which are read by multiple jobs
on
multiple hosts, so keeping them in the page cache over the duration of
several job invocations is crucial.

Yes. MDS needs resource to track the cached data. We don't want MDS
use too much resource.


Mount options are defaults,noatime,_netdev (+ extra options for the
kernel
client). Is there an option to keep data in page cache just like any
other
filesystem?

So far there is no option to do that. Later, we may add an option to
keep the cached data for a few seconds.


This renders CephFS useless for almost any HPC cluster application. And
keeping data for a few seconds is not a solution in most cases.

While I appreciate your frustration, that isn't an accurate statement.
For example, many physics HPC workloads use a network filesystem for
snapshotting their progress, where they dump their computed dataset at
regular intervals.  In these instances, having a cache of the data in
the pagecache is rarely if ever useful.
I completely agree. HPC workloads are different depending on your field, 
and even within a certain field the workloads may vary. The examples 
mentioned in another mail are just that. Examples. We also have other 
applications and other workloads. Traditional HPC cluster used to be 
isolated with respect to both compute nodes and storage; access was 
possible via a head node and maybe some NFS server. In our setup compute 
and storage are more integrated into the user's setup. I think the 
traditional model is becoming extinct in our field, given all the new 
developments in the last 15 years.




Moreover, in the general case of a shared filesystem with many nodes,
it is not to be assumed that the same client will be accessing the
same data repeatedly: there is an implicit hint in the use of a shared
filesystem that applications are likely to want to access that data
from different nodes, rather than the same node repeatedly.  Clearly
that is by no means true in all cases, but I think you may be
overestimating the generality of your own workload (not that we don't
want to make it faster for you)
As mentioned above, CephFS is not restricted to our cluster hosts. It is 
also available on interactive compute machines and even on desktops. And 
on this machines users expect data to be present in the cache if they 
want to start a computation a second time, e.g. after adjusting some 
parameters. I don't mind file access being slow on the batch machine. 
But our users do mind slow access on their day-to-day work.



CephFS supports capabilities to manages access to objects, enforce
consistency of data etc. IMHO a sane way to handle the page cache is use a
capability to inform the mds about caches objects; as long as no other
client claims write access to an object or its metadata, the cache copy is
considered consistent. Upon write access the client should drop the
capability (and thus remove the object from the page cache). If another
process tries to access a cache object with intact 'cache' capability, it
may be promoted to read/write capability.

This is essentially what we already do, except that we pro-actively
drop the capability when files are closed, rather than keeping it
around on the client in case its needed again.

Having those caps linger on a client is a tradeoff:
  * while it makes subsequent cached reads from the original client
nice and fast, it adds latency for any other client that wants to open
the file.
I assume the same is also true with the current situation, if the file 
is already opened by another client.

  * It also adds latency for the original client when it wants to open
many other files, because it will have to wait for the original file's
capabilities to be given up before it has room in its metadata cache
to open other files.
  * it also creates confusion if someone opens a big file, then closes
it, then wonders why their ceph-fuse process is still sitting on gigs
of memory
I agree on that. ceph-fuse processes already become way too large in my 
opinion:


  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ 
COMMAND
  902 root  20   0 3045056 1.680g   4328 S   0.0 21.5 338:23.78 
ceph-fuse


(and that's just a web server with some perl cgi stuff)

But the data itself should be stored in the page cache (dunno whether a 
fuse process can actually push data to the page cache).


Further, as Zheng pointed out, the design of cephfs requires that
whenever a client has capab

Re: [ceph-users] CephFS and page cache

2015-10-21 Thread Burkhard Linke

Hi,

On 10/22/2015 02:54 AM, Gregory Farnum wrote:

On Sun, Oct 18, 2015 at 8:27 PM, Yan, Zheng  wrote:

On Sat, Oct 17, 2015 at 1:42 AM, Burkhard Linke
 wrote:

Hi,

I've noticed that CephFS (both ceph-fuse and kernel client in version 4.2.3)
remove files from page cache as soon as they are not in use by a process
anymore.

Is this intended behaviour? We use CephFS as a replacement for NFS in our
HPC cluster. It should serve large files which are read by multiple jobs on
multiple hosts, so keeping them in the page cache over the duration of
several job invocations is crucial.

Yes. MDS needs resource to track the cached data. We don't want MDS
use too much resource.

So if I'm reading things right, the code to drop the page cache for
ceph-fuse was added in https://github.com/ceph/ceph/pull/1594
(specifically 82015e409d09701a7048848f1d4379e51dd00892). I don't think
it's actually needed for the cap trimming stuff or to prevent MDS
cache pressure and it's actually not clear to me why it was added here
anyway. But you do say the PR as a whole fixed a lot of bugs. Do you
know if the page cache clearing was for any bugs in particular, Zheng?

In general I think proactively clearing the page cache is something we
really only want to do as part of our consistency and cap handling
story, and file closes don't really play into that. I've pushed a
TOTALLY UNTESTED (NOT EVEN COMPILED) branch client-pagecache-norevoke
based on master to the gitbuilders. If it does succeed in building you
should be able to download it and you can use it for testing, or
cherry-pick the top commit out of git and build your own packages.
Then set the (new to this branch) client_preserve_pagecache config
option to true (default: false) and it should avoid flushing the page
cache.


Thanks a lot for having a closer look at this. I'm currently preparing 
the deployment of 0.94.4 (or 0.94.5 due to rbd bug), and need to add 
some patches to ceph-fuse for correct permission handling. I'll 
cherry-pick the changes of that branch and test the package.



Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "stray" objects in empty cephfs data pool

2015-10-23 Thread Burkhard Linke

Hi,

On 10/14/2015 06:32 AM, Gregory Farnum wrote:

On Mon, Oct 12, 2015 at 12:50 AM, Burkhard Linke
 wrote:



*snipsnap*

Thanks, that did the trick. I was able to locate the host blocking the file
handles and remove the objects from the EC pool.

Well, all except one:

# ceph df
   ...
 ec_ssd_cache 18  4216k 0 2500G  129
 cephfs_ec_data   19  4096k 0 31574G1

# rados -p ec_ssd_cache ls
1ef540f.0386
# rados -p cephfs_ec_data ls
1ef540f.0386
# ceph mds tell cb-dell-pe620r dumpcache cache.file
# grep 1ef540f /cache.file
#

It does not show up in the dumped cache file, but keeps being promoted to
the cache tier after MDS restarts. I've restarted most of the cephfs clients
by unmounting cephfs and restarting ceph-fuse, but the object remains
active.

You can enable MDS debug logging and see if the inode shows up in the
log during replay. It's possible it's getting read in (from journal
operations) but then getting evicted from cache if nobody's accessing
it any more.
You can also look at the xattrs on the object to see what the
backtrace is and if that file is in cephfs.
After the last MDS restart the stray object was not promoted to the 
cache anymore:

ec_ssd_cache 18   120k 0 3842G  128
cephfs_ec_data   19  4096k 0 10392G1

There are no xattrs available for the stray object, so it's not possible 
to find out which file it belongs/belonged to:

# rados -p cephfs_ec_data ls
1ef540f.0386
# rados -p cephfs_ec_data listxattr 1ef540f.0386
#

Is it possible to list pending journal operations to be on the safe side?

Regards,
Burkhard

--
Dr. rer. nat. Burkhard Linke
Bioinformatics and Systems Biology
Justus-Liebig-University Giessen
35392 Giessen, Germany
Phone: (+49) (0)641 9935810

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and page cache

2015-10-28 Thread Burkhard Linke

Hi,

On 10/26/2015 01:43 PM, Yan, Zheng wrote:

On Thu, Oct 22, 2015 at 2:55 PM, Burkhard Linke
 wrote:

Hi,


On 10/22/2015 02:54 AM, Gregory Farnum wrote:

On Sun, Oct 18, 2015 at 8:27 PM, Yan, Zheng  wrote:

On Sat, Oct 17, 2015 at 1:42 AM, Burkhard Linke
 wrote:

Hi,

I've noticed that CephFS (both ceph-fuse and kernel client in version
4.2.3)
remove files from page cache as soon as they are not in use by a process
anymore.

Is this intended behaviour? We use CephFS as a replacement for NFS in
our
HPC cluster. It should serve large files which are read by multiple jobs
on
multiple hosts, so keeping them in the page cache over the duration of
several job invocations is crucial.

Yes. MDS needs resource to track the cached data. We don't want MDS
use too much resource.

So if I'm reading things right, the code to drop the page cache for
ceph-fuse was added in https://github.com/ceph/ceph/pull/1594
(specifically 82015e409d09701a7048848f1d4379e51dd00892). I don't think
it's actually needed for the cap trimming stuff or to prevent MDS
cache pressure and it's actually not clear to me why it was added here
anyway. But you do say the PR as a whole fixed a lot of bugs. Do you
know if the page cache clearing was for any bugs in particular, Zheng?

In general I think proactively clearing the page cache is something we
really only want to do as part of our consistency and cap handling
story, and file closes don't really play into that. I've pushed a
TOTALLY UNTESTED (NOT EVEN COMPILED) branch client-pagecache-norevoke
based on master to the gitbuilders. If it does succeed in building you
should be able to download it and you can use it for testing, or
cherry-pick the top commit out of git and build your own packages.
Then set the (new to this branch) client_preserve_pagecache config
option to true (default: false) and it should avoid flushing the page
cache.


Thanks a lot for having a closer look at this. I'm currently preparing the
deployment of 0.94.4 (or 0.94.5 due to rbd bug), and need to add some
patches to ceph-fuse for correct permission handling. I'll cherry-pick the
changes of that branch and test the package.



I have wrote patches for both kernel and fuse clients. they are under testing

https://github.com/ceph/ceph/pull/6380
https://github.com/ceph/ceph-client/commit/dfbb503e4e12580fc3d2952269104f293b0ec7e8
Great! I've applied the changes of the fuse client to the current 0.94.5 
source tree. Automatic cache invalidation does not occur any more:


start: 196280 cached Mem
cat'ing of some file on cephfs (~850MB): 1027556 cached Mem

After termination of the cat command the cached size stays at about 1 GB.

Unfortunatly we're only halfway there:

dd'ing the first MB of the same file should be handled by the page cache 
(file is not changed on any other node). But cache size drops to 203244 
(~ start value above), so the file's content is evicted from cache by 
reopening the same file.


Debug output of ceph-fuse (debug_client = 10/10):

2015-10-28 17:40:38.647653 7f3a1700 10 client.904899 renew_caps()
2015-10-28 17:40:38.647764 7f3a1700 10 client.904899 renew_caps mds.0
2015-10-28 17:40:38.650445 7f3a1e7fc700 10 client.904899 
handle_client_session client_session(renewcaps seq 24) v1 from mds.0

2015-10-28 17:40:43.529085 7f39f387b700  3 client.904899 ll_getattr 1.head
2015-10-28 17:40:43.529149 7f39f387b700 10 client.904899 _getattr mask 
pAsLsXsFs issued=1
2015-10-28 17:40:43.529370 7f39f387b700 10 client.904899 fill_stat on 1 
snap/devhead mode 040755 mtime 2015-09-18 16:06:20.645030 ctime 
2015-09-18 16:06:20.645030
2015-10-28 17:40:43.529407 7f39f387b700  3 client.904899 ll_getattr 
1.head = 0

2015-10-28 17:40:43.529441 7f39f387b700  3 client.904899 ll_forget 1 1
2015-10-28 17:40:43.529876 7f3a01ffb700  3 client.904899 ll_lookup 
0x7f3a0c01b320 volumes
2015-10-28 17:40:43.529911 7f3a01ffb700 10 client.904899 _lookup 
1.head(ref=3 ll_ref=14 cap_refs={} open={} mode=40755 size=0/0 
mtime=2015-09-18 16:06:20.645030 caps=pAsLsXsFs(0=pAsLsXsFs) 
has_dir_layout 0x7f3a0c01b320) volumes = 19de0f2.head(ref=3 ll_ref=3 
cap_refs={} open={} mode=40755 size=0/0 mtime=2015-09-18 10:28:37.519639 
caps=pAsLsXsFs(0=pAsLsXsFs) parents=0x7f3a0c01dfd0 has_dir_layout 
0x7f3a0c01d210)
2015-10-28 17:40:43.529998 7f3a01ffb700 10 client.904899 fill_stat on 
19de0f2 snap/devhead mode 040755 mtime 2015-09-18 10:28:37.519639 
ctime 2015-09-18 10:28:37.519639
2015-10-28 17:40:43.530014 7f3a01ffb700  3 client.904899 ll_lookup 
0x7f3a0c01b320 volumes -> 0 (19de0f2)

2015-10-28 17:40:43.530036 7f3a01ffb700  3 client.904899 ll_forget 1 1
2015-10-28 17:40:43.530527 7f3a017fa700  3 client.904899 ll_getattr 
19de0f2.head
2015-10-28 17:40:43.530570 7f3a017fa700 10 client.904899 _getattr mask 
pAsLsXsFs issued=1
2015-10-28 17:40:43.530584 7f3a017fa700 10 client.904899 fill_stat on 
19de0f2 snap/devhead mode 040755 mtime

Re: [ceph-users] State of nfs-ganesha CEPH fsal

2015-10-28 Thread Burkhard Linke

Hi,

On 10/28/2015 03:08 PM, Dennis Kramer (DT) wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Sorry for raising this topic from the dead, but i'm having the same
issues with NFS-GANESHA /w the wrong user/group information.

Do you maybe have a working ganesha.conf? I'm assuming I might
mis-configured something in this file. It's also nice to have some
reference config file from a working FSAL CEPH, the sample config is
very minimalistic.

I also have another issue with files that are not immediately visible
in a NFS folder after another system (using the same NFS) has created
it. There seems to be a slight delay before all system have the same
directory listing. This can be enforced by creating a *new* file in
this directory which will cause a refresh on this folder. Changing
directories also helps on affected system(s).


I've been testing ganesha with a kerberos setup as alternative to 
kernel-nfs and re-exporting a ceph/ceph-fuse mountpoint (side note: 
ceph-fuse and kernel-nfs do not play well, use kernel cephfs in this 
case...)


The ganesha.conf I've used looks like this:

NFS_KRB5
{
PrincipalName = "nfs";
KeytabPath = /etc/krb5.keytab ;
Active_krb5 = true ;
}

NFSv4
{
# Set an alternative path for libnfsidmap configuration file
IdmapConf = /etc/idmapd.conf;
}

NFS_CORE_PARAM {
NFS_Protocols = 4;
}

EXPORT_DEFAULT {
Protocols = 4;
Transports = TCP;
SecType = "krb5p";
}

EXPORT {
Export_ID = 2;
Path = "/ceph_subdiretory_to_mount";
Pseudo = "/exported_name_of_the_subdirectory";
SecType = "krb5p";

FSAL {
Name = CEPH;
}
CLIENT {
Clients = ;
Access_Type = RW;
}
}

On the testclient I've mounted it with

mount.nfs :/exported_name_ /mnt -o 
rw,noatime,fsc,nfsvers=4,intr,ac,sec=krb5p


Accessing files work as expected:

$ ls /mnt
-bash: cd: /mnt: Permission denied
$ klist
klist: Credentials cache file '/tmp/krb5cc_XYZ' not found
$ kinit
Password for XYZ@XYZ:
$ klist
< ticket details >
$ ls /mnt
< directory content >

The difficult part is setting up kerberos correctly (keytab, id mapping 
etc.). It took me some time to figure it out. You need a very recent 
version of ganesha (I'm using 2.1.0). And you should test the setup 
before trying to use the ceph fsal, e.g. with a local directory:


EXPORT
{
 Export_ID = 3;
 Path = "/opt";
 Pseudo = "/test";
 SecType = "krb5p";
 FSAL {
Name = VFS;
 }
 CLIENT {
Clients = ;
Access_Type = RW;
 }
}

(different Export_ID and pseudo are mandatory!)

No tests with root squash so far, but at least the kerberos part is working.

Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Benchmark individual OSD's

2015-10-29 Thread Burkhard Linke

Hi,

On 10/29/2015 09:54 AM, Luis Periquito wrote:

Only way I can think of that is creating a new crush rule that selects
that specific OSD with min_size = max_size = 1, then creating a pool
with size = 1 and using that crush rule.

Then you can use that pool as you'd use any other pool.

I haven't tested however it should work.
There's also the osd bench command that writes a certain amount of data 
to a given OSD:


# ceph tell osd.1 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": 117403227.00
}

It might help you to figure out whether individual OSDs do not perform 
as expected. The amount of data written is limited (but there's a config 
setting for it). With 1 GB as in the example above, the write operation 
will probably be limited to the journal.


Regards,
Burkhard


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and page cache

2015-10-29 Thread Burkhard Linke

Hi,

On 10/29/2015 09:30 AM, Sage Weil wrote:

On Thu, 29 Oct 2015, Yan, Zheng wrote:

On Thu, Oct 29, 2015 at 2:21 PM, Gregory Farnum  wrote:

On Wed, Oct 28, 2015 at 8:38 PM, Yan, Zheng  wrote:

On Thu, Oct 29, 2015 at 1:10 AM, Burkhard Linke

I tried to dig into the ceph-fuse code, but I was unable to find the
fragment that is responsible for flushing the data from the page cache.


fuse kernel code invalidates page cache on opening file. you can
disable this behaviour by setting ""fuse use invalidate cb"  config
option to true.

With that option ceph-fuse finally works with page cache:

$ time cat /ceph/volumes/biodb/asn1/nr.3*.psq > /dev/null

real2m0.979s
user0m0.020s
sys0m3.164s
$ time cat /ceph/volumes/biodb/asn1/nr.3*.psq > /dev/null

real0m2.106s
user0m0.000s
sys0m1.996s


Zheng, do you know any reason we shouldn't make that the default value
now? There was a loopback deadlock (which is why it's disabled by
default) but I don't remember the details offhand well enough to know
if your recent work in those interfaces has fixed it. Or Sage?
-Greg

there is no loopback deadlock now, because we use a separate thread to
invalidate kernel page cache. I think we can enable this option
safely.

...as long as nobody blocks waiting for invalidate while holding a lock
(client_lock?) that could prevent other fuse ops like write (pretty sure
that was the deadlock we saw before).  I worry this could still happen
with a writer (or reader?) getting stuck in a check_caps() type situation
while the invalidate cb is waiting on a page lock held by the calling
kernel syscall...

I have created an issue to track this: http://tracker.ceph.com/issues/13640

It would be great it the patch is ported to one of the next hammer 
releases after the potential deadlock situation is analysed.


Best regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Group permission problems with CephFS

2015-11-06 Thread Burkhard Linke

Hi,

On 11/06/2015 04:52 PM, Aaron Ten Clay wrote:

I'm seeing similar behavior as well.

-rw-rw-r-- 1 testuser testgroup 6 Nov  6 07:41 testfile
aaron@testhost$ groups
... testgroup ...
aaron@testhost$ cat > testfile
-bash: testfile: Permission denied

Running version 9.0.2. Were you able to make any progress on this?
There's a pending pull request that need to be included in the next 
release (see http://tracker.ceph.com/issues/12617)


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure coded pools and 'feature set mismatch'issue

2015-11-09 Thread Burkhard Linke

Hi,

On 11/09/2015 11:49 AM, Ilya Dryomov wrote:

*snipsnap*


You can install an ubuntu kernel from a newer ubuntu release, or pretty
much any mainline kernel from kernel-ppa.
Ubuntu Trusty has backported kernels from newer releases, e.g. 
linux-generic-lts-vivid. By using this packages you will also receive 
kernel updates from Vivid.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs: Client hp-s3-r4-compute failing to respond to capabilityrelease

2015-11-09 Thread Burkhard Linke

Hi,

I'm currently investigating a lockup problem involving CephFS and SQLite 
databases. Applications lock up if the same database is accessed from 
multiple hosts.


I was able to narrow the problem down to two host:

host A:
sqlite3 
.schema

host B:
sqlite3 
.schema

If both .schema commands happen at the same time, both applications are 
blocked. Client 1332420 is host A in the example above, client ID 
1263969 is host B, the inode is the sqlite file:


ceph mds log:
2015-11-09 13:39:49.588024 7f6272805700  0 log_channel(cluster) log 
[WRN] : client.1263969 isn't responding to mclientcaps(revoke), ino 
10002c4e840 pending pAsLsXsFr issued pAsLsXsFscr, sent 245.303153 
seconds ago
2015-11-09 13:39:49.588520 7f6272805700  0 log_channel(cluster) log 
[WRN] : 1 slow requests, 1 included below; oldest blocked for > 
245.301935 secs
2015-11-09 13:39:49.588527 7f6272805700  0 log_channel(cluster) log 
[WRN] : slow request 245.301935 seconds old, received at 2015-11-09 
13:35:44.286527: client_request(client.1332420:97 getattr pAsLsXsFs 
#10002c4e840 2015-11-09 13:35:44.312820) currently failed to rdlock, waiting


ceph -s:
cluster 49098879-85ac-4c5d-aac0-e1a2658a680b
 health HEALTH_WARN
mds0: Client  failing to respond to capability release
mds0: Many clients (16) failing to respond to cache pressure

ceph mds cache dump (grepped for inode id):
inode 10002c4e840 [2,head] 
/volumes/adm/temp/test/sqlite/uniprot_sprot.dat.idx auth v183 ap=2+0 
s=53466112 n(v0 b53466112 1=1+0) (ifile sync->mix) (iversion lock) 
cr={1263969=0-109051904@1,1332420=0-134217728@1} 
caps={1263969=pAsLsXsFr/pAsLsXsFscr/pAsxXsxFsxcrwb@39,1332420=pAsLsXsFr/pAsxXsxFsxcrwb@63} 
| ptrwaiter=0 request=1 lock=1 caps=1 dirtyparent=0 dirty=0 waiter=1 
authpin=1 0x1480a3af8]



Cluster is running Hammer 0.94.5 on top of Ubuntu 14.04. Clients use 
ceph-fuse with patches for improved page cache handling, but the problem 
also occur with the official hammer packages from download.ceph.com


Any help with resolving this problem is appreciated.

Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs: Client hp-s3-r4-compute failing to respondtocapabilityrelease

2015-11-09 Thread Burkhard Linke

Hi,

On 11/09/2015 02:07 PM, Burkhard Linke wrote:

Hi,

*snipsnap*




Cluster is running Hammer 0.94.5 on top of Ubuntu 14.04. Clients use 
ceph-fuse with patches for improved page cache handling, but the 
problem also occur with the official hammer packages from 
download.ceph.com
I've tested the same setup with clients running kernel 4.2.5 and using 
the kernel cephfs client. I was not able to reproduce the problem in 
that setup.


Regards,
Burkhard

--
Dr. rer. nat. Burkhard Linke
Bioinformatics and Systems Biology
Justus-Liebig-University Giessen
35392 Giessen, Germany
Phone: (+49) (0)641 9935810

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs: Client hp-s3-r4-compute failing torespondtocapabilityrelease

2015-11-09 Thread Burkhard Linke

Hi,

On 11/09/2015 04:03 PM, Gregory Farnum wrote:

On Mon, Nov 9, 2015 at 6:57 AM, Burkhard Linke
 wrote:

Hi,

On 11/09/2015 02:07 PM, Burkhard Linke wrote:

Hi,

*snipsnap*



Cluster is running Hammer 0.94.5 on top of Ubuntu 14.04. Clients use
ceph-fuse with patches for improved page cache handling, but the problem
also occur with the official hammer packages from download.ceph.com

I've tested the same setup with clients running kernel 4.2.5 and using the
kernel cephfs client. I was not able to reproduce the problem in that setup.

What's the workload you're running, precisely? I would not generally
expect multiple accesses to a sqlite database to work *well*, but
offhand I'm not entirely certain why it would work differently between
the kernel and userspace clients. (Probably something to do with the
timing of the shared requests and any writes happening.)
Using SQLite on network filesystems is somewhat challenging, especially 
if multiple instances write to the database. The reproducible test case 
does not write to the database at all; it simply extracts the table 
structure from the default database. The applications itself only read 
from the database and do not modify anything. The underlying SQLite 
library may attempt to use locking to protect certain operations. 
According to dmesg the processes are blocked within fuse calls:


Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.543966] INFO: task 
ceph-fuse:6298 blocked for more than 120 seconds.
Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544014]   Not 
tainted 4.2.5-040205-generic #201510270124
Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544054] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544119] ceph-fuse   
D 881fbf8d64c0 0  6298   3262 0x0100
Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544125] 881f9768f838 
0086 883fb2d83700 881f97b38dc0
Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544130] 1000 
881f9769 881fbf8d64c0 7fff
Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544134] 0002 
817dc300 881f9768f858 817dbb07

Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544138] Call Trace:
Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544147] 
[] ? bit_wait+0x50/0x50
Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544156] 
[] schedule_timeout+0x189/0x250
Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544166] 
[] ? bit_wait+0x50/0x50
Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544176] 
[] ? prepare_to_wait_exclusive+0x54/0x80
Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544185] 
[] __wait_on_bit_lock+0x4b/0xa0
Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544195] 
[] ? autoremove_wake_function+0x40/0x40
Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544205] 
[] ? get_user_pages_fast+0x112/0x190
Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544213] 
[] ? ilookup5_nowait+0x6f/0x90
Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544222] 
[] fuse_notify+0x14d/0x830
Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544230] 
[] ? fuse_copy_do+0x84/0xf0
Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544239] 
[] ? ttwu_do_activate.constprop.89+0x5d/0x70
Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544248] 
[] do_iter_readv_writev+0x6c/0xa0
Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544257] 
[] ? mprotect_fixup+0x148/0x230
Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544264] 
[] SyS_writev+0x59/0xf0
Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672548]   Not 
tainted 4.2.5-040205-generic #201510270124
Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672654] ceph-fuse   
D 881fbf8d64c0 0  6298   3262 0x0100
Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672665] 1000 
881f9769 881fbf8d64c0 7fff

Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672673] Call Trace:
Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672687] 
[] schedule+0x37/0x80
Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672698] 
[] ? read_tsc+0x9/0x10
Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672707] 
[] io_schedule_timeout+0xa4/0x110
Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672717] 
[] bit_wait_io+0x35/0x50
Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672726] 
[] __lock_page+0xbb/0xe0
Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672736] 
[] invalidate_inode_pages2_range+0x22c/0x460
Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672745] 
[] ? fuse_init_file_inode+0x30/0x30
Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672753] 
[] fuse_reverse_inval_inode+0x66/0x90
Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672761] 
[] ? iov_iter_get_pages+0xa2/0x220
Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672770] 
[] fuse_dev_do_write+0x22d/0x380
Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672779] 
[] fuse_dev_write+0x5b/0x80
Nov  9 14:19:08 hp-s2

Re: [ceph-users] cephfs: Client hp-s3-r4-compute failingtorespondtocapabilityrelease

2015-11-11 Thread Burkhard Linke

Hi,

On 11/10/2015 09:20 PM, Gregory Farnum wrote:

Can you dump the metadata ops in flight on each ceph-fuse when it hangs?

ceph daemon  mds_requests


Current state: host A and host B blocked, both running ceph-fuse 0.94.5 
(trusty package)


hostA mds_requests (client id 1265369):
{
"request": {
"tid": 20,
"op": "getattr",
"path": "#10002c4e840",
"path2": "",
"ino": "10002c4e840",
"hint_ino": "0",
"sent_stamp": "2015-11-11 15:15:40.109818",
"mds": 0,
"resend_mds": -1,
"send_to_auth": 0,
"sent_on_mseq": 0,
"retry_attempt": 1,
"got_unsafe": 0,
"uid": 0,
"gid": 0,
"oldest_client_tid": 20,
"mdsmap_epoch": 0,
"flags": 0,
"num_retry": 0,
"num_fwd": 0,
"num_releases": 0
}
}

hostB mds_requests (client id 1348375, also marked as failing to respond 
to capability release):

{}

Excerpt of ceph log (ceph -w):
...
2015-11-11 15:16:10.293337 mds.0 [WRN] slow request 30.181451 seconds 
old, received at 2015-11-11 15:15:40.111736: 
client_request(client.1265369:20 getattr pAsLsXsFs #10002c4e840 
2015-11-11 15:15:40.109816) currently failed to rdlock, waiting

...
2015-11-11 15:16:40.310519 mds.0 [WRN] client.1348375 isn't responding 
to mclientcaps(revoke), ino 10002c4e840 pending pAsLsXsFr issued 
pAsLsXsFscr, sent 60.200778 seconds ago

...

Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs: Client hp-s3-r4-compute failingtorespondtocapabilityrelease

2015-11-16 Thread Burkhard Linke

Hi,

On 11/13/2015 03:42 PM, Yan, Zheng wrote:
On Tue, Nov 10, 2015 at 12:06 AM, Burkhard Linke 
<mailto:burkhard.li...@computational.bio.uni-giessen.de>> wrote:

> Hi,


*snipsnap*


it seems the hang is related to async invalidate.  please try the 
following patch

---
diff --git a/src/client/Client.cc b/src/client/Client.cc
index 0d85db2..afbb896 100644
--- a/src/client/Client.cc
+++ b/src/client/Client.cc
@@ -3151,8 +3151,6 @@ void Client::_async_invalidate(Inode *in, 
int64_t off, int64_t len, bool keep_ca

   ino_invalidate_cb(callback_handle, in->vino(), off, len);

   client_lock.Lock();
-  if (!keep_caps)
-check_caps(in, false);
   put_inode(in);
   client_lock.Unlock();
   ldout(cct, 10) << "_async_invalidate " << off << "~" << len << 
(keep_caps ? " keep_caps" : "") << " done" << dendl;
@@ -3163,7 +3161,7 @@ void Client::_schedule_invalidate_callback(Inode 
*in, int64_t off, int64_t len,

   if (ino_invalidate_cb)
 // we queue the invalidate, which calls the callback and 
decrements the ref
 async_ino_invalidator.queue(new C_Client_CacheInvalidate(this, 
in, off, len, keep_caps));

-  else if (!keep_caps)
+  if (!keep_caps)
 check_caps(in, false);
 }
I've deployed the patch together with the page cache patch on two 
machines in the compute cluster. I've not been able to reproduce the 
lockup on these machines.


Most cluster machines are currently under load, so I'll have to postpone 
a rollout to more machines until the jobs are finished.


Thanks again for the fast patch.

Best regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Removing OSD - double rebalance?

2015-11-30 Thread Burkhard Linke

Hi Carsten,

On 11/30/2015 10:08 AM, Carsten Schmitt wrote:

Hi all,

I'm running ceph version 0.94.5 and I need to downsize my servers 
because of insufficient RAM.


So I want to remove OSDs from the cluster and according to the manual 
it's a pretty straightforward process:
I'm beginning with "ceph osd out {osd-num}" and the cluster starts 
rebalancing immediately as expected. After the process is finished, 
the rest should be quick:
Stop the daemon "/etc/init.d/ceph stop osd.{osd-num}" and remove the 
OSD from the crush map: "ceph osd crush remove {name}"


But after entering the last command, the cluster starts rebalancing 
again.


And that I don't understand: Shouldn't be one rebalancing process 
enough or am I missing something?
Removing the OSD changes the weight for the host, thus a second 
rebalance is necessary.


The best practice to remove an OSD involves changing the crush weight to 
0.0 as first step.


- ceph osd crush reweight osd.X 0.0
... wait for rebalance to finish
- ceph osd out X
... stop OSD daemon 
- ceph osd crush remove osd.X
- ceph auth del osd.X
- ceph osd rm X

Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Enforce MDS map update in CephFS kernel driver

2016-04-27 Thread Burkhard Linke

Hi,

we recently stumbled over a problem with the kernel based CephFS driver 
(Ubuntu Trusty with 4.4.0-18 kernel from xenial lts backport package). 
Our MDS failed for some unknown reason, and the standby MDS became active.


After rejoining the MDS cluster, the former standby MDS stuck at the 
clientreplay state. Clients were not able to connect to it. We had to 
fail back to the original MDS to recover clients:


[Wed Apr 27 11:17:48 2016] ceph: mds0 hung
[Wed Apr 27 11:36:30 2016] ceph: mds0 came back
[Wed Apr 27 11:36:30 2016] ceph: mds0 caps went stale, renewing
[Wed Apr 27 11:36:30 2016] ceph: mds0 caps stale
[Wed Apr 27 11:36:33 2016] libceph: mds0 192.168.6.132:6809 socket 
closed (con state OPEN)

[Wed Apr 27 11:36:38 2016] libceph: mds0 192.168.6.132:6809 connection reset
[Wed Apr 27 11:36:38 2016] libceph: reset on mds0
[Wed Apr 27 11:36:38 2016] ceph: mds0 closed our session
[Wed Apr 27 11:36:38 2016] ceph: mds0 reconnect start
[Wed Apr 27 11:36:39 2016] ceph: mds0 reconnect denied
[Wed Apr 27 12:03:32 2016] libceph: mds0 192.168.6.132:6800 socket 
closed (con state OPEN)
[Wed Apr 27 12:03:33 2016] libceph: mds0 192.168.6.132:6800 socket 
closed (con state CONNECTING)
[Wed Apr 27 12:03:34 2016] libceph: mds0 192.168.6.132:6800 socket 
closed (con state CONNECTING)
[Wed Apr 27 12:03:35 2016] libceph: mds0 192.168.6.132:6800 socket 
closed (con state CONNECTING)
[Wed Apr 27 12:03:37 2016] libceph: mds0 192.168.6.132:6800 socket 
closed (con state CONNECTING)
[Wed Apr 27 12:03:41 2016] libceph: mds0 192.168.6.132:6800 socket 
closed (con state CONNECTING)

[Wed Apr 27 12:03:50 2016] ceph: mds0 reconnect start
[Wed Apr 27 12:03:50 2016] ceph: mds0 reconnect success
[Wed Apr 27 12:03:55 2016] ceph: mds0 recovery completed

(192.168.6.132 being the standby MDS)

The problem is similar to the one described in this mail thread from 
september:


http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-September/004070.html

My questions are:

- Does a recent kernel include the fix to react to MDS map changes?
- If this is the case, which is the upstream kernel release including 
the changes?
- Is it possible to manipulate the MDS map manually, e.g. by 
/sys/kernel/debug/ceph//mdsmap ?
- Does using a second MDS in active/active setup provide a way to handle 
this situation, although the configuration is not recommended (yet)?


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Disabling POSIX locking semantics for CephFS

2016-05-03 Thread Burkhard Linke

Hi,

we have a number of legacy applications that do not cope well with the 
POSIX locking semantics in CephFS due to missing locking support (e.g. 
flock syscalls). We are able to fix some of these applications, but 
others are binary only.


Is it possible to disable POSIX locking completely in CephFS (either 
kernel client or ceph-fuse)?


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disabling POSIX locking semantics for CephFS

2016-05-03 Thread Burkhard Linke

Hi,

On 03.05.2016 18:39, Gregory Farnum wrote:

On Tue, May 3, 2016 at 9:30 AM, Burkhard Linke
 wrote:

Hi,

we have a number of legacy applications that do not cope well with the POSIX
locking semantics in CephFS due to missing locking support (e.g. flock
syscalls). We are able to fix some of these applications, but others are
binary only.

Is it possible to disable POSIX locking completely in CephFS (either kernel
client or ceph-fuse)?

I'm confused. CephFS supports all of these — although some versions of
FUSE don't; you need a new-ish kernel.

So are you saying that
1) in your setup, it doesn't support both fcntl and flock,
2) that some of your applications don't do well under that scenario?

I don't really see how it's safe for you to just disable the
underlying file locking in an application which depends on it. You may
need to upgrade enough that all file locks are supported.


The application in question does a binary search in a large data file 
(~75 GB), which is stored on CephFS. It uses open and mmap without any 
further locking controls (neither fcntl nor flock). The performance was 
very poor with CephFS (Ubuntu Trusty 4.4 backport kernel from Xenial and 
ceph-fuse) compared to the same application with a NFS based storage. I 
didn't had the time to dig further into the kernel implementation yet, 
but I assume that the root cause is locking pages accessed via the 
memory mapped file. Adding a simple flock syscall for marking the data 
file globally as shared solved the problem for us, reducing the overall 
runtime from nearly 2 hours to 5 minutes (and thus comparable to the NFS 
control case). The application runs on our HPC cluster, so several 100 
instances may access the same data file at once.


We have other applications that were written without locking support and 
that do not perform very well with CephFS. There was a thread in 
February with a short discussion about CephFS mmap performance 
(http://article.gmane.org/gmane.comp.file-systems.ceph.user/27501). As 
pointed out in that thread, the problem is not only related to mmap 
itself, but also to the need to implement a proper invalidation. We 
cannot fix this for all our applications due to the lack of man power 
and the lack of source code in some cases. We either have to find a way 
to make them work with CephFS, or use a different setup, e.g. an extra 
NFS based mount point with a re-export of CephFS. I would like to avoid 
the later solution...


Disabling the POSIX semantics and having a fallback to a more NFS-like 
semantic without guarantees is a setback, but probably the easier way 
(if it is possible at all). Most data accessed by these applications is 
read only, so complex locking is not necessary in these cases.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disabling POSIX locking semantics for CephFS

2016-05-04 Thread Burkhard Linke

Hi,

On 05/04/2016 09:15 AM, Yan, Zheng wrote:

On Wed, May 4, 2016 at 3:39 AM, Burkhard Linke
 wrote:

Hi,

On 03.05.2016 18:39, Gregory Farnum wrote:

On Tue, May 3, 2016 at 9:30 AM, Burkhard Linke
 wrote:

Hi,

we have a number of legacy applications that do not cope well with the
POSIX
locking semantics in CephFS due to missing locking support (e.g. flock
syscalls). We are able to fix some of these applications, but others are
binary only.

Is it possible to disable POSIX locking completely in CephFS (either
kernel
client or ceph-fuse)?

I'm confused. CephFS supports all of these — although some versions of
FUSE don't; you need a new-ish kernel.

So are you saying that
1) in your setup, it doesn't support both fcntl and flock,
2) that some of your applications don't do well under that scenario?

I don't really see how it's safe for you to just disable the
underlying file locking in an application which depends on it. You may
need to upgrade enough that all file locks are supported.


The application in question does a binary search in a large data file (~75
GB), which is stored on CephFS. It uses open and mmap without any further
locking controls (neither fcntl nor flock). The performance was very poor
with CephFS (Ubuntu Trusty 4.4 backport kernel from Xenial and ceph-fuse)
compared to the same application with a NFS based storage. I didn't had the
time to dig further into the kernel implementation yet, but I assume that
the root cause is locking pages accessed via the memory mapped file. Adding
a simple flock syscall for marking the data file globally as shared solved
the problem for us, reducing the overall runtime from nearly 2 hours to 5
minutes (and thus comparable to the NFS control case). The application runs
on our HPC cluster, so several 100 instances may access the same data file
at once.

We have other applications that were written without locking support and
that do not perform very well with CephFS. There was a thread in February
with a short discussion about CephFS mmap performance
(http://article.gmane.org/gmane.comp.file-systems.ceph.user/27501). As
pointed out in that thread, the problem is not only related to mmap itself,
but also to the need to implement a proper invalidation. We cannot fix this
for all our applications due to the lack of man power and the lack of source
code in some cases. We either have to find a way to make them work with
CephFS, or use a different setup, e.g. an extra NFS based mount point with a
re-export of CephFS. I would like to avoid the later solution...

Disabling the POSIX semantics and having a fallback to a more NFS-like
semantic without guarantees is a setback, but probably the easier way (if it
is possible at all). Most data accessed by these applications is read only,
so complex locking is not necessary in these cases.


see http://tracker.ceph.com/issues/15502. Maybe it's related to this issue.
We are using Ceph release 0.94.6, so the performance problems are 
probably not related. The page cache is also keep populated after an 
application terminates:


# dd if=test of=/dev/null
20971520+0 records in
20971520+0 records out
10737418240 bytes (11 GB) copied, 109.008 s, 98.5 MB/s
# dd if=test of=/dev/null
20971520+0 records in
20971520+0 records out
10737418240 bytes (11 GB) copied, 9.24535 s, 1.2 GB/s


How does CephFS handle locking in case of missing explicit locking 
control (e.g. flock / fcntl)? And what's the default of mmap'ed memory 
access in that case?


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unable to mount the CephFS file system fromclientnode with "mount error 5 = Input/output error"

2016-06-14 Thread Burkhard Linke

Hi,

On 06/14/2016 01:21 PM, Rakesh Parkiti wrote:

Hello,

Unable to mount the CephFS file system from client node with *"mount 
error 5 = Input/output error"*
MDS was installed on a separate node. Ceph Cluster health is OK and 
mds services are running. firewall was disabled across all the nodes 
in a cluster.


-- Ceph Cluster Nodes (RHEL 7.2 version + Jewel version 10.2.1)
-- Client Nodes - Ubuntu 14.04 LTS

Admin Node:
*[root@Admin ceph]# ceph mds stat*
e34: 0/0/1 up


*snipsnap*

The MDS is not up and running. Otherwise the output should look like this:

# ceph mds stat
e190193: 1/1/1 up {0=XYZ=up:active}

Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] stuck unclean since forever

2016-06-22 Thread Burkhard Linke

Hi,

On 06/22/2016 12:10 PM, min fang wrote:
Hi, I created a new ceph cluster, and create a pool, but see "stuck 
unclean since forever" errors happen(as the following), can help point 
out the possible reasons for this? thanks.


ceph -s
cluster 602176c1-4937-45fc-a246-cc16f1066f65
 health HEALTH_WARN
8 pgs degraded
8 pgs stuck unclean
8 pgs undersized
too few PGs per OSD (2 < min 30)
 monmap e1: 1 mons at {ceph-01=172.0.0.11:6789/0 
}

election epoch 14, quorum 0 ceph-01
 osdmap e89: 3 osds: 3 up, 3 in
flags
  pgmap v310: 8 pgs, 1 pools, 0 bytes data, 0 objects
60112 MB used, 5527 GB / 5586 GB avail
   8 active+undersized+degraded


*snipsnap*

With three OSDs and a single host you need to change the crush ruleset 
for the pool, since it tries to distribute the data across 3 different 
_host_ by default.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recover Data from Deleted RBD Volume

2016-08-08 Thread Burkhard Linke

Hi,


On 08.08.2016 09:58, Georgios Dimitrakakis wrote:

Dear all,

I would like your help with an emergency issue but first let me 
describe our environment.


Our environment consists of 2OSD nodes with 10x 2TB HDDs each and 3MON 
nodes (2 of them are the OSD nodes as well) all with ceph version 
0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)


This environment provides RBD volumes to an OpenStack Icehouse 
installation.


Although not a state of the art environment is working well and within 
our expectations.


The issue now is that one of our users accidentally deleted one of the 
volumes without keeping its data first!


Is there any way (since the data are considered critical and very 
important) to recover them from CEPH?


Short answer: no

Long answer: no, but

Consider the way Ceph stores data... each RBD is striped into chunks 
(RADOS objects with 4MB size by default); the chunks are distributed 
among the OSDs with the configured number of replicates (probably two in 
your case since you use 2 OSD hosts). RBD uses thin provisioning, so 
chunks are allocated upon first write access.
If an RBD is deleted all of its chunks are deleted on the corresponding 
OSDs. If you want to recover a deleted RBD, you need to recover all 
individual chunks. Whether this is possible depends on your filesystem 
and whether the space of a former chunk is already assigned to other 
RADOS objects. The RADOS object names are composed of the RBD name and 
the offset position of the chunk, so if an undelete mechanism exists for 
the OSDs' filesystem, you have to be able to recover file by their 
filename, otherwise you might end up mixing the content of various 
deleted RBDs. Due to the thin provisioning there might be some chunks 
missing (e.g. never allocated before).


Given the fact that
- you probably use XFS on the OSDs since it is the preferred filesystem 
for OSDs (there is RDR-XFS, but I've never had to use it)
- you would need to stop the complete ceph cluster (recovery tools do 
not work on mounted filesystems)
- your cluster has been in use after the RBD was deleted and thus parts 
of its former space might already have been overwritten (replication 
might help you here, since there are two OSDs to try)
- XFS undelete does not work well on fragmented files (and OSDs tend to 
introduce fragmentation...)


the answer is no, since it might not be feasible and the chance of 
success are way too low.


If you want to spend time on it I would propose the stop the ceph 
cluster as soon as possible, create copies of all involved OSDs, start 
the cluster again and attempt the recovery on the copies.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recover Data from Deleted RBD Volume

2016-08-08 Thread Burkhard Linke

Hi,


On 08.08.2016 10:50, Georgios Dimitrakakis wrote:

Hi,


On 08.08.2016 09:58, Georgios Dimitrakakis wrote:

Dear all,

I would like your help with an emergency issue but first let me 
describe our environment.


Our environment consists of 2OSD nodes with 10x 2TB HDDs each and 
3MON nodes (2 of them are the OSD nodes as well) all with ceph 
version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)


This environment provides RBD volumes to an OpenStack Icehouse 
installation.


Although not a state of the art environment is working well and 
within our expectations.


The issue now is that one of our users accidentally deleted one of 
the volumes without keeping its data first!


Is there any way (since the data are considered critical and very 
important) to recover them from CEPH?


Short answer: no

Long answer: no, but

Consider the way Ceph stores data... each RBD is striped into chunks
(RADOS objects with 4MB size by default); the chunks are distributed
among the OSDs with the configured number of replicates (probably two
in your case since you use 2 OSD hosts). RBD uses thin provisioning,
so chunks are allocated upon first write access.
If an RBD is deleted all of its chunks are deleted on the
corresponding OSDs. If you want to recover a deleted RBD, you need to
recover all individual chunks. Whether this is possible depends on
your filesystem and whether the space of a former chunk is already
assigned to other RADOS objects. The RADOS object names are composed
of the RBD name and the offset position of the chunk, so if an
undelete mechanism exists for the OSDs' filesystem, you have to be
able to recover file by their filename, otherwise you might end up
mixing the content of various deleted RBDs. Due to the thin
provisioning there might be some chunks missing (e.g. never allocated
before).

Given the fact that
- you probably use XFS on the OSDs since it is the preferred
filesystem for OSDs (there is RDR-XFS, but I've never had to use it)
- you would need to stop the complete ceph cluster (recovery tools do
not work on mounted filesystems)
- your cluster has been in use after the RBD was deleted and thus
parts of its former space might already have been overwritten
(replication might help you here, since there are two OSDs to try)
- XFS undelete does not work well on fragmented files (and OSDs tend
to introduce fragmentation...)

the answer is no, since it might not be feasible and the chance of
success are way too low.

If you want to spend time on it I would propose the stop the ceph
cluster as soon as possible, create copies of all involved OSDs, start
the cluster again and attempt the recovery on the copies.

Regards,
Burkhard


Hi! Thanks for the info...I understand that this is a very difficult 
and probably not feasible task but in case I need to try a recovery 
what other info should I need? Can I somehow find out on which OSDs 
the specific data were stored and minimize my search there?

Any ideas on how should I proceed?
First of all you need to know the exact object names for the RADOS 
objects. As mentioned before, the name is composed of the RBD name and 
an offset.


In case of OpenStack, there are three different patterns for RBD names:

, e.g. 50f2a0bd-15b1-4dbb-8d1f-fc43ce535f13  
for glance images,

, e.g. 9aec1f45-9053-461e-b176-c65c25a48794_disk for nova images
, e.g. volume-0ca52f58-7e75-4b21-8b0f-39cbcd431c42 for 
cinder volumes


(not considering snapshots etc, which might use different patterns)

The RBD chunks are created using a certain prefix (using examples from 
our openstack setup):


# rbd -p os-images info 8fa3d9eb-91ed-4c60-9550-a62f34aed014
rbd image '8fa3d9eb-91ed-4c60-9550-a62f34aed014':
size 446 MB in 56 objects
order 23 (8192 kB objects)
block_name_prefix: rbd_data.30e57d54dea573
format: 2
features: layering, striping
flags:
stripe unit: 8192 kB
stripe count: 1

# rados -p os-images ls | grep rbd_data.30e57d54dea573
rbd_data.30e57d54dea573.0015
rbd_data.30e57d54dea573.0008
rbd_data.30e57d54dea573.000a
rbd_data.30e57d54dea573.002d
rbd_data.30e57d54dea573.0032

I don't know how whether the prefix is derived from some other 
information, but the recover the RBD you definitely need it.


_If_ you are able to recover the prefix, you can use 'ceph osd map' to 
find the OSDs for each chunk:


# ceph osd map os-images rbd_data.30e57d54dea573.001a
osdmap e418590 pool 'os-images' (38) object 
'rbd_data.30e57d54dea573.001a' -> pg 38.d5d81d65 (38.65) -> 
up ([45,17,108], p45) acting ([45,17,108], p45)


With 20 OSDs in your case you will likely have to process all of them if 
the RBD has a size of several GBs.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Merging CephFS data pools

2016-08-18 Thread Burkhard Linke

Hi,

the current setup for CephFS at our site uses two data pools due to 
different requirements in the past. I want to merge these two pools now, 
eliminating the second pool completely.


I've written a small script to locate all files on the second pool using 
their file layout attributes and replace them with a copy on the correct 
pool. This works well for files, but modifies the timestamps of the 
directories.
Do you have any idea for a better solution that does not modify 
timestamps and plays well with active CephFS clients (e.g. no problem 
with files being used)? A simple 'rados cppool' probably does not work 
since the pool id/name is part of a file's metadata and client will not 
be aware of moved files.


Regards,
Burkhard

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Recommended hardware for MDS server

2016-08-22 Thread Burkhard Linke

Hi,

we are running CephFS with about 70TB data, > 5 million files and about 
100 clients. The MDS is currently colocated on a storage box with 14 OSD 
(12 HDD, 2SSD). The box has two E52680v3 CPUs and 128 GB RAM. CephFS 
runs fine, but it feels like the metadata operations may need more speed.


Excerpt of MDS perf dump:
"mds": {
"request": 73389282,
"reply": 73389282,
"reply_latency": {
"avgcount": 73389282,
"sum": 259696.749971457
},
"forward": 0,
"dir_fetch": 4094842,
"dir_commit": 720085,
"dir_split": 0,
"inode_max": 500,
"inodes": 565,
"inodes_top": 320979,
"inodes_bottom": 530518,
"inodes_pin_tail": 4148568,
"inodes_pinned": 4469666,
"inodes_expired": 60001276,
"inodes_with_caps": 4468714,
"caps": 4850520,
"subtrees": 2,
"traverse": 92378836,
"traverse_hit": 75743822,
"traverse_forward": 0,
"traverse_discover": 0,
"traverse_dir_fetch": 1719440,
"traverse_remote_ino": 33,
"traverse_lock": 3952,
"load_cent": 7339063064,
"q": 0,
"exported": 0,
"exported_inodes": 0,
"imported": 0,
"imported_inodes": 0
},

The setup is expected grow, with regards to the amount of stored data 
and the number of clients. The MDS process currently consumes about 36 
TB RAM, with 22 TB resident. Since a large part of the MDS run single 
threaded, a CPU with less core and more CPU frequency might be a better 
choice in this setup.


How well does the MDS performance scale with CPU frequency (given other 
latency pathes like network I/O don't matter)? Given the amount of 
memory used, does the MDS benefit from larger CPU caches (e.g. E5-2XXX 
class cpu), or a smaller cache in faster CPUs a better choice (e.g. 
E5-1XXX or E3-1XXXv5)?


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS vs RBD

2017-06-23 Thread Burkhard Linke

Hi,


On 06/23/2017 02:44 PM, Bogdan SOLGA wrote:

Hello, everyone!

We are working on a project which uses RBD images (formatted with XFS) 
as home folders for the project's users. The access speed and the 
overall reliability have been pretty good, so far.


From the architectural perspective, our main focus is on providing a 
seamless user experience in case the Ceph clients will suddenly go 
offline. Hence, we were thinking about using CephFS instead of the RBD 
images, and we want to know your experiences with it.


My experience as user so far:


A few questions:

  * is CephFS considered production ready? As far as I know, it
currently supports a single (active) MDS server;

It was declared stable with the Jewel release. Since Jewel we did not 
encounter any severe problem with the MDS, it has improved a lot 
compared to prior releases (working with CephFS since Firefly).


Active-Active setups are possible, but not recommended. You can setup 
additional MDS servers as standby/standby-replay servers, which become 
active if the current active MDS fails. There might be a delay due to 
MDS state detection, but it can be adopted to your use case.


The main problem with failover to a standby MDS is the fact the all 
active inodes are stat()ed during failover; with many open files and 
slow storage this might take a considerable amount of time. Be sure to 
run some tests with a real life workload.


  * should we expect any speed / performance differences between RBD
and CephFS? if yes - should we see an improvement or a downgrade?

Definitely a downgrade. Every file metadata access requires a 
communication with the MDS to allocate the necessary capability. It 
might also require the MDS to contact other clients and ask for 
capabilities to be released.


The impact depends on your use case; many clients working in different 
directories might be less affected (especially due to the limited 
lock/capability contention), all clients working in the same directory 
gives a significant performance penalty. But this behavior is to be 
expected for any posix compliant distributed file system.


Data I/O itself does not involve the MDS, so speed should be comparable 
to RBD. Try to use the kernel cephfs implementation if possible, since 
it does not require kernel/user space context switches and thus has a 
better performance compared to ceph-fuse.


  * as far as I know, if we'd use CephFS, we'd be able to mount the
file system on several Ceph clients; would there be any problem
(from Ceph's perspective) if one of those clients would suddenly
go offline?

Problems (e.g. files/directories still locked by the failed client) 
should be temporarely, since stale sessions are detected and removed by 
the MDS.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs df with EC pool

2017-06-28 Thread Burkhard Linke

Hi,


On 06/28/2017 01:19 PM, Jake Grimmett wrote:

Dear All,

Sorry is this has been covered before, but is it possible to configure
cephfs to report free space based on what is available in the main
storage tier?

My "df" shows 76%, this gives a false sense of security, when the EC
tier is 93% full...

i.e. # df -h /ceph
Filesystem  Size  Used Avail Use% Mounted on
ceph-fuse   440T  333T  108T  76% /ceph

# ls -lhd /ceph
drwxr-xr-x 1 root root 254T Jun 27 17:03 /ceph

but "ceph df" shows that our EC pool is %92.46 full.

# ceph df
GLOBAL:
 SIZE AVAIL RAW USED %RAW USED
 439T  107T 332T 75.57
POOLS:
 NAME ID USED %USED MAX AVAIL OBJECTS
 rbd  0 0 0  450G 0
 ecpool   1  255T 92.4621334G 105148577
 hotpool  2  818G 64.53  450G236023
 metapool 3  274M  0.06  450G   2583306
Since 'df' is not able to access ceph (it simply does not know about 
ceph), it can only reports the information returned by ceph-fuse / 
kernel ceph.


'df' also operates on filesystem level, and data pools are assigned on 
directory level (with the default pool being assigned to the root 
directory). A single sane value for free available space is thus not 
meaningful, so the cephfs implementation just reports the overall values.


Regards,
Burkhard Linke
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot mount Ceph FS

2017-06-29 Thread Burkhard Linke

Hi,


On 06/29/2017 01:26 PM, Riccardo Murri wrote:

Hello!

I tried to create and mount a filesystem using the instructions at
<http://docs.ceph.com/docs/master/cephfs/createfs/> and
<http://docs.ceph.com/docs/master/cephfs/kernel> but I am getting
errors:

$ sudo ceph fs new cephfs cephfs_metadata cephfs_data
new fs with metadata pool 1 and data pool 2
$ sudo ceph mds stat
e6: 0/0/1 up
$ sudo mount -t ceph mds001:/ /mnt -o
name=admin,secretfile=/etc/ceph/client.admin.secret
mount error 110 = Connection timed out

I found this `mds cluster_up` command and thought I need to bring the
MDS cluster up before using FS functions but I get errors there as
well:

$ sudo ceph mds cluster_up
unmarked fsmap DOWN

Still, the cluster does not show any health issue:

$ sudo ceph -s
 cluster 00baac7a-0ad4-4ab7-9d5e-fdaf7d122aee
  health HEALTH_OK
  monmap e1: 1 mons at {mon001=172.23.140.181:6789/0}
 election epoch 3, quorum 0 mon001
   fsmap e7: 0/0/1 up
  osdmap e19: 3 osds: 3 up, 3 in
 flags sortbitwise,require_jewel_osds
   pgmap v1278: 192 pgs, 3 pools, 0 bytes data, 0 objects
 9728 MB used, 281 GB / 290 GB avail
  192 active+clean

Any hints?  What I am doing wrong?


You need a running MDS daemon for CephFS.

Regards,
Burkhard Linke
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "Zombie" ceph-osd@xx.service remain fromoldinstallation

2017-08-03 Thread Burkhard Linke

Hi,


On 03.08.2017 16:31, c.mo...@web.de wrote:

Hello!

I have purged my ceph and reinstalled it.
ceph-deploy purge node1 node2 node3
ceph-deploy purgedata node1 node2 node3
ceph-deploy forgetkeys

All disks configured as OSDs are physically in two servers.
Due to some restrictions I needed to modify the total number of disks usable as 
OSD, this means I have now less disks as before.

The installation with ceph-deploy finished w/o errors.

However, if I start all OSDs (on any of the servers) I get some services with status 
"failed".
ceph-osd@70.service 
   loaded failed failedCeph object storage daemon
ceph-osd@71.service 
   loaded failed failedCeph object storage daemon
ceph-osd@92.service 
   loaded failed failedCeph object storage daemon
ceph-osd@93.service 
   loaded failed failedCeph object storage daemon
ceph-osd@94.service 
   loaded failed failedCeph object storage daemon
ceph-osd@95.service 
   loaded failed failedCeph object storage daemon
ceph-osd@96.service 
   loaded failed failedCeph object storage daemon

Any of these services belong to the previous installation.

If I stop any of the failed service and disable it, e.g.
systemctl stop ceph-osd@70.service
systemctl disable ceph-osd@70.service
the status is correct.

However, when I trigger
systemctl restart ceph-osd.target
these zombie services get in status "auto-restart" first and then "fail" again.

As a workaround I need to mask the zombie services, but this should not be a 
final solution: systemctl mask ceph-osd@70.service

Question:
How can I get rid of the zombie services "ceph-osd@xx.service"?
If you are sure that these OSD are "zombie", you can remove the 
dependencies for ceph-osd.target. In case of CentOS, these are symlinks 
in /etc/systemd/system/ceph-osd.target.wants/ .


Do not forget to reload systemd afterwards. There might also be a nice 
systemctl command for removing dependencies.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] output discards (queue drops) on switchport

2017-09-08 Thread Burkhard Linke

Hi,


On 09/08/2017 02:12 PM, Marc Roos wrote:
  


Afaik ceph is is not supporting/working with bonding.

https://www.mail-archive.com/ceph-users@lists.ceph.com/msg35474.html
(thread: Maybe some tuning for bonded network adapters)
CEPH works well with LACP bonds. The problem described in that thread is 
the fact that LACP is not using links in a round robin fashion, but 
distributes network stream depending on a hash of certain parameters 
like source and destination IP address. This is already set to layer3+4 
policy by the OP.


Regarding the drops (and without any experience with neither 25GBit 
ethernet nor the Arista switches):
Do you have corresponding input drops on the server's network ports? Did 
you tune the network settings on server side for high throughput, e.g. 
net.ipv4.tcp_rmem, wmem, ...? And are the CPUs fast enough to handle the 
network traffic?


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] output discards (queue drops) on switchport

2017-09-08 Thread Burkhard Linke

Hi,


On 09/08/2017 04:13 PM, Andreas Herrmann wrote:

Hi,

On 08.09.2017 15:59, Burkhard Linke wrote:

On 09/08/2017 02:12 PM, Marc Roos wrote:
  
Afaik ceph is is not supporting/working with bonding.


https://www.mail-archive.com/ceph-users@lists.ceph.com/msg35474.html
(thread: Maybe some tuning for bonded network adapters)

CEPH works well with LACP bonds. The problem described in that thread is the
fact that LACP is not using links in a round robin fashion, but distributes
network stream depending on a hash of certain parameters like source and
destination IP address. This is already set to layer3+4 policy by the OP.

Regarding the drops (and without any experience with neither 25GBit ethernet
nor the Arista switches):
Do you have corresponding input drops on the server's network ports?

No input drops, just output drop
Output drops on the switch are related to input drops on the server 
side. If the link uses flow control and the server signals the switch 
that its internal buffer are full, the switch has to drop further 
packages if the port buffer is also filled. If there's no flow control, 
and the network card is not able to store the packet (full buffers...), 
it should be noted as overrun in the interface statistics (and if this 
is not correct, please correct me, I'm not a network guy).





Did you tune the network settings on server side for high throughput, e.g.
net.ipv4.tcp_rmem, wmem, ...?

sysctl tuning is disabled at the moment. I tried sysctl examples from
https://fatmin.com/2015/08/19/ceph-tcp-performance-tuning/. But there is still
the same amount of output drops.


And are the CPUs fast enough to handle the network traffic?

Xeon(R) CPU E5-1660 v4 @ 3.20GHz should be fast enough. But I'm unsure. It's
my first Ceph cluster.
The CPU has 6 cores, and you are driving 2x 10GBit, 2x 25 GBit, the raid 
controller and 8 ssd based osds with it. You can use tools like atop or 
ntop to watch certain aspects of the system during the tests (network, 
cpu, disk).


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: FileStore vs BlueStore

2017-09-20 Thread Burkhard Linke

Hi,


On 09/20/2017 11:10 AM, Sam Huracan wrote:

So why do not journal write only metadata?
As I've read, it is for ensure consistency of data, but I do not know 
how to do that in detail? And why BlueStore still ensure consistency 
without journal?


The main reason for having a journal with filestore is having a block 
device that supports synchronous writes. Writing to a filesystem in a 
synchronous way (e.g. including all metadata writes) results in a huge 
performance penalty.


With bluestore the data is also stored on a block devices, and thus also 
allows to perform synchronous writes directly (given the backing storage 
is handling sync writes correctly and in a consistent way, e.g. no drive 
caches, bbu for raid controllers/hbas). And similar to the filestore 
journal the bluestore wal/rocksdb partitions can be used to allow both 
faster devices (ssd/nvme) and faster sync writes (compared to spinners).


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: FileStore vs BlueStore

2017-09-20 Thread Burkhard Linke

Hi,


On 09/20/2017 12:24 PM, Sean Purdy wrote:

On Wed, 20 Sep 2017, Burkhard Linke said:

The main reason for having a journal with filestore is having a block device
that supports synchronous writes. Writing to a filesystem in a synchronous
way (e.g. including all metadata writes) results in a huge performance
penalty.

With bluestore the data is also stored on a block devices, and thus also
allows to perform synchronous writes directly (given the backing storage is
handling sync writes correctly and in a consistent way, e.g. no drive
caches, bbu for raid controllers/hbas). And similar to the filestore journal

Our Bluestore disks are hosted on RAID controllers.  Should I set cache policy 
as WriteThrough for these disks then?


It depends on the setup and availability of a BBU. If you have a BBU and 
cache on the controller, using write back should be ok if you monitor 
the BBU state. To be on the safe side is using write through and live 
with the performance impact.


There's also another thread on the mailing list discussing the choice of 
controllers/hba. Maybe there's more information available in that 
thread, especially with regard to vendors, firmware etc.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFs kernel client metadata caching

2017-10-13 Thread Burkhard Linke

Hi,


On 10/13/2017 12:36 PM, Denes Dolhay wrote:

Dear All,


First of all, this is my first post, so please be lenient :)


For the last few days I have been testing ceph, and cephfs, deploying 
a PoC cluster.


I have been testing the cephfs kernel client caching, when I came 
across something strange, and I cannot decide if it is a bug or I just 
messed up something.



Steps given client1 and client2 both mounded the same cephfs, extra 
mount option, noatime:



Client 1: watch -n 1 ls -lah /mnt/cephfs

-in tcpdump I can see that the directory is being listed once and only 
once, all the following ls requests are served from the client cache



Client 2: make any modification for example append to a file, or 
delete a file directly under /mnt/cephfs


-The operation is done, and client1 is informed about the change OK.

-Client1 does not seem to cache the new metadata information received 
from the metadata server, now it communicates every second with the mds.



Client 1: stop watch ls... command, wait a few sec and restart it

-The communication stops, client1 serves ls data from cache


Please help, if it is intentional then why, if not, how can I debug it?


This is probably the intended behaviour. CephFS is a posix compliant 
filesystem, and uses capabilities (similar to locks) to control 
concurrent access to directories and files.


In your first step, a capibility for directory access is granted to 
client1. As soon as client2 wants to access the directory (probably 
read-only first for listing, write access later), the MDS has to check 
the capability requests with client1. I'm not sure about the details, 
but something similar to "write lock" should be granted to client2, and 
client1 is granted a read lock or a "I have this entry in cache and need 
the MDS to know it" lock. That's also the reason why client1 has to ask 
the MDS every second whether its cache content is still valid. client2 
probably still holds the necessary capabilities, so you might also see 
some traffic between MDS and client2.


I'm not sure why client1 does not continue to ask the MDS in the last 
step. Maybe the capability in client2 has expired and it was granted to 
client1. Others with more insight into the details of capabilities might 
be able to give you more details.


Short version: CephFS has a strict posix locking semantic implemented by 
capabilities, and you need to be aware of this fact (especially if you 
are used to NFS...)


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFs kernel client metadata caching

2017-10-13 Thread Burkhard Linke

Hi,


On 10/13/2017 02:26 PM, Denes Dolhay wrote:

Hi,


Thank you for your fast response!


Is there a way -that You know of- to list these locks?
The only way I know of is dumping the MDS cache content. But I don't 
know exactly how to do it or how to analyse the content.


I write to the file with echo "foo" >> /mnt/ceph/...something... so if 
there is any locking, should not it be released after the append is done?


That's the capability for the filebut there are also capabilities 
for the directory itself. And capabilities are more complex than 
read/write locks.



The strange thing is, that this -increased traffic- stage went on for 
hours, tried many times, and after I stop the watch for ~5s (not tried 
different intervals) and restart it, the traffic is gone, and there is 
normal -I think some keepalive- comm between mds and client, two 
packets in ~5s (request, response)
I'm just guessing, but at that time both clients should have 
capabiltiies for the same directory. Maybe the client was checking 
whether it needs to change its capability?



As if the metadata cache would only be populated in a timer, (between 
1s and 5s) which is never reached because of the repeated watch ls 
query  just a blind shot in the dark...


You can increase the debugging level of the MDS. This should give you 
much more information about what's going on, what kind of requests are 
passed between the MDS and the clients etc.


Regards,
burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] unusual growth in cluster after replacing journalSSDs

2017-11-16 Thread Burkhard Linke

Hi,


On 11/16/2017 01:36 PM, Jogi Hofmüller wrote:

Dear all,

for about a month we experience something strange in our small cluster.
  Let me first describe what happened on the way.

On Oct 4ht smartmon told us that the journal SSDs in one of our two
ceph nodes will fail.  Since getting replacements took way longer than
expected we decided to place the journal on a spare HDD rather than
have the SSD fail and leave us in an uncertain state.

On Oct 17th we finally got the replacement SSDs.  First we replaced
broken/soon to be broken SSD and moved journals from the temporarily
used HDD to the new SSD.  Then we also replaced the journal SSD on the
other ceph node since it would probably fail sooner than later.

We performed all operations by setting noout first, then taking down
the OSDs, flushing journals, replacing disks, creating new journals and
starting OSDs again.  We waited until the cluster was back in HEALTH_OK
state before we proceeded to the next node.

AFAIR mkjournal crashed once on the second node.  So we ran the command
again and journals where created.


*snipsnap*


What remains is the growth of used data in the cluster.

I put background information of our cluster and some graphs of
different metrics on a wiki page:

   https://wiki.mur.at/Dokumentation/CephCluster

Basically we need to reduce the growth in the cluster, but since we are
not sure what causes it we don't have an idea.


Just a wild guess (wiki page is not accessible yet):

Are you sure that the journals were creating on the new SSD? If the 
journals were created as files in the OSD directory, their size might be 
accounted for in the cluster size report (assuming OSDs are reporting 
their free space, not a sum of all object sizes).


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph luminous + multi mds: slow request. behind on trimming, failedto authpin local pins

2017-12-06 Thread Burkhard Linke

Hi,


we have upgraded our cluster to luminous 12.2.2 and wanted to use a 
second MDS for HA purposes. Upgrade itself went well, setting up the 
second MDS from the former standby-replay configuration worked, too.



But upon load both MDS got stuck and need to be restarted. It starts 
with slow requests:



2017-12-06 20:26:25.756475 7fddc4424700  0 log_channel(cluster) log 
[WRN] : slow
 request 122.370227 seconds old, received at 2017-12-06 
20:24:23.386136: client_
request(client.15057265:2898 getattr pAsLsXsFs #0x19de0f2 2017-12-06 
20:24:2

3.244096 caller_uid=0, caller_gid=0{}) currently failed to rdlock, waiting


0x19de0f2 is the inode id of the directory we mount as root on most 
clients. Running daemonperf for both MDS shows a rising number of 
journal segments, accompanied with the corresponding warnings in the 
ceph log. We also see other slow requests:


2017-12-06 20:26:25.756488 7fddc4424700  0 log_channel(cluster) log 
[WRN] : slow
 request 180.346068 seconds old, received at 2017-12-06 
20:23:25.410295: client_
request(client.15163105:549847914 getattr pAs #0x19de0f2/sge-tmp 
2017-12-06
20:23:25.406481 caller_uid=1426, caller_gid=1008{}) currently failed to 
authpin

local pins

This is a client accessing a sub directory of the mount point.


On the client side (various Ubuntu kernel using kernel based cephfs) 
this leads to CPU lockups if the problem is not fixed fast enough. The 
clients need a hard reboot to recover.



We have mitigated the problem by disabling the second MDS. The MDS 
related configuration is:



[mds.ceph-storage-04]
mds_replay_interval = 10
mds_cache_memory_limit = 10737418240

[mds]
mds_beacon_grace = 60
mds_beacon_interval = 4
mds_session_timeout = 120


Data pool is on replicated HDD storage, meta data pool on replicated 
NVME storage. MDS are colocated with OSDs (12 HDD OSDs + 2 NVME OSDs, 
128 GB RAM).



The questions are:

- what is the minimum kernel version on clients required for multi mds 
setups?


- is the problem described above a known problem, e.g. a result of 
http://tracker.ceph.com/issues/21975 ?



Regards,

Burkhard Linke


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to fix mon scrub errors?

2017-12-12 Thread Burkhard Linke

HI,


since the upgrade to luminous 12.2.2 the mons are complaining about 
scrub errors:



2017-12-13 08:49:27.169184 mon.ceph-storage-03 [ERR] scrub mismatch
2017-12-13 08:49:27.169203 mon.ceph-storage-03 [ERR]  mon.0 
ScrubResult(keys {logm=87,mds_health=13} crc 
{logm=4080463437,mds_health=2210310418})
2017-12-13 08:49:27.169216 mon.ceph-storage-03 [ERR]  mon.1 
ScrubResult(keys {logm=87,mds_health=13} crc 
{logm=4080463437,mds_health=1599893324})

2017-12-13 08:49:27.169229 mon.ceph-storage-03 [ERR] scrub mismatch
2017-12-13 08:49:27.169243 mon.ceph-storage-03 [ERR]  mon.0 
ScrubResult(keys {logm=87,mds_health=13} crc 
{logm=4080463437,mds_health=2210310418})
2017-12-13 08:49:27.169260 mon.ceph-storage-03 [ERR]  mon.2 
ScrubResult(keys {logm=87,mds_health=13} crc 
{logm=4080463437,mds_health=3057347215})

2017-12-13 08:49:27.176435 mon.ceph-storage-03 [ERR] scrub mismatch
2017-12-13 08:49:27.176454 mon.ceph-storage-03 [ERR]  mon.0 
ScrubResult(keys {mgrstat=10,monmap=26,osd_metadata=64} crc 
{mgrstat=3940483607,monmap=3662510285,osd_metadata=45209833})
2017-12-13 08:49:27.176472 mon.ceph-storage-03 [ERR]  mon.1 
ScrubResult(keys {mgrstat=10,monmap=26,osd_metadata=64} crc 
{mgrstat=3940483607,monmap=3662510285,osd_metadata=289852700})



These errors might have been caused by problems setting up multi mds 
after luminous upgrade.


OSD scrub errors are a well known problem with many available 
solutionsbut how do I fix mon scrub errors?


Best regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deterministic naming of LVM volumes (ceph-volume)

2017-12-13 Thread Burkhard Linke

Hi,


On 12/13/2017 02:12 PM, Webert de Souza Lima wrote:

Cool


On Wed, Dec 13, 2017 at 11:04 AM, Stefan Kooman > wrote:


So, a "ceph osd ls" should give us a list, and we will pick the
smallest
available number as the new osd id to use. We will make a check in the
(ansible) deployment code to see Ceph will indeed use that number.

Thanks,

Gr. Stefan



Take into account that if an ID is available within a gap, it means 
that it might have been used before, so maybe you'll still need to 
include `ceph osd rm $ID` [and/or `ceph auth del osd.$ID`] to make 
sure that ID will be usable.


Just my 2 cents:

What is happening if ansible runs on multiple hosts in parallel?

Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problems understanding 'ceph features' output

2017-12-15 Thread Burkhard Linke

Hi,


On 12/15/2017 10:56 AM, Massimo Sgaravatto wrote:

Hi

I tried the jewel --> luminous update on a small testbed composed by:

- 3 mon + mgr nodes
- 3 osd nodes (4 OSDs per each of this node)
- 3 clients (each client maps a single volume)


*snipsnap*



[*]
    "client": {
        "group": {
            "features": "0x40106b84a842a52",
            "release": "jewel",
            "num": 3
        },
        "group": {
            "features": "0x1ffddff8eea4fffb",
            "release": "luminous",
            "num": 5
        }
AFAIK "client" does not refer to a host, but to the application running 
on the host. If you have several qemu+rbd based VMs running on a host, 
each VM with be considered an individual client.


So I assume there are 3 ceph applications (e.g. three VMs) on the jewel 
host, and 5 applications on the two luminous hosts.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unable to ceph-deploy luminos

2017-12-18 Thread Burkhard Linke

Hi,


On 12/18/2017 05:28 PM, Andre Goree wrote:
I'm working on setting up a cluster for testing purposes and I can't 
see to install luminos.  All nodes are runnind Ubuntu 16.04.


[cephadmin][DEBUG ] Err:7 https://download.ceph.com/debian-luminos 
xenial/main amd64 Packages

[cephadmin][DEBUG ]   404  Not Found
[cephadmin][DEBUG ] Ign:8 https://download.ceph.com/debian-luminos 
xenial/main i386 Packages
[cephadmin][DEBUG ] Ign:9 https://download.ceph.com/debian-luminos 
xenial/main all Packages
[cephadmin][DEBUG ] Ign:10 https://download.ceph.com/debian-luminos 
xenial/main Translation-en_US
[cephadmin][DEBUG ] Ign:11 https://download.ceph.com/debian-luminos 
xenial/main Translation-en

[cephadmin][DEBUG ] Fetched 306 kB in 1s (178 kB/s)
[cephadmin][DEBUG ] Reading package lists...
[cephadmin][WARNIN] W: The repository 
'https://download.ceph.com/debian-luminos xenial Release' does not 
have a Release file.
[cephadmin][WARNIN] E: Failed to fetch 
https://download.ceph.com/debian-luminos/dists/xenial/main/binary-amd64/Packages 
 404  Not Found
[cephadmin][WARNIN] E: Some index files failed to download. They have 
been ignored, or old ones used instead.
[cephadmin][ERROR ] RuntimeError: command returned non-zero exit 
status: 100
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: env 
DEBIAN_FRONTEND=noninteractive DEBIAN_PRIORITY=critical apt-get 
--assume-yes -q update



What's weird is that the Release file and 'Packages" does appear to be 
available when I visit download.ceph.com in my web browser. Any ideas?
Did you specify the release on the command line and made a typo? luminos 
vs. luminous


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Slow backfilling with bluestore, ssd and metadata pools

2017-12-21 Thread Burkhard Linke

Hi,


we are in the process of migrating our hosts to bluestore. Each host has 
12 HDDs (6TB / 4TB) and two Intel P3700 NVME SSDs with 375 GB capacity. 
The new bluestore OSDs are created by ceph-volume:



ceph-volume lvm create --bluestore --block.db /dev/nvmeXn1pY --data 
/dev/sdX1



6 OSDs share a SSD with 30GB partitions for rocksdb; the remaining space 
is used as additional ssd based osd without specifying additional 
partitions.



Backfilling from the other nodes works fine for the hdd based OSDs, but 
is _really_ slow for the ssd based ones. With filestore moving our 
cephfs metadata pool around was a matter of 10 minutes (350MB, 8 million 
objects, 1024 PGs). With bluestore remapped a part of the pool (about 
400PGs, those affected by adding a new pair of ssd based OSDs) did not 
finish over night



OSD config section from ceph.conf:

[osd]
osd_scrub_sleep = 0.05
osd_journal_size = 10240
osd_scrub_chunk_min = 1
osd_scrub_chunk_max = 1
max_pg_per_osd_hard_ratio = 4.0
osd_max_pg_per_osd_hard_ratio = 4.0
bluestore_cache_size_hdd = 5368709120
mon_max_pg_per_osd = 400


Backfilling runs with max-backfills set to 20 during day and 50 during 
night. Some numbers (ceph pg dump for the most advanced backfilling 
cephfs metadata PG, ten seconds difference):



ceph pg dump | grep backfilling | grep -v undersized | sort -k4 -n -r | 
tail -n 1 && sleep 10 && echo && ceph pg dump | grep backfilling | grep 
-v undersized | sort -k4 -n -r | tail -n 1

dumped all
8.101  7581  0    0  4549   0 4194304 
2488 2488 active+remapped+backfilling 2017-12-21 09:03:30.429605 
543240'1012998    543248:1923733 [78,34,49] 
78 [78,34,19] 78    522371'1009118 2017-12-18 
16:11:29.755231    522371'1009118 2017-12-18 16:11:29.755231


dumped all
8.101  7580  0    0  4542 0   0 
2489 2489 active+remapped+backfilling 2017-12-21 09:03:30.429605 
543248'1012999    543250:1923755 [78,34,49] 
78 [78,34,19] 78    522371'1009118 2017-12-18 
16:11:29.755231    522371'1009118 2017-12-18 16:11:29.755231



Seven objects in 10 seconds does not sound sane to me, given that only 
key-value has to be transferred.



Any hints how to tune this?


Regards,

Burkhard


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proper way of removing osds

2017-12-21 Thread Burkhard Linke

Hi,


On 12/21/2017 11:03 AM, Karun Josy wrote:

Hi,

This is how I remove an OSD from cluster

  * Take it out
ceph osd out osdid

Wait for the balancing to finish

  * Mark it down
ceph osd down osdid

Then Purge it
 cephosd purge osdid --yes-i-really-mean-it


While purging I can see there is another rebalancing occurring.
Is this the correct way to removes OSDs, or am I doing something wrong ?


The procedure is correct, but not optimal.

The first rebalancing is due to the osd being down; the second 
rebalancing is due to fact that removing the osd changes the crush 
weight of the host and thus the base of the overall data distribution.


If you want to skip this, you can set the crush weight of the 
to-be-removed osd to 0.0, wait for the rebalancing to be finished, and 
stop and remove the osds afterwards. You can also use smaller steps to 
reduce the backfill impact if necessary,


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow backfilling with bluestore, ssd and metadatapools

2017-12-21 Thread Burkhard Linke

Hi,


On 12/21/2017 11:43 AM, Richard Hesketh wrote:

On 21/12/17 10:28, Burkhard Linke wrote:

OSD config section from ceph.conf:

[osd]
osd_scrub_sleep = 0.05
osd_journal_size = 10240
osd_scrub_chunk_min = 1
osd_scrub_chunk_max = 1
max_pg_per_osd_hard_ratio = 4.0
osd_max_pg_per_osd_hard_ratio = 4.0
bluestore_cache_size_hdd = 5368709120
mon_max_pg_per_osd = 400

Consider also playing with the following OSD parameters:

osd_recovery_max_active
osd_recovery_sleep
osd_recovery_sleep_hdd
osd_recovery_sleep_hybrid
osd_recovery_sleep_ssd

In my anecdotal experience, the forced wait between requests (controlled by the 
recovery_sleep parameters) was causing significant slowdown in recovery speed 
in my cluster, though even at the default values it wasn't making things go 
nearly as slowly as your cluster - it sounds like something else is probably 
wrong.


Thanks for the hint. I've been thinking about recovery_sleep, too. But 
the default for ssd osds is set to 0.0:


# ceph daemon osd.93 config show | grep recovery
    "osd_allow_recovery_below_min_size": "true",
    "osd_debug_skip_full_check_in_recovery": "false",
    "osd_force_recovery_pg_log_entries_factor": "1.30",
    "osd_min_recovery_priority": "0",
    "osd_recovery_cost": "20971520",
    "osd_recovery_delay_start": "0.00",
    "osd_recovery_forget_lost_objects": "false",
    "osd_recovery_max_active": "3",
    "osd_recovery_max_chunk": "8388608",
    "osd_recovery_max_omap_entries_per_chunk": "64000",
    "osd_recovery_max_single_start": "1",
    "osd_recovery_op_priority": "3",
    "osd_recovery_op_warn_multiple": "16",
    "osd_recovery_priority": "5",
    "osd_recovery_retry_interval": "30.00",
    "osd_recovery_sleep": "0.00",
    "osd_recovery_sleep_hdd": "0.10",
    "osd_recovery_sleep_hybrid": "0.025000",
    "osd_recovery_sleep_ssd": "0.00",
    "osd_recovery_thread_suicide_timeout": "300",
    "osd_recovery_thread_timeout": "30",
    "osd_scrub_during_recovery": "false",

osd 93 is one of the ssd osd I've just recreated using bluestore about 3 
hours ago. All recovery related values are at their defaults. Since the 
first mail one hour ago the PG made some progress:


8.101  7580  0    0  2777 0   0 
2496 2496 active+remapped+backfilling 2017-12-21 09:03:30.429605 
543455'1013006    543518:1927782 [78,34,49] 
78 [78,34,19] 78    522371'1009118 2017-12-18 
16:11:29.755231    522371'1009118 2017-12-18 16:11:29.755231


So roughly 2000 objects on this PG have been copied to a new ssd based 
OSD (78,34,19 -> 78,34,49 -> one new copy).



Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs degraded on ceph luminous 12.2.2

2018-01-09 Thread Burkhard Linke

Hi,


On 01/08/2018 05:40 PM, Alessandro De Salvo wrote:

Thanks Lincoln,

indeed, as I said the cluster is recovering, so there are pending ops:


    pgs: 21.034% pgs not active
 1692310/24980804 objects degraded (6.774%)
 5612149/24980804 objects misplaced (22.466%)
 458 active+clean
 329 active+remapped+backfill_wait
 159 activating+remapped
 100 active+undersized+degraded+remapped+backfill_wait
 58  activating+undersized+degraded+remapped
 27  activating
 22  active+undersized+degraded+remapped+backfilling
 6   active+remapped+backfilling
 1   active+recovery_wait+degraded


If it's just a matter to wait for the system to complete the recovery 
it's fine, I'll deal with that, but I was wondendering if there is a 
more suble problem here.


OK, I'll wait for the recovery to complete and see what happens, thanks.


The blocked MDS might be caused by the 'activating' PGs. Do you have a 
warning about too much PGs per OSD? If that is the case, 
activating/creating/peering/whatever on the affected OSDs is blocked, 
which leads to blocked requests etc.


You can resolve this be increasing the number of allowed PGs per OSD 
('mon_max_pg_per_osd'). AFAIK it needs to be set for mon, mgr and osd 
instances. There was also been some discussion about this setting on the 
mailing list in the last weeks.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] After Luminous upgrade: ceph-fuse clients failingtorespond to cache pressure

2018-01-16 Thread Burkhard Linke

Hi,


On 01/16/2018 09:50 PM, Andras Pataki wrote:

Dear Cephers,


*snipsnap*




We are running with a larger MDS cache than usual, we have 
mds_cache_size set to 4 million.  All other MDS configs are the defaults.


AFAIK the MDS cache management in luminous has changed, focusing on 
memory size instead of number of inodes/caps/whatever.


We had to replace mds_cache_size with mds_cache_memory_limit to get mds 
cache working as expected again. This may also be the cause for the 
issue, since the default configuration uses quite a small cache. You can 
check this with 'ceph daemonperf mds.XYZ' on the mds host.


If you change the memory limit you also need to consider a certain 
overhead of the memory allocation. There was a thread about this on the 
mailing list some weeks ago; you should expect at least 50% overhead. As 
with the previous releases this is not a hard limit. The process may 
consume more memory in certain situations. Given the fact that bluestore 
osds do not use kernel page cache anymore but their own memory cache, 
you need to plan memory consumption of all ceph daemons.


As an example, our mds is configured with mds_cache_memory_limit = 
80 and is consuming about 12 GB memory RSS.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]

2018-01-23 Thread Burkhard Linke

Hi,


On 01/23/2018 08:54 AM, Nico Schottelius wrote:

Good morning,

the osd.61 actually just crashed and the disk is still intact. However,
after 8 hours of rebuilding, the unfound objects are still missing:


*snipsnap*



Is there any chance to recover those pgs or did we actually lose data
with a 2 disk failure?

And is there any way out  of this besides going with

 ceph pg {pg-id} mark_unfound_lost revert|delete

?


Just my 2 cents:

If the disk is still intact and the data is still readable, you can try 
to export the pg content with ceph-objectstore-tool, and import it into 
another OSD.


On the other hand: if the disk is still intact, just restart the OSD?

Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Importance of Stable Mon and OSD IPs

2018-01-23 Thread Burkhard Linke

Hi,


On 01/23/2018 09:53 AM, Mayank Kumar wrote:

Hi Ceph Experts

I am a new user of Ceph and currently using Kubernetes to deploy Ceph 
RBD Volumes. We our doing some initial work rolling it out to internal 
customers and in doing that we are using the ip of the host as the ip 
of the osd and mons. This means if a host goes down , we loose that 
ip. While we are still experimenting with these behaviors, i wanted to 
see what the community thinks for the following scenario :-


1: a rbd volume is already attached and mounted on host A
2: the osd on which this rbd volume resides, dies and never comes back up
3: another osd is replaced in its place. I dont know the intricacies 
here, but i am assuming the data for this rbd volume either moves to 
different osd's or goes back to the newly installed osd

4: the new osd has completley new ip
5: will the rbd volume attached to host A learn the new osd ip on 
which its data resides and everything just continues to work ?


What if all the mons also have changed ip ?
A volume does not reside "on a osd". The volume is striped, and each 
strip is stored in a placement group; the placement group on the other 
hand is distributed to several OSDs depending on the crush rules and the 
number of replicates.


If an OSD dies, ceph will backfill the now missing replicates to another 
OSD, given another OSD satisfying the crush rules is available. The same 
process is also triggered if an OSD is added.


This process is somewhat transparent to the ceph client, as long as 
enough replicates a present. The ceph client (librbd accessing a volume 
in this case) gets asynchronous notification from the ceph mons in case 
of relevant changes, e.g. updates to the osd map reflecting the failure 
of an OSD. Traffic to the OSD will be automatically rerouted depending 
on the crush rules as explained above. The OSD map also contains the IP 
address of all OSDs, so changes to the IP address are just another 
update to the map.


The only problem you might run into is changing the IP address of the 
mons. There's also a mon map listing all active mons; if the mon a ceph 
client is using dies/is removed, the client will switch to another 
active mon from the map. This works fine in a running system; you can 
change the IP address of a mon one by one without any interruption to 
the client (theoretically).


The problem is starting the ceph client. In this case the client uses 
the list of mons from the ceph configuration file to contact one mon and 
receive the initial mon map. If you change the hostnames/IP address of 
the mons, you also need to update the ceph configuration file.


The above outline is how it should work, given a valid ceph and network 
setup. YMMV.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Importance of Stable Mon and OSD IPs

2018-01-31 Thread Burkhard Linke

Hi,


On 02/01/2018 07:21 AM, Mayank Kumar wrote:

Thanks Gregory and Burkhard

In kubernetes we use rbd create  and rbd map/unmap commands. In this 
perspective are you referring to rbd as the client or after the image 
is created and mapped, is there a different client running inside the 
kernel that you are referring to which can get osd and mon updates ?


My question is mainly after we have run the rbd ccreate and rbd map 
commands, does a  client still exsit or its gone ? If the rbd image is 
mapped on a host and then if osd or mon ips change , what happens in 
this case ?


AFAIK the 'rbd create' command is creating its own client userspace 
session, which is terminated after the command is finished.


'rbd map' is instructing the kernel to map the given image to a block 
device. And the kernel is keeping track of map changes notified by the 
mons, including osd/mon ip changes.


You can verify this by having a look at the /sys/kernel/debug/ceph 
directory. All active kernel client sessions (rbd map or cephfs) create 
a subdirectory contain some information about the current state of the 
client.



Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migration from "classless pre luminous" to"deviceclasses" CRUSH.

2018-02-01 Thread Burkhard Linke

Hi,


On 02/01/2018 10:43 AM, Konstantin Shalygin wrote:


Hi cephers.


I have typical double root crush - for nvme pools and hdd pools 
created on Kraken cluster (what I mean: 
http://cephnotes.ksperis.com/blog/2015/02/02/crushmap-example-of-a-hierarchical-cluster-map).


Now cluster upgraded to Luminous and going to devices classes crush 
rules and I looking for experience.


1. Enable new crush rule with devices-class is safe for data and clients?

2. How much data movement? Should I be ready for slow requests?



We have changed our similar setup to a device class based one. According 
to the documentation the device classes are implemented by 'shadow' 
crush tree. 'ceph osd crush tree --show-shadow' displays all tree 
including the device class specific ones. This allows the device class 
setup to be backwards compatible with older releases.


We had a MASSIVE data movement upon changing the crush rules to device 
class based one. I'm not sure about the exact reasons, but I assume that 
the order of hosts in the crush tree has changed (hosts are ordered 
lexically now...), which resulted in about 80% of data being moved around.


So be prepared for slow requests, and set the corresponding 
configuration values to reduce the backfill impact.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitor won't upgrade

2018-02-15 Thread Burkhard Linke

Hi,


On 02/15/2018 09:19 AM, Mark Schouten wrote:

On woensdag 14 februari 2018 16:20:57 CET David Turner wrote:

 From the mon.0 server run `ceph --version`.  If you've restarted the mon
daemon and it is still showing 0.94.5, it is most likely because that is
the version of the packages on that server.

root@proxmox2:~# ceph --version
ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)


Also, this node has OSD's running with version 0.94.5 and 0.94.7 ..
Did you verify that the ceph mon process was actually restarted? If the 
initscripts/systemd stuff has changed during the releases, the restart 
might not be able to recognize the already running process, and (maybe 
silently?) fail to start the new version.


If in doubt, stop the ceph mon service on the host, and kill any still 
running ceph-mon processes.. If the mon still reports the older version 
after a restart, you need to dig further.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: Re: Merging CephFS data pools

2016-08-23 Thread Burkhard Linke


Missing CC to list



 Forwarded Message 
Subject:Re: [ceph-users] Merging CephFS data pools
Date:   Tue, 23 Aug 2016 08:59:45 +0200
From:   Burkhard Linke 
To: Gregory Farnum 



Hi,


On 08/22/2016 10:02 PM, Gregory Farnum wrote:

On Thu, Aug 18, 2016 at 12:21 AM, Burkhard Linke
 wrote:

Hi,

the current setup for CephFS at our site uses two data pools due to
different requirements in the past. I want to merge these two pools now,
eliminating the second pool completely.

I've written a small script to locate all files on the second pool using
their file layout attributes and replace them with a copy on the correct
pool. This works well for files, but modifies the timestamps of the
directories.
Do you have any idea for a better solution that does not modify timestamps
and plays well with active CephFS clients (e.g. no problem with files being
used)? A simple 'rados cppool' probably does not work since the pool id/name
is part of a file's metadata and client will not be aware of moved
files.

Can't you just use rsync or something that will set the timestamps itself?

The script is using 'cp -a', which also preserves the timestamps. So
file timestamps are ok, but directory timestamps get updated by cp and
mv. And that's ok from my point of view.

The main concern is data integrity. There are 20TB left to be
transferred from the old pool, and part of this data is currently in
active use (including being overwritten in place). If write access to an
opened file happens while it is being transfered, the changes to that
file might be lost.

We can coordinate the remaining transfers with the affected users, if no
other way exists.

Regards,
Burkhard

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recommended hardware for MDS server

2016-08-23 Thread Burkhard Linke

Hi,


On 08/22/2016 07:27 PM, Wido den Hollander wrote:

Op 22 augustus 2016 om 15:52 schreef Christian Balzer :



Hello,

first off, not a CephFS user, just installed it on a lab setup for fun.
That being said, I tend to read most posts here.

And I do remember participating in similar discussions.

On Mon, 22 Aug 2016 14:47:38 +0200 Burkhard Linke wrote:


Hi,

we are running CephFS with about 70TB data, > 5 million files and about
100 clients. The MDS is currently colocated on a storage box with 14 OSD
(12 HDD, 2SSD). The box has two E52680v3 CPUs and 128 GB RAM. CephFS
runs fine, but it feels like the metadata operations may need more speed.


Firstly, I wouldn't share share the MDS with a storage/OSD node, a MON
node would make a more "natural" co-location spot.

Indeed. I always try to avoid to co-locate anything with the OSDs.
The MONs are also colocated with other OSD hosts, but this is also 
subject to change in the near future.



That being said, CPU wise that machine feels vastly overpowered, don't see
more then half of the cores utilized ever for OSD purposes, even in the
most contrived test cases.

Have you monitored that node with something like atop to get a feel what
tasks are using how much (of a specific) CPU?


Excerpt of MDS perf dump:
"mds": {
  "request": 73389282,
  "reply": 73389282,
  "reply_latency": {
  "avgcount": 73389282,
  "sum": 259696.749971457
  },
  "forward": 0,
  "dir_fetch": 4094842,
  "dir_commit": 720085,
  "dir_split": 0,
  "inode_max": 500,
  "inodes": 565,
  "inodes_top": 320979,
  "inodes_bottom": 530518,
  "inodes_pin_tail": 4148568,
  "inodes_pinned": 4469666,
  "inodes_expired": 60001276,
  "inodes_with_caps": 4468714,
  "caps": 4850520,
  "subtrees": 2,
  "traverse": 92378836,
  "traverse_hit": 75743822,
  "traverse_forward": 0,
  "traverse_discover": 0,
  "traverse_dir_fetch": 1719440,
  "traverse_remote_ino": 33,
  "traverse_lock": 3952,
  "load_cent": 7339063064,
  "q": 0,
  "exported": 0,
  "exported_inodes": 0,
  "imported": 0,
  "imported_inodes": 0
  },

The setup is expected grow, with regards to the amount of stored data
and the number of clients. The MDS process currently consumes about 36
TB RAM, with 22 TB resident. Since a large part of the MDS run single
threaded, a CPU with less core and more CPU frequency might be a better
choice in this setup.


I suppose you mean GB up there. ^o^

If memory serves me well, there are knobs to control MDS memory usage, so
tuning them upwards may help.


mds_cache_size you mean probably. That's the amount of inodes the MDS will 
cache at max.

Keep in mind, a single inodes uses about 4k of memory. So the default of 100k 
will consume 400MB of memory.

You can increase this to 16.777.216 so it will use about 64GB at max. I would 
still advise to put 128GB of memory in that machine since the MDS might have a 
leak at some points and you want to give it some headroom.

Source: http://docs.ceph.com/docs/master/cephfs/mds-config-ref/
mds_cache_size is already set to 5.000.000 and will need to be changed 
again since there are already cache pressure messages in the ceph logs. 
128GB RAM will definitely be a good idea.



And yes to the less cores, more speed rationale. Up to a point of course.

Indeed. Faster single-core E5 is better for the MDS than a slower multi-core.

So I'll have a closer look at configurations with E5-1XXX.



Again, checking with atop should give you a better insight there.

Also up there you said metadata stuff feels sluggish, have you considered
moving that pool to SSDs?

I recall from recent benchmarks that there was no benefit in having the 
metadata on SSD. Sure, it might help a bit with maybe a journal replay, but I 
think that regular disks with a proper journal do just fine.
Most of the metadata is read by the MDS upon start and cached in memory 
(that's why the process consumes several GB of RAM...). Given a suitable 
cache size, only journal updates should result in I/O to the metadata 
pool; client requests should be served from memory.


Thanks for hints, I'll go for a single socket setup with a E5-1XXX and 
128GB RAM.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS + cache tiering in Jewel

2016-08-23 Thread Burkhard Linke

Hi,

the Firefly and Hammer releases did not support transparent usage of 
cache tiering in CephFS. The cache tier itself had to be specified as 
data pool, thus preventing on-the-fly addition and removal of cache tiers.


Does the same restriction also apply to Jewel? I would like to add a 
cache tier to an existing data pool.


Regards,
Burkhard

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] phantom osd.0 in osd tree

2016-08-23 Thread Burkhard Linke

Hi,


On 08/23/2016 08:19 PM, Reed Dier wrote:

Trying to hunt down a mystery osd populated in the osd tree.

Cluster was deployed using ceph-deploy on an admin node, originally 10.2.1 at 
time of deployment, but since upgraded to 10.2.2.

For reference, mons and mds do not live on the osd nodes, and the admin node is 
neither mon, mds, or osd.

Attempting to remove it from the crush map, it says that osd.0 does not exist.

Just looking for some insight into this mystery.

Thanks

# ceph osd tree
ID WEIGHT   TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 58.19960 root default
-2  7.27489 host node24
  1  7.27489 osd.1up  1.0  1.0
-3  7.27489 host node25
  2  7.27489 osd.2up  1.0  1.0
-4  7.27489 host node26
  3  7.27489 osd.3up  1.0  1.0
-5  7.27489 host node27
  4  7.27489 osd.4up  1.0  1.0
-6  7.27489 host node28
  5  7.27489 osd.5up  1.0  1.0
-7  7.27489 host node29
  6  7.27489 osd.6up  1.0  1.0
-8  7.27539 host node30
  9  7.27539 osd.9up  1.0  1.0
-9  7.27489 host node31
  7  7.27489 osd.7up  1.0  1.0
  00 osd.0  down0  1.0
I've seen these entries during manual removal of OSD. There's still an 
OSD entry, but no crush location (e.g. after 'ceph osd crush remove ..' 
and before 'ceph osd rm ...'). It also happens if you invoke ceph-osd 
manually to create an OSD entry.


You should be able to remove this entry with 'ceph osd rm 0'.

Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS + cache tiering in Jewel

2016-08-24 Thread Burkhard Linke

Hi,


On 08/24/2016 10:22 PM, Gregory Farnum wrote:

On Tue, Aug 23, 2016 at 7:50 AM, Burkhard Linke
 wrote:

Hi,

the Firefly and Hammer releases did not support transparent usage of cache
tiering in CephFS. The cache tier itself had to be specified as data pool,
thus preventing on-the-fly addition and removal of cache tiers.

Does the same restriction also apply to Jewel? I would like to add a cache
tier to an existing data pool.

This got cleaned up a lot but is still a bit weird since you *can't*
use a bare EC pool on Ceph. I think right now you'll find that you can
add an EC pool to the CephFS data pools if it has a cache pool, but
doing so will prevent removing the cache pool.
EC pools have been a problem in Firefly and Hammer, too. We removed them 
from our CephFS setup in the wake of the cache tiering error in Hammer.


Does cache tiering work as expected with replicated pools? We use kernel 
based CephFS clients running kernel 4.6.6 on almost all machines.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs/ceph-fuse: mds0: Client XXX:XXX failingtorespond to capability release

2016-09-14 Thread Burkhard Linke

Hi,


On 09/14/2016 12:43 PM, Dennis Kramer (DT) wrote:

Hi Goncalo,

Thank you. Yes, i have seen that thread, but I have no near full osds 
and my mds cache size is pretty high.


You can use the daemon socket on the mds server to get an overview of 
the current cache state:


ceph daemon mds.XXX perf dump

The message itself indicates that the mds is in fact trying to convince 
clients to release capabilities, probably because it is running out of 
cache.


The 'session ls' command on the daemon socket lists all current ceph 
clients and the number capabilities for each client. Depending on your 
workload / applications you might be surprised how many capabilities are 
assigned to individual nodes...


From the client side of view the error means that there's either a bug 
in the client, or an application is keeping a large number of files open 
(e.g. do you run mlocate on the clients?)


If you use the kernel based client re-mounting won't help, since the 
internal state is keep the same (afaik). In case of the ceph-fuse client 
the ugly way to get rid off the mount point is a lazy / forced umount 
and killing the ceph-fuse process if necessary. Processes with open file 
handles will complain afterwards.


Before using rude ways to terminate the client session i would propose 
to look for rogue applications on the involved host. We had a number of 
problems with multithreaded applications and concurrent file access on 
the past (both with ceph-fuse from hammer and kernel based clients). 
lsof or other tools might help locating the application.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs/ceph-fuse: mds0: Client XXX:XXXfailingtorespondto capability release

2016-09-14 Thread Burkhard Linke

Hi,


My cluster is back to HEALTH_OK, the involved host has been restarted 
by the user. But I will debug some more on the host when i see this 
issue again next time.


PS: For completeness, i've stated that this issue was often seen in my 
current Jewel environment, I meant to say that this issue comes up 
sometimes (so not so often). But the times when i *do* have this 
issue, it blocks some I/O for clients as a consequence.


That's why I assume that the root cause might be a bug in ceph-fuse. 
There's support for page cache in ceph-fuse (not sure whether it is 
active by default), and afaik it has to keep the capabilities around as 
long as the corresponding file is still in the cache. If another clients 
wants to access the file, the mds might need to revoke the capabilites 
for cached files (e.g. if one client wants to overwrite a file that has 
been read by another client before). The client has to wait until it is 
able to acquire the capabilities, resulting in blocked I/O.


We had similar problems in the past with ceph-fuse, especially if page 
cache support was active. We have switched to kernel based cephfs in the 
meantime (with it's own pro and cons).


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS: Upper limit for number of files in a directory?

2016-09-15 Thread Burkhard Linke

Hi,

does CephFS impose an upper limit on the number of files in a directory?


We currently have one directory with a large number of subdirectories:

$ ls | wc -l
158141

Creating a new subdirectory fails:

$ touch foo
touch: cannot touch 'foo': No space left on device

Creating files in a different directory does not show any problems. The 
last message in the MDS log relates to the large directory:


2016-09-15 07:51:54.539216 7f24ef2a6700  0 mds.0.bal replicating dir 
[dir 18bf9a6 /volumes/biodb/ncbi_genomes/all/ [2,head] auth 
v=57751905 cv=0/0 ap=0+2+2 state=1073741826|complete f(v0 m2016-08-22 
08:51:34.714570 158141=3+158138) n(v182285 rc2016-08-22 11:59:34.976373 
b3360569421156 2989670=2478235+511435) hs=158141+842,ss=0+0 | child=1 
waiter=0 authpin=0 0x7f252ca05f00] pop 12842 .. rdp 7353.96 adj 0


Any hints what might go wrong in this case? MDS is taken from the 
current jewel git branch due to some pending backports:


# ceph-mds --version
ceph version 10.2.2-508-g9bfc0cf (9bfc0cf178dc21b0fe33e0ce3b90a18858abaf1b)

CephFS is mounted via kernel implementation:

# uname -a
Linux waas 4.6.6-040606-generic #201608100733 SMP Wed Aug 10 11:35:29 
UTC 2016 x86_64 x86_64 x86_64 GNU/Linux


ceph-fuse from jewel is also affected.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: Upper limit for number of files in adirectory?

2016-09-15 Thread Burkhard Linke

Hi,


On 09/15/2016 12:00 PM, John Spray wrote:

On Thu, Sep 15, 2016 at 2:20 PM, Burkhard Linke
 wrote:

Hi,

does CephFS impose an upper limit on the number of files in a directory?


We currently have one directory with a large number of subdirectories:

$ ls | wc -l
158141

Creating a new subdirectory fails:

$ touch foo
touch: cannot touch 'foo': No space left on device

thansk for the fast reply.

This limit was added recently: it's a limit on the size of a directory fragment.

Previously folks were hitting nasty OSD issues with very large
directory fragments, so we added this limit to give a clean failure
instead.

I remember seeing a thread on the devel mailing list about this issue.


Directory fragmentation (mds_bal_frag setting) is turned off by
default in Jewel: I was planning to get this activated by default in
Kraken, but haven't quite got there yet.  Once fragmentation is
enabled you should find that the threshold for splitting dirfrags is
hit well before you hit the safety limit that gives you ENOSPC.
Does enabling directory fragmentation require a MDS restart? And are 
directories processed at restart or on demand during the first access? 
Are there known problems with fragmentation?




Note that if you set mds_bal_frag then you also need to use the "ceph
fs set  allow_dirfrags true" (that command from memory so check
the help if it's wrong), or the MDSs will ignore the setting.
So its allowing fragmentation first and changing the MDS configuration 
afterwards.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph full cluster

2016-09-26 Thread Burkhard Linke

Hi,


On 09/26/2016 12:58 PM, Dmitriy Lock wrote:

Hello all!
I need some help with my Ceph cluster.
I've installed ceph cluster with two physical servers with osd /data 
40G on each.

Here is ceph.conf:
[global]
fsid = 377174ff-f11f-48ec-ad8b-ff450d43391c
mon_initial_members = vm35, vm36
mon_host = 192.168.1.35,192.168.1.36
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

osd pool default size = 2  # Write an object 2 times.
osd pool default min size = 1 # Allow writing one copy in a degraded 
state.


osd pool default pg num = 200
osd pool default pgp num = 200

Right after creation it was HEALTH_OK, and i've started with filling 
it. I've wrote 40G data to cluster using Rados gateway, but cluster 
uses all avaiable space and keep growing after i've added two another 
osd - 10G /data1 on each server.

Here is tree output:
# ceph osd tree
ID WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.09756 root default
-2 0.04878 host vm35
0 0.03899 osd.0  up  1.0  1.0
2 0.00980 osd.2  up  1.0  1.0
-3 0.04878 host vm36
1 0.03899 osd.1  up  1.0  1.0
3 0.00980 osd.3  up  1.0  1.0

and health:
root@vm35:/etc# ceph health
HEALTH_ERR 5 pgs backfill_toofull; 15 pgs degraded; 16 pgs stuck 
unclean; 15 pgs undersized; recovery 87176/300483 objects degraded 
(29.012%); recovery 62272/300483 obj
ects misplaced (20.724%); 1 full osd(s); 2 near full osd(s); pool 
default.rgw.buckets.data has many more objects per pg than average 
(too few pgs?)

root@vm35:/etc# ceph health detail
HEALTH_ERR 5 pgs backfill_toofull; 15 pgs degraded; 16 pgs stuck 
unclean; 15 pgs undersized; recovery 87176/300483 objects degraded 
(29.012%); recovery 62272/300483 obj
ects misplaced (20.724%); 1 full osd(s); 2 near full osd(s); pool 
default.rgw.buckets.data has many more objects per pg than average 
(too few pgs?)
pg 10.5 is stuck unclean since forever, current state 
active+undersized+degraded, last acting [1,0]
pg 9.6 is stuck unclean since forever, current state 
active+undersized+degraded+remapped+backfill_toofull, last acting [1,0]
pg 10.4 is stuck unclean since forever, current state active+remapped, 
last acting [3,0,1]
pg 9.7 is stuck unclean since forever, current state 
active+undersized+degraded+remapped+backfill_toofull, last acting [1,0]
pg 10.7 is stuck unclean since forever, current state 
active+undersized+degraded+remapped+backfill_toofull, last acting [0,1]
pg 9.4 is stuck unclean since forever, current state 
active+undersized+degraded, last acting [1,0]
pg 9.1 is stuck unclean since forever, current state 
active+undersized+degraded, last acting [0,3]
pg 10.2 is stuck unclean since forever, current state 
active+undersized+degraded, last acting [1,0]
pg 9.0 is stuck unclean since forever, current state 
active+undersized+degraded, last acting [1,2]
pg 10.3 is stuck unclean since forever, current state 
active+undersized+degraded, last acting [2,1]
pg 9.3 is stuck unclean since forever, current state 
active+undersized+degraded+remapped+backfill_toofull, last acting [1,0]
pg 10.0 is stuck unclean since forever, current state 
active+undersized+degraded+remapped+backfill_toofull, last acting [1,0]
pg 9.2 is stuck unclean since forever, current state 
active+undersized+degraded, last acting [0,1]
pg 10.1 is stuck unclean since forever, current state 
active+undersized+degraded, last acting [0,1]
pg 9.5 is stuck unclean since forever, current state 
active+undersized+degraded, last acting [1,0]
pg 10.6 is stuck unclean since forever, current state 
active+undersized+degraded, last acting [0,1]

pg 9.1 is active+undersized+degraded, acting [0,3]
pg 10.2 is active+undersized+degraded, acting [1,0]
pg 9.0 is active+undersized+degraded, acting [1,2]
pg 10.3 is active+undersized+degraded, acting [2,1]
pg 9.3 is active+undersized+degraded+remapped+backfill_toofull, acting 
[1,0]
pg 10.0 is active+undersized+degraded+remapped+backfill_toofull, 
acting [1,0]

pg 9.2 is active+undersized+degraded, acting [0,1]
pg 10.1 is active+undersized+degraded, acting [0,1]
pg 9.5 is active+undersized+degraded, acting [1,0]
pg 10.6 is active+undersized+degraded, acting [0,1]
pg 9.4 is active+undersized+degraded, acting [1,0]
pg 10.7 is active+undersized+degraded+remapped+backfill_toofull, 
acting [0,1]
pg 9.7 is active+undersized+degraded+remapped+backfill_toofull, acting 
[1,0]
pg 9.6 is active+undersized+degraded+remapped+backfill_toofull, acting 
[1,0]

pg 10.5 is active+undersized+degraded, acting [1,0]
recovery 87176/300483 objects degraded (29.012%)
recovery 62272/300483 objects misplaced (20.724%)
osd.1 is full at 95%
osd.2 is near full at 91%
osd.3 is near full at 91%
pool default.rgw.buckets.data objects per pg (12438) is more than 
17.8451 times cluster average (697)


In log i see this:
2016-09-26 10:37:21.688849 mon.0 192.168.1.35:6789/0 
 483

Re: [ceph-users] Ceph with Cache pool - disk usage / cleanup

2016-09-28 Thread Burkhard Linke

Hi,


someone correct me if I'm wrong, but removing objects in a cache tier 
setup result in empty objects which acts as markers for deleting the 
object on the backing store.. I've seen the same pattern you have 
described in the past.



As a test you can try to evict all objects from the cache pool. This 
should trigger the actual removal of pending objects.



Regards,

Burkhard


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph write performance issue

2016-09-29 Thread Burkhard Linke

Hi,


I would propose to start with a OSD only benchmark (ceph tell osd.* 
bench) to get an upper estimate of what the OSD itself is capabable of.



You also did not describe the network setup. 800MB/s is a good value if 
the network connection is a 10GbE link (which has a theoretical upper 
limit of 1.2 GB/s without protocol overhead). You may also be limited by 
the client's CPU, so check CPU load, too.



As a comparison, our cluster (9 hosts with single 40 GbE links) allows 
up to 1 GB/s in simple rados benchmarks runing on clients with 2x 10GbE 
links).



Regards,

Burkhard

On 09/29/2016 12:05 PM, min fang wrote:

I used 2 copies, not 3, so should be 1000MB/s in theory. thanks.

2016-09-29 17:54 GMT+08:00 Nick Fisk <mailto:n...@fisk.me.uk>>:


*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com
<mailto:ceph-users-boun...@lists.ceph.com>] *On Behalf Of *min fang
*Sent:* 29 September 2016 10:34
*To:* ceph-users mailto:ceph-users@lists.ceph.com>>
*Subject:* [ceph-users] ceph write performance issue

Hi, I created 40 osds ceph cluster with 8 PM863 960G SSD as
journal. One ssd is used by 5 osd drives as journal.   The ssd 512
random write performance is about 450MB/s, but the whole cluster
sequential write throughput is only 800MB/s. Any suggestion on
improving sequential write performance? thanks.

Take a conservative figure of 50MB/s for each disk as writing in
Ceph is not just straight sequential writes, there is a slight
random nature to it.

(40x50MB/s)/3 = 666MB/s. Seems fine to me.


Testing result is here:
rados bench -p libvirt-pool 10 write --no-cleanup
Maintaining 16 concurrent writes of 4194304 bytes to objects of
size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_redpower-sh-04_16462
  sec Cur ops   started  finished  avg MB/s cur MB/s last lat(s) 
avg lat(s)

0   0 0 0 0 0   -   0
1  15   189   174 695.968   696   0.0359122   
0.082477
2  16   395   379 757.938   820   0.0634079  
0.0826266
3  16   582   566 754.601   748   0.0401129  
0.0830207
4  16   796   780 779.934   856   0.0374938  
0.0816794
5  16   977   961 768.735   724   0.0489886  
0.0827479
6  16  1172  1156 770.601   780   0.0428639  
0.0812062
7  16  1387  1371 783.362   860   0.0461826  
0.0811803
8  16  1545  1529 764.433   6320.238497  
0.0831018
9  16  1765  1749 777.265   880   0.0557358  
0.0814399
   10  16  1971  1955 781.931   824   0.0321333  
0.0814144

Total time run: 10.044813
Total writes made:  1972
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 785.281
Stddev Bandwidth:   80.8235
Max bandwidth (MB/sec): 880
Min bandwidth (MB/sec): 632
Average IOPS:   196
Stddev IOPS:20
Max IOPS:   220
Min IOPS:   158
Average Latency(s): 0.081415
Stddev Latency(s):  0.0554568
Max latency(s): 0.345111
Min latency(s): 0.0230153

my ceph osd configuration:
sd_mkfs_type = xfs
osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k
osd_mkfs_options_xfs = -f -i size=2048
filestore_max_inline_xattr_size = 254
filestore_max_inline_xattrs = 6
osd_op_threads = 20
filestore_queue_max_ops = 25000
journal_max_write_entries=1
journal_queue_max_ops=5
objecter_inflight_ops=10240
filestore_queue_max_bytes=1048576000
filestore_queue_committing_max_bytes =1048576000
journal_max_write_bytes=1073714824
journal_queue_max_bytes=1048576
ms_dispatch_throttle_bytes=1048576000
objecter_infilght_op_bytes=1048576000
filestore_max_sync_interval=20
filestore_flusher=false
filestore_flush_min=0
filestore_sync_flush=true
journal_block_align = true
journal_dio = true
journal_aio = true
journal_force_aio = true
osd_op_num_shards=8
osd_op_num_threads_per_shard=2
filestore_wbthrottle_enable=false
filestore_fd_cache_size=1024
filestore_omap_header_cache_size=1024





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Dr. rer. nat. Burkhard Linke
Bioinformatics and Systems Biology
Justus-Liebig-University Giessen
35392 Giessen, Germany
Phone: (+49) (0)641 9935810

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph with Cache pool - disk usage / cleanup

2016-09-29 Thread Burkhard Linke

Hi,


On 09/29/2016 01:34 PM, Sascha Vogt wrote:

*snipsnap*


We have a huge amount of short lived VMs which are deleted before they
are even flushed to the backing pool. Might this be the reason, that
ceph doesn't handle that particular thing well? Eg. when deleting an
object / RBD image which has not been flushed, that the "deletion
mechanism" only deletes whats in the backing pool and if there is
nothing it skips deleting the marker files in the cache pool?
You should be able to validate this. Create a new rbd in the pool, map 
it, write some data to it (few MB should be sufficient), note its rbd 
prefix (rbd info ), and remove the rbd.


Then check whether objects with the prefix exists in the cache pool or 
the backend pool. If such objects exists, try to flush/evict it manually 
(rados cache-flush / cache-evict) and check whether the object is still 
present in the pools.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph with Cache pool - disk usage / cleanup

2016-09-29 Thread Burkhard Linke

Hi,


On 09/29/2016 01:46 PM, Sascha Vogt wrote:

A quick follow up question:

Am 29.09.2016 um 13:34 schrieb Sascha Vogt:

Can you check/verify that the deleted objects are actually gone on the
backing pool?

How do I check that? Aka how to find out on which OSD a particular
object in the cache pool ends up in the backing pool?

Ie. I have a 0-byte file in the cache pool:
./current/11.26_head/DIR_6/DIR_2/DIR_8/DIR_6/DIR_0/rbd\udata.3abcbb93a5057a9.4225__head_94406826__b

I searched for the prefix in the backing pool via

rados -p ephemeral-vms ls | grep data.3abcbb93a5057a9

and nothing was found, so I guess the whole RBD image has been deleted
from the backing pool already (or has never hit the backing pool in the
first place)
You can use the same rados commands on the cache pool to check for 
individual rados objects.


If you want to try to associate the rados objects to rbd images, you 
need to know the rbd prefix of existing images. Use something like 'rbd 
-p ephemeral-vms ls | xargs rbd -p ephemeral-vms info' to list 
information about all rbd images, including the prefix in the 
block_name_prefix row.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph with Cache pool - disk usage / cleanup

2016-09-29 Thread Burkhard Linke

Hi,


On 09/29/2016 02:52 PM, Sascha Vogt wrote:

Hi,

Am 29.09.2016 um 13:45 schrieb Burkhard Linke:

On 09/29/2016 01:34 PM, Sascha Vogt wrote:

We have a huge amount of short lived VMs which are deleted before they
are even flushed to the backing pool. Might this be the reason, that
ceph doesn't handle that particular thing well? Eg. when deleting an
object / RBD image which has not been flushed, that the "deletion
mechanism" only deletes whats in the backing pool and if there is
nothing it skips deleting the marker files in the cache pool?

You should be able to validate this. Create a new rbd in the pool, map
it, write some data to it (few MB should be sufficient), note its rbd
prefix (rbd info ), and remove the rbd.

Then check whether objects with the prefix exists in the cache pool or
the backend pool. If such objects exists, try to flush/evict it manually
(rados cache-flush / cache-evict) and check whether the object is still
present in the pools.

Took a while and a lot more numbers we're seeing is starting to make
sense now.

After rbd rm the objects are still present in the cache pool, stat on
any of those objects returns the "No such file or directory" error and
no object is on the backing pool.

That explains our values we're seeing on "ceph df detail" which is:

ephemeral-vms: 910880 objects
ssd: 109096429 objects <- constantly growing... only rarely dropping
number. The number drops when an Openstack

-> calculating each "missing" object with 4k we end up with around 411
GB, one replica and we have our around 800 GB missing space :(

Good thing: Evicting an object where stat returns the error removes it.
So I'm now listing all objects in the SSD pool and then trying to evict
those who return a "No such file" when stating them. Hopefully that
doesn't break anything.

Question: Do I need a flush before the evict? Just in case? Or what
happens if I call evict on an object which is technically not a dead
object and needs to be flushed first?
AFAIK evicting an object also flushes it to the backing storage, so 
evicting a live object should be ok. It will be promoted again at the 
next access (or whatever triggers promotion in the caching mechanism).


For the dead 0-byte files: Should I open a bug report?
Not sure whether this is a bug at all. The objects should be evicted and 
removed if the cache pool hits the max object thresholds.



Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw backup / staging solutions?

2016-09-30 Thread Burkhard Linke

Hi,


we are about to move from internal testing to a first production setup 
with our object storage based on Ceph RGW. One of the last open problems 
is a backup / staging solution for S3 buckets.


As far as I know many of the life-cycle operations available in Amazon 
S3 are not implemented in Ceph RGW yet. How do you cope with tasks like 
long term archives (we need to keep some data around for up to 10 
years...) or restoring of objects accidentally deleted by a user?


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 6 Node cluster with 24 SSD per node: Hardwareplanning/ agreement

2016-10-04 Thread Burkhard Linke

Hi,


some thoughts about network and disks inline

On 10/04/2016 03:43 PM, Denny Fuchs wrote:

Hello,


*snipsnap*



* Storage NIC: 1 x Infiniband MCX314A-BCCT
** I red, that ConnectX-3 Pro is better supported, than the X-4 and a 
bit cheaper

** Switch: 2 x Mellanox SX6012 (56Gb/s)
** Active FC cables
** Maybe VPI is nice to have, but unsure.
The Infiniband support is Ceph is experimental and not recommended for 
production uses. You'll have to fall back to IPoIB for the moment. The 
ConnectX-3 has configurable ports and also supports 40GbE, so ethernet 
switches might be an alternative for your setup. (some of the Mellanox 
switch support both infiniband and ethernet).


* Production NIC: 1 x Intel 520 dual port SFP+
** Connected each to one of a HP 2920 10Gb/s ports via 802.3ad

All nodes are connected over cross to every switch, so if one switch 
goes down, a second path is available.



* Disk:
** Storage: 24 x Crucial MX300 250GB (maybe for production 12xSSD / 
12x big Sata disks)

** OSD journal: 1 x Intel SSD DC P3700 PCIe


Not sure about the OSD SSDs, but keep in mind that consumer SSDs are not 
intended for running under load 24/7. Have a closer look at their write 
endurance and ensure that the SSDs are monitored properly. If you take 
ll of them into production at the same time and got a good data 
distribution over all SSDs, they might also be failing within a very 
short time span...


The journal SSD is OK (we have the same model), but according to tests 
it is only capable of writing about 1 GB/s as journal SSD (obligatory 
blog link: 
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/). 
With 2x 10GbE public network links, the SSD might become the bottleneck 
in large scale write operations.


For read operations with 24 SSDs (assuming 400MB/s per SSD -> 9,6 GB/s) 
the network will definitely become the bottleneck. You also might want 
to check whether the I/O subsystem is able to drive 24 SSDs (SAS-3 has 
12 GBit/s, expander are usually connected with 4 channels -> 6 GB/s).


Regards,
Burkhard

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Merging CephFS data pools

2016-10-05 Thread Burkhard Linke

Hi,

I've managed to move the data from the old pool to the new one using 
some shell scripts and cp/rsync. Recursive getfattr on the mount point 
does not reveal any file with a layout refering the old pool.


Nonetheless 486 objects are left in the pool:
...
POOLS:
NAME ID USED   %USED MAX AVAIL 
OBJECTS

...
cephfs_two_rep_data  12  1695M 0 48740G  486
...


The majority of objects seem to belong to one file:
# rados -p cephfs_two_rep_data ls | grep -c 1be9363.
414

But the first chunk of this file is missing (no 1be9363. 
object in that pool):

# rados -p cephfs_two_rep_data ls | grep 1be9363 | sort
1be9363.0262
1be9363.0263
1be9363.0264
1be9363.0265

1be9363.03fe
1be9363.03ff

I suspect these objects are a remainder of a file deletion that was 
interrupted. The remaining objects are the first chunks of files, all 
well below the default 4 MB stripe size (-> named XYZ.). All but 
one do not have any xattr associated with them (no parent, no layout). 
In case of the single object with parent xattr it seems to be a stray 
object. I want to get rid of that data pool, but I would also like to 
avoid wrecking the filesystem.


Do make a long story short:
What's the best way to verify that a certain inode id is not 
used/referenced within cephfs anymore?
Is it possible to dump all strays to verify that the single stray object 
in the pool is also an orphan and can be removed (MDS cache size is 
5.000.000, thus dumping the cache will result in a service interruption)?

Do the filesystem recovery tools detect orphaned objects in data pools?


Regards,
Burkhard

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   >