Re: [ceph-users] Cephfs NFS failover

2017-12-21 Thread Robert Sander
On 20.12.2017 18:45, nigel davies wrote:
> Hay all
> 
> Can any one advise on how it can do this.

You can use ctdb for that and run an active/active NFS cluster:

https://wiki.samba.org/index.php/Setting_up_CTDB_for_Clustered_NFS

The cluster filesystem can be a CephFS. This also works with Samba, i.e.
you get an unlimited fileserver.

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Proper way of removing osds

2017-12-21 Thread Karun Josy
Hi,

This is how I remove an OSD from cluster


   - Take it out
   ceph osd out osdid

   Wait for the balancing to finish

   - Mark it down
   ceph osd down osdid

   Then Purge it
ceph osd purge osdid --yes-i-really-mean-it


While purging I can see there is another rebalancing occurring.
Is this the correct way to removes OSDs, or am I doing something wrong ?



Karun
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cephfs limis

2017-12-21 Thread nigel davies
Hay all is it possable to set cephfs to have an sapce limit
eg i like to set my cephfs to have an limit of 20TB
and my s3 storage to have 4TB for example

thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proper way of removing osds

2017-12-21 Thread Konstantin Shalygin

Is this the correct way to removes OSDs, or am I doing something wrong ?
Generic way for maintenance (e.g. disk replace) is rebalance by change 
osd weight:



ceph osd crush reweight osdid 0

cluster migrate data "from this osd"


When HEALTH_OK you can safe remove this OSD:

ceph osd out osd_id
systemctl stop ceph-osd@osd_id
ceph osd crush remove osd_id
ceph auth del osd_id
ceph osd rm osd_id



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Slow backfilling with bluestore, ssd and metadata pools

2017-12-21 Thread Burkhard Linke

Hi,


we are in the process of migrating our hosts to bluestore. Each host has 
12 HDDs (6TB / 4TB) and two Intel P3700 NVME SSDs with 375 GB capacity. 
The new bluestore OSDs are created by ceph-volume:



ceph-volume lvm create --bluestore --block.db /dev/nvmeXn1pY --data 
/dev/sdX1



6 OSDs share a SSD with 30GB partitions for rocksdb; the remaining space 
is used as additional ssd based osd without specifying additional 
partitions.



Backfilling from the other nodes works fine for the hdd based OSDs, but 
is _really_ slow for the ssd based ones. With filestore moving our 
cephfs metadata pool around was a matter of 10 minutes (350MB, 8 million 
objects, 1024 PGs). With bluestore remapped a part of the pool (about 
400PGs, those affected by adding a new pair of ssd based OSDs) did not 
finish over night



OSD config section from ceph.conf:

[osd]
osd_scrub_sleep = 0.05
osd_journal_size = 10240
osd_scrub_chunk_min = 1
osd_scrub_chunk_max = 1
max_pg_per_osd_hard_ratio = 4.0
osd_max_pg_per_osd_hard_ratio = 4.0
bluestore_cache_size_hdd = 5368709120
mon_max_pg_per_osd = 400


Backfilling runs with max-backfills set to 20 during day and 50 during 
night. Some numbers (ceph pg dump for the most advanced backfilling 
cephfs metadata PG, ten seconds difference):



ceph pg dump | grep backfilling | grep -v undersized | sort -k4 -n -r | 
tail -n 1 && sleep 10 && echo && ceph pg dump | grep backfilling | grep 
-v undersized | sort -k4 -n -r | tail -n 1

dumped all
8.101  7581  0    0  4549   0 4194304 
2488 2488 active+remapped+backfilling 2017-12-21 09:03:30.429605 
543240'1012998    543248:1923733 [78,34,49] 
78 [78,34,19] 78    522371'1009118 2017-12-18 
16:11:29.755231    522371'1009118 2017-12-18 16:11:29.755231


dumped all
8.101  7580  0    0  4542 0   0 
2489 2489 active+remapped+backfilling 2017-12-21 09:03:30.429605 
543248'1012999    543250:1923755 [78,34,49] 
78 [78,34,19] 78    522371'1009118 2017-12-18 
16:11:29.755231    522371'1009118 2017-12-18 16:11:29.755231



Seven objects in 10 seconds does not sound sane to me, given that only 
key-value has to be transferred.



Any hints how to tune this?


Regards,

Burkhard


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proper way of removing osds

2017-12-21 Thread Richard Hesketh
On 21/12/17 10:21, Konstantin Shalygin wrote:
>> Is this the correct way to removes OSDs, or am I doing something wrong ?
> Generic way for maintenance (e.g. disk replace) is rebalance by change osd 
> weight:
> 
> 
> ceph osd crush reweight osdid 0
> 
> cluster migrate data "from this osd"
> 
> 
> When HEALTH_OK you can safe remove this OSD:
> 
> ceph osd out osd_id
> systemctl stop ceph-osd@osd_id
> ceph osd crush remove osd_id
> ceph auth del osd_id
> ceph osd rm osd_id
> 
> 
> 
> k

basically this, when you mark an OSD "out" it stops receiving data and PGs will 
be remapped but it is still part of the crushmap and influences the weights of 
buckets - so when you do the final purge your weights will shift and another 
rebalance occurs. Weighting the OSD to 0 first will ensure you don't incur any 
extra data movement when you finally purge it.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proper way of removing osds

2017-12-21 Thread Burkhard Linke

Hi,


On 12/21/2017 11:03 AM, Karun Josy wrote:

Hi,

This is how I remove an OSD from cluster

  * Take it out
ceph osd out osdid

Wait for the balancing to finish

  * Mark it down
ceph osd down osdid

Then Purge it
 cephosd purge osdid --yes-i-really-mean-it


While purging I can see there is another rebalancing occurring.
Is this the correct way to removes OSDs, or am I doing something wrong ?


The procedure is correct, but not optimal.

The first rebalancing is due to the osd being down; the second 
rebalancing is due to fact that removing the osd changes the crush 
weight of the host and thus the base of the overall data distribution.


If you want to skip this, you can set the crush weight of the 
to-be-removed osd to 0.0, wait for the rebalancing to be finished, and 
stop and remove the osds afterwards. You can also use smaller steps to 
reduce the backfill impact if necessary,


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow backfilling with bluestore, ssd and metadata pools

2017-12-21 Thread Richard Hesketh
On 21/12/17 10:28, Burkhard Linke wrote:
> OSD config section from ceph.conf:
> 
> [osd]
> osd_scrub_sleep = 0.05
> osd_journal_size = 10240
> osd_scrub_chunk_min = 1
> osd_scrub_chunk_max = 1
> max_pg_per_osd_hard_ratio = 4.0
> osd_max_pg_per_osd_hard_ratio = 4.0
> bluestore_cache_size_hdd = 5368709120
> mon_max_pg_per_osd = 400

Consider also playing with the following OSD parameters:

osd_recovery_max_active
osd_recovery_sleep
osd_recovery_sleep_hdd
osd_recovery_sleep_hybrid
osd_recovery_sleep_ssd

In my anecdotal experience, the forced wait between requests (controlled by the 
recovery_sleep parameters) was causing significant slowdown in recovery speed 
in my cluster, though even at the default values it wasn't making things go 
nearly as slowly as your cluster - it sounds like something else is probably 
wrong.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Not timing out watcher

2017-12-21 Thread Ilya Dryomov
On Wed, Dec 20, 2017 at 6:56 PM, Jason Dillaman  wrote:
> ... looks like this watch "timeout" was introduced in the kraken
> release [1] so if you don't see this issue with a Jewel cluster, I
> suspect that's the cause.
>
> [1] https://github.com/ceph/ceph/pull/11378

Strictly speaking that's a backwards incompatible change, because
zeroes have never been and aren't enforced -- clients are free to fill
the remaining bits of ceph_osd_op with whatever values.

That said, the kernel client has always been zeroing the front portion
of the message before encoding, so even though the timeout field hasn't
been carried into ceph_osd_op definition in the kernel, it's always 0
(for "use osd_client_watch_timeout for this watch").

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Not timing out watcher

2017-12-21 Thread Ilya Dryomov
On Wed, Dec 20, 2017 at 6:20 PM, Serguei Bezverkhi (sbezverk)
 wrote:
> It took 30 minutes for the Watcher to time out after ungraceful restart. Is 
> there a way limit it to something a bit more reasonable? Like 1-3 minutes?
>
> On 2017-12-20, 12:01 PM, "Serguei Bezverkhi (sbezverk)"  
> wrote:
>
> Ok, here is what I found out. If I gracefully kill a pod then watcher 
> gets properly cleared, but if it is done ungracefully, without “rbd unmap” 
> then even after a node reboot Watcher stays up for a long time,  it has been 
> more than 20 minutes and it is still active (no any kubernetes services are 
> running).

Hi Serguei,

Can you try taking k8s out of the equation -- set up a fresh VM with
the same kernel, do "rbd map" in it and kill it?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] POOL_NEARFULL

2017-12-21 Thread Konstantin Shalygin

Update your ceph.conf file


This is also not help. I was create ticket 
http://tracker.ceph.com/issues/22520


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow backfilling with bluestore, ssd and metadatapools

2017-12-21 Thread Burkhard Linke

Hi,


On 12/21/2017 11:43 AM, Richard Hesketh wrote:

On 21/12/17 10:28, Burkhard Linke wrote:

OSD config section from ceph.conf:

[osd]
osd_scrub_sleep = 0.05
osd_journal_size = 10240
osd_scrub_chunk_min = 1
osd_scrub_chunk_max = 1
max_pg_per_osd_hard_ratio = 4.0
osd_max_pg_per_osd_hard_ratio = 4.0
bluestore_cache_size_hdd = 5368709120
mon_max_pg_per_osd = 400

Consider also playing with the following OSD parameters:

osd_recovery_max_active
osd_recovery_sleep
osd_recovery_sleep_hdd
osd_recovery_sleep_hybrid
osd_recovery_sleep_ssd

In my anecdotal experience, the forced wait between requests (controlled by the 
recovery_sleep parameters) was causing significant slowdown in recovery speed 
in my cluster, though even at the default values it wasn't making things go 
nearly as slowly as your cluster - it sounds like something else is probably 
wrong.


Thanks for the hint. I've been thinking about recovery_sleep, too. But 
the default for ssd osds is set to 0.0:


# ceph daemon osd.93 config show | grep recovery
    "osd_allow_recovery_below_min_size": "true",
    "osd_debug_skip_full_check_in_recovery": "false",
    "osd_force_recovery_pg_log_entries_factor": "1.30",
    "osd_min_recovery_priority": "0",
    "osd_recovery_cost": "20971520",
    "osd_recovery_delay_start": "0.00",
    "osd_recovery_forget_lost_objects": "false",
    "osd_recovery_max_active": "3",
    "osd_recovery_max_chunk": "8388608",
    "osd_recovery_max_omap_entries_per_chunk": "64000",
    "osd_recovery_max_single_start": "1",
    "osd_recovery_op_priority": "3",
    "osd_recovery_op_warn_multiple": "16",
    "osd_recovery_priority": "5",
    "osd_recovery_retry_interval": "30.00",
    "osd_recovery_sleep": "0.00",
    "osd_recovery_sleep_hdd": "0.10",
    "osd_recovery_sleep_hybrid": "0.025000",
    "osd_recovery_sleep_ssd": "0.00",
    "osd_recovery_thread_suicide_timeout": "300",
    "osd_recovery_thread_timeout": "30",
    "osd_scrub_during_recovery": "false",

osd 93 is one of the ssd osd I've just recreated using bluestore about 3 
hours ago. All recovery related values are at their defaults. Since the 
first mail one hour ago the PG made some progress:


8.101  7580  0    0  2777 0   0 
2496 2496 active+remapped+backfilling 2017-12-21 09:03:30.429605 
543455'1013006    543518:1927782 [78,34,49] 
78 [78,34,19] 78    522371'1009118 2017-12-18 
16:11:29.755231    522371'1009118 2017-12-18 16:11:29.755231


So roughly 2000 objects on this PG have been copied to a new ssd based 
OSD (78,34,19 -> 78,34,49 -> one new copy).



Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mds millions of caps

2017-12-21 Thread Webert de Souza Lima
I have upgraded the kernel on a client node (one that has close-to-zero
traffic) used for tests.

   {
  "reconnecting" : false,
  "id" : 1620266,
  "num_leases" : 0,
  "inst" : "client.1620266 10.0.0.111:0/3921220890",
  "state" : "open",
  "completed_requests" : 0,
  "num_caps" : 1402490,
  "client_metadata" : {
 "kernel_version" : "4.4.0-104-generic",
 "hostname" : "suppressed",
 "entity_id" : "admin"
  },
  "replay_requests" : 0
   },

still 1.4M caps used.

is upgrading the client kernel enough ?



Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

On Fri, Dec 15, 2017 at 11:16 AM, Webert de Souza Lima <
webert.b...@gmail.com> wrote:

> So,
>
> On Fri, Dec 15, 2017 at 10:58 AM, Yan, Zheng  wrote:
>
>>
>> 300k are ready quite a lot. opening them requires long time. does you
>> mail server really open so many files?
>
>
> Yes, probably. It's a commercial solution. A few thousand domains, dozens
> of thousands of users and god knows how any mailboxes.
> From the daemonperf you can see the write workload is high, so yes, too
> much files opening (dovecot mdbox stores multiple e-mails per file, split
> into many files).
>
> I checked 4.4 kernel, it includes the code that trim cache when mds
>> recovers.
>
>
> Ok, all nodes are running 4.4.0-75-generic. The fix might have been
> included in a newer version.
> I'll upgrade it asap.
>
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
> *IRC NICK - WebertRLZ*
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?

2017-12-21 Thread Denes Dolhay

Hi,


Since many ceph clusters use intel ssds and admins do recommend them, 
they are probably very good drives. My own experiences however are not 
so good with them. (About 70% of our intel drives ran into the 8mb bug 
at my previous job, 5xx and DC35xx series both, latest firmware at that 
time, <<10% cell usage, ~1 year use).


For the future I would recommend that you use different series / vendors 
for each of the failure domains, this way you can minimize the chance of 
"correlated failures".


There is a lecture about this here from Lars at Suse:

https://www.youtube.com/watch?v=fgRWVZXxRN8


Regards,

Denes.


On 12/21/2017 02:48 AM, David Herselman wrote:

Hi Christian,

Thanks for taking the time, I haven't been contacted by anyone yet but managed 
to get the down placement groups cleared by exporting 7.4s0 and 7.fs0 and then 
marking them as complete on the surviving OSDs:
 kvm5c:
   ceph-objectstore-tool --op export --pgid 7.4s0 --data-path 
/var/lib/ceph/osd/ceph-8 --journal-path /var/lib/ceph/osd/ceph-8/journal --file 
/var/lib/vz/template/ssd_recovery/osd8_7.4s0.export;
   ceph-objectstore-tool --op mark-complete --data-path 
/var/lib/ceph/osd/ceph-8 --journal-path /var/lib/ceph/osd/ceph-8/journal --pgid 
7.4s0;
 kvm5f:
   ceph-objectstore-tool --op export --pgid 7.fs0 --data-path 
/var/lib/ceph/osd/ceph-23 --journal-path /var/lib/ceph/osd/ceph-23/journal 
--file /var/lib/vz/template/ssd_recovery/osd23_7.fs0.export;
   ceph-objectstore-tool --op mark-complete --data-path 
/var/lib/ceph/osd/ceph-23 --journal-path /var/lib/ceph/osd/ceph-23/journal 
--pgid 7.fs0;

This would presumably simply punch holes in the RBD images but at least we can 
copy them out of that pool and hope that Intel can somehow unlock the drives 
for us to then export/import objects.


To answer your questions though, we have 6 near identical Intel Wildcat Pass 1U 
servers and have Proxmox loaded on them. Proxmox uses a Debian 9 base with the 
Ubuntu kernel, for which they apply cherry picked kernel patches (eg Intel NIC 
driver updates, vhost perf regression and mem-leak fixes, etc):

kvm5a:
Intel R1208WTTGSR System (serial: BQWS55091014)
Intel S2600WTTR Motherboard (serial: BQWL54950385, BIOS ID: 
SE5C610.86B.01.01.0021.032120170601)
2 x Intel Xeon E5-2640v4 2.4GHz (HT disabled)
24 x Micron 8GB DDR4 2133MHz (24 x 18ASF1G72PZ-2G1B1)
Intel AXX10GBNIA I/O Module
kvm5b:
Intel R1208WTTGS System (serial: BQWS53890178)
Intel S2600WTT Motherboard (serial: BQWL52550359, BIOS ID: 
SE5C610.86B.01.01.0021.032120170601)
2 x Intel Xeon E5-2640v4 2.4GHz (HT enabled)
4 x Micron 64GB DDR4 2400MHz LR-DIMM (4 x 72ASS8G72LZ-2G3B2)
Intel AXX10GBNIA I/O Module
kvm5c:
Intel R1208WT2GS System (serial: BQWS50490279)
Intel S2600WT2 Motherboard (serial: BQWL44650203, BIOS ID: 
SE5C610.86B.01.01.0021.032120170601)
2 x Intel Xeon E5-2640v3 2.6GHz (HT enabled)
4 x Micron 64GB DDR4 2400MHz LR-DIMM (4 x 72ASS8G72LZ-2G3B2)
Intel AXX10GBNIA I/O Module
kvm5d:
Intel R1208WTTGSR System (serial: BQWS62291318)
Intel S2600WTTR Motherboard (serial: BQWL61855187, BIOS ID: 
SE5C610.86B.01.01.0021.032120170601)
2 x Intel Xeon E5-2640v4 2.4GHz (HT enabled)
4 x Micron 64GB DDR4 2400MHz LR-DIMM (4 x 72ASS8G72LZ-2G3B2)
Intel AXX10GBNIA I/O Module
kvm5e:
Intel R1208WTTGSR System (serial: BQWS64290162)
Intel S2600WTTR Motherboard (serial: BQWL63953066, BIOS ID: 
SE5C610.86B.01.01.0021.032120170601)
2 x Intel Xeon E5-2640v4 2.4GHz (HT enabled)
4 x Micron 64GB DDR4 2400MHz LR-DIMM (4 x 72ASS8G72LZ-2G3B2)
Intel AXX10GBNIA I/O Module
kvm5f:
Intel R1208WTTGSR System (serial: BQWS71790632)
Intel S2600WTTR Motherboard (serial: BQWL71050622, BIOS ID: 
SE5C610.86B.01.01.0021.032120170601)
2 x Intel Xeon E5-2640v4 2.4GHz (HT enabled)
4 x Micron 64GB DDR4 2400MHz LR-DIMM (4 x 72ASS8G72LZ-2G3B2)
Intel AXX10GBNIA I/O Module
  
Summary:

   * 5b has an Intel S2600WTT, 5c has an Intel S2600WT2, all others have 
S2600WTTR Motherboards
   * 5a has ECC Registered Dual Rank DDR DIMMs, all others have ECC 
LoadReduced-DIMMs
   * 5c has an Intel X540-AT2 10 GbE adapter as the on-board NICs are only 1 GbE


Each system has identical discs:
   * 2 x 480 GB Intel SSD DC S3610 (SSDSC2BX480G4) - partitioned as software 
RAID1 OS volume and Ceph FileStore journals (spinners)
   * 4 x 2 TB Seagate discs (ST2000NX0243) - Ceph FileStore OSDs (journals in 
S3610 partitions)
   * 2 x 1.9 TB Intel SSD DC S4600 (SSDSC2KG019T7) - Ceph BlueStore OSDs 
(problematic)


Additional information:
   * All drives are directly attached to the on-board AHCI SATA controllers, 
via the standard 2.5 inch drive chassis hot-swap bays.
   * We added 12 x 1.9 TB SSD DC S4600 drives last week Thursday, 2 in each 
system's slots 7 & 8
   * Systems have been ope

Re: [ceph-users] ceph status doesnt show available and used disk space after upgrade

2017-12-21 Thread kevin parrikar
accidently removed mailing list email

++ceph-users

Thanks a lot JC for looking into this issue. I am really out of ideas.


ceph.conf on mgr node which is also monitor node.

[global]
fsid = 06c5c906-fc43-499f-8a6f-6c8e21807acf
mon_initial_members = node-16 node-30 node-31
mon_host = 172.16.1.9 172.16.1.3 172.16.1.11
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
log_to_syslog_level = info
log_to_syslog = True
osd_pool_default_size = 2
osd_pool_default_min_size = 1
osd_pool_default_pg_num = 64
public_network = 172.16.1.0/24
log_to_syslog_facility = LOG_LOCAL0
osd_journal_size = 2048
auth_supported = cephx
osd_pool_default_pgp_num = 64
osd_mkfs_type = xfs
cluster_network = 172.16.1.0/24
osd_recovery_max_active = 1
osd_max_backfills = 1
mon allow pool delete = true

[client]
rbd_cache_writethrough_until_flush = True
rbd_cache = True

[client.radosgw.gateway]
rgw_keystone_accepted_roles = _member_, Member, admin, swiftoperator
keyring = /etc/ceph/keyring.radosgw.gateway
rgw_frontends = fastcgi socket_port=9000 socket_host=127.0.0.1
rgw_socket_path = /tmp/radosgw.sock
rgw_keystone_revocation_interval = 100
rgw_keystone_url = http://192.168.1.3:35357
rgw_keystone_admin_token = jaJSmlTNxgsFp1ttq5SuAT1R
rgw_init_timeout = 36
host = controller3
rgw_dns_name = *.sapiennetworks.com
rgw_print_continue = True
rgw_keystone_token_cache_size = 10
rgw_data = /var/lib/ceph/radosgw
user = www-data




ceph auth list


osd.100
key: AQAtZjpaVZOFBxAAwl0yFLdUOidLzPFjv+HnjA==
caps: [mgr] allow profile osd
caps: [mon] allow profile osd
caps: [osd] allow *
osd.101
key: AQA4ZjpaS4wwGBAABwgoXQRc1J8sav4MUkWceQ==
caps: [mgr] allow profile osd
caps: [mon] allow profile osd
caps: [osd] allow *
osd.102
key: AQBDZjpaBS2tEBAAtFiPKBzh8JGi8Nh3PtAGCg==
caps: [mgr] allow profile osd
caps: [mon] allow profile osd
caps: [osd] allow *

client.admin
key: AQD0yXFYflnYFxAAEz/2XLHO/6RiRXQ5HXRAnw==
caps: [mds] allow *
caps: [mgr] allow *
caps: [mon] allow *
caps: [osd] allow *
client.backups
key: AQC0y3FY4YQNNhAAs5fludq0yvtp/JJt7RT4HA==
caps: [mgr] allow r
caps: [mon] allow r
caps: [osd] allow class-read object_prefix rbd_children, allow rwx
pool=backups, allow rwx pool=volumes
client.bootstrap-mds
key: AQD5yXFYyIxiFxAAyoqLPnxxqWmUr+zz7S+qVQ==
caps: [mgr] allow r
caps: [mon] allow profile bootstrap-mds
client.bootstrap-mgr
key: AQBmOTpaXqHQDhAAyDXoxlPmG9QovfmmUd8gIg==
caps: [mon] allow profile bootstrap-mgr
client.bootstrap-osd
key: AQD0yXFYuGkSIhAAelSb3TCPuXRFoFJTBh7Vdg==
caps: [mgr] allow r
caps: [mon] allow profile bootstrap-osd
client.bootstrap-rbd
key: AQBnOTpafDS/IRAAnKzuI9AYEF81/6mDVv0QgQ==
caps: [mon] allow profile bootstrap-rbd

client.bootstrap-rgw
key: AQD3yXFYxt1mLRAArxOgRvWmmzT9pmsqTLpXKw==
caps: [mgr] allow r
caps: [mon] allow profile bootstrap-rgw
client.compute
key: AQCbynFYRcNWOBAAPzdAKfP21GvGz1VoHBimGQ==
caps: [mgr] allow r
caps: [mon] allow r
caps: [osd] allow class-read object_prefix rbd_children, allow rwx
pool=volumes, allow rx pool=images, allow rwx pool=compute
client.images
key: AQCyy3FYSMtlJRAAbJ8/U/R82NXvWBC5LmkPGw==
caps: [mgr] allow r
caps: [mon] allow r
caps: [osd] allow class-read object_prefix rbd_children, allow rwx
pool=images
client.radosgw.gateway
key: AQA3ynFYAYMSAxAApvfe/booa9KhigpKpLpUOA==
caps: [mgr] allow r
caps: [mon] allow rw
caps: [osd] allow rwx
client.volumes
key: AQCzy3FYa3paKBAA9BlYpQ1PTeR770ghVv1jKQ==
caps: [mgr] allow r
caps: [mon] allow r
caps: [osd] allow class-read object_prefix rbd_children, allow rwx
pool=volumes, allow rx pool=images
mgr.controller2
key: AQAmVTpaA+9vBhAApD3rMs//Qri+SawjUF4U4Q==
caps: [mds] allow *
caps: [mgr] allow *
caps: [mon] allow *
caps: [osd] allow *
mgr.controller3
key: AQByfDparprIEBAAj7Pxdr/87/v0kmJV49aKpQ==
caps: [mds] allow *
caps: [mgr] allow *
caps: [mon] allow *
caps: [osd] allow *

Regards,
Kevin

On Thu, Dec 21, 2017 at 8:10 AM, kevin parrikar 
wrote:

> Thanks JC,
> I tried
> ceph auth caps client.admin osd 'allow *' mds 'allow *' mon 'allow *' mgr
> 'allow *'
>
> but still status is same,also  mgr.log is being flooded with below errors.
>
> 2017-12-21 02:39:10.622834 7fb40a22b700  0 Cannot get stat of OSD 140
> 2017-12-21 02:39:10.622835 7fb40a22b700  0 Cannot get stat of OSD 141
> Not sure whats wrong in my setup
>
> Regards,
> Kevin
>
>
> On Thu, Dec 21, 2017 at 2:37 AM, Jean-Charles Lopez 
> wrote:
>
>> Hi,
>>
>> make sure client.admin user has an MGR cap using ceph auth list. At some
>> point there was a 

Re: [ceph-users] Cephfs limis

2017-12-21 Thread Yan, Zheng
On Thu, Dec 21, 2017 at 6:18 PM, nigel davies  wrote:
> Hay all is it possable to set cephfs to have an sapce limit
> eg i like to set my cephfs to have an limit of 20TB
> and my s3 storage to have 4TB for example
>

you can set pool quota on cephfs data pools

> thanks
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Added two OSDs, 10% of pgs went inactive

2017-12-21 Thread Daniel K
Caspar,

I found Nick Fisk's post yesterday
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-December/023223.html
and set osd_max_pg_per_osd_hard_ratio = 4 in my ceph.conf on the OSDs and
restarted the 10TB OSDs. The PGs went back active and recovery is complete
now.

My setup is similar to his in that there's a large difference in OSD size,
most are 1.8TB, but about 10% of them are 10TB.

The difference is I had a functional Luminous cluster, until increased the
number 10TB OSDs from 6 to 8. I'm still not sure why that caused *more* PGs
per OSD with the same pools.

Thanks!

Daniel


On Wed, Dec 20, 2017 at 10:23 AM, Caspar Smit 
wrote:

> Hi Daniel,
>
> I've had the same problem with creating a new 12.2.2 cluster where i
> couldn't get some pgs out of the "activating+remapped" status after i
> switched some OSD's from one chassis to another (there was no data on it
> yet).
>
> I tried restarting OSD's to no avail.
>
> Couldn't find anything about the stuck in "activating+remapped" state so
> in the end i threw away the pool and started over.
>
> Could this be a bug in 12.2.2 ?
>
> Kind regards,
> Caspar
>
> 2017-12-20 15:48 GMT+01:00 Daniel K :
>
>> Just an update.
>>
>> Recovery completed but the PGS are still inactive.
>>
>> Still having a hard time understanding why adding OSDs caused this. I'm
>> on 12.2.2
>>
>> user@admin:~$ ceph -s
>>   cluster:
>> id: a3672c60-3051-440c-bd83-8aff7835ce53
>> health: HEALTH_WARN
>> Reduced data availability: 307 pgs inactive
>> Degraded data redundancy: 307 pgs unclean
>>
>>   services:
>> mon: 5 daemons, quorum stor585r2u8a,stor585r2u12a,sto
>> r585r2u16a,stor585r2u20a,stor585r2u24a
>> mgr: stor585r2u8a(active)
>> osd: 88 osds: 87 up, 87 in; 133 remapped pgs
>>
>>   data:
>> pools:   12 pools, 3016 pgs
>> objects: 387k objects, 1546 GB
>> usage:   3313 GB used, 186 TB / 189 TB avail
>> pgs: 10.179% pgs not active
>>  2709 active+clean
>>  174  activating
>>  133  activating+remapped
>>
>>   io:
>> client:   8436 kB/s rd, 935 kB/s wr, 140 op/s rd, 64 op/s wr
>>
>>
>> On Tue, Dec 19, 2017 at 8:57 PM, Daniel K  wrote:
>>
>>> I'm trying to understand why adding OSDs would cause pgs to go inactive.
>>>
>>> This cluster has 88 OSDs, and had 6 OSD with device class "hdd_10TB_7.2k"
>>>
>>> I added two more OSDs, set the device class to "hdd_10TB_7.2k" and 10%
>>> of pgs went inactive.
>>>
>>> I have an EC pool on these OSDs with the profile:
>>> user@admin:~$ ceph osd erasure-code-profile get ISA_10TB_7.2k_4.2
>>> crush-device-class=hdd_10TB_7.2k
>>> crush-failure-domain=host
>>> crush-root=default
>>> k=4
>>> m=2
>>> plugin=isa
>>> technique=reed_sol_van.
>>>
>>> some outputs of ceph health detail and ceph osd df
>>> user@admin:~$ ceph osd df |grep 10TB
>>> 76 hdd_10TB_7.2k 9.09509  1.0 9313G   349G 8963G 3.76 2.20 488
>>> 20 hdd_10TB_7.2k 9.09509  1.0 9313G   345G 8967G 3.71 2.17 489
>>> 28 hdd_10TB_7.2k 9.09509  1.0 9313G   344G 8968G 3.70 2.17 484
>>> 36 hdd_10TB_7.2k 9.09509  1.0 9313G   345G 8967G 3.71 2.17 484
>>> 87 hdd_10TB_7.2k 9.09560  1.0 9313G  8936M 9305G 0.09 0.05 311
>>> 86 hdd_10TB_7.2k 9.09560  1.0 9313G  8793M 9305G 0.09 0.05 304
>>>  6 hdd_10TB_7.2k 9.09509  1.0 9313G   344G 8969G 3.70 2.16 471
>>> 68 hdd_10TB_7.2k 9.09509  1.0 9313G   344G 8969G 3.70 2.17 480
>>> user@admin:~$ ceph health detail|grep inactive
>>> HEALTH_WARN 68287/1928007 objects misplaced (3.542%); Reduced data
>>> availability: 307 pgs inactive; Degraded data redundancy: 341 pgs unclean
>>> PG_AVAILABILITY Reduced data availability: 307 pgs inactive
>>> pg 24.60 is stuck inactive for 1947.792377, current state
>>> activating+remapped, last acting [36,20,76,6,68,28]
>>> pg 24.63 is stuck inactive for 1946.571425, current state
>>> activating+remapped, last acting [28,76,6,20,68,36]
>>> pg 24.71 is stuck inactive for 1947.625988, current state
>>> activating+remapped, last acting [6,68,20,36,28,76]
>>> pg 24.73 is stuck inactive for 1947.705250, current state
>>> activating+remapped, last acting [36,6,20,76,68,28]
>>> pg 24.74 is stuck inactive for 1947.828063, current state
>>> activating+remapped, last acting [68,36,28,20,6,76]
>>> pg 24.75 is stuck inactive for 1947.475644, current state
>>> activating+remapped, last acting [6,28,76,36,20,68]
>>> pg 24.76 is stuck inactive for 1947.712046, current state
>>> activating+remapped, last acting [20,76,6,28,68,36]
>>> pg 24.78 is stuck inactive for 1946.576304, current state
>>> activating+remapped, last acting [76,20,68,36,6,28]
>>> pg 24.7a is stuck inactive for 1947.820932, current state
>>> activating+remapped, last acting [36,20,28,68,6,76]
>>> pg 24.7b is stuck inactive for 1947.858305, current state
>>> activating+remapped, last acting [68,6,20,28,76,36]
>>> pg 24.7c is stuck inactive for 1947.753917, current state
>>> activating+remapped, las

[ceph-users] How to use vfs_ceph

2017-12-21 Thread Felix Stolte

Hello folks,

is anybody using the vfs_ceph module for exporting cephfs as samba 
shares? We are running ceph jewel with cephx enabled. Manpage of 
vfs_ceph only references the option ceph:config_file. How do I need to 
configure my share (or maybe ceph.conf)?


log.smbd:  '/' does not exist or permission denied when connecting to 
[vfs] Error was Transport endpoint is not connected


I have a user ctdb with keyring file /etc/ceph/ceph.client.ctdb.keyring 
with permissions:


    caps: [mds] allow rw
    caps: [mon] allow r    caps: [osd] allow rwx 
pool=cephfs_metadata,allow rwx pool=cephfs_data


I can mount cephfs with cephf-fuse using the id ctdb and its keyfile.

My share definition is:

[vfs]
    comment = vfs
    path = /
    read only = No
    vfs objects = acl_xattr ceph
    ceph:user_id = ctdb
    ceph:config_file = /etc/ceph/ceph.conf


Any advice is appreciated.

Regards Felix

--
Forschungszentrum Jülich GmbH
52425 Jülich
Sitz der Gesellschaft: Jülich
Eingetragen im Handelsregister des Amtsgerichts Düren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir. Dr. Karl Eugen Huthmacher
Geschäftsführung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Luminous 12.2.2] Cluster peformance drops after certain point of time

2017-12-21 Thread shadow_lin
Thanks for your information, but I don't think it is my case.My cluster don't 
have any ssd.

2017-12-21 


lin.yunfan



发件人:Denes Dolhay 
发送时间:2017-12-18 06:41
主题:Re: [ceph-users] [Luminous 12.2.2] Cluster peformance drops after certain 
point of time
收件人:"ceph-users"
抄送:

Hi,
This is just a tip, I do not know if this actually applies to you, but some 
ssds are decreasing their write throughput on purpose so they do not wear out 
the cells before the warranty period is over.


Denes.





On 12/17/2017 06:45 PM, shadow_lin wrote:

Hi All,
I am testing luminous 12.2.2 and find a strange behavior of my cluster.
   I was testing my cluster throughput by using fio on a mounted rbd with 
follow fio parameters:
   fio -directory=fiotest -direct=1 -thread -rw=write -ioengine=libaio 
-size=200G -group_reporting -bs=1m -iodepth 4 -numjobs=200 -name=writetest
   Everything was fine at the begining, but after about 10 hrs of testing I 
found the performance dropped noticeably.
   Throughput droped from 300-450MBps to 250-350MBps  and osd latency 
increased from 300ms to 400ms
   I also noted the heap stats showed the osd start reclaiming  page heap 
freelist much more frequently but the rss memory of osd were increasing.
  
  below is the links of grafana graph of my cluster.
  cluster metrics: https://pasteboard.co/GYEOgV1.jpg
  osd mem metrics: https://pasteboard.co/GYEP74M.png
  In the graph the performance dropped after 10:00.

 I am investigating what happened but haven't found any clue yet. If you 
know any thing about how to solve this problem or where I should  look into 
please let me know. 
 Thanks. 


2017-12-18



lin.yunfan

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS behind on trimming

2017-12-21 Thread Stefan Kooman
Hi,

We have two MDS servers. One active, one active-standby. While doing a
parallel rsync of 10 threads with loads of files, dirs, subdirs we get
the following HEALTH_WARN:

ceph health detail
HEALTH_WARN 2 MDSs behind on trimming
MDS_TRIM 2 MDSs behind on trimming
mdsmds2(mds.0): Behind on trimming (124/30)max_segments: 30,
num_segments: 124
mdsmds1(mds.0): Behind on trimming (118/30)max_segments: 30,
num_segments: 118

To be clear: the amount of segments behind on trimming fluctuates. It
sometimes does get smaller, and is relatively stable around ~ 130.

The load on the MDS is low, load on OSDs is low (both CPU/RAM/IO). All
flash, cephfs_metadata co-located on the same OSDs. Using cephfs kernel
client (4.13.0-19-generic) with Ceph 12.2.2 (cllient as well as cluster
runs Ceph 12.2.2). In older threads I found several possible
explanations for getting this warning:

1) When the number of segments exceeds that setting, the MDS starts
  writing back metadata so that it can remove (trim) the oldest
  segments. If this process is too slow, or a software bug is preventing
  trimming, then this health message appears.

2) The OSDs cannot keep up with the load

3) cephfs kernel client  mis behaving / bug

I definitely don't think nr 2) is the reason. I doubt it's a Ceph MDS 1)
or client bug 3). Might this be conservative default settings? I.e. not
trying to trim fast / soon enough. John wonders in thread [1] if the
default journal length should be longer. Yan [2] recommends bumping
"mds_log_max_expiring" to a large value (200). 

What would you suggest at this point? I'm thinking about the following
changes:

mds log max segments = 200
mds log max expiring = 200

Thanks,

Stefan

[1]: https://www.spinics.net/lists/ceph-users/msg39387.html
[2]:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-July/011138.html

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?

2017-12-21 Thread David Herselman
Hi,

I assume this can only be a physical manufacturing flaw or a firmware bug? Do 
Intel publish advisories on recalled equipment? Should others be concerned 
about using Intel DC S4600 SSD drives? Could this be an electrical issue on the 
Hot Swap Backplane or BMC firmware issue? Either way, all pure Intel...

The hole is only 1.3 GB (4 MB x 339 objects) but perfectly striped through 
images, file systems are subsequently severely damaged.

Is it possible to get Ceph to read in partial data shards? It would provide 
between 25-75% more yield...


Is there anything wrong with how we've proceeded thus far? Would be nice to 
reference examples of using ceph-objectstore-tool but documentation is 
virtually non-existent.

We used another SSD drive to simulate bringing all the SSDs back online. We 
carved up the drive to provide equal partitions to essentially simulate the 
original SSDs:
  # Partition a drive to provide 12 x 150GB partitions, eg:
sdd   8:48   0   1.8T  0 disk
|-sdd18:49   0   140G  0 part
|-sdd28:50   0   140G  0 part
|-sdd38:51   0   140G  0 part
|-sdd48:52   0   140G  0 part
|-sdd58:53   0   140G  0 part
|-sdd68:54   0   140G  0 part
|-sdd78:55   0   140G  0 part
|-sdd88:56   0   140G  0 part
|-sdd98:57   0   140G  0 part
|-sdd10   8:58   0   140G  0 part
|-sdd11   8:59   0   140G  0 part
+-sdd12   8:60   0   140G  0 part


  Pre-requisites:
ceph osd set noout;
apt-get install uuid-runtime;


  for ID in `seq 24 35`; do
UUID=`uuidgen`;
OSD_SECRET=`ceph-authtool --gen-print-key`;
DEVICE='/dev/sdd'$[$ID-23]; # 24-23 = /dev/sdd1, 35-23 = /dev/sdd12
echo "{\"cephx_secret\": \"$OSD_SECRET\"}" | ceph osd new $UUID $ID -i - -n 
client.bootstrap-osd -k /var/lib/ceph/bootstrap-osd/ceph.keyring;
mkdir /var/lib/ceph/osd/ceph-$ID;
mkfs.xfs $DEVICE;
mount $DEVICE /var/lib/ceph/osd/ceph-$ID;
ceph-authtool --create-keyring /var/lib/ceph/osd/ceph-$ID/keyring --name 
osd.$ID --add-key $OSD_SECRET;
ceph-osd -i $ID --mkfs --osd-uuid $UUID;
chown -R ceph:ceph /var/lib/ceph/osd/ceph-$ID;
systemctl enable ceph-osd@$ID;
systemctl start ceph-osd@$ID;
  done


Once up we imported previous exports of empty head files in to 'real' OSDs:
  kvm5b:
systemctl stop ceph-osd@8;
ceph-objectstore-tool --op import --pgid 7.4s0 --data-path 
/var/lib/ceph/osd/ceph-8 --journal-path /var/lib/ceph/osd/ceph-8/journal --file 
/var/lib/vz/template/ssd_recovery/osd8_7.4s0.export;
chown ceph:ceph -R /var/lib/ceph/osd/ceph-8;
systemctl start ceph-osd@8;
  kvm5f:
systemctl stop ceph-osd@23;
ceph-objectstore-tool --op import --pgid 7.fs0 --data-path 
/var/lib/ceph/osd/ceph-23 --journal-path /var/lib/ceph/osd/ceph-23/journal 
--file /var/lib/vz/template/ssd_recovery/osd23_7.fs0.export;
chown ceph:ceph -R /var/lib/ceph/osd/ceph-23;
systemctl start ceph-osd@23;


Bulk import previously exported objects:
cd /var/lib/vz/template/ssd_recovery;
for FILE in `ls -1A osd*_*.export | grep -Pv '^osd(8|23)_'`; do
  OSD=`echo $FILE | perl -pe 's/^osd(\d+).*/\1/'`;
  PGID=`echo $FILE | perl -pe 's/^osd\d+_(.*?).export/\1/g'`;
  echo -e "systemctl stop ceph-osd@$OSD\t ceph-objectstore-tool --op import 
--pgid $PGID --data-path /var/lib/ceph/osd/ceph-$OSD --journal-path 
/var/lib/ceph/osd/ceph-$OSD/journal --file 
/var/lib/vz/template/ssd_recovery/osd"$OSD"_$PGID.export";
done | sort

Sample output (this will wrap):
systemctl stop ceph-osd@27   ceph-objectstore-tool --op import --pgid 7.4s3 
--data-path /var/lib/ceph/osd/ceph-27 --journal-path 
/var/lib/ceph/osd/ceph-27/journal --file 
/var/lib/vz/template/ssd_recovery/osd27_7.4s3.export
systemctl stop ceph-osd@27   ceph-objectstore-tool --op import --pgid 7.fs5 
--data-path /var/lib/ceph/osd/ceph-27 --journal-path 
/var/lib/ceph/osd/ceph-27/journal --file 
/var/lib/vz/template/ssd_recovery/osd27_7.fs5.export
systemctl stop ceph-osd@30   ceph-objectstore-tool --op import --pgid 7.fs4 
--data-path /var/lib/ceph/osd/ceph-30 --journal-path 
/var/lib/ceph/osd/ceph-30/journal --file 
/var/lib/vz/template/ssd_recovery/osd30_7.fs4.export
systemctl stop ceph-osd@31   ceph-objectstore-tool --op import --pgid 7.4s2 
--data-path /var/lib/ceph/osd/ceph-31 --journal-path 
/var/lib/ceph/osd/ceph-31/journal --file 
/var/lib/vz/template/ssd_recovery/osd31_7.4s2.export
systemctl stop ceph-osd@32   ceph-objectstore-tool --op import --pgid 7.4s4 
--data-path /var/lib/ceph/osd/ceph-32 --journal-path 
/var/lib/ceph/osd/ceph-32/journal --file 
/var/lib/vz/template/ssd_recovery/osd32_7.4s4.export
systemctl stop ceph-osd@32   ceph-objectstore-tool --op import --pgid 7.fs2 
--data-path /var/lib/ceph/osd/ceph-32 --journal-path 
/var/lib/ceph/osd/ceph-32/journal --file 
/var/lib/vz/template/ssd_recovery/osd32_7.fs2.export
systemctl stop ceph-osd@34   ceph-objectstore-tool --op import --pgid 7.4s5 
--data

Re: [ceph-users] MDS behind on trimming

2017-12-21 Thread Dan van der Ster
Hi,

We've used double the defaults for around 6 months now and haven't had any
behind on trimming errors in that time.

   mds log max segments = 60
   mds log max expiring = 40

Should be simple to try.

-- dan



On Thu, Dec 21, 2017 at 2:32 PM, Stefan Kooman  wrote:

> Hi,
>
> We have two MDS servers. One active, one active-standby. While doing a
> parallel rsync of 10 threads with loads of files, dirs, subdirs we get
> the following HEALTH_WARN:
>
> ceph health detail
> HEALTH_WARN 2 MDSs behind on trimming
> MDS_TRIM 2 MDSs behind on trimming
> mdsmds2(mds.0): Behind on trimming (124/30)max_segments: 30,
> num_segments: 124
> mdsmds1(mds.0): Behind on trimming (118/30)max_segments: 30,
> num_segments: 118
>
> To be clear: the amount of segments behind on trimming fluctuates. It
> sometimes does get smaller, and is relatively stable around ~ 130.
>
> The load on the MDS is low, load on OSDs is low (both CPU/RAM/IO). All
> flash, cephfs_metadata co-located on the same OSDs. Using cephfs kernel
> client (4.13.0-19-generic) with Ceph 12.2.2 (cllient as well as cluster
> runs Ceph 12.2.2). In older threads I found several possible
> explanations for getting this warning:
>
> 1) When the number of segments exceeds that setting, the MDS starts
>   writing back metadata so that it can remove (trim) the oldest
>   segments. If this process is too slow, or a software bug is preventing
>   trimming, then this health message appears.
>
> 2) The OSDs cannot keep up with the load
>
> 3) cephfs kernel client  mis behaving / bug
>
> I definitely don't think nr 2) is the reason. I doubt it's a Ceph MDS 1)
> or client bug 3). Might this be conservative default settings? I.e. not
> trying to trim fast / soon enough. John wonders in thread [1] if the
> default journal length should be longer. Yan [2] recommends bumping
> "mds_log_max_expiring" to a large value (200).
>
> What would you suggest at this point? I'm thinking about the following
> changes:
>
> mds log max segments = 200
> mds log max expiring = 200
>
> Thanks,
>
> Stefan
>
> [1]: https://www.spinics.net/lists/ceph-users/msg39387.html
> [2]:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-July/011138.html
>
> --
> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS behind on trimming

2017-12-21 Thread Yan, Zheng
On Thu, Dec 21, 2017 at 9:32 PM, Stefan Kooman  wrote:
> Hi,
>
> We have two MDS servers. One active, one active-standby. While doing a
> parallel rsync of 10 threads with loads of files, dirs, subdirs we get
> the following HEALTH_WARN:
>
> ceph health detail
> HEALTH_WARN 2 MDSs behind on trimming
> MDS_TRIM 2 MDSs behind on trimming
> mdsmds2(mds.0): Behind on trimming (124/30)max_segments: 30,
> num_segments: 124
> mdsmds1(mds.0): Behind on trimming (118/30)max_segments: 30,
> num_segments: 118
>
> To be clear: the amount of segments behind on trimming fluctuates. It
> sometimes does get smaller, and is relatively stable around ~ 130.
>
> The load on the MDS is low, load on OSDs is low (both CPU/RAM/IO). All
> flash, cephfs_metadata co-located on the same OSDs. Using cephfs kernel
> client (4.13.0-19-generic) with Ceph 12.2.2 (cllient as well as cluster
> runs Ceph 12.2.2). In older threads I found several possible
> explanations for getting this warning:
>
> 1) When the number of segments exceeds that setting, the MDS starts
>   writing back metadata so that it can remove (trim) the oldest
>   segments. If this process is too slow, or a software bug is preventing
>   trimming, then this health message appears.
>
> 2) The OSDs cannot keep up with the load
>
> 3) cephfs kernel client  mis behaving / bug
>
> I definitely don't think nr 2) is the reason. I doubt it's a Ceph MDS 1)
> or client bug 3). Might this be conservative default settings? I.e. not
> trying to trim fast / soon enough. John wonders in thread [1] if the
> default journal length should be longer. Yan [2] recommends bumping
> "mds_log_max_expiring" to a large value (200).
>
> What would you suggest at this point? I'm thinking about the following
> changes:
>
> mds log max segments = 200
> mds log max expiring = 200
>

Yes, these change should help. you can also try
https://github.com/ceph/ceph/pull/18783

> Thanks,
>
> Stefan
>
> [1]: https://www.spinics.net/lists/ceph-users/msg39387.html
> [2]:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-July/011138.html
>
> --
> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mds millions of caps

2017-12-21 Thread Yan, Zheng
On Thu, Dec 21, 2017 at 7:33 PM, Webert de Souza Lima
 wrote:
> I have upgraded the kernel on a client node (one that has close-to-zero
> traffic) used for tests.
>
>{
>   "reconnecting" : false,
>   "id" : 1620266,
>   "num_leases" : 0,
>   "inst" : "client.1620266 10.0.0.111:0/3921220890",
>   "state" : "open",
>   "completed_requests" : 0,
>   "num_caps" : 1402490,
>   "client_metadata" : {
>  "kernel_version" : "4.4.0-104-generic",
>  "hostname" : "suppressed",
>  "entity_id" : "admin"
>   },
>   "replay_requests" : 0
>},
>
> still 1.4M caps used.
>
> is upgrading the client kernel enough ?
>

See http://tracker.ceph.com/issues/22446. We haven't implemented that
feature.  "echo 3 >/proc/sys/vm/drop_caches"  should drop most caps.

>
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> Belo Horizonte - Brasil
> IRC NICK - WebertRLZ
>
> On Fri, Dec 15, 2017 at 11:16 AM, Webert de Souza Lima
>  wrote:
>>
>> So,
>>
>> On Fri, Dec 15, 2017 at 10:58 AM, Yan, Zheng  wrote:
>>>
>>>
>>> 300k are ready quite a lot. opening them requires long time. does you
>>> mail server really open so many files?
>>
>>
>> Yes, probably. It's a commercial solution. A few thousand domains, dozens
>> of thousands of users and god knows how any mailboxes.
>> From the daemonperf you can see the write workload is high, so yes, too
>> much files opening (dovecot mdbox stores multiple e-mails per file, split
>> into many files).
>>
>>> I checked 4.4 kernel, it includes the code that trim cache when mds
>>> recovers.
>>
>>
>> Ok, all nodes are running 4.4.0-75-generic. The fix might have been
>> included in a newer version.
>> I'll upgrade it asap.
>>
>>
>> Regards,
>>
>> Webert Lima
>> DevOps Engineer at MAV Tecnologia
>> Belo Horizonte - Brasil
>> IRC NICK - WebertRLZ
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Not timing out watcher

2017-12-21 Thread Serguei Bezverkhi (sbezverk)
Hi Ilya,

Here you go, no k8s services running this time:

sbezverk@kube-4:~$ sudo rbd map raw-volume --pool kubernetes --id admin -m 
192.168.80.233  --key=AQCeHO1ZILPPDRAA7zw3d76bplkvTwzoosybvA==
/dev/rbd0
sbezverk@kube-4:~$ sudo rbd status raw-volume --pool kubernetes --id admin -m 
192.168.80.233  --key=AQCeHO1ZILPPDRAA7zw3d76bplkvTwzoosybvA==
Watchers:
watcher=192.168.80.235:0/3465920438 client.65327 cookie=1
sbezverk@kube-4:~$ sudo rbd info raw-volume --pool kubernetes --id admin -m 
192.168.80.233  --key=AQCeHO1ZILPPDRAA7zw3d76bplkvTwzoosybvA==
rbd image 'raw-volume':
size 10240 MB in 2560 objects
order 22 (4096 kB objects)
block_name_prefix: rb.0.fafa.625558ec
format: 1
sbezverk@kube-4:~$ sudo reboot

sbezverk@kube-4:~$ sudo rbd status raw-volume --pool kubernetes --id admin -m 
192.168.80.233  --key=AQCeHO1ZILPPDRAA7zw3d76bplkvTwzoosybvA==
Watchers: none

It seems when the image was mapped manually, this issue is not reproducible. 

K8s does not just map the image, it also creates loopback device which is 
linked to /dev/rbd0. Maybe this somehow reminds rbd client to re-activate a 
watcher on reboot. I will try to mimic exact steps k8s follows manually to see 
what exactly forces an active watcher after reboot.

Thank you
Serguei

On 2017-12-21, 5:49 AM, "Ilya Dryomov"  wrote:

On Wed, Dec 20, 2017 at 6:20 PM, Serguei Bezverkhi (sbezverk)
 wrote:
> It took 30 minutes for the Watcher to time out after ungraceful restart. 
Is there a way limit it to something a bit more reasonable? Like 1-3 minutes?
>
> On 2017-12-20, 12:01 PM, "Serguei Bezverkhi (sbezverk)" 
 wrote:
>
> Ok, here is what I found out. If I gracefully kill a pod then watcher 
gets properly cleared, but if it is done ungracefully, without “rbd unmap” then 
even after a node reboot Watcher stays up for a long time,  it has been more 
than 20 minutes and it is still active (no any kubernetes services are running).

Hi Serguei,

Can you try taking k8s out of the equation -- set up a fresh VM with
the same kernel, do "rbd map" in it and kill it?

Thanks,

Ilya


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Gateway timeout

2017-12-21 Thread Brent Kennedy
I have noticed over the years ( been using ceph since 2013 ) that when an
OSD attached to a single physical drive ( JBOD setup ) that is failing, that
at times this will cause rados gateways to go offline.  I have two clusters
running ( one on firefly and one on hammer, both scheduled for upgrades next
year ) and it happens on both when a drive is not marked out but has many
blocked ops requests.  The drive is physically still functioning but is most
likely failing, just not failed yet.  The issue is that the gateways will
just stop responding to all requests.  Both of our clusters have 3 rados
gateways behind a haproxy load balancer, so we know immediately when they
drop.  This will occur continually until we out the failing OSD ( normally
we restart the gateways or the services on them first, then move to out the
drive ).  

 

Wonder if anyone runs into this, a quick search revealed one hit with no
actual resolution.  Also wondering if there is some way I could prevent the
gateways from falling over due to the unresponsive OSD?

 

I did setup a test Jewel install in our dev and semi-recreate the problem by
shutting down all the OSDs.  This resulted in the gateway going down
completely as well.  I imagine taking the OSDs offline like that wouldn't be
expected though.  It would be nice if the gateway would just throw a message
back, like service unavailable.  I suppose haproxy is doing this for it
though.

 

Regards,

Brent

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-volume lvm deactivate/destroy/zap

2017-12-21 Thread Dan van der Ster
Hi,

For someone who is not an lvm expert, does anyone have a recipe for
destroying a ceph-volume lvm osd?
(I have a failed disk which I want to deactivate / wipe before
physically removing from the host, and the tooling for this doesn't
exist yet http://tracker.ceph.com/issues/22287)

>  ceph-volume lvm zap /dev/sdu # does not work
Zapping: /dev/sdu
Running command: sudo wipefs --all /dev/sdu
 stderr: wipefs: error: /dev/sdu: probing initialization failed:
Device or resource busy
-->  RuntimeError: command returned non-zero exit status: 1

This is the drive I want to remove:

= osd.240 ==

  [block]/dev/ceph-/osd-block-f1455f38-b94b-4501-86df-6d6c96727d02

  type  block
  osd id240
  cluster fsid  xxx
  cluster name  ceph
  osd fsid  f1455f38-b94b-4501-86df-6d6c96727d02
  block uuidN4fpLc-O3y0-hvfN-oRpD-y6kH-znfl-4EaVLi
  block device  /dev/ceph-/osd-block-f1455f38-b94b-4501-86df-6d6c96727d02

How does one tear that down so it can be zapped?

Best Regards,

Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [luminous 12.2.2] Cluster write performance degradation problem(possibly tcmalloc related)

2017-12-21 Thread shadow_lin
My testing cluster is an all hdd cluster with 12 osd(10T hdd each).
I moinitor luminous 12.2.2 write performance and osd memory usage with grafana 
graph for statistic logging.
The test is done  by using fio on a mounted rbd with follow fio parameters:
fio -directory=fiotest -direct=1 -thread -rw=write -ioengine=libaio  -size=200G 
-group_reporting -bs=1m -iodepth 4 -numjobs=200 -name=writetest
   I found there is a noticeably performance degration over time.
   Graph of write throughput and iops
   https://pasteboard.co/GZflpTO.png
   Graph of osd memory usage(2 of 12 osds,the pattern are identical)
   https://pasteboard.co/GZfmfzo.png
   Graph of osd perf
   https://pasteboard.co/GZfmZNx.png

   There are some interesting founding from the graph.
   After 18:00 suddenly the write throughput dropped and the osd latency 
increased. TCmalloc started relcaim page heap freelist much more frequently.All 
of this happened very fast and every osd had the indentical pattern.

   I have done this kind of test several times with different bluestore 
cache setting and find out with more cache the performance degradation would 
happen later.

 I don't know if this is a bug or I can fix it with modify some of the 
config of my cluster. 
  Any advice or direction to look into is appreciated.

  Thanks
   
 


2017-12-21



lin.yunfan___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume lvm deactivate/destroy/zap

2017-12-21 Thread Stefan Kooman
Quoting Dan van der Ster (d...@vanderster.com):
> Hi,
> 
> For someone who is not an lvm expert, does anyone have a recipe for
> destroying a ceph-volume lvm osd?
> (I have a failed disk which I want to deactivate / wipe before
> physically removing from the host, and the tooling for this doesn't
> exist yet http://tracker.ceph.com/issues/22287)
> 
> >  ceph-volume lvm zap /dev/sdu # does not work
> Zapping: /dev/sdu
> Running command: sudo wipefs --all /dev/sdu
>  stderr: wipefs: error: /dev/sdu: probing initialization failed:
> Device or resource busy
> 
> How does one tear that down so it can be zapped?

wipefs -fa /dev/the/device
dd if=/dev/zero of=/dev/the/device bs=1M count=1

^^ I have succesfully re-created ceph-volume lvm bluestore OSDs with
above method (assuming you have done the ceph osd purge osd.$ID part
already and brought down the OSD process itself).

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Not timing out watcher

2017-12-21 Thread Ilya Dryomov
On Thu, Dec 21, 2017 at 3:04 PM, Serguei Bezverkhi (sbezverk)
 wrote:
> Hi Ilya,
>
> Here you go, no k8s services running this time:
>
> sbezverk@kube-4:~$ sudo rbd map raw-volume --pool kubernetes --id admin -m 
> 192.168.80.233  --key=AQCeHO1ZILPPDRAA7zw3d76bplkvTwzoosybvA==
> /dev/rbd0
> sbezverk@kube-4:~$ sudo rbd status raw-volume --pool kubernetes --id admin -m 
> 192.168.80.233  --key=AQCeHO1ZILPPDRAA7zw3d76bplkvTwzoosybvA==
> Watchers:
> watcher=192.168.80.235:0/3465920438 client.65327 cookie=1
> sbezverk@kube-4:~$ sudo rbd info raw-volume --pool kubernetes --id admin -m 
> 192.168.80.233  --key=AQCeHO1ZILPPDRAA7zw3d76bplkvTwzoosybvA==
> rbd image 'raw-volume':
> size 10240 MB in 2560 objects
> order 22 (4096 kB objects)
> block_name_prefix: rb.0.fafa.625558ec
> format: 1
> sbezverk@kube-4:~$ sudo reboot
>
> sbezverk@kube-4:~$ sudo rbd status raw-volume --pool kubernetes --id admin -m 
> 192.168.80.233  --key=AQCeHO1ZILPPDRAA7zw3d76bplkvTwzoosybvA==
> Watchers: none
>
> It seems when the image was mapped manually, this issue is not reproducible.
>
> K8s does not just map the image, it also creates loopback device which is 
> linked to /dev/rbd0. Maybe this somehow reminds rbd client to re-activate a 
> watcher on reboot. I will try to mimic exact steps k8s follows manually to 
> see what exactly forces an active watcher after reboot.

To confirm, I'd also make sure that nothing runs "rbd unmap" on all
images (or some subset of images) during shutdown in the manual case.
Either do a hard reboot or rename /usr/bin/rbd to something else before
running reboot.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS behind on trimming

2017-12-21 Thread Stefan Kooman
Quoting Dan van der Ster (d...@vanderster.com):
> Hi,
> 
> We've used double the defaults for around 6 months now and haven't had any
> behind on trimming errors in that time.
> 
>mds log max segments = 60
>mds log max expiring = 40
> 
> Should be simple to try.
Yup, and works like a charm:

ceph tell mds.* injectargs '--mds_log_max_segments=60'
ceph tell mds.* injectargs '--mds_log_max_expiring=40'

Although you see this logged: (not observed, change may require restart),
these settings do get applied almost instantly ... and the trim lag was
gone within 30 seconds after that.

Thanks,

Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume lvm deactivate/destroy/zap

2017-12-21 Thread Dan van der Ster
On Thu, Dec 21, 2017 at 3:59 PM, Stefan Kooman  wrote:
> Quoting Dan van der Ster (d...@vanderster.com):
>> Hi,
>>
>> For someone who is not an lvm expert, does anyone have a recipe for
>> destroying a ceph-volume lvm osd?
>> (I have a failed disk which I want to deactivate / wipe before
>> physically removing from the host, and the tooling for this doesn't
>> exist yet http://tracker.ceph.com/issues/22287)
>>
>> >  ceph-volume lvm zap /dev/sdu # does not work
>> Zapping: /dev/sdu
>> Running command: sudo wipefs --all /dev/sdu
>>  stderr: wipefs: error: /dev/sdu: probing initialization failed:
>> Device or resource busy
>>
>> How does one tear that down so it can be zapped?
>
> wipefs -fa /dev/the/device
> dd if=/dev/zero of=/dev/the/device bs=1M count=1

Thanks Stefan. But isn't there also some vgremove or lvremove magic
that needs to bring down these /dev/dm-... devices I have?

-- dan

>
> ^^ I have succesfully re-created ceph-volume lvm bluestore OSDs with
> above method (assuming you have done the ceph osd purge osd.$ID part
> already and brought down the OSD process itself).
>
> Gr. Stefan
>
> --
> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mds millions of caps

2017-12-21 Thread Webert de Souza Lima
Hello Zheng,

Thanks for opening that issue on the bug tracker.

Also thanks for that tip. Caps dropped from 1.6M to 600k for that client.
Is it safe to run in a cronjob? Let's say, once or twice a day during
production?

Thanks!


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

On Thu, Dec 21, 2017 at 11:55 AM, Yan, Zheng  wrote:

> On Thu, Dec 21, 2017 at 7:33 PM, Webert de Souza Lima
>  wrote:
> > I have upgraded the kernel on a client node (one that has close-to-zero
> > traffic) used for tests.
> >
> >{
> >   "reconnecting" : false,
> >   "id" : 1620266,
> >   "num_leases" : 0,
> >   "inst" : "client.1620266 10.0.0.111:0/3921220890",
> >   "state" : "open",
> >   "completed_requests" : 0,
> >   "num_caps" : 1402490,
> >   "client_metadata" : {
> >  "kernel_version" : "4.4.0-104-generic",
> >  "hostname" : "suppressed",
> >  "entity_id" : "admin"
> >   },
> >   "replay_requests" : 0
> >},
> >
> > still 1.4M caps used.
> >
> > is upgrading the client kernel enough ?
> >
>
> See http://tracker.ceph.com/issues/22446. We haven't implemented that
> feature.  "echo 3 >/proc/sys/vm/drop_caches"  should drop most caps.
>
> >
> >
> > Regards,
> >
> > Webert Lima
> > DevOps Engineer at MAV Tecnologia
> > Belo Horizonte - Brasil
> > IRC NICK - WebertRLZ
> >
> > On Fri, Dec 15, 2017 at 11:16 AM, Webert de Souza Lima
> >  wrote:
> >>
> >> So,
> >>
> >> On Fri, Dec 15, 2017 at 10:58 AM, Yan, Zheng  wrote:
> >>>
> >>>
> >>> 300k are ready quite a lot. opening them requires long time. does you
> >>> mail server really open so many files?
> >>
> >>
> >> Yes, probably. It's a commercial solution. A few thousand domains,
> dozens
> >> of thousands of users and god knows how any mailboxes.
> >> From the daemonperf you can see the write workload is high, so yes, too
> >> much files opening (dovecot mdbox stores multiple e-mails per file,
> split
> >> into many files).
> >>
> >>> I checked 4.4 kernel, it includes the code that trim cache when mds
> >>> recovers.
> >>
> >>
> >> Ok, all nodes are running 4.4.0-75-generic. The fix might have been
> >> included in a newer version.
> >> I'll upgrade it asap.
> >>
> >>
> >> Regards,
> >>
> >> Webert Lima
> >> DevOps Engineer at MAV Tecnologia
> >> Belo Horizonte - Brasil
> >> IRC NICK - WebertRLZ
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?

2017-12-21 Thread Dénes Dolhay
If i were in your shoes, i would grab a failed disc which DOES NOT contain the 
data you need, an oscilloscope, and start experimenting on it ... try to find 
debug testpoints on the panel etc. At the same time i would contact the factory 
or a data recovery company with a good reputation, and experience with ssds.

If i would have to bet, i would bet on a mfg defect because i cannot see any 
other mention of your problem on the net.

On December 21, 2017 2:38:52 PM GMT+01:00, David Herselman  
wrote:
>Hi,
>
>I assume this can only be a physical manufacturing flaw or a firmware
>bug? Do Intel publish advisories on recalled equipment? Should others
>be concerned about using Intel DC S4600 SSD drives? Could this be an
>electrical issue on the Hot Swap Backplane or BMC firmware issue?
>Either way, all pure Intel...
>
>The hole is only 1.3 GB (4 MB x 339 objects) but perfectly striped
>through images, file systems are subsequently severely damaged.
>
>Is it possible to get Ceph to read in partial data shards? It would
>provide between 25-75% more yield...
>
>
>Is there anything wrong with how we've proceeded thus far? Would be
>nice to reference examples of using ceph-objectstore-tool but
>documentation is virtually non-existent.
>
>We used another SSD drive to simulate bringing all the SSDs back
>online. We carved up the drive to provide equal partitions to
>essentially simulate the original SSDs:
>  # Partition a drive to provide 12 x 150GB partitions, eg:
>sdd   8:48   0   1.8T  0 disk
>|-sdd18:49   0   140G  0 part
>|-sdd28:50   0   140G  0 part
>|-sdd38:51   0   140G  0 part
>|-sdd48:52   0   140G  0 part
>|-sdd58:53   0   140G  0 part
>|-sdd68:54   0   140G  0 part
>|-sdd78:55   0   140G  0 part
>|-sdd88:56   0   140G  0 part
>|-sdd98:57   0   140G  0 part
>|-sdd10   8:58   0   140G  0 part
>|-sdd11   8:59   0   140G  0 part
>+-sdd12   8:60   0   140G  0 part
>
>
>  Pre-requisites:
>ceph osd set noout;
>apt-get install uuid-runtime;
>
>
>  for ID in `seq 24 35`; do
>UUID=`uuidgen`;
>OSD_SECRET=`ceph-authtool --gen-print-key`;
>DEVICE='/dev/sdd'$[$ID-23];# 24-23 = /dev/sdd1, 35-23 = /dev/sdd12
>echo "{\"cephx_secret\": \"$OSD_SECRET\"}" | ceph osd new $UUID $ID -i
>- -n client.bootstrap-osd -k /var/lib/ceph/bootstrap-osd/ceph.keyring;
>mkdir /var/lib/ceph/osd/ceph-$ID;
>mkfs.xfs $DEVICE;
>mount $DEVICE /var/lib/ceph/osd/ceph-$ID;
>ceph-authtool --create-keyring /var/lib/ceph/osd/ceph-$ID/keyring
>--name osd.$ID --add-key $OSD_SECRET;
>ceph-osd -i $ID --mkfs --osd-uuid $UUID;
>chown -R ceph:ceph /var/lib/ceph/osd/ceph-$ID;
>systemctl enable ceph-osd@$ID;
>systemctl start ceph-osd@$ID;
>  done
>
>
>Once up we imported previous exports of empty head files in to 'real'
>OSDs:
>  kvm5b:
>systemctl stop ceph-osd@8;
>ceph-objectstore-tool --op import --pgid 7.4s0 --data-path
>/var/lib/ceph/osd/ceph-8 --journal-path
>/var/lib/ceph/osd/ceph-8/journal --file
>/var/lib/vz/template/ssd_recovery/osd8_7.4s0.export;
>chown ceph:ceph -R /var/lib/ceph/osd/ceph-8;
>systemctl start ceph-osd@8;
>  kvm5f:
>systemctl stop ceph-osd@23;
>ceph-objectstore-tool --op import --pgid 7.fs0 --data-path
>/var/lib/ceph/osd/ceph-23 --journal-path
>/var/lib/ceph/osd/ceph-23/journal --file
>/var/lib/vz/template/ssd_recovery/osd23_7.fs0.export;
>chown ceph:ceph -R /var/lib/ceph/osd/ceph-23;
>systemctl start ceph-osd@23;
>
>
>Bulk import previously exported objects:
>cd /var/lib/vz/template/ssd_recovery;
>for FILE in `ls -1A osd*_*.export | grep -Pv '^osd(8|23)_'`; do
>  OSD=`echo $FILE | perl -pe 's/^osd(\d+).*/\1/'`;
>  PGID=`echo $FILE | perl -pe 's/^osd\d+_(.*?).export/\1/g'`;
>echo -e "systemctl stop ceph-osd@$OSD\t ceph-objectstore-tool --op
>import --pgid $PGID --data-path /var/lib/ceph/osd/ceph-$OSD
>--journal-path /var/lib/ceph/osd/ceph-$OSD/journal --file
>/var/lib/vz/template/ssd_recovery/osd"$OSD"_$PGID.export";
>done | sort
>
>Sample output (this will wrap):
>systemctl stop ceph-osd@27   ceph-objectstore-tool --op import
>--pgid 7.4s3 --data-path /var/lib/ceph/osd/ceph-27 --journal-path
>/var/lib/ceph/osd/ceph-27/journal --file
>/var/lib/vz/template/ssd_recovery/osd27_7.4s3.export
>systemctl stop ceph-osd@27   ceph-objectstore-tool --op import
>--pgid 7.fs5 --data-path /var/lib/ceph/osd/ceph-27 --journal-path
>/var/lib/ceph/osd/ceph-27/journal --file
>/var/lib/vz/template/ssd_recovery/osd27_7.fs5.export
>systemctl stop ceph-osd@30   ceph-objectstore-tool --op import
>--pgid 7.fs4 --data-path /var/lib/ceph/osd/ceph-30 --journal-path
>/var/lib/ceph/osd/ceph-30/journal --file
>/var/lib/vz/template/ssd_recovery/osd30_7.fs4.export
>systemctl stop ceph-osd@31   ceph-objectstore-tool --op import
>--pgid 7.4s2 --data-path /var/lib/ceph/osd/ceph-31 --journal-path
>/var/lib/ceph/osd/ceph-31/journal --file
>/var/lib/vz/template/ssd_reco

Re: [ceph-users] Cache tier unexpected behavior: promote on lock

2017-12-21 Thread Захаров Алексей
Thanks for the answers!
As it leads to a decrease of caching efficiency, i've opened an issue:
http://tracker.ceph.com/issues/22528 

15.12.2017, 23:03, "Gregory Farnum" :
> On Thu, Dec 14, 2017 at 9:11 AM, Захаров Алексей  
> wrote:
>>  Hi, Gregory,
>>  Thank you for your answer!
>>
>>  Is there a way to not promote on "locking", when not using EC pools?
>>  Is it possible to make this configurable?
>>
>>  We don't use EC pool. So, for us this meachanism is overhead. It only adds
>>  more load on both pools and network.
>
> Unfortunately I don't think there's an easy way to avoid it that
> exists right now. The caching is generally not set up well for
> handling these kinds of things, but it's possible the logic to proxy
> class operations onto replicated pools might not be *too*
> objectionable
> -Greg
>
>>  14.12.2017, 01:16, "Gregory Farnum" :
>>
>>  Voluntary “locking” in RADOS is an “object class” operation. These are not
>>  part of the core API and cannot run on EC pools, so any operation using them
>>  will cause an immediate promotion.
>>  On Wed, Dec 13, 2017 at 4:02 AM Захаров Алексей 
>>  wrote:
>>
>>  Hello,
>>
>>  I've found that when client gets lock on object then ceph ignores any
>>  promotion settings and promotes this object immedeatly.
>>
>>  Is it a bug or a feature?
>>  Is it configurable?
>>
>>  Hope for any help!
>>
>>  Ceph version: 10.2.10 and 12.2.2
>>  We use libradosstriper-based clients.
>>
>>  Cache pool settings:
>>  size: 3
>>  min_size: 2
>>  crash_replay_interval: 0
>>  pg_num: 2048
>>  pgp_num: 2048
>>  crush_ruleset: 0
>>  hashpspool: true
>>  nodelete: false
>>  nopgchange: false
>>  nosizechange: false
>>  write_fadvise_dontneed: false
>>  noscrub: true
>>  nodeep-scrub: false
>>  hit_set_type: bloom
>>  hit_set_period: 60
>>  hit_set_count: 30
>>  hit_set_fpp: 0.05
>>  use_gmt_hitset: 1
>>  auid: 0
>>  target_max_objects: 0
>>  target_max_bytes: 18819770744832
>>  cache_target_dirty_ratio: 0.4
>>  cache_target_dirty_high_ratio: 0.6
>>  cache_target_full_ratio: 0.8
>>  cache_min_flush_age: 60
>>  cache_min_evict_age: 180
>>  min_read_recency_for_promote: 15
>>  min_write_recency_for_promote: 15
>>  fast_read: 0
>>  hit_set_grade_decay_rate: 50
>>  hit_set_search_last_n: 30
>>
>>  To get lock via cli (to test behavior) we use:
>>  # rados -p poolname lock get --lock-tag weird_ceph_locks --lock-cookie
>>  `uuid` objectname striper.lock
>>  Right after that object could be found in caching pool.
>>
>>  --
>>  Regards,
>>  Aleksei Zakharov
>>  ___
>>  ceph-users mailing list
>>  ceph-users@lists.ceph.com
>>  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>  --
>>  Regards,
>>  Aleksei Zakharov

-- 
Regards,
Aleksei Zakharov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume lvm deactivate/destroy/zap

2017-12-21 Thread Stefan Kooman
Quoting Dan van der Ster (d...@vanderster.com):
> Thanks Stefan. But isn't there also some vgremove or lvremove magic
> that needs to bring down these /dev/dm-... devices I have?

Ah, you want to clean up properly before that. Sure:

lvremove -f /
vgremove 
pvremove /dev/ceph-device (should wipe labels)

So ideally there should be a ceph-volume lvm destroy / zap option that
takes care of this:

1) Properly remove LV/VG/PV as shown above
2) wipefs to get rid of LVM signatures
3) dd zeroes to get rid of signatures that might still be there

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to use vfs_ceph

2017-12-21 Thread David C
At a glance looks OK, I've not tested this in a while. Silly question but
does your Samba package definitely ship with the Ceph vfs? Caught me out in
the past.

Have you tried exporting a sub dir? Maybe 777 it although shouldn't make a
difference.

On 21 Dec 2017 13:16, "Felix Stolte"  wrote:

> Hello folks,
>
> is anybody using the vfs_ceph module for exporting cephfs as samba shares?
> We are running ceph jewel with cephx enabled. Manpage of vfs_ceph only
> references the option ceph:config_file. How do I need to configure my share
> (or maybe ceph.conf)?
>
> log.smbd:  '/' does not exist or permission denied when connecting to
> [vfs] Error was Transport endpoint is not connected
>
> I have a user ctdb with keyring file /etc/ceph/ceph.client.ctdb.keyring
> with permissions:
>
> caps: [mds] allow rw
> caps: [mon] allow rcaps: [osd] allow rwx
> pool=cephfs_metadata,allow rwx pool=cephfs_data
>
> I can mount cephfs with cephf-fuse using the id ctdb and its keyfile.
>
> My share definition is:
>
> [vfs]
> comment = vfs
> path = /
> read only = No
> vfs objects = acl_xattr ceph
> ceph:user_id = ctdb
> ceph:config_file = /etc/ceph/ceph.conf
>
>
> Any advice is appreciated.
>
> Regards Felix
>
> --
> Forschungszentrum Jülich GmbH
> 52425 Jülich
> Sitz der Gesellschaft: Jülich
> Eingetragen im Handelsregister des Amtsgerichts Düren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDir. Dr. Karl Eugen Huthmacher
> Geschäftsführung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> Prof. Dr. Sebastian M. Schmidt
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Permissions for mon status command

2017-12-21 Thread Andreas Calminder
Hi,
I'm writing a small python script using librados to display cluster health,
same info as ceph health detail show, it works fine but I rather not use
the admin keyring for something like this. However I have no clue what kind
of caps I should or can set, I was kind of hoping that mon allow r would do
it, but that didn't work, and I'm unable to find any documentation that
covers this. Any pointers would be appreciated.

Thanks,
Andreas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Permissions for mon status command

2017-12-21 Thread Alvaro Soto
Hi Andreas,
I believe is not a problem of caps, I have tested using the same cap on mon
and I have the same problem, still looking into.

[client.python]

key = AQDORjxaYHG9JxAA0qiZC0Rmf3qulsO3P/bZgw==

caps mon = "allow r"



# ceph -n client.python --keyring ceph.client.python.keyring health

HEALTH_OK


but if I run the python script that contains a connect command to the
cluster.


# python health.py

Traceback (most recent call last):

  File "health.py", line 13, in 

r.connect()

  File "/usr/lib/python2.7/dist-packages/rados.py", line 429, in connect

raise make_ex(ret, "error connecting to the cluster")

rados.Error: error connecting to the cluster: errno EINVAL


** PYTHON SCRIPT 

#!/usr/bin/env python


import rados

import json


def get_cluster_health(r):

cmd = {"prefix":"status", "format":"json"}

ret, buf, errs = r.mon_command(json.dumps(cmd), b'', timeout=5)

result = json.loads(buf)

return result['health']['overall_status']


r = rados.Rados(conffile = '/etc/ceph/ceph.conf', conf = dict (keyring =
'/etc/ceph/ceph.client.python.keyring'))

r.connect()


print("{0}".format(get_cluster_health(r)))


if r is not None:

r.shutdown()

*



On Thu, Dec 21, 2017 at 4:15 PM, Andreas Calminder <
andreas.calmin...@klarna.com> wrote:

> Hi,
> I'm writing a small python script using librados to display cluster
> health, same info as ceph health detail show, it works fine but I rather
> not use the admin keyring for something like this. However I have no clue
> what kind of caps I should or can set, I was kind of hoping that mon allow
> r would do it, but that didn't work, and I'm unable to find any
> documentation that covers this. Any pointers would be appreciated.
>
> Thanks,
> Andreas
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 

ATTE. Alvaro Soto Escobar

--
Great people talk about ideas,
average people talk about things,
small people talk ... about other people.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph not reclaiming space or overhead?

2017-12-21 Thread Brian Woods
I will start with I am very new to ceph and am trying to teach myself the
ins and outs.  While doing this I have been creating and destroying pools
as I experiment on some test hardware.  Something I noticed was that when a
pool is deleted, the space is not always freed 100%.  This is true even
after days of idle time.

Right now with 7 OSD and a few empty pools I have 70GBs of raw spaced used.

Now, I am not sure if this is normal, but I did migrate my OSDs to
bluestore and have been adding OSDs.  So maybe some space is just overhead
for each OSD?  I lost one of my disks and the usage dropped to 70GBs.
Though when I had that failure I got some REALLY odd results from ceph -s…
  Note the number of data objects (242 total) vs. the number of degraded
objects (101 of 726):

--

root@MediaServer:~# ceph -s

 cluster:

id: 26c81563-ee27-4967-a950-afffb795f29e

health: HEALTH_WARN

   1 filesystem is degraded

   insufficient standby MDS daemons available

   1 osds down

   Degraded data redundancy: 101/726 objects degraded (13.912%), 92
pgs unclean, 92 pgs degraded, 92 pgs undersized

 services:

mon: 2 daemons, quorum TheMonolith,MediaServer

mgr: MediaServer.domain(active), standbys: TheMonolith.domain

mds: MediaStoreFS-1/1/1 up  {0=MediaMDS=up:reconnect(laggy or crashed)}

osd: 8 osds: 7 up, 8 in

rgw: 2 daemons active

 data:

pools:   8 pools, 176 pgs

objects: 242 objects, 3568 bytes

usage:   80463 MB used, 10633 GB / 10712 GB avail

pgs: 101/726 objects degraded (13.912%)

92 active+undersized+degraded

84 active+clean

--

After reweighting the failed OSD out:

--

root@MediaServer:/var/log/ceph# ceph -s

 cluster:

id: 26c81563-ee27-4967-a950-afffb795f29e

health: HEALTH_WARN

   1 filesystem is degraded

   insufficient standby MDS daemons available

 services:

mon: 2 daemons, quorum TheMonolith,MediaServer

mgr: MediaServer.domain(active), standbys: TheMonolith.domain

mds: MediaStoreFS-1/1/1 up  {0=MediaMDS=up:reconnect(laggy or crashed)}

osd: 8 osds: 7 up, 7 in

rgw: 2 daemons active

 data:

pools:   8 pools, 176 pgs

objects: 242 objects, 3568 bytes

usage:   71189 MB used, 8779 GB / 8849 GB avail

pgs: 176 active+clean

--

My pools:

--

root@MediaServer:/var/log/ceph# ceph df

GLOBAL:

SIZE  AVAIL RAW USED %RAW USED

8849G 8779G   71189M  0.79

POOLS:

NAME  ID USED %USED MAX AVAIL
OBJECTS

.rgw.root 6  1322 0 3316G
3

default.rgw.control   7 0 0 3316G
11

default.rgw.meta  8 0 0 3316G
0

default.rgw.log   9 0 0 3316G
207

MediaStorePool190 0 5970G
0

MediaStorePool-Meta   20 2246 0 3316G
21

MediaStorePool-WriteCache 210 0 3316G
0

rbd   220 0 4975G
0

--

Am I looking at some sort of a file system leak, or is this normal?


Also, before I deleted (or broke rather) my last pool, I marked OSDs in and
out and tracked the space. The data pool was erasure with 4 data and 1
parity and all data cleared from the cache pool:


Obj Used Total Size
Data Expected Usage Difference Notes

639 10712 417 521.25 -117.75 8 OSDs
337k 636 10246 417 521.25 -114.75 7 OSDs (complete removal, osd 0, 500GB)
337k 629 10712 417 521.25 -107.75 8 OSDs (Wiped and re-added as osd.51002)
337k 631 9780 417 521.25 -109.75 7 OSDs (out, crush removed, osd 5, 1TB)
337k 649 10712 417 521.25 -127.75 8 OSDs (crush add, osd in)
337k 643 9780 417 521.25 -121.75 7 OSDs (out, osd 5, 1TB)
337k 625 9780 417 521.25 -103.75 7 OSDs (crush reweight 0, osd 5, 1TB)

There was enough difference between the in and out of OSDs that I kinda
think something is up. Even with the 80GBs removed from the difference when
I have no data at all, that still leaved me with upwards of 40GBs of
unaccounted for usage...


Debian 9 \ Kernel: 4.4.0-104-generic

ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous
(stable)


Thanks for your input! It's appreciated!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Permissions for mon status command

2017-12-21 Thread David Turner
You aren't specifying your cluster user, only the keyring.  So the
connection command is still trying to use the default client.admin instead
of client.python.  Here's the connect line I use in my scripts.

rados.Rados(conffile='/etc/ceph/ceph.conf', conf=dict(keyring = '
/etc/ceph/ceph.client.python.keyring'), name='client.python')

On Thu, Dec 21, 2017 at 6:55 PM Alvaro Soto  wrote:

> Hi Andreas,
> I believe is not a problem of caps, I have tested using the same cap on
> mon and I have the same problem, still looking into.
>
> [client.python]
>
> key = AQDORjxaYHG9JxAA0qiZC0Rmf3qulsO3P/bZgw==
>
> caps mon = "allow r"
>
>
>
> # ceph -n client.python --keyring ceph.client.python.keyring health
>
> HEALTH_OK
>
>
> but if I run the python script that contains a connect command to the
> cluster.
>
>
> # python health.py
>
> Traceback (most recent call last):
>
>   File "health.py", line 13, in 
>
> r.connect()
>
>   File "/usr/lib/python2.7/dist-packages/rados.py", line 429, in connect
>
> raise make_ex(ret, "error connecting to the cluster")
>
> rados.Error: error connecting to the cluster: errno EINVAL
>
>
> ** PYTHON SCRIPT 
>
> #!/usr/bin/env python
>
>
> import rados
>
> import json
>
>
> def get_cluster_health(r):
>
> cmd = {"prefix":"status", "format":"json"}
>
> ret, buf, errs = r.mon_command(json.dumps(cmd), b'', timeout=5)
>
> result = json.loads(buf)
>
> return result['health']['overall_status']
>
>
> r = rados.Rados(conffile = '/etc/ceph/ceph.conf', conf = dict (keyring =
> '/etc/ceph/ceph.client.python.keyring'))
>
> r.connect()
>
>
> print("{0}".format(get_cluster_health(r)))
>
>
> if r is not None:
>
> r.shutdown()
>
> *
>
>
>
> On Thu, Dec 21, 2017 at 4:15 PM, Andreas Calminder <
> andreas.calmin...@klarna.com> wrote:
>
>> Hi,
>> I'm writing a small python script using librados to display cluster
>> health, same info as ceph health detail show, it works fine but I rather
>> not use the admin keyring for something like this. However I have no clue
>> what kind of caps I should or can set, I was kind of hoping that mon allow
>> r would do it, but that didn't work, and I'm unable to find any
>> documentation that covers this. Any pointers would be appreciated.
>>
>> Thanks,
>> Andreas
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
>
> ATTE. Alvaro Soto Escobar
>
> --
> Great people talk about ideas,
> average people talk about things,
> small people talk ... about other people.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph as an Alternative to HDFS for Hadoop

2017-12-21 Thread Traiano Welcome
Hi List

I'm researching the possibility os using ceph as a drop in replacement for
hdfs for applications using spark and hadoop.

I note that the jewel documentation states that it requires hadoop 1.1.x,
which seems a little dated and would be of concern for peopel:

http://docs.ceph.com/docs/jewel/cephfs/hadoop/

What about the 2.x series?

Also, are there any benchmark comparisons between hdfs and ceph
specifically around performance of apps benefiting from data locality ?

Many thanks in advance for any feedback!

Regards,
Traiano
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph as an Alternative to HDFS for Hadoop

2017-12-21 Thread Serkan Çoban
>Also, are there any benchmark comparisons between hdfs and ceph specifically 
>around performance of apps benefiting from data locality ?
There will be no data locality in ceph, because all the data is
accessed through network.

On Fri, Dec 22, 2017 at 4:52 AM, Traiano Welcome  wrote:
> Hi List
>
> I'm researching the possibility os using ceph as a drop in replacement for
> hdfs for applications using spark and hadoop.
>
> I note that the jewel documentation states that it requires hadoop 1.1.x,
> which seems a little dated and would be of concern for peopel:
>
> http://docs.ceph.com/docs/jewel/cephfs/hadoop/
>
> What about the 2.x series?
>
> Also, are there any benchmark comparisons between hdfs and ceph specifically
> around performance of apps benefiting from data locality ?
>
> Many thanks in advance for any feedback!
>
> Regards,
> Traiano
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mds millions of caps

2017-12-21 Thread Yan, Zheng
On Thu, Dec 21, 2017 at 11:46 PM, Webert de Souza Lima
 wrote:
> Hello Zheng,
>
> Thanks for opening that issue on the bug tracker.
>
> Also thanks for that tip. Caps dropped from 1.6M to 600k for that client.

idle client shouldn't hold so many caps.

> Is it safe to run in a cronjob? Let's say, once or twice a day during
> production?
>

yes. For now, it's better to run "echo 3 >/proc/sys/vm/drop_caches"
after cronjob finishes

> Thanks!
>
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> Belo Horizonte - Brasil
> IRC NICK - WebertRLZ
>
> On Thu, Dec 21, 2017 at 11:55 AM, Yan, Zheng  wrote:
>>
>> On Thu, Dec 21, 2017 at 7:33 PM, Webert de Souza Lima
>>  wrote:
>> > I have upgraded the kernel on a client node (one that has close-to-zero
>> > traffic) used for tests.
>> >
>> >{
>> >   "reconnecting" : false,
>> >   "id" : 1620266,
>> >   "num_leases" : 0,
>> >   "inst" : "client.1620266 10.0.0.111:0/3921220890",
>> >   "state" : "open",
>> >   "completed_requests" : 0,
>> >   "num_caps" : 1402490,
>> >   "client_metadata" : {
>> >  "kernel_version" : "4.4.0-104-generic",
>> >  "hostname" : "suppressed",
>> >  "entity_id" : "admin"
>> >   },
>> >   "replay_requests" : 0
>> >},
>> >
>> > still 1.4M caps used.
>> >
>> > is upgrading the client kernel enough ?
>> >
>>
>> See http://tracker.ceph.com/issues/22446. We haven't implemented that
>> feature.  "echo 3 >/proc/sys/vm/drop_caches"  should drop most caps.
>>
>> >
>> >
>> > Regards,
>> >
>> > Webert Lima
>> > DevOps Engineer at MAV Tecnologia
>> > Belo Horizonte - Brasil
>> > IRC NICK - WebertRLZ
>> >
>> > On Fri, Dec 15, 2017 at 11:16 AM, Webert de Souza Lima
>> >  wrote:
>> >>
>> >> So,
>> >>
>> >> On Fri, Dec 15, 2017 at 10:58 AM, Yan, Zheng  wrote:
>> >>>
>> >>>
>> >>> 300k are ready quite a lot. opening them requires long time. does you
>> >>> mail server really open so many files?
>> >>
>> >>
>> >> Yes, probably. It's a commercial solution. A few thousand domains,
>> >> dozens
>> >> of thousands of users and god knows how any mailboxes.
>> >> From the daemonperf you can see the write workload is high, so yes, too
>> >> much files opening (dovecot mdbox stores multiple e-mails per file,
>> >> split
>> >> into many files).
>> >>
>> >>> I checked 4.4 kernel, it includes the code that trim cache when mds
>> >>> recovers.
>> >>
>> >>
>> >> Ok, all nodes are running 4.4.0-75-generic. The fix might have been
>> >> included in a newer version.
>> >> I'll upgrade it asap.
>> >>
>> >>
>> >> Regards,
>> >>
>> >> Webert Lima
>> >> DevOps Engineer at MAV Tecnologia
>> >> Belo Horizonte - Brasil
>> >> IRC NICK - WebertRLZ
>> >
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs NFS failover

2017-12-21 Thread nigel davies
Thanks all on this one

The ctdb worked amazing, just need to tweak the settings on it. So the
failover happens a tad faster. But all in all it works.

Thanks for all your help

On 21 Dec 2017 9:08 am, "Robert Sander" 
wrote:

> On 20.12.2017 18:45, nigel davies wrote:
> > Hay all
> >
> > Can any one advise on how it can do this.
>
> You can use ctdb for that and run an active/active NFS cluster:
>
> https://wiki.samba.org/index.php/Setting_up_CTDB_for_Clustered_NFS
>
> The cluster filesystem can be a CephFS. This also works with Samba, i.e.
> you get an unlimited fileserver.
>
> Regards
> --
> Robert Sander
> Heinlein Support GmbH
> Schwedter Str. 8/9b, 10119 Berlin
>
> http://www.heinlein-support.de
>
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
>
> Zwangsangaben lt. §35a GmbHG:
> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein -- Sitz: Berlin
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs limis

2017-12-21 Thread nigel davies
Right ok I take an look. Can you do that after the pool /cephfs has been
set up


On 21 Dec 2017 12:25 pm, "Yan, Zheng"  wrote:

> On Thu, Dec 21, 2017 at 6:18 PM, nigel davies  wrote:
> > Hay all is it possable to set cephfs to have an sapce limit
> > eg i like to set my cephfs to have an limit of 20TB
> > and my s3 storage to have 4TB for example
> >
>
> you can set pool quota on cephfs data pools
>
> > thanks
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS locatiins

2017-12-21 Thread nigel davies
Hay all

Is it ok to set up mds on the same serves that do host the osd's or should
they be on different server's
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com