Re: [ceph-users] unfound objects blocking cluster, need help!

2016-10-07 Thread Paweł Sadowski
Hi,

I work with Tomasz and I'm investigating this situation. We still don't
fully understood why there was unfound object after removing single OSD.
>From logs[1] it looks like all PGs were active+clean before marking that
OSD out. After that backfills started on multiple OSDs. Three minutes
later our robot removed that OSD from cluster. Before removing OSD there
were some PGs marked as degraded. And we ended up with some PGs
degraded+undersized and with unfound objects. During whole event there
were some slow requests related to this OSD -- that's why we decided to
get rid of it.

Then, after a while whole cluster seems to be blocked -- those unfound
objects were blocking recovery and client operations -- we only got
something like 10% of normal traffic. It looks like this was caused by
OSD with unfound objects -- there is throttling mechanism in OSD which
will by default allow only 100 messages. If there are some unfound
objects and many client requests for it it will completely block that
OSD after a while with all ops "waiting for missing object"[2]. In this
state it's not possible to query PG, even do 'ceph tell osd.X version'.
Restarting OSD will give a chance to execute 'pg query' or
'make_unfound_lost revert|delete' before throttle limit will be reached.
It's also possible to increase that limit by setting
'osd_client_message_cap' in config file (option not documented in docs).

I was able to reproduce blocking on throttling limit on Hammer (0.94.5
and 0.94.9) and Jewel (10.2.3). This looks like a bug - client
operations are blocking recovery process/admin tools.

1:

2016-10-01 07:25:05 - mon.0 10.97.212.2:6789/0 8690339 : cluster [INF]
pgmap v31174228: 10432 pgs: 10432 active+clean; 141 TB data, 423 TB
used, 807 TB / 1231 TB avail; 83202 kB/s rd, 144 MB/s wr, 8717
op/s  <-- cluster seems ok
2016-10-01 07:25:06 - mon.0 10.97.212.2:6789/0 8690341 : cluster [INF]
osdmap e696136: 227 osds: 227 up, 226 in <-- ceph osd out
2016-10-01 07:28:47 - mon.0 10.97.212.2:6789/0 8691212 : cluster [INF]
pgmap v31174433: 10432 pgs: 4 active+recovery_wait+degraded+remapped, 9
activating+degraded+remapped, 10301 active+clean, 59
active+remapped+wait_backfill, 19 activating+remapped, 38
active+remapped+backfilling, 2 active+recovering+degraded; 141 TB data,
421 TB used, 803 TB / 1225 TB avail; 4738 kB/s rd, 19025 kB/s wr, 2006
op/s; 1863/125061398 objects degraded (0.001%); 1211382/125061398
objects misplaced (0.969%); 994 MB/s, 352 objects/s recovering
2016-10-01 07:28:47 - mon.0 10.97.212.2:6789/0 8691232 : cluster [INF]
osd.87 marked itself down<-- osd down
2016-10-01 07:28:41 - osd.102 10.99.208.134:6801/77524 6820 : cluster
[WRN] slow request 30.664042 seconds old, received at 2016-10-01
07:28:11.248835: osd_op(client.458404438.0:13798725
rbd_data.e623422ae8944a.0d49 [write 786432~8192] 3.abb76dbb
snapc 9256b=[] ack+ondisk+write+known_if_redirected e696155) currently
waiting for subops from 5,54,87,142
2016-10-01 07:28:54 - mon.0 10.97.212.2:6789/0 8691266 : cluster [INF]
pgmap v31174439: 10432 pgs: 3
stale+active+recovery_wait+degraded+remapped, 9
stale+activating+degraded+remapped, 11
stale+active+remapped+wait_backfill, 10301 active+clean, 49
active+remapped+wait_backfill, 1
active+recovering+undersized+degraded+remapped, 17 activating+remapped,
39 active+remapped+backfilling, 2 active+recovering+degraded; 141 TB
data, 421 TB used, 803 TB / 1225 TB avail; 85707 kB/s rd, 139 MB/s wr,
6271 op/s; 6326/125057026 objects degraded (0.005%); 1224087/125057026
objects misplaced (0.979%); 1/41406770 unfound (0.000%) <-- first unfound

full log:
https://gist.githubusercontent.com/anonymous/0336ea26c7c165e20ae75fbe03204d19/raw/9f5053e91ec5ae10bfb614033c924fe2361a116e/ceph_log

2:
# ceph --admin-daemon /var/run/ceph/*.asok dump_ops_in_flight
{
"description": "osd_op(client.176998.0:2805522
rbd_data.24d2b238e1f29.182f [set-alloc-hint object_size
4194304 write_size 4194304,write 2662400~4096] 2.62ea5cd0
ack+ondisk+write+known_if_redirected e1919)",
"initiated_at": "2016-10-06 07:19:45.650089",
"age": 647.394210,
"duration": 0.003584,
"type_data": [
"delayed",
{
"client": "client.176998",
"tid": 2805522
},
[
{
"time": "2016-10-06 07:19:45.650089",
"event": "initiated"
},
{
"time": "2016-10-06 07:19:45.653667",
"event": "reached_pg"
},
{
"time": "2016-10-06 07:19:45.653673",
"event": "waiting for missing object"
}
]
]
}

3:
# ceph --admin-daemon /var/run/ceph/*.asok perf dump | grep -A 20
"

Re: [ceph-users] [EXTERNAL] Benchmarks using fio tool gets stuck

2016-10-07 Thread Mario Rodríguez Molins
Adding the parameter "--iodirect=1" to the fio command line, it does not
get stuck anymore.
This is how it looks now my script:

for operation in read write randread randwrite; do

  for rbd in 4K 64K 1M 4M; do
for bs in 4k 64k 1M 4M ; do
  # - create rbd image with block size $rbd

  fio --name=global \
  --ioengine=rbd \
  --clientname=admin \
  --pool=scbench \
  --rbdname=image01 \
  --exec_prerun="echo 3 > /proc/sys/vm/drop_caches && sync" \
  --bs=${bs} \
  --name=rbd_iodeph32 \
  --iodepth=32 \
  --direct=1 \
  --rw=${operation} \
  --output-format=json

  sleep 10

  # - delete rbd image
done
  done
done


On Wed, Oct 5, 2016 at 5:09 PM, Mario Rodríguez Molins <
mariorodrig...@tuenti.com> wrote:

> Doing some tests using iperf, our network has a bandwidth among nodes of
> 940 Mbits/sec.
> According to our metrics of network use in this cluster, hosts with OSD
> have a peek traffic of about 200 Mbits/sec each and the client which runs
> FIO about 300 Mbits/sec.
> It doesn't seem to be saturated the network.
>
>
>
>
>
> On Wed, Oct 5, 2016 at 4:16 PM, Will.Boege  wrote:
>
>> Because you do not have segregated networks, the cluster traffic is most
>> likely drowning out the FIO user traffic.  This is especially exacerbated
>> by the fact that it is only a 1gb link between the cluster nodes.
>>
>>
>>
>> If you are planning on using this cluster for anything other than
>> testing, you’ll want to re-evaluate your network architecture.
>>
>>
>>
>> +  >= 10gbe
>>
>> + Dedicated cluster network
>>
>>
>>
>>
>>
>> *From: *Mario Rodríguez Molins 
>> *Date: *Wednesday, October 5, 2016 at 8:38 AM
>> *To: *"Will.Boege" 
>> *Cc: *"ceph-users@lists.ceph.com" 
>> *Subject: *Re: [EXTERNAL] [ceph-users] Benchmarks using fio tool gets
>> stuck
>>
>>
>>
>> Hi,
>>
>>
>>
>> Currently, we do not have a separated cluster network and our setup is:
>>
>>  - 3 nodes for OSD with 1Gbps links. Each node is running a unique OSD
>> daemon. Although we plan to increase the number of OSDs per host.
>>
>>  - 3 virtual machines also with 1Gbps links, where each vm is running one
>> monitor daemon (two of them are running a metadata server too).
>>
>>  - The two clients used for testing purposes are also 2 vms.
>>
>>
>>
>> In each run of FIO tool, we do the following steps (all of them in the
>> client):
>>
>>  1.- Create an rbd image of 1Gb within a pool and map this image to a
>> block device
>>
>>  2.- Create the ext4 filesystem in this block device
>>
>>  3.- Unmap the device from the client
>>
>>  4.- Before testing, drop caches (echo 3 | tee /proc/sys/vm/drop_caches
>> && sync)
>>
>>  5.- Perform the fio test, setting the pool and name of the rbd image. In
>> each run, the block size used is changed.
>>
>>  6.- Remove the image from the pool
>>
>>
>>
>>
>>
>>
>>
>> Thanks in advance!
>>
>>
>>
>> On Wed, Oct 5, 2016 at 2:57 PM, Will.Boege  wrote:
>>
>> What does your network setup look like?  Do you have a separate cluster
>> network?
>>
>>
>>
>> Can you explain how you are performing the FIO test? Are you mounting a
>> volume through krbd and testing that from a different server?
>>
>>
>> On Oct 5, 2016, at 3:11 AM, Mario Rodríguez Molins <
>> mariorodrig...@tuenti.com> wrote:
>>
>> Hello,
>>
>>
>>
>> We are setting a new cluster of Ceph and doing some benchmarks on it.
>>
>> At this moment, our cluster consists of:
>>
>>  - 3 nodes for OSD. In our current configuration one daemon per node.
>>
>>  - 3 nodes for monitors (MON). In two of these nodes, there is a metadata
>> server (MDS).
>>
>>
>>
>> Benchmarks are performed with tools that ceph/rados provides us as well
>> as with fio benchmark tool.
>>
>> Our benchmark tests are based on this tutorial:
>> http://tracker.ceph.com/projects/ceph/wiki/Benchmark_Ceph_
>> Cluster_Performance.
>>
>>
>>
>> Using fio benchmark tool, we are having some issues. After some
>> executions, the fio process gets stuck with futex_wait_queue_me call:
>>
>> # cat /proc/14413/stack
>>
>> [] futex_wait_queue_me+0xd2/0x140
>>
>> [] futex_wait+0xff/0x260
>>
>> [] wake_up_q+0x2d/0x60
>>
>> [] futex_requeue+0x2c1/0x930
>>
>> [] do_futex+0x2b1/0xb20
>>
>> [] handle_mm_fault+0x14e1/0x1cd0
>>
>> [] wake_up_new_task+0x108/0x1a0
>>
>> [] SyS_futex+0x83/0x180
>>
>> [] __do_page_fault+0x221/0x510
>>
>> [] system_call_fast_compare_end+0xc/0x96
>>
>> [] 0x
>>
>>
>>
>> Logs of osd and mon daemons do not show any information or error about
>> what the problem could be.
>>
>>
>>
>> Executing strace command to trace the execution of the fio process show
>> the following:
>>
>>
>>
>> [pid 14416] futex(0x7fffdffa16fc, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME,
>> 632809, {1475609725, 98199000}, ) = -1 ETIMEDOUT (Connection timed
>> out)
>>
>> [pid 14416] gettimeofday({1475609725, 98347}, NULL) = 0
>>
>> [pid 14416] futex(0x7fffdffa16d0, FUTEX_WAKE, 1) = 0
>>
>> [pid 14416] clock_gettime(CLOCK_MONOTONIC_RAW, {125063, 345690227}) = 0

Re: [ceph-users] unable to start radosgw after upgrade from 10.2.2 to 10.2.3

2016-10-07 Thread Orit Wasserman
Hi,

On Wed, Oct 5, 2016 at 11:23 PM, Andrei Mikhailovsky  wrote:
> Hello everyone,
>
> I've just updated my ceph to version 10.2.3 from 10.2.2 and I am no longer
> able to start the radosgw service. When executing I get the following error:
>
> 2016-10-05 22:14:10.735883 7f1852d26a00  0 ceph version 10.2.3
> (ecc23778eb545d8dd55e2e4735b53cc93f92e65b), process radosgw, pid 2711
> 2016-10-05 22:14:10.765648 7f1852d26a00  0 pidfile_write: ignore empty
> --pid-file
> 2016-10-05 22:14:11.287772 7f1852d26a00  0 zonegroup default missing zone
> for master_zone=

This means you are missing a master zone , you can get here only if
you configured a realm.
Is that the case?

Can you provide:
radosgw-admin realm get
radosgw-admin zonegroupmap get
radosgw-admin zonegroup get
radosgw-admin zone get  -rgw-zone=default

Orit

> 2016-10-05 22:14:11.294141 7f1852d26a00 -1 Couldn't init storage provider
> (RADOS)
>
>
>
> I had no issues starting rados on 10.2.2 and all versions prior to that.
>
> I am running ceph 10.2.3 on Ubuntu 16.04 LTS servers.
>
> Could someone please help me with fixing the problem?
>
> Thanks
>
> Andrei
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] unable to start radosgw after upgrade from 10.2.2 to 10.2.3

2016-10-07 Thread Andrei Mikhailovsky
Hi Orit,

The radosgw service has been configured about two years ago using the 
documentation on the ceph.com website. No changes to configuration has been 
done since. The service was working fine until the 10.2.3 update recently. I 
have been updating ceph to include every major release and practically every 
minor release (apart from maybe one or two about a year ago.

I have followed the insructions in the following link 
(https://www.mail-archive.com/ceph-users@lists.ceph.com/msg31764.html) and that 
has solved the issue with radosgw not able to start at all.

To answer your question, I now have the following from the commands that you've 
listed. Please let me know if something is missing.


# radosgw-admin realm get
{
"id": "5b41b1b2-0f92-463d-b582-07552f83e66c",
"name": "default",
"current_period": "286475fa-625b-4fdb-97bf-dcec4b437960",
"epoch": 1
}



# radosgw-admin zonegroupmap get
{
"zonegroups": [],
"master_zonegroup": "",
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
}
}



# radosgw-admin zonegroup get
{
"id": "default",
"name": "default",
"api_name": "",
"is_master": "true",
"endpoints": [],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "",
"zones": [
{
"id": "default",
"name": "default",
"endpoints": [],
"log_meta": "false",
"log_data": "false",
"bucket_index_max_shards": 0,
"read_only": "false"
}
],
"placement_targets": [
{
"name": "default-placement",
"tags": []
}
],
"default_placement": "default-placement",
"realm_id": ""
}



# radosgw-admin zone get  --rgw-zone=default
{
"id": "default",
"name": "default",
"domain_root": ".rgw",
"control_pool": ".rgw.control",
"gc_pool": ".rgw.gc",
"log_pool": ".log",
"intent_log_pool": ".intent-log",
"usage_log_pool": ".usage",
"user_keys_pool": ".users",
"user_email_pool": ".users.email",
"user_swift_pool": ".users.swift",
"user_uid_pool": ".users.uid",
"system_key": {
"access_key": "",
"secret_key": ""
},
"placement_pools": [
{
"key": "default-placement",
"val": {
"index_pool": ".rgw.buckets.index",
"data_pool": ".rgw.buckets",
"data_extra_pool": "",
"index_type": 0
}
}
],
"metadata_heap": ".rgw.meta",
"realm_id": ""
}





I did notice that I am missing the .usage pool as per "usage_log_pool": 
".usage", setting. Should I create it with the same settings as the .users pool?

Cheers

Andrei


 

- Original Message -
> From: "Orit Wasserman" 
> To: "andrei" 
> Cc: "ceph-users" 
> Sent: Friday, 7 October, 2016 10:21:44
> Subject: Re: [ceph-users] unable to start radosgw after upgrade from 10.2.2 
> to 10.2.3

> Hi,
> 
> On Wed, Oct 5, 2016 at 11:23 PM, Andrei Mikhailovsky  
> wrote:
>> Hello everyone,
>>
>> I've just updated my ceph to version 10.2.3 from 10.2.2 and I am no longer
>> able to start the radosgw service. When executing I get the following error:
>>
>> 2016-10-05 22:14:10.735883 7f1852d26a00  0 ceph version 10.2.3
>> (ecc23778eb545d8dd55e2e4735b53cc93f92e65b), process radosgw, pid 2711
>> 2016-10-05 22:14:10.765648 7f1852d26a00  0 pidfile_write: ignore empty
>> --pid-file
>> 2016-10-05 22:14:11.287772 7f1852d26a00  0 zonegroup default missing zone
>> for master_zone=
> 
> This means you are missing a master zone , you can get here only if
> you configured a realm.
> Is that the case?
> 
> Can you provide:
> radosgw-admin realm get
> radosgw-admin zonegroupmap get
> radosgw-admin zonegroup get
> radosgw-admin zone get  -rgw-zone=default
> 
> Orit
> 
>> 2016-10-05 22:14:11.294141 7f1852d26a00 -1 Couldn't init storage provider
>> (RADOS)
>>
>>
>>
>> I had no issues starting rados on 10.2.2 and all versions prior to that.
>>
>> I am running ceph 10.2.3 on Ubuntu 16.04 LTS servers.
>>
>> Could someone please help me with fixing the problem?
>>
>> Thanks
>>
>> Andrei
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Mon Crashing after creating Cephfs

2016-10-07 Thread John Spray
On Fri, Oct 7, 2016 at 8:04 AM, James Horner  wrote:
> Hi All
>
> Just wondering if anyone can help me out here. Small home cluster with 1
> mon, the next phase of the plan called for more but I hadn't got there yet.
>
> I was trying to setup Cephfs and I ran "ceph fs new" without having an MDS
> as I was having issues with rank 0 immediately being degraded. My thinking
> was that I would bring up an MDS and it would be assigned to rank 0. Anyhoo
> after I did that my mon crashed and I havn't been able to restart it since,
> its output is:
>
> root@bertie ~ $ /usr/bin/ceph-mon -f --cluster ceph --id bertie --setuser
> ceph --setgroup ceph 2>&1 | tee /var/log/ceph/mon-temp
> starting mon.bertie rank 0 at 192.168.2.3:6789/0 mon_data
> /var/lib/ceph/mon/ceph-bertie fsid 06e2f4e0-35e1-4f8c-b2a0-bc72c4cd3199
> terminate called after throwing an instance of 'std::out_of_range'
>   what():  map::at
> *** Caught signal (Aborted) **
>  in thread 7fad7f86c480 thread_name:ceph-mon
>  ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>  1: (()+0x525737) [0x56219142b737]
>  2: (()+0xf8d0) [0x7fad7eb3c8d0]
>  3: (gsignal()+0x37) [0x7fad7cdc6067]
>  4: (abort()+0x148) [0x7fad7cdc7448]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7fad7d6b3b3d]
>  6: (()+0x5ebb6) [0x7fad7d6b1bb6]
>  7: (()+0x5ec01) [0x7fad7d6b1c01]
>  8: (()+0x5ee19) [0x7fad7d6b1e19]
>  9: (std::__throw_out_of_range(char const*)+0x66) [0x7fad7d707b76]
>  10: (FSMap::get_filesystem(int) const+0x7c) [0x56219126ed6c]
>  11: (MDSMonitor::maybe_promote_standby(std::shared_ptr)+0x48a)
> [0x56219125b13a]
>  12: (MDSMonitor::tick()+0x4bb) [0x56219126084b]
>  13: (MDSMonitor::on_active()+0x28) [0x562191255da8]
>  14: (PaxosService::_active()+0x60a) [0x5621911d896a]
>  15: (PaxosService::election_finished()+0x7a) [0x5621911d8d7a]
>  16: (Monitor::win_election(unsigned int, std::set,
> std::allocator >&, unsigned long, MonCommand const*, int, std::set std::less, std::allocator > const*)+0x24e) [0x5621911958ce]
>  17: (Monitor::win_standalone_election()+0x20f) [0x562191195d9f]
>  18: (Monitor::bootstrap()+0x91b) [0x56219119676b]
>  19: (Monitor::init()+0x17d) [0x562191196a5d]
>  20: (main()+0x2694) [0x562191106f44]
>  21: (__libc_start_main()+0xf5) [0x7fad7cdb2b45]
>  22: (()+0x257edf) [0x56219115dedf]
> 2016-10-07 06:50:39.049061 7fad7f86c480 -1 *** Caught signal (Aborted) **
>  in thread 7fad7f86c480 thread_name:ceph-mon
>
>  ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>  1: (()+0x525737) [0x56219142b737]
>  2: (()+0xf8d0) [0x7fad7eb3c8d0]
>  3: (gsignal()+0x37) [0x7fad7cdc6067]
>  4: (abort()+0x148) [0x7fad7cdc7448]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7fad7d6b3b3d]
>  6: (()+0x5ebb6) [0x7fad7d6b1bb6]
>  7: (()+0x5ec01) [0x7fad7d6b1c01]
>  8: (()+0x5ee19) [0x7fad7d6b1e19]
>  9: (std::__throw_out_of_range(char const*)+0x66) [0x7fad7d707b76]
>  10: (FSMap::get_filesystem(int) const+0x7c) [0x56219126ed6c]
>  11: (MDSMonitor::maybe_promote_standby(std::shared_ptr)+0x48a)
> [0x56219125b13a]
>  12: (MDSMonitor::tick()+0x4bb) [0x56219126084b]
>  13: (MDSMonitor::on_active()+0x28) [0x562191255da8]
>  14: (PaxosService::_active()+0x60a) [0x5621911d896a]
>  15: (PaxosService::election_finished()+0x7a) [0x5621911d8d7a]
>  16: (Monitor::win_election(unsigned int, std::set,
> std::allocator >&, unsigned long, MonCommand const*, int, std::set std::less, std::allocator > const*)+0x24e) [0x5621911958ce]
>  17: (Monitor::win_standalone_election()+0x20f) [0x562191195d9f]
>  18: (Monitor::bootstrap()+0x91b) [0x56219119676b]
>  19: (Monitor::init()+0x17d) [0x562191196a5d]
>  20: (main()+0x2694) [0x562191106f44]
>  21: (__libc_start_main()+0xf5) [0x7fad7cdb2b45]
>  22: (()+0x257edf) [0x56219115dedf]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.
>
>  0> 2016-10-07 06:50:39.049061 7fad7f86c480 -1 *** Caught signal
> (Aborted) **
>  in thread 7fad7f86c480 thread_name:ceph-mon
>
>  ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>  1: (()+0x525737) [0x56219142b737]
>  2: (()+0xf8d0) [0x7fad7eb3c8d0]
>  3: (gsignal()+0x37) [0x7fad7cdc6067]
>  4: (abort()+0x148) [0x7fad7cdc7448]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7fad7d6b3b3d]
>  6: (()+0x5ebb6) [0x7fad7d6b1bb6]
>  7: (()+0x5ec01) [0x7fad7d6b1c01]
>  8: (()+0x5ee19) [0x7fad7d6b1e19]
>  9: (std::__throw_out_of_range(char const*)+0x66) [0x7fad7d707b76]
>  10: (FSMap::get_filesystem(int) const+0x7c) [0x56219126ed6c]
>  11: (MDSMonitor::maybe_promote_standby(std::shared_ptr)+0x48a)
> [0x56219125b13a]
>  12: (MDSMonitor::tick()+0x4bb) [0x56219126084b]
>  13: (MDSMonitor::on_active()+0x28) [0x562191255da8]
>  14: (PaxosService::_active()+0x60a) [0x5621911d896a]
>  15: (PaxosService::election_finished()+0x7a) [0x5621911d8d7a]
>  16: (Monitor::win_election(unsigned int, std::set,
> std::allocator >&, unsigned long, MonCommand const*, int, std::set std::less, std::allocator > const*)+0x24e) [0x5621911958ce

Re: [ceph-users] unable to start radosgw after upgrade from 10.2.2 to 10.2.3

2016-10-07 Thread Orit Wasserman
On Fri, Oct 7, 2016 at 12:24 PM, Andrei Mikhailovsky  wrote:
> Hi Orit,
>
> The radosgw service has been configured about two years ago using the 
> documentation on the ceph.com website. No changes to configuration has been 
> done since. The service was working fine until the 10.2.3 update recently. I 
> have been updating ceph to include every major release and practically every 
> minor release (apart from maybe one or two about a year ago.
>
> I have followed the insructions in the following link 
> (https://www.mail-archive.com/ceph-users@lists.ceph.com/msg31764.html) and 
> that has solved the issue with radosgw not able to start at all.
>
> To answer your question, I now have the following from the commands that 
> you've listed. Please let me know if something is missing.
>
>
> # radosgw-admin realm get
> {
> "id": "5b41b1b2-0f92-463d-b582-07552f83e66c",
> "name": "default",
> "current_period": "286475fa-625b-4fdb-97bf-dcec4b437960",
> "epoch": 1
> }
>
>
>
> # radosgw-admin zonegroupmap get
> {
> "zonegroups": [],
> "master_zonegroup": "",
> "bucket_quota": {
> "enabled": false,
> "max_size_kb": -1,
> "max_objects": -1
> },
> "user_quota": {
> "enabled": false,
> "max_size_kb": -1,
> "max_objects": -1
> }
> }
>
>
>
> # radosgw-admin zonegroup get
> {
> "id": "default",
> "name": "default",
> "api_name": "",
> "is_master": "true",
> "endpoints": [],
> "hostnames": [],
> "hostnames_s3website": [],
> "master_zone": "",

You need to set the master zone to be default zone id (which is also
default becuase you upgraded from an older version)

try: radosgw-admin zone modify --master --rgw-zone default.

If that doesn't work there is a more complicated procedure that I hope
we can avoid.

> "zones": [
> {
> "id": "default",
> "name": "default",
> "endpoints": [],
> "log_meta": "false",
> "log_data": "false",
> "bucket_index_max_shards": 0,
> "read_only": "false"
> }
> ],
> "placement_targets": [
> {
> "name": "default-placement",
> "tags": []
> }
> ],
> "default_placement": "default-placement",
> "realm_id": ""
> }
>
>
>
> # radosgw-admin zone get  --rgw-zone=default
> {
> "id": "default",
> "name": "default",
> "domain_root": ".rgw",
> "control_pool": ".rgw.control",
> "gc_pool": ".rgw.gc",
> "log_pool": ".log",
> "intent_log_pool": ".intent-log",
> "usage_log_pool": ".usage",
> "user_keys_pool": ".users",
> "user_email_pool": ".users.email",
> "user_swift_pool": ".users.swift",
> "user_uid_pool": ".users.uid",
> "system_key": {
> "access_key": "",
> "secret_key": ""
> },
> "placement_pools": [
> {
> "key": "default-placement",
> "val": {
> "index_pool": ".rgw.buckets.index",
> "data_pool": ".rgw.buckets",
> "data_extra_pool": "",
> "index_type": 0
> }
> }
> ],
> "metadata_heap": ".rgw.meta",
> "realm_id": ""
> }
>
>
>
>
>
> I did notice that I am missing the .usage pool as per "usage_log_pool": 
> ".usage", setting. Should I create it with the same settings as the .users 
> pool?

no need , rgw creates it when it needs it

Orit
> Cheers
>
> Andrei
>
>
>
>
> - Original Message -
>> From: "Orit Wasserman" 
>> To: "andrei" 
>> Cc: "ceph-users" 
>> Sent: Friday, 7 October, 2016 10:21:44
>> Subject: Re: [ceph-users] unable to start radosgw after upgrade from 10.2.2 
>> to 10.2.3
>
>> Hi,
>>
>> On Wed, Oct 5, 2016 at 11:23 PM, Andrei Mikhailovsky  
>> wrote:
>>> Hello everyone,
>>>
>>> I've just updated my ceph to version 10.2.3 from 10.2.2 and I am no longer
>>> able to start the radosgw service. When executing I get the following error:
>>>
>>> 2016-10-05 22:14:10.735883 7f1852d26a00  0 ceph version 10.2.3
>>> (ecc23778eb545d8dd55e2e4735b53cc93f92e65b), process radosgw, pid 2711
>>> 2016-10-05 22:14:10.765648 7f1852d26a00  0 pidfile_write: ignore empty
>>> --pid-file
>>> 2016-10-05 22:14:11.287772 7f1852d26a00  0 zonegroup default missing zone
>>> for master_zone=
>>
>> This means you are missing a master zone , you can get here only if
>> you configured a realm.
>> Is that the case?
>>
>> Can you provide:
>> radosgw-admin realm get
>> radosgw-admin zonegroupmap get
>> radosgw-admin zonegroup get
>> radosgw-admin zone get  -rgw-zone=default
>>
>> Orit
>>
>>> 2016-10-05 22:14:11.294141 7f1852d26a00 -1 Couldn't init storage provider
>>> (RADOS)
>>>
>>>
>>>
>>> I had no issues starting rados on 10.2.2 and all versions prior to that.
>>>
>>> I am running ceph 10.2.3 on Ubuntu 16.04 LTS servers.
>>>
>>> Could someone please help me with fixing the problem?
>>>
>>> Thanks
>>>
>>> Andrei
>>>
>>> 

Re: [ceph-users] Ceph Mon Crashing after creating Cephfs

2016-10-07 Thread James Horner
Hi John

Thanks for that, life saver! Running on Debian Jessie and I replaced the
mail ceph repo in source.d to:

deb
http://gitbuilder.ceph.com/ceph-deb-jessie-x86_64-basic/ref/wip-17466-jewel/
jessie main

Updated and Upgraded Ceph, tried to manually run my mon which failed as it
had already been started during the upgrade!

Just to ask about the gitbuilder repo's, is there a way I can track if this
patch gets pushed into the mainline (10.2.4 or something)? Are there any
gotchas to consider with using them?

Thanks again, My Domain Controller thanks you, my mailserver thanks you and
my webserver thanks you!!!


James

On 7 October 2016 at 11:37, John Spray  wrote:

> On Fri, Oct 7, 2016 at 8:04 AM, James Horner 
> wrote:
> > Hi All
> >
> > Just wondering if anyone can help me out here. Small home cluster with 1
> > mon, the next phase of the plan called for more but I hadn't got there
> yet.
> >
> > I was trying to setup Cephfs and I ran "ceph fs new" without having an
> MDS
> > as I was having issues with rank 0 immediately being degraded. My
> thinking
> > was that I would bring up an MDS and it would be assigned to rank 0.
> Anyhoo
> > after I did that my mon crashed and I havn't been able to restart it
> since,
> > its output is:
> >
> > root@bertie ~ $ /usr/bin/ceph-mon -f --cluster ceph --id bertie
> --setuser
> > ceph --setgroup ceph 2>&1 | tee /var/log/ceph/mon-temp
> > starting mon.bertie rank 0 at 192.168.2.3:6789/0 mon_data
> > /var/lib/ceph/mon/ceph-bertie fsid 06e2f4e0-35e1-4f8c-b2a0-bc72c4cd3199
> > terminate called after throwing an instance of 'std::out_of_range'
> >   what():  map::at
> > *** Caught signal (Aborted) **
> >  in thread 7fad7f86c480 thread_name:ceph-mon
> >  ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
> >  1: (()+0x525737) [0x56219142b737]
> >  2: (()+0xf8d0) [0x7fad7eb3c8d0]
> >  3: (gsignal()+0x37) [0x7fad7cdc6067]
> >  4: (abort()+0x148) [0x7fad7cdc7448]
> >  5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7fad7d6b3b3d]
> >  6: (()+0x5ebb6) [0x7fad7d6b1bb6]
> >  7: (()+0x5ec01) [0x7fad7d6b1c01]
> >  8: (()+0x5ee19) [0x7fad7d6b1e19]
> >  9: (std::__throw_out_of_range(char const*)+0x66) [0x7fad7d707b76]
> >  10: (FSMap::get_filesystem(int) const+0x7c) [0x56219126ed6c]
> >  11: (MDSMonitor::maybe_promote_standby(std::shared_ptr<
> Filesystem>)+0x48a)
> > [0x56219125b13a]
> >  12: (MDSMonitor::tick()+0x4bb) [0x56219126084b]
> >  13: (MDSMonitor::on_active()+0x28) [0x562191255da8]
> >  14: (PaxosService::_active()+0x60a) [0x5621911d896a]
> >  15: (PaxosService::election_finished()+0x7a) [0x5621911d8d7a]
> >  16: (Monitor::win_election(unsigned int, std::set,
> > std::allocator >&, unsigned long, MonCommand const*, int,
> std::set > std::less, std::allocator > const*)+0x24e) [0x5621911958ce]
> >  17: (Monitor::win_standalone_election()+0x20f) [0x562191195d9f]
> >  18: (Monitor::bootstrap()+0x91b) [0x56219119676b]
> >  19: (Monitor::init()+0x17d) [0x562191196a5d]
> >  20: (main()+0x2694) [0x562191106f44]
> >  21: (__libc_start_main()+0xf5) [0x7fad7cdb2b45]
> >  22: (()+0x257edf) [0x56219115dedf]
> > 2016-10-07 06:50:39.049061 7fad7f86c480 -1 *** Caught signal (Aborted) **
> >  in thread 7fad7f86c480 thread_name:ceph-mon
> >
> >  ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
> >  1: (()+0x525737) [0x56219142b737]
> >  2: (()+0xf8d0) [0x7fad7eb3c8d0]
> >  3: (gsignal()+0x37) [0x7fad7cdc6067]
> >  4: (abort()+0x148) [0x7fad7cdc7448]
> >  5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7fad7d6b3b3d]
> >  6: (()+0x5ebb6) [0x7fad7d6b1bb6]
> >  7: (()+0x5ec01) [0x7fad7d6b1c01]
> >  8: (()+0x5ee19) [0x7fad7d6b1e19]
> >  9: (std::__throw_out_of_range(char const*)+0x66) [0x7fad7d707b76]
> >  10: (FSMap::get_filesystem(int) const+0x7c) [0x56219126ed6c]
> >  11: (MDSMonitor::maybe_promote_standby(std::shared_ptr<
> Filesystem>)+0x48a)
> > [0x56219125b13a]
> >  12: (MDSMonitor::tick()+0x4bb) [0x56219126084b]
> >  13: (MDSMonitor::on_active()+0x28) [0x562191255da8]
> >  14: (PaxosService::_active()+0x60a) [0x5621911d896a]
> >  15: (PaxosService::election_finished()+0x7a) [0x5621911d8d7a]
> >  16: (Monitor::win_election(unsigned int, std::set,
> > std::allocator >&, unsigned long, MonCommand const*, int,
> std::set > std::less, std::allocator > const*)+0x24e) [0x5621911958ce]
> >  17: (Monitor::win_standalone_election()+0x20f) [0x562191195d9f]
> >  18: (Monitor::bootstrap()+0x91b) [0x56219119676b]
> >  19: (Monitor::init()+0x17d) [0x562191196a5d]
> >  20: (main()+0x2694) [0x562191106f44]
> >  21: (__libc_start_main()+0xf5) [0x7fad7cdb2b45]
> >  22: (()+0x257edf) [0x56219115dedf]
> >  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to
> > interpret this.
> >
> >  0> 2016-10-07 06:50:39.049061 7fad7f86c480 -1 *** Caught signal
> > (Aborted) **
> >  in thread 7fad7f86c480 thread_name:ceph-mon
> >
> >  ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
> >  1: (()+0x525737) [0x56219142b737]
> >  2: (()+0xf8d0) [0x7fa

[ceph-users] Crash in ceph_read_iter->__free_pages due to null page

2016-10-07 Thread Nikolay Borisov
Hello, 

I've encountered yet another cephfs crash: 

[990188.822271] BUG: unable to handle kernel NULL pointer dereference at 
001c
[990188.822790] IP: [] __free_pages+0x5/0x30
[990188.823090] PGD 180dd8f067 PUD 1bf2722067 PMD 0 
[990188.823506] Oops: 0002 [#1] SMP 
[990188.831274] CPU: 25 PID: 18418 Comm: php-fpm Tainted: G   O
4.4.20-clouder2 #6
[990188.831650] Hardware name: Supermicro X10DRi/X10DRi, BIOS 2.0 12/28/2015
[990188.831876] task: 8822a3b7b700 ti: 88022427c000 task.ti: 
88022427c000
[990188.832249] RIP: 0010:[]  [] 
__free_pages+0x5/0x30
[990188.832691] RSP: :88022427fda8  EFLAGS: 00010246
[990188.832914] RAX: fe00 RBX: 0f3d RCX: 
c100
[990188.833292] RDX: 47f2 RSI:  RDI: 

[990188.833670] RBP: 88022427fe50 R08: 88022427c000 R09: 
00038459d3aa3ee4
[990188.834049] R10: 00013b00e4b8 R11:  R12: 

[990188.834429] R13: 8802c5189f88 R14: 881091270ca8 R15: 
88022427fe70
[990188.838820] FS:  7fc8ff5cb7c0() GS:881fffba() 
knlGS:
[990188.839197] CS:  0010 DS:  ES:  CR0: 80050033
[990188.839420] CR2: 001c CR3: 000405f7e000 CR4: 
001406e0
[990188.839797] Stack:
[990188.840013]  a044a1bc 8806  
88022427fe70
[990188.840639]  8802c5189f88 88189297b6a0 0f3d 
8810fe00
[990188.841263]  88022427fe98  2000 
8802c5189c20
[990188.841886] Call Trace:
[990188.842115]  [] ? ceph_read_iter+0x19c/0x5f0 [ceph]
[990188.842345]  [] __vfs_read+0xa7/0xd0
[990188.842568]  [] vfs_read+0x86/0x130
[990188.842792]  [] SyS_read+0x46/0xa0
[990188.843018]  [] entry_SYSCALL_64_fastpath+0x16/0x6e
[990188.843243] Code: e2 48 89 de ff d1 49 8b 0f 48 85 c9 75 e8 65 ff 0d 99 a7 
ed 7e eb 85 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00  ff 
4f 1c 74 01 c3 55 85 f6 48 89 e5 74 07 e8 f7 f5 ff ff 5d 
[990188.847887] RIP  [] __free_pages+0x5/0x30
[990188.848183]  RSP 
[990188.848404] CR2: 001c

The problem is that page(%RDI) being passed to __free_pages is NULL. Also
retry_op is CHECK_EOF(1), so the page allocation didn't execute which leads 
to the null page. statret is : fe00 which seems to be -ERESTARTSYS. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jewel/CephFS - misc problems (duplicate strays, mismatch between head items and fnode.fragst)

2016-10-07 Thread John Spray
On Fri, Oct 7, 2016 at 1:05 AM, Kjetil Jørgensen  wrote:
> Hi,
>
> context (i.e. what we're doing): We're migrating (or trying to) migrate off
> of an nfs server onto cephfs, for a workload that's best described as "big
> piles" of hardlinks. Essentially, we have a set of "sources":
> foo/01/
> foo/0b/<0b>
> .. and so on
> bar/02/..
> bar/0c/..
> .. and so on
>
> foo/bar/friends have been "cloned" numerous times to a set of names that
> over the course of weeks end up being recycled again, the clone is
> essentially cp -L foo copy-1-of-foo.
>
> We're doing "incremental" rsyncs of this onto cephfs, so the sense of "the
> original source of the hardlink" will end up moving around, depending on the
> whims of rsync. (if it matters, I found some allusion to "if the original
> file hardlinked is deleted, ...".

This might not be much help but... have you thought about making your
application use hardlinks less aggressively?  They have an intrinsinc
overhead in any system that stores inodes locally to directories (like
we do) because you have to take an extra step to resolve them.

In CephFS, resolving a hard link involves reading the dentry (where we
would usually have the inode inline), and then going and finding an
object from the data pool by the inode number, reading the "backtrace"
(i.e.path) from that object and then going back to the metadata pool
to traverse that path.  It's all very fast if your metadata fits in
your MDS cache, but will slow down a lot otherwise, especially as your
metadata IOs are now potentially getting held up by anything hammering
your data pool.

By the way, if your workload is relatively little code and you can
share it, it sounds like it would be a useful hardlink stress test for
our test suite...

> For RBD the ceph cluster have mostly been rather well behaved, the problems
> we have had have for the most part been self-inflicted. Before introducing
> the hardlink spectacle to cephfs, the same filesystem were used for
> light-ish read-mostly loads, beint mostly un-eventful. (That being said, we
> did patch it for
>
> Cluster is v10.2.2 (mds v10.2.2+4d15eb12298e007744486e28924a6f0ae071bd06),
> clients are ubuntu's 4.4.0-32 kernel(s), and elrepo v4.4.4.
>
> The problems we're facing:
>
> Maybe a "non-problem" I have ~6M strays sitting around

So as you hint above, when the original file is deleted, the inode
goes into a stray dentry.  The next time someone reads the file via
one of its other links, the inode gets "reintegrated" (via
eval_remote_stray()) into the dentry it was read from.

> Slightly more problematic, I have duplicate stray(s) ? See log excercepts
> below. Also; rados -p cephfs_metadata listomapkeys 60X. did/does
> seem to agree with there being duplicate strays (assuming 60X. is
> the directory indexes for the stray catalogs), caveat "not a perfect
> snapshot", listomapkeys issued in serial fashion.
> We stumbled across (http://tracker.ceph.com/issues/17177 - mostly here for
> more context)

When you say you stumbled across it, do you mean that you actually had
this same deep scrub error on your system, or just that you found the
ticket?

> There's been a couple of instances of invalid backtrace(s), mostly solved by
> either mds:scrub_path or just unlinking the files/directories in question
> and re-rsync-ing.
>
> mismatch between head items and fnode.fragstat (See below for more of the
> log excercept), appeared to have been solved by mds:scrub_path
>
>
> Duplicate stray(s), ceph-mds complains (a lot, during rsync):
> 2016-09-30 20:00:21.978314 7ffb653b8700  0 mds.0.cache.dir(603) _fetched
> badness: got (but i already had) [inode 10003f25eaf [...2,head]
> ~mds0/stray0/10003f25eaf auth v38836572 s=8998 nl=5 n(v0 b8998 1=1+0)
> (iversion lock) 0x561082e6b520] mode 33188 mtime 2016-07-25 03:02:50.00
> 2016-09-30 20:00:21.978336 7ffb653b8700 -1 log_channel(cluster) log [ERR] :
> loaded dup inode 10003f25eaf [2,head] v36792929 at ~mds0/stray3/10003f25eaf,
> but inode 10003f25eaf.head v38836572 already exists at
> ~mds0/stray0/10003f25eaf

Is your workload doing lots of delete/create cycles of hard links to
the same inode?

I wonder if we are seeing a bug where a new stray is getting created
before the old one has been properly removed, due to some bogus
assumption in the code that stray unlinks don't need to be persisted
as rigorously.

>
> I briefly ran ceph-mds with debug_mds=20/20 which didn't yield anything
> immediately useful, beyond slightly-easier-to-follow the control-flow of
> src/mds/CDir.cc without becoming much wiser.
> 2016-09-30 20:43:51.910754 7ffb653b8700 20 mds.0.cache.dir(606) _fetched pos
> 310473 marker 'I' dname '100022e8617 [2,head]
> 2016-09-30 20:43:51.910757 7ffb653b8700 20 mds.0.cache.dir(606) lookup
> (head, '100022e8617')
> 2016-09-30 20:43:51.910759 7ffb653b8700 20 mds.0.cache.dir(606)   miss ->
> (10002a81c10,head)
> 2016-09-30 20:43:51.910762 7ffb653b8700  0 mds.0.cache.dir(606) _fetched
> badness: got (but i alread

Re: [ceph-users] Ceph Mon Crashing after creating Cephfs

2016-10-07 Thread John Spray
On Fri, Oct 7, 2016 at 12:37 PM, James Horner  wrote:
> Hi John
>
> Thanks for that, life saver! Running on Debian Jessie and I replaced the
> mail ceph repo in source.d to:
>
> deb
> http://gitbuilder.ceph.com/ceph-deb-jessie-x86_64-basic/ref/wip-17466-jewel/
> jessie main
>
> Updated and Upgraded Ceph, tried to manually run my mon which failed as it
> had already been started during the upgrade!
>
> Just to ask about the gitbuilder repo's, is there a way I can track if this
> patch gets pushed into the mainline (10.2.4 or something)? Are there any
> gotchas to consider with using them?

The release notes for the stable releases contain a list of tickets
fixed, so you can search for that.  We also have "Fixes:" lines in
commit messages so you can "git log --grep" for the particular URL in
any branch.  No gotchas with this particular set of patches, other
than the obvious that it isn't strictly a stable release and
consequently has had less testing.

I would be fairly certain this will go into jewel soon and then 10.2.4
when it comes out.

John

>
> Thanks again, My Domain Controller thanks you, my mailserver thanks you and
> my webserver thanks you!!!
>
>
> James
>
> On 7 October 2016 at 11:37, John Spray  wrote:
>>
>> On Fri, Oct 7, 2016 at 8:04 AM, James Horner 
>> wrote:
>> > Hi All
>> >
>> > Just wondering if anyone can help me out here. Small home cluster with 1
>> > mon, the next phase of the plan called for more but I hadn't got there
>> > yet.
>> >
>> > I was trying to setup Cephfs and I ran "ceph fs new" without having an
>> > MDS
>> > as I was having issues with rank 0 immediately being degraded. My
>> > thinking
>> > was that I would bring up an MDS and it would be assigned to rank 0.
>> > Anyhoo
>> > after I did that my mon crashed and I havn't been able to restart it
>> > since,
>> > its output is:
>> >
>> > root@bertie ~ $ /usr/bin/ceph-mon -f --cluster ceph --id bertie
>> > --setuser
>> > ceph --setgroup ceph 2>&1 | tee /var/log/ceph/mon-temp
>> > starting mon.bertie rank 0 at 192.168.2.3:6789/0 mon_data
>> > /var/lib/ceph/mon/ceph-bertie fsid 06e2f4e0-35e1-4f8c-b2a0-bc72c4cd3199
>> > terminate called after throwing an instance of 'std::out_of_range'
>> >   what():  map::at
>> > *** Caught signal (Aborted) **
>> >  in thread 7fad7f86c480 thread_name:ceph-mon
>> >  ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>> >  1: (()+0x525737) [0x56219142b737]
>> >  2: (()+0xf8d0) [0x7fad7eb3c8d0]
>> >  3: (gsignal()+0x37) [0x7fad7cdc6067]
>> >  4: (abort()+0x148) [0x7fad7cdc7448]
>> >  5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7fad7d6b3b3d]
>> >  6: (()+0x5ebb6) [0x7fad7d6b1bb6]
>> >  7: (()+0x5ec01) [0x7fad7d6b1c01]
>> >  8: (()+0x5ee19) [0x7fad7d6b1e19]
>> >  9: (std::__throw_out_of_range(char const*)+0x66) [0x7fad7d707b76]
>> >  10: (FSMap::get_filesystem(int) const+0x7c) [0x56219126ed6c]
>> >  11:
>> > (MDSMonitor::maybe_promote_standby(std::shared_ptr)+0x48a)
>> > [0x56219125b13a]
>> >  12: (MDSMonitor::tick()+0x4bb) [0x56219126084b]
>> >  13: (MDSMonitor::on_active()+0x28) [0x562191255da8]
>> >  14: (PaxosService::_active()+0x60a) [0x5621911d896a]
>> >  15: (PaxosService::election_finished()+0x7a) [0x5621911d8d7a]
>> >  16: (Monitor::win_election(unsigned int, std::set,
>> > std::allocator >&, unsigned long, MonCommand const*, int,
>> > std::set> > std::less, std::allocator > const*)+0x24e) [0x5621911958ce]
>> >  17: (Monitor::win_standalone_election()+0x20f) [0x562191195d9f]
>> >  18: (Monitor::bootstrap()+0x91b) [0x56219119676b]
>> >  19: (Monitor::init()+0x17d) [0x562191196a5d]
>> >  20: (main()+0x2694) [0x562191106f44]
>> >  21: (__libc_start_main()+0xf5) [0x7fad7cdb2b45]
>> >  22: (()+0x257edf) [0x56219115dedf]
>> > 2016-10-07 06:50:39.049061 7fad7f86c480 -1 *** Caught signal (Aborted)
>> > **
>> >  in thread 7fad7f86c480 thread_name:ceph-mon
>> >
>> >  ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>> >  1: (()+0x525737) [0x56219142b737]
>> >  2: (()+0xf8d0) [0x7fad7eb3c8d0]
>> >  3: (gsignal()+0x37) [0x7fad7cdc6067]
>> >  4: (abort()+0x148) [0x7fad7cdc7448]
>> >  5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7fad7d6b3b3d]
>> >  6: (()+0x5ebb6) [0x7fad7d6b1bb6]
>> >  7: (()+0x5ec01) [0x7fad7d6b1c01]
>> >  8: (()+0x5ee19) [0x7fad7d6b1e19]
>> >  9: (std::__throw_out_of_range(char const*)+0x66) [0x7fad7d707b76]
>> >  10: (FSMap::get_filesystem(int) const+0x7c) [0x56219126ed6c]
>> >  11:
>> > (MDSMonitor::maybe_promote_standby(std::shared_ptr)+0x48a)
>> > [0x56219125b13a]
>> >  12: (MDSMonitor::tick()+0x4bb) [0x56219126084b]
>> >  13: (MDSMonitor::on_active()+0x28) [0x562191255da8]
>> >  14: (PaxosService::_active()+0x60a) [0x5621911d896a]
>> >  15: (PaxosService::election_finished()+0x7a) [0x5621911d8d7a]
>> >  16: (Monitor::win_election(unsigned int, std::set,
>> > std::allocator >&, unsigned long, MonCommand const*, int,
>> > std::set> > std::less, std::allocator > const*)+0x24e) [0x5621911958ce]
>> >  17: (Monitor::win_standalone_elect

Re: [ceph-users] Hammer OSD memory usage very high

2016-10-07 Thread Haomai Wang
do you try to restart osd to se the memory usage?

On Fri, Oct 7, 2016 at 1:04 PM, David Burns  wrote:
> Hello all,
>
> We have a small 160TB Ceph cluster used only as a test s3 storage repository 
> for media content.
>
> Problem
> Since upgrading from Firefly to Hammer we are experiencing very high OSD 
> memory use of 2-3 GB per TB of OSD storage - typical OSD memory 6-10GB.
> We have had to increase swap space to bring the cluster to a basic functional 
> state. Clearly this will significantly impact system performance and 
> precludes starting all OSDs simultaneously.
>
> Hardware
> 4 x storage nodes with 16 OSDs/node. OSD nodes are reasonable spec SMC 
> storage servers with dual Xeon CPUs. Storage is 16 x 3TB SAS disks in each 
> node.
> Installed RAM is 72GB (2 nodes) & 80GB (2 nodes). (We note that the installed 
> RAM is at least 50% higher than the Ceph recommended 1 GB RAM per TB of 
> storage.)
>
> Software
> OSD node OS is CentOS 6.8 (with updates). One node has been updated to CentOS 
> 7.2 - no change in memory usage was observed.
>
> "ceph -v" -> ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
> (all Ceph packages downloaded from download.ceph.com)
>
> The cluster has achieved status HEALTH_OK so we don’t believe this relates to 
> increased memory due to recovery.
>
> History
> Emperor 0.72.2 -> Firefly 0.80.10 -> Hammer 0.94.6 -> Hammer 0.94.7 -> Hammer 
> 0.94.9
>
> OSD per process memory is observed to increase substantially during load_pgs 
> phase.
>
> Use of "ceph tell 'osd.*' heap release” has minimal effect - there is no 
> substantial memory in the heap or cache freelists.
>
> More information can be found in bug #17228 (link 
> http://tracker.ceph.com/issues/17228)
>
> Any feedback or guidance to further understanding the high memory usage would 
> be welcomed.
>
> Thanks
>
> David
>
>
> --
> FetchTV Pty Ltd, Level 5, 61 Lavender Street, Milsons Point, NSW 2061
>
> 
>
> This email is sent by FetchTV Pty Ltd (ABN 36 130 669 500). The contents of
> this communication may be
> confidential, legally privileged and/or copyright material. If you are not
> the intended recipient, any use,
> disclosure or copying of this communication is expressed prohibited. If you
> have received this email in error,
> please notify the sender and delete it immediately.
>
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rsync kernel client cepfs mkstemp no space left on device

2016-10-07 Thread Hauke Homburg
Hello,

I have a Ceph Cluster with 5 Server, and 40 OSD. Aktual on this Cluster
are 85GB Free Space, and the rsync dir has lots of Pictures and a Data
Volume of 40GB.

The Linux is a Centos 7 and the Last stable Ceph. The Client is a Debian
8 with Kernel 4 and the Cluster is with cephfs mounted.

When i sync the Directory i see often the Message rsync mkstemp no space
left on device (28). At this Point i can touch a File in anotherDiretory
in the Cluster. In the Diretory i have ~ 63 Files. Are this too much
Files?

greetings

Hauke


-- 
www.w3-creative.de

www.westchat.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jewel/CephFS - misc problems (duplicate strays, mismatch between head items and fnode.fragst)

2016-10-07 Thread Yan, Zheng
On Fri, Oct 7, 2016 at 8:20 AM, Kjetil Jørgensen  wrote:
> And - I just saw another recent thread -
> http://tracker.ceph.com/issues/17177 - can be an explanation of most/all of
> the above ?
>
> Next question(s) would then be:
>
> How would one deal with duplicate stray(s)

Here is an untested method

list omap keys in objects 600. ~ 609.. find all duplicated keys

for each duplicated keys, use ceph-dencoder to decode their values,
find the one has the biggest version and delete the rest
(ceph-dencoder type inode_t skip 9 import /tmp/ decode dump_json)

Regards
Yan, Zheng

> How would one deal with mismatch between head items and fnode.fragstat, ceph
> daemon mds.foo scrub_path ?
>
> -KJ
>
> On Thu, Oct 6, 2016 at 5:05 PM, Kjetil Jørgensen 
> wrote:
>>
>> Hi,
>>
>> context (i.e. what we're doing): We're migrating (or trying to) migrate
>> off of an nfs server onto cephfs, for a workload that's best described as
>> "big piles" of hardlinks. Essentially, we have a set of "sources":
>> foo/01/
>> foo/0b/<0b>
>> .. and so on
>> bar/02/..
>> bar/0c/..
>> .. and so on
>>
>> foo/bar/friends have been "cloned" numerous times to a set of names that
>> over the course of weeks end up being recycled again, the clone is
>> essentially cp -L foo copy-1-of-foo.
>>
>> We're doing "incremental" rsyncs of this onto cephfs, so the sense of "the
>> original source of the hardlink" will end up moving around, depending on the
>> whims of rsync. (if it matters, I found some allusion to "if the original
>> file hardlinked is deleted, ...".
>>
>> For RBD the ceph cluster have mostly been rather well behaved, the
>> problems we have had have for the most part been self-inflicted. Before
>> introducing the hardlink spectacle to cephfs, the same filesystem were used
>> for light-ish read-mostly loads, beint mostly un-eventful. (That being said,
>> we did patch it for
>>
>> Cluster is v10.2.2 (mds v10.2.2+4d15eb12298e007744486e28924a6f0ae071bd06),
>> clients are ubuntu's 4.4.0-32 kernel(s), and elrepo v4.4.4.
>>
>> The problems we're facing:
>>
>> Maybe a "non-problem" I have ~6M strays sitting around
>> Slightly more problematic, I have duplicate stray(s) ? See log excercepts
>> below. Also; rados -p cephfs_metadata listomapkeys 60X. did/does
>> seem to agree with there being duplicate strays (assuming 60X. is
>> the directory indexes for the stray catalogs), caveat "not a perfect
>> snapshot", listomapkeys issued in serial fashion.
>> We stumbled across (http://tracker.ceph.com/issues/17177 - mostly here for
>> more context)
>> There's been a couple of instances of invalid backtrace(s), mostly solved
>> by either mds:scrub_path or just unlinking the files/directories in question
>> and re-rsync-ing.
>> mismatch between head items and fnode.fragstat (See below for more of the
>> log excercept), appeared to have been solved by mds:scrub_path
>>
>>
>> Duplicate stray(s), ceph-mds complains (a lot, during rsync):
>> 2016-09-30 20:00:21.978314 7ffb653b8700  0 mds.0.cache.dir(603) _fetched
>> badness: got (but i already had) [inode 10003f25eaf [...2,head]
>> ~mds0/stray0/10003f25eaf auth v38836572 s=8998 nl=5 n(v0 b8998 1=1+0)
>> (iversion lock) 0x561082e6b520] mode 33188 mtime 2016-07-25 03:02:50.00
>> 2016-09-30 20:00:21.978336 7ffb653b8700 -1 log_channel(cluster) log [ERR]
>> : loaded dup inode 10003f25eaf [2,head] v36792929 at
>> ~mds0/stray3/10003f25eaf, but inode 10003f25eaf.head v38836572 already
>> exists at ~mds0/stray0/10003f25eaf
>>
>> I briefly ran ceph-mds with debug_mds=20/20 which didn't yield anything
>> immediately useful, beyond slightly-easier-to-follow the control-flow of
>> src/mds/CDir.cc without becoming much wiser.
>> 2016-09-30 20:43:51.910754 7ffb653b8700 20 mds.0.cache.dir(606) _fetched
>> pos 310473 marker 'I' dname '100022e8617 [2,head]
>> 2016-09-30 20:43:51.910757 7ffb653b8700 20 mds.0.cache.dir(606) lookup
>> (head, '100022e8617')
>> 2016-09-30 20:43:51.910759 7ffb653b8700 20 mds.0.cache.dir(606)   miss ->
>> (10002a81c10,head)
>> 2016-09-30 20:43:51.910762 7ffb653b8700  0 mds.0.cache.dir(606) _fetched
>> badness: got (but i already had) [inode 100022e8617 [...2,head]
>> ~mds0/stray9/100022e8617 auth v39303851 s=11470 nl=10 n(v0 b11470 1=1+0)
>> (iversion lock) 0x560c013904b8] mode 33188 mtime 2016-07-25 03:38:01.00
>> 2016-09-30 20:43:51.910772 7ffb653b8700 -1 log_channel(cluster) log [ERR]
>> : loaded dup inode 100022e8617 [2,head] v39284583 at
>> ~mds0/stray6/100022e8617, but inode 100022e8617.head v39303851 already
>> exists at ~mds0/stray9/100022e8617
>>
>>
>> 2016-09-25 06:23:50.947761 7ffb653b8700  1 mds.0.cache.dir(10003439a33)
>> mismatch between head items and fnode.fragstat! printing dentries
>> 2016-09-25 06:23:50.947779 7ffb653b8700  1 mds.0.cache.dir(10003439a33)
>> get_num_head_items() = 36; fnode.fragstat.nfiles=53
>> fnode.fragstat.nsubdirs=0
>> 2016-09-25 06:23:50.947782 7ffb653b8700  1 mds.0.cache.dir(10003439a33)
>> mismatch between child account

Re: [ceph-users] CephFS: No space left on device

2016-10-07 Thread Yan, Zheng
On Thu, Oct 6, 2016 at 4:11 PM,   wrote:
> Is there any way to repair pgs/cephfs gracefully?
>

So far no.  We need to write a tool to repair this type of corruption.

Which version of ceph did you use before upgrading to 10.2.3 ?

Regards
Yan, Zheng

>
>
> -Mykola
>
>
>
> From: Yan, Zheng
> Sent: Thursday, 6 October 2016 04:48
> To: Mykola Dvornik
> Cc: John Spray; ceph-users
> Subject: Re: [ceph-users] CephFS: No space left on device
>
>
>
> On Wed, Oct 5, 2016 at 2:27 PM, Mykola Dvornik 
> wrote:
>
>> Hi Zheng,
>
>>
>
>> Many thanks for you reply.
>
>>
>
>> This indicates the MDS metadata is corrupted. Did you do any unusual
>
>> operation on the cephfs? (e.g reset journal, create new fs using
>
>> existing metadata pool)
>
>>
>
>> No, nothing has been explicitly done to the MDS. I had a few inconsistent
>
>> PGs that belonged to the (3 replica) metadata pool. The symptoms were
>
>> similar to http://tracker.ceph.com/issues/17177 . The PGs were eventually
>
>> repaired and no data corruption was expected as explained in the ticket.
>
>>
>
>
>
> I'm afraid that issue does cause corruption.
>
>
>
>> BTW, when I posted this issue on the ML the amount of ground state stry
>
>> objects was around 7.5K. Now it went up to 23K. No inconsistent PGs or any
>
>> other problems happened to the cluster within this time scale.
>
>>
>
>> -Mykola
>
>>
>
>> On 5 October 2016 at 05:49, Yan, Zheng  wrote:
>
>>>
>
>>> On Mon, Oct 3, 2016 at 5:48 AM, Mykola Dvornik 
>
>>> wrote:
>
>>> > Hi Johan,
>
>>> >
>
>>> > Many thanks for your reply. I will try to play with the mds tunables
>>> > and
>
>>> > report back to your ASAP.
>
>>> >
>
>>> > So far I see that mds log contains a lot of errors of the following
>
>>> > kind:
>
>>> >
>
>>> > 2016-10-02 11:58:03.002769 7f8372d54700  0 mds.0.cache.dir(100056ddecd)
>
>>> > _fetched  badness: got (but i already had) [inode 10005729a77 [2,head]
>
>>> > ~mds0/stray1/10005729a77 auth v67464942 s=196728 nl=0 n(v0 b196728
>
>>> > 1=1+0)
>
>>> > (iversion lock) 0x7f84acae82a0] mode 33204 mtime 2016-08-07
>
>>> > 23:06:29.776298
>
>>> >
>
>>> > 2016-10-02 11:58:03.002789 7f8372d54700 -1 log_channel(cluster) log
>
>>> > [ERR] :
>
>>> > loaded dup inode 10005729a77 [2,head] v68621 at
>
>>> >
>
>>> >
>>> > /users/mykola/mms/NCSHNO/final/120nm-uniform-h8200/j002654.out/m_xrange192-320_yrange192-320_016232.dump,
>
>>> > but inode 10005729a77.head v67464942 already exists at
>
>>> > ~mds0/stray1/10005729a77
>
>>>
>
>>> This indicates the MDS metadata is corrupted. Did you do any unusual
>
>>> operation on the cephfs? (e.g reset journal, create new fs using
>
>>> existing metadata pool)
>
>>>
>
>>> >
>
>>> > Those folders within mds.0.cache.dir that got badness report a size of
>
>>> > 16EB
>
>>> > on the clients. rm on them fails with 'Directory not empty'.
>
>>> >
>
>>> > As for the "Client failing to respond to cache pressure", I have 2
>
>>> > kernel
>
>>> > clients on 4.4.21, 1 on 4.7.5 and 16 fuse clients always running the
>
>>> > most
>
>>> > recent release version of ceph-fuse. The funny thing is that every
>
>>> > single
>
>>> > client misbehaves from time to time. I am aware of quite discussion
>
>>> > about
>
>>> > this issue on the ML, but cannot really follow how to debug it.
>
>>> >
>
>>> > Regards,
>
>>> >
>
>>> > -Mykola
>
>>> >
>
>>> > On 2 October 2016 at 22:27, John Spray  wrote:
>
>>> >>
>
>>> >> On Sun, Oct 2, 2016 at 11:09 AM, Mykola Dvornik
>
>>> >>  wrote:
>
>>> >> > After upgrading to 10.2.3 we frequently see messages like
>
>>> >>
>
>>> >> From which version did you upgrade?
>
>>> >>
>
>>> >> > 'rm: cannot remove '...': No space left on device
>
>>> >> >
>
>>> >> > The folders we are trying to delete contain approx. 50K files 193 KB
>
>>> >> > each.
>
>>> >>
>
>>> >> My guess would be that you are hitting the new
>
>>> >> mds_bal_fragment_size_max check.  This limits the number of entries
>
>>> >> that the MDS will create in a single directory fragment, to avoid
>
>>> >> overwhelming the OSD with oversized objects.  It is 10 by default.
>
>>> >> This limit also applies to "stray" directories where unlinked files
>
>>> >> are put while they wait to be purged, so you could get into this state
>
>>> >> while doing lots of deletions.  There are ten stray directories that
>
>>> >> get a roughly even share of files, so if you have more than about one
>
>>> >> million files waiting to be purged, you could see this condition.
>
>>> >>
>
>>> >> The "Client failing to respond to cache pressure" messages may play a
>
>>> >> part here -- if you have misbehaving clients then they may cause the
>
>>> >> MDS to delay purging stray files, leading to a backlog.  If your
>
>>> >> clients are by any chance older kernel clients, you should upgrade
>
>>> >> them.  You can also unmount/remount them to clear this state, although
>
>>> >> it will reoccur until the clients are updated (or until the bug is
>
>>> >> fixed, if you're running latest clients already).
>
>>> >>
>
>>> >> The high le

Re: [ceph-users] unable to start radosgw after upgrade from 10.2.2 to 10.2.3

2016-10-07 Thread Graham Allan
The fundamental problem seems to be the same in each case, related to a 
missing master_zone in the zonegroup. Like yours, our cluster has been 
running for several years with few config changes, though in our case, 
the 10.2.3 radosgw simply doesn't start at all, logging the following error:


2016-10-05 16:39:53.814677 7f3a1d085900  0 zonegroup default missing 
zone for master_zone=
2016-10-05 16:39:53.819964 7f3a1d085900 -1 Couldn't init storage 
provider (RADOS)


There seem to be several approaches to fixing it - I did find that link 
you refer to, and also the "fix-zone" script from Yehuda referred to in:


http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-April/009189.html

then later this looks like a simpler solution to the same issue:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-July/011157.html

I am just moving slowly, as there is ~300TB in the object store which we 
naturally don't want anything to happen to...


There was a good question in that first thread which I never saw an 
answer to, 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-April/009178.html 
- namely can the hammer and jewel rados gateways co-exist for a short 
time, or if correcting the master_zone will bring down the 
still-functional hammer gateway. Or indeed whether all the gateways 
should be stopped before updating any of them.


Graham

On 10/06/2016 06:47 PM, Andrei Mikhailovsky wrote:

Hi Graham,

Yeah, I am not sure why no one else is having the same issues. Anyway, had a 
chat on irc and got a link that helped me: 
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg31764.html

I've followed what it said, even though the errors i got were different, but it 
helped me to start the service. I am yet to test if the rgw is functional and 
user clients can connect.

Hope that helps

andrei

- Original Message -

From: "Graham Allan" 
To: "ceph-users" 
Sent: Thursday, 6 October, 2016 20:04:38
Subject: Re: [ceph-users] unable to start radosgw after upgrade from 10.2.2 to 
10.2.3



That's interesting, as I am getting the exact same errors after
upgrading from Hammer 0.94.9 to Jewel 10.2.3 (on ubuntu 14.04).

I wondered if it was the issue referred to a few months ago here, but
I'm not so sure, since the error returned from radosgw-admin commands is
different:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-April/009171.html

I do have one radosgw which is still on 0.94.9 and still functions
normally - is it possible that this is preventing the config migration
alluded to in that thread? I'm reluctant to do anything to the
still-working 0.94.9 gateway until I can get the 10.2.3 gateways working!

Graham

On 10/05/2016 04:23 PM, Andrei Mikhailovsky wrote:

Hello everyone,

I've just updated my ceph to version 10.2.3 from 10.2.2 and I am no
longer able to start the radosgw service. When executing I get the
following error:

2016-10-05 22:14:10.735883 7f1852d26a00  0 ceph version 10.2.3
(ecc23778eb545d8dd55e2e4735b53cc93f92e65b), process radosgw, pid 2711
2016-10-05 22:14:10.765648 7f1852d26a00  0 pidfile_write: ignore empty
--pid-file
2016-10-05 22:14:11.287772 7f1852d26a00  0 zonegroup default missing
zone for master_zone=
2016-10-05 22:14:11.294141 7f1852d26a00 -1 Couldn't init storage
provider (RADOS)



I had no issues starting rados on 10.2.2 and all versions prior to that.

I am running ceph 10.2.3 on Ubuntu 16.04 LTS servers.

Could someone please help me with fixing the problem?

Thanks

Andrei


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rsync kernel client cepfs mkstemp no space left on device

2016-10-07 Thread Gregory Farnum
On Fri, Oct 7, 2016 at 7:15 AM, Hauke Homburg  wrote:
> Hello,
>
> I have a Ceph Cluster with 5 Server, and 40 OSD. Aktual on this Cluster
> are 85GB Free Space, and the rsync dir has lots of Pictures and a Data
> Volume of 40GB.
>
> The Linux is a Centos 7 and the Last stable Ceph. The Client is a Debian
> 8 with Kernel 4 and the Cluster is with cephfs mounted.
>
> When i sync the Directory i see often the Message rsync mkstemp no space
> left on device (28). At this Point i can touch a File in anotherDiretory
> in the Cluster. In the Diretory i have ~ 63 Files. Are this too much
> Files?

Yes, in recent releases CephFS limits you to 100k dentries in a single
directory fragment. This *includes* the "stray" directories that files
get moved into when you unlink them, and is intended to prevent issues
with very large folders. It will stop being a problem once we enable
automatic fragmenting (soon, hopefully).
You can change that by changing the "mds bal fragment size max"
config, but you're probably better off by figuring out if you've got
an over-large directory or if you're deleting files faster than the
cluster can keep up. There was a thread about this very recently and
John included some details about tuning if you check the archives. :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jewel/CephFS - misc problems (duplicate strays, mismatch between head items and fnode.fragst)

2016-10-07 Thread Kjetil Jørgensen
On Fri, Oct 7, 2016 at 4:46 AM, John Spray  wrote:

> On Fri, Oct 7, 2016 at 1:05 AM, Kjetil Jørgensen 
> wrote:
> > Hi,
> >
> > context (i.e. what we're doing): We're migrating (or trying to) migrate
> off
> > of an nfs server onto cephfs, for a workload that's best described as
> "big
> > piles" of hardlinks. Essentially, we have a set of "sources":
> > foo/01/
> > foo/0b/<0b>
> > .. and so on
> > bar/02/..
> > bar/0c/..
> > .. and so on
> >
> > foo/bar/friends have been "cloned" numerous times to a set of names that
> > over the course of weeks end up being recycled again, the clone is
> > essentially cp -L foo copy-1-of-foo.
> >
> > We're doing "incremental" rsyncs of this onto cephfs, so the sense of
> "the
> > original source of the hardlink" will end up moving around, depending on
> the
> > whims of rsync. (if it matters, I found some allusion to "if the original
> > file hardlinked is deleted, ...".
>
> This might not be much help but... have you thought about making your
> application use hardlinks less aggressively?  They have an intrinsinc
> overhead in any system that stores inodes locally to directories (like
> we do) because you have to take an extra step to resolve them.
>
>
Under "normal" circumstances, this isn't "all that bad", the serious
hammering is
coming from trying migrate to cephfs, where I think we've for the time being
abandoned using hardlinks and take the space-penalty for now. Under "normal"
circumstances it isn't that bad (if my nfs-server stats is to be believed,
it's between
5e5 - and 1.5e6 hardlinks created and unlinked per day, it actually seems a
bit low).


> In CephFS, resolving a hard link involves reading the dentry (where we
> would usually have the inode inline), and then going and finding an
> object from the data pool by the inode number, reading the "backtrace"
> (i.e.path) from that object and then going back to the metadata pool
> to traverse that path.  It's all very fast if your metadata fits in
> your MDS cache, but will slow down a lot otherwise, especially as your
> metadata IOs are now potentially getting held up by anything hammering
> your data pool.
>
> By the way, if your workload is relatively little code and you can
> share it, it sounds like it would be a useful hardlink stress test for
> our test suite


I'll let you know if I manage to reproduce, I'm on-and-off-again trying to
tease this
out on a separate ceph cluster with a "synthetic" load that's close to
equivalent.


> ...
>
> > For RBD the ceph cluster have mostly been rather well behaved, the
> problems
> > we have had have for the most part been self-inflicted. Before
> introducing
> > the hardlink spectacle to cephfs, the same filesystem were used for
> > light-ish read-mostly loads, beint mostly un-eventful. (That being said,
> we
> > did patch it for
> >
> > Cluster is v10.2.2 (mds v10.2.2+4d15eb12298e007744486e28924a6f
> 0ae071bd06),
> > clients are ubuntu's 4.4.0-32 kernel(s), and elrepo v4.4.4.
> >
> > The problems we're facing:
> >
> > Maybe a "non-problem" I have ~6M strays sitting around
>
> So as you hint above, when the original file is deleted, the inode
> goes into a stray dentry.  The next time someone reads the file via
> one of its other links, the inode gets "reintegrated" (via
> eval_remote_stray()) into the dentry it was read from.
>
> > Slightly more problematic, I have duplicate stray(s) ? See log excercepts
> > below. Also; rados -p cephfs_metadata listomapkeys 60X. did/does
> > seem to agree with there being duplicate strays (assuming 60X. is
> > the directory indexes for the stray catalogs), caveat "not a perfect
> > snapshot", listomapkeys issued in serial fashion.
> > We stumbled across (http://tracker.ceph.com/issues/17177 - mostly here
> for
> > more context)
>
> When you say you stumbled across it, do you mean that you actually had
> this same deep scrub error on your system, or just that you found the
> ticket?


No - we have done "ceph pg repair", as we did end up with single degraded
objects
in the metadata pool during heavy rsync of "lot of hardlinks".


> > There's been a couple of instances of invalid backtrace(s), mostly
> solved by
> > either mds:scrub_path or just unlinking the files/directories in question
> > and re-rsync-ing.
> >
> > mismatch between head items and fnode.fragstat (See below for more of the
> > log excercept), appeared to have been solved by mds:scrub_path
> >
> >
> > Duplicate stray(s), ceph-mds complains (a lot, during rsync):
> > 2016-09-30 20:00:21.978314 7ffb653b8700  0 mds.0.cache.dir(603) _fetched
> > badness: got (but i already had) [inode 10003f25eaf [...2,head]
> > ~mds0/stray0/10003f25eaf auth v38836572 s=8998 nl=5 n(v0 b8998 1=1+0)
> > (iversion lock) 0x561082e6b520] mode 33188 mtime 2016-07-25
> 03:02:50.00
> > 2016-09-30 20:00:21.978336 7ffb653b8700 -1 log_channel(cluster) log
> [ERR] :
> > loaded dup inode 10003f25eaf [2,head] v36792929 at
> ~mds0/stray3/10003f25eaf,
> > but inode 10003f25eaf.head 

Re: [ceph-users] jewel/CephFS - misc problems (duplicate strays, mismatch between head items and fnode.fragst)

2016-10-07 Thread Kjetil Jørgensen
Hi

On Fri, Oct 7, 2016 at 6:31 AM, Yan, Zheng  wrote:

> On Fri, Oct 7, 2016 at 8:20 AM, Kjetil Jørgensen 
> wrote:
> > And - I just saw another recent thread -
> > http://tracker.ceph.com/issues/17177 - can be an explanation of
> most/all of
> > the above ?
> >
> > Next question(s) would then be:
> >
> > How would one deal with duplicate stray(s)
>
> Here is an untested method
>
> list omap keys in objects 600. ~ 609.. find all duplicated
> keys
>
> for each duplicated keys, use ceph-dencoder to decode their values,
> find the one has the biggest version and delete the rest
> (ceph-dencoder type inode_t skip 9 import /tmp/ decode dump_json)


If I do this - should I turn off any active ceph-mds while/when doing so ?

Cheers,
-- 
Kjetil Joergensen 
SRE, Medallia Inc
Phone: +1 (650) 739-6580
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] maintenance questions

2016-10-07 Thread Jeff Applewhite
Hi All

I have a few questions pertaining to management of MONs and OSDs. This is
in a Ceph 2.x context only.
---
1) Can MONs be placed in something resembling maintenance mode (for
firmware updates, patch reboots, etc.). If so how? If not how addressed?

2) Can OSDs be placed in something resembling maintenance mode (for
firmware updates, patch reboots, etc.). If so how? If not how addressed?

3) Can MONs be "replaced/migrated" efficiently in a hardware upgrade
scenario? If so how? If not how addressed?

4) Can OSDs be "replaced/migrated" efficiently in a hardware upgrade
scenario? If so how? If not how addressed?

---
​I suspect the answer is somewhat nuanced and has to do with timeouts and
such. Please describe how these things are successully handled in
production settings.​

The goal here is to automate such things in a management tool so the
strategies should be well worn. If the answer is "no you can't and it's not
addressed in Ceph" is this a potential roadmap item?

​If addressed in previous discussions please forgive me and point me to
them - new to the list.​

​Thanks in advance!​

-- 

Jeff Applewhite
Principal Product Manager
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] unable to start radosgw after upgrade from 10.2.2 to 10.2.3

2016-10-07 Thread Graham Allan

Dear Orit,

On 10/07/2016 04:21 AM, Orit Wasserman wrote:

Hi,

On Wed, Oct 5, 2016 at 11:23 PM, Andrei Mikhailovsky  wrote:

Hello everyone,

I've just updated my ceph to version 10.2.3 from 10.2.2 and I am no longer
able to start the radosgw service. When executing I get the following error:

2016-10-05 22:14:10.735883 7f1852d26a00  0 ceph version 10.2.3
(ecc23778eb545d8dd55e2e4735b53cc93f92e65b), process radosgw, pid 2711
2016-10-05 22:14:10.765648 7f1852d26a00  0 pidfile_write: ignore empty
--pid-file
2016-10-05 22:14:11.287772 7f1852d26a00  0 zonegroup default missing zone
for master_zone=


This means you are missing a master zone , you can get here only if
you configured a realm.
Is that the case?

Can you provide:
radosgw-admin realm get
radosgw-admin zonegroupmap get
radosgw-admin zonegroup get
radosgw-admin zone get  -rgw-zone=default

Orit


I have not yet modified anything since the jewel upgrade - do you mind 
if I post the output for these from our cluster for your opinion? There 
is apparently no realm configured (which is what I expect for this 
cluster), but it sounds like you think this situation shouldn't arise in 
that case.


root@cephgw04:~# radosgw-admin realm get
missing realm name or id, or default realm not found
root@cephgw04:~# radosgw-admin realm list
{
"default_info": "",
"realms": []
}

root@cephgw04:~# radosgw-admin zonegroupmap get
failed to read current period info: (2) No such file or directory{
"zonegroups": [],
"master_zonegroup": "",
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
}
}
2016-10-07 14:33:15.24 7fecf5cf4900  0 RGWPeriod::init failed to 
init realm  id  : (2) No such file or directory

root@cephgw04:~# radosgw-admin zonegroup get
failed to init zonegroup: (2) No such file or directory
root@cephgw04:~# radosgw-admin zonegroup get --rgw-zonegroup=default
{
"id": "default",
"name": "default",
"api_name": "",
"is_master": "true",
"endpoints": [],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "",
"zones": [
{
"id": "default",
"name": "default",
"endpoints": [],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 32,
"read_only": "false"
}
],
"placement_targets": [
{
"name": "default-placement",
"tags": []
},
{
"name": "ec42-placement",
"tags": []
}
],
"default_placement": "ec42-placement",
"realm_id": ""
}

root@cephgw04:~# radosgw-admin zone get --rgw-zone=default
{
"id": "default",
"name": "default",
"domain_root": ".rgw",
"control_pool": ".rgw.control",
"gc_pool": ".rgw.gc",
"log_pool": ".log",
"intent_log_pool": ".intent-log",
"usage_log_pool": ".usage",
"user_keys_pool": ".users",
"user_email_pool": ".users.email",
"user_swift_pool": ".users.swift",
"user_uid_pool": ".users.uid",
"system_key": {
"access_key": "",
"secret_key": ""
},
"placement_pools": [],
"metadata_heap": ".rgw.meta",
"realm_id": ""
}


--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: No space left on device

2016-10-07 Thread Mykola Dvornik
10.2.2

-Mykola

On 7 October 2016 at 15:43, Yan, Zheng  wrote:

> On Thu, Oct 6, 2016 at 4:11 PM,   wrote:
> > Is there any way to repair pgs/cephfs gracefully?
> >
>
> So far no.  We need to write a tool to repair this type of corruption.
>
> Which version of ceph did you use before upgrading to 10.2.3 ?
>
> Regards
> Yan, Zheng
>
> >
> >
> > -Mykola
> >
> >
> >
> > From: Yan, Zheng
> > Sent: Thursday, 6 October 2016 04:48
> > To: Mykola Dvornik
> > Cc: John Spray; ceph-users
> > Subject: Re: [ceph-users] CephFS: No space left on device
> >
> >
> >
> > On Wed, Oct 5, 2016 at 2:27 PM, Mykola Dvornik  >
> > wrote:
> >
> >> Hi Zheng,
> >
> >>
> >
> >> Many thanks for you reply.
> >
> >>
> >
> >> This indicates the MDS metadata is corrupted. Did you do any unusual
> >
> >> operation on the cephfs? (e.g reset journal, create new fs using
> >
> >> existing metadata pool)
> >
> >>
> >
> >> No, nothing has been explicitly done to the MDS. I had a few
> inconsistent
> >
> >> PGs that belonged to the (3 replica) metadata pool. The symptoms were
> >
> >> similar to http://tracker.ceph.com/issues/17177 . The PGs were
> eventually
> >
> >> repaired and no data corruption was expected as explained in the ticket.
> >
> >>
> >
> >
> >
> > I'm afraid that issue does cause corruption.
> >
> >
> >
> >> BTW, when I posted this issue on the ML the amount of ground state stry
> >
> >> objects was around 7.5K. Now it went up to 23K. No inconsistent PGs or
> any
> >
> >> other problems happened to the cluster within this time scale.
> >
> >>
> >
> >> -Mykola
> >
> >>
> >
> >> On 5 October 2016 at 05:49, Yan, Zheng  wrote:
> >
> >>>
> >
> >>> On Mon, Oct 3, 2016 at 5:48 AM, Mykola Dvornik <
> mykola.dvor...@gmail.com>
> >
> >>> wrote:
> >
> >>> > Hi Johan,
> >
> >>> >
> >
> >>> > Many thanks for your reply. I will try to play with the mds tunables
> >>> > and
> >
> >>> > report back to your ASAP.
> >
> >>> >
> >
> >>> > So far I see that mds log contains a lot of errors of the following
> >
> >>> > kind:
> >
> >>> >
> >
> >>> > 2016-10-02 11:58:03.002769 7f8372d54700  0
> mds.0.cache.dir(100056ddecd)
> >
> >>> > _fetched  badness: got (but i already had) [inode 10005729a77
> [2,head]
> >
> >>> > ~mds0/stray1/10005729a77 auth v67464942 s=196728 nl=0 n(v0 b196728
> >
> >>> > 1=1+0)
> >
> >>> > (iversion lock) 0x7f84acae82a0] mode 33204 mtime 2016-08-07
> >
> >>> > 23:06:29.776298
> >
> >>> >
> >
> >>> > 2016-10-02 11:58:03.002789 7f8372d54700 -1 log_channel(cluster) log
> >
> >>> > [ERR] :
> >
> >>> > loaded dup inode 10005729a77 [2,head] v68621 at
> >
> >>> >
> >
> >>> >
> >>> > /users/mykola/mms/NCSHNO/final/120nm-uniform-h8200/
> j002654.out/m_xrange192-320_yrange192-320_016232.dump,
> >
> >>> > but inode 10005729a77.head v67464942 already exists at
> >
> >>> > ~mds0/stray1/10005729a77
> >
> >>>
> >
> >>> This indicates the MDS metadata is corrupted. Did you do any unusual
> >
> >>> operation on the cephfs? (e.g reset journal, create new fs using
> >
> >>> existing metadata pool)
> >
> >>>
> >
> >>> >
> >
> >>> > Those folders within mds.0.cache.dir that got badness report a size
> of
> >
> >>> > 16EB
> >
> >>> > on the clients. rm on them fails with 'Directory not empty'.
> >
> >>> >
> >
> >>> > As for the "Client failing to respond to cache pressure", I have 2
> >
> >>> > kernel
> >
> >>> > clients on 4.4.21, 1 on 4.7.5 and 16 fuse clients always running the
> >
> >>> > most
> >
> >>> > recent release version of ceph-fuse. The funny thing is that every
> >
> >>> > single
> >
> >>> > client misbehaves from time to time. I am aware of quite discussion
> >
> >>> > about
> >
> >>> > this issue on the ML, but cannot really follow how to debug it.
> >
> >>> >
> >
> >>> > Regards,
> >
> >>> >
> >
> >>> > -Mykola
> >
> >>> >
> >
> >>> > On 2 October 2016 at 22:27, John Spray  wrote:
> >
> >>> >>
> >
> >>> >> On Sun, Oct 2, 2016 at 11:09 AM, Mykola Dvornik
> >
> >>> >>  wrote:
> >
> >>> >> > After upgrading to 10.2.3 we frequently see messages like
> >
> >>> >>
> >
> >>> >> From which version did you upgrade?
> >
> >>> >>
> >
> >>> >> > 'rm: cannot remove '...': No space left on device
> >
> >>> >> >
> >
> >>> >> > The folders we are trying to delete contain approx. 50K files 193
> KB
> >
> >>> >> > each.
> >
> >>> >>
> >
> >>> >> My guess would be that you are hitting the new
> >
> >>> >> mds_bal_fragment_size_max check.  This limits the number of entries
> >
> >>> >> that the MDS will create in a single directory fragment, to avoid
> >
> >>> >> overwhelming the OSD with oversized objects.  It is 10 by
> default.
> >
> >>> >> This limit also applies to "stray" directories where unlinked files
> >
> >>> >> are put while they wait to be purged, so you could get into this
> state
> >
> >>> >> while doing lots of deletions.  There are ten stray directories that
> >
> >>> >> get a roughly even share of files, so if you have more than about
> one
> >
> >>> >> million files waiting to be purged, you could see this condition.
> >
> >>> >

Re: [ceph-users] maintenance questions

2016-10-07 Thread Gregory Farnum
On Fri, Oct 7, 2016 at 1:21 PM, Jeff Applewhite  wrote:
> Hi All
>
> I have a few questions pertaining to management of MONs and OSDs. This is in
> a Ceph 2.x context only.

You mean Jewel? ;)

> ---
> 1) Can MONs be placed in something resembling maintenance mode (for firmware
> updates, patch reboots, etc.). If so how? If not how addressed?
>
> 2) Can OSDs be placed in something resembling maintenance mode (for firmware
> updates, patch reboots, etc.). If so how? If not how addressed?

In both of these cases, you just turn it off. Preferably politely (ie,
software shutdown) so that the node can report to the cluster it won't
be available. But it's the same as any other failure case from Ceph's
perspective: the node is unavailable for service.

See http://docs.ceph.com/docs/master/install/upgrading-ceph, which is
a little old now but illustrates the basic ideas.

>
> 3) Can MONs be "replaced/migrated" efficiently in a hardware upgrade
> scenario? If so how? If not how addressed?

Monitors can be moved if they can keep the same IP; otherwise you need
to go through some shenanigans:
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address

Or you can add the new location and remove the old location (hopefully
in that order, to maintain your durability, but you could do it the
other way around if rally necessary):
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/

>
> 4) Can OSDs be "replaced/migrated" efficiently in a hardware upgrade
> scenario? If so how? If not how addressed?

You can move an OSD around as long as you either flush its journal or
the journal device is colocated or moved with it. But by default it
will then update to a new CRUSH location and all the data will
reshuffle anyway.

You can also mark an OSD out while keeping it up and the cluster will
then backfill all its data the correct new locations without ever
reducing redundancy.
(http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/)
This gets into the typical concerns about changing CRUSH weights and
migrating data unnecessarily if you aren't removing the whole
host/rack/whatever, but it sounds like are only interested in
wholesale replacement.

It's also possible to clone the drive or whatever and just stick it in
place, in which case Ceph doesn't really notice (maybe modulo some of
the install or admin stuff on the local node, but the cluster doesn't
care).

I've included some links directly relevant throughout but there's
plenty of other info in the docs and you probably want to spend some
time reading them carefully if you're planning to build a management
tool. :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + VMWare

2016-10-07 Thread Jake Young
Hey Patrick,

I work for Cisco.

We have a 200TB cluster (108 OSDs on 12 OSD Nodes) and use the cluster for
both OpenStack and VMware deployments.

We are using iSCSI now, but it really would be much better if VMware did
support RBD natively.

We present a 1-2TB Volume that is shared between 4-8 ESXi hosts.

I have been looking for an optimal solution for a few years now, and I have
finally found something that works pretty well:

We are installing FreeNAS on a KVM hypervisor and passing through rbd
volumes as disks on a SCSI bus. We are able to add volumes dynamically (no
need to reboot FreeNAS to recognize new drives).  In FreeNAS, we are
passing the disks through directly as iscsi targets, we are not putting the
disks into a ZFS volume.

The biggest benefit to this is that VMware really likes the FreeBSD target
and all VAAI stuff works reliably. We also get the benefit of the stability
of rbd in QEMU client.

My next step is to create a redundant KVM host with a redundant FreeNAS VM
and see how iscsi multipath works with the ESXi hosts.

We have tried many different things and have run into all the same issues
as others have posted on this list. The general theme seems to be that most
(all?) Linux iSCSI Target software and Linux NFS solutions are not very
good. The BSD OS's (FreeBSD, Solaris derivatives, etc.) do these things a
lot better, but typically lack Ceph support as well as having poor HW
compatibility (compared to Linux).

Our goal has always been to replace FC SAN with something comparable in
performance, reliability and redundancy.

Again, the best thing in the world would be for ESXi to mount rbd volumes
natively using librbd. I'm not sure if VMware is interested in this though.

Jake


On Wednesday, October 5, 2016, Patrick McGarry  wrote:

> Hey guys,
>
> Starting to buckle down a bit in looking at how we can better set up
> Ceph for VMWare integration, but I need a little info/help from you
> folks.
>
> If you currently are using Ceph+VMWare, or are exploring the option,
> I'd like some simple info from you:
>
> 1) Company
> 2) Current deployment size
> 3) Expected deployment growth
> 4) Integration method (or desired method) ex: iscsi, native, etc
>
> Just casting the net so we know who is interested and might want to
> help us shape and/or test things in the future if we can make it
> better. Thanks.
>
>
> --
>
> Best Regards,
>
> Patrick McGarry
> Director Ceph Community || Red Hat
> http://ceph.com  ||  http://community.redhat.com
> @scuttlemonkey || @ceph
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD won't come back "UP"

2016-10-07 Thread Reed Dier
Attempting to adjust parameters of some of my recovery options, I restarted a 
single osd in the cluster with the following syntax:

> sudo restart ceph-osd id=0


The osd restarts without issue, status shows running with the PID.

> sudo status ceph-osd id=0
> ceph-osd (ceph/0) start/running, process 2685


The osd marked itself down cleanly.

> 2016-10-07 19:36:20.872883 mon.0 10.0.1.249:6789/0 1475867 : cluster [INF] 
> osd.0 marked itself down

> 2016-10-07 19:36:21.590874 mon.0 10.0.1.249:6789/0 1475869 : cluster [INF] 
> osdmap e4361: 16 osds: 15 up, 16 in

The mon’s show this from one of many subsequent attempts to restart the osd.

> 2016-10-07 19:58:16.222949 mon.1 [INF] from='client.? 10.0.1.25:0/324114592' 
> entity='osd.0' cmd=[{"prefix": "osd crush create-or-move", "args": 
> ["host=node24", "root=default"], "id": 0, "weight": 7.2701}]: dispatch
> 2016-10-07 19:58:16.223626 mon.0 [INF] from='client.6557620 :/0' 
> entity='osd.0' cmd=[{"prefix": "osd crush create-or-move", "args": 
> ["host=node24", "root=default"], "id": 0, "weight": 7.2701}]: dispatch

mon logs show this when grepping for the osd.0 in the mon log

> 2016-10-07 19:36:20.872882 7fd39aced700  0 log_channel(cluster) log [INF] : 
> osd.0 marked itself down
> 2016-10-07 19:36:27.698708 7fd39aced700  0 log_channel(audit) log [INF] : 
> from='client.6554095 :/0' entity='osd.0' cmd=[{"prefix": "osd crush 
> create-or-move", "args": ["host=node24", "root=default"], "id": 0, "weight": 
> 7.2701}]: dispatch
> 2016-10-07 19:36:27.706374 7fd39aced700  0 mon.core@0(leader).osd e4363 
> create-or-move crush item name 'osd.0' initial_weight 7.2701 at location 
> {host=node24,root=default}
> 2016-10-07 19:39:30.515494 7fd39aced700  0 log_channel(audit) log [INF] : 
> from='client.6554587 :/0' entity='osd.0' cmd=[{"prefix": "osd crush 
> create-or-move", "args": ["host=node24", "root=default"], "id": 0, "weight": 
> 7.2701}]: dispatch
> 2016-10-07 19:39:30.515618 7fd39aced700  0 mon.core@0(leader).osd e4363 
> create-or-move crush item name 'osd.0' initial_weight 7.2701 at location 
> {host=node24,root=default}
> 2016-10-07 19:41:59.714517 7fd39b4ee700  0 log_channel(cluster) log [INF] : 
> osd.0 out (down for 338.148761)


Everything running latest Jewel release

> ceph --version
> ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)

Any help with this is extremely appreciated. Hoping someone has dealt with this 
before.

Reed Dier
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD won't come back "UP"

2016-10-07 Thread Reed Dier
Resolved.

Apparently it took the OSD almost 2.5 hours to fully boot.

Had not seen this behavior before, but it eventually booted itself back into 
the crush map.

Bookend log stamps below.

> 2016-10-07 21:33:39.241720 7f3d59a97800  0 set uid:gid to 64045:64045 
> (ceph:ceph)

> 2016-10-07 23:53:29.617038 7f3d59a97800  0 osd.0 4360 done with init, 
> starting boot process

I had noticed that there was a consistent read operation on the “down/out” osd 
tied to that osd’s PID, which led me to believe it was doing something with its 
time.

Also for reference, this was a 26% full 8TB disk.
> Filesystem1K-blocksUsed  Available Use% Mounted on

> /dev/sda17806165996  1953556296 5852609700  26% 
> /var/lib/ceph/osd/ceph-0

Reed


> On Oct 7, 2016, at 7:33 PM, Reed Dier  wrote:
> 
> Attempting to adjust parameters of some of my recovery options, I restarted a 
> single osd in the cluster with the following syntax:
> 
>> sudo restart ceph-osd id=0
> 
> 
> The osd restarts without issue, status shows running with the PID.
> 
>> sudo status ceph-osd id=0
>> ceph-osd (ceph/0) start/running, process 2685
> 
> 
> The osd marked itself down cleanly.
> 
>> 2016-10-07 19:36:20.872883 mon.0 10.0.1.249:6789/0 1475867 : cluster [INF] 
>> osd.0 marked itself down
> 
>> 2016-10-07 19:36:21.590874 mon.0 10.0.1.249:6789/0 1475869 : cluster [INF] 
>> osdmap e4361: 16 osds: 15 up, 16 in
> 
> The mon’s show this from one of many subsequent attempts to restart the osd.
> 
>> 2016-10-07 19:58:16.222949 mon.1 [INF] from='client.? 10.0.1.25:0/324114592' 
>> entity='osd.0' cmd=[{"prefix": "osd crush create-or-move", "args": 
>> ["host=node24", "root=default"], "id": 0, "weight": 7.2701}]: dispatch
>> 2016-10-07 19:58:16.223626 mon.0 [INF] from='client.6557620 :/0' 
>> entity='osd.0' cmd=[{"prefix": "osd crush create-or-move", "args": 
>> ["host=node24", "root=default"], "id": 0, "weight": 7.2701}]: dispatch
> 
> mon logs show this when grepping for the osd.0 in the mon log
> 
>> 2016-10-07 19:36:20.872882 7fd39aced700  0 log_channel(cluster) log [INF] : 
>> osd.0 marked itself down
>> 2016-10-07 19:36:27.698708 7fd39aced700  0 log_channel(audit) log [INF] : 
>> from='client.6554095 :/0' entity='osd.0' cmd=[{"prefix": "osd crush 
>> create-or-move", "args": ["host=node24", "root=default"], "id": 0, "weight": 
>> 7.2701}]: dispatch
>> 2016-10-07 19:36:27.706374 7fd39aced700  0 mon.core@0(leader).osd e4363 
>> create-or-move crush item name 'osd.0' initial_weight 7.2701 at location 
>> {host=node24,root=default}
>> 2016-10-07 19:39:30.515494 7fd39aced700  0 log_channel(audit) log [INF] : 
>> from='client.6554587 :/0' entity='osd.0' cmd=[{"prefix": "osd crush 
>> create-or-move", "args": ["host=node24", "root=default"], "id": 0, "weight": 
>> 7.2701}]: dispatch
>> 2016-10-07 19:39:30.515618 7fd39aced700  0 mon.core@0(leader).osd e4363 
>> create-or-move crush item name 'osd.0' initial_weight 7.2701 at location 
>> {host=node24,root=default}
>> 2016-10-07 19:41:59.714517 7fd39b4ee700  0 log_channel(cluster) log [INF] : 
>> osd.0 out (down for 338.148761)
> 
> 
> Everything running latest Jewel release
> 
>> ceph --version
>> ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
> 
> Any help with this is extremely appreciated. Hoping someone has dealt with 
> this before.
> 
> Reed Dier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com