Re: [ceph-users] pros/cons of multiple OSD's per host

2017-08-22 Thread Nick Tan
Hi Christian,



> > Hi David,
> >
> > The planned usage for this CephFS cluster is scratch space for an image
> > processing cluster with 100+ processing nodes.
>
> Lots of clients, how much data movement would you expect, how many images
> come in per timeframe, lets say an hour?
> Typical size of a image?
>
> Does an image come in and then gets processed by one processing node?
> Unlikely to be touched again, at least in the short term?
> Probably being deleted after being processed?
>

We'd typically get up to 6TB of raw imagery per day at an average image
size of 20MB.  There's a complex multi stage processing chain that happens
- typically images are read by multiple nodes with intermediate data
generated and processed again by multiple nodes.  This would generate about
30TB of intermediate data.  The end result would be around 9TB of final
processed data.  Once the processing is complete and the final data is
copied off and completed QA, the entire data set is deleted.  The data sets
could remain on the file system for up to 2 weeks before deletion.



> >  My thinking is we'd be
> > better off with a large number (100+) of storage hosts with 1-2 OSD's
> each,
> > rather than 10 or so storage nodes with 10+ OSD's to get better
> parallelism
> > but I don't have any practical experience with CephFS to really judge.
> CephFS is one thing (of which I have very limited experience), but at this
> point you're talking about parallelism in Ceph (RBD).
> And that happens much more on an OSD than host level.
>
> Which you _can_ achieve with larger nodes, if they're well designed.
> Meaning CPU/RAM/interal storage bandwidth/network bandwidth being in
> "harmony".
>

I'm not sure what you mean about the RBD reference.  Does CephFS use RBD
internally?


>
> Also you keep talking about really huge HDDs, you could do worse than
> halving their size and doubling their numbers to achieve much more
> bandwidth and the ever crucial IOPS (even in your use case).
>
> So something like 20x 12 HDD servers, with SSDs/NVMes for journal/bluestore
> wAL/DB if you can afford or actually need it.
>
> CephFS metadata on a SSD pool isn't the most dramatic improvement one can
> do (or so people tell me), but given your budget it may be worthwhile.
>
>
Yes, I totally get the benefits of using greater numbers of smaller HDD's.
One of the requirements is to keep $/TB low and large capacity drives helps
with that.  I guess we need to look at the tradeoff of $/TB vs number of
spindles for performance.

If CephFS's parallelism happens more at the OSD level than the host level
then perhaps the 12 disk storage host would be fine as long as
"mon_osd_down_out_subtree_limit = host" and there's enough CPU/RAM/BUS and
Network bandwidth on the host.  I'm doing some cost comparisons of these
"big" servers vs multiple "small" servers such as the supermicro microcloud
chassis or the Ambedded Mars 200 ARM cluster (which looks very
interesting).  However, cost is not the sole consideration, so I'm hoping
to get an idea of performance differences between the two architectures to
help with the decision making process given the lack of test equipment
available.



>
> > And
> > I don't have enough hardware to setup a test cluster of any significant
> > size to run some actual testing.
> >
> You may want to set up something to get a feeling for CephFS, if it's
> right for you or if something else on top of RBD may be more suitable.
>
>
I've setup a 3 node cluster, 2 OSD servers and 1 mon/mds to get a feel for
ceph and cephFS.  It looks pretty straightforward and performs well enough
given the lack of nodes.


Thanks,
Nick


> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Rakuten Communications
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cache tier unevictable objects

2017-08-22 Thread Eugen Block

Hi list,

we have a productive Hammer cluster for our OpenStack cloud and  
recently a colleague added a cache tier consisting of 2 SSDs and also  
a pool size of 2, we're still experimenting with this topic.


Now we have some hardware maintenance to do and need to shutdown  
nodes, one at a time of course. So we tried to flush/evict the cache  
pool and disable it to prevent data loss, we also set the cache-mode  
to "forward". Most of the objects have been evicted successfully, but  
there are still 39 objects left, and it's impossible to evict them.  
I'm not sure how to make sure if we can just delete the cache pool  
without data loss, we want to set up the cache-pool from scratch.


# rados -p images-cache ls
rbd_header.210f542ae8944a
volume-ce17068e-a36d-4d9b-9779-3af473aba033.rbd
rbd_header.50ec372eb141f2
931f9a1e-2022-4571-909e-6c3f5f8c3ae8_disk.rbd
rbd_header.59dd32ae8944a
...

There are only 3 types of objects in the cache-pool:
  - rbd_header
  - volume-XXX.rbd (obviously cinder related)
  - XXX_disk (nova disks)

All rbd_header objects have a size of 0 if I run a "stat" command on  
them, the rest has a size of 112. If I compare the objects with the  
respective object in the cold-storage, they are identical:


Object rbd_header.1128db1b5d2111:
images-cache/rbd_header.1128db1b5d2111 mtime 2017-08-21  
15:55:26.00, size 0
  images/rbd_header.1128db1b5d2111 mtime 2017-08-21  
15:55:26.00, size 0


Object volume-fd07dd66-8a82-431c-99cf-9bfc3076af30.rbd:
images-cache/volume-fd07dd66-8a82-431c-99cf-9bfc3076af30.rbd mtime  
2017-08-21 15:55:26.00, size 112
  images/volume-fd07dd66-8a82-431c-99cf-9bfc3076af30.rbd mtime  
2017-08-21 15:55:26.00, size 112


Object 2dcb9d7d-3a4f-49a4-8792-b4b74f5b60e5_disk.rbd:
images-cache/2dcb9d7d-3a4f-49a4-8792-b4b74f5b60e5_disk.rbd mtime  
2017-08-21 15:55:25.00, size 112
  images/2dcb9d7d-3a4f-49a4-8792-b4b74f5b60e5_disk.rbd mtime  
2017-08-21 15:55:25.00, size 112


Some of them have an rbd_lock, some of them have a watcher, some don't  
have any of that but they still can't be evicted:


# rados -p images-cache lock list rbd_header.2207c92ae8944a
{"objname":"rbd_header.2207c92ae8944a","locks":[]}
# rados -p images-cache listwatchers rbd_header.2207c92ae8944a
#
# rados -p images-cache cache-evict rbd_header.2207c92ae8944a
error from cache-evict rbd_header.2207c92ae8944a: (16) Device or resource busy

Then I also tried to shutdown an instance that uses some of the  
volumes listed in the cache pool, but the objects didn't change at  
all, the total number was also still 39. For the rbd_header objects I  
don't even know how to identify their "owner", is there a way?


Has anyone a hint what else I could check or is it reasonable to  
assume that the objects are really the same and there would be no data  
loss in case we deleted that pool?

We appreciate any help!

Regards,
Eugen

--
Eugen Block voice   : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg e-mail  : ebl...@nde.ag

Vorsitzende des Aufsichtsrates: Angelika Mozdzen
  Sitz und Registergericht: Hamburg, HRB 90934
  Vorstand: Jens-U. Mozdzen
   USt-IdNr. DE 814 013 983

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] WBThrottle

2017-08-22 Thread Ranjan Ghosh

Hi Ceph gurus,

I've got the following problem with our Ceph installation (Jewel): There 
are various websites served from the CephFS mount. Sometimes, when I 
copy many new (large?) files onto this mount, it seems that after a 
certain delay, everything grinds to a halt. No websites are served; 
processes are in D state; probably until Ceph has written everything to 
disk. Then after a while, everythign recovers. Obviously, it would be 
great if I could tune some values to make the experience more "even" 
i.e. it can be a bit slower in general but OTOH without such huge 
"spikes" in performance...


Now, first, I discovered there is "filestore flusher" documented here:

http://docs.ceph.com/docs/jewel/rados/configuration/filestore-config-ref/?highlight=flusher

Weirdly, when I use

ceph --admin-daemon /bla/bla config show

then I cannot see anything about this config option. Does it still exist?

Then I found this somewhat cryptic page:

http://docs.ceph.com/docs/jewel/dev/osd_internals/wbthrottle/

It says: "The flusher was not an adequate solution to this problem since 
it forced writeback of small writes too eagerly killing performance."


Perhaps the "filestore flusher" was removed? But why is it still documented?

On the other hand, "config show" lists many "wbthrottle"-Options:

"filestore_wbthrottle_enable": "true",
"filestore_wbthrottle_xfs_bytes_hard_limit": "419430400",
"filestore_wbthrottle_xfs_bytes_start_flusher": "41943040",
"filestore_wbthrottle_xfs_inodes_hard_limit": "5000",
"filestore_wbthrottle_xfs_inodes_start_flusher": "500",
"filestore_wbthrottle_xfs_ios_hard_limit": "5000",
"filestore_wbthrottle_xfs_ios_start_flusher": "500",

I couldnt find them documented under docs.ceph.com, however they are 
documented here:


https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/configuration_guide/file_store_configuration_reference

Quite confusing! Now, I wonder: Could/should I modify (raise/lower) some 
of these values (we're using XFS)? Should I perhaps disable the 
WBThrottle altogether for my use case?


Thank you,

Ranjan








___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tier unevictable objects

2017-08-22 Thread Christian Balzer
On Tue, 22 Aug 2017 09:54:34 + Eugen Block wrote:

> Hi list,
> 
> we have a productive Hammer cluster for our OpenStack cloud and  
> recently a colleague added a cache tier consisting of 2 SSDs and also  
> a pool size of 2, we're still experimenting with this topic.
> 
Risky, but I guess you know that.

> Now we have some hardware maintenance to do and need to shutdown  
> nodes, one at a time of course. So we tried to flush/evict the cache  
> pool and disable it to prevent data loss, we also set the cache-mode  
> to "forward". Most of the objects have been evicted successfully, but  
> there are still 39 objects left, and it's impossible to evict them.  
> I'm not sure how to make sure if we can just delete the cache pool  
> without data loss, we want to set up the cache-pool from scratch.
> 
Do I take it from this that your cache tier is only on one node?
If so upgrade the "Risky" up there to "Channeling Murphy".

If not, and your min_size is 1 as it should be for a size 2 pool, nothing
bad should happen.

Penultimately, google is EVIL but helps you find answers:
http://tracker.ceph.com/issues/12659

Christian

> # rados -p images-cache ls
> rbd_header.210f542ae8944a
> volume-ce17068e-a36d-4d9b-9779-3af473aba033.rbd
> rbd_header.50ec372eb141f2
> 931f9a1e-2022-4571-909e-6c3f5f8c3ae8_disk.rbd
> rbd_header.59dd32ae8944a
> ...
> 
> There are only 3 types of objects in the cache-pool:
>- rbd_header
>- volume-XXX.rbd (obviously cinder related)
>- XXX_disk (nova disks)
> 
> All rbd_header objects have a size of 0 if I run a "stat" command on  
> them, the rest has a size of 112. If I compare the objects with the  
> respective object in the cold-storage, they are identical:
> 
> Object rbd_header.1128db1b5d2111:
> images-cache/rbd_header.1128db1b5d2111 mtime 2017-08-21  
> 15:55:26.00, size 0
>    images/rbd_header.1128db1b5d2111 mtime 2017-08-21  
> 15:55:26.00, size 0
> 
> Object volume-fd07dd66-8a82-431c-99cf-9bfc3076af30.rbd:
> images-cache/volume-fd07dd66-8a82-431c-99cf-9bfc3076af30.rbd mtime  
> 2017-08-21 15:55:26.00, size 112
>    images/volume-fd07dd66-8a82-431c-99cf-9bfc3076af30.rbd mtime  
> 2017-08-21 15:55:26.00, size 112
> 
> Object 2dcb9d7d-3a4f-49a4-8792-b4b74f5b60e5_disk.rbd:
> images-cache/2dcb9d7d-3a4f-49a4-8792-b4b74f5b60e5_disk.rbd mtime  
> 2017-08-21 15:55:25.00, size 112
>    images/2dcb9d7d-3a4f-49a4-8792-b4b74f5b60e5_disk.rbd mtime  
> 2017-08-21 15:55:25.00, size 112
> 
> Some of them have an rbd_lock, some of them have a watcher, some don't  
> have any of that but they still can't be evicted:
> 
> # rados -p images-cache lock list rbd_header.2207c92ae8944a
> {"objname":"rbd_header.2207c92ae8944a","locks":[]}
> # rados -p images-cache listwatchers rbd_header.2207c92ae8944a
> #
> # rados -p images-cache cache-evict rbd_header.2207c92ae8944a
> error from cache-evict rbd_header.2207c92ae8944a: (16) Device or resource busy
> 
> Then I also tried to shutdown an instance that uses some of the  
> volumes listed in the cache pool, but the objects didn't change at  
> all, the total number was also still 39. For the rbd_header objects I  
> don't even know how to identify their "owner", is there a way?
> 
> Has anyone a hint what else I could check or is it reasonable to  
> assume that the objects are really the same and there would be no data  
> loss in case we deleted that pool?
> We appreciate any help!
> 
> Regards,
> Eugen
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pros/cons of multiple OSD's per host

2017-08-22 Thread Christian Balzer

Hello,

On Tue, 22 Aug 2017 16:51:47 +0800 Nick Tan wrote:

> Hi Christian,
> 
> 
> 
> > > Hi David,
> > >
> > > The planned usage for this CephFS cluster is scratch space for an image
> > > processing cluster with 100+ processing nodes.  
> >
> > Lots of clients, how much data movement would you expect, how many images
> > come in per timeframe, lets say an hour?
> > Typical size of a image?
> >
> > Does an image come in and then gets processed by one processing node?
> > Unlikely to be touched again, at least in the short term?
> > Probably being deleted after being processed?
> >  
> 
> We'd typically get up to 6TB of raw imagery per day at an average image
> size of 20MB.  There's a complex multi stage processing chain that happens
> - typically images are read by multiple nodes with intermediate data
> generated and processed again by multiple nodes.  This would generate about
> 30TB of intermediate data.  The end result would be around 9TB of final
> processed data.  Once the processing is complete and the final data is
> copied off and completed QA, the entire data set is deleted.  The data sets
> could remain on the file system for up to 2 weeks before deletion.
> 

If this is a more or less sequential processes w/o too many spikes, a hot
(daily) SSD pool or cache-tier may work wonders.
45TB of flash storage would be a bit spendy, though.

630TB total, lets call it 800, that's already 20 nodes with 12x 10TB HDDs.

> 
> 
> > >  My thinking is we'd be
> > > better off with a large number (100+) of storage hosts with 1-2 OSD's  
> > each,  
> > > rather than 10 or so storage nodes with 10+ OSD's to get better  
> > parallelism  
> > > but I don't have any practical experience with CephFS to really judge.  
> > CephFS is one thing (of which I have very limited experience), but at this
> > point you're talking about parallelism in Ceph (RBD).
> > And that happens much more on an OSD than host level.
> >
> > Which you _can_ achieve with larger nodes, if they're well designed.
> > Meaning CPU/RAM/interal storage bandwidth/network bandwidth being in
> > "harmony".
> >  
> 
> I'm not sure what you mean about the RBD reference.  Does CephFS use RBD
> internally?
> 
RADOS, the underlying layer.

> 
> >
> > Also you keep talking about really huge HDDs, you could do worse than
> > halving their size and doubling their numbers to achieve much more
> > bandwidth and the ever crucial IOPS (even in your use case).
> >
> > So something like 20x 12 HDD servers, with SSDs/NVMes for journal/bluestore
> > wAL/DB if you can afford or actually need it.
> >
> > CephFS metadata on a SSD pool isn't the most dramatic improvement one can
> > do (or so people tell me), but given your budget it may be worthwhile.
> >
> >  
> Yes, I totally get the benefits of using greater numbers of smaller HDD's.
> One of the requirements is to keep $/TB low and large capacity drives helps
> with that.  I guess we need to look at the tradeoff of $/TB vs number of
> spindles for performance.
> 
Again, if it's mostly sequential the IOPS needs will be of course very
different from a scenario where you get 100 images coming in at once while
the processing nodes are munching on previous 100 ones. 

> If CephFS's parallelism happens more at the OSD level than the host level
> then perhaps the 12 disk storage host would be fine as long as
> "mon_osd_down_out_subtree_limit = host" and there's enough CPU/RAM/BUS and
> Network bandwidth on the host.  I'm doing some cost comparisons of these
> "big" servers vs multiple "small" servers such as the supermicro microcloud
> chassis or the Ambedded Mars 200 ARM cluster (which looks very
> interesting).  
The later at least avoids the usual issue of underpowered and high latency
networking  with these kinds of designs (one from Supermicro comes to
mind) tend to have, but 2GB RAM and CPU feel... weak

Also you will have to buy an SSD for each in case you want/need journals
(or fast WAL/DB with bluestore).
Spendy and massively annoying if anything fails with these things (no
hot-swap).

> However, cost is not the sole consideration, so I'm hoping
> to get an idea of performance differences between the two architectures to
> help with the decision making process given the lack of test equipment
> available.
> 
If you compare the above bits, they should perform withing the same
ballpark when it comes to sequential operations.
Bit is a lot easier to beef up a medium sized node (to a point) then
something like those high density solutions.

With a larger node you have the option to go for 25Gb/s (lower latency)
NICs easily, with just 12 HDDs keep it to one NUMA node (also look at the
upcoming AMD Epyc stuff) with fast cores (lower latency again) and enough
RAM to have significant page cache effects AND even more importantly keep
SLAB data like inodes in RAM.

Which reminds me, I don't have the faintest idea how this (lots of RAM)
will apply to or help with Bluestore


Christian

> 
> 
> >  
> > > And
> > > I do

Re: [ceph-users] RBD encryption options?

2017-08-22 Thread Marc Roos
 

I had some issues with the iscsi software starting to early, maybe this 
can give you some ideas.


systemctl show target.service -p After

mkdir /etc/systemd/system/target.service.d

cat << 'EOF' > /etc/systemd/system/target.service.d/10-waitforrbd.conf
[Unit]
After=systemd-journald.socket sys-kernel-config.mount system.slice 
basic.target network.target local-fs.target rbdmap.service
EOF


-Original Message-
From: Daniel K [mailto:satha...@gmail.com] 
Sent: dinsdag 22 augustus 2017 3:03
To: ceph-users@lists.ceph.com
Subject: [ceph-users] RBD encryption options?

Are there any client-side options to encrypt an RBD device?

Using latest luminous RC, on Ubuntu 16.04 and a 4.10 kernel

I assumed adding client site encryption  would be as simple as using 
luks/dm-crypt/cryptsetup after adding the RBD device to /etc/ceph/rbdmap 
and enabling the rbdmap service -- but I failed to consider the order of 
things loading and it appears that the RBD gets mapped too late for 
dm-crypt to recognize it as valid.It just keeps telling me it's not a 
valid LUKS device.

I know you can run the OSDs on an encrypted drive, but I was hoping for 
something client side since it's not exactly simple(as far as I can 
tell) to restrict client access to a single(or group) of RBDs within a 
shared pool.

Any suggestions?




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Blocked requests problem

2017-08-22 Thread Ramazan Terzi
Hello,

I have a Ceph Cluster with specifications below:
3 x Monitor node
6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks have SSD 
journals)
Distributed public and private networks. All NICs are 10Gbit/s
osd pool default size = 3
osd pool default min size = 2

Ceph version is Jewel 10.2.6.

My cluster is active and a lot of virtual machines running on it (Linux and 
Windows VM's, database clusters, web servers etc).

During normal use, cluster slowly went into a state of blocked requests. 
Blocked requests periodically incrementing. All OSD's seems healthy. Benchmark, 
iowait, network tests, all of them succeed.

Yerterday, 08:00:
$ ceph health detail
HEALTH_WARN 3 requests are blocked > 32 sec; 3 osds have slow requests
1 ops are blocked > 134218 sec on osd.31
1 ops are blocked > 134218 sec on osd.3
1 ops are blocked > 8388.61 sec on osd.29
3 osds have slow requests

Todat, 16:05:
$ ceph health detail
HEALTH_WARN 32 requests are blocked > 32 sec; 3 osds have slow requests
1 ops are blocked > 134218 sec on osd.31
1 ops are blocked > 134218 sec on osd.3
16 ops are blocked > 134218 sec on osd.29
11 ops are blocked > 67108.9 sec on osd.29
2 ops are blocked > 16777.2 sec on osd.29
1 ops are blocked > 8388.61 sec on osd.29
3 osds have slow requests

$ ceph pg dump | grep scrub
dumped all in format plain
pg_stat objects mip degrmispunf bytes   log disklog state   
state_stamp v   reportedup  up_primary  acting  
acting_primary  last_scrub  scrub_stamp last_deep_scrub deep_scrub_stamp
20.1e   25183   0   0   0   0   98332537930 30663066
active+clean+scrubbing  2017-08-21 04:55:13.354379  6930'23908781   
6930:20905696   [29,31,3]   29  [29,31,3]   29  6712'22950171   
2017-08-20 04:46:59.208792  6712'22950171   2017-08-20 04:46:59.208792

Active scrub does not finish (about 24 hours). I did not restart any OSD 
meanwhile.
I'm thinking set noscrub, noscrub-deep, norebalance, nobackfill, and norecover 
flags and restart 3,29,31th OSDs. Is this solve my problem? Or anyone has 
suggestion about this problem?

Thanks,
Ramazan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG reported as inconsistent in status, but no inconsistencies visible to rados

2017-08-22 Thread Edward R Huyer
Neat, hadn't seen that command before.  Here's the fsck log from the primary 
OSD:  https://pastebin.com/nZ0H5ag3

Looks like the OSD's bluestore "filesystem" itself has some underlying errors, 
though I'm not sure what to do about them.

-Original Message-
From: Brad Hubbard [mailto:bhubb...@redhat.com] 
Sent: Monday, August 21, 2017 7:05 PM
To: Edward R Huyer 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] PG reported as inconsistent in status, but no 
inconsistencies visible to rados

Could you provide the output of 'ceph-bluestore-tool fsck' for one of these 
OSDs?

On Tue, Aug 22, 2017 at 2:53 AM, Edward R Huyer  wrote:
> This is an odd one.  My cluster is reporting an inconsistent pg in 
> ceph status and ceph health detail.  However, rados 
> list-inconsistent-obj and rados list-inconsistent-snapset both report 
> no inconsistencies.  Scrubbing the pg results in these errors in the osd logs:
>
>
>
> OSD 63 (primary):
>
> 2017-08-21 12:41:03.580068 7f0b36629700 -1
> bluestore(/var/lib/ceph/osd/ceph-63) _verify_csum bad crc32c/0x1000 
> checksum at blob offset 0x0, got 0x6b6b9184, expected 0x6706be76, 
> device location [0x23f39d~1000], logical extent 0x0~1000, object 
> #9:55bf7cc6:::rbd_data.33992ae8944a.200f:e#
>
> 2017-08-21 12:41:03.961945 7f0b36629700 -1 log_channel(cluster) log [ERR] :
> 9.aa soid 9:55bf7cc6:::rbd_data.33992ae8944a.200f:e: 
> failed to pick suitable object info
>
> 2017-08-21 12:41:15.357484 7f0b36629700 -1 log_channel(cluster) log [ERR] :
> 9.aa deep-scrub 3 errors
>
>
>
> OSD 50:
>
> 2017-08-21 12:41:03.592918 7f264be6d700 -1
> bluestore(/var/lib/ceph/osd/ceph-50) _verify_csum bad crc32c/0x1000 
> checksum at blob offset 0x0, got 0x64a1e2b1, expected 0x6706be76, 
> device location [0x341883~1000], logical extent 0x0~1000, object 
> #9:55bf7cc6:::rbd_data.33992ae8944a.200f:e#
>
>
>
> OSD 46:
>
> 2017-08-21 12:41:03.531394 7fb396b1f700 -1
> bluestore(/var/lib/ceph/osd/ceph-46) _verify_csum bad crc32c/0x1000 
> checksum at blob offset 0x0, got 0x7aa05c01, expected 0x6706be76, 
> device location [0x1d6e1e~1000], logical extent 0x0~1000, object 
> #9:55bf7cc6:::rbd_data.33992ae8944a.200f:e#
>
>
>
> This is on Ceph 12.1.4 (previously 12.1.1).
>
>
>
> Thoughts?
>
>
>
> -
>
> Edward Huyer
>
> School of Interactive Games and Media
>
> Rochester Institute of Technology
>
> Golisano 70-2373
>
> 152 Lomb Memorial Drive
>
> Rochester, NY 14623
>
> 585-475-6651
>
> erh...@rit.edu
>
>
>
> Obligatory Legalese:
>
> The information transmitted, including attachments, is intended only 
> for the
> person(s) or entity to which it is addressed and may contain 
> confidential and/or privileged material. Any review, retransmission, 
> dissemination or other use of, or taking of any action in reliance 
> upon this information by persons or entities other than the intended 
> recipient is prohibited. If you received this in error, please contact 
> the sender and destroy any copies of this information.
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



--
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Blocked requests problem

2017-08-22 Thread Ranjan Ghosh

Hi Ramazan,

I'm no Ceph expert, but what I can say from my experience using Ceph is:

1) During "Scrubbing", Ceph can be extremely slow. This is probably 
where your "blocked requests" are coming from. BTW: Perhaps you can even 
find out which processes are currently blocking with: ps aux | grep "D". 
You might even want to kill some of those and/or shutdown services in 
order to relieve some stress from the machine until it recovers.


2) I usually have the following in my ceph.conf. This lets the scrubbing 
only run between midnight and 6 AM (hopefully the time of least demand; 
adjust as necessary)  - and with the lowest priority.


#Reduce impact of scrub.
osd_disk_thread_ioprio_priority = 7
osd_disk_thread_ioprio_class = "idle"
osd_scrub_end_hour = 6

3) The Scrubbing begin and end hour will always work. The low priority 
mode, however, works (AFAIK!) only with CFQ I/O Scheduler. Show your 
current scheduler like this (replace sda with your device):


cat /sys/block/sda/queue/scheduler

You can also echo to this file to set a different scheduler.


With these settings you can perhaps alleviate the problem so far, that 
the scrubbing runs over many nights until it finished. Again, AFAIK, it 
doesnt have to finish in one night. It will continue the next night and 
so on.


The Ceph experts say scrubbing is important. Don't know why, but I just 
believe them. They've built this complex stuff after all :-)


Thus, you can use "noscrub"/"nodeepscrub" to quickly get a hung server 
back to work, but you should not let it run like this forever and a day.


Hope this helps at least a bit.

BR,

Ranjan


Am 22.08.2017 um 15:20 schrieb Ramazan Terzi:

Hello,

I have a Ceph Cluster with specifications below:
3 x Monitor node
6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks have SSD 
journals)
Distributed public and private networks. All NICs are 10Gbit/s
osd pool default size = 3
osd pool default min size = 2

Ceph version is Jewel 10.2.6.

My cluster is active and a lot of virtual machines running on it (Linux and 
Windows VM's, database clusters, web servers etc).

During normal use, cluster slowly went into a state of blocked requests. 
Blocked requests periodically incrementing. All OSD's seems healthy. Benchmark, 
iowait, network tests, all of them succeed.

Yerterday, 08:00:
$ ceph health detail
HEALTH_WARN 3 requests are blocked > 32 sec; 3 osds have slow requests
1 ops are blocked > 134218 sec on osd.31
1 ops are blocked > 134218 sec on osd.3
1 ops are blocked > 8388.61 sec on osd.29
3 osds have slow requests

Todat, 16:05:
$ ceph health detail
HEALTH_WARN 32 requests are blocked > 32 sec; 3 osds have slow requests
1 ops are blocked > 134218 sec on osd.31
1 ops are blocked > 134218 sec on osd.3
16 ops are blocked > 134218 sec on osd.29
11 ops are blocked > 67108.9 sec on osd.29
2 ops are blocked > 16777.2 sec on osd.29
1 ops are blocked > 8388.61 sec on osd.29
3 osds have slow requests

$ ceph pg dump | grep scrub
dumped all in format plain
pg_stat objects mip degrmispunf bytes   log disklog state   
state_stamp v   reportedup  up_primary  acting  
acting_primary  last_scrub  scrub_stamp last_deep_scrub deep_scrub_stamp
20.1e   25183   0   0   0   0   98332537930 30663066
active+clean+scrubbing  2017-08-21 04:55:13.354379  6930'23908781   
6930:20905696   [29,31,3]   29  [29,31,3]   29  6712'22950171   
2017-08-20 04:46:59.208792  6712'22950171   2017-08-20 04:46:59.208792

Active scrub does not finish (about 24 hours). I did not restart any OSD 
meanwhile.
I'm thinking set noscrub, noscrub-deep, norebalance, nobackfill, and norecover 
flags and restart 3,29,31th OSDs. Is this solve my problem? Or anyone has 
suggestion about this problem?

Thanks,
Ramazan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Help with file system with failed mds daemon

2017-08-22 Thread Bryan Banister
Hi all,

I'm still new to ceph and cephfs.  Trying out the multi-fs configuration on at 
Luminous test cluster.  I shutdown the cluster to do an upgrade and when I 
brought the cluster back up I now have a warnings that one of the file systems 
has a failed mds daemon:

2017-08-21 17:00:00.81 mon.carf-ceph-osd15 [WRN] overall HEALTH_WARN 1 
filesystem is degraded; 1 filesystem is have a failed mds daemon; 1 pools have 
many more objects per pg than average; application not enabled on 9 pool(s)

I tried restarting the mds service on the system and it doesn't seem to 
indicate any problems:
2017-08-21 16:13:40.979449 7fffed8b0700  1 mds.0.20 shutdown: shutting down 
rank 0
2017-08-21 16:13:41.012167 77fde1c0  0 set uid:gid to 167:167 (ceph:ceph)
2017-08-21 16:13:41.012180 77fde1c0  0 ceph version 12.1.4 
(a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc), process (unknown), 
pid 16656
2017-08-21 16:13:41.014105 77fde1c0  0 pidfile_write: ignore empty 
--pid-file
2017-08-21 16:13:45.541442 710b7700  1 mds.0.23 handle_mds_map i am now 
mds.0.23
2017-08-21 16:13:45.541449 710b7700  1 mds.0.23 handle_mds_map state change 
up:boot --> up:replay
2017-08-21 16:13:45.541459 710b7700  1 mds.0.23 replay_start
2017-08-21 16:13:45.541466 710b7700  1 mds.0.23  recovery set is
2017-08-21 16:13:45.541475 710b7700  1 mds.0.23  waiting for osdmap 1198 
(which blacklists prior instance)
2017-08-21 16:13:45.565779 7fffea8aa700  0 mds.0.cache creating system inode 
with ino:0x100
2017-08-21 16:13:45.565920 7fffea8aa700  0 mds.0.cache creating system inode 
with ino:0x1
2017-08-21 16:13:45.571747 7fffe98a8700  1 mds.0.23 replay_done
2017-08-21 16:13:45.571751 7fffe98a8700  1 mds.0.23 making mds journal writeable
2017-08-21 16:13:46.542148 710b7700  1 mds.0.23 handle_mds_map i am now 
mds.0.23
2017-08-21 16:13:46.542149 710b7700  1 mds.0.23 handle_mds_map state change 
up:replay --> up:reconnect
2017-08-21 16:13:46.542158 710b7700  1 mds.0.23 reconnect_start
2017-08-21 16:13:46.542161 710b7700  1 mds.0.23 reopen_log
2017-08-21 16:13:46.542171 710b7700  1 mds.0.23 reconnect_done
2017-08-21 16:13:47.543612 710b7700  1 mds.0.23 handle_mds_map i am now 
mds.0.23
2017-08-21 16:13:47.543616 710b7700  1 mds.0.23 handle_mds_map state change 
up:reconnect --> up:rejoin
2017-08-21 16:13:47.543623 710b7700  1 mds.0.23 rejoin_start
2017-08-21 16:13:47.543638 710b7700  1 mds.0.23 rejoin_joint_start
2017-08-21 16:13:47.543666 710b7700  1 mds.0.23 rejoin_done
2017-08-21 16:13:48.544768 710b7700  1 mds.0.23 handle_mds_map i am now 
mds.0.23
2017-08-21 16:13:48.544771 710b7700  1 mds.0.23 handle_mds_map state change 
up:rejoin --> up:active
2017-08-21 16:13:48.544779 710b7700  1 mds.0.23 recovery_done -- successful 
recovery!
2017-08-21 16:13:48.544924 710b7700  1 mds.0.23 active_start
2017-08-21 16:13:48.544954 710b7700  1 mds.0.23 cluster recovered.

This seems like an easy problem to fix.  Any help is greatly appreciated!
-Bryan



Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you are hereby notified that any review, dissemination 
or copying of this email is strictly prohibited, and to please notify the 
sender immediately and destroy this email and any attachments. Email 
transmission cannot be guaranteed to be secure or error-free. The Company, 
therefore, does not make any guarantees as to the completeness or accuracy of 
this email or any attachments. This email is for informational purposes only 
and does not constitute a recommendation, offer, request or solicitation of any 
kind to buy, sell, subscribe, redeem or perform any type of transaction of a 
financial product.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help with file system with failed mds daemon

2017-08-22 Thread John Spray
On Tue, Aug 22, 2017 at 4:58 PM, Bryan Banister
 wrote:
> Hi all,
>
>
>
> I’m still new to ceph and cephfs.  Trying out the multi-fs configuration on
> at Luminous test cluster.  I shutdown the cluster to do an upgrade and when
> I brought the cluster back up I now have a warnings that one of the file
> systems has a failed mds daemon:
>
>
>
> 2017-08-21 17:00:00.81 mon.carf-ceph-osd15 [WRN] overall HEALTH_WARN 1
> filesystem is degraded; 1 filesystem is have a failed mds daemon; 1 pools
> have many more objects per pg than average; application not enabled on 9
> pool(s)
>
>
>
> I tried restarting the mds service on the system and it doesn’t seem to
> indicate any problems:
>
> 2017-08-21 16:13:40.979449 7fffed8b0700  1 mds.0.20 shutdown: shutting down
> rank 0
>
> 2017-08-21 16:13:41.012167 77fde1c0  0 set uid:gid to 167:167
> (ceph:ceph)
>
> 2017-08-21 16:13:41.012180 77fde1c0  0 ceph version 12.1.4
> (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc), process (unknown),
> pid 16656
>
> 2017-08-21 16:13:41.014105 77fde1c0  0 pidfile_write: ignore empty
> --pid-file
>
> 2017-08-21 16:13:45.541442 710b7700  1 mds.0.23 handle_mds_map i am now
> mds.0.23
>
> 2017-08-21 16:13:45.541449 710b7700  1 mds.0.23 handle_mds_map state
> change up:boot --> up:replay
>
> 2017-08-21 16:13:45.541459 710b7700  1 mds.0.23 replay_start
>
> 2017-08-21 16:13:45.541466 710b7700  1 mds.0.23  recovery set is
>
> 2017-08-21 16:13:45.541475 710b7700  1 mds.0.23  waiting for osdmap 1198
> (which blacklists prior instance)
>
> 2017-08-21 16:13:45.565779 7fffea8aa700  0 mds.0.cache creating system inode
> with ino:0x100
>
> 2017-08-21 16:13:45.565920 7fffea8aa700  0 mds.0.cache creating system inode
> with ino:0x1
>
> 2017-08-21 16:13:45.571747 7fffe98a8700  1 mds.0.23 replay_done
>
> 2017-08-21 16:13:45.571751 7fffe98a8700  1 mds.0.23 making mds journal
> writeable
>
> 2017-08-21 16:13:46.542148 710b7700  1 mds.0.23 handle_mds_map i am now
> mds.0.23
>
> 2017-08-21 16:13:46.542149 710b7700  1 mds.0.23 handle_mds_map state
> change up:replay --> up:reconnect
>
> 2017-08-21 16:13:46.542158 710b7700  1 mds.0.23 reconnect_start
>
> 2017-08-21 16:13:46.542161 710b7700  1 mds.0.23 reopen_log
>
> 2017-08-21 16:13:46.542171 710b7700  1 mds.0.23 reconnect_done
>
> 2017-08-21 16:13:47.543612 710b7700  1 mds.0.23 handle_mds_map i am now
> mds.0.23
>
> 2017-08-21 16:13:47.543616 710b7700  1 mds.0.23 handle_mds_map state
> change up:reconnect --> up:rejoin
>
> 2017-08-21 16:13:47.543623 710b7700  1 mds.0.23 rejoin_start
>
> 2017-08-21 16:13:47.543638 710b7700  1 mds.0.23 rejoin_joint_start
>
> 2017-08-21 16:13:47.543666 710b7700  1 mds.0.23 rejoin_done
>
> 2017-08-21 16:13:48.544768 710b7700  1 mds.0.23 handle_mds_map i am now
> mds.0.23
>
> 2017-08-21 16:13:48.544771 710b7700  1 mds.0.23 handle_mds_map state
> change up:rejoin --> up:active
>
> 2017-08-21 16:13:48.544779 710b7700  1 mds.0.23 recovery_done --
> successful recovery!
>
> 2017-08-21 16:13:48.544924 710b7700  1 mds.0.23 active_start
>
> 2017-08-21 16:13:48.544954 710b7700  1 mds.0.23 cluster recovered.
>
>
>
> This seems like an easy problem to fix.  Any help is greatly appreciated!

I wonder if you have two filesystems but only one MDS?  Ceph will then
think that the second filesystem "has a failed MDS" because there
isn't an MDS online to service it.

John

>
> -Bryan
>
>
> 
>
> Note: This email is for the confidential use of the named addressee(s) only
> and may contain proprietary, confidential or privileged information. If you
> are not the intended recipient, you are hereby notified that any review,
> dissemination or copying of this email is strictly prohibited, and to please
> notify the sender immediately and destroy this email and any attachments.
> Email transmission cannot be guaranteed to be secure or error-free. The
> Company, therefore, does not make any guarantees as to the completeness or
> accuracy of this email or any attachments. This email is for informational
> purposes only and does not constitute a recommendation, offer, request or
> solicitation of any kind to buy, sell, subscribe, redeem or perform any type
> of transaction of a financial product.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSDs in EC pool flapping

2017-08-22 Thread george.vasilakakos
Hey folks,


I'm staring at a problem that I have found no solution for and which is causing 
major issues.
We've had a PG go down with the first 3 OSDs all crashing and coming back only 
to crash again with the following error in their logs:

-1> 2017-08-22 17:27:50.961633 7f4af4057700 -1 osd.1290 pg_epoch: 72946 
pg[1.138s0( v 72946'430011 (62760'421568,72
946'430011] local-les=72945 n=22918 ec=764 les/c/f 72945/72881/0 
72942/72944/72944) [1290,927,672,456,177,1094,194,1513
,236,302,1326]/[1290,927,672,456,177,1094,194,2147483647,236,302,1326] r=0 
lpr=72944 pi=72880-72943/24 bft=1513(7) crt=
72946'430011 lcod 72889'430010 mlcod 72889'430010 
active+undersized+degraded+remapped+backfilling] recover_replicas: ob
ject added to missing set for backfill, but is not in recovering, error!
 0> 2017-08-22 17:27:50.965861 7f4af4057700 -1 *** Caught signal (Aborted) 
**
 in thread 7f4af4057700 thread_name:tp_osd_tp

This has been going on over the weekend when we saw a different error message 
before upgrading from 11.2.0 to 11.2.1.
The pool is running EC 8+3.

The OSDs crash with that error only to be restarted by systemd and fail again 
the exact same way. Eventually systemd gives, the mon_osd_down_out_interval 
expires and the PG just stays down+remapped while other recover and go 
active+clean.

Can anybody help with this type of problem?


Best regards,

George Vasilakakos
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Blocked requests problem

2017-08-22 Thread Ramazan Terzi
Hi Ranjan,

Thanks for your reply. I did set scrub and nodeep-scrub flags. But active 
scrubbing operation can’t working properly. Scrubbing operation always in same 
pg (20.1e).

$ ceph pg dump | grep scrub
dumped all in format plain
pg_stat objects mip degrmispunf bytes   log disklog state   
state_stamp v   reportedup  up_primary  acting  
acting_primary  last_scrub  scrub_stamp last_deep_scrub deep_scrub_stamp
20.1e   25189   0   0   0   0   98359116362 30483048
active+clean+scrubbing  2017-08-21 04:55:13.354379  6930'2393   
6930:20949058   [29,31,3]   29  [29,31,3]   29  6712'22950171   
2017-08-20 04:46:59.208792  6712'22950171   2017-08-20 04:46:59.208792


$ ceph -s
cluster 
 health HEALTH_WARN
33 requests are blocked > 32 sec
noscrub,nodeep-scrub flag(s) set
 monmap e9: 3 mons at 
{ceph-mon01=**:6789/0,ceph-mon02=**:6789/0,ceph-mon03=**:6789/0}
election epoch 84, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03
 osdmap e6930: 36 osds: 36 up, 36 in
flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
  pgmap v17667617: 1408 pgs, 5 pools, 24779 GB data, 6494 kobjects
70497 GB used, 127 TB / 196 TB avail
1407 active+clean
   1 active+clean+scrubbing


Thanks,
Ramazan


> On 22 Aug 2017, at 18:52, Ranjan Ghosh  wrote:
> 
> Hi Ramazan,
> 
> I'm no Ceph expert, but what I can say from my experience using Ceph is:
> 
> 1) During "Scrubbing", Ceph can be extremely slow. This is probably where 
> your "blocked requests" are coming from. BTW: Perhaps you can even find out 
> which processes are currently blocking with: ps aux | grep "D". You might 
> even want to kill some of those and/or shutdown services in order to relieve 
> some stress from the machine until it recovers.
> 
> 2) I usually have the following in my ceph.conf. This lets the scrubbing only 
> run between midnight and 6 AM (hopefully the time of least demand; adjust as 
> necessary)  - and with the lowest priority.
> 
> #Reduce impact of scrub.
> osd_disk_thread_ioprio_priority = 7
> osd_disk_thread_ioprio_class = "idle"
> osd_scrub_end_hour = 6
> 
> 3) The Scrubbing begin and end hour will always work. The low priority mode, 
> however, works (AFAIK!) only with CFQ I/O Scheduler. Show your current 
> scheduler like this (replace sda with your device):
> 
> cat /sys/block/sda/queue/scheduler
> 
> You can also echo to this file to set a different scheduler.
> 
> 
> With these settings you can perhaps alleviate the problem so far, that the 
> scrubbing runs over many nights until it finished. Again, AFAIK, it doesnt 
> have to finish in one night. It will continue the next night and so on.
> 
> The Ceph experts say scrubbing is important. Don't know why, but I just 
> believe them. They've built this complex stuff after all :-)
> 
> Thus, you can use "noscrub"/"nodeepscrub" to quickly get a hung server back 
> to work, but you should not let it run like this forever and a day.
> 
> Hope this helps at least a bit.
> 
> BR,
> 
> Ranjan
> 
> 
> Am 22.08.2017 um 15:20 schrieb Ramazan Terzi:
>> Hello,
>> 
>> I have a Ceph Cluster with specifications below:
>> 3 x Monitor node
>> 6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks have 
>> SSD journals)
>> Distributed public and private networks. All NICs are 10Gbit/s
>> osd pool default size = 3
>> osd pool default min size = 2
>> 
>> Ceph version is Jewel 10.2.6.
>> 
>> My cluster is active and a lot of virtual machines running on it (Linux and 
>> Windows VM's, database clusters, web servers etc).
>> 
>> During normal use, cluster slowly went into a state of blocked requests. 
>> Blocked requests periodically incrementing. All OSD's seems healthy. 
>> Benchmark, iowait, network tests, all of them succeed.
>> 
>> Yerterday, 08:00:
>> $ ceph health detail
>> HEALTH_WARN 3 requests are blocked > 32 sec; 3 osds have slow requests
>> 1 ops are blocked > 134218 sec on osd.31
>> 1 ops are blocked > 134218 sec on osd.3
>> 1 ops are blocked > 8388.61 sec on osd.29
>> 3 osds have slow requests
>> 
>> Todat, 16:05:
>> $ ceph health detail
>> HEALTH_WARN 32 requests are blocked > 32 sec; 3 osds have slow requests
>> 1 ops are blocked > 134218 sec on osd.31
>> 1 ops are blocked > 134218 sec on osd.3
>> 16 ops are blocked > 134218 sec on osd.29
>> 11 ops are blocked > 67108.9 sec on osd.29
>> 2 ops are blocked > 16777.2 sec on osd.29
>> 1 ops are blocked > 8388.61 sec on osd.29
>> 3 osds have slow requests
>> 
>> $ ceph pg dump | grep scrub
>> dumped all in format plain
>> pg_stat  objects mip degrmispunf bytes   log disklog 
>> state   state_stamp v   reportedup  up_primary  
>> acting  acting_primary  last_scrub  scrub_stamp last_deep_scrub 
>> deep_scrub_stamp
>> 20.1e25183   0   0   0   0   

Re: [ceph-users] Blocked requests problem

2017-08-22 Thread Ranjan Ghosh
Hm. That's quite weird. On our cluster, when I set "noscrub", 
"nodeep-scrub", scrubbing will always stop pretty quickly (a few 
minutes). I wonder why this doesnt happen on your cluster. When exactly 
did you set the flag? Perhaps it just needs some more time... Or there 
might be a disk problem why the scrubbing never finishes. Perhaps it's 
really a good idea, just like you proposed, to shutdown the 
corresponding OSDs. But that's just my thoughts. Perhaps some Ceph pro 
can shed some light on the possible reasons, why a scrubbing might get 
stuck and how to resolve this.



Am 22.08.2017 um 18:58 schrieb Ramazan Terzi:

Hi Ranjan,

Thanks for your reply. I did set scrub and nodeep-scrub flags. But active 
scrubbing operation can’t working properly. Scrubbing operation always in same 
pg (20.1e).

$ ceph pg dump | grep scrub
dumped all in format plain
pg_stat objects mip degrmispunf bytes   log disklog state   
state_stamp v   reportedup  up_primary  acting  
acting_primary  last_scrub  scrub_stamp last_deep_scrub deep_scrub_stamp
20.1e   25189   0   0   0   0   98359116362 30483048
active+clean+scrubbing  2017-08-21 04:55:13.354379  6930'2393   
6930:20949058   [29,31,3]   29  [29,31,3]   29  6712'22950171   
2017-08-20 04:46:59.208792  6712'22950171   2017-08-20 04:46:59.208792


$ ceph -s
 cluster 
  health HEALTH_WARN
 33 requests are blocked > 32 sec
 noscrub,nodeep-scrub flag(s) set
  monmap e9: 3 mons at 
{ceph-mon01=**:6789/0,ceph-mon02=**:6789/0,ceph-mon03=**:6789/0}
 election epoch 84, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03
  osdmap e6930: 36 osds: 36 up, 36 in
 flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
   pgmap v17667617: 1408 pgs, 5 pools, 24779 GB data, 6494 kobjects
 70497 GB used, 127 TB / 196 TB avail
 1407 active+clean
1 active+clean+scrubbing


Thanks,
Ramazan



On 22 Aug 2017, at 18:52, Ranjan Ghosh  wrote:

Hi Ramazan,

I'm no Ceph expert, but what I can say from my experience using Ceph is:

1) During "Scrubbing", Ceph can be extremely slow. This is probably where your "blocked 
requests" are coming from. BTW: Perhaps you can even find out which processes are currently blocking 
with: ps aux | grep "D". You might even want to kill some of those and/or shutdown services in 
order to relieve some stress from the machine until it recovers.

2) I usually have the following in my ceph.conf. This lets the scrubbing only 
run between midnight and 6 AM (hopefully the time of least demand; adjust as 
necessary)  - and with the lowest priority.

#Reduce impact of scrub.
osd_disk_thread_ioprio_priority = 7
osd_disk_thread_ioprio_class = "idle"
osd_scrub_end_hour = 6

3) The Scrubbing begin and end hour will always work. The low priority mode, 
however, works (AFAIK!) only with CFQ I/O Scheduler. Show your current 
scheduler like this (replace sda with your device):

cat /sys/block/sda/queue/scheduler

You can also echo to this file to set a different scheduler.


With these settings you can perhaps alleviate the problem so far, that the 
scrubbing runs over many nights until it finished. Again, AFAIK, it doesnt have 
to finish in one night. It will continue the next night and so on.

The Ceph experts say scrubbing is important. Don't know why, but I just believe 
them. They've built this complex stuff after all :-)

Thus, you can use "noscrub"/"nodeepscrub" to quickly get a hung server back to 
work, but you should not let it run like this forever and a day.

Hope this helps at least a bit.

BR,

Ranjan


Am 22.08.2017 um 15:20 schrieb Ramazan Terzi:

Hello,

I have a Ceph Cluster with specifications below:
3 x Monitor node
6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks have SSD 
journals)
Distributed public and private networks. All NICs are 10Gbit/s
osd pool default size = 3
osd pool default min size = 2

Ceph version is Jewel 10.2.6.

My cluster is active and a lot of virtual machines running on it (Linux and 
Windows VM's, database clusters, web servers etc).

During normal use, cluster slowly went into a state of blocked requests. 
Blocked requests periodically incrementing. All OSD's seems healthy. Benchmark, 
iowait, network tests, all of them succeed.

Yerterday, 08:00:
$ ceph health detail
HEALTH_WARN 3 requests are blocked > 32 sec; 3 osds have slow requests
1 ops are blocked > 134218 sec on osd.31
1 ops are blocked > 134218 sec on osd.3
1 ops are blocked > 8388.61 sec on osd.29
3 osds have slow requests

Todat, 16:05:
$ ceph health detail
HEALTH_WARN 32 requests are blocked > 32 sec; 3 osds have slow requests
1 ops are blocked > 134218 sec on osd.31
1 ops are blocked > 134218 sec on osd.3
16 ops are blocked > 134218 sec on osd.29
11 ops are blocked > 67108.9 sec on osd.29
2 ops are blocked > 16777.2 sec on osd.29

[ceph-users] Small-cluster performance issues

2017-08-22 Thread fcid

Hello everyone,

I've been using ceph to provide storage using RBD for 60 KVM virtual 
machines running on proxmox.


The ceph cluster we have is very small (2 OSDs + 1 mon per node, and a 
total of 3 nodes) and we are having some performace issues, like big 
latency times (apply lat:~0.5 s; commit lat: 0.001 s), which get worse 
by the weekly deep-scrubs.


I wonder if doubling the numbers of OSDs would improve latency times, or 
if there is any other configuration tweak recommended for such small 
cluster. Also, I'm looking forward to read any experience of other users 
using a similiar configuration.


Some technical info:

  - Ceph version: 10.2.5

  - OSDs have SSD journal (one SSD disk per 2 OSDs) and have a spindle 
for backend disk.


  - Using CFQ disk queue scheduler

  - OSD configuration excerpt:

osd_recovery_max_active = 1
osd_recovery_op_priority = 63
osd_client_op_priority = 1
osd_mkfs_options = -f -i size=2048 -n size=64k
osd_mount_options_xfs = inode64,noatime,logbsize=256k
osd_journal_size = 20480
osd_op_threads = 12
osd_disk_threads = 1
osd_disk_thread_ioprio_class = idle
osd_disk_thread_ioprio_priority = 7
osd_scrub_begin_hour = 3
osd_scrub_end_hour = 8
osd_scrub_during_recovery = false
filestore_merge_threshold = 40
filestore_split_multiple = 8
filestore_xattr_use_omap = true
filestore_queue_max_ops = 2500
filestore_min_sync_interval = 0.01
filestore_max_sync_interval = 0.1
filestore_journal_writeahead = true

Best regards,

--
Fernando Cid O.
Ingeniero de Operaciones
AltaVoz S.A.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs in EC pool flapping

2017-08-22 Thread Paweł Woszuk
Have you experienced huge memory consumption by flapping OSD daemons? Restart 
could be triggered by no memory (omkiller).

If yes,this could be connected with osd device error,(bad blocks?), but we've 
experienced something similar on Jewel, not Kraken release. Solution was to 
find PG that cause error, set it to deep scrub manually and restart PG's 
primary OSD.

Hope that helps, or at least lead to some solution.

Dnia 22 sierpnia 2017 18:39:47 CEST, george.vasilaka...@stfc.ac.uk napisał(a):
>Hey folks,
>
>
>I'm staring at a problem that I have found no solution for and which is
>causing major issues.
>We've had a PG go down with the first 3 OSDs all crashing and coming
>back only to crash again with the following error in their logs:
>
>-1> 2017-08-22 17:27:50.961633 7f4af4057700 -1 osd.1290 pg_epoch: 72946
>pg[1.138s0( v 72946'430011 (62760'421568,72
>946'430011] local-les=72945 n=22918 ec=764 les/c/f 72945/72881/0
>72942/72944/72944) [1290,927,672,456,177,1094,194,1513
>,236,302,1326]/[1290,927,672,456,177,1094,194,2147483647,236,302,1326]
>r=0 lpr=72944 pi=72880-72943/24 bft=1513(7) crt=
>72946'430011 lcod 72889'430010 mlcod 72889'430010
>active+undersized+degraded+remapped+backfilling] recover_replicas: ob
>ject added to missing set for backfill, but is not in recovering,
>error!
>0> 2017-08-22 17:27:50.965861 7f4af4057700 -1 *** Caught signal
>(Aborted) **
> in thread 7f4af4057700 thread_name:tp_osd_tp
>
>This has been going on over the weekend when we saw a different error
>message before upgrading from 11.2.0 to 11.2.1.
>The pool is running EC 8+3.
>
>The OSDs crash with that error only to be restarted by systemd and fail
>again the exact same way. Eventually systemd gives, the
>mon_osd_down_out_interval expires and the PG just stays down+remapped
>while other recover and go active+clean.
>
>Can anybody help with this type of problem?
>
>
>Best regards,
>
>George Vasilakakos
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Paweł Woszuk
PCSS, Poznańskie Centrum Superkomputerowo-Sieciowe
ul. Jana Pawła II nr 10, 61-139 Poznań
Polska___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Small-cluster performance issues

2017-08-22 Thread Maged Mokhtar
It is likely your 2 spinning disks cannot keep up with the load. Things
are likely to improve if you double your OSDs hooking them up to your
existing SSD journal. Technically it would be nice to run a
load/performance tool (either atop/collectl/sysstat) and measure how
busy your resources are, but it is most likely your 2 spinning disks
will show near 100% busy utilization. 

filestore_max_sync_interval: i do not recommend decreasing this to 0.1,
i would keep it at 5 sec  

osd_op_threads do not increase this unless you have enough cores. 

but adding disks is the way to go 

Maged 

On 2017-08-22 20:08, fcid wrote:

> Hello everyone,
> 
> I've been using ceph to provide storage using RBD for 60 KVM virtual machines 
> running on proxmox.
> 
> The ceph cluster we have is very small (2 OSDs + 1 mon per node, and a total 
> of 3 nodes) and we are having some performace issues, like big latency times 
> (apply lat:~0.5 s; commit lat: 0.001 s), which get worse by the weekly 
> deep-scrubs.
> 
> I wonder if doubling the numbers of OSDs would improve latency times, or if 
> there is any other configuration tweak recommended for such small cluster. 
> Also, I'm looking forward to read any experience of other users using a 
> similiar configuration.
> 
> Some technical info:
> 
> - Ceph version: 10.2.5
> 
> - OSDs have SSD journal (one SSD disk per 2 OSDs) and have a spindle for 
> backend disk.
> 
> - Using CFQ disk queue scheduler
> 
> - OSD configuration excerpt:
> 
> osd_recovery_max_active = 1
> osd_recovery_op_priority = 63
> osd_client_op_priority = 1
> osd_mkfs_options = -f -i size=2048 -n size=64k
> osd_mount_options_xfs = inode64,noatime,logbsize=256k
> osd_journal_size = 20480
> osd_op_threads = 12
> osd_disk_threads = 1
> osd_disk_thread_ioprio_class = idle
> osd_disk_thread_ioprio_priority = 7
> osd_scrub_begin_hour = 3
> osd_scrub_end_hour = 8
> osd_scrub_during_recovery = false
> filestore_merge_threshold = 40
> filestore_split_multiple = 8
> filestore_xattr_use_omap = true
> filestore_queue_max_ops = 2500
> filestore_min_sync_interval = 0.01
> filestore_max_sync_interval = 0.1
> filestore_journal_writeahead = true
> 
> Best regards,___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Exclusive-lock Ceph

2017-08-22 Thread lista

Dears,
 
Some days  ago, I read about this comands rbd lock add  and rbd lock 
remove , this commands will go maintened in ceph in future versions, or the better 
form, to use lock in ceph, will go exclusive-lock and this commands will go depreciated 
?
 
Thanks a Lot,
Marcelo

Em 24/07/2017, Jason Dillaman  escreveu:
> On Mon, Jul 24, 2017 at 2:15 PM,   wrote: 
> > 2 questions, 
> > 
> > 
> > 
> > 1 In the moment i use kernel 4.10, the exclusive-lock not works fine in 
> > kernel's version less than < 4.12, right ? 
> 
> Exclusive lock should work just fine under 4.10 -- but you are trying 
> to use the new "exclusive" map option that is only available starting 
> with kernel 4.12. 
> 
> > 2 The comand  with exclusive 
> > would this  ? 
> > 
> > rbd map --exclusive test-xlock3 
> 
> Yes, that should be it. 
> 
> > Thanks a Lot, 
> > 
> > Marcelo 
> > 
> > 
> > Em 24/07/2017, Jason Dillaman  escreveu: 
> >> You will need to pass the "exclusive" option when running "rbd map" 
> >> (and be running kernel >= 4.12). 
> >> 
> >> On Mon, Jul 24, 2017 at 8:42 AM,   wrote: 
> >> > I'm testing ceph in my enviroment, but the feature exclusive lock don't 
> >> > works fine for me or maybe i'm doing something wrong. 
> >> > 
> >> > I testing in two machines create one image with exclusive-lock enable, 
> >> > if I 
> >> > understood correctly, with this feature, one machine only can mount and 
> >> > write in image at time. 
> >> > 
> >> > But When I'm testing, i saw the lock always is move to machine that try 
> >> > mount the volume lastly 
> >> > 
> >> > Example if i try mount the image in machine1 i see ip the machine1 and i 
> >> > mount the volume in machine1 : 
> >> > #rbd lock list test-xlock3 
> >> > There is 1 exclusive lock on this image. 
> >> > Locker      ID                        Address 
> >> > client.4390 auto  192.168.0.1:0/2940167630 
> >> > 
> >> > But if now i running rbd map and try mount image in machine2, the lock 
> >> > is 
> >> > change to machine2, and i believe this is one error, because if lock 
> >> > already 
> >> > in machine one and i write in image, the machine2 don't should can mount 
> >> > the 
> >> > same image in the same time. 
> >> > If i running in machine2 now, i see : 
> >> > 
> >> > #rbd lock list test-xlock3 
> >> > There is 1 exclusive lock on this image. 
> >> > Locker      ID                        Address 
> >> > client.4491 auto XX 192.168.0.2:0/1260424031 
> >> > 
> >> > 
> >> > 
> >> > Exclusive-lock enable in my image : 
> >> > 
> >> > rbd info  test-xlock3 | grep features 
> >> > features: exclusive-lock 
> >> > 
> >> > 
> >> > i'm doing some wrong ? Existing some conf, to add in ceph.conf, to fix 
> >> > this, 
> >> > if one machine mount the volume, the machine2 don't can in the same 
> >> > time, i 
> >> > read about command rbd 
> >> > lock, but this command seem deprecated. 
> >> > 
> >> > 
> >> > 
> >> > Thanks, a lot. 
> >> > Marcelo 
> >> > 
> >> > 
> >> > ___ 
> >> > ceph-users mailing list 
> >> > ceph-users@lists.ceph.com 
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> >> > 
> >> 
> >> 
> >> 
> >> -- 
> >> Jason 
> 
> 
> 
> -- 
> Jason___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Small-cluster performance issues

2017-08-22 Thread Mazzystr
Also examine your network layout.  Any saturation in the private cluster
network or client facing network will be felt in clients / libvirt /
virtual machines

As OSD count increases...

   - Ensure client network private cluster network seperation - different
   nics, different wires, different switches
   - Add more nics both client side and private cluster network side and
   lag them.
   - If/When your dept's budget suddenly swells...implement 10 gig-e.


Monitor, capacity plan, execute  :)

/Chris C

On Tue, Aug 22, 2017 at 3:02 PM, Maged Mokhtar  wrote:

>
>
> It is likely your 2 spinning disks cannot keep up with the load. Things
> are likely to improve if you double your OSDs hooking them up to your
> existing SSD journal. Technically it would be nice to run a
> load/performance tool (either atop/collectl/sysstat) and measure how busy
> your resources are, but it is most likely your 2 spinning disks will show
> near 100% busy utilization.
>
> filestore_max_sync_interval: i do not recommend decreasing this to 0.1, i
> would keep it at 5 sec
>
> osd_op_threads do not increase this unless you have enough cores.
>
> but adding disks is the way to go
>
> Maged
>
>
>
> On 2017-08-22 20:08, fcid wrote:
>
> Hello everyone,
>
> I've been using ceph to provide storage using RBD for 60 KVM virtual
> machines running on proxmox.
>
> The ceph cluster we have is very small (2 OSDs + 1 mon per node, and a
> total of 3 nodes) and we are having some performace issues, like big
> latency times (apply lat:~0.5 s; commit lat: 0.001 s), which get worse by
> the weekly deep-scrubs.
>
> I wonder if doubling the numbers of OSDs would improve latency times, or
> if there is any other configuration tweak recommended for such small
> cluster. Also, I'm looking forward to read any experience of other users
> using a similiar configuration.
>
> Some technical info:
>
>   - Ceph version: 10.2.5
>
>   - OSDs have SSD journal (one SSD disk per 2 OSDs) and have a spindle for
> backend disk.
>
>   - Using CFQ disk queue scheduler
>
>   - OSD configuration excerpt:
>
> osd_recovery_max_active = 1
> osd_recovery_op_priority = 63
> osd_client_op_priority = 1
> osd_mkfs_options = -f -i size=2048 -n size=64k
> osd_mount_options_xfs = inode64,noatime,logbsize=256k
> osd_journal_size = 20480
> osd_op_threads = 12
> osd_disk_threads = 1
> osd_disk_thread_ioprio_class = idle
> osd_disk_thread_ioprio_priority = 7
> osd_scrub_begin_hour = 3
> osd_scrub_end_hour = 8
> osd_scrub_during_recovery = false
> filestore_merge_threshold = 40
> filestore_split_multiple = 8
> filestore_xattr_use_omap = true
> filestore_queue_max_ops = 2500
> filestore_min_sync_interval = 0.01
> filestore_max_sync_interval = 0.1
> filestore_journal_writeahead = true
>
> Best regards,
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help with file system with failed mds daemon

2017-08-22 Thread Bryan Banister
Hi John,



Seems like you're right... strange that it seemed to work with only one mds 
before I shut the cluster down.  Here is the `ceph fs get` output for the two 
file systems:



[root@carf-ceph-osd15 ~]# ceph fs get carf_ceph_kube01

Filesystem 'carf_ceph_kube01' (2)

fs_name carf_ceph_kube01

epoch   22

flags   8

created 2017-08-21 12:10:57.948579

modified2017-08-21 12:10:57.948579

tableserver 0

root0

session_timeout 60

session_autoclose   300

max_file_size   1099511627776

last_failure0

last_failure_osd_epoch  1218

compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses 
versioned encoding,6=dirfrag is stored in omap,8=file layout v2}

max_mds 1

in  0

up  {}

failed  0

damaged

stopped

data_pools  [23]

metadata_pool   24

inline_data disabled

balancer

standby_count_wanted0

[root@carf-ceph-osd15 ~]#

[root@carf-ceph-osd15 ~]# ceph fs get carf_ceph02

Filesystem 'carf_ceph02' (1)

fs_name carf_ceph02

epoch   26

flags   8

created 2017-08-18 14:20:50.152054

modified2017-08-18 14:20:50.152054

tableserver 0

root0

session_timeout 60

session_autoclose   300

max_file_size   1099511627776

last_failure0

last_failure_osd_epoch  1198

compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses 
versioned encoding,6=dirfrag is stored in omap,8=file layout v2}

max_mds 1

in  0

up  {0=474299}

failed

damaged

stopped

data_pools  [21]

metadata_pool   22

inline_data disabled

balancer

standby_count_wanted0

474299: 7.128.13.69:6800/304042158 'carf-ceph-osd15' mds.0.23 up:active seq 5



I also looked into trying to specify the mds_namespace option to the mount 
operation (http://docs.ceph.com/docs/master/cephfs/kernel/) but that doesn’t 
seem to be valid:

[ceph-admin@carf-ceph-osd04 ~]$ sudo mount -t ceph carf-ceph-osd15:6789:/ 
/mnt/carf_ceph02/ -o 
mds_namespace=carf_ceph02,name=cephfs.k8test,secretfile=k8test.secret

mount error 22 = Invalid argument



Thanks,

-Bryan



-Original Message-
From: John Spray [mailto:jsp...@redhat.com]
Sent: Tuesday, August 22, 2017 11:18 AM
To: Bryan Banister 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Help with file system with failed mds daemon



Note: External Email

-



On Tue, Aug 22, 2017 at 4:58 PM, Bryan Banister

mailto:bbanis...@jumptrading.com>> wrote:

> Hi all,

>

>

>

> I’m still new to ceph and cephfs.  Trying out the multi-fs configuration on

> at Luminous test cluster.  I shutdown the cluster to do an upgrade and when

> I brought the cluster back up I now have a warnings that one of the file

> systems has a failed mds daemon:

>

>

>

> 2017-08-21 17:00:00.81 mon.carf-ceph-osd15 [WRN] overall HEALTH_WARN 1

> filesystem is degraded; 1 filesystem is have a failed mds daemon; 1 pools

> have many more objects per pg than average; application not enabled on 9

> pool(s)

>

>

>

> I tried restarting the mds service on the system and it doesn’t seem to

> indicate any problems:

>

> 2017-08-21 16:13:40.979449 7fffed8b0700  1 mds.0.20 shutdown: shutting down

> rank 0

>

> 2017-08-21 16:13:41.012167 77fde1c0  0 set uid:gid to 167:167

> (ceph:ceph)

>

> 2017-08-21 16:13:41.012180 77fde1c0  0 ceph version 12.1.4

> (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc), process (unknown),

> pid 16656

>

> 2017-08-21 16:13:41.014105 77fde1c0  0 pidfile_write: ignore empty

> --pid-file

>

> 2017-08-21 16:13:45.541442 710b7700  1 mds.0.23 handle_mds_map i am now

> mds.0.23

>

> 2017-08-21 16:13:45.541449 710b7700  1 mds.0.23 handle_mds_map state

> change up:boot --> up:replay

>

> 2017-08-21 16:13:45.541459 710b7700  1 mds.0.23 replay_start

>

> 2017-08-21 16:13:45.541466 710b7700  1 mds.0.23  recovery set is

>

> 2017-08-21 16:13:45.541475 710b7700  1 mds.0.23  waiting for osdmap 1198

> (which blacklists prior instance)

>

> 2017-08-21 16:13:45.565779 7fffea8aa700  0 mds.0.cache creating system inode

> with ino:0x100

>

> 2017-08-21 16:13:45.565920 7fffea8aa700  0 mds.0.cache creating system inode

> with ino:0x1

>

> 2017-08-21 16:13:45.571747 7fffe98a8700  1 mds.0.23 replay_done

>

> 2017-08-21 16:13:45.571751 7fffe98a8700  1 mds.0.23 making mds journal

> writeable

>

> 2017-08-21 16:13:46.542148 710b7700  1 mds.0.23 handle_mds_map i am now

> mds.0.23

>

> 2017-08-21 16:13:46.542149 710b7700  1 mds.0.23 handle_mds_map state

> change up:replay --> up:reconnect

>

> 2017-08-21 16:13:46.542158 710b7700  1 mds.0.23 reconnect_start

>

> 2017-08-21 16:13:46.542161 710b7700  1 mds.0.23 reopen_log

>

> 2017-08-21 16:13:46.542171 710b7700  1 mds.0.23 reconnect_done

>

> 2017-08-21 16:13:47.543612 710b770

Re: [ceph-users] Small-cluster performance issues

2017-08-22 Thread David Turner
I would run some benchmarking throughout the cluster environment to see
where your bottlenecks are before putting time and money into something
that might not be your limiting resource.  Sebastian Han put together a
great guide for benchmarking your cluster here.

https://www.sebastien-han.fr/blog/2012/08/26/ceph-benchmarks/

Test to see if it's your client configuration, client's connection to the
cluster, the Cluster's disks, the cluster overall speed, etc.

On Tue, Aug 22, 2017 at 3:31 PM Mazzystr  wrote:

> Also examine your network layout.  Any saturation in the private cluster
> network or client facing network will be felt in clients / libvirt /
> virtual machines
>
> As OSD count increases...
>
>- Ensure client network private cluster network seperation - different
>nics, different wires, different switches
>- Add more nics both client side and private cluster network side and
>lag them.
>- If/When your dept's budget suddenly swells...implement 10 gig-e.
>
>
> Monitor, capacity plan, execute  :)
>
> /Chris C
>
> On Tue, Aug 22, 2017 at 3:02 PM, Maged Mokhtar 
> wrote:
>
>>
>>
>> It is likely your 2 spinning disks cannot keep up with the load. Things
>> are likely to improve if you double your OSDs hooking them up to your
>> existing SSD journal. Technically it would be nice to run a
>> load/performance tool (either atop/collectl/sysstat) and measure how busy
>> your resources are, but it is most likely your 2 spinning disks will show
>> near 100% busy utilization.
>>
>> filestore_max_sync_interval: i do not recommend decreasing this to 0.1, i
>> would keep it at 5 sec
>>
>> osd_op_threads do not increase this unless you have enough cores.
>>
>> but adding disks is the way to go
>>
>> Maged
>>
>>
>>
>> On 2017-08-22 20:08, fcid wrote:
>>
>> Hello everyone,
>>
>> I've been using ceph to provide storage using RBD for 60 KVM virtual
>> machines running on proxmox.
>>
>> The ceph cluster we have is very small (2 OSDs + 1 mon per node, and a
>> total of 3 nodes) and we are having some performace issues, like big
>> latency times (apply lat:~0.5 s; commit lat: 0.001 s), which get worse by
>> the weekly deep-scrubs.
>>
>> I wonder if doubling the numbers of OSDs would improve latency times, or
>> if there is any other configuration tweak recommended for such small
>> cluster. Also, I'm looking forward to read any experience of other users
>> using a similiar configuration.
>>
>> Some technical info:
>>
>>   - Ceph version: 10.2.5
>>
>>   - OSDs have SSD journal (one SSD disk per 2 OSDs) and have a spindle
>> for backend disk.
>>
>>   - Using CFQ disk queue scheduler
>>
>>   - OSD configuration excerpt:
>>
>> osd_recovery_max_active = 1
>> osd_recovery_op_priority = 63
>> osd_client_op_priority = 1
>> osd_mkfs_options = -f -i size=2048 -n size=64k
>> osd_mount_options_xfs = inode64,noatime,logbsize=256k
>> osd_journal_size = 20480
>> osd_op_threads = 12
>> osd_disk_threads = 1
>> osd_disk_thread_ioprio_class = idle
>> osd_disk_thread_ioprio_priority = 7
>> osd_scrub_begin_hour = 3
>> osd_scrub_end_hour = 8
>> osd_scrub_during_recovery = false
>> filestore_merge_threshold = 40
>> filestore_split_multiple = 8
>> filestore_xattr_use_omap = true
>> filestore_queue_max_ops = 2500
>> filestore_min_sync_interval = 0.01
>> filestore_max_sync_interval = 0.1
>> filestore_journal_writeahead = true
>>
>> Best regards,
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help with file system with failed mds daemon

2017-08-22 Thread John Spray
On Tue, Aug 22, 2017 at 8:49 PM, Bryan Banister
 wrote:
> Hi John,
>
>
>
> Seems like you're right... strange that it seemed to work with only one mds
> before I shut the cluster down.  Here is the `ceph fs get` output for the
> two file systems:
>
>
>
> [root@carf-ceph-osd15 ~]# ceph fs get carf_ceph_kube01
>
> Filesystem 'carf_ceph_kube01' (2)
>
> fs_name carf_ceph_kube01
>
> epoch   22
>
> flags   8
>
> created 2017-08-21 12:10:57.948579
>
> modified2017-08-21 12:10:57.948579
>
> tableserver 0
>
> root0
>
> session_timeout 60
>
> session_autoclose   300
>
> max_file_size   1099511627776
>
> last_failure0
>
> last_failure_osd_epoch  1218
>
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
>
> max_mds 1
>
> in  0
>
> up  {}
>
> failed  0
>
> damaged
>
> stopped
>
> data_pools  [23]
>
> metadata_pool   24
>
> inline_data disabled
>
> balancer
>
> standby_count_wanted0
>
> [root@carf-ceph-osd15 ~]#
>
> [root@carf-ceph-osd15 ~]# ceph fs get carf_ceph02
>
> Filesystem 'carf_ceph02' (1)
>
> fs_name carf_ceph02
>
> epoch   26
>
> flags   8
>
> created 2017-08-18 14:20:50.152054
>
> modified2017-08-18 14:20:50.152054
>
> tableserver 0
>
> root0
>
> session_timeout 60
>
> session_autoclose   300
>
> max_file_size   1099511627776
>
> last_failure0
>
> last_failure_osd_epoch  1198
>
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
>
> max_mds 1
>
> in  0
>
> up  {0=474299}
>
> failed
>
> damaged
>
> stopped
>
> data_pools  [21]
>
> metadata_pool   22
>
> inline_data disabled
>
> balancer
>
> standby_count_wanted0
>
> 474299: 7.128.13.69:6800/304042158 'carf-ceph-osd15' mds.0.23 up:active seq
> 5

In that instance, it's not complaining because one of the filesystems
has never had an MDS.


> I also looked into trying to specify the mds_namespace option to the mount
> operation (http://docs.ceph.com/docs/master/cephfs/kernel/) but that doesn’t
> seem to be valid:
>
> [ceph-admin@carf-ceph-osd04 ~]$ sudo mount -t ceph carf-ceph-osd15:6789:/
> /mnt/carf_ceph02/ -o
> mds_namespace=carf_ceph02,name=cephfs.k8test,secretfile=k8test.secret
>
> mount error 22 = Invalid argument

It's likely that you are using an older kernel that doesn't have
support for the feature.  It was added in linux 4.8.

John

>
>
> Thanks,
>
> -Bryan
>
>
>
> -Original Message-
> From: John Spray [mailto:jsp...@redhat.com]
> Sent: Tuesday, August 22, 2017 11:18 AM
> To: Bryan Banister 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Help with file system with failed mds daemon
>
>
>
> Note: External Email
>
> -
>
>
>
> On Tue, Aug 22, 2017 at 4:58 PM, Bryan Banister
>
>  wrote:
>
>> Hi all,
>
>>
>
>>
>
>>
>
>> I’m still new to ceph and cephfs.  Trying out the multi-fs configuration
>> on
>
>> at Luminous test cluster.  I shutdown the cluster to do an upgrade and
>> when
>
>> I brought the cluster back up I now have a warnings that one of the file
>
>> systems has a failed mds daemon:
>
>>
>
>>
>
>>
>
>> 2017-08-21 17:00:00.81 mon.carf-ceph-osd15 [WRN] overall HEALTH_WARN 1
>
>> filesystem is degraded; 1 filesystem is have a failed mds daemon; 1 pools
>
>> have many more objects per pg than average; application not enabled on 9
>
>> pool(s)
>
>>
>
>>
>
>>
>
>> I tried restarting the mds service on the system and it doesn’t seem to
>
>> indicate any problems:
>
>>
>
>> 2017-08-21 16:13:40.979449 7fffed8b0700  1 mds.0.20 shutdown: shutting
>> down
>
>> rank 0
>
>>
>
>> 2017-08-21 16:13:41.012167 77fde1c0  0 set uid:gid to 167:167
>
>> (ceph:ceph)
>
>>
>
>> 2017-08-21 16:13:41.012180 77fde1c0  0 ceph version 12.1.4
>
>> (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc), process
>> (unknown),
>
>> pid 16656
>
>>
>
>> 2017-08-21 16:13:41.014105 77fde1c0  0 pidfile_write: ignore empty
>
>> --pid-file
>
>>
>
>> 2017-08-21 16:13:45.541442 710b7700  1 mds.0.23 handle_mds_map i am
>> now
>
>> mds.0.23
>
>>
>
>> 2017-08-21 16:13:45.541449 710b7700  1 mds.0.23 handle_mds_map state
>
>> change up:boot --> up:replay
>
>>
>
>> 2017-08-21 16:13:45.541459 710b7700  1 mds.0.23 replay_start
>
>>
>
>> 2017-08-21 16:13:45.541466 710b7700  1 mds.0.23  recovery set is
>
>>
>
>> 2017-08-21 16:13:45.541475 710b7700  1 mds.0.23  waiting for osdmap
>> 1198
>
>> (which blacklists prior instance)
>
>>
>
>> 2017-08-21 16:13:45.565779 7fffea8aa700  0 mds.0.cache creating system
>> inode
>
>> with ino:0x100
>
>>
>
>> 2017-08-21 16:13:45.565920 7fffea8aa700  0 mds.0.cache creating system
>> inode
>
>> with ino:0x1
>
>>
>
>> 2017-08-21 16:13:45.5717

Re: [ceph-users] Help with file system with failed mds daemon

2017-08-22 Thread Bryan Banister
All sounds right to me... looks like this is a little too bleeding edge for my 
taste!  I'll probably drop it at this point and just wait till we are actually 
on a 4.8 kernel before checking on status again.

Thanks for your help!
-Bryan

-Original Message-
From: John Spray [mailto:jsp...@redhat.com]
Sent: Tuesday, August 22, 2017 2:56 PM
To: Bryan Banister 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Help with file system with failed mds daemon

Note: External Email
-

On Tue, Aug 22, 2017 at 8:49 PM, Bryan Banister
 wrote:
> Hi John,
>
>
>
> Seems like you're right... strange that it seemed to work with only one mds
> before I shut the cluster down.  Here is the `ceph fs get` output for the
> two file systems:
>
>
>
> [root@carf-ceph-osd15 ~]# ceph fs get carf_ceph_kube01
>
> Filesystem 'carf_ceph_kube01' (2)
>
> fs_name carf_ceph_kube01
>
> epoch   22
>
> flags   8
>
> created 2017-08-21 12:10:57.948579
>
> modified2017-08-21 12:10:57.948579
>
> tableserver 0
>
> root0
>
> session_timeout 60
>
> session_autoclose   300
>
> max_file_size   1099511627776
>
> last_failure0
>
> last_failure_osd_epoch  1218
>
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
>
> max_mds 1
>
> in  0
>
> up  {}
>
> failed  0
>
> damaged
>
> stopped
>
> data_pools  [23]
>
> metadata_pool   24
>
> inline_data disabled
>
> balancer
>
> standby_count_wanted0
>
> [root@carf-ceph-osd15 ~]#
>
> [root@carf-ceph-osd15 ~]# ceph fs get carf_ceph02
>
> Filesystem 'carf_ceph02' (1)
>
> fs_name carf_ceph02
>
> epoch   26
>
> flags   8
>
> created 2017-08-18 14:20:50.152054
>
> modified2017-08-18 14:20:50.152054
>
> tableserver 0
>
> root0
>
> session_timeout 60
>
> session_autoclose   300
>
> max_file_size   1099511627776
>
> last_failure0
>
> last_failure_osd_epoch  1198
>
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
>
> max_mds 1
>
> in  0
>
> up  {0=474299}
>
> failed
>
> damaged
>
> stopped
>
> data_pools  [21]
>
> metadata_pool   22
>
> inline_data disabled
>
> balancer
>
> standby_count_wanted0
>
> 474299: 7.128.13.69:6800/304042158 'carf-ceph-osd15' mds.0.23 up:active seq
> 5

In that instance, it's not complaining because one of the filesystems
has never had an MDS.


> I also looked into trying to specify the mds_namespace option to the mount
> operation (http://docs.ceph.com/docs/master/cephfs/kernel/) but that doesn’t
> seem to be valid:
>
> [ceph-admin@carf-ceph-osd04 ~]$ sudo mount -t ceph carf-ceph-osd15:6789:/
> /mnt/carf_ceph02/ -o
> mds_namespace=carf_ceph02,name=cephfs.k8test,secretfile=k8test.secret
>
> mount error 22 = Invalid argument

It's likely that you are using an older kernel that doesn't have
support for the feature.  It was added in linux 4.8.

John

>
>
> Thanks,
>
> -Bryan
>
>
>
> -Original Message-
> From: John Spray [mailto:jsp...@redhat.com]
> Sent: Tuesday, August 22, 2017 11:18 AM
> To: Bryan Banister 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Help with file system with failed mds daemon
>
>
>
> Note: External Email
>
> -
>
>
>
> On Tue, Aug 22, 2017 at 4:58 PM, Bryan Banister
>
>  wrote:
>
>> Hi all,
>
>>
>
>>
>
>>
>
>> I’m still new to ceph and cephfs.  Trying out the multi-fs configuration
>> on
>
>> at Luminous test cluster.  I shutdown the cluster to do an upgrade and
>> when
>
>> I brought the cluster back up I now have a warnings that one of the file
>
>> systems has a failed mds daemon:
>
>>
>
>>
>
>>
>
>> 2017-08-21 17:00:00.81 mon.carf-ceph-osd15 [WRN] overall HEALTH_WARN 1
>
>> filesystem is degraded; 1 filesystem is have a failed mds daemon; 1 pools
>
>> have many more objects per pg than average; application not enabled on 9
>
>> pool(s)
>
>>
>
>>
>
>>
>
>> I tried restarting the mds service on the system and it doesn’t seem to
>
>> indicate any problems:
>
>>
>
>> 2017-08-21 16:13:40.979449 7fffed8b0700  1 mds.0.20 shutdown: shutting
>> down
>
>> rank 0
>
>>
>
>> 2017-08-21 16:13:41.012167 77fde1c0  0 set uid:gid to 167:167
>
>> (ceph:ceph)
>
>>
>
>> 2017-08-21 16:13:41.012180 77fde1c0  0 ceph version 12.1.4
>
>> (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc), process
>> (unknown),
>
>> pid 16656
>
>>
>
>> 2017-08-21 16:13:41.014105 77fde1c0  0 pidfile_write: ignore empty
>
>> --pid-file
>
>>
>
>> 2017-08-21 16:13:45.541442 710b7700  1 mds.0.23 handle_mds_map i am
>> now
>
>> mds.0.23
>
>>
>
>> 2017-08-21 16:13:45.541449 710b7700  1 mds.0.23 handle_mds_map state
>
>> change up:boot --> u

[ceph-users] Anybody gotten boto3 and ceph RGW working?

2017-08-22 Thread Bryan Banister
Hello,

I have the boto python API working with our ceph cluster but haven't figured 
out a way to get boto3 to communicate yet to our RGWs.  Anybody have a simple 
example?

Cheers for any help!
-Bryan



Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you are hereby notified that any review, dissemination 
or copying of this email is strictly prohibited, and to please notify the 
sender immediately and destroy this email and any attachments. Email 
transmission cannot be guaranteed to be secure or error-free. The Company, 
therefore, does not make any guarantees as to the completeness or accuracy of 
this email or any attachments. This email is for informational purposes only 
and does not constitute a recommendation, offer, request or solicitation of any 
kind to buy, sell, subscribe, redeem or perform any type of transaction of a 
financial product.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Small-cluster performance issues

2017-08-22 Thread fcid

Thanks for your advices Maged, Chris

I'll answer bellow


On 08/22/2017 04:30 PM, Mazzystr wrote:
Also examine your network layout.  Any saturation in the private 
cluster network or client facing network will be felt in clients / 
libvirt / virtual machines


As OSD count increases...

  * Ensure client network private cluster network seperation -
different nics, different wires, different switches
  * Add more nics both client side and private cluster network side
and lag them.
  * If/When your dept's budget suddenly swells...implement 10 gig-e.

We have different NICs for each network but they are connected to the 
same switch. In that switch both nets are logically separated by VLANs. 
The switch does not look saturated for now (it is a 10gbit-e), but using 
the same switch may become a problem as the OSD count increases.


Monitor, capacity plan, execute  :)

/Chris C

On Tue, Aug 22, 2017 at 3:02 PM, Maged Mokhtar > wrote:


It is likely your 2 spinning disks cannot keep up with the load.
Things are likely to improve if you double your OSDs hooking them
up to your existing SSD journal. Technically it would be nice to
run a load/performance tool (either atop/collectl/sysstat) and
measure how busy your resources are, but it is most likely your 2
spinning disks will show near 100% busy utilization.

We have a monitoring "stack" compounded by collectd/graphite/grafana and 
I can see the spinning disks almost saturated when performing IO heavy 
tasks on the cluster.


filestore_max_sync_interval: i do not recommend decreasing this to
0.1, i would keep it at 5 sec

I'll increase this parameter today, since we have some maintenance work 
to do.


osd_op_threads do not increase this unless you have enough cores.


I'll look into this today too.


but adding disks is the way to go

Maged

On 2017-08-22 20:08, fcid wrote:


Hello everyone,

I've been using ceph to provide storage using RBD for 60 KVM
virtual machines running on proxmox.

The ceph cluster we have is very small (2 OSDs + 1 mon per node,
and a total of 3 nodes) and we are having some performace issues,
like big latency times (apply lat:~0.5 s; commit lat: 0.001 s),
which get worse by the weekly deep-scrubs.

I wonder if doubling the numbers of OSDs would improve latency
times, or if there is any other configuration tweak recommended
for such small cluster. Also, I'm looking forward to read any
experience of other users using a similiar configuration.

Some technical info:

  - Ceph version: 10.2.5

  - OSDs have SSD journal (one SSD disk per 2 OSDs) and have a
spindle for backend disk.

  - Using CFQ disk queue scheduler

  - OSD configuration excerpt:

osd_recovery_max_active = 1
osd_recovery_op_priority = 63
osd_client_op_priority = 1
osd_mkfs_options = -f -i size=2048 -n size=64k
osd_mount_options_xfs = inode64,noatime,logbsize=256k
osd_journal_size = 20480
osd_op_threads = 12
osd_disk_threads = 1
osd_disk_thread_ioprio_class = idle
osd_disk_thread_ioprio_priority = 7
osd_scrub_begin_hour = 3
osd_scrub_end_hour = 8
osd_scrub_during_recovery = false
filestore_merge_threshold = 40
filestore_split_multiple = 8
filestore_xattr_use_omap = true
filestore_queue_max_ops = 2500
filestore_min_sync_interval = 0.01
filestore_max_sync_interval = 0.1
filestore_journal_writeahead = true

Best regards,



___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--
Fernando Cid O.
Ingeniero de Operaciones
AltaVoz S.A.
 http://www.altavoz.net
Viña del Mar, Valparaiso:
 2 Poniente 355 of 53
 +56 32 276 8060
Santiago:
 San Pío X 2460, oficina 304, Providencia
 +56 2 2585 4264

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Small-cluster performance issues

2017-08-22 Thread fcid

Hi David

I'll try to perform these tests soon.

Thank you.


On 08/22/2017 04:52 PM, David Turner wrote:
I would run some benchmarking throughout the cluster environment to 
see where your bottlenecks are before putting time and money into 
something that might not be your limiting resource.  Sebastian Han put 
together a great guide for benchmarking your cluster here.


https://www.sebastien-han.fr/blog/2012/08/26/ceph-benchmarks/

Test to see if it's your client configuration, client's connection to 
the cluster, the Cluster's disks, the cluster overall speed, etc.


On Tue, Aug 22, 2017 at 3:31 PM Mazzystr > wrote:


Also examine your network layout.  Any saturation in the private
cluster network or client facing network will be felt in clients /
libvirt / virtual machines

As OSD count increases...

  * Ensure client network private cluster network seperation -
different nics, different wires, different switches
  * Add more nics both client side and private cluster network
side and lag them.
  * If/When your dept's budget suddenly swells...implement 10 gig-e.


Monitor, capacity plan, execute  :)

/Chris C

On Tue, Aug 22, 2017 at 3:02 PM, Maged Mokhtar
mailto:mmokh...@petasan.org>> wrote:

It is likely your 2 spinning disks cannot keep up with the
load. Things are likely to improve if you double your OSDs
hooking them up to your existing SSD journal. Technically it
would be nice to run a load/performance tool (either
atop/collectl/sysstat) and measure how busy your resources
are, but it is most likely your 2 spinning disks will show
near 100% busy utilization.

filestore_max_sync_interval: i do not recommend decreasing
this to 0.1, i would keep it at 5 sec

osd_op_threads do not increase this unless you have enough cores.

but adding disks is the way to go

Maged

On 2017-08-22 20:08, fcid wrote:


Hello everyone,

I've been using ceph to provide storage using RBD for 60 KVM
virtual machines running on proxmox.

The ceph cluster we have is very small (2 OSDs + 1 mon per
node, and a total of 3 nodes) and we are having some
performace issues, like big latency times (apply lat:~0.5 s;
commit lat: 0.001 s), which get worse by the weekly deep-scrubs.

I wonder if doubling the numbers of OSDs would improve
latency times, or if there is any other configuration tweak
recommended for such small cluster. Also, I'm looking forward
to read any experience of other users using a similiar
configuration.

Some technical info:

  - Ceph version: 10.2.5

  - OSDs have SSD journal (one SSD disk per 2 OSDs) and have
a spindle for backend disk.

  - Using CFQ disk queue scheduler

  - OSD configuration excerpt:

osd_recovery_max_active = 1
osd_recovery_op_priority = 63
osd_client_op_priority = 1
osd_mkfs_options = -f -i size=2048 -n size=64k
osd_mount_options_xfs = inode64,noatime,logbsize=256k
osd_journal_size = 20480
osd_op_threads = 12
osd_disk_threads = 1
osd_disk_thread_ioprio_class = idle
osd_disk_thread_ioprio_priority = 7
osd_scrub_begin_hour = 3
osd_scrub_end_hour = 8
osd_scrub_during_recovery = false
filestore_merge_threshold = 40
filestore_split_multiple = 8
filestore_xattr_use_omap = true
filestore_queue_max_ops = 2500
filestore_min_sync_interval = 0.01
filestore_max_sync_interval = 0.1
filestore_journal_writeahead = true

Best regards,



___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Fernando Cid O.
Ingeniero de Operaciones
AltaVoz S.A.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse hanging on df with ceph luminous >= 12.1.3

2017-08-22 Thread Patrick Donnelly
On Mon, Aug 21, 2017 at 5:37 PM, Alessandro De Salvo
 wrote:
> Hi,
>
> when trying to use df on a ceph-fuse mounted cephfs filesystem with ceph
> luminous >= 12.1.3 I'm having hangs with the following kind of messages in
> the logs:
>
>
> 2017-08-22 02:20:51.094704 7f80addb7700  0 client.174216 ms_handle_reset on
> 192.168.0.10:6789/0
>
>
> The logs are only showing this type of messages and nothing more useful. The
> only possible way to resume the operations is to kill ceph-fuse and remount.
> Only df is hanging though, while file operations, like copy/rm/ls are
> working as expected.
>
> This behavior is only shown for ceph >= 12.1.3, while for example ceph-fuse
> on 12.1.2 works.
>
> Anyone has seen the same problems? Any help is highly appreciated.

It could be caused by [1]. I don't see a particular reason why you
would experience a hang in the client. You can try adding "debug
client = 20" and "debug ms = 5" to your ceph.conf [2] to get more
information.

[1] https://github.com/ceph/ceph/pull/16378/
[2] http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pros/cons of multiple OSD's per host

2017-08-22 Thread Nick Tan
Thanks for the advice Christian.  I think I'm leaning more towards the
'traditional' storage server with 12 disks - as you say they give a lot
more flexibility with the performance tuning/network options etc.

The cache pool is an interesting idea but as you say it can get quite
expensive for the capacities we're looking at.  I'm interested in how
bluestore performs without a flash/SSD WAL/DB.  In my small scale testing
it seems much better than filestore so I was planning on building something
without any flash/SSD.  There's always the option of adding it later if
required.

Thanks,
Nick

On Tue, Aug 22, 2017 at 6:56 PM, Christian Balzer  wrote:

>
> Hello,
>
> On Tue, 22 Aug 2017 16:51:47 +0800 Nick Tan wrote:
>
> > Hi Christian,
> >
> >
> >
> > > > Hi David,
> > > >
> > > > The planned usage for this CephFS cluster is scratch space for an
> image
> > > > processing cluster with 100+ processing nodes.
> > >
> > > Lots of clients, how much data movement would you expect, how many
> images
> > > come in per timeframe, lets say an hour?
> > > Typical size of a image?
> > >
> > > Does an image come in and then gets processed by one processing node?
> > > Unlikely to be touched again, at least in the short term?
> > > Probably being deleted after being processed?
> > >
> >
> > We'd typically get up to 6TB of raw imagery per day at an average image
> > size of 20MB.  There's a complex multi stage processing chain that
> happens
> > - typically images are read by multiple nodes with intermediate data
> > generated and processed again by multiple nodes.  This would generate
> about
> > 30TB of intermediate data.  The end result would be around 9TB of final
> > processed data.  Once the processing is complete and the final data is
> > copied off and completed QA, the entire data set is deleted.  The data
> sets
> > could remain on the file system for up to 2 weeks before deletion.
> >
>
> If this is a more or less sequential processes w/o too many spikes, a hot
> (daily) SSD pool or cache-tier may work wonders.
> 45TB of flash storage would be a bit spendy, though.
>
> 630TB total, lets call it 800, that's already 20 nodes with 12x 10TB HDDs.
>
> >
> >
> > > >  My thinking is we'd be
> > > > better off with a large number (100+) of storage hosts with 1-2 OSD's
> > > each,
> > > > rather than 10 or so storage nodes with 10+ OSD's to get better
> > > parallelism
> > > > but I don't have any practical experience with CephFS to really
> judge.
> > > CephFS is one thing (of which I have very limited experience), but at
> this
> > > point you're talking about parallelism in Ceph (RBD).
> > > And that happens much more on an OSD than host level.
> > >
> > > Which you _can_ achieve with larger nodes, if they're well designed.
> > > Meaning CPU/RAM/interal storage bandwidth/network bandwidth being in
> > > "harmony".
> > >
> >
> > I'm not sure what you mean about the RBD reference.  Does CephFS use RBD
> > internally?
> >
> RADOS, the underlying layer.
>
> >
> > >
> > > Also you keep talking about really huge HDDs, you could do worse than
> > > halving their size and doubling their numbers to achieve much more
> > > bandwidth and the ever crucial IOPS (even in your use case).
> > >
> > > So something like 20x 12 HDD servers, with SSDs/NVMes for
> journal/bluestore
> > > wAL/DB if you can afford or actually need it.
> > >
> > > CephFS metadata on a SSD pool isn't the most dramatic improvement one
> can
> > > do (or so people tell me), but given your budget it may be worthwhile.
> > >
> > >
> > Yes, I totally get the benefits of using greater numbers of smaller
> HDD's.
> > One of the requirements is to keep $/TB low and large capacity drives
> helps
> > with that.  I guess we need to look at the tradeoff of $/TB vs number of
> > spindles for performance.
> >
> Again, if it's mostly sequential the IOPS needs will be of course very
> different from a scenario where you get 100 images coming in at once while
> the processing nodes are munching on previous 100 ones.
>
> > If CephFS's parallelism happens more at the OSD level than the host level
> > then perhaps the 12 disk storage host would be fine as long as
> > "mon_osd_down_out_subtree_limit = host" and there's enough CPU/RAM/BUS
> and
> > Network bandwidth on the host.  I'm doing some cost comparisons of these
> > "big" servers vs multiple "small" servers such as the supermicro
> microcloud
> > chassis or the Ambedded Mars 200 ARM cluster (which looks very
> > interesting).
> The later at least avoids the usual issue of underpowered and high latency
> networking  with these kinds of designs (one from Supermicro comes to
> mind) tend to have, but 2GB RAM and CPU feel... weak
>
> Also you will have to buy an SSD for each in case you want/need journals
> (or fast WAL/DB with bluestore).
> Spendy and massively annoying if anything fails with these things (no
> hot-swap).
>
> > However, cost is not the sole consideration, so I'm hoping
> > to get an idea of performance differe

Re: [ceph-users] pros/cons of multiple OSD's per host

2017-08-22 Thread Christian Balzer
On Wed, 23 Aug 2017 13:38:25 +0800 Nick Tan wrote:

> Thanks for the advice Christian.  I think I'm leaning more towards the
> 'traditional' storage server with 12 disks - as you say they give a lot
> more flexibility with the performance tuning/network options etc.
> 
> The cache pool is an interesting idea but as you say it can get quite
> expensive for the capacities we're looking at.  I'm interested in how
> bluestore performs without a flash/SSD WAL/DB.  In my small scale testing
> it seems much better than filestore so I was planning on building something
> without any flash/SSD.  There's always the option of adding it later if
> required.
> 
Given the lack (for large writes) of double writes with Bluestore that's
to be expected. 

Since you're looking mostly at largish, sequential writes and reads, a
pure HDDs cluster may be feasible. 

Christian

> Thanks,
> Nick
> 
> On Tue, Aug 22, 2017 at 6:56 PM, Christian Balzer  wrote:
> 
> >
> > Hello,
> >
> > On Tue, 22 Aug 2017 16:51:47 +0800 Nick Tan wrote:
> >  
> > > Hi Christian,
> > >
> > >
> > >  
> > > > > Hi David,
> > > > >
> > > > > The planned usage for this CephFS cluster is scratch space for an  
> > image  
> > > > > processing cluster with 100+ processing nodes.  
> > > >
> > > > Lots of clients, how much data movement would you expect, how many  
> > images  
> > > > come in per timeframe, lets say an hour?
> > > > Typical size of a image?
> > > >
> > > > Does an image come in and then gets processed by one processing node?
> > > > Unlikely to be touched again, at least in the short term?
> > > > Probably being deleted after being processed?
> > > >  
> > >
> > > We'd typically get up to 6TB of raw imagery per day at an average image
> > > size of 20MB.  There's a complex multi stage processing chain that  
> > happens  
> > > - typically images are read by multiple nodes with intermediate data
> > > generated and processed again by multiple nodes.  This would generate  
> > about  
> > > 30TB of intermediate data.  The end result would be around 9TB of final
> > > processed data.  Once the processing is complete and the final data is
> > > copied off and completed QA, the entire data set is deleted.  The data  
> > sets  
> > > could remain on the file system for up to 2 weeks before deletion.
> > >  
> >
> > If this is a more or less sequential processes w/o too many spikes, a hot
> > (daily) SSD pool or cache-tier may work wonders.
> > 45TB of flash storage would be a bit spendy, though.
> >
> > 630TB total, lets call it 800, that's already 20 nodes with 12x 10TB HDDs.
> >  
> > >
> > >  
> > > > >  My thinking is we'd be
> > > > > better off with a large number (100+) of storage hosts with 1-2 OSD's 
> > > > >  
> > > > each,  
> > > > > rather than 10 or so storage nodes with 10+ OSD's to get better  
> > > > parallelism  
> > > > > but I don't have any practical experience with CephFS to really  
> > judge.  
> > > > CephFS is one thing (of which I have very limited experience), but at  
> > this  
> > > > point you're talking about parallelism in Ceph (RBD).
> > > > And that happens much more on an OSD than host level.
> > > >
> > > > Which you _can_ achieve with larger nodes, if they're well designed.
> > > > Meaning CPU/RAM/interal storage bandwidth/network bandwidth being in
> > > > "harmony".
> > > >  
> > >
> > > I'm not sure what you mean about the RBD reference.  Does CephFS use RBD
> > > internally?
> > >  
> > RADOS, the underlying layer.
> >  
> > >  
> > > >
> > > > Also you keep talking about really huge HDDs, you could do worse than
> > > > halving their size and doubling their numbers to achieve much more
> > > > bandwidth and the ever crucial IOPS (even in your use case).
> > > >
> > > > So something like 20x 12 HDD servers, with SSDs/NVMes for  
> > journal/bluestore  
> > > > wAL/DB if you can afford or actually need it.
> > > >
> > > > CephFS metadata on a SSD pool isn't the most dramatic improvement one  
> > can  
> > > > do (or so people tell me), but given your budget it may be worthwhile.
> > > >
> > > >  
> > > Yes, I totally get the benefits of using greater numbers of smaller  
> > HDD's.  
> > > One of the requirements is to keep $/TB low and large capacity drives  
> > helps  
> > > with that.  I guess we need to look at the tradeoff of $/TB vs number of
> > > spindles for performance.
> > >  
> > Again, if it's mostly sequential the IOPS needs will be of course very
> > different from a scenario where you get 100 images coming in at once while
> > the processing nodes are munching on previous 100 ones.
> >  
> > > If CephFS's parallelism happens more at the OSD level than the host level
> > > then perhaps the 12 disk storage host would be fine as long as
> > > "mon_osd_down_out_subtree_limit = host" and there's enough CPU/RAM/BUS  
> > and  
> > > Network bandwidth on the host.  I'm doing some cost comparisons of these
> > > "big" servers vs multiple "small" servers such as the supermicro  
> > microcloud