Re: [ceph-users] NVRAM cards as OSD journals

2016-05-22 Thread Adrian Saul

I am using Intel P3700DC 400G cards in a similar configuration (two per host) - 
perhaps you could look at cards of that capacity to meet your needs.

I would suggest having such small journals would mean you will be constantly 
blocking on journal flushes which will impact write performance and latency, 
you would be better off with larger journals to accommodate the expected 
throughput you are after.

Also for redundancy I would suggest more than a single journal - if you lose 
the journal you will need to rebuild all the OSDs on the host which will be a 
significant performance impact and depending on your replication level opens up 
the risk of data loss should another OSD fail for whatever reason.




From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of EP 
Komarla
Sent: Saturday, 21 May 2016 1:53 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] NVRAM cards as OSD journals

Hi,

I am contemplating using a NVRAM card for OSD journals in place of SSD drives 
in our ceph cluster.

Configuration:

* 4 Ceph servers

* Each server has 24 OSDs (each OSD is a 1TB SAS drive)

* 1 PCIe NVRAM card of 16GB capacity per ceph server

* Both Client & cluster network is 10Gbps

As per ceph documents:
The expected throughput number should include the expected disk throughput 
(i.e., sustained data transfer rate), and network throughput. For example, a 
7200 RPM disk will likely have approximately 100 MB/s. Taking the min() of the 
disk and network throughput should provide a reasonable expected throughput. 
Some users just start off with a 10GB journal size. For example:
osd journal size = 1
Given that I have a single 16GB card per server that has to be carved among all 
24OSDs, I will have to configure each OSD journal to be much smaller around 
600MB, i.e., 16GB/24 drives.  This value is much smaller than 10GB/OSD journal 
that is generally used.  So, I am wondering if this configuration and journal 
size is valid.  Is there a performance benefit of having a journal that is this 
small?  Also, do I have to reduce the default "filestore maxsync interval" from 
5 seconds to a smaller value say 2 seconds to match the smaller journal size?

Have people used NVRAM cards in the Ceph clusters as journals?  What is their 
experience?

Any thoughts?



Legal Disclaimer:
The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD removal issue

2016-05-23 Thread Adrian Saul

A while back I attempted to create an RBD volume manually - intending it to be 
an exact size of another LUN around 100G.  The command line instead took this 
to be the default MB argument for size and so I ended up with a 102400 TB 
volume.  Deletion was painfully slow (I never used the volume, it just seemed 
to spin on CPU for ages going through all the objects it thought it had) and 
the rbd rm command was interrupted a few times, but even after running for two 
months it still wont complete.

I still have the volume listed even though it appears to be otherwise gone from 
the RADOS view.  From what I can see there is only the rbd_header object 
remaining - can I just remove that directly or am I risking corrupting 
something else by not removing it using rbd rm?

Cheers,
 Adrian


[root@ceph-glb-fec-01 ~]# rbd info glebe-sata/oemprd01db_lun00
rbd image 'oemprd01db_lun00':
size 102400 TB in 26843545600 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.8d4ca65a5db37
format: 2
features: layering
flags:
[root@ceph-glb-fec-01 ~]# rados ls -p glebe-sata|grep rbd_data.8d4ca65a5db37
[root@ceph-glb-fec-01 ~]# rados ls -p glebe-sata|grep 8d4ca65a5db37
rbd_header.8d4ca65a5db37

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD removal issue

2016-05-23 Thread Adrian Saul

Thanks - all sorted.


> -Original Message-
> From: Nick Fisk [mailto:n...@fisk.me.uk]
> Sent: Monday, 23 May 2016 6:58 PM
> To: Adrian Saul; ceph-users@lists.ceph.com
> Subject: RE: RBD removal issue
>
> See here:
>
> http://cephnotes.ksperis.com/blog/2014/07/04/remove-big-rbd-image
>
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Adrian Saul
> > Sent: 23 May 2016 09:37
> > To: 'ceph-users@lists.ceph.com' 
> > Subject: [ceph-users] RBD removal issue
> >
> >
> > A while back I attempted to create an RBD volume manually - intending
> > it
> to
> > be an exact size of another LUN around 100G.  The command line instead
> > took this to be the default MB argument for size and so I ended up
> > with a
> > 102400 TB volume.  Deletion was painfully slow (I never used the
> > volume,
> it
> > just seemed to spin on CPU for ages going through all the objects it
> thought it
> > had) and the rbd rm command was interrupted a few times, but even
> > after running for two months it still wont complete.
> >
> > I still have the volume listed even though it appears to be otherwise
> > gone from the RADOS view.  From what I can see there is only the
> > rbd_header object remaining - can I just remove that directly or am I
> > risking
> corrupting
> > something else by not removing it using rbd rm?
> >
> > Cheers,
> >  Adrian
> >
> >
> > [root@ceph-glb-fec-01 ~]# rbd info glebe-sata/oemprd01db_lun00 rbd
> > image 'oemprd01db_lun00':
> > size 102400 TB in 26843545600 objects
> > order 22 (4096 kB objects)
> > block_name_prefix: rbd_data.8d4ca65a5db37
> > format: 2
> > features: layering
> > flags:
> > [root@ceph-glb-fec-01 ~]# rados ls -p glebe-sata|grep
> > rbd_data.8d4ca65a5db37
> > [root@ceph-glb-fec-01 ~]# rados ls -p glebe-sata|grep 8d4ca65a5db37
> > rbd_header.8d4ca65a5db37
> >
> > Confidentiality: This email and any attachments are confidential and
> > may
> be
> > subject to copyright, legal or some other professional privilege. They
> > are intended solely for the attention and use of the named
> > addressee(s). They may only be copied, distributed or disclosed with
> > the consent of the copyright owner. If you have received this email by
> > mistake or by breach
> of
> > the confidentiality clause, please notify the sender immediately by
> > return email and delete or destroy all copies of the email. Any
> > confidentiality, privilege or copyright is not waived or lost because
> > this email has been
> sent
> > to you by mistake.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] seqwrite gets good performance but random rw gets worse

2016-05-25 Thread Adrian Saul

Are you using image-format 2 RBD images?

We found a major performance hit using format 2 images under 10.2.0 today in 
some testing.  When we switched to using format 1 images we literally got 10x 
random write IOPS performance (1600 IOPs up to 3 IOPS for the same test).



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ken 
Peng
Sent: Wednesday, 25 May 2016 5:02 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] seqwrite gets good performance but random rw gets worse

Hello,
We have a cluster with 20+ hosts and 200+ OSDs, each 4T SATA disk for an OSD, 
no SSD cache.
OS is Ubuntu 16.04 LTS, ceph version 10.2.0
Both data network and cluster network are 10Gbps.
We run ceph as block storage service only (rbd client within VM).
For testing within a VM with sysbench tool, we see that the seqwrite has a 
relatively good performance, it can reach 170.37Mb/sec, but random read/write 
always gets bad result, it can be only 474.63Kb/sec (shown as below).

Can you help give the idea why the random IO is so worse? Thanks.
This is what sysbench outputs,

# sysbench --test=fileio --file-total-size=5G prepare
sysbench 0.4.12:  multi-threaded system evaluation benchmark

128 files, 40960Kb each, 5120Mb total
Creating files for the test...


# sysbench --test=fileio --file-total-size=5G --file-test-mode=seqwr 
--init-rng=on --max-time=300 --max-requests=0 run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Initializing random number generator from timer.


Extra file open flags: 0
128 files, 40Mb each
5Gb total file size
Block size 16Kb
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing sequential write (creation) test
Threads started!
Done.

Operations performed:  0 Read, 327680 Write, 128 Other = 327808 Total
Read 0b  Written 5Gb  Total transferred 5Gb  (170.37Mb/sec)
10903.42 Requests/sec executed

Test execution summary:
total time:  30.0530s
total number of events:  327680
total time taken by event execution: 28.5936
per-request statistics:
 min:  0.01ms
 avg:  0.09ms
 max:192.84ms
 approx.  95 percentile:   0.03ms

Threads fairness:
events (avg/stddev):   327680./0.00
execution time (avg/stddev):   28.5936/0.00



# sysbench --test=fileio --file-total-size=5G --file-test-mode=rndrw 
--init-rng=on --max-time=300 --max-requests=0 run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Initializing random number generator from timer.


Extra file open flags: 0
128 files, 40Mb each
5Gb total file size
Block size 16Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Threads started!

Time limit exceeded, exiting...
Done.

Operations performed:  5340 Read, 3560 Write, 11269 Other = 20169 Total
Read 83.438Mb  Written 55.625Mb  Total transferred 139.06Mb  (474.63Kb/sec)
   29.66 Requests/sec executed

Test execution summary:
total time:  300.0216s
total number of events:  8900
total time taken by event execution: 6.4774
per-request statistics:
 min:  0.01ms
 avg:  0.73ms
 max: 90.18ms
 approx.  95 percentile:   1.60ms

Threads fairness:
events (avg/stddev):   8900./0.00
execution time (avg/stddev):   6.4774/0.00
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] seqwrite gets good performance but random rw gets worse

2016-05-25 Thread Adrian Saul

Sync will always be lower – it will cause it to wait for previous writes to 
complete before issuing more so it will effectively throttle writes to a queue 
depth of 1.



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ken 
Peng
Sent: Wednesday, 25 May 2016 6:36 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] seqwrite gets good performance but random rw gets 
worse

Hi again,
when setup file-fsync-freq=1 (fsync for each time writing) and 
file-fsync-freq=0 (never fsync by sysbench), the result gets huge difference.
(one is 382.94Kb/sec, another is 25.921Mb/sec).
How do you think of it? thanks.

file-fsync-freq=1,
# sysbench --test=fileio --file-total-size=5G --file-test-mode=rndrw 
--init-rng=on --max-time=300 --max-requests=0 --file-fsync-freq=1 run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Initializing random number generator from timer.


Extra file open flags: 0
128 files, 40Mb each
5Gb total file size
Block size 16Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 1 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Threads started!
Time limit exceeded, exiting...
Done.

Operations performed:  4309 Read, 2873 Write, 367707 Other = 374889 Total
Read 67.328Mb  Written 44.891Mb  Total transferred 112.22Mb  (382.94Kb/sec)
   23.93 Requests/sec executed

Test execution summary:
total time:  300.0782s
total number of events:  7182
total time taken by event execution: 2.3207
per-request statistics:
 min:  0.01ms
 avg:  0.32ms
 max: 80.17ms
 approx.  95 percentile:   1.48ms

Threads fairness:
events (avg/stddev):   7182./0.00
execution time (avg/stddev):   2.3207/0.00


file-fsync-freq=0,

# sysbench --test=fileio --file-total-size=5G --file-test-mode=rndrw 
--init-rng=on --max-time=300 --max-requests=0 --file-fsync-freq=0 run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Initializing random number generator from timer.


Extra file open flags: 0
128 files, 40Mb each
5Gb total file size
Block size 16Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Threads started!
Time limit exceeded, exiting...
Done.

Operations performed:  298613 Read, 199075 Write, 0 Other = 497688 Total
Read 4.5565Gb  Written 3.0376Gb  Total transferred 7.5941Gb  (25.921Mb/sec)
 1658.93 Requests/sec executed

Test execution summary:
total time:  300.0049s
total number of events:  497688
total time taken by event execution: 299.7026
per-request statistics:
 min:  0.00ms
 avg:  0.60ms
 max:   2211.13ms
 approx.  95 percentile:   1.21ms

Threads fairness:
events (avg/stddev):   497688./0.00
execution time (avg/stddev):   299.7026/0.00

2016-05-25 15:01 GMT+08:00 Ken Peng mailto:k...@dnsbed.com>>:
Hello,
We have a cluster with 20+ hosts and 200+ OSDs, each 4T SATA disk for an OSD, 
no SSD cache.
OS is Ubuntu 16.04 LTS, ceph version 10.2.0
Both data network and cluster network are 10Gbps.
We run ceph as block storage service only (rbd client within VM).
For testing within a VM with sysbench tool, we see that the seqwrite has a 
relatively good performance, it can reach 170.37Mb/sec, but random read/write 
always gets bad result, it can be only 474.63Kb/sec (shown as below).

Can you help give the idea why the random IO is so worse? Thanks.
This is what sysbench outputs,

# sysbench --test=fileio --file-total-size=5G prepare
sysbench 0.4.12:  multi-threaded system evaluation benchmark

128 files, 40960Kb each, 5120Mb total
Creating files for the test...


# sysbench --test=fileio --file-total-size=5G --file-test-mode=seqwr 
--init-rng=on --max-time=300 --max-requests=0 run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Initializing random number generator from timer.


Extra file open flags: 0
128 files, 40Mb each
5Gb total file size
Block size 16Kb
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing sequential write (creation) test
Threads started!
Done.

Operations performed:  0 Read, 327680 Write, 128 Other = 327808 Total
Read 0b  Written 5Gb  Total transferred 5Gb  (170.37Mb/sec)
10903.42 Reques

Re: [ceph-users] Fwd: [Ceph-community] Wasting the Storage capacity when using Ceph based On high-end storage systems

2016-06-01 Thread Adrian Saul

Also if for political reasons you need a “vendor” solution – ask Dell about 
their DSS 7000 servers – 90 8TB  disks and two compute nodes in 4RU would go a 
long way to making up a multi-PB Ceph solution.

Supermicro also do a similar solution with some 36, 60 and 90 disk in 4RU 
models.

Cisco has C3260s which are about 60 disks depending on config.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jack 
Makenz
Sent: Monday, 30 May 2016 3:56 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Fwd: [Ceph-community] Wasting the Storage capacity when 
using Ceph based On high-end storage systems


Forwarded conversation
Subject: Wasting the Storage capacity when using Ceph based On high-end storage 
systems


From: Jack Makenz mailto:jack.mak...@gmail.com>>
Date: Sun, May 29, 2016 at 6:52 PM
To: ceph-commun...@lists.ceph.com

Hello All,
There are some serious problem about ceph that may waste storage capacity when 
using high-end storage system(Hitachi, IBM, EMC, HP ,...) as back-end for OSD 
hosts.

Imagine in the real cloud we need  n Petabytes of storage capacity that 
commodity hardware's hard disks or OSD server's hard disks can't provide this 
amount of storage capacity. thus we have to use storage systems as back-end for 
OSD hosts(to implement OSD daemons ).

But because almost all of these storage systems ( Regardless of their brand) 
use Raid technology and also ceph replicate at least two copy of each Object, 
lot's amount of storage capacity waste.

So is there any solution to solve this problem/misunderstand ?

Regards
Jack Makenz

--
From: Nate Curry mailto:cu...@mosaicatm.com>>
Date: Mon, May 30, 2016 at 5:50 AM
To: Jack Makenz mailto:jack.mak...@gmail.com>>
Cc: Unknown 
mailto:ceph-commun...@lists.ceph.com>>


I think that purpose of ceph is to get away from having to rely on high end 
storage systems and to be provide the capacity to utilize multiple less 
expensive servers as the storage system.

That being said you should still be able to use the high end storage systems 
with or without RAID enabled.  You could do away with RAID altogether and let 
Ceph handle the redundancy or you can have LUNs assigned to hosts be put into 
use as OSDs.  You could make it work however but to get the most out of your 
storage with Ceph I think a non-RAID configuration would be best.

Nate Curry
___
Ceph-community mailing list
ceph-commun...@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com

--
From: Doug Dressler mailto:darbymorri...@gmail.com>>
Date: Mon, May 30, 2016 at 6:02 AM
To: Nate Curry mailto:cu...@mosaicatm.com>>
Cc: Jack Makenz mailto:jack.mak...@gmail.com>>, Unknown 
mailto:ceph-commun...@lists.ceph.com>>

For non-technical reasons I had to run ceph initially using SAN disks.

Lesson learned:

Make sure deduplication is disabled on the SAN :-)



--
From: Jack Makenz mailto:jack.mak...@gmail.com>>
Date: Mon, May 30, 2016 at 9:05 AM
To: Nate Curry mailto:cu...@mosaicatm.com>>, 
ceph-commun...@lists.ceph.com

Thanks Nate,
But as i mentioned before , providing petabytes of storage capacity on 
commodity hardware or enterprise servers is almost impossible, of course that 
it's possible by installing hundreds of servers with 3 terabytes hard disks, 
but this solution waste data center raise floor, power consumption and also 
money :)


Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best Network Switches for Redundancy

2016-06-01 Thread Adrian Saul

I am currently running our Ceph POC environment using dual Nexus 9372TX 10G-T 
switches, each OSD host has two connections to each switch and they are formed 
into a single 4 link VPC (MC-LAG), which is bonded under LACP on the host side.

What I have noticed is that the various hashing policies for LACP do not 
guarantee you will make full use of all the links.  I tried various policies 
and from what I could see the normal L3+L4 IP and port hashing generally worked 
as good as anything else, but if you have lots of similar connections it 
doesn't seem to hash across all the links and say 2 will be heavily used while 
not much is hashed onto the other links.  This might have just been because it 
was a fairly small pool of IPs and fairly similar port numbers that just 
happened to keep hashing to the same links (I ended up going to the point of 
tcpdumping traffic and scripting a calculation of what link it should use, it 
just happened to be so consistent).

For two links it should be quite good - it seemed to balance across that quite 
well, but with 4 links it seemed to really prefer 2 in my case.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> David Riedl
> Sent: Thursday, 2 June 2016 2:12 AM
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Best Network Switches for Redundancy
>
>
> > 4. As Ceph has lots of connections on lots of IP's and port's, LACP or
> > the Linux ALB mode should work really well to balance connections.
> Linux ALB Mode looks promising. Does that work with two switches? Each
> server has 4 ports which are 'splitted' and connected to each switch.
>  _
>/ _[switch]
>   / /  ||
> [server] ||
>  \ \_ ||
>   \__[switch]
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best Network Switches for Redundancy

2016-06-01 Thread Adrian Saul

> > For two links it should be quite good - it seemed to balance across
> > that quite well, but with 4 links it seemed to really prefer 2 in my case.
> >
> Just for the record, did you also change the LACP policies on the switches?
>
> From what I gather, having fancy pants L3+4 hashing on the Linux side will not
> fix imbalances by itself, the switches need to be configured likewise.

Yes - I was changing policies on both sides in similar manners but it seemed to 
be that the way the OSDs selected their service ports just happened to hash 
consistently to the same links.   There just wasn't enough variation in the 
combinations of L3+L4 or even L2 hash output to utilise more of the links (the 
even numbered ports and consistent IP pairs just kept returning the same link 
output for the hash algorithm).   Some of the more simplistic round robin 
methods might have got better results but I didn't want to stick with for 
future scalability.

In a larger scale deployment with more clients or a wider pool of OSDs that 
would probably not be the case as there would be greater distribution of hash 
inputs.  Just something to be aware of when you look to do LACP with more than 
2 links.



>
> Christian
> >
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > Behalf Of David Riedl
> > > Sent: Thursday, 2 June 2016 2:12 AM
> > > To: ceph-users@lists.ceph.com
> > > Subject: Re: [ceph-users] Best Network Switches for Redundancy
> > >
> > >
> > > > 4. As Ceph has lots of connections on lots of IP's and port's,
> > > > LACP or the Linux ALB mode should work really well to balance
> connections.
> > > Linux ALB Mode looks promising. Does that work with two switches?
> > > Each server has 4 ports which are 'splitted' and connected to each switch.
> > >  _
> > >/ _[switch]
> > >   / /  ||
> > > [server] ||
> > >  \ \_ ||
> > >   \__[switch]
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > Confidentiality: This email and any attachments are confidential and
> > may be subject to copyright, legal or some other professional privilege.
> > They are intended solely for the attention and use of the named
> > addressee(s). They may only be copied, distributed or disclosed with
> > the consent of the copyright owner. If you have received this email by
> > mistake or by breach of the confidentiality clause, please notify the
> > sender immediately by return email and delete or destroy all copies of
> > the email. Any confidentiality, privilege or copyright is not waived
> > or lost because this email has been sent to you by mistake.
> > ___ ceph-users
> mailing
> > list ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Jewel upgrade - rbd errors after upgrade

2016-06-05 Thread Adrian Saul

I upgraded my Infernalis semi-production cluster to Jewel on Friday.  While the 
upgrade went through smoothly (aside from a time wasting restorecon 
/var/lib/ceph in the selinux package upgrade) and the services continued 
running without interruption.  However this morning when I went to create some 
new RBD images I am unable to do much at all with RBD.

Just about any rbd command fails with an I/O error.   I can run showmapped but 
that is about it - anything like an ls, info or status fails.  This applies to 
all my pools.

I can see no errors in any log files that appear to suggest an issue.  I  have 
also tried the commands on other cluster members that have not done anything 
with RBD before (I was wondering if perhaps the kernel rbd was pinning the old 
library version open or something) but the same error occurs.

Where can I start trying to resolve this?

Cheers,
 Adrian


[root@ceph-glb-fec-01 ceph]# rbd ls glebe-sata
rbd: list: (5) Input/output error
2016-06-06 10:41:31.792720 7f53c06a2d80 -1 librbd: error listing image in 
directory: (5) Input/output error
2016-06-06 10:41:31.792749 7f53c06a2d80 -1 librbd: error listing v2 images: (5) 
Input/output error

[root@ceph-glb-fec-01 ceph]# rbd ls glebe-ssd
rbd: list: (5) Input/output error
2016-06-06 10:41:33.956648 7f90de663d80 -1 librbd: error listing image in 
directory: (5) Input/output error
2016-06-06 10:41:33.956672 7f90de663d80 -1 librbd: error listing v2 images: (5) 
Input/output error

[root@ceph-glb-fec-02 ~]# rbd showmapped
id pool   image snap device
0  glebe-sata test02-/dev/rbd0
1  glebe-ssd  zfstest   -/dev/rbd1
10 glebe-sata hypervtst-lun00   -/dev/rbd10
11 glebe-sata hypervtst-lun02   -/dev/rbd11
12 glebe-sata hypervtst-lun03   -/dev/rbd12
13 glebe-ssd  nspprd01_lun00-/dev/rbd13
14 glebe-sata cirrux-nfs01  -/dev/rbd14
15 glebe-sata hypervtst-lun04   -/dev/rbd15
16 glebe-sata hypervtst-lun05   -/dev/rbd16
17 glebe-sata pvtcloud-nfs01-/dev/rbd17
18 glebe-sata cloud2sql-lun00   -/dev/rbd18
19 glebe-sata cloud2sql-lun01   -/dev/rbd19
2  glebe-sata radmast02-lun00   -/dev/rbd2
20 glebe-sata cloud2sql-lun02   -/dev/rbd20
21 glebe-sata cloud2fs-lun00-/dev/rbd21
22 glebe-sata cloud2fs-lun01-/dev/rbd22
3  glebe-sata radmast02-lun01   -/dev/rbd3
4  glebe-sata radmast02-lun02   -/dev/rbd4
5  glebe-sata radmast02-lun03   -/dev/rbd5
6  glebe-sata radmast02-lun04   -/dev/rbd6
7  glebe-ssd  sybase_iquser02_lun00 -/dev/rbd7
8  glebe-ssd  sybase_iquser03_lun00 -/dev/rbd8
9  glebe-ssd  sybase_iquser04_lun00 -/dev/rbd9

[root@ceph-glb-fec-02 ~]# rbd status glebe-sata/hypervtst-lun04
2016-06-06 10:47:30.221453 7fc0030dc700 -1 librbd::image::OpenRequest: failed 
to retrieve image id: (5) Input/output error
2016-06-06 10:47:30.221556 7fc0028db700 -1 librbd::ImageState: failed to open 
image: (5) Input/output error
rbd: error opening image hypervtst-lun04: (5) Input/output error
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel upgrade - rbd errors after upgrade

2016-06-05 Thread Adrian Saul

No - it throws a usage error - if I add a file argument after it works:

[root@ceph-glb-fec-02 ceph]# rados -p glebe-sata get rbd_id.hypervtst-lun04 
/tmp/crap
[root@ceph-glb-fec-02 ceph]# cat /tmp/crap
109eb01f5f89de

stat works:

[root@ceph-glb-fec-02 ceph]# rados -p glebe-sata stat rbd_id.hypervtst-lun04
glebe-sata/rbd_id.hypervtst-lun04 mtime 2016-06-06 10:55:08.00, size 18


I can do a rados ls:

[root@ceph-glb-fec-02 ceph]# rados ls -p glebe-sata|grep rbd_id
rbd_id.cloud2sql-lun01
rbd_id.glbcluster3-vm17
rbd_id.holder   <<<  a create that said it failed while I was debugging this
rbd_id.pvtcloud-nfs01
rbd_id.hypervtst-lun05
rbd_id.test02
rbd_id.cloud2sql-lun02
rbd_id.fiotest2
rbd_id.radmast02-lun04
rbd_id.hypervtst-lun04
rbd_id.cloud2fs-lun00
rbd_id.radmast02-lun03
rbd_id.hypervtst-lun00
rbd_id.cloud2sql-lun00
rbd_id.radmast02-lun02


> -Original Message-
> From: Jason Dillaman [mailto:jdill...@redhat.com]
> Sent: Monday, 6 June 2016 11:00 AM
> To: Adrian Saul
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
>
> Are you able to successfully run the following command successfully?
>
> rados -p glebe-sata get rbd_id.hypervtst-lun04
>
>
>
> On Sun, Jun 5, 2016 at 8:49 PM, Adrian Saul
>  wrote:
> >
> > I upgraded my Infernalis semi-production cluster to Jewel on Friday.  While
> the upgrade went through smoothly (aside from a time wasting restorecon
> /var/lib/ceph in the selinux package upgrade) and the services continued
> running without interruption.  However this morning when I went to create
> some new RBD images I am unable to do much at all with RBD.
> >
> > Just about any rbd command fails with an I/O error.   I can run
> showmapped but that is about it - anything like an ls, info or status fails.  
> This
> applies to all my pools.
> >
> > I can see no errors in any log files that appear to suggest an issue.  I  
> > have
> also tried the commands on other cluster members that have not done
> anything with RBD before (I was wondering if perhaps the kernel rbd was
> pinning the old library version open or something) but the same error occurs.
> >
> > Where can I start trying to resolve this?
> >
> > Cheers,
> >  Adrian
> >
> >
> > [root@ceph-glb-fec-01 ceph]# rbd ls glebe-sata
> > rbd: list: (5) Input/output error
> > 2016-06-06 10:41:31.792720 7f53c06a2d80 -1 librbd: error listing image
> > in directory: (5) Input/output error
> > 2016-06-06 10:41:31.792749 7f53c06a2d80 -1 librbd: error listing v2
> > images: (5) Input/output error
> >
> > [root@ceph-glb-fec-01 ceph]# rbd ls glebe-ssd
> > rbd: list: (5) Input/output error
> > 2016-06-06 10:41:33.956648 7f90de663d80 -1 librbd: error listing image
> > in directory: (5) Input/output error
> > 2016-06-06 10:41:33.956672 7f90de663d80 -1 librbd: error listing v2
> > images: (5) Input/output error
> >
> > [root@ceph-glb-fec-02 ~]# rbd showmapped
> > id pool   image snap device
> > 0  glebe-sata test02-/dev/rbd0
> > 1  glebe-ssd  zfstest   -/dev/rbd1
> > 10 glebe-sata hypervtst-lun00   -/dev/rbd10
> > 11 glebe-sata hypervtst-lun02   -/dev/rbd11
> > 12 glebe-sata hypervtst-lun03   -/dev/rbd12
> > 13 glebe-ssd  nspprd01_lun00-/dev/rbd13
> > 14 glebe-sata cirrux-nfs01  -/dev/rbd14
> > 15 glebe-sata hypervtst-lun04   -/dev/rbd15
> > 16 glebe-sata hypervtst-lun05   -/dev/rbd16
> > 17 glebe-sata pvtcloud-nfs01-/dev/rbd17
> > 18 glebe-sata cloud2sql-lun00   -/dev/rbd18
> > 19 glebe-sata cloud2sql-lun01   -/dev/rbd19
> > 2  glebe-sata radmast02-lun00   -/dev/rbd2
> > 20 glebe-sata cloud2sql-lun02   -/dev/rbd20
> > 21 glebe-sata cloud2fs-lun00-/dev/rbd21
> > 22 glebe-sata cloud2fs-lun01-/dev/rbd22
> > 3  glebe-sata radmast02-lun01   -/dev/rbd3
> > 4  glebe-sata radmast02-lun02   -/dev/rbd4
> > 5  glebe-sata radmast02-lun03   -/dev/rbd5
> > 6  glebe-sata radmast02-lun04   -/dev/rbd6
> > 7  glebe-ssd  sybase_iquser02_lun00 -/dev/rbd7
> > 8  glebe-ssd  sybase_iquser03_lun00 -/dev/rbd8
> > 9  glebe-ssd  sybase_iquser04_lun00 -/dev/rbd9
> >
> > [root@ceph-glb-fec-02 ~]# rbd status glebe-sata/hypervtst-lun04
> > 2016-06-06 10:47:30.221453 7fc0030dc700 -1 librbd::image::OpenRequest:
> > failed to retrieve image id: (5) Input/output error
> > 2016-06-06 10:47:30.221556 7fc0028db700 -1 librbd::ImageState: failed
> > to open image: (5) Input/o

Re: [ceph-users] Jewel upgrade - rbd errors after upgrade

2016-06-05 Thread Adrian Saul

Seems like my rbd_directory is empty for some reason:

[root@ceph-glb-fec-02 ceph]# rados get -p glebe-sata rbd_directory /tmp/dir
[root@ceph-glb-fec-02 ceph]# strings /tmp/dir
[root@ceph-glb-fec-02 ceph]# ls -la /tmp/dir
-rw-r--r--. 1 root root 0 Jun  6 11:12 /tmp/dir

[root@ceph-glb-fec-02 ceph]# rados stat -p glebe-sata rbd_directory
glebe-sata/rbd_directory mtime 2016-06-06 10:18:28.00, size 0



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Adrian Saul
> Sent: Monday, 6 June 2016 11:11 AM
> To: dilla...@redhat.com
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
>
>
> No - it throws a usage error - if I add a file argument after it works:
>
> [root@ceph-glb-fec-02 ceph]# rados -p glebe-sata get rbd_id.hypervtst-
> lun04 /tmp/crap
> [root@ceph-glb-fec-02 ceph]# cat /tmp/crap 109eb01f5f89de
>
> stat works:
>
> [root@ceph-glb-fec-02 ceph]# rados -p glebe-sata stat rbd_id.hypervtst-
> lun04
> glebe-sata/rbd_id.hypervtst-lun04 mtime 2016-06-06 10:55:08.00, size 18
>
>
> I can do a rados ls:
>
> [root@ceph-glb-fec-02 ceph]# rados ls -p glebe-sata|grep rbd_id
> rbd_id.cloud2sql-lun01
> rbd_id.glbcluster3-vm17
> rbd_id.holder   <<<  a create that said it failed while I was debugging this
> rbd_id.pvtcloud-nfs01
> rbd_id.hypervtst-lun05
> rbd_id.test02
> rbd_id.cloud2sql-lun02
> rbd_id.fiotest2
> rbd_id.radmast02-lun04
> rbd_id.hypervtst-lun04
> rbd_id.cloud2fs-lun00
> rbd_id.radmast02-lun03
> rbd_id.hypervtst-lun00
> rbd_id.cloud2sql-lun00
> rbd_id.radmast02-lun02
>
>
> > -Original Message-
> > From: Jason Dillaman [mailto:jdill...@redhat.com]
> > Sent: Monday, 6 June 2016 11:00 AM
> > To: Adrian Saul
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
> >
> > Are you able to successfully run the following command successfully?
> >
> > rados -p glebe-sata get rbd_id.hypervtst-lun04
> >
> >
> >
> > On Sun, Jun 5, 2016 at 8:49 PM, Adrian Saul
> >  wrote:
> > >
> > > I upgraded my Infernalis semi-production cluster to Jewel on Friday.
> > > While
> > the upgrade went through smoothly (aside from a time wasting
> > restorecon /var/lib/ceph in the selinux package upgrade) and the
> > services continued running without interruption.  However this morning
> > when I went to create some new RBD images I am unable to do much at all
> with RBD.
> > >
> > > Just about any rbd command fails with an I/O error.   I can run
> > showmapped but that is about it - anything like an ls, info or status
> > fails.  This applies to all my pools.
> > >
> > > I can see no errors in any log files that appear to suggest an
> > > issue.  I  have
> > also tried the commands on other cluster members that have not done
> > anything with RBD before (I was wondering if perhaps the kernel rbd
> > was pinning the old library version open or something) but the same error
> occurs.
> > >
> > > Where can I start trying to resolve this?
> > >
> > > Cheers,
> > >  Adrian
> > >
> > >
> > > [root@ceph-glb-fec-01 ceph]# rbd ls glebe-sata
> > > rbd: list: (5) Input/output error
> > > 2016-06-06 10:41:31.792720 7f53c06a2d80 -1 librbd: error listing
> > > image in directory: (5) Input/output error
> > > 2016-06-06 10:41:31.792749 7f53c06a2d80 -1 librbd: error listing v2
> > > images: (5) Input/output error
> > >
> > > [root@ceph-glb-fec-01 ceph]# rbd ls glebe-ssd
> > > rbd: list: (5) Input/output error
> > > 2016-06-06 10:41:33.956648 7f90de663d80 -1 librbd: error listing
> > > image in directory: (5) Input/output error
> > > 2016-06-06 10:41:33.956672 7f90de663d80 -1 librbd: error listing v2
> > > images: (5) Input/output error
> > >
> > > [root@ceph-glb-fec-02 ~]# rbd showmapped
> > > id pool   image snap device
> > > 0  glebe-sata test02-/dev/rbd0
> > > 1  glebe-ssd  zfstest   -/dev/rbd1
> > > 10 glebe-sata hypervtst-lun00   -/dev/rbd10
> > > 11 glebe-sata hypervtst-lun02   -/dev/rbd11
> > > 12 glebe-sata hypervtst-lun03   -/dev/rbd12
> > > 13 glebe-ssd  nspprd01_lun00-/dev/rbd13
> > > 14 glebe-sata cirrux-nfs01  -/dev/rbd14
> > > 15 glebe-sata hypervtst-lun04   -/dev/rbd15
> > > 16 glebe-sata hypervtst-lun05   -/dev/r

Re: [ceph-users] Jewel upgrade - rbd errors after upgrade

2016-06-05 Thread Adrian Saul

I have traced it back to an OSD giving this error:

2016-06-06 12:18:14.315573 7fd714679700 -1 osd.20 23623 class rbd open got (5) 
Input/output error
2016-06-06 12:19:49.835227 7fd714679700  0 _load_class could not open class 
/usr/lib64/rados-classes/libcls_rbd.so (dlopen failed): 
/usr/lib64/rados-classes/libcls_rbd.so: undefined symbol: 
_ZN4ceph6buffer4list8iteratorC1EPS1_j

Trying to figure out why that is the case.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Adrian Saul
> Sent: Monday, 6 June 2016 11:11 AM
> To: dilla...@redhat.com
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
>
>
> No - it throws a usage error - if I add a file argument after it works:
>
> [root@ceph-glb-fec-02 ceph]# rados -p glebe-sata get rbd_id.hypervtst-
> lun04 /tmp/crap
> [root@ceph-glb-fec-02 ceph]# cat /tmp/crap 109eb01f5f89de
>
> stat works:
>
> [root@ceph-glb-fec-02 ceph]# rados -p glebe-sata stat rbd_id.hypervtst-
> lun04
> glebe-sata/rbd_id.hypervtst-lun04 mtime 2016-06-06 10:55:08.00, size 18
>
>
> I can do a rados ls:
>
> [root@ceph-glb-fec-02 ceph]# rados ls -p glebe-sata|grep rbd_id
> rbd_id.cloud2sql-lun01
> rbd_id.glbcluster3-vm17
> rbd_id.holder   <<<  a create that said it failed while I was debugging this
> rbd_id.pvtcloud-nfs01
> rbd_id.hypervtst-lun05
> rbd_id.test02
> rbd_id.cloud2sql-lun02
> rbd_id.fiotest2
> rbd_id.radmast02-lun04
> rbd_id.hypervtst-lun04
> rbd_id.cloud2fs-lun00
> rbd_id.radmast02-lun03
> rbd_id.hypervtst-lun00
> rbd_id.cloud2sql-lun00
> rbd_id.radmast02-lun02
>
>
> > -Original Message-
> > From: Jason Dillaman [mailto:jdill...@redhat.com]
> > Sent: Monday, 6 June 2016 11:00 AM
> > To: Adrian Saul
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
> >
> > Are you able to successfully run the following command successfully?
> >
> > rados -p glebe-sata get rbd_id.hypervtst-lun04
> >
> >
> >
> > On Sun, Jun 5, 2016 at 8:49 PM, Adrian Saul
> >  wrote:
> > >
> > > I upgraded my Infernalis semi-production cluster to Jewel on Friday.
> > > While
> > the upgrade went through smoothly (aside from a time wasting
> > restorecon /var/lib/ceph in the selinux package upgrade) and the
> > services continued running without interruption.  However this morning
> > when I went to create some new RBD images I am unable to do much at all
> with RBD.
> > >
> > > Just about any rbd command fails with an I/O error.   I can run
> > showmapped but that is about it - anything like an ls, info or status
> > fails.  This applies to all my pools.
> > >
> > > I can see no errors in any log files that appear to suggest an
> > > issue.  I  have
> > also tried the commands on other cluster members that have not done
> > anything with RBD before (I was wondering if perhaps the kernel rbd
> > was pinning the old library version open or something) but the same error
> occurs.
> > >
> > > Where can I start trying to resolve this?
> > >
> > > Cheers,
> > >  Adrian
> > >
> > >
> > > [root@ceph-glb-fec-01 ceph]# rbd ls glebe-sata
> > > rbd: list: (5) Input/output error
> > > 2016-06-06 10:41:31.792720 7f53c06a2d80 -1 librbd: error listing
> > > image in directory: (5) Input/output error
> > > 2016-06-06 10:41:31.792749 7f53c06a2d80 -1 librbd: error listing v2
> > > images: (5) Input/output error
> > >
> > > [root@ceph-glb-fec-01 ceph]# rbd ls glebe-ssd
> > > rbd: list: (5) Input/output error
> > > 2016-06-06 10:41:33.956648 7f90de663d80 -1 librbd: error listing
> > > image in directory: (5) Input/output error
> > > 2016-06-06 10:41:33.956672 7f90de663d80 -1 librbd: error listing v2
> > > images: (5) Input/output error
> > >
> > > [root@ceph-glb-fec-02 ~]# rbd showmapped
> > > id pool   image snap device
> > > 0  glebe-sata test02-/dev/rbd0
> > > 1  glebe-ssd  zfstest   -/dev/rbd1
> > > 10 glebe-sata hypervtst-lun00   -/dev/rbd10
> > > 11 glebe-sata hypervtst-lun02   -/dev/rbd11
> > > 12 glebe-sata hypervtst-lun03   -/dev/rbd12
> > > 13 glebe-ssd  nspprd01_lun00-/dev/rbd13
> > > 14 glebe-sata cirrux-nfs01  -/dev/rbd14
> > > 15 glebe-sata hypervtst-lun04   -/dev/rbd15
> > > 16 glebe-sata hype

Re: [ceph-users] Jewel upgrade - rbd errors after upgrade

2016-06-05 Thread Adrian Saul

I couldn't find anything wrong with the packages and everything seemed 
installed ok.

Once I restarted the OSDs the directory issue went away but the error started 
moving to other rbd output, and the same class open error occurred on other 
OSDs.  I have gone through and bounced all the OSDs and that seems to have 
cleared the issue.

I am guessing that perhaps the restart of the OSDs during the package upgrade 
is occurring before all library packages are upgraded and so they are starting 
with the wrong versions loaded, so when these class libraries are dynamically 
opened later they are failing.



> -Original Message-
> From: Adrian Saul
> Sent: Monday, 6 June 2016 12:29 PM
> To: Adrian Saul; dilla...@redhat.com
> Cc: ceph-users@lists.ceph.com
> Subject: RE: [ceph-users] Jewel upgrade - rbd errors after upgrade
>
>
> I have traced it back to an OSD giving this error:
>
> 2016-06-06 12:18:14.315573 7fd714679700 -1 osd.20 23623 class rbd open got
> (5) Input/output error
> 2016-06-06 12:19:49.835227 7fd714679700  0 _load_class could not open class
> /usr/lib64/rados-classes/libcls_rbd.so (dlopen failed): /usr/lib64/rados-
> classes/libcls_rbd.so: undefined symbol:
> _ZN4ceph6buffer4list8iteratorC1EPS1_j
>
> Trying to figure out why that is the case.
>
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Adrian Saul
> > Sent: Monday, 6 June 2016 11:11 AM
> > To: dilla...@redhat.com
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
> >
> >
> > No - it throws a usage error - if I add a file argument after it works:
> >
> > [root@ceph-glb-fec-02 ceph]# rados -p glebe-sata get rbd_id.hypervtst-
> > lun04 /tmp/crap
> > [root@ceph-glb-fec-02 ceph]# cat /tmp/crap 109eb01f5f89de
> >
> > stat works:
> >
> > [root@ceph-glb-fec-02 ceph]# rados -p glebe-sata stat
> > rbd_id.hypervtst-
> > lun04
> > glebe-sata/rbd_id.hypervtst-lun04 mtime 2016-06-06 10:55:08.00,
> > size 18
> >
> >
> > I can do a rados ls:
> >
> > [root@ceph-glb-fec-02 ceph]# rados ls -p glebe-sata|grep rbd_id
> > rbd_id.cloud2sql-lun01
> > rbd_id.glbcluster3-vm17
> > rbd_id.holder   <<<  a create that said it failed while I was debugging this
> > rbd_id.pvtcloud-nfs01
> > rbd_id.hypervtst-lun05
> > rbd_id.test02
> > rbd_id.cloud2sql-lun02
> > rbd_id.fiotest2
> > rbd_id.radmast02-lun04
> > rbd_id.hypervtst-lun04
> > rbd_id.cloud2fs-lun00
> > rbd_id.radmast02-lun03
> > rbd_id.hypervtst-lun00
> > rbd_id.cloud2sql-lun00
> > rbd_id.radmast02-lun02
> >
> >
> > > -Original Message-
> > > From: Jason Dillaman [mailto:jdill...@redhat.com]
> > > Sent: Monday, 6 June 2016 11:00 AM
> > > To: Adrian Saul
> > > Cc: ceph-users@lists.ceph.com
> > > Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
> > >
> > > Are you able to successfully run the following command successfully?
> > >
> > > rados -p glebe-sata get rbd_id.hypervtst-lun04
> > >
> > >
> > >
> > > On Sun, Jun 5, 2016 at 8:49 PM, Adrian Saul
> > >  wrote:
> > > >
> > > > I upgraded my Infernalis semi-production cluster to Jewel on Friday.
> > > > While
> > > the upgrade went through smoothly (aside from a time wasting
> > > restorecon /var/lib/ceph in the selinux package upgrade) and the
> > > services continued running without interruption.  However this
> > > morning when I went to create some new RBD images I am unable to do
> > > much at all
> > with RBD.
> > > >
> > > > Just about any rbd command fails with an I/O error.   I can run
> > > showmapped but that is about it - anything like an ls, info or
> > > status fails.  This applies to all my pools.
> > > >
> > > > I can see no errors in any log files that appear to suggest an
> > > > issue.  I  have
> > > also tried the commands on other cluster members that have not done
> > > anything with RBD before (I was wondering if perhaps the kernel rbd
> > > was pinning the old library version open or something) but the same
> > > error
> > occurs.
> > > >
> > > > Where can I start trying to resolve this?
> > > >
> > > > Cheers,
> > > >  Adrian
> > > >
> > > >
> > > > [root@ceph-glb-fec-01 ceph]# rbd ls glebe-sata
> 

Re: [ceph-users] Jewel upgrade - rbd errors after upgrade

2016-06-05 Thread Adrian Saul

Thanks Jason.

I don’t have anything specified explicitly for osd class dir.   I suspect it 
might be related to the OSDs being restarted during the package upgrade process 
before all libraries are upgraded.


> -Original Message-
> From: Jason Dillaman [mailto:jdill...@redhat.com]
> Sent: Monday, 6 June 2016 12:37 PM
> To: Adrian Saul
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
>
> Odd -- sounds like you might have Jewel and Infernalis class objects and
> OSDs intermixed. I would double-check your installation and see if your
> configuration has any overload for "osd class dir".
>
> On Sun, Jun 5, 2016 at 10:28 PM, Adrian Saul
>  wrote:
> >
> > I have traced it back to an OSD giving this error:
> >
> > 2016-06-06 12:18:14.315573 7fd714679700 -1 osd.20 23623 class rbd open
> > got (5) Input/output error
> > 2016-06-06 12:19:49.835227 7fd714679700  0 _load_class could not open
> > class /usr/lib64/rados-classes/libcls_rbd.so (dlopen failed):
> > /usr/lib64/rados-classes/libcls_rbd.so: undefined symbol:
> > _ZN4ceph6buffer4list8iteratorC1EPS1_j
> >
> > Trying to figure out why that is the case.
> >
> >
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of Adrian Saul
> >> Sent: Monday, 6 June 2016 11:11 AM
> >> To: dilla...@redhat.com
> >> Cc: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
> >>
> >>
> >> No - it throws a usage error - if I add a file argument after it works:
> >>
> >> [root@ceph-glb-fec-02 ceph]# rados -p glebe-sata get
> >> rbd_id.hypervtst-
> >> lun04 /tmp/crap
> >> [root@ceph-glb-fec-02 ceph]# cat /tmp/crap 109eb01f5f89de
> >>
> >> stat works:
> >>
> >> [root@ceph-glb-fec-02 ceph]# rados -p glebe-sata stat
> >> rbd_id.hypervtst-
> >> lun04
> >> glebe-sata/rbd_id.hypervtst-lun04 mtime 2016-06-06 10:55:08.00,
> >> size 18
> >>
> >>
> >> I can do a rados ls:
> >>
> >> [root@ceph-glb-fec-02 ceph]# rados ls -p glebe-sata|grep rbd_id
> >> rbd_id.cloud2sql-lun01
> >> rbd_id.glbcluster3-vm17
> >> rbd_id.holder   <<<  a create that said it failed while I was debugging 
> >> this
> >> rbd_id.pvtcloud-nfs01
> >> rbd_id.hypervtst-lun05
> >> rbd_id.test02
> >> rbd_id.cloud2sql-lun02
> >> rbd_id.fiotest2
> >> rbd_id.radmast02-lun04
> >> rbd_id.hypervtst-lun04
> >> rbd_id.cloud2fs-lun00
> >> rbd_id.radmast02-lun03
> >> rbd_id.hypervtst-lun00
> >> rbd_id.cloud2sql-lun00
> >> rbd_id.radmast02-lun02
> >>
> >>
> >> > -Original Message-
> >> > From: Jason Dillaman [mailto:jdill...@redhat.com]
> >> > Sent: Monday, 6 June 2016 11:00 AM
> >> > To: Adrian Saul
> >> > Cc: ceph-users@lists.ceph.com
> >> > Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
> >> >
> >> > Are you able to successfully run the following command successfully?
> >> >
> >> > rados -p glebe-sata get rbd_id.hypervtst-lun04
> >> >
> >> >
> >> >
> >> > On Sun, Jun 5, 2016 at 8:49 PM, Adrian Saul
> >> >  wrote:
> >> > >
> >> > > I upgraded my Infernalis semi-production cluster to Jewel on Friday.
> >> > > While
> >> > the upgrade went through smoothly (aside from a time wasting
> >> > restorecon /var/lib/ceph in the selinux package upgrade) and the
> >> > services continued running without interruption.  However this
> >> > morning when I went to create some new RBD images I am unable to do
> >> > much at all
> >> with RBD.
> >> > >
> >> > > Just about any rbd command fails with an I/O error.   I can run
> >> > showmapped but that is about it - anything like an ls, info or
> >> > status fails.  This applies to all my pools.
> >> > >
> >> > > I can see no errors in any log files that appear to suggest an
> >> > > issue.  I  have
> >> > also tried the commands on other cluster members that have not done
> >> > anything with RBD before (I was wondering if perhaps the kernel rbd
> >> > was pinning the old library version open or something) but the same
> >> > erro

Re: [ceph-users] Jewel upgrade - rbd errors after upgrade

2016-06-06 Thread Adrian Saul
Centos 7 - the ugrade was done simply with "yum update -y ceph" on each node 
one by one, so the package order would have been determined by yum.




From: Jason Dillaman 
Sent: Monday, June 6, 2016 10:42 PM
To: Adrian Saul
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade

What OS are you using?  It actually sounds like the plugins were
updated, the Infernalis OSD was reset, and then the Jewel OSD was
installed.

On Sun, Jun 5, 2016 at 10:42 PM, Adrian Saul
 wrote:
>
> Thanks Jason.
>
> I don’t have anything specified explicitly for osd class dir.   I suspect it 
> might be related to the OSDs being restarted during the package upgrade 
> process before all libraries are upgraded.
>
>
>> -Original Message-
>> From: Jason Dillaman [mailto:jdill...@redhat.com]
>> Sent: Monday, 6 June 2016 12:37 PM
>> To: Adrian Saul
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
>>
>> Odd -- sounds like you might have Jewel and Infernalis class objects and
>> OSDs intermixed. I would double-check your installation and see if your
>> configuration has any overload for "osd class dir".
>>
>> On Sun, Jun 5, 2016 at 10:28 PM, Adrian Saul
>>  wrote:
>> >
>> > I have traced it back to an OSD giving this error:
>> >
>> > 2016-06-06 12:18:14.315573 7fd714679700 -1 osd.20 23623 class rbd open
>> > got (5) Input/output error
>> > 2016-06-06 12:19:49.835227 7fd714679700  0 _load_class could not open
>> > class /usr/lib64/rados-classes/libcls_rbd.so (dlopen failed):
>> > /usr/lib64/rados-classes/libcls_rbd.so: undefined symbol:
>> > _ZN4ceph6buffer4list8iteratorC1EPS1_j
>> >
>> > Trying to figure out why that is the case.
>> >
>> >
>> >> -Original Message-
>> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>> >> Of Adrian Saul
>> >> Sent: Monday, 6 June 2016 11:11 AM
>> >> To: dilla...@redhat.com
>> >> Cc: ceph-users@lists.ceph.com
>> >> Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
>> >>
>> >>
>> >> No - it throws a usage error - if I add a file argument after it works:
>> >>
>> >> [root@ceph-glb-fec-02 ceph]# rados -p glebe-sata get
>> >> rbd_id.hypervtst-
>> >> lun04 /tmp/crap
>> >> [root@ceph-glb-fec-02 ceph]# cat /tmp/crap 109eb01f5f89de
>> >>
>> >> stat works:
>> >>
>> >> [root@ceph-glb-fec-02 ceph]# rados -p glebe-sata stat
>> >> rbd_id.hypervtst-
>> >> lun04
>> >> glebe-sata/rbd_id.hypervtst-lun04 mtime 2016-06-06 10:55:08.00,
>> >> size 18
>> >>
>> >>
>> >> I can do a rados ls:
>> >>
>> >> [root@ceph-glb-fec-02 ceph]# rados ls -p glebe-sata|grep rbd_id
>> >> rbd_id.cloud2sql-lun01
>> >> rbd_id.glbcluster3-vm17
>> >> rbd_id.holder   <<<  a create that said it failed while I was debugging 
>> >> this
>> >> rbd_id.pvtcloud-nfs01
>> >> rbd_id.hypervtst-lun05
>> >> rbd_id.test02
>> >> rbd_id.cloud2sql-lun02
>> >> rbd_id.fiotest2
>> >> rbd_id.radmast02-lun04
>> >> rbd_id.hypervtst-lun04
>> >> rbd_id.cloud2fs-lun00
>> >> rbd_id.radmast02-lun03
>> >> rbd_id.hypervtst-lun00
>> >> rbd_id.cloud2sql-lun00
>> >> rbd_id.radmast02-lun02
>> >>
>> >>
>> >> > -Original Message-
>> >> > From: Jason Dillaman [mailto:jdill...@redhat.com]
>> >> > Sent: Monday, 6 June 2016 11:00 AM
>> >> > To: Adrian Saul
>> >> > Cc: ceph-users@lists.ceph.com
>> >> > Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
>> >> >
>> >> > Are you able to successfully run the following command successfully?
>> >> >
>> >> > rados -p glebe-sata get rbd_id.hypervtst-lun04
>> >> >
>> >> >
>> >> >
>> >> > On Sun, Jun 5, 2016 at 8:49 PM, Adrian Saul
>> >> >  wrote:
>> >> > >
>> >> > > I upgraded my Infernalis semi-production cluster to Jewel on Friday.
>> >> > > While
>> >> > the upgrade went through smoothly (aside from a time wasting
>> >> > restorecon /var

[ceph-users] OSD out/down detection

2016-06-19 Thread Adrian Saul
Hi All,
 We have a Jewel (10.2.1) cluster on Centos 7 - I am using an  elrepo 4.4.1 
kernel on all machines and we have an issue where some of the machines hang - 
not sure if its hardware or OS but essentially the host including the console 
is unresponsive and can only be recovered with a hardware reset.  Unfortunately 
nothing useful is logged so I am still trying to figure out what is going on to 
cause this.   But the result for ceph is that if an OSD host goes down like 
this we have run into an issue where only some of its OSDs are marked down.
In the instance on the weekend, the host had 8 OSDs and only 5 got marked as 
down - this lead to the kRBD devices jamming up trying to send IO to 
non-responsive OSDs that stayed marked up.

The machine went into a slow death - lots of reports of slow or blocked 
requests:

2016-06-19 09:37:49.070810 osd.36 10.145.2.15:6802/31359 65 : cluster [WRN] 2 
slow requests, 2 included below; oldest blocked for > 30.297258 secs
2016-06-19 09:37:54.071542 osd.36 10.145.2.15:6802/31359 82 : cluster [WRN] 112 
slow requests, 5 included below; oldest blocked for > 35.297988 secs
2016-06-19 09:37:54.071737 osd.6 10.145.2.15:6801/21836 221 : cluster [WRN] 253 
slow requests, 5 included below; oldest blocked for > 35.325155 secs
2016-06-19 09:37:59.072570 osd.6 10.145.2.15:6801/21836 251 : cluster [WRN] 262 
slow requests, 5 included below; oldest blocked for > 40.325986 secs

And then when the monitors did report them down the OSDs disputed that:

2016-06-19 09:38:35.821716 mon.0 10.145.2.13:6789/0 244970 : cluster [INF] 
osd.6 10.145.2.15:6801/21836 failed (2 reporters from different host after 
20.000365 >= grace 20.00)
2016-06-19 09:38:36.950556 mon.0 10.145.2.13:6789/0 244978 : cluster [INF] 
osd.22 10.145.2.15:6806/21826 failed (2 reporters from different host after 
21.613336 >= grace 20.00)
2016-06-19 09:38:36.951133 mon.0 10.145.2.13:6789/0 244980 : cluster [INF] 
osd.31 10.145.2.15:6812/21838 failed (2 reporters from different host after 
21.613781 >= grace 20.836511)
2016-06-19 09:38:36.951636 mon.0 10.145.2.13:6789/0 244982 : cluster [INF] 
osd.36 10.145.2.15:6802/31359 failed (2 reporters from different host after 
21.614259 >= grace 20.00)

2016-06-19 09:38:37.156088 osd.36 10.145.2.15:6802/31359 346 : cluster [WRN] 
map e28730 wrongly marked me down
2016-06-19 09:38:36.002076 osd.6 10.145.2.15:6801/21836 473 : cluster [WRN] map 
e28729 wrongly marked me down
2016-06-19 09:38:37.046885 osd.22 10.145.2.15:6806/21826 374 : cluster [WRN] 
map e28730 wrongly marked me down
2016-06-19 09:38:37.050635 osd.31 10.145.2.15:6812/21838 351 : cluster [WRN] 
map e28730 wrongly marked me down

But shortly after

2016-06-19 09:43:39.940985 mon.0 10.145.2.13:6789/0 245305 : cluster [INF] 
osd.6 out (down for 303.951251)
2016-06-19 09:43:39.941061 mon.0 10.145.2.13:6789/0 245306 : cluster [INF] 
osd.22 out (down for 302.908528)
2016-06-19 09:43:39.941099 mon.0 10.145.2.13:6789/0 245307 : cluster [INF] 
osd.31 out (down for 302.908527)
2016-06-19 09:43:39.941152 mon.0 10.145.2.13:6789/0 245308 : cluster [INF] 
osd.36 out (down for 302.908527)

2016-06-19 10:09:10.648924 mon.0 10.145.2.13:6789/0 247076 : cluster [INF] 
osd.23 10.145.2.15:6814/21852 failed (2 reporters from different host after 
20.000378 >= grace 20.00)
2016-06-19 10:09:10.887220 osd.23 10.145.2.15:6814/21852 176 : cluster [WRN] 
map e28848 wrongly marked me down
2016-06-19 10:14:15.160513 mon.0 10.145.2.13:6789/0 247422 : cluster [INF] 
osd.23 out (down for 304.288018)

By the time the issue was eventually escalated and I was able to do something 
about it I manual marked the remaining host OSDs down (which seemed to unclog 
RBD):

2016-06-19 15:25:06.171395 mon.0 10.145.2.13:6789/0 267212 : cluster [INF] 
osd.7 10.145.2.15:6808/21837 failed (2 reporters from different host after 
22.000367 >= grace 20.00)
2016-06-19 15:25:06.171905 mon.0 10.145.2.13:6789/0 267214 : cluster [INF] 
osd.24 10.145.2.15:6800/21813 failed (2 reporters from different host after 
22.000748 >= grace 20.710981)
2016-06-19 15:25:06.172426 mon.0 10.145.2.13:6789/0 267216 : cluster [INF] 
osd.37 10.145.2.15:6810/31936 failed (2 reporters from different host after 
22.001167 >= grace 20.00)

The question I have is why might the these 3 OSDs, despite not being responsive 
for over 5 hours, stayed in the cluster?  The CRUSH map for all pools is to 
have the hosts as fault boundaries, so I would have expected other host OSDs to 
be reporting these as unresponsive and reporting them.  On the OSD logs nothing 
was logged in the hour prior to the failure, and on the other OSDs it seems 
like they noticed all the other OSDs timing out but the 3 that stayed up it 
seemed to be actively attempting backfills.

Any ideas on how I can improve detection of this condition?

Cheers,
 Adrian


Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privi

[ceph-users] Snap delete performance impact

2016-07-05 Thread Adrian Saul

I recently started a process of using rbd snapshots to setup a backup regime 
for a few file systems contained in RBD images.  While this generally works 
well at the time of the snapshots there is a massive increase in latency (10ms 
to multiple seconds of rbd device latency) across the entire cluster.  This has 
flow on effects for some cluster timeouts as well as general performance hits 
to applications.

In research I have found some references to osd_snap_trim_sleep being the way 
to throttle this activity but no real guidance on values for it.   I also see 
some other osd_snap_trim tunables  (priority and cost).

Is there any recommendations around setting these for a Jewel cluster?

cheers,
 Adrian

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Terrible RBD performance with Jewel

2016-07-14 Thread Adrian Saul

I would suggest caution with " filestore_odsync_write" - its fine on good SSDs, 
but on poor SSDs or spinning disks it will kill performance.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: Friday, 15 July 2016 3:12 AM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Terrible RBD performance with Jewel

Try increasing the following to say 10

osd_op_num_shards = 10
filestore_fd_cache_size = 128

Hope, the following you introduced after I told you , so, it shouldn't be the 
cause it seems (?)

filestore_odsync_write = true

Also, comment out the following.

filestore_wbthrottle_enable = false



From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Thursday, July 14, 2016 10:05 AM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

Something in this section is causing all the 0 IOPS issue. Have not been able 
to nail down it yet. (I did comment out the filestore_max_inline_xattr_size 
entries, and problem still exists).
If I take out the whole [osd] section, I was able to get rid of IOPS staying at 
0 for long periods of time. Performance is still not where I would expect.
[osd]
osd_enable_op_tracker = false
osd_op_num_shards = 2
filestore_wbthrottle_enable = false
filestore_max_sync_interval = 1
filestore_odsync_write = true
#filestore_max_inline_xattr_size = 254
#filestore_max_inline_xattrs = 6
filestore_queue_committing_max_bytes = 1048576000
filestore_queue_committing_max_ops = 5000
filestore_queue_max_bytes = 1048576000
filestore_queue_max_ops = 500
journal_max_write_bytes = 1048576000
journal_max_write_entries = 1000
journal_queue_max_bytes = 1048576000
journal_queue_max_ops = 3000
filestore_fd_cache_shards = 32
filestore_fd_cache_size = 64

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 7:05 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

I am not sure whether you need to set the following. What's the point of 
reducing inline xattr stuff ? I forgot the calculation but lower values could 
redirect your xattrs to omap. Better comment those out.

filestore_max_inline_xattr_size = 254
filestore_max_inline_xattrs = 6

We could do some improvement on some of the params but nothing it seems 
responsible for the behavior you are seeing.
Could you run iotop and see if any process (like xfsaild) is doing io on the 
drives during that time ?

Thanks & Regards
Somnath

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 6:40 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

I agree, but I'm dealing with something else out here with this setup.
I just ran a test, and within 3 seconds my IOPS went to 0, and stayed there for 
90 secondsthen started and within seconds again went to 0.
This doesn't seem normal at all. Here is my ceph.conf:

[global]
fsid = xx
public_network = 
cluster_network = 
mon_initial_members = ceph1
mon_host = 
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_mkfs_options = -f -i size=2048 -n size=64k
osd_mount_options_xfs = inode64,noatime,logbsize=256k
filestore_merge_threshold = 40
filestore_split_multiple = 8
osd_op_threads = 12
osd_pool_default_size = 2
mon_pg_warn_max_object_skew = 10
mon_pg_warn_min_per_osd = 0
mon_pg_warn_max_per_osd = 32768
filestore_op_threads = 6

[osd]
osd_enable_op_tracker = false
osd_op_num_shards = 2
filestore_wbthrottle_enable = false
filestore_max_sync_interval = 1
filestore_odsync_write = true
filestore_max_inline_xattr_size = 254
filestore_max_inline_xattrs = 6
filestore_queue_committing_max_bytes = 1048576000
filestore_queue_committing_max_ops = 5000
filestore_queue_max_bytes = 1048576000
filestore_queue_max_ops = 500
journal_max_write_bytes = 1048576000
journal_max_write_entries = 1000
journal_queue_max_bytes = 1048576000
journal_queue_max_ops = 3000
filestore_fd_cache_shards = 32
filestore_fd_cache_size = 64


From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 6:06 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

You should do that first to get a stable performance out with filestore.
1M seq write for the entire image should be sufficient to precondition it.

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 6:04 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

No I have not.

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 6:00 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

In fact, I was wrong , I missed you are running with 12 OSDs (considerin

Re: [ceph-users] Lessons learned upgrading Hammer -> Jewel

2016-07-17 Thread Adrian Saul

I have SELinux disabled and it does the restorecon on /var/lib/ceph regardless 
from the RPM post upgrade scripts.

In my case I chose to kill the restorecon processes to save outage time – it 
didn’t affect the upgrade package completion.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mykola 
Dvornik
Sent: Friday, 15 July 2016 6:54 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Lessons learned upgrading Hammer -> Jewel

I would also advice people to mind the SELinux if it is enabled on the OSD's 
nodes.
The re-labeling should be done as the part of the upgrade and this is rather 
time consuming process.


-Original Message-
From: Mart van Santen 
mailto:mart%20van%20santen%20%3cm...@greenhost.nl%3e>>
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Lessons learned upgrading Hammer -> Jewel
Date: Fri, 15 Jul 2016 10:48:40 +0200


Hi Wido,

Thank you, we are currently in the same process so this information is very 
usefull. Can you share why you upgraded from hammer directly to jewel, is there 
a reason to skip infernalis? So, I wonder why you didn't do a 
hammer->infernalis->jewel upgrade, as that seems the logical path for me.

(we did indeed saw the same errors "Failed to encode map eXXX with expected 
crc" when upgrading to the latest hammer)


Regards,

Mart






On 07/15/2016 03:08 AM, 席智勇 wrote:
good job, thank you for sharing, Wido~
it's very useful~

2016-07-14 14:33 GMT+08:00 Wido den Hollander 
mailto:w...@42on.com>>:

To add, the RGWs upgraded just fine as well.

No regions in use here (yet!), so that upgraded as it should.

Wido

> Op 13 juli 2016 om 16:56 schreef Wido den Hollander 
> mailto:w...@42on.com>>:
>
>
> Hello,
>
> The last 3 days I worked at a customer with a 1800 OSD cluster which had to 
> be upgraded from Hammer 0.94.5 to Jewel 10.2.2
>
> The cluster in this case is 99% RGW, but also some RBD.
>
> I wanted to share some of the things we encountered during this upgrade.
>
> All 180 nodes are running CentOS 7.1 on a IPv6-only network.
>
> ** Hammer Upgrade **
> At first we upgraded from 0.94.5 to 0.94.7, this went well except for the 
> fact that the monitors got spammed with these kind of messages:
>
>   "Failed to encode map eXXX with expected crc"
>
> Some searching on the list brought me to:
>
>   ceph tell osd.* injectargs -- --clog_to_monitors=false
>
>  This reduced the load on the 5 monitors and made recovery succeed smoothly.
>
>  ** Monitors to Jewel **
>  The next step was to upgrade the monitors from Hammer to Jewel.
>
>  Using Salt we upgraded the packages and afterwards it was simple:
>
>killall ceph-mon
>chown -R ceph:ceph /var/lib/ceph
>chown -R ceph:ceph /var/log/ceph
>
> Now, a systemd quirck. 'systemctl start ceph.target' does not work, I had to 
> manually enabled the monitor and start it:
>
>   systemctl enable 
> ceph-mon@srv-zmb04-05.service
>   systemctl start 
> ceph-mon@srv-zmb04-05.service
>
> Afterwards the monitors were running just fine.
>
> ** OSDs to Jewel **
> To upgrade the OSDs to Jewel we initially used Salt to update the packages on 
> all systems to 10.2.2, we then used a Shell script which we ran on one node 
> at a time.
>
> The failure domain here is 'rack', so we executed this in one rack, then the 
> next one, etc, etc.
>
> Script can be found on Github: 
> https://gist.github.com/wido/06eac901bd42f01ca2f4f1a1d76c49a6
>
> Be aware that the chown can take a long, long, very long time!
>
> We ran into the issue that some OSDs crashed after start. But after trying 
> again they would start.
>
>   "void FileStore::init_temp_collections()"
>
> I reported this in the tracker as I'm not sure what is happening here: 
> http://tracker.ceph.com/issues/16672
>
> ** New OSDs with Jewel **
> We also had some new nodes which we wanted to add to the Jewel cluster.
>
> Using Salt and ceph-disk we ran into a partprobe issue in combination with 
> ceph-disk. There was already a Pull Request for the fix, but that was not 
> included in Jewel 10.2.2.
>
> We manually applied the PR and it fixed our issues: 
> https://github.com/ceph/ceph/pull/9330
>
> Hope this helps other people with their upgrades to Jewel!
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___

ceph-users mailing list

ceph-users@lists.ceph.com

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___

ceph-users mailing list

ceph-users@lists.ceph.co

[ceph-users] Deep scrub distribution

2017-07-05 Thread Adrian Saul

During a recent snafu with a production cluster I disabled scrubbing and deep 
scrubbing in order to reduce load on the cluster while things backfilled and 
settled down.  The PTSD caused by the incident meant I was not keen to 
re-enable it until I was confident we had fixed the root cause of the issues 
(driver issues with a new NIC type introduced with new hardware that did not 
show up until production load hit them).   My cluster is using Jewel 10.2.1, 
and is a mix of SSD and SATA over 20 hosts, 352 OSDs in total.

Fast forward a few weeks and I was ready to re-enable it.  On some reading I 
was concerned the cluster might kick off excessive scrubbing once I unset the 
flags, so I tried increasing the deep scrub interval from 7 days to 60 days - 
with most of the last deep scrubs being from over a month before I was hoping 
it would distribute them over the next 30 days.  Having unset the flag and 
carefully watched the cluster it seems to have just run a steady catch up 
without significant impact.  What I am noticing though is that the scrubbing is 
seeming to just run through the full set of PGs, so it did some 2280 PGs last 
night over 6 hours, and so far today in 12 hours another 4000 odd.  With 13408 
PGs, I am guessing that all this will stop some time early tomorrow.

ceph-glb-fec-01[/var/log]$ sudo ceph pg dump|awk '{print $20}'|grep 
2017|sort|uniq -c
dumped all in format plain
  5 2017-05-23
 18 2017-05-24
 33 2017-05-25
 52 2017-05-26
 89 2017-05-27
114 2017-05-28
144 2017-05-29
172 2017-05-30
256 2017-05-31
191 2017-06-01
230 2017-06-02
369 2017-06-03
606 2017-06-04
680 2017-06-05
919 2017-06-06
   1261 2017-06-07
   1876 2017-06-08
 15 2017-06-09
   2280 2017-07-05
   4098 2017-07-06

My concern is am I now set to have all 13408 PGs do a deep scrub in 60 days in 
a serial fashion again over 3 days.  I would much rather they distribute over 
that period.

Will the OSDs do this distribution themselves now they have caught up, or do I 
need to say create a script that will trigger batches of PGs to deep scrub over 
time to push out the distribution again?





Adrian Saul | Infrastructure Projects Team Lead
IT
T 02 9009 9041 | M +61 402 075 760
30 Ross St, Glebe NSW 2037
adrian.s...@tpgtelecom.com.au<mailto:adrian.s...@tpgtelecom.com.au> | 
www.tpg.com.au<http://www.tpg.com.au/>

TPG Telecom (ASX: TPM)


[Description: http://res.tpgi.com.au/img/signature/tpgtelecomlogo.jpg]


This email and any attachments are confidential and may be subject to 
copyright, legal or some other professional privilege. They are intended solely 
for the attention and use of the named addressee(s). They may only be copied, 
distributed or disclosed with the consent of the copyright owner. If you have 
received this email by mistake or by breach of the confidentiality clause, 
please notify the sender immediately by return email and delete or destroy all 
copies of the email. Any confidentiality, privilege or copyright is not waived 
or lost because this email has been sent to you by mistake.



Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PGs per OSD guidance

2017-07-14 Thread Adrian Saul
Hi All,
   I have been reviewing the sizing of our PGs with a view to some intermittent 
performance issues.  When we have scrubs running, even when only a few are, we 
can sometimes get severe impacts on the performance of RBD images, enough to 
start causing VMs to appear stalled or unresponsive.When some of these 
scrubs are running I can see very high latency on some disks which I suspect is 
what is impacting the performance.  We currently have around 70 PGs per SATA 
OSD, and 140 PGs per SSD OSD.   These numbers are probably not really 
reflective as most of the data is in only really half of the pools, so some PGs 
would be fairly heavy while others are practically empty.   From what I have 
read we should be able to go significantly higher though.We are running 
10.2.1 if that matters in this context.

 My question is if we increase the numbers of PGs, is that likely to help 
reduce the scrub impact or spread it wider?  For example, does the mere act of 
scrubbing one PG mean the underlying disk is going to be hammered and so we 
will impact more PGs with that load, or would having more PGs mean the time to 
scrub the PG should be reduced and so the impact will be more disbursed?

I am also curious from a performance stand of view are we better off with more 
PGs to reduce PG lock contention etc?

Cheers,
 Adrian


Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs per OSD guidance

2017-07-19 Thread Adrian Saul

Anyone able to offer any advice on this?

Cheers,
 Adrian


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Adrian Saul
> Sent: Friday, 14 July 2017 6:05 PM
> To: 'ceph-users@lists.ceph.com'
> Subject: [ceph-users] PGs per OSD guidance
>
> Hi All,
>I have been reviewing the sizing of our PGs with a view to some
> intermittent performance issues.  When we have scrubs running, even when
> only a few are, we can sometimes get severe impacts on the performance of
> RBD images, enough to start causing VMs to appear stalled or unresponsive.
> When some of these scrubs are running I can see very high latency on some
> disks which I suspect is what is impacting the performance.  We currently
> have around 70 PGs per SATA OSD, and 140 PGs per SSD OSD.   These
> numbers are probably not really reflective as most of the data is in only 
> really
> half of the pools, so some PGs would be fairly heavy while others are
> practically empty.   From what I have read we should be able to go
> significantly higher though.We are running 10.2.1 if that matters in this
> context.
>
>  My question is if we increase the numbers of PGs, is that likely to help
> reduce the scrub impact or spread it wider?  For example, does the mere act
> of scrubbing one PG mean the underlying disk is going to be hammered and
> so we will impact more PGs with that load, or would having more PGs mean
> the time to scrub the PG should be reduced and so the impact will be more
> disbursed?
>
> I am also curious from a performance stand of view are we better off with
> more PGs to reduce PG lock contention etc?
>
> Cheers,
>  Adrian
>
>
> Confidentiality: This email and any attachments are confidential and may be
> subject to copyright, legal or some other professional privilege. They are
> intended solely for the attention and use of the named addressee(s). They
> may only be copied, distributed or disclosed with the consent of the
> copyright owner. If you have received this email by mistake or by breach of
> the confidentiality clause, please notify the sender immediately by return
> email and delete or destroy all copies of the email. Any confidentiality,
> privilege or copyright is not waived or lost because this email has been sent
> to you by mistake.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does ceph pg scrub error affect all of I/O in ceph cluster?

2017-08-03 Thread Adrian Saul

Depends on the error case – usually you will see blocked IO messages as well if 
there is a condition causing OSDs to be unresponsive.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of ???
Sent: Friday, 4 August 2017 1:34 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Does ceph pg scrub error affect all of I/O in ceph 
cluster?

Hi cephers,

I experienced ceph status into HEALTH_ERR because of pg scrub error.

I thought all I/O is blocked when the status of ceph is Error.

However, ceph could operate normally even thought ceph is in error status.

There are two pools in ceph cluster which are include seperate 
nodes.(volumes-1, volumes-2)

The OSD device which has problem is in volumes-1 pool.

I noticed that volumes-2 pool has no problem with operation.

My question is that all of I/O request are blocked when the ceph status is into 
error or it depends on error case?

Thank you!
John Haan
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Iscsi configuration

2017-08-08 Thread Adrian Saul
Hi Sam,
  We use SCST for iSCSI with Ceph, and a pacemaker cluster to orchestrate the 
management of active/passive presentation using ALUA though SCST device groups. 
 In our case we ended up writing our own pacemaker resources to support our 
particular model and preferences, but I believe there are a few resources out 
there for setting this up that you could make use of.

For us it consists of resources for the RBD devices, the iSCSI targets, the 
device groups and hostgroups for presentation.  The resources are cloned across 
all the cluster nodes, except for the device group resources which are 
master/slave, with the master becoming the active ALUA member and the others 
becoming standby or non-optimised.

The iSCSI clients see the ALUA presentation and manage it with their own 
multipathing stacks.

There may be ways to do it with LIO now, but at the time I looked at the ALUA 
support in SCST was a lot better.

HTH.

Cheers,
 Adrian



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Samuel 
Soulard
Sent: Wednesday, 9 August 2017 6:45 AM
To: ceph-us...@ceph.com
Subject: [ceph-users] Iscsi configuration

Hi all,

Platform : Centos 7 Luminous 12.1.2

First time here but, are there any guides or guidelines out there on how to 
configure ISCSI gateways in HA so that if one gateway fails, IO can continue on 
the passive node?

What I've done so far
-ISCSI node with Ceph client map rbd on boot
-Rbd has exclusive-lock feature enabled and layering
-Targetd service dependent on rbdmap.service
-rbd exported through LUN ISCSI
-Windows ISCSI imitator can map the lun and format / write to it (awesome)

Now I have no idea where to start to have an active /passive scenario for luns 
exported with LIO.  Any ideas?

Also the web dashboard seem to hint that it can get stats for various clients 
made on ISCSI gateways, I'm not sure where it pulls that information. Is 
Luminous now shipping a ISCSI daemon of some sort?

Thanks all!
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] VMware + Ceph using NFS sync/async ?

2017-08-16 Thread Adrian Saul

We are using Ceph on NFS for VMWare – we are using SSD tiers in front of SATA 
and some direct SSD pools.  The datastores are just XFS file systems on RBD 
managed by a pacemaker cluster for failover.

Lessons so far are that large datastores quickly run out of IOPS and compete 
for performance – you are better off with many smaller RBDs (say 1TB) to spread 
out workloads.  Also tuning up NFS threads seems to help.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Osama 
Hasebou
Sent: Wednesday, 16 August 2017 10:34 PM
To: n...@fisk.me.uk
Cc: ceph-users 
Subject: Re: [ceph-users] VMware + Ceph using NFS sync/async ?

Hi Nick,

Thanks for replying! If Ceph is combined with Openstack then, does that mean 
that actually when openstack writes are happening, it is not fully sync'd (as 
in written to disks) before it starts receiving more data, so acting as async ? 
In that scenario there is a chance for data loss if things go bad, i.e power 
outage or something like that ?

As for the slow operations, reading is quite fine when I compare it to a SAN 
storage system connected to VMware. It is writing data, small chunks or big 
ones, that suffer when trying to use the sync option with FIO for benchmarking.

In that case, I wonder, is no one using CEPH with VMware in a production 
environment ?

Cheers.

Regards,
Ossi



Hi Osama,

This is a known problem with many software defined storage stacks, but 
potentially slightly worse with Ceph due to extra overheads. Sync writes have 
to wait until all copies of the data are written to disk by the OSD and 
acknowledged back to the client. The extra network hops for replication and NFS 
gateways add significant latency which impacts the time it takes to carry out 
small writes. The Ceph code also takes time to process each IO request.

What particular operations are you finding slow? Storage vmotions are just bad, 
and I don’t think there is much that can be done about them as they are split 
into lots of 64kb IO’s.

One thing you can try is to force the CPU’s on your OSD nodes to run at C1 
cstate and force their minimum frequency to 100%. This can have quite a large 
impact on latency. Also you don’t specify your network, but 10G is a must.

Nick


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Osama 
Hasebou
Sent: 14 August 2017 12:27
To: ceph-users mailto:ceph-users@lists.ceph.com>>
Subject: [ceph-users] VMware + Ceph using NFS sync/async ?

Hi Everyone,

We started testing the idea of using Ceph storage with VMware, the idea was to 
provide Ceph storage through open stack to VMware, by creating a virtual 
machine coming from Ceph + Openstack , which acts as an NFS gateway, then mount 
that storage on top of VMware cluster.

When mounting the NFS exports using the sync option, we noticed a huge 
degradation in performance which makes it very slow to use it in production, 
the async option makes it much better but then there is the risk of it being 
risky that in case a failure shall happen, some data might be lost in that 
Scenario.

Now I understand that some people in the ceph community are using Ceph with 
VMware using NFS gateways, so if you can kindly shed some light on your 
experience, and if you do use it in production purpose, that would be great and 
how did you mitigate the sync/async options and keep write performance.


Thanks you!!!

Regards,
Ossi


Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] VMware + Ceph using NFS sync/async ?

2017-08-16 Thread Adrian Saul
> I'd be interested in details of this small versus large bit.

The smaller shares is just simply to distribute the workload over more RBDs so 
the bottleneck doesn’t become the RBD device. The size itself doesn’t 
particularly matter but just the idea to distribute VMs across many shares 
rather than a few large datastores.

We originally started with 10TB shares, just because we had the space - but we 
found performance was running out before capacity did.  But it's been apparent 
that the limitation appears to be at the RBD level, particularly with writes.  
So under heavy usage with say VMWare snapshot backups VMs gets impacted by 
higher latency to the point that some VMs become unresponsive for small 
periods.  The ceph cluster itself has plenty of performance available and 
handles far higher workload periods, but individual RBD devices just seem to 
hit the wall.

For example, one of our shares will sit there all day happily doing 3-400 IOPS 
read at very low latencies.  During the backup period we get heavier writes as 
snapshots are created and cleaned up.   That increased write activity pushes 
the RBD to 100% busy and read latencies go up from 1-2ms to 20-30ms, even 
though the number of reads doesn’t change that much.   The devices though can 
handle more, I can see periods of up to 1800 IOPS read and 800 write.

There is probably more tuning that can be applied at the XFS/NFS level, but for 
the moment that’s the direction we are taking - creating more shares.

>
> Would you say that the IOPS starvation is more an issue of the large
> filesystem than the underlying Ceph/RBD?

As above - I think its more to do with an IOPS limitation at the RBD device 
level - likely due to sync write latency limiting the number of effective IOs.  
That might be XFS as well but I have not had the chance to dial that in more.

> With a cache-tier in place I'd expect all hot FS objects (inodes, etc) to be
> there and thus be as fast as it gets from a Ceph perspective.

Yeah - the cache teir takes a fair bit of the heat and improves the response 
considerably for the SATA environments - it makes a significant difference.  
The SSD only pool images behave in a similar way but operate to a much higher 
performance level before they start showing issues.

> OTOH lots of competing accesses to same journal, inodes would be a
> limitation inherent to the FS.

Its likely there is tuning there to improve the XFS performance, but from the 
stats of the RBD device they are showing the latencies going up, there might be 
more impact further up the stack, but the underlying device shows the change in 
performance.

>
> Christian
>
> >
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Osama Hasebou
> > Sent: Wednesday, 16 August 2017 10:34 PM
> > To: n...@fisk.me.uk
> > Cc: ceph-users 
> > Subject: Re: [ceph-users] VMware + Ceph using NFS sync/async ?
> >
> > Hi Nick,
> >
> > Thanks for replying! If Ceph is combined with Openstack then, does that
> mean that actually when openstack writes are happening, it is not fully sync'd
> (as in written to disks) before it starts receiving more data, so acting as 
> async
> ? In that scenario there is a chance for data loss if things go bad, i.e power
> outage or something like that ?
> >
> > As for the slow operations, reading is quite fine when I compare it to a SAN
> storage system connected to VMware. It is writing data, small chunks or big
> ones, that suffer when trying to use the sync option with FIO for
> benchmarking.
> >
> > In that case, I wonder, is no one using CEPH with VMware in a production
> environment ?
> >
> > Cheers.
> >
> > Regards,
> > Ossi
> >
> >
> >
> > Hi Osama,
> >
> > This is a known problem with many software defined storage stacks, but
> potentially slightly worse with Ceph due to extra overheads. Sync writes
> have to wait until all copies of the data are written to disk by the OSD and
> acknowledged back to the client. The extra network hops for replication and
> NFS gateways add significant latency which impacts the time it takes to carry
> out small writes. The Ceph code also takes time to process each IO request.
> >
> > What particular operations are you finding slow? Storage vmotions are just
> bad, and I don’t think there is much that can be done about them as they are
> split into lots of 64kb IO’s.
> >
> > One thing you can try is to force the CPU’s on your OSD nodes to run at C1
> cstate and force their minimum frequency to 100%. This can have quite a
> large impact on latency. Also you don’t specify your network, but 10G is a
> must.
> >
> > Nick
> >
> >
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Osama Hasebou
> > Sent: 14 August 2017 12:27
> > To: ceph-users
> > mailto:ceph-users@lists.ceph.com>>
> > Subject: [ceph-users] VMware + Ceph using NFS sync/async ?
> >
> > Hi Everyone,
> >
> > We started testing the idea of using Ceph storage with VMware, the idea
> was to provide C

Re: [ceph-users] Ceph cluster with SSDs

2017-08-20 Thread Adrian Saul
> SSD make details : SSD 850 EVO 2.5" SATA III 4TB Memory & Storage - MZ-
> 75E4T0B/AM | Samsung

The performance difference between these and the SM or PM863 range is night and 
day.  I would not use these for anything you care about with performance, 
particularly IOPS or latency.
Their write latency is highly variable and even at best is still 5x higher than 
what the SM863 range does.  When we compared them we could not get them below 
6ms and they frequently spiked to much higher values (25-30ms).  With the 
SM863s they were a constant sub 1ms and didn't fluctuate.  I believe it was the 
garbage collection on the Evos that causes the issue.  Here was the difference 
in average latencies from a pool made of half Evo and half SM863:

Write latency - Evo 7.64ms - SM863 0.55ms
Read Latency - Evo 2.56ms - SM863  0.16ms

Add to that Christian's remarks on the write endurance and they are only good 
for desktops that wont exercise them that much.   You are far better investing 
in DC/Enterprise grade devices.




>
> On Sat, Aug 19, 2017 at 10:44 PM, M Ranga Swami Reddy
>  wrote:
> > Yes, Its in production and used the pg count as per the pg calcuator @
> ceph.com.
> >
> > On Fri, Aug 18, 2017 at 3:30 AM, Mehmet  wrote:
> >> Which ssds are used? Are they in production? If so how is your PG Count?
> >>
> >> Am 17. August 2017 20:04:25 MESZ schrieb M Ranga Swami Reddy
> >> :
> >>>
> >>> Hello,
> >>> I am using the Ceph cluster with HDDs and SSDs. Created separate
> >>> pool for each.
> >>> Now, when I ran the "ceph osd bench", HDD's OSDs show around 500
> >>> MB/s and SSD's OSD show around 280MB/s.
> >>>
> >>> Ideally, what I expected was - SSD's OSDs should be at-least 40%
> >>> high as compared with HDD's OSD bench.
> >>>
> >>> Did I miss anything here? Any hint is appreciated.
> >>>
> >>> Thanks
> >>> Swami
> >>> 
> >>>
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ruleset vs replica count

2017-08-24 Thread Adrian Saul

Yes - ams5-ssd would have 2 replicas, ams6-ssd would have 1  (@size 3, -2 = 1)

Although for this ruleset the min_size should be set to at least 2, or more 
practically 3 or 4.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Sinan 
Polat
Sent: Friday, 25 August 2017 3:02 AM
To: ceph-us...@ceph.com
Subject: [ceph-users] Ruleset vs replica count

Hi,

In a Multi Datacenter Cluster I have the following rulesets:
--
rule ams5_ssd {
ruleset 1
type replicated
min_size 1
max_size 10
step take ams5-ssd
step chooseleaf firstn 2 type host
step emit
step take ams6-ssd
step chooseleaf firstn -2 type host
step emit
}
rule ams6_ssd {
ruleset 2
type replicated
min_size 1
max_size 10
step take ams6-ssd
step chooseleaf firstn 2 type host
step emit
step take ams5-ssd
step chooseleaf firstn -2 type host
step emit
}
--

The replication size is set to 3.

When for example ruleset 1 is used, how is the replication being done? Does it 
store 2 replica's in ams5-ssd and store 1 replica in ams6-ssd? Or does it store 
3 replicas in ams5-ssd and 3 replicas in ams6-ssd?

Thanks!

Sinan
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitoring a rbd map rbd connection

2017-08-24 Thread Adrian Saul
If you are monitoring to ensure that it is mounted and active, a simple 
check_disk on the mountpoint should work.  If the mount is not present, or the 
filesystem is non-responsive then this should pick it up. A second check to 
perhaps test you can actually write files to the file system would not go 
astray either.

Other than that I don't think there is much point checking anything else like 
rbd mapped output.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Hauke Homburg
> Sent: Friday, 25 August 2017 1:35 PM
> To: ceph-users 
> Subject: [ceph-users] Monitoring a rbd map rbd connection
>
> Hallo,
>
> Ich want to monitor the mapped Connection between a rbd map rbdimage
> an a /dev/rbd device.
>
> This i want to do with icinga.
>
> Has anyone a Idea how i can do this?
>
> My first Idea is to touch and remove a File in the mount point. I am not sure
> that this is the the only thing i have to do
>
>
> Thanks for Help
>
> Hauke
>
> --
> www.w3-creative.de
>
> www.westchat.de
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph release cadence

2017-09-06 Thread Adrian Saul
> * Drop the odd releases, and aim for a ~9 month cadence. This splits the
> difference between the current even/odd pattern we've been doing.
>
>   + eliminate the confusing odd releases with dubious value
>   + waiting for the next release isn't quite as bad
>   - required upgrades every 9 months instead of ever 12 months

As a user is probably closest to the ideal, although a production deployment 
might slip out of the LTS view in 18 months given once deployed they tend to 
stay static.

>From a testing perspective it would be good to know you could deploy the 
>"early access" version of a release and test with that rather than having to 
>switch release to productionise when that release is blessed.

Also, and this might be harder to achieve, but could krbd support for new 
releases be more aligned with kernel versions?  Or at the least a definitive 
map of what kernels and backports support which release.


Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-osd restartd via systemd in case of disk error

2017-09-19 Thread Adrian Saul
> I understand what you mean and it's indeed dangerous, but see:
> https://github.com/ceph/ceph/blob/master/systemd/ceph-osd%40.service
>
> Looking at the systemd docs it's difficult though:
> https://www.freedesktop.org/software/systemd/man/systemd.service.ht
> ml
>
> If the OSD crashes due to another bug you do want it to restart.
>
> But for systemd it's not possible to see if the crash was due to a disk I/O-
> error or a bug in the OSD itself or maybe the OOM-killer or something.

Perhaps using something like RestartPreventExitStatus and defining a specific 
exit code for the OSD to exit on when it is exiting due to an IO error.

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] librmb: Mail storage on RADOS with Dovecot

2017-09-21 Thread Adrian Saul

Thanks for bringing this to attention Wido - its of interest to us as we are 
currently looking to migrate mail platforms onto Ceph using NFS, but this seems 
far more practical.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Wido den Hollander
> Sent: Thursday, 21 September 2017 6:40 PM
> To: ceph-us...@ceph.com
> Subject: [ceph-users] librmb: Mail storage on RADOS with Dovecot
>
> Hi,
>
> A tracker issue has been out there for a while:
> http://tracker.ceph.com/issues/12430
>
> Storing e-mail in RADOS with Dovecot, the IMAP/POP3/LDA server with a
> huge marketshare.
>
> It took a while, but last year Deutsche Telekom took on the heavy work and
> started a project to develop librmb: LibRadosMailBox
>
> Together with Deutsche Telekom and Tallence GmbH (DE) this project came
> to life.
>
> First, the Github link: https://github.com/ceph-dovecot/dovecot-ceph-
> plugin
>
> I am not going to repeat everything which is on Github, put a short summary:
>
> - CephFS is used for storing Mailbox Indexes
> - E-Mails are stored directly as RADOS objects
> - It's a Dovecot plugin
>
> We would like everybody to test librmb and report back issues on Github so
> that further development can be done.
>
> It's not finalized yet, but all the help is welcome to make librmb the best
> solution for storing your e-mails on Ceph with Dovecot.
>
> Danny Al-Gaaf has written a small blogpost about it and a presentation:
>
> - https://dalgaaf.github.io/CephMeetUpBerlin20170918-librmb/
> - http://blog.bisect.de/2017/09/ceph-meetup-berlin-followup-librmb.html
>
> To get a idea of the scale: 4,7PB of RAW storage over 1.200 OSDs is the final
> goal (last slide in presentation). That will provide roughly 1,2PB of usable
> storage capacity for storing e-mail, a lot of e-mail.
>
> To see this project finally go into the Open Source world excites me a lot :-)
>
> A very, very big thanks to Deutsche Telekom for funding this awesome
> project!
>
> A big thanks as well to Tallence as they did an awesome job in developing
> librmb in such a short time.
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd create returns duplicate ID's

2017-09-29 Thread Adrian Saul

Do you mean that after you delete and remove the crush and auth entries for the 
OSD, when you go to create another OSD later it will re-use the previous OSD ID 
that you have destroyed in the past?

Because I have seen that behaviour as well -  but only for previously allocated 
OSD IDs that have been osd rm/crush rm/auth del.




> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Luis Periquito
> Sent: Friday, 29 September 2017 6:01 PM
> To: Ceph Users 
> Subject: [ceph-users] osd create returns duplicate ID's
>
> Hi all,
>
> I use puppet to deploy and manage my clusters.
>
> Recently, as I have been doing a removal of old hardware and adding of new
> I've noticed that sometimes the "ceph osd create" is returning repeated IDs.
> Usually it's on the same server, but yesterday I saw it in different servers.
>
> I was expecting the OSD ID's to be unique, and when they come on the same
> server puppet starts spewing errors - which is desirable - but when it's in
> different servers it broke those OSDs in Ceph. As they hadn't backfill any 
> full
> PGs I just wiped, removed and started anew.
>
> As for the process itself: The OSDs are marked out and removed from crush,
> when empty they are auth del and osd rm. After building the server puppet
> will osd create, and use the generated ID for crush move and mkfs.
>
> Unfortunately I haven't been able to reproduce in isolation, and being a
> production cluster logging is tuned way down.
>
> This has happened in several different clusters, but they are all running
> 10.2.7.
>
> Any ideas?
>
> thanks,
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bad crc/signature errors

2017-10-04 Thread Adrian Saul

We see the same messages and are similarly on a 4.4 KRBD version that is 
affected by this.

I have seen no impact from it so far that I know about


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Jason Dillaman
> Sent: Thursday, 5 October 2017 5:45 AM
> To: Gregory Farnum 
> Cc: ceph-users ; Josy
> 
> Subject: Re: [ceph-users] bad crc/signature errors
>
> Perhaps this is related to a known issue on some 4.4 and later kernels [1]
> where the stable write flag was not preserved by the kernel?
>
> [1] http://tracker.ceph.com/issues/19275
>
> On Wed, Oct 4, 2017 at 2:36 PM, Gregory Farnum 
> wrote:
> > That message indicates that the checksums of messages between your
> > kernel client and OSD are incorrect. It could be actual physical
> > transmission errors, but if you don't see other issues then this isn't
> > fatal; they can recover from it.
> >
> > On Wed, Oct 4, 2017 at 8:52 AM Josy 
> wrote:
> >>
> >> Hi,
> >>
> >> We have setup a cluster with 8 OSD servers (31 disks)
> >>
> >> Ceph health is Ok.
> >> --
> >> [root@las1-1-44 ~]# ceph -s
> >>cluster:
> >>  id: de296604-d85c-46ab-a3af-add3367f0e6d
> >>  health: HEALTH_OK
> >>
> >>services:
> >>  mon: 3 daemons, quorum
> >> ceph-las-mon-a1,ceph-las-mon-a2,ceph-las-mon-a3
> >>  mgr: ceph-las-mon-a1(active), standbys: ceph-las-mon-a2
> >>  osd: 31 osds: 31 up, 31 in
> >>
> >>data:
> >>  pools:   4 pools, 510 pgs
> >>  objects: 459k objects, 1800 GB
> >>  usage:   5288 GB used, 24461 GB / 29749 GB avail
> >>  pgs: 510 active+clean
> >> 
> >>
> >> We created a pool and mounted it as RBD in one of the client server.
> >> While adding data to it, we see this below error :
> >>
> >> 
> >> [939656.039750] libceph: osd20 10.255.0.9:6808 bad crc/signature
> >> [939656.041079] libceph: osd16 10.255.0.8:6816 bad crc/signature
> >> [939735.627456] libceph: osd11 10.255.0.7:6800 bad crc/signature
> >> [939735.628293] libceph: osd30 10.255.0.11:6804 bad crc/signature
> >>
> >> =
> >>
> >> Can anyone explain what is this and if I can fix it ?
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Jason
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-ISCSI

2017-10-11 Thread Adrian Saul

As an aside, SCST  iSCSI will support ALUA and does PGRs through the use of 
DLM.  We have been using that with Solaris and Hyper-V initiators for RBD 
backed storage but still have some ongoing issues with ALUA (probably our 
current config, we need to lab later recommendations).



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Jason Dillaman
> Sent: Thursday, 12 October 2017 5:04 AM
> To: Samuel Soulard 
> Cc: ceph-users ; Zhu Lingshan 
> Subject: Re: [ceph-users] Ceph-ISCSI
>
> On Wed, Oct 11, 2017 at 1:10 PM, Samuel Soulard
>  wrote:
> > Hmmm, If you failover the identity of the LIO configuration including
> > PGRs (I believe they are files on disk), this would work no?  Using an
> > 2 ISCSI gateways which have shared storage to store the LIO
> > configuration and PGR data.
>
> Are you referring to the Active Persist Through Power Loss (APTPL) support
> in LIO where it writes the PR metadata to "/var/target/pr/aptpl_"? I
> suppose that would work for a Pacemaker failover if you had a shared file
> system mounted between all your gateways *and* the initiator requests
> APTPL mode(?).
>
> > Also, you said another "fails over to another port", do you mean a
> > port on another ISCSI gateway?  I believe LIO with multiple target
> > portal IP on the same node for path redundancy works with PGRs.
>
> Yes, I was referring to the case with multiple active iSCSI gateways which
> doesn't currently distribute PGRs to all gateways in the group.
>
> > In my scenario, if my assumptions are correct, you would only have 1
> > ISCSI gateway available through 2 target portal IP (for data path
> > redundancy).  If this first ISCSI gateway fails, both target portal IP
> > failover to the standby node with the PGR data that is available on share
> stored.
> >
> >
> > Sam
> >
> > On Wed, Oct 11, 2017 at 12:52 PM, Jason Dillaman 
> > wrote:
> >>
> >> On Wed, Oct 11, 2017 at 12:31 PM, Samuel Soulard
> >>  wrote:
> >> > Hi to all,
> >> >
> >> > What if you're using an ISCSI gateway based on LIO and KRBD (that
> >> > is, RBD block device mounted on the ISCSI gateway and published
> >> > through LIO).
> >> > The
> >> > LIO target portal (virtual IP) would failover to another node.
> >> > This would theoretically provide support for PGRs since LIO does
> >> > support SPC-3.
> >> > Granted it is not distributed and limited to 1 single node
> >> > throughput, but this would achieve high availability required by
> >> > some environment.
> >>
> >> Yes, LIO technically supports PGR but it's not distributed to other
> >> nodes. If you have a pacemaker-initiated target failover to another
> >> node, the PGR state would be lost / missing after migration (unless I
> >> am missing something like a resource agent that attempts to preserve
> >> the PGRs). For initiator-initiated failover (e.g. a target is alive
> >> but the initiator cannot reach it), after it fails over to another
> >> port the PGR data won't be available.
> >>
> >> > Of course, multiple target portal would be awesome since available
> >> > throughput would be able to scale linearly, but since this isn't
> >> > here right now, this would provide at least an alternative.
> >>
> >> It would definitely be great to go active/active but there are
> >> concerns of data-corrupting edge conditions when using MPIO since it
> >> relies on client-side failure timers that are not coordinated with
> >> the target.
> >>
> >> For example, if an initiator writes to sector X down path A and there
> >> is delay to the path A target (i.e. the target and initiator timeout
> >> timers are not in-sync), and MPIO fails over to path B, quickly
> >> performs the write to sector X and performs second write to sector X,
> >> there is a possibility that eventually path A will unblock and
> >> overwrite the new value in sector 1 with the old value. The safe way
> >> to handle that would require setting the initiator-side IO timeouts
> >> to such high values as to cause higher-level subsystems to mark the
> >> MPIO path as failed should a failure actually occur.
> >>
> >> The iSCSI MCS protocol would address these concerns since in theory
> >> path B could discover that the retried IO was actually a retry, but
> >> alas it's not available in the Linux Open-iSCSI nor ESX iSCSI
> >> initiators.
> >>
> >> > On Wed, Oct 11, 2017 at 12:26 PM, David Disseldorp 
> >> > wrote:
> >> >>
> >> >> Hi Jason,
> >> >>
> >> >> Thanks for the detailed write-up...
> >> >>
> >> >> On Wed, 11 Oct 2017 08:57:46 -0400, Jason Dillaman wrote:
> >> >>
> >> >> > On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López
> >> >> > 
> >> >> > wrote:
> >> >> >
> >> >> > > As far as I am able to understand there are 2 ways of setting
> >> >> > > iscsi for ceph
> >> >> > >
> >> >> > > 1- using kernel (lrbd) only able on SUSE, CentOS, fedora...
> >> >> > >
> >> >> >
> >> >> > The target_core_rbd approach is only utilized by SUSE (and its
> >> >> > derivatives like PetaSAN) as far a

Re: [ceph-users] Ceph-ISCSI

2017-10-11 Thread Adrian Saul

It’s a fair point – in our case we are based on CentOS so self-support only 
anyway (business does not like paying support costs).  At the time we evaluated 
LIO, SCST and STGT, with a  directive to use ALUA support instead of IP 
failover.   In the end we went with SCST as it had more mature ALUA support at 
the time, and was easier to integrate into pacemaker to support the ALUA 
failover, it also seemed to perform fairly well.

However given the road we have gone down and the issues we are facing as we 
scale up and load up the storage, having a vendor support channel would be a 
relief.


From: Samuel Soulard [mailto:samuel.soul...@gmail.com]
Sent: Thursday, 12 October 2017 11:20 AM
To: Adrian Saul 
Cc: Zhu Lingshan ; dilla...@redhat.com; ceph-users 

Subject: RE: [ceph-users] Ceph-ISCSI

Yes I looked at this solution, and it seems interesting.  However, one point 
often stick with business requirements is commercial support.

With Redhat or Suse, you have support provided with the solution.   I'm not 
sure about SCST what support channel they offer.

Sam

On Oct 11, 2017 20:05, "Adrian Saul" 
mailto:adrian.s...@tpgtelecom.com.au>> wrote:

As an aside, SCST  iSCSI will support ALUA and does PGRs through the use of 
DLM.  We have been using that with Solaris and Hyper-V initiators for RBD 
backed storage but still have some ongoing issues with ALUA (probably our 
current config, we need to lab later recommendations).



> -Original Message-
> From: ceph-users 
> [mailto:ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>]
>  On Behalf Of
> Jason Dillaman
> Sent: Thursday, 12 October 2017 5:04 AM
> To: Samuel Soulard mailto:samuel.soul...@gmail.com>>
> Cc: ceph-users mailto:ceph-us...@ceph.com>>; Zhu 
> Lingshan mailto:ls...@suse.com>>
> Subject: Re: [ceph-users] Ceph-ISCSI
>
> On Wed, Oct 11, 2017 at 1:10 PM, Samuel Soulard
> mailto:samuel.soul...@gmail.com>> wrote:
> > Hmmm, If you failover the identity of the LIO configuration including
> > PGRs (I believe they are files on disk), this would work no?  Using an
> > 2 ISCSI gateways which have shared storage to store the LIO
> > configuration and PGR data.
>
> Are you referring to the Active Persist Through Power Loss (APTPL) support
> in LIO where it writes the PR metadata to "/var/target/pr/aptpl_"? I
> suppose that would work for a Pacemaker failover if you had a shared file
> system mounted between all your gateways *and* the initiator requests
> APTPL mode(?).
>
> > Also, you said another "fails over to another port", do you mean a
> > port on another ISCSI gateway?  I believe LIO with multiple target
> > portal IP on the same node for path redundancy works with PGRs.
>
> Yes, I was referring to the case with multiple active iSCSI gateways which
> doesn't currently distribute PGRs to all gateways in the group.
>
> > In my scenario, if my assumptions are correct, you would only have 1
> > ISCSI gateway available through 2 target portal IP (for data path
> > redundancy).  If this first ISCSI gateway fails, both target portal IP
> > failover to the standby node with the PGR data that is available on share
> stored.
> >
> >
> > Sam
> >
> > On Wed, Oct 11, 2017 at 12:52 PM, Jason Dillaman 
> > mailto:jdill...@redhat.com>>
> > wrote:
> >>
> >> On Wed, Oct 11, 2017 at 12:31 PM, Samuel Soulard
> >> mailto:samuel.soul...@gmail.com>> wrote:
> >> > Hi to all,
> >> >
> >> > What if you're using an ISCSI gateway based on LIO and KRBD (that
> >> > is, RBD block device mounted on the ISCSI gateway and published
> >> > through LIO).
> >> > The
> >> > LIO target portal (virtual IP) would failover to another node.
> >> > This would theoretically provide support for PGRs since LIO does
> >> > support SPC-3.
> >> > Granted it is not distributed and limited to 1 single node
> >> > throughput, but this would achieve high availability required by
> >> > some environment.
> >>
> >> Yes, LIO technically supports PGR but it's not distributed to other
> >> nodes. If you have a pacemaker-initiated target failover to another
> >> node, the PGR state would be lost / missing after migration (unless I
> >> am missing something like a resource agent that attempts to preserve
> >> the PGRs). For initiator-initiated failover (e.g. a target is alive
> >> but the initiator cannot reach it), after it fails over to another
> >> port the PGR data won't be available.
> >>
> >> > Of course, multiple target portal wou

Re: [ceph-users] Thick provisioning

2017-10-18 Thread Adrian Saul

I concur - at the moment we need to manually sum the RBD images to look at how 
much we have "provisioned" vs what ceph df shows.  in our case we had a rapid 
run of provisioning new LUNs but it took  a while before usage started to catch 
up with what was provisioned as data was migrated in.  Ceph df would show say 
only 20% of a pool used, but the actual RBD allocation was nearer 80+%

I am not sure if its workable but if there could be a pool level metric to 
track the total allocation of RBD images that would be useful.  I imagine it 
gets tricky with snapshots/clones though.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> si...@turka.nl
> Sent: Thursday, 19 October 2017 6:41 AM
> To: Samuel Soulard 
> Cc: ceph-users 
> Subject: Re: [ceph-users] Thick provisioning
>
> Hi all,
>
> Thanks for the replies.
>
> The main reason why I was looking for the thin/thick provisioning setting is
> that I want to be sure that provisioned space should not exceed the cluster
> capacity.
>
> With thin provisioning there is a risk that more space is provisioned than the
> cluster capacity. When you monitor closely the real usage, this should not be
> a problem; but from experience when there is no hard limit, overprovisioning
> will happen at some point.
>
> Sinan
>
> > I can only speak for some environments, but sometimes, you would want
> > to make sure that a cluster cannot fill up until you can add more capacity.
> >
> > Some organizations are unable to purchase new capacity rapidly and
> > making sure you cannot exceed your current capacity, then you can't
> > run into problems.
> >
> > It may also come from an understanding that thick provisioning will
> > provide more performance initially like virtual machines environment.
> >
> > Having said all of this, isn't there a way to make sure the cluster
> > can accommodate the size of all RBD images that are created. And
> > ensure they have the space available? Some service availability might
> > depend on making sure the storage can provide the necessary capacity.
> >
> > I'm assuming that this is all from an understanding that it is more
> > costly to run such type of environments, however, you can also
> > guarantee that you will never fill up unexpectedly your cluster.
> >
> > Sam
> >
> > On Oct 18, 2017 02:20, "Wido den Hollander"  wrote:
> >
> >
> >> Op 17 oktober 2017 om 19:38 schreef Jason Dillaman
> >> :
> >>
> >>
> >> There is no existing option to thick provision images within RBD.
> >> When an image is created or cloned, the only actions that occur are
> >> some small metadata updates to describe the image. This allows image
> >> creation to be a quick, constant time operation regardless of the
> >> image size. To thick provision the entire image would require writing
> >> data to the entire image and ensuring discard support is disabled to
> >> prevent the OS from releasing space back (and thus re-sparsifying the
> >> image).
> >>
> >
> > Indeed. It makes me wonder why anybody would want it. It will:
> >
> > - Impact recovery performance
> > - Impact scrubbing performance
> > - Utilize more space then needed
> >
> > Why would you want to do this Sinan?
> >
> > Wido
> >
> >> On Mon, Oct 16, 2017 at 10:49 AM,   wrote:
> >> > Hi,
> >> >
> >> > I have deployed a Ceph cluster (Jewel). By default all block
> >> > devices
> > that
> >> > are created are thin provisioned.
> >> >
> >> > Is it possible to change this setting? I would like to have that
> >> > all created block devices are thick provisioned.
> >> >
> >> > In front of the Ceph cluster, I am running Openstack.
> >> >
> >> > Thanks!
> >> >
> >> > Sinan
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >>
> >> --
> >> Jason
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because thi

Re: [ceph-users] Cephfs NFS failover

2017-12-20 Thread Adrian Saul

What I have been doing with CephFS is make a number of hosts export the same 
CephFS mountpoints i.e

cephfs01:/cephfs/home
cephfs02:/cephfs/home
...

I then put the hosts all under a common DNS A record i.e "cephfs-nfs" so it 
resolves to all of the hosts exporting the share.

I then use autofs on the clients to mount the share as needed with the source 
host being "cephfs-nfs:/cephfs/home".  Autofs will automatically pick one of 
the hosts to mount and in the event it becomes unavailable autofs will remount 
using one of the other hosts in the A record.

If you wanted you could get more funky with the automount maps and use 
priorities and list the hosts individually, but the above is simple and seems 
to works well.



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Smith, 
Eric
Sent: Thursday, 21 December 2017 7:36 AM
To: nigel davies ; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Cephfs NFS failover


We did this with RBD, pacemaker, and corosync without issue - not sure about 
CephFS though. You might have to use something like sanlock maybe?


From: ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of nigel davies mailto:nigdav...@gmail.com>>
Sent: Wednesday, December 20, 2017 12:45 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Cephfs NFS failover

Hay all

Can any one advise on how it can do this.

I have set up an test ceph cluster

3 osd system
2 NFS servers

I want to set up the two NFS serves as an failover process. So if one fails the 
other starts up.

I have tried an few ways and getting stuck any advise would be gratefully 
received on this one

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Consistency problems when taking RBD snapshot

2016-09-13 Thread Adrian Saul

I found I could ignore the XFS issues and just mount it with the appropriate 
options (below from my backup scripts):

#
# Mount with nouuid (conflicting XFS) and norecovery (ro snapshot)
#
if ! mount -o ro,nouuid,norecovery  $SNAPDEV /backup${FS}; then
echo "FAILED: Unable to mount snapshot $DATESTAMP of $FS - 
cleaning up"
rbd unmap $SNAPDEV
rbd snap rm ${RBDPATH}@${DATESTAMP}
exit 3;
fi
echo "Backup snapshot of $RBDPATH mounted at: /backup${FS}"

It's impossible without clones to do it without norecovery.



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Ilya Dryomov
> Sent: Wednesday, 14 September 2016 1:51 AM
> To: Nikolay Borisov
> Cc: ceph-users; SiteGround Operations
> Subject: Re: [ceph-users] Consistency problems when taking RBD snapshot
>
> On Tue, Sep 13, 2016 at 4:11 PM, Nikolay Borisov  wrote:
> >
> >
> > On 09/13/2016 04:30 PM, Ilya Dryomov wrote:
> > [SNIP]
> >>
> >> Hmm, it could be about whether it is able to do journal replay on
> >> mount.  When you mount a snapshot, you get a read-only block device;
> >> when you mount a clone image, you get a read-write block device.
> >>
> >> Let's try this again, suppose image is foo and snapshot is snap:
> >>
> >> # fsfreeze -f /mnt
> >>
> >> # rbd snap create foo@snap
> >> # rbd map foo@snap
> >> /dev/rbd0
> >> # file -s /dev/rbd0
> >> # fsck.ext4 -n /dev/rbd0
> >> # mount /dev/rbd0 /foo
> >> # umount /foo
> >> 
> >> # file -s /dev/rbd0
> >> # fsck.ext4 -n /dev/rbd0
> >>
> >> # rbd clone foo@snap bar
> >> $ rbd map bar
> >> /dev/rbd1
> >> # file -s /dev/rbd1
> >> # fsck.ext4 -n /dev/rbd1
> >> # mount /dev/rbd1 /bar
> >> # umount /bar
> >> 
> >> # file -s /dev/rbd1
> >> # fsck.ext4 -n /dev/rbd1
> >>
> >> Could you please provide the output for the above?
> >
> > Here you go : http://paste.ubuntu.com/23173721/
>
> OK, so that explains it: the frozen filesystem is "needs journal recovery", so
> mounting it off of read-only block device leads to errors.
>
> root@alxc13:~# fsfreeze -f /var/lxc/c11579 root@alxc13:~# rbd snap create
> rbd/c11579@snap_test root@alxc13:~# rbd map c11579@snap_test
> /dev/rbd151
> root@alxc13:~# fsfreeze -u /var/lxc/c11579 root@alxc13:~# file -s
> /dev/rbd151
> /dev/rbd151: Linux rev 1.0 ext4 filesystem data (needs journal
> recovery) (extents) (large files) (huge files)
>
> Now, to isolate the problem, the easiest would probably be to try to
> reproduce it with loop devices.  Can you try dding one of these images to a
> file, make sure that the filesystem is clean, losetup + mount, freeze, make a
> "snapshot" with cp and losetup -r + mount?
>
> Try sticking file -s before unfreeze and also compare md5sums:
>
> root@alxc13:~# fsfreeze -f /var/lxc/c11579  device> root@alxc13:~# rbd snap create rbd/c11579@snap_test
> root@alxc13:~# rbd map c11579@snap_test  device>  root@alxc13:~# file -s /dev/rbd151
> root@alxc13:~# fsfreeze -u /var/lxc/c11579  device>  root@alxc13:~# file -s /dev/rbd151
>
> Thanks,
>
> Ilya
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Consistency problems when taking RBD snapshot

2016-09-14 Thread Adrian Saul
> But shouldn't freezing the fs and doing a snapshot constitute a "clean
> unmount" hence no need to recover on the next mount (of the snapshot) -
> Ilya?

It's what I thought as well, but XFS seems to want to attempt to replay the log 
regardless on mount and write to the device to do so.  This was the only way I 
found to mount it without converting the snapshot to a clone (which I couldn't 
do with the image options enabled anyway).

I have this script snapshotting, mounting and backing up multiple file systems 
on my cluster with no issue.
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Snap delete performance impact

2016-09-21 Thread Adrian Saul

Any guidance on this?  I have osd_snap_trim_sleep set to 1 and it seems to have 
tempered some of the issues but its still bad enough that NFS storage off RBD 
volumes become unavailable for over 3 minutes.

It seems that the activity which the snapshot deletes are actioned triggers 
massive disk load for around 30 minutes.  The logs show OSDs marking each other 
out, OSDs complaining they are wrongly marked out and blocked requests errors 
for around 10 minutes at the start of this activity.

Is there any way to throttle snapshot deletes to make them much more of a 
background activity?  It really should not make the entire platform unusable 
for 10 minutes.



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Adrian Saul
> Sent: Wednesday, 6 July 2016 3:41 PM
> To: 'ceph-users@lists.ceph.com'
> Subject: [ceph-users] Snap delete performance impact
>
>
> I recently started a process of using rbd snapshots to setup a backup regime
> for a few file systems contained in RBD images.  While this generally works
> well at the time of the snapshots there is a massive increase in latency (10ms
> to multiple seconds of rbd device latency) across the entire cluster.  This 
> has
> flow on effects for some cluster timeouts as well as general performance hits
> to applications.
>
> In research I have found some references to osd_snap_trim_sleep being the
> way to throttle this activity but no real guidance on values for it.   I also 
> see
> some other osd_snap_trim tunables  (priority and cost).
>
> Is there any recommendations around setting these for a Jewel cluster?
>
> cheers,
>  Adrian
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Snap delete performance impact

2016-09-22 Thread Adrian Saul

I tried 2 this afternoon and saw the same results.  Essentially the disks 
appear to go to 100% busy doing very small but high numbers of IO and incur 
massive service times (300-400ms).   During that period I get blocked request 
errors continually.

I suspect part of that might be the SATA servers had filestore_op_threads set 
too high and hammering the disks with too much concurrent work.  As they have 
inherited a setting targeted for SSDs, so I have wound that back to defaults on 
those machines see if it makes a difference.

But I suspect going by the disk activity there is a lot of very small FS 
metadata updates going on and that is what is killing it.

Cheers,
 Adrian


> -Original Message-
> From: Nick Fisk [mailto:n...@fisk.me.uk]
> Sent: Thursday, 22 September 2016 7:06 PM
> To: Adrian Saul; ceph-users@lists.ceph.com
> Subject: RE: Snap delete performance impact
>
> Hi Adrian,
>
> I have also hit this recently and have since increased the
> osd_snap_trim_sleep to try and stop this from happening again. However, I
> haven't had an opportunity to actually try and break it again yet, but your
> mail seems to suggest it might not be the silver bullet I was looking for.
>
> I'm wondering if the problem is not with the removal of the snapshot, but
> actually down to the amount of object deletes that happen, as I see similar
> results when doing fstrim's or deleting RBD's. Either way I agree that a
> settable throttle to allow it to process more slowly would be a good addition.
> Have you tried that value set to higher than 1, maybe 10?
>
> Nick
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Adrian Saul
> > Sent: 22 September 2016 05:19
> > To: 'ceph-users@lists.ceph.com' 
> > Subject: Re: [ceph-users] Snap delete performance impact
> >
> >
> > Any guidance on this?  I have osd_snap_trim_sleep set to 1 and it
> > seems to have tempered some of the issues but its still bad
> enough
> > that NFS storage off RBD volumes become unavailable for over 3 minutes.
> >
> > It seems that the activity which the snapshot deletes are actioned
> > triggers massive disk load for around 30 minutes.  The logs
> show
> > OSDs marking each other out, OSDs complaining they are wrongly marked
> > out and blocked requests errors for around 10 minutes at the start of this
> activity.
> >
> > Is there any way to throttle snapshot deletes to make them much more
> > of a background activity?  It really should not make the
> entire
> > platform unusable for 10 minutes.
> >
> >
> >
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > Behalf Of Adrian Saul
> > > Sent: Wednesday, 6 July 2016 3:41 PM
> > > To: 'ceph-users@lists.ceph.com'
> > > Subject: [ceph-users] Snap delete performance impact
> > >
> > >
> > > I recently started a process of using rbd snapshots to setup a
> > > backup regime for a few file systems contained in RBD images.  While
> > > this generally works well at the time of the snapshots there is a
> > > massive increase in latency (10ms to multiple seconds of rbd device
> > > latency) across the entire cluster.  This has flow on effects for
> > > some cluster timeouts as well as general performance hits to applications.
> > >
> > > In research I have found some references to osd_snap_trim_sleep being
> the
> > > way to throttle this activity but no real guidance on values for it.   I 
> > > also
> see
> > > some other osd_snap_trim tunables  (priority and cost).
> > >
> > > Is there any recommendations around setting these for a Jewel cluster?
> > >
> > > cheers,
> > >  Adrian
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > Confidentiality: This email and any attachments are confidential and
> > may be subject to copyright, legal or some other professional
> > privilege. They are intended solely for the attention and use of the
> > named addressee(s). They may only be copied, distributed or disclosed
> > with the consent of the copyright owner. If you have received this email by
> mistake or by breach of the confidentiality clause, please notify the sender
> immediately by return email and delete or destroy all copies of the email.
> Any confidentiality, privilege or copyright is not waived or lost because t

Re: [ceph-users] Snap delete performance impact

2016-09-23 Thread Adrian Saul

I did some observation today - with the reduced filestore_op_threads it seems 
to ride out the storm better, not ideal but better.

The main issue is that for the 10 minutes from the moment the rbd snap rm 
command is issued, the SATA systems in my configuration load up massively on 
disk IO and I think this is what is rolling on to all other issues (OSDs 
unresponsive, queue backlogs). The disks all go 100% busy - the average SATA 
write latency goes from 14ms to 250ms.  I was observing disks doing 400, 700 
and higher service times.  After those few minutes it tapers down and goes back 
to normal.

There are all ST6000VN0001 disks - anyone aware of anything that might explain 
this sort of behaviour?  It seems odd that even if the disks were hit with high 
write traffic (average of 50 write IOPS going up to 270-300 during this 
activity) that the service times would blow out that much.

Cheers,
 Adrian






> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Adrian Saul
> Sent: Thursday, 22 September 2016 7:15 PM
> To: n...@fisk.me.uk; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Snap delete performance impact
>
>
> I tried 2 this afternoon and saw the same results.  Essentially the disks 
> appear
> to go to 100% busy doing very small but high numbers of IO and incur massive
> service times (300-400ms).   During that period I get blocked request errors
> continually.
>
> I suspect part of that might be the SATA servers had filestore_op_threads
> set too high and hammering the disks with too much concurrent work.  As
> they have inherited a setting targeted for SSDs, so I have wound that back to
> defaults on those machines see if it makes a difference.
>
> But I suspect going by the disk activity there is a lot of very small FS 
> metadata
> updates going on and that is what is killing it.
>
> Cheers,
>  Adrian
>
>
> > -Original Message-
> > From: Nick Fisk [mailto:n...@fisk.me.uk]
> > Sent: Thursday, 22 September 2016 7:06 PM
> > To: Adrian Saul; ceph-users@lists.ceph.com
> > Subject: RE: Snap delete performance impact
> >
> > Hi Adrian,
> >
> > I have also hit this recently and have since increased the
> > osd_snap_trim_sleep to try and stop this from happening again.
> > However, I haven't had an opportunity to actually try and break it
> > again yet, but your mail seems to suggest it might not be the silver bullet 
> > I
> was looking for.
> >
> > I'm wondering if the problem is not with the removal of the snapshot,
> > but actually down to the amount of object deletes that happen, as I
> > see similar results when doing fstrim's or deleting RBD's. Either way
> > I agree that a settable throttle to allow it to process more slowly would 
> > be a
> good addition.
> > Have you tried that value set to higher than 1, maybe 10?
> >
> > Nick
> >
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > Behalf Of Adrian Saul
> > > Sent: 22 September 2016 05:19
> > > To: 'ceph-users@lists.ceph.com' 
> > > Subject: Re: [ceph-users] Snap delete performance impact
> > >
> > >
> > > Any guidance on this?  I have osd_snap_trim_sleep set to 1 and it
> > > seems to have tempered some of the issues but its still bad
> > enough
> > > that NFS storage off RBD volumes become unavailable for over 3
> minutes.
> > >
> > > It seems that the activity which the snapshot deletes are actioned
> > > triggers massive disk load for around 30 minutes.  The logs
> > show
> > > OSDs marking each other out, OSDs complaining they are wrongly
> > > marked out and blocked requests errors for around 10 minutes at the
> > > start of this
> > activity.
> > >
> > > Is there any way to throttle snapshot deletes to make them much more
> > > of a background activity?  It really should not make the
> > entire
> > > platform unusable for 10 minutes.
> > >
> > >
> > >
> > > > -Original Message-
> > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > > Behalf Of Adrian Saul
> > > > Sent: Wednesday, 6 July 2016 3:41 PM
> > > > To: 'ceph-users@lists.ceph.com'
> > > > Subject: [ceph-users] Snap delete performance impact
> > > >
> > > >
> > > > I recently started a process of using rbd snapshots to setup a
> > > > backup regime for a few file systems contained in RBD images.
> >

Re: [ceph-users] Snap delete performance impact

2016-09-23 Thread Adrian Saul
I am also seeing if reducing filestore queue ops limit from 500 to 250.  On my 
graphs I can see the file store ops queue goes from 1 or 2 to 500 for the 
period of the load.  I am looking to see if throttling down helps spread out 
the load.  The normal ops load is not enough to worry the current limit.



Sent from my SAMSUNG Galaxy S7 on the Telstra Mobile Network


 Original message 
From: Nick Fisk 
Date: 23/09/2016 7:26 PM (GMT+10:00)
To: Adrian Saul , ceph-users@lists.ceph.com
Subject: RE: Snap delete performance impact

Looking back through my graphs when this happened to me I can see that the 
queue on the disks was up as high as 30 during the period when the snapshot was 
removed, this would explain the high latencies, the disk is literally having 
fits trying to jump all over the place.

I need to test with the higher osd_snap_trim_sleep to see if that helps. What 
I'm interested in finding out is why so much disk activity is required for 
deleting an object. It feels to me that the process is async, in that Ceph will 
quite happily flood the Filestore with delete requests without any feedback to 
the higher layers.


> -Original Message-
> From: Adrian Saul [mailto:adrian.s...@tpgtelecom.com.au]
> Sent: 23 September 2016 10:04
> To: n...@fisk.me.uk; ceph-users@lists.ceph.com
> Subject: RE: Snap delete performance impact
>
>
> I did some observation today - with the reduced filestore_op_threads it seems 
> to ride out the storm better, not ideal but better.
>
> The main issue is that for the 10 minutes from the moment the rbd snap rm 
> command is issued, the SATA systems in my configuration
> load up massively on disk IO and I think this is what is rolling on to all 
> other issues (OSDs unresponsive, queue backlogs). The disks all
> go 100% busy - the average SATA write latency goes from 14ms to 250ms.  I was 
> observing disks doing 400, 700 and higher service
> times.  After those few minutes it tapers down and goes back to normal.
>
> There are all ST6000VN0001 disks - anyone aware of anything that might 
> explain this sort of behaviour?  It seems odd that even if the
> disks were hit with high write traffic (average of 50 write IOPS going up to 
> 270-300 during this activity) that the service times would
> blow out that much.
>
> Cheers,
>  Adrian
>
>
>
>
>
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Adrian Saul
> > Sent: Thursday, 22 September 2016 7:15 PM
> > To: n...@fisk.me.uk; ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Snap delete performance impact
> >
> >
> > I tried 2 this afternoon and saw the same results.  Essentially the
> > disks appear to go to 100% busy doing very small but high numbers of IO and 
> > incur massive
> > service times (300-400ms).   During that period I get blocked request errors
> > continually.
> >
> > I suspect part of that might be the SATA servers had
> > filestore_op_threads set too high and hammering the disks with too
> > much concurrent work.  As they have inherited a setting targeted for
> > SSDs, so I have wound that back to defaults on those machines see if it 
> > makes a difference.
> >
> > But I suspect going by the disk activity there is a lot of very small
> > FS metadata updates going on and that is what is killing it.
> >
> > Cheers,
> >  Adrian
> >
> >
> > > -Original Message-
> > > From: Nick Fisk [mailto:n...@fisk.me.uk]
> > > Sent: Thursday, 22 September 2016 7:06 PM
> > > To: Adrian Saul; ceph-users@lists.ceph.com
> > > Subject: RE: Snap delete performance impact
> > >
> > > Hi Adrian,
> > >
> > > I have also hit this recently and have since increased the
> > > osd_snap_trim_sleep to try and stop this from happening again.
> > > However, I haven't had an opportunity to actually try and break it
> > > again yet, but your mail seems to suggest it might not be the silver
> > > bullet I
> > was looking for.
> > >
> > > I'm wondering if the problem is not with the removal of the
> > > snapshot, but actually down to the amount of object deletes that
> > > happen, as I see similar results when doing fstrim's or deleting
> > > RBD's. Either way I agree that a settable throttle to allow it to
> > > process more slowly would be a
> > good addition.
> > > Have you tried that value set to higher than 1, maybe 10?
> > >
> > > Nick
> > >
> > > > -Original Message-
> > > > From: ceph-users [mailto:ceph-users-bou

[ceph-users] Ceph outage - monitoring options

2016-11-21 Thread Adrian Saul

Hi All,
  We have a jewel cluster (10.2.1) that we built up in a POC state (2 clients 
also being mons, 12 SSD OSDs on 3 hosts, 20 SATA OSDs on 3 hosts).   We have 
connected up our "prod" environment to it and performed a migration for all the 
OSDs so it is now 114 OSDs (36 SSD, 78 NL-SAS with another 26 waiting in for 
replacement of 2 DOA journals).

The migration was relatively clean, except for when I removed the old hosts 
from the crush map - even though the OSDs were already out and removed, the 
host entries still had weight and the rebalancing when the buckets were removed 
kicked causes some sort of lockup with the kernel RBD clients.  Once they were 
kicked the main issue settled down and all was good.

We then had one SAS OSD machine have a single network flap for 8 seconds where 
it lost all network (unknown by me until later).   By the time I was looking at 
the issue it was like a ghostly silence - no OSDs down, none out, no failed 
PGs, no repairing - just blocked requests and nothing else.   After jumping 
between multiple machines and finding nothing of note I got desperate and 
restarted all OSDs on a host by host basis.  That resulted in a number of PGs 
becoming inactive, but no recovery or any sort of other improvement and the 
blocked requests continued.

I tried digging blocked requests out from random OSDs - most seemed to just say 
"waiting for peered" with no other real information.  After a while I started 
finding a general pattern pointing to one machine as being a peer for some of 
the pgs that were out  (I wish I had done ceph health detail earlier, because 
it would have clued me into a list of PGs owned by that host faster).  Finding 
nothing wrong and having seen the OSDs already been restarted (monitors etc saw 
them go down and back up fine but the blocked requests remained), I chose to 
reboot the server.

As soon as that was done its like the cluster jumped back to life and 
discovered it was broken and started repairing.  Once the down host was back 
online it joined again and the cluster recovered quickly back as if nothing had 
happened.

When I later dug into the logs I found the network flap, followed by the 
building blocked requests and by most of the OSDs on that host complaining of 
heartbeat_check failures.

Nov 21 16:38:31 ceph-glb-sata-06 kernel: enic :0a:00.0 enp10s0: Link DOWN
Nov 21 16:38:31 ceph-glb-sata-06 kernel: enic :0b:00.0 enp11s0: Link DOWN
Nov 21 16:38:31 ceph-glb-sata-06 kernel: enic :0c:00.0 enp12s0: Link DOWN
Nov 21 16:38:39 ceph-glb-sata-06 kernel: enic :0a:00.0 enp10s0: Link UP
Nov 21 16:38:39 ceph-glb-sata-06 kernel: enic :0b:00.0 enp11s0: Link UP
Nov 21 16:38:39 ceph-glb-sata-06 kernel: enic :0c:00.0 enp12s0: Link UP

Nov 21 16:38:45 ceph-glb-sata-06 ceph-osd: 2016-11-21 16:38:45.900016 
7f66a918c700 -1 osd.103 69295 heartbeat_check: no reply from osd.68 since back 
2016-11-21 16:38:25.707045 front 2016-11-21 16:38:43.908106 (cutoff 2016-11-21 
16:38:25.900010)
Nov 21 16:38:45 ceph-glb-sata-06 ceph-osd: 2016-11-21 16:38:45.900038 
7f66a918c700 -1 osd.103 69295 heartbeat_check: no reply from osd.70 since back 
2016-11-21 16:38:25.707045 front 2016-11-21 16:38:43.908106 (cutoff 2016-11-21 
16:38:25.900010)

2016-11-21 16:38:49.751243 7f138a653700 -1 osd.98 69295 heartbeat_check: no 
reply from osd.74 since back 2016-11-21 16:38:29.391091 front 2016-11-21 
16:38:47.293162 (cutoff 2016-11-21 16:38:29.751238)
2016-11-21 16:38:49.751264 7f138a653700 -1 osd.98 69295 heartbeat_check: no 
reply from osd.77 since back 2016-11-21 16:38:29.391091 front 2016-11-21 
16:38:47.293162 (cutoff 2016-11-21 16:38:29.751238)

In some of the blocked ops I found references to waiting on rw locks:

2016-11-21 16:43:42.787670 7f138a653700  0 log_channel(cluster) log [WRN] : 4 
slow requests, 4 included below; oldest blocked for > 120.843670 secs
2016-11-21 16:43:42.787673 7f138a653700  0 log_channel(cluster) do_log log to 
syslog
2016-11-21 16:43:42.787707 7f138a653700  0 log_channel(cluster) log [WRN] : 
slow request 120.843670 seconds old, received at 2016-11-21 16:41:41.943943: 
osd_op(client.2714100.1:69471470 1.3140a3e7 
rbd_data.159a26238e1f29.00018502 [set-alloc-hint object_size 4194304 
write_size 4194304,write 3683840~510464] snapc 0=[] 
ondisk+write+ignore_cache+ignore_overlay+known_if_redirected e69295) currently 
waiting for subops from 89
2016-11-21 16:43:42.787710 7f138a653700  0 log_channel(cluster) do_log log to 
syslog
2016-11-21 16:43:42.787725 7f138a653700  0 log_channel(cluster) log [WRN] : 
slow request 60.181839 seconds old, received at 2016-11-21 16:42:42.605773: 
osd_op(client.2714100.1:69528164 1.3140a3e7 
rbd_data.159a26238e1f29.00018502 [set-alloc-hint object_size 4194304 
write_size 4194304,write 3683840~510464] snapc 0=[] 
ondisk+write+ignore_cache+ignore_overlay+known_if_redirected e69295) currently 
waiting for subops from 89
2016-11-21 16:43:42.787740 7f138a653700  0 log_channel(cluster) d

[ceph-users] osd set noin ignored for old OSD ids

2016-11-22 Thread Adrian Saul

Hi ,
 As part of migration between hardware I have been building new OSDs and 
cleaning up old ones  (osd rm osd.x, osd crush rm osd.x, auth del osd.x).   To 
try and prevent rebalancing kicking in until all the new OSDs are created on a 
host I use "ceph osd set noin", however what I have seen is that if the new OSD 
that is created uses a new unique ID, then the flag is honoured and the OSD 
remains out until I bring it in.  However if the OSD re-uses a previous OSD id 
then it will go straight to in and start backfilling.  I have to manually out 
the OSD to stop it (or set nobackfill,norebalance).

Am I doing something wrong in this process or is there something about "noin" 
that is ignored for previously existing OSDs that have been removed from both 
the OSD map and crush map?

Cheers,
 Adrian




Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [EXTERNAL] Re: osd set noin ignored for old OSD ids

2016-11-23 Thread Adrian Saul

Thanks - that is more in line with what I was looking for, being able to 
suppress backfills/rebalancing until a host/hosts full set of OSDs are up and 
ready.


> -Original Message-
> From: Will.Boege [mailto:will.bo...@target.com]
> Sent: Thursday, 24 November 2016 2:17 PM
> To: Gregory Farnum
> Cc: Adrian Saul; ceph-users@lists.ceph.com
> Subject: Re: [EXTERNAL] Re: [ceph-users] osd set noin ignored for old OSD
> ids
>
> From my experience noin doesn't stop new OSDs from being marked in. noin
> only works on OSDs already in the crushmap. To accomplish the behavior you
> want I've injected "mon osd auto mark new in = false" into MONs. This also
> seems to set their OSD weight to 0 when they are created.
>
> > On Nov 23, 2016, at 1:47 PM, Gregory Farnum 
> wrote:
> >
> > On Tue, Nov 22, 2016 at 7:56 PM, Adrian Saul
> >  wrote:
> >>
> >> Hi ,
> >> As part of migration between hardware I have been building new OSDs
> and cleaning up old ones  (osd rm osd.x, osd crush rm osd.x, auth del osd.x).
> To try and prevent rebalancing kicking in until all the new OSDs are created
> on a host I use "ceph osd set noin", however what I have seen is that if the
> new OSD that is created uses a new unique ID, then the flag is honoured and
> the OSD remains out until I bring it in.  However if the OSD re-uses a 
> previous
> OSD id then it will go straight to in and start backfilling.  I have to 
> manually out
> the OSD to stop it (or set nobackfill,norebalance).
> >>
> >> Am I doing something wrong in this process or is there something about
> "noin" that is ignored for previously existing OSDs that have been removed
> from both the OSD map and crush map?
> >
> > There are a lot of different pieces of an OSD ID that need to get
> > deleted for it to be truly gone; my guess is you've missed some of
> > those. The noin flag doesn't prevent unlinked-but-up CRUSH entries
> > from getting placed back into the tree, etc.
> >
> > We may also have a bug though, so if you can demonstrate that the ID
> > doesn't exist in the CRUSH and OSD dumps then please create a ticket
> > at tracker.ceph.com!
> > -Greg
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Crush rule check

2016-12-10 Thread Adrian Saul

Hi Ceph-users,
  I just want to double check a new crush ruleset I am creating - the intent 
here is that over 2 DCs, it will select one DC, and place two copies on 
separate hosts in that DC.  The pools created on this will use size 4  and 
min-size 2.

 I just want to check I have crafted this correctly.

rule sydney-ssd {
ruleset 6
type replicated
min_size 2
max_size 10
step take ssd-sydney
step choose firstn -2 type datacenter
step chooseleaf firstn 2 type host
step emit
}

Cheers,
 Adrian



Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crush rule check

2016-12-12 Thread Adrian Saul

Thanks Wido.

I had found the show-utilization test, but had not seen show-mappings - that 
confirmed it for me.

thanks,
 Adrian


> -Original Message-
> From: Wido den Hollander [mailto:w...@42on.com]
> Sent: Monday, 12 December 2016 7:07 PM
> To: ceph-users@lists.ceph.com; Adrian Saul
> Subject: Re: [ceph-users] Crush rule check
>
>
> > Op 10 december 2016 om 12:45 schreef Adrian Saul
> :
> >
> >
> >
> > Hi Ceph-users,
> >   I just want to double check a new crush ruleset I am creating - the intent
> here is that over 2 DCs, it will select one DC, and place two copies on 
> separate
> hosts in that DC.  The pools created on this will use size 4  and min-size 2.
> >
> >  I just want to check I have crafted this correctly.
> >
>
> I suggest that you test your ruleset with crushtool like this:
>
> $ crushtool -i crushmap.new --test --rule 6 --num-rep 4 --show-utilization $
> crushtool -i crushmap.new --test --rule 6 --num-rep 4 --show-mappings
>
> You can now manually verify if the placement goes as intended.
>
> Wido
>
> > rule sydney-ssd {
> > ruleset 6
> > type replicated
> > min_size 2
> > max_size 10
> > step take ssd-sydney
> > step choose firstn -2 type datacenter
> > step chooseleaf firstn 2 type host
> > step emit
> > }
> >
> > Cheers,
> >  Adrian
> >
> >
> >
> > Confidentiality: This email and any attachments are confidential and may be
> subject to copyright, legal or some other professional privilege. They are
> intended solely for the attention and use of the named addressee(s). They
> may only be copied, distributed or disclosed with the consent of the
> copyright owner. If you have received this email by mistake or by breach of
> the confidentiality clause, please notify the sender immediately by return
> email and delete or destroy all copies of the email. Any confidentiality,
> privilege or copyright is not waived or lost because this email has been sent
> to you by mistake.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crush rule check

2016-12-12 Thread Adrian Saul
> One thing to check though. The number of DCs is a fixed number right? You
> will always have two DCs with X hosts.

I am keeping it open in case we add other sites for some reason, but likely to 
remain  at 2.

>
> In that case:
>
>   step choose firstn 2 type datacenter
>   step chooseleaf firstn -2 type host
>
> First, take 2 of the type 'datacenter' and then find the remaining hosts. But
> since you will always use size = 4 you might even try:
>
> rule sydney-ssd {
> ruleset 6
> type replicated
> min_size 4
> max_size 4
> step take ssd-sydney
> step choose firstn 2 type datacenter
> step chooseleaf firstn 2 type host
> step emit
> }
>
> This way the ruleset will only work for size = 4.
>
> Wido
>
>
> > thanks,
> >  Adrian
> >
> >
> > > -Original Message-
> > > From: Wido den Hollander [mailto:w...@42on.com]
> > > Sent: Monday, 12 December 2016 7:07 PM
> > > To: ceph-users@lists.ceph.com; Adrian Saul
> > > Subject: Re: [ceph-users] Crush rule check
> > >
> > >
> > > > Op 10 december 2016 om 12:45 schreef Adrian Saul
> > > :
> > > >
> > > >
> > > >
> > > > Hi Ceph-users,
> > > >   I just want to double check a new crush ruleset I am creating -
> > > > the intent
> > > here is that over 2 DCs, it will select one DC, and place two copies
> > > on separate hosts in that DC.  The pools created on this will use size 4  
> > > and
> min-size 2.
> > > >
> > > >  I just want to check I have crafted this correctly.
> > > >
> > >
> > > I suggest that you test your ruleset with crushtool like this:
> > >
> > > $ crushtool -i crushmap.new --test --rule 6 --num-rep 4
> > > --show-utilization $ crushtool -i crushmap.new --test --rule 6
> > > --num-rep 4 --show-mappings
> > >
> > > You can now manually verify if the placement goes as intended.
> > >
> > > Wido
> > >
> > > > rule sydney-ssd {
> > > > ruleset 6
> > > > type replicated
> > > > min_size 2
> > > > max_size 10
> > > > step take ssd-sydney
> > > > step choose firstn -2 type datacenter
> > > > step chooseleaf firstn 2 type host
> > > > step emit
> > > > }
> > > >
> > > > Cheers,
> > > >  Adrian
> > > >
> > > >
> > > >
> > > > Confidentiality: This email and any attachments are confidential
> > > > and may be
> > > subject to copyright, legal or some other professional privilege.
> > > They are intended solely for the attention and use of the named
> > > addressee(s). They may only be copied, distributed or disclosed with
> > > the consent of the copyright owner. If you have received this email
> > > by mistake or by breach of the confidentiality clause, please notify
> > > the sender immediately by return email and delete or destroy all
> > > copies of the email. Any confidentiality, privilege or copyright is
> > > not waived or lost because this email has been sent to you by mistake.
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > Confidentiality: This email and any attachments are confidential and may be
> subject to copyright, legal or some other professional privilege. They are
> intended solely for the attention and use of the named addressee(s). They
> may only be copied, distributed or disclosed with the consent of the
> copyright owner. If you have received this email by mistake or by breach of
> the confidentiality clause, please notify the sender immediately by return
> email and delete or destroy all copies of the email. Any confidentiality,
> privilege or copyright is not waived or lost because this email has been sent
> to you by mistake.
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] When Zero isn't 0 (Crush weight mysteries)

2016-12-20 Thread Adrian Saul

I found the other day even though I had 0 weighted OSDs, there was still weight 
in the containing buckets which triggered some rebalancing.

Maybe it is something similar, there was weight added to the bucket even though 
the OSD underneath was 0.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Christian Balzer
> Sent: Wednesday, 21 December 2016 12:39 PM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] When Zero isn't 0 (Crush weight mysteries)
>
>
> Hello,
>
> I just (manually) added 1 OSD each to my 2 cache-tier nodes.
> The plan was/is to actually do the data-migration at the least busiest day in
> Japan, New Years (the actual holiday is January 2nd this year).
>
> So I was going to have everything up and in but at weight 0 initially.
>
> Alas at the "ceph osd crush add osd.x0 0 host=ceph-0x" steps Ceph happily
> started to juggle a few PGs (about 7 total) around, despite of course no
> weight in the cluster changing at all.
> No harm done (this is the fast and not too busy cache-tier after all), but 
> very
> much unexpected.
>
> So which part of the CRUSH algorithm goes around and pulls weights out of
> thin air?
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSDs for data drives

2018-07-12 Thread Adrian Saul

We started our cluster with consumer (Samsung EVO) disks and the write 
performance was pitiful, they had periodic spikes in latency (average of 8ms, 
but much higher spikes) and just did not perform anywhere near where we were 
expecting.

When replaced with SM863 based devices the difference was night and day.  The 
DC grade disks held a nearly constant low latency (contantly sub-ms), no 
spiking and performance was massively better.   For a period I ran both disks 
in the cluster and was able to graph them side by side with the same workload.  
This was not even a moderately loaded cluster so I am glad we discovered this 
before we went full scale.

So while you certainly can do cheap and cheerful and let the data availability 
be handled by Ceph, don’t expect the performance to keep up.



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Satish 
Patel
Sent: Wednesday, 11 July 2018 10:50 PM
To: Paul Emmerich 
Cc: ceph-users 
Subject: Re: [ceph-users] SSDs for data drives

Prices going way up if I am picking Samsung SM863a for all data drives.

We have many servers running on consumer grade sad drives and we never noticed 
any performance or any fault so far (but we never used ceph before)

I thought that is the whole point of ceph to provide high availability if drive 
go down also parellel read from multiple osd node

Sent from my iPhone

On Jul 11, 2018, at 6:57 AM, Paul Emmerich 
mailto:paul.emmer...@croit.io>> wrote:
Hi,

we‘ve no long-term data for the SM variant.
Performance is fine as far as we can tell, but the main difference between 
these two models should be endurance.


Also, I forgot to mention that my experiences are only for the 1, 2, and 4 TB 
variants. Smaller SSDs are often proportionally slower (especially below 500GB).

Paul

Robert Stanford mailto:rstanford8...@gmail.com>>:
Paul -

 That's extremely helpful, thanks.  I do have another cluster that uses Samsung 
SM863a just for journal (spinning disks for data).  Do you happen to have an 
opinion on those as well?

On Wed, Jul 11, 2018 at 4:03 AM, Paul Emmerich 
mailto:paul.emmer...@croit.io>> wrote:
PM/SM863a are usually great disks and should be the default go-to option, they 
outperform
even the more expensive PM1633 in our experience.
(But that really doesn't matter if it's for the full OSD and not as dedicated 
WAL/journal)

We got a cluster with a few hundred SanDisk Ultra II (discontinued, i believe) 
that was built on a budget.
Not the best disk but great value. They have been running since ~3 years now 
with very few failures and
okayish overall performance.

We also got a few clusters with a few hundred SanDisk Extreme Pro, but we are 
not yet sure about their
long-time durability as they are only ~9 months old (average of ~1000 write 
IOPS on each disk over that time).
Some of them report only 50-60% lifetime left.

For NVMe, the Intel NVMe 750 is still a great disk

Be carefuly to get these exact models. Seemingly similar disks might be just 
completely bad, for
example, the Samsung PM961 is just unusable for Ceph in our experience.

Paul

2018-07-11 10:14 GMT+02:00 Wido den Hollander 
mailto:w...@42on.com>>:


On 07/11/2018 10:10 AM, Robert Stanford wrote:
>
>  In a recent thread the Samsung SM863a was recommended as a journal
> SSD.  Are there any recommendations for data SSDs, for people who want
> to use just SSDs in a new Ceph cluster?
>

Depends on what you are looking for, SATA, SAS3 or NVMe?

I have very good experiences with these drives running with BlueStore in
them in SuperMicro machines:

- SATA: Samsung PM863a
- SATA: Intel S4500
- SAS: Samsung PM1633
- NVMe: Samsung PM963

Running WAL+DB+DATA with BlueStore on the same drives.

Wido

>  Thank you
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 
31h
81247 
München
www.croit.io
Tel: +49 89 1896585 90

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. I

Re: [ceph-users] v12.2.8 Luminous released

2018-09-05 Thread Adrian Saul
Can I confirm if this bluestore compression assert issue is resolved in 12.2.8?

https://tracker.ceph.com/issues/23540

I notice that it has a backport that is listed against 12.2.8 but there is no 
mention of that issue or backport listed in the release notes.


> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of Abhishek Lekshmanan
> Sent: Wednesday, 5 September 2018 2:30 AM
> To: ceph-de...@vger.kernel.org; ceph-us...@ceph.com; ceph-
> maintain...@ceph.com; ceph-annou...@ceph.com
> Subject: v12.2.8 Luminous released
>
>
> We're glad to announce the next point release in the Luminous v12.2.X stable
> release series. This release contains a range of bugfixes and stability
> improvements across all the components of ceph. For detailed release notes
> with links to tracker issues and pull requests, refer to the blog post at
> http://ceph.com/releases/v12-2-8-released/
>
> Upgrade Notes from previous luminous releases
> -
>
> When upgrading from v12.2.5 or v12.2.6 please note that upgrade caveats
> from
> 12.2.5 will apply to any _newer_ luminous version including 12.2.8. Please
> read the notes at https://ceph.com/releases/12-2-7-luminous-
> released/#upgrading-from-v12-2-6
>
> For the cluster that installed the broken 12.2.6 release, 12.2.7 fixed the
> regression and introduced a workaround option `osd distrust data digest =
> true`, but 12.2.7 clusters still generated health warnings like ::
>
>   [ERR] 11.288 shard 207: soid
>   11:1155c332:::rbd_data.207dce238e1f29.0527:head
> data_digest
>   0xc8997a5b != data_digest 0x2ca15853
>
>
> 12.2.8 improves the deep scrub code to automatically repair these
> inconsistencies. Once the entire cluster has been upgraded and then fully
> deep scrubbed, and all such inconsistencies are resolved; it will be safe to
> disable the `osd distrust data digest = true` workaround option.
>
> Changelog
> -
> * bluestore: set correctly shard for existed Collection (issue#24761, 
> pr#22860,
> Jianpeng Ma)
> * build/ops: Boost system library is no longer required to compile and link
> example librados program (issue#25054, pr#23202, Nathan Cutler)
> * build/ops: Bring back diff -y for non-FreeBSD (issue#24396, issue#21664,
> pr#22848, Sage Weil, David Zafman)
> * build/ops: install-deps.sh fails on newest openSUSE Leap (issue#25064,
> pr#23179, Kyr Shatskyy)
> * build/ops: Mimic build fails with -DWITH_RADOSGW=0 (issue#24437,
> pr#22864, Dan Mick)
> * build/ops: order rbdmap.service before remote-fs-pre.target
> (issue#24713, pr#22844, Ilya Dryomov)
> * build/ops: rpm: silence osd block chown (issue#25152, pr#23313, Dan van
> der Ster)
> * cephfs-journal-tool: Fix purging when importing an zero-length journal
> (issue#24239, pr#22980, yupeng chen, zhongyan gu)
> * cephfs: MDSMonitor: uncommitted state exposed to clients/mdss
> (issue#23768, pr#23013, Patrick Donnelly)
> * ceph-fuse mount failed because no mds (issue#22205, pr#22895, liyan)
> * ceph-volume add a __release__ string, to help version-conditional calls
> (issue#25170, pr#23331, Alfredo Deza)
> * ceph-volume: adds test for `ceph-volume lvm list /dev/sda` (issue#24784,
> issue#24957, pr#23350, Andrew Schoen)
> * ceph-volume: do not use stdin in luminous (issue#25173, issue#23260,
> pr#23367, Alfredo Deza)
> * ceph-volume enable the ceph-osd during lvm activation (issue#24152,
> pr#23394, Dan van der Ster, Alfredo Deza)
> * ceph-volume expand on the LVM API to create multiple LVs at different
> sizes (issue#24020, pr#23395, Alfredo Deza)
> * ceph-volume lvm.activate conditional mon-config on prime-osd-dir
> (issue#25216, pr#23397, Alfredo Deza)
> * ceph-volume lvm.batch remove non-existent sys_api property
> (issue#34310, pr#23811, Alfredo Deza)
> * ceph-volume lvm.listing only include devices if they exist (issue#24952,
> pr#23150, Alfredo Deza)
> * ceph-volume: process.call with stdin in Python 3 fix (issue#24993, pr#23238,
> Alfredo Deza)
> * ceph-volume: PVolumes.get() should return one PV when using name or
> uuid (issue#24784, pr#23329, Andrew Schoen)
> * ceph-volume: refuse to zap mapper devices (issue#24504, pr#23374,
> Andrew Schoen)
> * ceph-volume: tests.functional inherit SSH_ARGS from ansible (issue#34311,
> pr#23813, Alfredo Deza)
> * ceph-volume tests/functional run lvm list after OSD provisioning
> (issue#24961, pr#23147, Alfredo Deza)
> * ceph-volume: unmount lvs correctly before zapping (issue#24796,
> pr#23128, Andrew Schoen)
> * ceph-volume: update batch documentation to explain filestore strategies
> (issue#34309, pr#23825, Alfredo Deza)
> * change default filestore_merge_threshold to -10 (issue#24686, pr#22814,
> Douglas Fuller)
> * client: add inst to asok status output (issue#24724, pr#23107, Patrick
> Donnelly)
> * client: fixup parallel calls to ceph_ll_lookup_inode() in NFS FASL
> (issue#22683, pr#23012, huanwen ren)
> * client: increase verbosity level 

Re: [ceph-users] Ceph iSCSI is a prank?

2018-03-04 Thread Adrian Saul

We are using Ceph+RBD+NFS under pacemaker for VMware.  We are doing iSCSI using 
SCST but have not used it against VMware, just Solaris and Hyper-V.

It generally works and performs well enough – the biggest issues are the 
clustering for iSCSI ALUA support and NFS failover, most of which we have 
developed in house – we still have not quite got that right yet.



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Daniel 
K
Sent: Saturday, 3 March 2018 1:03 AM
To: Joshua Chen 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph iSCSI is a prank?

There's been quite a few VMWare/Ceph threads on the mailing list in the past.

One setup I've been toying with is a linux guest running on the vmware host on 
local storage, with the guest mounting a ceph RBD with a filesystem on it, then 
exporting that via NFS to the VMWare host as a datastore.

Exporting CephFS via NFS to Vmware is another option.

I'm not sure how well shared storage will work with either of these 
configurations. but they work fairly well for single-host deployments.

There are also quite a few products that do support iscsi on ceph. Suse 
Enterprise Storage is a commercial one, PetaSAN is an open-source option.


On Fri, Mar 2, 2018 at 2:24 AM, Joshua Chen 
mailto:csc...@asiaa.sinica.edu.tw>> wrote:
Dear all,
  I wonder how we could support VM systems with ceph storage (block device)? my 
colleagues are waiting for my answer for vmware (vSphere 5) and I myself use 
oVirt (RHEV). the default protocol is iSCSI.
  I know that openstack/cinder work well with ceph and proxmox (just heard) 
too. But currently we are using vmware and ovirt.


Your wise suggestion is appreciated

Cheers
Joshua


On Thu, Mar 1, 2018 at 3:16 AM, Mark Schouten 
mailto:m...@tuxis.nl>> wrote:
Does Xen still not support RBD? Ceph has been around for years now!
Met vriendelijke groeten,

--
Kerio Operator in de Cloud? https://www.kerioindecloud.nl/
Mark Schouten | Tuxis Internet Engineering
KvK: 61527076 | http://www.tuxis.nl/
T: 0318 200208 | i...@tuxis.nl


Van: Massimiliano Cuttini mailto:m...@phoenixweb.it>>
Aan: "ceph-users@lists.ceph.com" 
mailto:ceph-users@lists.ceph.com>>
Verzonden: 28-2-2018 13:53
Onderwerp: [ceph-users] Ceph iSCSI is a prank?

I was building ceph in order to use with iSCSI.
But I just see from the docs that need:

CentOS 7.5
(which is not available yet, it's still at 7.4)
https://wiki.centos.org/Download

Kernel 4.17
(which is not available yet, it is still at 4.15.7)
https://www.kernel.org/

So I guess, there is no ufficial support and this is just a bad prank.

Ceph is ready to be used with S3 since many years.
But need the kernel of the next century to works with such an old technology 
like iSCSI.
So sad.






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multi site with cephfs

2018-05-21 Thread Adrian Saul

We run CephFS in a limited fashion in a stretched cluster of about 40km with 
redundant 10G fibre between sites – link latency is in the order of 1-2ms.  
Performance is reasonable for our usage but is noticeably slower than 
comparable local ceph based RBD shares.

Essentially we just setup the ceph pools behind cephFS to have replicas on each 
site.  To export it we are simply using Linux kernel NFS and it gets exported 
from 4 hosts that act as CephFS clients.  Those 4 hosts are then setup in an 
DNS record that resolves to all 4 IPs, and we then use automount to do 
automatic mounting and host failover on the NFS clients.  Automount takes care 
of finding the quickest and available NFS server.

I stress this is a limited setup that we use for some fairly light duty, but we 
are looking to move things like user home directories onto this.  YMMV.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Up Safe
Sent: Monday, 21 May 2018 5:36 PM
To: David Turner 
Cc: ceph-users 
Subject: Re: [ceph-users] multi site with cephfs

Hi,
can you be a bit more specific?
I need to understand whether this is doable at all.
Other options would be using ganesha, but I understand it's very limited on NFS;
or start looking at gluster.

Basically, I need the multi site option, i.e. active-active read-write.

Thanks

On Wed, May 16, 2018 at 5:50 PM, David Turner 
mailto:drakonst...@gmail.com>> wrote:
Object storage multi-site is very specific to using object storage.  It uses 
the RGW API's to sync s3 uploads between each site.  For CephFS you might be 
able to do a sync of the rados pools, but I don't think that's actually a thing 
yet.  RBD mirror is also a layer on top of things to sync between sites.  
Basically I think you need to do something on top of the Filesystem as opposed 
to within Ceph  to sync it between sites.

On Wed, May 16, 2018 at 9:51 AM Up Safe 
mailto:upands...@gmail.com>> wrote:
But this is not the question here.
The question is whether I can configure multi site for CephFS.
Will I be able to do so by following the guide to set up the multi site for 
object storage?

Thanks

On Wed, May 16, 2018, 16:45 John Hearns 
mailto:hear...@googlemail.com>> wrote:
The answer given at the seminar yesterday was that a practical limit was around 
60km.
I don't think 100km is that much longer.  I defer to the experts here.






On 16 May 2018 at 15:24, Up Safe 
mailto:upands...@gmail.com>> wrote:
Hi,

About a 100 km.
I have a 2-4ms latency between them.

Leon

On Wed, May 16, 2018, 16:13 John Hearns 
mailto:hear...@googlemail.com>> wrote:
Leon,
I was at a Lenovo/SuSE seminar yesterday and asked a similar question regarding 
separated sites.
How far apart are these two geographical locations?   It does matter.

On 16 May 2018 at 15:07, Up Safe 
mailto:upands...@gmail.com>> wrote:
Hi,
I'm trying to build a multi site setup.
But the only guides I've found on the net were about building it with object 
storage or rbd.
What I need is cephfs.
I.e. I need to have 2 synced file storages at 2 geographical locations.
Is this possible?
Also, if I understand correctly - cephfs is just a component on top of the 
object storage.
Following this logic - it should be possible, right?
Or am I totally off here?
Thanks,
Leon

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multi site with cephfs

2018-05-21 Thread Adrian Saul

You have the same performance problem then regardless of what platform you 
choose to present it on.  If you want cross site consistency with a single 
consistent view, you need to replicate writes synchronously between sites, 
which will induce a performance hit for writes.   Any other snapshot/async 
setup while improving write performance leaves you with that time window gap 
should you lose a site.

If you are not particularly latency sensitive on writes (i.e these are just 
small documents being written and left behind) then the write latency penalty 
is probably not that big an issue for the easier access a stretched CephFS 
filesystem would give you.  If your clients can access cephfs natively that 
might be cleaner than using NFS over the top, although it means having clients 
get full access to the ceph public network – otherwise my previously mentioned 
NFS export with automount would probably work for you.


From: Up Safe [mailto:upands...@gmail.com]
Sent: Tuesday, 22 May 2018 12:33 AM
To: David Turner 
Cc: Adrian Saul ; ceph-users 

Subject: Re: [ceph-users] multi site with cephfs

I'll explain.
Right now we have 2 sites (racks) with several dozens of servers at each
accessing a NAS (let's call it a NAS, although it's an IBM v7000 Unified that 
serves the files via NFS).

The biggest problem is that it works active-passive, i.e. we always access one 
of the storages for read/write
and the other one is replicated once every few hours, so it's more for backup 
needs.
In this setup once the power goes down in our main site - we're stuck with a 
bit (several hours) outdated files
and we need to remount all of the servers and what not.
The multi site ceph was supposed to solve this problem for us. This way we 
would have only local mounts, i.e.
each server would only access the filesystem that is in the same site. And if 
one of the sited go down - no pain.
The files are rather small, pdfs and xml of 50-300KB mostly.
The total size is about 25 TB right now.

We're a low budget company, so your advise about developing is not going to 
happen as we have no such skills or resources for this.
Plus, I want to make this transparent for the devs and everyone - just an 
infrastructure replacement that will buy me all of the ceph benefits and
allow the company to survive the power outages or storage crashes.


On Mon, May 21, 2018 at 5:12 PM, David Turner 
mailto:drakonst...@gmail.com>> wrote:
Not a lot of people use object storage multi-site.  I doubt anyone is using 
this like you are.  In theory it would work, but even if somebody has this 
setup running, it's almost impossible to tell if it would work for your needs 
and use case.  You really should try it out for yourself to see if it works to 
your needs.  And if you feel so inclined, report back here with how it worked.

If you're asking for advice, why do you need a networked posix filesystem?  
Unless you are using proprietary software with this requirement, it's generally 
lazy coding that requires a mounted filesystem like this and you should aim 
towards using object storage instead without any sort of NFS layer.  It's a 
little more work for the developers, but is drastically simpler to support and 
manage.

On Mon, May 21, 2018 at 10:06 AM Up Safe 
mailto:upands...@gmail.com>> wrote:
guys,
please tell me if I'm in the right direction.
If ceph object storage can be set up in multi site configuration,
and I add ganesha (which to my understanding is an "adapter"
that serves s3 objects via nfs to clients) -
won't this work as active-active?


Thanks

On Mon, May 21, 2018 at 11:48 AM, Up Safe 
mailto:upands...@gmail.com>> wrote:
ok, thanks.
but it seems to me that having pool replicas spread over sites is a bit too 
risky performance wise.
how about ganesha? will it work with cephfs and multi site setup?
I was previously reading about rgw with ganesha and it was full of limitations.
with cephfs - there is only one and one I can live with.
Will it work?

On Mon, May 21, 2018 at 10:57 AM, Adrian Saul 
mailto:adrian.s...@tpgtelecom.com.au>> wrote:

We run CephFS in a limited fashion in a stretched cluster of about 40km with 
redundant 10G fibre between sites – link latency is in the order of 1-2ms.  
Performance is reasonable for our usage but is noticeably slower than 
comparable local ceph based RBD shares.

Essentially we just setup the ceph pools behind cephFS to have replicas on each 
site.  To export it we are simply using Linux kernel NFS and it gets exported 
from 4 hosts that act as CephFS clients.  Those 4 hosts are then setup in an 
DNS record that resolves to all 4 IPs, and we then use automount to do 
automatic mounting and host failover on the NFS clients.  Automount takes care 
of finding the quickest and available NFS server.

I stress this is a limited setup that we use for some fairly light duty, but we 
are looking to move t

Re: [ceph-users] Review of Ceph on ZFS - or how not to deploy Ceph for RBD + OpenStack

2017-01-10 Thread Adrian Saul

I would concur having spent a lot of time on ZFS on Solaris.

ZIL will reduce the fragmentation problem a lot (because it is not doing intent 
logging into the filesystem itself which fragments the block allocations) and 
write response will be a lot better.  I would use different devices for L2ARC 
and ZIL - ZIL needs to be small and fast for writes (and mirrored - we have 
used some HGST 16G devices which are designed as ZILs - pricy but highly 
recommend) - L2ARC just needs to be faster for reads than your data disks, most 
SSDs would be fine for this.

A 14 disk RAIDZ2 is also going to be very poor for writes especially with SATA 
- you are effectively only getting one disk worth of IOPS for write as each 
write needs to hit all disks.  Without a ZIL you are also losing out on write 
IOPS for ZIL and metadata operations.



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Patrick Donnelly
> Sent: Wednesday, 11 January 2017 5:24 PM
> To: Kevin Olbrich
> Cc: Ceph Users
> Subject: Re: [ceph-users] Review of Ceph on ZFS - or how not to deploy Ceph
> for RBD + OpenStack
>
> Hello Kevin,
>
> On Tue, Jan 10, 2017 at 4:21 PM, Kevin Olbrich  wrote:
> > 5x Ceph node equipped with 32GB RAM, Intel i5, Intel DC P3700 NVMe
> > journal,
>
> Is the "journal" used as a ZIL?
>
> > We experienced a lot of io blocks (X requests blocked > 32 sec) when a
> > lot of data is changed in cloned RBDs (disk imported via OpenStack
> > Glance, cloned during instance creation by Cinder).
> > If the disk was cloned some months ago and large software updates are
> > applied (a lot of small files) combined with a lot of syncs, we often
> > had a node hit suicide timeout.
> > Most likely this is a problem with op thread count, as it is easy to
> > block threads with RAIDZ2 (RAID6) if many small operations are written
> > to disk (again, COW is not optimal here).
> > When recovery took place (0.020% degraded) the cluster performance was
> > very bad - remote service VMs (Windows) were unusable. Recovery itself
> > was using
> > 70 - 200 mb/s which was okay.
>
> I would think having an SSD ZIL here would make a very large difference.
> Probably a ZIL may have a much larger performance impact than an L2ARC
> device. [You may even partition it and have both but I'm not sure if that's
> normally recommended.]
>
> Thanks for your writeup!
>
> --
> Patrick Donnelly
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MySQL and ceph volumes

2017-03-07 Thread Adrian Saul

Possibly MySQL is doing sync writes, where as your FIO could be doing buffered 
writes.

Try enabling the sync option on fio and compare results.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Matteo Dacrema
> Sent: Wednesday, 8 March 2017 7:52 AM
> To: ceph-users
> Subject: [ceph-users] MySQL and ceph volumes
>
> Hi All,
>
> I have a galera cluster running on openstack with data on ceph volumes
> capped at 1500 iops for read and write ( 3000 total ).
> I can’t understand why with fio I can reach 1500 iops without IOwait and
> MySQL can reach only 150 iops both read or writes showing 30% of IOwait.
>
> I tried with fio 64k block size and various io depth ( 1.2.4.8.16….128) and I
> can’t reproduce the problem.
>
> Anyone can tell me where I’m wrong?
>
> Thank you
> Regards
> Matteo
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MySQL and ceph volumes

2017-03-07 Thread Adrian Saul

The problem is not so much ceph, but the fact that sync workloads tend to mean 
you have an effective queue depth of 1 because it serialises the IO from the 
application, as it waits for the last write to complete before issuing the next 
one.


From: Matteo Dacrema [mailto:mdacr...@enter.eu]
Sent: Wednesday, 8 March 2017 10:36 AM
To: Adrian Saul
Cc: ceph-users
Subject: Re: [ceph-users] MySQL and ceph volumes

Thank you Adrian!

I’ve forgot this option and I can reproduce the problem.

Now, what could be the problem on ceph side with O_DSYNC writes?

Regards
Matteo



This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this email in error please notify the system manager. This 
message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and delete 
this e-mail from your system. If you are not the intended recipient you are 
notified that disclosing, copying, distributing or taking any action in 
reliance on the contents of this information is strictly prohibited.

Il giorno 08 mar 2017, alle ore 00:25, Adrian Saul 
mailto:adrian.s...@tpgtelecom.com.au>> ha 
scritto:


Possibly MySQL is doing sync writes, where as your FIO could be doing buffered 
writes.

Try enabling the sync option on fio and compare results.



-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Matteo Dacrema
Sent: Wednesday, 8 March 2017 7:52 AM
To: ceph-users
Subject: [ceph-users] MySQL and ceph volumes

Hi All,

I have a galera cluster running on openstack with data on ceph volumes
capped at 1500 iops for read and write ( 3000 total ).
I can’t understand why with fio I can reach 1500 iops without IOwait and
MySQL can reach only 150 iops both read or writes showing 30% of IOwait.

I tried with fio 64k block size and various io depth ( 1.2.4.8.16….128) and I
can’t reproduce the problem.

Anyone can tell me where I’m wrong?

Thank you
Regards
Matteo

___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.

--
Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto.
Seguire il link qui sotto per segnalarlo come spam:
http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=13CCD402D0.AA534


Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd iscsi gateway question

2017-04-05 Thread Adrian Saul

I am not sure if there is a hard and fast rule you are after, but pretty much 
anything that would cause ceph transactions to be blocked (flapping OSD, 
network loss, hung host) has the potential to block RBD IO which would cause 
your iSCSI LUNs to become unresponsive for that period.

For the most part though, once that condition clears things keep working, so 
its not like a hang where you need to reboot to clear it.  Some situations we 
have hit with our setup:


-  Failed OSDs (dead disks) – no issues

-  Cluster rebalancing – ok if throttled back to keep service times down

-  Network packet loss (bad fibre) – painful, broken communication 
everywhere, caused a krbd hang needing a reboot

-  RBD Snapshot deletion – disk latency through roof, cluster 
unresponsive for minutes at a time, won’t do again.



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Brady 
Deetz
Sent: Thursday, 6 April 2017 12:58 PM
To: ceph-users
Subject: [ceph-users] rbd iscsi gateway question

I apologize if this is a duplicate of something recent, but I'm not finding 
much. Does the issue still exist where dropping an OSD results in a LUN's I/O 
hanging?

I'm attempting to determine if I have to move off of VMWare in order to safely 
use Ceph as my VM storage.
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd iscsi gateway question

2017-04-06 Thread Adrian Saul
In my case I am using SCST, so that is what my experience is based on.  For our 
VMware we are using NFS, but for Hyper-V and Solaris we are using iSCSI.

There is actually some work done to make userland SCST which could be 
interesting for making a scst_librbd integration that bypasses the need for 
krbd.



From: Nick Fisk [mailto:n...@fisk.me.uk]
Sent: Thursday, 6 April 2017 5:43 PM
To: Adrian Saul; 'Brady Deetz'; 'ceph-users'
Subject: RE: [ceph-users] rbd iscsi gateway question

I assume Brady is referring to the death spiral LIO gets into with some 
initiators, including vmware, if an IO takes longer than about 10s. I haven’t 
heard of anything, and can’t see any changes, so I would assume this issue 
still remains.

I would look at either SCST or NFS for now.

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Adrian 
Saul
Sent: 06 April 2017 05:32
To: Brady Deetz ; ceph-users 
Subject: Re: [ceph-users] rbd iscsi gateway question


I am not sure if there is a hard and fast rule you are after, but pretty much 
anything that would cause ceph transactions to be blocked (flapping OSD, 
network loss, hung host) has the potential to block RBD IO which would cause 
your iSCSI LUNs to become unresponsive for that period.

For the most part though, once that condition clears things keep working, so 
its not like a hang where you need to reboot to clear it.  Some situations we 
have hit with our setup:

-  Failed OSDs (dead disks) – no issues
-  Cluster rebalancing – ok if throttled back to keep service times down
-  Network packet loss (bad fibre) – painful, broken communication 
everywhere, caused a krbd hang needing a reboot
-  RBD Snapshot deletion – disk latency through roof, cluster 
unresponsive for minutes at a time, won’t do again.



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Brady 
Deetz
Sent: Thursday, 6 April 2017 12:58 PM
To: ceph-users
Subject: [ceph-users] rbd iscsi gateway question

I apologize if this is a duplicate of something recent, but I'm not finding 
much. Does the issue still exist where dropping an OSD results in a LUN's I/O 
hanging?

I'm attempting to determine if I have to move off of VMWare in order to safely 
use Ceph as my VM storage.
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.


Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] design guidance

2017-06-05 Thread Adrian Saul
> > Early usage will be CephFS, exported via NFS and mounted on ESXi 5.5
> > and
> > 6.0 hosts(migrating from a VMWare environment), later to transition to
> > qemu/kvm/libvirt using native RBD mapping. I tested iscsi using lio
> > and saw much worse performance with the first cluster, so it seems
> > this may be the better way, but I'm open to other suggestions.
> >
> I've never seen any ultimate solution to providing HA iSCSI on top of Ceph,
> though other people here have made significant efforts.

In our tests our best results were with SCST - also because it provided proper 
ALUA support at the time.  I ended up developing my own pacemaker cluster 
resources to manage the SCST orchestration and ALUA failover.  In our model we 
have  a pacemaker cluster in front being an RBD client presenting LUNs/NFS out 
to VMware (NFS), Solaris and Hyper-V (iSCSI).  We are using CephFS over NFS but 
performance has been poor, even using it just for VMware templates.  We are on 
an earlier version of Jewel so its possibly some later versions may improve 
CephFS for that but I have not had time to test it.

We have been running a small production/POC for over 18 months on that setup, 
and gone live into a much larger setup in the last 6 months based on that 
model.  It's not without its issues, but most of that is a lack of test 
resources to be able to shake out some of the client compatibility and failover 
shortfalls we have.

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] VMware + CEPH Integration

2017-06-18 Thread Adrian Saul
> Hi Alex,
>
> Have you experienced any problems with timeouts in the monitor action in
> pacemaker? Although largely stable, every now and again in our cluster the
> FS and Exportfs resources timeout in pacemaker. There's no mention of any
> slow requests or any peering..etc from the ceph logs so it's a bit of a 
> mystery.

Yes - we have that in our setup which is very similar.  Usually  I find it 
related to RBD device latency  due to scrubbing or similar but even when tuning 
some of that down we still get it randomly.

The most annoying part is that once it comes up, having to use  "resource 
cleanup" to try and remove the failed usually has more impact than the actual 
error.
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph RBD latencies

2016-03-02 Thread Adrian Saul
Hi Ceph-users,

TL;DR - I can't seem to pin down why an unloaded system with flash based OSD 
journals has higher than desired write latencies for RBD devices.  Any ideas?


  I am developing a storage system based on Ceph and an SCST+pacemaker cluster. 
  Our initial testing showed promising results even with mixed available 
hardware and we proceeded to order a more designed platform for developing into 
production.   The hardware is:

2x 1RU servers as "frontends" (SCST+pacemaker - ceph mons and clients using RBD 
- they present iSCSI to other systems).
3x 2RU OSD SSD servers (24 bay 2.5" SSD) - currently with 4 2TB Samsung Evo 
SSDs each
3x 4RU OSD SATA servers (36 bay) - currently with 6 8TB Seagate each

 As part of the research and planning we opted to put a pair of Intel PC3700DC 
400G NVME cards in each OSD server.  These are configured mirrored and setup as 
the journals for the OSD disks, the aim being to improve write latencies.  All 
the machines have 128G RAM and dual E5-2630v3 CPUs, and use 4 aggregated 10G 
NICs back to a common pair of switches.   All machines are running Centos 7, 
with the frontends using the 4.4.1 elrepo-ml kernel to get a later RBD kernel 
module.

On the ceph side each disk in the OSD servers are setup as an individual OSD, 
with a 12G journal created on the flash mirror.   I setup the SSD servers into 
one root, and the SATA servers into another and created pools using hosts as 
fault boundaries, with the pools set for 2 copies.   I created the pools with 
the pg_num and pgp_num set to 32x the number of OSDs in the pool.   On the 
frontends we create RBD devices and present them as iSCSI LUNs using SCST to 
clients - in this test case a Solaris host.

The problem I have is that even with a lightly loaded system the service times 
for the LUNs for writes is just not getting down to where we want it, and they 
are not very stable - with 5 LUNs doing around 200 32K IOPS consistently the 
service times sit at around 3-4ms, but regularly (every 20-30 seconds) up to 
above 12-15ms which puts the average at 6ms over 5 minutes.  I fully expected 
we would have some latencies due to the distributed and networked nature of 
Ceph, but in this instance I just cannot find where these latencies are coming 
from, especially with the SSD based pool and having flash based journaling.

- The RBD devices show relatively low service times, but high queue times.  
These are in line with what Solaris sees so I don't think SCST/iSCSI is adding 
much latency.
- The journals are reporting 0.02ms service times, and seem to cope fine with 
any bursts
- The SSDs do show similar latency variations with writes - bursting up to 12ms 
or more whenever there is high write workloads.
- I have tried applying what tuning I can to the SSD block devices (noop 
scheduler etc) - no difference
- I have removed any sort of smarts around IO grouping in SCST - no major impact
- I have tried tuning up filesystore  queue and wbthrottle values but could not 
find much difference from that.
- Read performance is excellent, the RBD devices show little to no rwait and I 
can do benchmarks up over 1GB/s in some tests.  Write throughput can also be 
good (~700MB/s).
- I have tried using different RBD orders more in line with the iSCSI client 
block sizes (i.e 32K, 128K instead of 4M) but it seemed to make things worse.  
I would have thought better alignment would reduce latency but is that offset 
buy the extra overhead in object work?

What I am looking for is what other areas do I need to look or diagnostics do I 
need to work this out?  We would really like to use ceph across a mixed 
workload that includes some DB systems that are fairly latency sensitive, but 
as it stands its hard to be confident in the performance when a fairly quiet 
unloaded system seems to struggle, even with all this hardware behind it.   I 
get the impression that the SSD write latencies might be coming into play as 
they are similar to the numbers I see, but really for writes I would expect 
them to be "hidden" behind the journaling.

I also would have thought that being not under load and with the flash journals 
the only latency would be coming from mapping calculations on the client or 
otherwise some contention within the RBD module itself.   Any ideas how I can 
break out what the times are for what the RBD module is doing?

Any help appreciated.

As an aside - I think Ceph as a concept is exactly what a storage system should 
be about, hence why we are using it this way.  Its been awesome to get stuck 
into it and learn how it works and what it can do.




Adrian Saul | Infrastructure Projects Team Lead
TPG Telecom (ASX: TPM)










Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed o

Re: [ceph-users] Ceph RBD latencies

2016-03-03 Thread Adrian Saul

> Samsung EVO...
> Which exact model, I presume this is not a DC one?
>
> If you had put your journals on those, you would already be pulling your hairs
> out due to abysmal performance.
>
> Also with Evo ones, I'd be worried about endurance.

No,  I am using the P3700DCs for journals.  The Samsungs are the 850 2TB 
(MZ-75E2T0BW).  Chosen primarily on price.  We already built a system using the 
1TB models with Solaris+ZFS and I have little faith in them.  Certainly their 
write performance is erratic and not ideal.  We have other vendor options which 
are what they call "Enterprise Value" SSDs, but still 4x the price.   I would 
prefer a higher grade drive but unfortunately cost is being driven from above 
me.

> > On the ceph side each disk in the OSD servers are setup as an individual
> > OSD, with a 12G journal created on the flash mirror.   I setup the SSD
> > servers into one root, and the SATA servers into another and created
> > pools using hosts as fault boundaries, with the pools set for 2
> > copies.
> Risky. If you have very reliable and well monitored SSDs you can get away
> with 2 (I do so), but with HDDs and the combination of their reliability and
> recovery time it's asking for trouble.
> I realize that this is testbed, but if your production has a replication of 3 
> you
> will be disappointed by the additional latency.

Again, cost - the end goal will be we build metro based dual site pools which 
will be 2+2 replication.  I am aware of the risks but already presenting 
numbers based on buying 4x the disk we are able to use gets questioned hard.

> This smells like garbage collection on your SSDs, especially since it matches
> time wise what you saw on them below.

I concur.   I am just not sure why that impacts back to the client when from 
the client perspective the journal should hide this.   If the journal is 
struggling to keep up and has to flush constantly then perhaps, but  on the 
current steady state IO rate I am testing with I don't think the journal should 
be that saturated.

> Have you tried the HDD based pool and did you see similar, consistent
> interval, spikes?

To be honest I have been focusing on the SSD numbers but that would be a good 
comparison.

> Or alternatively, configured 2 of your NVMEs as OSDs?

That was what I was thinking of doing - move the NVMEs to the frontends, make 
them OSDs and configure them as a read-forward cache tier for the other pools, 
and just have the SSDs and SATA journal by default on a first partition.

> No, not really. The journal can only buffer so much.
> There are several threads about this in the archives.
>
> You could tune it but that will only go so far if your backing storage can't 
> keep
> up.
>
> Regards,
>
> Christian


Agreed - Thanks for your help.
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph RBD latencies

2016-03-06 Thread Adrian Saul
> >The Samsungs are the 850 2TB
> > (MZ-75E2T0BW).  Chosen primarily on price.
>
> These are spec'ed at 150TBW, or an amazingly low 0.04 DWPD (over 5 years).
> Unless you have a read-only cluster, you will wind up spending MORE on
> replacing them (and/or loosing data when 2 fail at the same time) than going
> with something more sensible like Samsung's DC models or the Intel DC ones
> (S3610s come to mind for "normal" use).
> See also the current "List of SSDs" thread in this ML.

This was a metric I struggled to find and would have been useful in comparison. 
 I am sourcing prices on the SM863s anyway.  That SSD thread has been good to 
follow as well.

> Fast, reliable, cheap. Pick any 2.

Yup - unfortunately cheap is fixed, reliable is the reason we are doing this 
however fast is now a must have.  the normal engineering/management dilemma.

> On your test setup or even better the Solaris one, have a look at their media
> wearout, or  Wear_Leveling_Count as Samsung calls it.
> I bet that makes for some scary reading.

For the Evos we found no tools we could use on Solaris - also because we have 
cheap nasty SAS interposers in that setup most tools don't work anyway.  Until 
we pull a disk and put it into a windows box we can't do any sort of 
diagnostics on it.  It would be useful to see because we have those disks 
taking a fair brunt of our performance workload now.

> Note that Ceph (RBD/RADOS to be precise) isn't particular suited for "long"
> distance replication due to the incurred latencies.
>
> That's unless your replication is happening "above" Ceph in the iSCSI bits 
> with
> something that's more optimized for this.
>
> Something along the lines of the DRBD proxy has been suggested for Ceph,
> but if at all it is a backburner project at best from what I gather.

We can fairly easily do low latency links (telco) but are looking at the 
architecture to try and limit that sort of long replication - doing replication 
at application and database levels instead.  The site to site replication would 
be limited to some clusters or applications that need sync replication for 
availability.

> There are some ways around this, which may or may not be suitable for your
> use case.
> EC pools (or RAID'ed OSDs, which I prefer) for HDD based pools.
> Of course this comes at a performance penalty, which you can offset again
> with for example fast RAID controllers with HW cache to some extend.
> But it may well turn out to be zero sum game.

I modelled an EC setup but that was at a multi site level with local cache 
tiers in front, and it was going to be too big a challenge to do as a new 
untested platform with too many latency questions.  Within a site EC was to 
going to be cost effective as to do properly I would need to up the number of 
hosts and that pushed the pricing up too far, even if I went with smaller less 
configured hosts.

I thought about hardware RAID as well, but as I would need to do host level 
redundancy anyway it was not gaining any efficiency - less risk but I would 
still need to replicate anyway so why not just go disk to disk.  More than 
likely I would quietly work in higher protection as we go live and deal with it 
later as a capacity expansion.

> Another thing is to use a cache pool (with top of the line SSDs), this is of
> course only a sensible course of action if your hot objects will fit in there.
> In my case they do (about 10-20% of the 2.4TB raw pool capacity) and
> everything is as fast as can be expected and the VMs (their time
> critical/sensitive application to be precise) are happy campers.

This is the model I am working to - our "fast" workloads using SSD caches  in 
front of bulk SATA, sizing the SSDs at around 25% of the capacity we require 
for "fast" storage.

For the "bulk" storage I would still use the SSD cache but sized to 10% of the 
SATA usable capacity.   I figure once we get live we can adjust numbers as 
required - expand with more cache hosts if needed.

> There's a counter in Ceph (counter-filestore_journal_bytes) that you can
> graph for journal usage.
> The highest I have ever seen is about 100MB for HDD based OSDs, less than
> 8MB for SSD based ones with default(ish) Ceph parameters.
>
> Since you seem to have experience with ZFS (I don't really, but I read alot
> ^o^), consider the Ceph journal equivalent to the ZIL.
> It is a write only journal, it never gets read from unless there is a crash.
> That is why sequential, sync write speed is the utmost criteria for Ceph
> journal device.
>
> If I recall correctly you were testing with 4MB block streams, thus pretty
> much filling the pipe to capacity, atop on your storage nodes will give a good
> insight.
>
> The journal is great to cover some bursts, but the Ceph OSD is flushing things
> from RAM to the backing storage on configurable time limits and once these
> are exceeded and/or you run out RAM (pagecache), you are limited to what
> your backing storage can sustain.
>
> Now in real lif

[ceph-users] OSD crash after conversion to bluestore

2016-03-30 Thread Adrian Saul

I upgraded my lab cluster to 10.1.0 specifically to test out bluestore and see 
what latency difference it makes.

I was able to one by one zap and recreate my OSDs to bluestore and rebalance 
the cluster (the change to having new OSDs start with low weight threw me at 
first, but once  I worked that out it was fine).

I was all good until I completed the last OSD, and then one of the earlier ones 
fell over and refuses to restart.  Every attempt to start fails with this 
assertion failure:

-2> 2016-03-31 15:15:08.868588 7f931e5f0800  0  
cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
-1> 2016-03-31 15:15:08.868800 7f931e5f0800  1  
cls/timeindex/cls_timeindex.cc:259: Loaded timeindex class!
 0> 2016-03-31 15:15:08.870948 7f931e5f0800 -1 osd/OSD.h: In function 
'OSDMapRef OSDService::get_map(epoch_t)' thread 7f931e5f0800 time 2016-03-31 
15:15:08.869638
osd/OSD.h: 886: FAILED assert(ret)

 ceph version 10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) 
[0x558cee37da55]
 2: (OSDService::get_map(unsigned int)+0x3d) [0x558cedd6a6fd]
 3: (OSD::init()+0xf22) [0x558cedd1d172]
 4: (main()+0x2aab) [0x558cedc83a2b]
 5: (__libc_start_main()+0xf5) [0x7f931b506b15]
 6: (()+0x349689) [0x558cedccd689]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.


I could just zap and recreate it again, but I would be curious to know how to 
fix it, or unless someone can suggest if this is a bug that needs looking at.

Cheers,
 Adrian


Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph.conf

2016-03-30 Thread Adrian Saul

It is the monitors that ceph clients/daemons can connect to initially to 
connect with the cluster.

Once they connect to one of the initial mons they will get a full list of all 
monitors and be able to connect to any of them to pull updated maps.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
zai...@nocser.net
Sent: Thursday, 31 March 2016 3:21 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Ceph.conf

Hi,

What does mean by mon initial members in ceph.conf? Is it monitor node that 
monitor all osd node? Or node osd that been monitor? Care to exlain?

Regards,

Mohd Zainal Abidin Rabani
Technical Support

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD crash after conversion to bluestore

2016-03-31 Thread Adrian Saul

Not sure about commands however if you look at the OSD mount point there is  a 
“bluefs” file.


From: German Anders [mailto:gand...@despegar.com]
Sent: Thursday, 31 March 2016 11:48 PM
To: Adrian Saul
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] OSD crash after conversion to bluestore

having jewel install, is possible to run a command in order to see that the OSD 
is actually using bluestore?
Thanks in advance,
Best,

German

2016-03-31 1:24 GMT-03:00 Adrian Saul 
mailto:adrian.s...@tpgtelecom.com.au>>:

I upgraded my lab cluster to 10.1.0 specifically to test out bluestore and see 
what latency difference it makes.

I was able to one by one zap and recreate my OSDs to bluestore and rebalance 
the cluster (the change to having new OSDs start with low weight threw me at 
first, but once  I worked that out it was fine).

I was all good until I completed the last OSD, and then one of the earlier ones 
fell over and refuses to restart.  Every attempt to start fails with this 
assertion failure:

-2> 2016-03-31 15:15:08.868588 7f931e5f0800  0  
cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
-1> 2016-03-31 15:15:08.868800 7f931e5f0800  1  
cls/timeindex/cls_timeindex.cc:259: Loaded timeindex class!
 0> 2016-03-31 15:15:08.870948 7f931e5f0800 -1 osd/OSD.h: In function 
'OSDMapRef OSDService::get_map(epoch_t)' thread 7f931e5f0800 time 2016-03-31 
15:15:08.869638
osd/OSD.h: 886: FAILED assert(ret)

 ceph version 10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) 
[0x558cee37da55]
 2: (OSDService::get_map(unsigned int)+0x3d) [0x558cedd6a6fd]
 3: (OSD::init()+0xf22) [0x558cedd1d172]
 4: (main()+0x2aab) [0x558cedc83a2b]
 5: (__libc_start_main()+0xf5) [0x7f931b506b15]
 6: (()+0x349689) [0x558cedccd689]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.


I could just zap and recreate it again, but I would be curious to know how to 
fix it, or unless someone can suggest if this is a bug that needs looking at.

Cheers,
 Adrian


Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD crash after conversion to bluestore

2016-03-31 Thread Adrian Saul

No - if you use ceph-disk prepare it creates a small filesystem with some 
control files, the bluestore partition is not visible.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Oliver Dzombic
> Sent: Friday, 1 April 2016 12:08 AM
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] OSD crash after conversion to bluestore
>
> Hi,
>
> if i understand it correct, bluestore wont use / is not a filesystem to be
> mounted.
>
> So if an osd is up and in, while we dont see its mounted into the filesystem
> and accessable, we could assume that it must be powered by bluestore...
> !??!
>
> --
> Mit freundlichen Gruessen / Best regards
>
> Oliver Dzombic
> IP-Interactive
>
> mailto:i...@ip-interactive.de
>
> Anschrift:
>
> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
> 63571 Gelnhausen
>
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
>
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
>
>
> Am 31.03.2016 um 14:47 schrieb German Anders:
> > having jewel install, is possible to run a command in order to see
> > that the OSD is actually using bluestore?
> >
> > Thanks in advance,
> >
> > Best,
> >
> >
> > **
> >
> > *German*
> >
> > 2016-03-31 1:24 GMT-03:00 Adrian Saul  > <mailto:adrian.s...@tpgtelecom.com.au>>:
> >
> >
> > I upgraded my lab cluster to 10.1.0 specifically to test out
> > bluestore and see what latency difference it makes.
> >
> > I was able to one by one zap and recreate my OSDs to bluestore and
> > rebalance the cluster (the change to having new OSDs start with low
> > weight threw me at first, but once  I worked that out it was fine).
> >
> > I was all good until I completed the last OSD, and then one of the
> > earlier ones fell over and refuses to restart.  Every attempt to
> > start fails with this assertion failure:
> >
> > -2> 2016-03-31 15:15:08.868588 7f931e5f0800  0 
> > cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > -1> 2016-03-31 15:15:08.868800 7f931e5f0800  1 
> > cls/timeindex/cls_timeindex.cc:259: Loaded timeindex class!
> >  0> 2016-03-31 15:15:08.870948 7f931e5f0800 -1 osd/OSD.h: In
> > function 'OSDMapRef OSDService::get_map(epoch_t)' thread
> > 7f931e5f0800 time 2016-03-31 15:15:08.869638
> > osd/OSD.h: 886: FAILED assert(ret)
> >
> >  ceph version 10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6)
> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x85) [0x558cee37da55]
> >  2: (OSDService::get_map(unsigned int)+0x3d) [0x558cedd6a6fd]
> >  3: (OSD::init()+0xf22) [0x558cedd1d172]
> >  4: (main()+0x2aab) [0x558cedc83a2b]
> >  5: (__libc_start_main()+0xf5) [0x7f931b506b15]
> >  6: (()+0x349689) [0x558cedccd689]
> >  NOTE: a copy of the executable, or `objdump -rdS ` is
> > needed to interpret this.
> >
> >
> > I could just zap and recreate it again, but I would be curious to
> > know how to fix it, or unless someone can suggest if this is a bug
> > that needs looking at.
> >
> > Cheers,
> >  Adrian
> >
> >
> > Confidentiality: This email and any attachments are confidential and
> > may be subject to copyright, legal or some other professional
> > privilege. They are intended solely for the attention and use of the
> > named addressee(s). They may only be copied, distributed or
> > disclosed with the consent of the copyright owner. If you have
> > received this email by mistake or by breach of the confidentiality
> > clause, please notify the sender immediately by return email and
> > delete or destroy all copies of the email. Any confidentiality,
> > privilege or copyright is not waived or lost because this email has
> > been sent to you by mistake.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.c

[ceph-users] Mon placement over wide area

2016-04-11 Thread Adrian Saul

We are close to being given approval to deploy a 3.5PB Ceph cluster that will 
be distributed over every major capital in Australia.The config will be 
dual sites in each city that will be coupled as HA pairs - 12 sites in total.   
The vast majority of CRUSH rules will place data either locally to the 
individual site, or replicated to the other HA site in that city.   However 
there are future use cases where I think we could use EC to distribute data 
wider or have some replication that puts small data sets across multiple 
cities.   All of this will be tied together with a dedicated private IP network.

The concern I have is around the placement of mons.  In the current design 
there would be two monitors in each site, running separate to the OSDs as part 
of some hosts acting as RBD to iSCSI/NFS gateways.   There will also be a 
"tiebreaker" mon placed on a separate host which will house some management 
infrastructure for the whole platform.

Obviously a concern is latency - the east coast to west coast latency is around 
50ms, and on the east coast it is 12ms between Sydney and the other two sites, 
and 24ms Melbourne to Brisbane.  Most of the data traffic will remain local but 
if we create a single national cluster then how much of an impact will it be 
having all the mons needing to keep in sync, as well as monitor and communicate 
with all OSDs (in the end goal design there will be some 2300+ OSDs).

The other options I  am considering:
- split into east and west coast clusters, most of the cross city need is in 
the east coast, any data moves between clusters can be done with snap 
replication
- city based clusters (tightest latency) but loose the multi-DC EC option, do 
cross city replication using snapshots

Just want to get a feel for what I need to consider when we start building at 
this scale.

Cheers,
 Adrian






Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mon placement over wide area

2016-04-11 Thread Adrian Saul
Hello again Christian :)


> > We are close to being given approval to deploy a 3.5PB Ceph cluster that
> > will be distributed over every major capital in Australia.The config
> > will be dual sites in each city that will be coupled as HA pairs - 12
> > sites in total.   The vast majority of CRUSH rules will place data
> > either locally to the individual site, or replicated to the other HA
> > site in that city.   However there are future use cases where I think we
> > could use EC to distribute data wider or have some replication that puts
> > small data sets across multiple cities.
> This will very, very, VERY much depend on the data (use case) in question.

The EC use case would be using RGW and to act as an archival backup store

> > The concern I have is around the placement of mons.  In the current
> > design there would be two monitors in each site, running separate to the
> > OSDs as part of some hosts acting as RBD to iSCSI/NFS gateways.   There
> > will also be a "tiebreaker" mon placed on a separate host which will
> > house some management infrastructure for the whole platform.
> >
> Yes, that's the preferable way, might want to up this to 5 mons so you can
> loose one while doing maintenance on another one.
> But if that would be a coupled, national cluster you're looking both at
> significant MON traffic, interesting "split-brain" scenarios and latencies as
> well (MONs get chosen randomly by clients AFAIK).

In the case I am setting up it would be 2 per site plus the extra so 25 - but I 
am fearing that would make the mon syncing become to heavy.  Once we build up 
to multiple sites though we can maybe reduce to one per site to reduce the 
workload on keeping the mons in sync.

> > Obviously a concern is latency - the east coast to west coast latency
> > is around 50ms, and on the east coast it is 12ms between Sydney and
> > the other two sites, and 24ms Melbourne to Brisbane.
> In any situation other than "write speed doesn't matter at all" combined with
> "large writes, not small ones" and "read-mostly" you're going to be in severe
> pain.

For data yes, but the main case for that would be backup data where it would be 
large writes, read rarely and as long as streaming performance keeps up latency 
wont matter.   My concern with the latency would be how that impacts the 
monitors having to keep in sync and how that would impact client opertions, 
especially with the rate of change that would occur with the predominant RBD 
use in most sites.

> > Most of the data
> > traffic will remain local but if we create a single national cluster
> > then how much of an impact will it be having all the mons needing to
> > keep in sync, as well as monitor and communicate with all OSDs (in the
> > end goal design there will be some 2300+ OSDs).
> >
> Significant.
> I wouldn't suggest it, but even if you deploy differently I'd suggest a test
> run/setup and sharing the experience with us. ^.^

Someone has to be the canary right :)

> > The other options I  am considering:
> > - split into east and west coast clusters, most of the cross city need
> > is in the east coast, any data moves between clusters can be done with
> > snap replication
> > - city based clusters (tightest latency) but loose the multi-DC EC
> > option, do cross city replication using snapshots
> >
> The later, I seem to remember that there was work in progress to do this
> (snapshot replication) in an automated fashion.
>
> > Just want to get a feel for what I need to consider when we start
> > building at this scale.
> >
> I know you're set on iSCSI/NFS (have you worked out the iSCSI kinks?), but
> the only well known/supported way to do geo-replication with Ceph is via
> RGW.

iSCSI is working fairly well.  We have decided to not use Ceph for the latency 
sensitive workloads so while we are still working to keep that low, we wont be 
putting the heavier IOP or latency sensitive workloads onto it until we get a 
better feel for how it behaves at scale and can be sure of the performance.

As above - for the most part we are going to be for the most part having local 
site pools (replicate at application level), a few metro replicated pools and a 
couple of very small multi-metro replicated pools, with the geo-redundant EC 
stuff a future plan.  It would just be a shame to lock the design into a setup 
that won't let us do some of these wider options down the track.

Thanks.

Adrian

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not 

Re: [ceph-users] Mon placement over wide area

2016-04-12 Thread Adrian Saul

At this stage the RGW component is down the line - pretty much just concept 
while we build out the RBD side first.

What I wanted to get out of EC was distributing the data across multiple DCs 
such that we were not simply replicating data - which would give us much better 
storage efficiency and redundancy.Some of what I had read in the past was 
around using EC to spread data over multiple DCs to be able to sustain loss of 
multiple sites.  Most of this was implied fairly clearly in the documentation 
under "CHEAP MULTIDATACENTER STORAGE":

http://docs.ceph.com/docs/hammer/dev/erasure-coded-pool/

Although I note that section appears to have disappeared in the later 
documentation versions

It seems a little disheartening that much of this promise and capability for 
Ceph appears to be just not there in practice.






> -Original Message-
> From: Maxime Guyot [mailto:maxime.gu...@elits.com]
> Sent: Tuesday, 12 April 2016 5:49 PM
> To: Adrian Saul; Christian Balzer; 'ceph-users@lists.ceph.com'
> Subject: Re: [ceph-users] Mon placement over wide area
>
> Hi Adrian,
>
> Looking at the documentation RadosGW has multi region support with the
> “federated gateways”
> (http://docs.ceph.com/docs/master/radosgw/federated-config/):
> "When you deploy a Ceph Object Store service that spans geographical
> locales, configuring Ceph Object Gateway regions and metadata
> synchronization agents enables the service to maintain a global namespace,
> even though Ceph Object Gateway instances run in different geographic
> locales and potentially on different Ceph Storage Clusters.”
>
> Maybe that could do the trick for your multi metro EC pools?
>
> Disclaimer: I haven't tested the federated gateways RadosGW.
>
> Best Regards
>
> Maxime Guyot
> System Engineer
>
>
>
>
>
>
>
>
>
> On 12/04/16 03:28, "ceph-users on behalf of Adrian Saul"  boun...@lists.ceph.com on behalf of adrian.s...@tpgtelecom.com.au>
> wrote:
>
> >Hello again Christian :)
> >
> >
> >> > We are close to being given approval to deploy a 3.5PB Ceph cluster that
> >> > will be distributed over every major capital in Australia.The config
> >> > will be dual sites in each city that will be coupled as HA pairs - 12
> >> > sites in total.   The vast majority of CRUSH rules will place data
> >> > either locally to the individual site, or replicated to the other HA
> >> > site in that city.   However there are future use cases where I think we
> >> > could use EC to distribute data wider or have some replication that
> >> > puts small data sets across multiple cities.
> >> This will very, very, VERY much depend on the data (use case) in question.
> >
> >The EC use case would be using RGW and to act as an archival backup
> >store
> >
> >> > The concern I have is around the placement of mons.  In the current
> >> > design there would be two monitors in each site, running separate to
> the
> >> > OSDs as part of some hosts acting as RBD to iSCSI/NFS gateways.   There
> >> > will also be a "tiebreaker" mon placed on a separate host which
> >> > will house some management infrastructure for the whole platform.
> >> >
> >> Yes, that's the preferable way, might want to up this to 5 mons so
> >> you can loose one while doing maintenance on another one.
> >> But if that would be a coupled, national cluster you're looking both
> >> at significant MON traffic, interesting "split-brain" scenarios and
> >> latencies as well (MONs get chosen randomly by clients AFAIK).
> >
> >In the case I am setting up it would be 2 per site plus the extra so 25 - 
> >but I
> am fearing that would make the mon syncing become to heavy.  Once we
> build up to multiple sites though we can maybe reduce to one per site to
> reduce the workload on keeping the mons in sync.
> >
> >> > Obviously a concern is latency - the east coast to west coast
> >> > latency is around 50ms, and on the east coast it is 12ms between
> >> > Sydney and the other two sites, and 24ms Melbourne to Brisbane.
> >> In any situation other than "write speed doesn't matter at all"
> >> combined with "large writes, not small ones" and "read-mostly" you're
> >> going to be in severe pain.
> >
> >For data yes, but the main case for that would be backup data where it
> would be large writes, read rarely and as long as streaming performance
> keeps up latency wont matter.   My concern with the latency would be how
&g

Re: [ceph-users] fibre channel as ceph storage interconnect

2016-04-21 Thread Adrian Saul

I could only see it being done using FCIP as the OSD processes use IP to 
communicate.

I guess it would depend on why you are looking to use something like FC instead 
of Ethernet or IB.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Schlacta, Christ
> Sent: Friday, 22 April 2016 1:12 PM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] fibre channel as ceph storage interconnect
>
> Is it possible?  Can I use fibre channel to interconnect my ceph OSDs?
>  Intuition tells me it should be possible, yet experience (Mostly with fibre
> channel) tells me no.  I don't know enough about how ceph works to know
> for sure.  All my googling returns results about using ceph as a BACKEND for
> exporting fibre channel LUNs, which is, sadly, not what I'm looking for at the
> moment.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fibre channel as ceph storage interconnect

2016-04-21 Thread Adrian Saul
> from the responses I've gotten, it looks like there's no viable option to use
> fibre channel as an interconnect between the nodes of the cluster.
> Would it be worth while development effort to establish a block protocol
> between the nodes so that something like fibre channel could be used to
> communicate internally?  Unless I'm waaay wrong (And I'm seldom *that*
> wrong), it would not be worth the effort.  I won't even feature request it.
> Looks like I'll have to look into infiniband or CE, and possibly migrate away
> from Fibre Channel, even though it kinda just works, and therefore I really
> like it :(

I would think even conceptually it would be a mess -  FC as a peer to peer 
network fabric might be useful (in many ways I like it a lot better than 
Ethernet), but you would have to develop an entire transport protocol over it 
(the normal SCSI model would be useless) for Ceph and then write that in to 
replace any of the network code in the existing Ceph code base.

A lot of work for something that is probably easier done swapping your FC HBAs 
for 10G NICs or IB HBAs.

>
> On Thu, Apr 21, 2016 at 11:06 PM, Schlacta, Christ 
> wrote:
> > My primary motivations are:
> > Most of my systems that I want to use with ceph already have fibre
> > Chantel cards and infrastructure, and more infrastructure is
> > incredibly cheap compared to infiniband or {1,4}0gbe cards and
> > infrastructure Most of my systems are expansion slot constrained, and
> > I'd be forced to pick one or the other anyway.
> >
> > On Thu, Apr 21, 2016 at 9:28 PM, Paul Evans  wrote:
> >> In today’s world, OSDs communicate via IP and only IP*. Some
> >> FiberChannel switches and HBAs  support IP-over-FC, but it’s about
> >> 0.02% of the FC deployments.
> >> Therefore, one could technically use FC, but it does’t appear to
> >> offer enough benefit to OSD operations to justify the unique architecture.
> >>
> >> What is your motivation to leverage FC behind OSDs?
> >>
> >> -Paul
> >>
> >> *Ceph on native Infiniband may be available some day, but it seems
> >> impractical with the current releases. IP-over-IB is also known to work.
> >>
> >>
> >> On Apr 21, 2016, at 8:12 PM, Schlacta, Christ 
> wrote:
> >>
> >> Is it possible?  Can I use fibre channel to interconnect my ceph OSDs?
> >> Intuition tells me it should be possible, yet experience (Mostly with
> >> fibre channel) tells me no.  I don't know enough about how ceph works
> >> to know for sure.  All my googling returns results about using ceph
> >> as a BACKEND for exporting fibre channel LUNs, which is, sadly, not
> >> what I'm looking for at the moment.
> >>
> >>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com