Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Nick Fisk
> -Original Message-
> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> Sent: 21 August 2016 04:15
> To: Nick Fisk 
> Cc: w...@globe.de; Horace Ng ; ceph-users 
> 
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> Hi Nick,
> 
> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  wrote:
> >> -Original Message-
> >> From: w...@globe.de [mailto:w...@globe.de]
> >> Sent: 21 July 2016 13:23
> >> To: n...@fisk.me.uk; 'Horace Ng' 
> >> Cc: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Okay and what is your plan now to speed up ?
> >
> > Now I have come up with a lower latency hardware design, there is not much 
> > further improvement until persistent RBD caching is
> implemented, as you will be moving the SSD/NVME closer to the client. But I'm 
> happy with what I can achieve at the moment. You
> could also experiment with bcache on the RBD.
> 
> Reviving this thread, would you be willing to share the details of the low 
> latency hardware design?  Are you optimizing for NFS or
> iSCSI?

Both really, just trying to get the write latency as low as possible, as you 
know, vmware does everything with lots of unbuffered small io's. Eg when you 
migrate a VM or as thin vmdk's grow.

Even storage vmotions which might kick off 32 threads, as they all roughly fall 
on the same PG, there still appears to be a bottleneck with contention on the 
PG itself. 

These were the sort of things I was trying to optimise for, to make the time 
spent in Ceph as minimal as possible for each IO.

So onto the hardware. Through reading various threads and experiments on my own 
I came to the following conclusions. 

-You need highest possible frequency on the CPU cores, which normally also 
means less of them. 
-Dual sockets are probably bad and will impact performance.
-Use NVME's for journals to minimise latency

The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an Intel 
P3700 for a journal. I used the SuperMicro X11SSH-CTF board which has 10G-T 
onboard as well as 8SATA and 8SAS, so no expansion cards required. Actually 
this design as well as being very performant for Ceph, also works out very 
cheap as you are using low end server parts. The whole lot + 12x7.2k disks all 
goes into a 1U case.

During testing I noticed that by default c-states and p-states slaughter 
performance. After forcing max cstate to 1 and forcing the CPU frequency up to 
max, I was seeing 600us latency for a 4kb write to a 3xreplica pool, or around 
1600IOPs, this is at QD=1.

Few other observations:
1. Power usage is around 150-200W for this config with 12x7.2k disks
2. CPU usage maxing out disks, is only around 10-15%, so plenty of headroom for 
more disks.
3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage
4. No idea about CPU load for pure SSD nodes, but based on the current disks, 
you could maybe expect ~1iops per node, before maxing out CPU's
5. Single NVME seems to be able to journal 12 disks with no problem during 
normal operation, no doubt a specific benchmark could max it out though.
6. There are slightly faster Xeon E3's, but price/performance = diminishing 
returns

Hope that answers all your questions.
Nick

> 
> Thank you,
> Alex
> 
> >
> >>
> >> Would it help to put in multiple P3700 per OSD Node to improve performance 
> >> for a single Thread (example Storage VMotion) ?
> >
> > Most likely not, it's all the other parts of the puzzle which are causing 
> > the latency. ESXi was designed for storage arrays that service
> IO's in 100us-1ms range, Ceph is probably about 10x slower than this, hence 
> the problem. Disable the BBWC on a RAID controller or
> SAN and you will the same behaviour.
> >
> >>
> >> Regards
> >>
> >>
> >> Am 21.07.16 um 14:17 schrieb Nick Fisk:
> >> >> -Original Message-
> >> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> >> >> Behalf Of w...@globe.de
> >> >> Sent: 21 July 2016 13:04
> >> >> To: n...@fisk.me.uk; 'Horace Ng' 
> >> >> Cc: ceph-users@lists.ceph.com
> >> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread
> >> >> Performance
> >> >>
> >> >> Hi,
> >> >>
> >> >> hmm i think 200 MByte/s is really bad. Is your Cluster in production 
> >> >> right now?
> >> > It's just been built, not running yet.
> >> >
> >> >> So if you start a storage migration you get only 200 MByte/s right?
> >> > I wish. My current cluster (not this new one) would storage migrate
> >> > at ~10-15MB/s. Serial latency is the problem, without being able to
> >> > buffer, ESXi waits on an ack for each IO before sending the next.
> >> > Also it submits the migrations in 64kb chunks, unless you get VAAI
> >> working. I think esxi will try and do them in parallel, which will help as 
> >> well.
> >> >
> >> >> I think it would be awesome if you get 1000 MByte/s
> >> >>
> >> >> Where is the Bottleneck?
> >> > Latency serialisation, without a buffer, you can't drive the
> >> > 

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Brian ::
Hi Nick

Interested in this comment - "-Dual sockets are probably bad and will
impact performance."

Have you got real world experience of this being the case?

Thanks - B

On Sun, Aug 21, 2016 at 8:31 AM, Nick Fisk  wrote:
>> -Original Message-
>> From: Alex Gorbachev [mailto:a...@iss-integration.com]
>> Sent: 21 August 2016 04:15
>> To: Nick Fisk 
>> Cc: w...@globe.de; Horace Ng ; ceph-users 
>> 
>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>
>> Hi Nick,
>>
>> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  wrote:
>> >> -Original Message-
>> >> From: w...@globe.de [mailto:w...@globe.de]
>> >> Sent: 21 July 2016 13:23
>> >> To: n...@fisk.me.uk; 'Horace Ng' 
>> >> Cc: ceph-users@lists.ceph.com
>> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>> >>
>> >> Okay and what is your plan now to speed up ?
>> >
>> > Now I have come up with a lower latency hardware design, there is not much 
>> > further improvement until persistent RBD caching is
>> implemented, as you will be moving the SSD/NVME closer to the client. But 
>> I'm happy with what I can achieve at the moment. You
>> could also experiment with bcache on the RBD.
>>
>> Reviving this thread, would you be willing to share the details of the low 
>> latency hardware design?  Are you optimizing for NFS or
>> iSCSI?
>
> Both really, just trying to get the write latency as low as possible, as you 
> know, vmware does everything with lots of unbuffered small io's. Eg when you 
> migrate a VM or as thin vmdk's grow.
>
> Even storage vmotions which might kick off 32 threads, as they all roughly 
> fall on the same PG, there still appears to be a bottleneck with contention 
> on the PG itself.
>
> These were the sort of things I was trying to optimise for, to make the time 
> spent in Ceph as minimal as possible for each IO.
>
> So onto the hardware. Through reading various threads and experiments on my 
> own I came to the following conclusions.
>
> -You need highest possible frequency on the CPU cores, which normally also 
> means less of them.
> -Dual sockets are probably bad and will impact performance.
> -Use NVME's for journals to minimise latency
>
> The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an Intel 
> P3700 for a journal. I used the SuperMicro X11SSH-CTF board which has 10G-T 
> onboard as well as 8SATA and 8SAS, so no expansion cards required. Actually 
> this design as well as being very performant for Ceph, also works out very 
> cheap as you are using low end server parts. The whole lot + 12x7.2k disks 
> all goes into a 1U case.
>
> During testing I noticed that by default c-states and p-states slaughter 
> performance. After forcing max cstate to 1 and forcing the CPU frequency up 
> to max, I was seeing 600us latency for a 4kb write to a 3xreplica pool, or 
> around 1600IOPs, this is at QD=1.
>
> Few other observations:
> 1. Power usage is around 150-200W for this config with 12x7.2k disks
> 2. CPU usage maxing out disks, is only around 10-15%, so plenty of headroom 
> for more disks.
> 3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage
> 4. No idea about CPU load for pure SSD nodes, but based on the current disks, 
> you could maybe expect ~1iops per node, before maxing out CPU's
> 5. Single NVME seems to be able to journal 12 disks with no problem during 
> normal operation, no doubt a specific benchmark could max it out though.
> 6. There are slightly faster Xeon E3's, but price/performance = diminishing 
> returns
>
> Hope that answers all your questions.
> Nick
>
>>
>> Thank you,
>> Alex
>>
>> >
>> >>
>> >> Would it help to put in multiple P3700 per OSD Node to improve 
>> >> performance for a single Thread (example Storage VMotion) ?
>> >
>> > Most likely not, it's all the other parts of the puzzle which are causing 
>> > the latency. ESXi was designed for storage arrays that service
>> IO's in 100us-1ms range, Ceph is probably about 10x slower than this, hence 
>> the problem. Disable the BBWC on a RAID controller or
>> SAN and you will the same behaviour.
>> >
>> >>
>> >> Regards
>> >>
>> >>
>> >> Am 21.07.16 um 14:17 schrieb Nick Fisk:
>> >> >> -Original Message-
>> >> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>> >> >> Behalf Of w...@globe.de
>> >> >> Sent: 21 July 2016 13:04
>> >> >> To: n...@fisk.me.uk; 'Horace Ng' 
>> >> >> Cc: ceph-users@lists.ceph.com
>> >> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread
>> >> >> Performance
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> hmm i think 200 MByte/s is really bad. Is your Cluster in production 
>> >> >> right now?
>> >> > It's just been built, not running yet.
>> >> >
>> >> >> So if you start a storage migration you get only 200 MByte/s right?
>> >> > I wish. My current cluster (not this new one) would storage migrate
>> >> > at ~10-15MB/s. Serial latency is the problem, without being able to
>> >> > buffer, ESXi waits on an ack for each IO before

Re: [ceph-users] Ceph repository IP block

2016-08-21 Thread Brian ::
If you point at the eu.ceph.com

ceph.apt-get.eu has address 185.27.175.43

ceph.apt-get.eu has IPv6 address 2a00:f10:121:400:48c:baff:fe00:477

On Sat, Aug 20, 2016 at 11:59 AM, Vlad Blando  wrote:

> Hi Guys,
>
> I will be installing Ceph behind a very restrictive firewall and one of
> the requirements is for me to submit the IP block of the repository used in
> the installation. I searched the internet but couldn't find one (or I
> haven't searched enough). Hoping to get answers from here.
>
> Thanks.
>
> /Vlad
> ᐧ
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Christian Balzer

Hello,

On Sun, 21 Aug 2016 09:16:33 +0100 Brian :: wrote:

> Hi Nick
> 
> Interested in this comment - "-Dual sockets are probably bad and will
> impact performance."
> 
> Have you got real world experience of this being the case?
> 
Well, Nick wrote "probably".

Dual sockets and thus NUMA, the need for CPUs to talk to each other and
share information certainly can impact things that are very time critical.
How much though is a question of design, both HW and SW.

We're looking here at a case where he's trying to reduce latency by all
means and where the actual CPU needs for the HDDs are negligible.
The idea being that a "Ceph IOPS" stays on one core which is hopefully
also not being shared at that time.

If you're looking at full SSD nodes OTOH a singe CPU may very well not be
able to saturate a sensible amount of SSDs per node, so a slight penalty
but better utilization and overall IOPS with 2 CPUs may be the forward.

Christian

> Thanks - B
> 
> On Sun, Aug 21, 2016 at 8:31 AM, Nick Fisk  wrote:
> >> -Original Message-
> >> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> >> Sent: 21 August 2016 04:15
> >> To: Nick Fisk 
> >> Cc: w...@globe.de; Horace Ng ; ceph-users 
> >> 
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Hi Nick,
> >>
> >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  wrote:
> >> >> -Original Message-
> >> >> From: w...@globe.de [mailto:w...@globe.de]
> >> >> Sent: 21 July 2016 13:23
> >> >> To: n...@fisk.me.uk; 'Horace Ng' 
> >> >> Cc: ceph-users@lists.ceph.com
> >> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >> >>
> >> >> Okay and what is your plan now to speed up ?
> >> >
> >> > Now I have come up with a lower latency hardware design, there is not 
> >> > much further improvement until persistent RBD caching is
> >> implemented, as you will be moving the SSD/NVME closer to the client. But 
> >> I'm happy with what I can achieve at the moment. You
> >> could also experiment with bcache on the RBD.
> >>
> >> Reviving this thread, would you be willing to share the details of the low 
> >> latency hardware design?  Are you optimizing for NFS or
> >> iSCSI?
> >
> > Both really, just trying to get the write latency as low as possible, as 
> > you know, vmware does everything with lots of unbuffered small io's. Eg 
> > when you migrate a VM or as thin vmdk's grow.
> >
> > Even storage vmotions which might kick off 32 threads, as they all roughly 
> > fall on the same PG, there still appears to be a bottleneck with contention 
> > on the PG itself.
> >
> > These were the sort of things I was trying to optimise for, to make the 
> > time spent in Ceph as minimal as possible for each IO.
> >
> > So onto the hardware. Through reading various threads and experiments on my 
> > own I came to the following conclusions.
> >
> > -You need highest possible frequency on the CPU cores, which normally also 
> > means less of them.
> > -Dual sockets are probably bad and will impact performance.
> > -Use NVME's for journals to minimise latency
> >
> > The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an Intel 
> > P3700 for a journal. I used the SuperMicro X11SSH-CTF board which has 10G-T 
> > onboard as well as 8SATA and 8SAS, so no expansion cards required. Actually 
> > this design as well as being very performant for Ceph, also works out very 
> > cheap as you are using low end server parts. The whole lot + 12x7.2k disks 
> > all goes into a 1U case.
> >
> > During testing I noticed that by default c-states and p-states slaughter 
> > performance. After forcing max cstate to 1 and forcing the CPU frequency up 
> > to max, I was seeing 600us latency for a 4kb write to a 3xreplica pool, or 
> > around 1600IOPs, this is at QD=1.
> >
> > Few other observations:
> > 1. Power usage is around 150-200W for this config with 12x7.2k disks
> > 2. CPU usage maxing out disks, is only around 10-15%, so plenty of headroom 
> > for more disks.
> > 3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage
> > 4. No idea about CPU load for pure SSD nodes, but based on the current 
> > disks, you could maybe expect ~1iops per node, before maxing out CPU's
> > 5. Single NVME seems to be able to journal 12 disks with no problem during 
> > normal operation, no doubt a specific benchmark could max it out though.
> > 6. There are slightly faster Xeon E3's, but price/performance = diminishing 
> > returns
> >
> > Hope that answers all your questions.
> > Nick
> >
> >>
> >> Thank you,
> >> Alex
> >>
> >> >
> >> >>
> >> >> Would it help to put in multiple P3700 per OSD Node to improve 
> >> >> performance for a single Thread (example Storage VMotion) ?
> >> >
> >> > Most likely not, it's all the other parts of the puzzle which are 
> >> > causing the latency. ESXi was designed for storage arrays that service
> >> IO's in 100us-1ms range, Ceph is probably about 10x slower than this, 
> >> hence the problem. Disab

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Nick Fisk


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Christian Balzer
> Sent: 21 August 2016 09:32
> To: ceph-users 
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> 
> Hello,
> 
> On Sun, 21 Aug 2016 09:16:33 +0100 Brian :: wrote:
> 
> > Hi Nick
> >
> > Interested in this comment - "-Dual sockets are probably bad and will
> > impact performance."
> >
> > Have you got real world experience of this being the case?
> >
> Well, Nick wrote "probably".
> 
> Dual sockets and thus NUMA, the need for CPUs to talk to each other and share 
> information certainly can impact things that are
very
> time critical.
> How much though is a question of design, both HW and SW.

There was a guy from Redhat (sorry his name escapes me now) a few months ago on 
the performance weekly meeting. He was analysing the
CPU cache miss effects with Ceph and it looked like a NUMA setup was having 
quite a severe impact on some things. To be honest a lot
of it went over my head, but I came away from it with a general feeling that if 
you can get the required performance from 1 socket,
then that is probably a better bet. This includes only populating a single 
socket in a dual socket system. There was also a Ceph
tech talk at the start of the year (High perf databases on Ceph) where the guy 
presenting was also recommending only populating 1
socket for latency reasons.

Both of those, coupled with the fact that Xeon E3's are the cheapest way to get 
high clock speeds, sort of made my decision.

> 
> We're looking here at a case where he's trying to reduce latency by all means 
> and where the actual CPU needs for the HDDs are
> negligible.
> The idea being that a "Ceph IOPS" stays on one core which is hopefully also 
> not being shared at that time.
> 
> If you're looking at full SSD nodes OTOH a singe CPU may very well not be 
> able to saturate a sensible amount of SSDs per node, so
a
> slight penalty but better utilization and overall IOPS with 2 CPUs may be the 
> forward.

Definitely, as always work out what your requirements are and design around 
them.  

> 
> Christian
> 
> > Thanks - B
> >
> > On Sun, Aug 21, 2016 at 8:31 AM, Nick Fisk  wrote:
> > >> -Original Message-
> > >> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> > >> Sent: 21 August 2016 04:15
> > >> To: Nick Fisk 
> > >> Cc: w...@globe.de; Horace Ng ; ceph-users
> > >> 
> > >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> > >>
> > >> Hi Nick,
> > >>
> > >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  wrote:
> > >> >> -Original Message-
> > >> >> From: w...@globe.de [mailto:w...@globe.de]
> > >> >> Sent: 21 July 2016 13:23
> > >> >> To: n...@fisk.me.uk; 'Horace Ng' 
> > >> >> Cc: ceph-users@lists.ceph.com
> > >> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread
> > >> >> Performance
> > >> >>
> > >> >> Okay and what is your plan now to speed up ?
> > >> >
> > >> > Now I have come up with a lower latency hardware design, there is
> > >> > not much further improvement until persistent RBD caching is
> > >> implemented, as you will be moving the SSD/NVME closer to the
> > >> client. But I'm happy with what I can achieve at the moment. You could 
> > >> also experiment with bcache on the RBD.
> > >>
> > >> Reviving this thread, would you be willing to share the details of
> > >> the low latency hardware design?  Are you optimizing for NFS or iSCSI?
> > >
> > > Both really, just trying to get the write latency as low as possible, as 
> > > you know, vmware does everything with lots of
unbuffered
> small io's. Eg when you migrate a VM or as thin vmdk's grow.
> > >
> > > Even storage vmotions which might kick off 32 threads, as they all 
> > > roughly fall on the same PG, there still appears to be a
> bottleneck with contention on the PG itself.
> > >
> > > These were the sort of things I was trying to optimise for, to make the 
> > > time spent in Ceph as minimal as possible for each IO.
> > >
> > > So onto the hardware. Through reading various threads and experiments on 
> > > my own I came to the following conclusions.
> > >
> > > -You need highest possible frequency on the CPU cores, which normally 
> > > also means less of them.
> > > -Dual sockets are probably bad and will impact performance.
> > > -Use NVME's for journals to minimise latency
> > >
> > > The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an 
> > > Intel P3700 for a journal. I used the SuperMicro X11SSH-
> CTF board which has 10G-T onboard as well as 8SATA and 8SAS, so no expansion 
> cards required. Actually this design as well as being
> very performant for Ceph, also works out very cheap as you are using low end 
> server parts. The whole lot + 12x7.2k disks all goes
into
> a 1U case.
> > >
> > > During testing I noticed that by default c-states and p-states slaughter 
> > > performance. After forcing max cstate to 1 and
forcing the
> CPU frequency u

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Nick Fisk


> -Original Message-
> From: Wilhelm Redbrake [mailto:w...@globe.de]
> Sent: 21 August 2016 09:34
> To: n...@fisk.me.uk
> Cc: Alex Gorbachev ; Horace Ng ; 
> ceph-users 
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> Hi Nick,
> i understand all of your technical improvements.
> But: why do you Not use a simple for example Areca Raid Controller with 8 gb 
> Cache and Bbu ontop in every ceph node.
> Configure n Times RAID 0 on the Controller and enable Write back Cache.
> That must be a latency "Killer" like in all the prop. Storage arrays or Not ??

Possibly, the latency of the NVME is very low, to the point that the "latency" 
in Ceph dwarfs it. So I'm not sure how much more improvement can be got from 
lowering journal latency further. But you are certainly correct it would help.

The other thing, if you don't use a SSD for a journal but rely on the RAID WBC, 
do you still see half the MB/s on the hard disks due to colo journal? Maybe 
someone can confirm?

Oh and I just looked at the price of that thing. The 16 port version is nearly 
double the price of what I paid for the 400GB NVME and that’s without adding on 
the 8GB ram and BBU. Maybe it's more suited for a full SSD cluster rather than 
spinning disks?

> 
> Best Regards !!
> 
> 
> 
> Am 21.08.2016 um 09:31 schrieb Nick Fisk :
> 
> >> -Original Message-
> >> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> >> Sent: 21 August 2016 04:15
> >> To: Nick Fisk 
> >> Cc: w...@globe.de; Horace Ng ; ceph-users
> >> 
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Hi Nick,
> >>
> >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  wrote:
>  -Original Message-
>  From: w...@globe.de [mailto:w...@globe.de]
>  Sent: 21 July 2016 13:23
>  To: n...@fisk.me.uk; 'Horace Ng' 
>  Cc: ceph-users@lists.ceph.com
>  Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
>  Okay and what is your plan now to speed up ?
> >>>
> >>> Now I have come up with a lower latency hardware design, there is
> >>> not much further improvement until persistent RBD caching is
> >> implemented, as you will be moving the SSD/NVME closer to the client.
> >> But I'm happy with what I can achieve at the moment. You could also 
> >> experiment with bcache on the RBD.
> >>
> >> Reviving this thread, would you be willing to share the details of
> >> the low latency hardware design?  Are you optimizing for NFS or iSCSI?
> >
> > Both really, just trying to get the write latency as low as possible, as 
> > you know, vmware does everything with lots of unbuffered
> small io's. Eg when you migrate a VM or as thin vmdk's grow.
> >
> > Even storage vmotions which might kick off 32 threads, as they all roughly 
> > fall on the same PG, there still appears to be a bottleneck
> with contention on the PG itself.
> >
> > These were the sort of things I was trying to optimise for, to make the 
> > time spent in Ceph as minimal as possible for each IO.
> >
> > So onto the hardware. Through reading various threads and experiments on my 
> > own I came to the following conclusions.
> >
> > -You need highest possible frequency on the CPU cores, which normally also 
> > means less of them.
> > -Dual sockets are probably bad and will impact performance.
> > -Use NVME's for journals to minimise latency
> >
> > The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an Intel 
> > P3700 for a journal. I used the SuperMicro X11SSH-CTF
> board which has 10G-T onboard as well as 8SATA and 8SAS, so no expansion 
> cards required. Actually this design as well as being very
> performant for Ceph, also works out very cheap as you are using low end 
> server parts. The whole lot + 12x7.2k disks all goes into a 1U
> case.
> >
> > During testing I noticed that by default c-states and p-states slaughter 
> > performance. After forcing max cstate to 1 and forcing the
> CPU frequency up to max, I was seeing 600us latency for a 4kb write to a 
> 3xreplica pool, or around 1600IOPs, this is at QD=1.
> >
> > Few other observations:
> > 1. Power usage is around 150-200W for this config with 12x7.2k disks
> > 2. CPU usage maxing out disks, is only around 10-15%, so plenty of headroom 
> > for more disks.
> > 3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage 4.
> > No idea about CPU load for pure SSD nodes, but based on the current
> > disks, you could maybe expect ~1iops per node, before maxing out CPU's 
> > 5. Single NVME seems to be able to journal 12 disks
> with no problem during normal operation, no doubt a specific benchmark could 
> max it out though.
> > 6. There are slightly faster Xeon E3's, but price/performance =
> > diminishing returns
> >
> > Hope that answers all your questions.
> > Nick
> >
> >>
> >> Thank you,
> >> Alex
> >>
> >>>
> 
>  Would it help to put in multiple P3700 per OSD Node to improve 
>  performance for a single Thread (exam

Re: [ceph-users] Ceph repository IP block

2016-08-21 Thread Wido den Hollander

> Op 21 augustus 2016 om 10:26 schreef "Brian ::" :
> 
> 
> If you point at the eu.ceph.com
> 
> ceph.apt-get.eu has address 185.27.175.43
> 
> ceph.apt-get.eu has IPv6 address 2a00:f10:121:400:48c:baff:fe00:477
> 

Yes, however, keep in mind that IPs might change without notice.

The best way to sync the data locally through a proxy or something.

Wido

> On Sat, Aug 20, 2016 at 11:59 AM, Vlad Blando  wrote:
> 
> > Hi Guys,
> >
> > I will be installing Ceph behind a very restrictive firewall and one of
> > the requirements is for me to submit the IP block of the repository used in
> > the installation. I searched the internet but couldn't find one (or I
> > haven't searched enough). Hoping to get answers from here.
> >
> > Thanks.
> >
> > /Vlad
> > ᐧ
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph pool snapshots

2016-08-21 Thread Vimal Kumar
Hi,

[ceph@ceph1 my-cluster]$ ceph -v
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
[ceph@ceph1 my-cluster]$ rados -p mypool ls
hello.txt
[ceph@ceph1 my-cluster]$ rados -p mypool mksnap snap01
created pool mypool snap snap01
[ceph@ceph1 my-cluster]$ rados -p mypool lssnap
5 snap01 2016.08.21 03:59:28
1 snaps
[ceph@ceph1 my-cluster]$ rados -p mypool listsnaps hello.txt
hello.txt:
cloneid snaps size overlap
head - 13

Is this normal? Why is snap01 not listed in the above output? It is also
having 'hello.txt', then why does snap01 not feature in the above list? I
assume 'head' refers to the pool itself?


[ceph@ceph1 my-cluster]$ rados -p mypool rm hello.txt
[ceph@ceph1 my-cluster]$ rados -p mypool ls
[ceph@ceph1 my-cluster]$ rados -p mypool listsnaps hello.txt
hello.txt:
cloneid snaps size overlap
5 5 13 []

So, only after removing the object from pool does the snapshot id shows up.


[ceph@ceph1 my-cluster]$ rados -p mypool rollback hello.txt snap01
rolled back pool mypool to snapshot snap01
[ceph@ceph1 my-cluster]$ rados -p mypool ls
hello.txt
[ceph@ceph1 my-cluster]$ rados -p mypool listsnaps hello.txt
hello.txt:
cloneid snaps size overlap
5 5 13 []
head - 13

After roll back of the deleted object from snapshot to pool, both snapshot
id and 'head' are listed! Isn't this same as the first case?

More over, on checking the 'man' page of rados, I realise that 'listsnaps'
and 'rollback' options are missing. Is there a better / recommended way to
deal with pool snapshots?

Thank you!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recover Data from Deleted RBD Volume

2016-08-21 Thread Georgios Dimitrakakis


As a closure I would like to thank all people who contributed with 
their knowledge in my problem although the final decision was not to try 
any sort of recovery since the effort required would have been 
tremendous with unambiguous results (to say at least).


Jason, Ilya, Brad, David, George, Burkhard thank you very much for your 
contribution


Kind regards,

G.


On Wed, Aug 10, 2016 at 10:55 AM, Ilya Dryomov  
wrote:

I think Jason meant to write "rbd_id." here.



Whoops -- thanks for the typo correction.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is anyone seeing iissues with task_numa_find_cpu?

2016-08-21 Thread Василий Ангапов
Yeah, switched to 4.7 recently and no issues so far.

2016-08-21 6:09 GMT+03:00 Alex Gorbachev :
> On Tue, Jul 19, 2016 at 12:04 PM, Alex Gorbachev  
> wrote:
>> On Mon, Jul 18, 2016 at 4:41 AM, Василий Ангапов  wrote:
>>> Guys,
>>>
>>> This bug is hitting me constantly, may be once per several days. Does
>>> anyone know is there a solution already?
>>
>>
>> I see there is a fix available, and am waiting for a backport to a
>> longterm kernel:
>>
>> https://lkml.org/lkml/2016/7/12/919
>>
>> https://lkml.org/lkml/2016/7/12/297
>>
>> --
>> Alex Gorbachev
>> Storcium
>
>
> No more issues on the latest kernel builds.
>
> Alex
>
>>
>>
>>
>>
>>>
>>> 2016-07-05 11:47 GMT+03:00 Nick Fisk :
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Alex Gorbachev
> Sent: 04 July 2016 20:50
> To: Campbell Steven 
> Cc: ceph-users ; Tim Bishop  li...@bishnet.net>
> Subject: Re: [ceph-users] Is anyone seeing iissues with
> task_numa_find_cpu?
>
> On Wed, Jun 29, 2016 at 5:41 AM, Campbell Steven 
> wrote:
> > Hi Alex/Stefan,
> >
> > I'm in the middle of testing 4.7rc5 on our test cluster to confirm
> > once and for all this particular issue has been completely resolved by
> > Peter's recent patch to sched/fair.c refereed to by Stefan above. For
> > us anyway the patches that Stefan applied did not solve the issue and
> > neither did any 4.5.x or 4.6.x released kernel thus far, hopefully it
> > does the trick for you. We could get about 4 hours uptime before
> > things went haywire for us.
> >
> > It's interesting how it seems the CEPH workload triggers this bug so
> > well as it's quite a long standing issue that's only just been
> > resolved, another user chimed in on the lkml thread a couple of days
> > ago as well and again his trace had ceph-osd in it as well.
> >
> > https://lkml.org/lkml/headers/2016/6/21/491
> >
> > Campbell
>
> Campbell, any luck with testing 4.7rc5?  rc6 came out just now, and I am
> having trouble booting it on an ubuntu box due to some other unrelated
> problem.  So dropping to kernel 4.2.0 for now, which does not seem to have
> this load related problem.
>
> I looked at the fair.c code in kernel source tree 4.4.14 and it is quite
 different
> than Peter's patch (assuming 4.5.x source), so the patch does not apply
> cleanly.  Maybe another 4.4.x kernel will get the update.

 I put in a new 16.04 node yesterday and went straight to 4.7.rc6. It's been
 backfilling for just under 24 hours now with no drama. Disks are set to use
 CFQ.

>
> Thanks,
> Alex
>
>
>
> >
> > On 29 June 2016 at 18:29, Stefan Priebe - Profihost AG
> >  wrote:
> >>
> >> Am 29.06.2016 um 04:30 schrieb Alex Gorbachev:
> >>> Hi Stefan,
> >>>
> >>> On Tue, Jun 28, 2016 at 1:46 PM, Stefan Priebe - Profihost AG
> >>>  wrote:
>  Please be aware that you may need even more patches. Overall this
>  needs 3 patches. Where the first two try to fix a bug and the 3rd
>  one fixes the fixes + even more bugs related to the scheduler. I've
>  no idea on which patch level Ubuntu is.
> >>>
> >>> Stefan, would you be able to please point to the other two patches
> >>> beside https://lkml.org/lkml/diff/2016/6/22/102/1 ?
> >>
> >> Sorry sure yes:
> >>
> >> 1. 2b8c41daba32 ("sched/fair: Initiate a new task's util avg to a
> >> bounded value")
> >>
> >> 2.) 40ed9cba24bb7e01cc380a02d3f04065b8afae1d ("sched/fair: Fix
> >> post_init_entity_util_avg() serialization")
> >>
> >> 3.) the one listed at lkml.
> >>
> >> Stefan
> >>
> >>>
> >>> Thank you,
> >>> Alex
> >>>
> 
>  Stefan
> 
>  Excuse my typo sent from my mobile phone.
> 
>  Am 28.06.2016 um 17:59 schrieb Tim Bishop :
> 
>  Yes - I noticed this today on Ubuntu 16.04 with the default kernel.
>  No useful information to add other than it's not just you.
> 
>  Tim.
> 
>  On Tue, Jun 28, 2016 at 11:05:40AM -0400, Alex Gorbachev wrote:
> 
>  After upgrading to kernel 4.4.13 on Ubuntu, we are seeing a few of
> 
>  these issues where an OSD would fail with the stack below.  I
>  logged a
> 
>  bug at https://bugzilla.kernel.org/show_bug.cgi?id=121101 and there
>  is
> 
>  a similar description at https://lkml.org/lkml/2016/6/22/102, but
>  the
> 
>  odd part is we have turned off CFQ and blk-mq/scsi-mq and are using
> 
>  just the noop scheduler.
> 
> 
>  Does the ceph kernel code somehow use the fair scheduler code
> block?
> 
> 
>  Tha

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Alex Gorbachev
On Sunday, August 21, 2016, Wilhelm Redbrake  wrote:

> Hi Nick,
> i understand all of your technical improvements.
> But: why do you Not use a simple for example Areca Raid Controller with 8
> gb Cache and Bbu ontop in every ceph node.
> Configure n Times RAID 0 on the Controller and enable Write back Cache.
> That must be a latency "Killer" like in all the prop. Storage arrays or
> Not ??
>
> Best Regards !!


What we saw specifically with Areca cards is that performance is excellent
in benchmarking and for bursty loads. However, once we started loading with
more constant workloads (we replicate databases and files to our Ceph
cluster), this looks to have saturated the relatively small Areca NVDIMM
caches and we went back to pure drive based performance.

So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3
HDDs) in hopes that it would help reduce the noisy neighbor impact. That
worked, but now the overall latency is really high at times, not always.
Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS
drives with too many IOPS, which get their latency sky high. Overall we are
functioning fine, but I sure would like storage vmotion and other large
operations faster.

I am thinking I will test a few different schedulers and readahead settings
to see if we can improve this by parallelizing reads. Also will test NFS,
but need to determine whether to do krbd/knfsd or something more
interesting like CephFS/Ganesha.

Thanks for your very valuable info on analysis and hw build.

Alex


>
>
>
> Am 21.08.2016 um 09:31 schrieb Nick Fisk >:
>
> >> -Original Message-
> >> From: Alex Gorbachev [mailto:a...@iss-integration.com ]
> >> Sent: 21 August 2016 04:15
> >> To: Nick Fisk >
> >> Cc: w...@globe.de ; Horace Ng  >; ceph-users >
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Hi Nick,
> >>
> >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  > wrote:
>  -Original Message-
>  From: w...@globe.de  [mailto:w...@globe.de ]
>  Sent: 21 July 2016 13:23
>  To: n...@fisk.me.uk ; 'Horace Ng'  >
>  Cc: ceph-users@lists.ceph.com 
>  Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
>  Okay and what is your plan now to speed up ?
> >>>
> >>> Now I have come up with a lower latency hardware design, there is not
> much further improvement until persistent RBD caching is
> >> implemented, as you will be moving the SSD/NVME closer to the client.
> But I'm happy with what I can achieve at the moment. You
> >> could also experiment with bcache on the RBD.
> >>
> >> Reviving this thread, would you be willing to share the details of the
> low latency hardware design?  Are you optimizing for NFS or
> >> iSCSI?
> >
> > Both really, just trying to get the write latency as low as possible, as
> you know, vmware does everything with lots of unbuffered small io's. Eg
> when you migrate a VM or as thin vmdk's grow.
> >
> > Even storage vmotions which might kick off 32 threads, as they all
> roughly fall on the same PG, there still appears to be a bottleneck with
> contention on the PG itself.
> >
> > These were the sort of things I was trying to optimise for, to make the
> time spent in Ceph as minimal as possible for each IO.
> >
> > So onto the hardware. Through reading various threads and experiments on
> my own I came to the following conclusions.
> >
> > -You need highest possible frequency on the CPU cores, which normally
> also means less of them.
> > -Dual sockets are probably bad and will impact performance.
> > -Use NVME's for journals to minimise latency
> >
> > The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an
> Intel P3700 for a journal. I used the SuperMicro X11SSH-CTF board which has
> 10G-T onboard as well as 8SATA and 8SAS, so no expansion cards required.
> Actually this design as well as being very performant for Ceph, also works
> out very cheap as you are using low end server parts. The whole lot +
> 12x7.2k disks all goes into a 1U case.
> >
> > During testing I noticed that by default c-states and p-states slaughter
> performance. After forcing max cstate to 1 and forcing the CPU frequency up
> to max, I was seeing 600us latency for a 4kb write to a 3xreplica pool, or
> around 1600IOPs, this is at QD=1.
> >
> > Few other observations:
> > 1. Power usage is around 150-200W for this config with 12x7.2k disks
> > 2. CPU usage maxing out disks, is only around 10-15%, so plenty of
> headroom for more disks.
> > 3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage
> > 4. No idea about CPU load for pure SSD nodes, but based on the current
> disks, you could maybe expect ~1iops per node, before maxing out CPU's
> > 5. Single NVME seems to be able to journal 12 disks with no problem
> during normal operation, no doubt a specific benchmark could max it out
> though.
> > 6. There are slightly faster Xeon E3's, but price/performance =
> diminishing returns
> 

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Nick Fisk
From: Alex Gorbachev [mailto:a...@iss-integration.com] 
Sent: 21 August 2016 15:27
To: Wilhelm Redbrake 
Cc: n...@fisk.me.uk; Horace Ng ; ceph-users 

Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 



On Sunday, August 21, 2016, Wilhelm Redbrake mailto:w...@globe.de> > wrote:

Hi Nick,
i understand all of your technical improvements.
But: why do you Not use a simple for example Areca Raid Controller with 8 gb 
Cache and Bbu ontop in every ceph node.
Configure n Times RAID 0 on the Controller and enable Write back Cache.
That must be a latency "Killer" like in all the prop. Storage arrays or Not ??

Best Regards !!

 

What we saw specifically with Areca cards is that performance is excellent in 
benchmarking and for bursty loads. However, once we started loading with more 
constant workloads (we replicate databases and files to our Ceph cluster), this 
looks to have saturated the relatively small Areca NVDIMM caches and we went 
back to pure drive based performance. 

 

Yes, I think that is a valid point. Although low latency, you are still having 
to write to the disks twice (journal+data), so once the cache’s on the cards 
start filling up, you are going to hit problems.

 

 

So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3 
HDDs) in hopes that it would help reduce the noisy neighbor impact. That 
worked, but now the overall latency is really high at times, not always. Red 
Hat engineer suggested this is due to loading the 7200 rpm NL-SAS drives with 
too many IOPS, which get their latency sky high. Overall we are functioning 
fine, but I sure would like storage vmotion and other large operations faster. 

 

 

Yeah this is the biggest pain point I think. Normal VM ops are fine, but if you 
ever have to move a multi-TB VM, it’s just too slow. 

 

If you use iscsi with vaai and are migrating a thick provisioned vmdk, then 
performance is actually quite good, as the block sizes used for the copy are a 
lot bigger. 

 

However, my use case required thin provisioned VM’s + snapshots and I found 
that using iscsi you have no control over the fragmentation of the vmdk’s and 
so the read performance is then what suffers (certainly with 7.2k disks)

 

Also with thin provisioned vmdk’s I think I was seeing PG contention with the 
updating of the VMFS metadata, although I can’t be sure.

 

 

I am thinking I will test a few different schedulers and readahead settings to 
see if we can improve this by parallelizing reads. Also will test NFS, but need 
to determine whether to do krbd/knfsd or something more interesting like 
CephFS/Ganesha. 

 

As you know I’m on NFS now. I’ve found it a lot easier to get going and a lot 
less sensitive to making config adjustments without suddenly everything 
dropping offline. The fact that you can specify the extent size on XFS helps 
massively with using thin vmdks/snapshots to avoid fragmentation. Storage 
v-motions are a bit faster than iscsi, but I think I am hitting PG contention 
when esxi tries to write 32 copy threads to the same object. There is probably 
some tuning that could be done here (RBD striping???) but this is the best it’s 
been for a long time and I’m reluctant to fiddle any further.

 

But as mentioned above, thick vmdk’s with vaai might be a really good fit.

 

Thanks for your very valuable info on analysis and hw build. 

 

Alex

 




Am 21.08.2016 um 09:31 schrieb Nick Fisk  >:

>> -Original Message-
>> From: Alex Gorbachev [mailto:a...@iss-integration.com  ]
>> Sent: 21 August 2016 04:15
>> To: Nick Fisk  >
>> Cc: w...@globe.de  ; Horace Ng >  >; ceph-users  >
>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>
>> Hi Nick,
>>
>> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  > 
>> wrote:
 -Original Message-
 From: w...@globe.de   [mailto:w...@globe.de  ]
 Sent: 21 July 2016 13:23
 To: n...@fisk.me.uk  ; 'Horace Ng' >>>  >
 Cc: ceph-users@lists.ceph.com  
 Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 Okay and what is your plan now to speed up ?
>>>
>>> Now I have come up with a lower latency hardware design, there is not much 
>>> further improvement until persistent RBD caching is
>> implemented, as you will be moving the SSD/NVME closer to the client. But 
>> I'm happy with what I can achieve at the moment. You
>> could also experiment with bcache on the RBD.
>>
>> Reviving this thread, would you be willing to share the details of the low 
>> latency hardware design?  Are you optimizing for NFS or
>> iSCSI?
>
> Both really, just trying to get the write latency as low as possible, as you 
> know, vmware does everything with lots of unbuffered small io's. Eg when you 
> migrate a VM or as thin vmdk's grow.
>
> Even storage vmotions which might kick off 32 threads, as they all roughly 
> fall on the same PG, there still appears to be a bottleneck with contention 
> on the PG itself.
>
> These were the so

Re: [ceph-users] Ceph repository IP block

2016-08-21 Thread Vlad Blando
This is going to be a challenge

​/Vlad

On Sun, Aug 21, 2016 at 5:34 PM, Wido den Hollander  wrote:

>
> > Op 21 augustus 2016 om 10:26 schreef "Brian ::" :
> >
> >
> > If you point at the eu.ceph.com
> >
> > ceph.apt-get.eu has address 185.27.175.43
> >
> > ceph.apt-get.eu has IPv6 address 2a00:f10:121:400:48c:baff:fe00:477
> >
>
> Yes, however, keep in mind that IPs might change without notice.
>
> The best way to sync the data locally through a proxy or something.
>
> Wido
>
> > On Sat, Aug 20, 2016 at 11:59 AM, Vlad Blando 
> wrote:
> >
> > > Hi Guys,
> > >
> > > I will be installing Ceph behind a very restrictive firewall and one of
> > > the requirements is for me to submit the IP block of the repository
> used in
> > > the installation. I searched the internet but couldn't find one (or I
> > > haven't searched enough). Hoping to get answers from here.
> > >
> > > Thanks.
> > >
> > > /Vlad
> > > ᐧ
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

ᐧ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Christian Balzer

Hello,

On Sun, 21 Aug 2016 09:57:40 +0100 Nick Fisk wrote:

> 
> 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> > Christian Balzer
> > Sent: 21 August 2016 09:32
> > To: ceph-users 
> > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> > 
> > 
> > Hello,
> > 
> > On Sun, 21 Aug 2016 09:16:33 +0100 Brian :: wrote:
> > 
> > > Hi Nick
> > >
> > > Interested in this comment - "-Dual sockets are probably bad and will
> > > impact performance."
> > >
> > > Have you got real world experience of this being the case?
> > >
> > Well, Nick wrote "probably".
> > 
> > Dual sockets and thus NUMA, the need for CPUs to talk to each other and 
> > share information certainly can impact things that are
> very
> > time critical.
> > How much though is a question of design, both HW and SW.
> 
> There was a guy from Redhat (sorry his name escapes me now) a few months ago 
> on the performance weekly meeting. He was analysing the
> CPU cache miss effects with Ceph and it looked like a NUMA setup was having 
> quite a severe impact on some things. To be honest a lot
> of it went over my head, but I came away from it with a general feeling that 
> if you can get the required performance from 1 socket,
> then that is probably a better bet. This includes only populating a single 
> socket in a dual socket system. There was also a Ceph
> tech talk at the start of the year (High perf databases on Ceph) where the 
> guy presenting was also recommending only populating 1
> socket for latency reasons.
> 
I wonder how complete their testing was and how much manual tuning they
tried.
As in:

1. Was irqbalance running? 
Because it and the normal kernel strategies clash beautifully.
Irqbalance moves stuff around, the kernel tries to move things close to
where the IRQs are, cat and mouse.

2. Did they try with manual IRQ pinning?
I do, not that it's critical with my Ceph nodes, but on other machines it
can make a LOT of difference. 
Like keeping the cores near (or at least on the same NUMA node) as the
network IRQs reserved for KVM vhost processes. 

3. Did they try pining Ceph OSD processes?
While this may certainly help (and make things more predictable when the
load gets high), as I said above the kernel normally does a pretty good job
of NOT moving things around and keeping processes close to the resources
they need.

> Both of those, coupled with the fact that Xeon E3's are the cheapest way to 
> get high clock speeds, sort of made my decision.
> 
Totally agreed, my current HDD node design is based on the single CPU
Supermicro 5028R-E1CR12L barebone, with an E5-1650 v3 (3.50GHz) CPU.

> > 
> > We're looking here at a case where he's trying to reduce latency by all 
> > means and where the actual CPU needs for the HDDs are
> > negligible.
> > The idea being that a "Ceph IOPS" stays on one core which is hopefully also 
> > not being shared at that time.
> > 
> > If you're looking at full SSD nodes OTOH a singe CPU may very well not be 
> > able to saturate a sensible amount of SSDs per node, so
> a
> > slight penalty but better utilization and overall IOPS with 2 CPUs may be 
> > the forward.
> 
> Definitely, as always work out what your requirements are and design around 
> them.  
> 
On my cache tier nodes with 2x E5-2623 v3 (3.00GHz) and currently 4 800GB
DC S3610 SSDs I can already saturate all but 2 "cores", with the "right"
extreme test cases.
Normal load is of course just around 4 (out of 16) "cores".

And for the people who like it fast(er) but don't have to deal with VMware
or the likes, instead of forcing the c-state to 1 just setting the governor
to "performance" was enough in my case to halve latency (from about 2 to
1ms).

This still does save some power at times and (as Nick speculated) indeed
allows some cores to use their turbo speeds.

So the 4-5 busy cores on my cache tier nodes tend to hover around 3.3GHz,
instead of the 3.0GHz baseline for their CPUs.
And the less loaded cores don't tend to go below 2.6GHz, as opposed to the
1.2GHz that the "powersave" governor would default to.

Christian

> > 
> > Christian
> > 
> > > Thanks - B
> > >
> > > On Sun, Aug 21, 2016 at 8:31 AM, Nick Fisk  wrote:
> > > >> -Original Message-
> > > >> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> > > >> Sent: 21 August 2016 04:15
> > > >> To: Nick Fisk 
> > > >> Cc: w...@globe.de; Horace Ng ; ceph-users
> > > >> 
> > > >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> > > >>
> > > >> Hi Nick,
> > > >>
> > > >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  wrote:
> > > >> >> -Original Message-
> > > >> >> From: w...@globe.de [mailto:w...@globe.de]
> > > >> >> Sent: 21 July 2016 13:23
> > > >> >> To: n...@fisk.me.uk; 'Horace Ng' 
> > > >> >> Cc: ceph-users@lists.ceph.com
> > > >> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread
> > > >> >> Performance
> > > >> >>
> > > >> >> Okay and what is your plan

Re: [ceph-users] RGW multisite - second cluster woes

2016-08-21 Thread Ben Morrice
Hello,

Looks fine on the first cluster:

cluster1# radosgw-admin period get
{
"id": "6ea09956-60a7-48df-980c-2b5bbf71b565",
"epoch": 2,
"predecessor_uuid": "80026abd-49f4-436e-844f-f8743685dac5",
"sync_status": [
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
""
],
"period_map": {
"id": "6ea09956-60a7-48df-980c-2b5bbf71b565",
"zonegroups": [
{
"id": "rgw1-gva",
"name": "rgw1-gva",
"api_name": "",
"is_master": "true",
"endpoints": [],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "rgw1-gva-master",
"zones": [
{
"id": "rgw1-gva-master",
"name": "rgw1-gva-master",
"endpoints": [
"http:\/\/rgw1:80\/"
],
"log_meta": "true",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false"
}
],
"placement_targets": [
{
"name": "default-placement",
"tags": []
}
],
"default_placement": "default-placement",
"realm_id": "b23771d0-6005-41da-8ee0-aec03db510d7"
}
],
"short_zone_ids": [
{
"key": "rgw1-gva-master",
"val": 1414621010
}
]
},
"master_zonegroup": "rgw1-gva",
"master_zone": "rgw1-gva-master",
"period_config": {
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
}
},
"realm_id": "b23771d0-6005-41da-8ee0-aec03db510d7",
"realm_name": "gold",
"realm_epoch": 2
}

And, from the second cluster I get this:

cluster2 # radosgw-admin realm pull --url=http://rgw1:80
--access-key=access --secret=secret
2016-08-22 08:48:42.682785 7fc5d3fe29c0  0 error read_lastest_epoch
.rgw.root:periods.381464e1-4326-4b6b-9191-35940c4f645f.latest_epoch
{
"id": "98a7b356-83fd-4d42-b895-b58d45fa4233",
"name": "",
"current_period": "381464e1-4326-4b6b-9191-35940c4f645f",
"epoch": 1
}


Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL ENT CBS BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 19/08/16 08:46, Shilpa Manjarabad Jagannath wrote:
>
> - Original Message -
>> From: "Ben Morrice" 
>> To: ceph-users@lists.ceph.com
>> Sent: Thursday, August 18, 2016 8:59:30 PM
>> Subject: [ceph-users] RGW multisite - second cluster woes
>>
>> Hello,
>>
>> I am trying to configure a second cluster into an existing Jewel RGW
>> installation.
>>
>> I do not get the expected output when I perform a 'radosgw-admin realm
>> pull'. My realm on the first cluster is called 'gold', however when
>> doing a realm pull it doesn't reflect the 'gold' name or id and I get an
>> error related to latest_epoch (?).
>>
>> The documentation seems straight forward, so i'm not quite sure what i'm
>> missing here?
>>
>> Please see below for the full output.
>>
>> # radosgw-admin realm pull --url=http://cluster1:80 --access-key=access
>> --secret=secret
>>
>> 2016-08-18 17:20:09.585261 7fb939d879c0  0 error read_lastest_epoch
>> .rgw.root:periods.8c64a4dd-ccd8-4975-b63b-324fbb24aab6.latest_epoch
>> {
>> "id": "98a7b356-83fd-4d42-b895-b58d45fa4233",
>> "name": "",
>> "current_period": "8c64a4dd-ccd8-4975-b63b-324fbb24aab6",
>> "epoch": 1
>> }
>>
> The realm name is empty here. Could you share the output of "radosgw-admin 
> period get" from the first cluster?
>
>
>> # radosgw-admin period pull --url=http://cluster1:80 --access-key=access
>> secret=secret
>> 2016-08-18 17:21:33.277719 7f5dbc7849c0  0 error read_lastest_epoch
>> .rgw.root:periods..latest_epoch
>