That one big server sounds great, but it also sounds like a single point of
failure. It's also not cheap. I've been able to build this cluster for
about $1400 per node, including the 10Gb networking gear, which is less
than what I see the _empty case_ you describe going for new. Even used, the
lowest I've seen (lacking trays at that price) is what I paid for one of my
nodes including CPU and RAM, and drive trays. So, it's been a pretty
inexpensive venture considering what we get out of it. I have no per-node
fault tolerance, but if one of my nodes dies, I just restart the VMs that
were on it somewhere else and wait for ceph to heal. I also benefit from
higher aggregate network bandwidth because I have more ports on the wire.
And better per-U cpu and RAM density (for the money). *shrug* different
strokes.

As for difficulty of management, any screwing around I've done has had
nothing to do with the converged nature of the setup, aside from
discovering and changing the one setting I mentioned. So, for me at least,
it's been a pretty well unqualified net win. I can imagine all sorts of
scenarios where that wouldn't be, but I think it's probably debatable
whether or not those constitute a common case. The higher node count does
add some complexity, but that's easily overcome with some simple
automation. Again though, that has no bearing on the converged setup, it's
just a factor of how much CPU and RAM we need for our use case.

I guess what I'm trying to say is that I don't think the answer is as cut
and dry as you seem to think.

QH

On Thu, Mar 26, 2015 at 9:36 AM, Mark Nelson <mnel...@redhat.com> wrote:

> I suspect a config like this where you only have 3 OSDs per node would be
> more manageable than something denser.
>
> IE theoretically a single E5-2697v3 is enough to run 36 OSDs in a 4U super
> micro chassis for a semi-dense converged solution.  You could attempt to
> restrict the OSDs to one socket and then use a second E5-2697v3 for VMs.
> Maybe after you've got cgroups setup properly and if you've otherwise
> balanced things it would work out ok.  I question though how much you
> really benefit by doing this rather than running a 36 drive storage server
> with lower bin CPUs and a 2nd 1U box for VMs (which you don't need as many
> of because you can dedicate both sockets to VMs).
>
> It probably depends quite a bit on how memory, network, and disk intensive
> the VMs are, but my take is that it's better to error on the side of
> simplicity rather than making things overly complicated.  Every second you
> are screwing around trying to make the setup work right eats into any
> savings you might gain by going with the converged setup.
>
> Mark
>
> On 03/26/2015 10:12 AM, Quentin Hartman wrote:
>
>> I run a converged openstack / ceph cluster with 14 1U nodes. Each has 1
>> SSD (os / journals), 3 1TB spinners (1 OSD each), 16 HT cores, 10Gb NICs
>> for ceph network, and 72GB of RAM. I configure openstack to leave 3GB of
>> RAM unused on each node for OSD / OS overhead. All the VMs are backed by
>> ceph volumes and things generally work very well. I would prefer a
>> dedicated storage layer simply because it seems more "right", but I
>> can't say that any of the common concerns of using this kind of setup
>> have come up for me. Aside from shaving off that 3GB of RAM, my
>> deployment isn't any more complex than a split stack deployment would
>> be. After running like this for the better part of a year, I would have
>> a hard time honestly making a real business case for the extra hardware
>> a split stack cluster would require.
>>
>> QH
>>
>> On Thu, Mar 26, 2015 at 6:57 AM, Mark Nelson <mnel...@redhat.com
>> <mailto:mnel...@redhat.com>> wrote:
>>
>>     It's kind of a philosophical question.  Technically there's nothing
>>     that prevents you from putting ceph and the hypervisor on the same
>>     boxes. It's a question of whether or not potential cost savings are
>>     worth increased risk of failure and contention.  You can minimize
>>     those things through various means (cgroups, ristricting NUMA nodes,
>>     etc).  What is more difficult is isolating disk IO contention (say
>>     if you want local SSDs for VMs), memory bus and QPI contention,
>>     network contention, etc. If the VMs are working really hard you can
>>     restrict them to their own socket, and you can even restrict memory
>>     usage to the local socket, but what about remote socket network or
>>     disk IO? (you will almost certainly want these things on the ceph
>>     socket)  I wonder as well about increased risk of hardware failure
>>     with the increased load, but I don't have any statistics.
>>
>>     I'm guessing if you spent enough time at it you could make it work
>>     relatively well, but at least personally I question how beneficial
>>     it really is after all of that.  If you are going for cost savings,
>>     I suspect efficient compute and storage node designs will be nearly
>>     as good with much less complexity.
>>
>>     Mark
>>
>>
>>     On 03/26/2015 07:11 AM, Wido den Hollander wrote:
>>
>>         On 26-03-15 12:04, Stefan Priebe - Profihost AG wrote:
>>
>>             Hi Wido,
>>             Am 26.03.2015 um 11:59 schrieb Wido den Hollander:
>>
>>                 On 26-03-15 11:52, Stefan Priebe - Profihost AG wrote:
>>
>>                     Hi,
>>
>>                     in the past i rwad pretty often that it's not a good
>>                     idea to run ceph
>>                     and qemu / the hypervisors on the same nodes.
>>
>>                     But why is this a bad idea? You save space and can
>>                     better use the
>>                     ressources you have in the nodes anyway.
>>
>>
>>                 Memory pressure during recovery *might* become a
>>                 problem. If you make
>>                 sure that you don't allocate more then let's say 50% for
>>                 the guests it
>>                 could work.
>>
>>
>>             mhm sure? I've never seen problems like that. Currently i
>>             ran each ceph
>>             node with 64GB of memory and each hypervisor node with
>>             around 512GB to
>>             1TB RAM while having 48 cores.
>>
>>
>>         Yes, it can happen. You have machines with enough memory, but if
>> you
>>         overprovision the machines it can happen.
>>
>>                 Using cgroups you could also prevent that the OSDs eat
>>                 up all memory or CPU.
>>
>>             Never seen an OSD doing so crazy things.
>>
>>
>>         Again, it really depends on the available memory and CPU. If you
>>         buy big
>>         machines for this purpose it probably won't be a problem.
>>
>>             Stefan
>>
>>                 So technically it could work, but memorey and CPU
>>                 pressure is something
>>                 which might give you problems.
>>
>>                     Stefan
>>
>>                     _________________________________________________
>>                     ceph-users mailing list
>>                     ceph-users@lists.ceph.com
>>                     <mailto:ceph-users@lists.ceph.com>
>>                     http://lists.ceph.com/__
>> listinfo.cgi/ceph-users-ceph.__com
>>                     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.
>> com>
>>
>>
>>
>>
>>
>>     _________________________________________________
>>     ceph-users mailing list
>>     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>     http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
>>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>
>>
>>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to