> To those interested in a tricky problem,
>
> We have a Ceph cluster running at one of our data centers. One of our
> client's requirements is to have them hosted at AWS. My question is: How do
> we effectively migrate our data on our internal Ceph cluster to an AWS Ceph
> cluster?
>
> Ideas curre
> do people consider a UPS + Shutdown procedures a suitable substitute?
I certainly wouldn't, I've seen utility power fail and the transfer
switch fail to transition to UPS strings. Had this happened to me with
nobarrier it would have been a very sad day.
--
> Thanks for all the help. Can the moving over from VLAN to separate
> switches be done on a live cluster? Or does there need to be a down
> time?
You can do it on a life cluster. The more cavalier approach would be
to quickly switch the link over one server at a time, which might
cause a short io
> For a large network (say 100 servers and 2500 disks), are there any
> strong advantages to using separate switch and physical network
> instead of VLAN?
Physical isolation will ensure that congestion on one does not affect
the other. On the flip side, asymmetric network failures tend to be
more
> Can you paste me the whole output of the install? I am curious why/how you
> are getting el7 and el6 packages.
priority=1 required in /etc/yum.repos.d/ceph.repo entries
--
Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph
> I wonder that OSDs use system calls of Virtual File System (i.e. open, read,
> write, etc) when they access disks.
>
> I mean ... Could I monitor I/O command requested by OSD to disks if I
> monitor VFS?
Ceph OSDs run on top of a traditional filesystem, so long as they
support xattrs - xfs by de
> I was wondering, having a cache pool in front of an RBD pool is all fine
> and dandy, but imagine you want to pull backups of all your VMs (or one
> of them, or multiple...). Going to the cache for all those reads isn't
> only pointless, it'll also potentially fill up the cache and possibly
> evi
> TL;DR: Power outages are more common than your colo facility will admit.
Seconded. I've seen power failures in at least 4 different facilities
and all of them had the usual gamut of batteries/generators/etc. Some
of those facilities I've seen problems multiple times in a single
year. Even a data
> Anyway replacing set of monitors means downtime for every client, so
> I`m in doubt if 'no outage' word is still applicable there.
Taking the entire quorum down for migration would be bad. It's better
to add one in the new location, remove one at the old, ad infinitum.
--
Kyle
___
> Let's assume a test cluster up and running with real data on it.
> Which is the best way to migrate everything to a production (and
> larger) cluster?
>
> I'm thinking to add production MONs to the test cluster, after that,
> add productions OSDs to the test cluster, waiting for a full rebalance
>> >> I think the timing should work that we'll be deploying with Firefly and
>> >> so
>> >> have Ceph cache pool tiering as an option, but I'm also evaluating
>> >> Bcache
>> >> versus Tier to act as node-local block cache device. Does anybody have
>> >> real
>> >> or anecdotal evidence about whic
>> Obviously the ssds could be used as journal devices, but I'm not really
>> convinced whether this is worthwhile when all nodes have 1GB of hardware
>> writeback cache (writes to journal and data areas on the same spindle have
>> time to coalesce in the cache and minimise seek time hurt). Any adv
> I'm assuming Ceph/RBD doesn't have any direct awareness of this since
> the file system doesn't traditionally have a "give back blocks"
> operation to the block device. Is there anything special RBD does in
> this case that communicates the release of the Ceph storage back to the
> pool?
VMs ru
> I have run the repair command, and the warning info disappears in the
output of "ceph health detail", but the replicas isn't recovered in the
"current" directory.
> In all, the ceph cluster status can recover (the pg's status recover from
inconsistent to active and clean), but not the replica.
I
> I have two nodes with 8 OSDs on each. First node running 2 monitors on
> different virtual machines (mon.1 and mon.2), second node runing mon.3
> After several reboots (I have tested power failure scenarios) "ceph -w" on
> node 2 always fails with message:
>
> root@bes-mon3:~# ceph --verbose -w
> I upload a file through swift API, then delete it in the “current” directory
> in the secondary OSD manually, why the object can’t be recovered?
>
> If I delete it in the primary OSD, the object is deleted directly in the
> pool .rgw.bucket and it can’t be recovered from the secondary OSD.
>
> Do
> ceph-disk-prepare --fs-type xfs --dmcrypt --dmcrypt-key-dir
> /etc/ceph/dmcrypt-keys --cluster ceph -- /dev/sdb
> ceph-disk: Error: Device /dev/sdb2 is in use by a device-mapper mapping
> (dm-crypt?): dm-0
It sounds like device-mapper still thinks it's using the the volume,
you might be able t
> I need to add a extend server, which reside several osds, to a
> running ceph cluster. During add osds, ceph would not automatically modify
> the ceph.conf. So I manually modify the ceph.conf
>
> And restart the whole ceph cluster with command: ’service ceph –a restart’.
> I just confuse
> One downside of the above arrangement: I read that support for mapping
> newer-format RBDs is only present in fairly recent kernels. I'm running
> Ubuntu 12.04 on the cluster at present with its stock 3.2 kernel. There
> is a PPA for the 3.11 kernel used in Ubuntu 13.10, but if you're looking
>
> If I want to use a disk dedicated for osd, can I just use something like
> /dev/sdb instead of /dev/sdb1? Is there any negative impact on performance?
You can pass /dev/sdb to ceph-disk-prepare and it will create two
partitions, one for the journal (raw partition) and one for the data
volume (de
> We use /dev/disk/by-path for this reason, but we confirmed that is stable
> for our HBAs. Maybe /dev/disk/by-something is consistent with your
> controller.
The upstart/udev scripts will handle mounting and osd id detection, at
least on Ubuntu.
--
Kyle
> This is in my lab. Plain passthrough setup with automap enabled on the F5. s3
> & curl work fine as far as queries go. But file transfer rate degrades badly
> once I start file up/download.
Maybe the difference can be attributed to LAN client traffic with
jumbo frames vs F5 using a smaller WAN
> You're right. Sorry didn't specify I was trying this for Radosgw. Even for
> this I'm seeing performance degrade once my clients start to hit the LB VIP.
Could you tell us more about your load balancer and configuration?
--
Kyle
___
ceph-users ma
> Anybody has a good practice on how to set up a ceph cluster behind a pair of
> load balancer?
The only place you would want to put a load balancer in the context of
a Ceph cluster would be north of RGW nodes. You can do L3 transparent
load balancing or balance with a L7 proxy, ie Linux Virtual
> I tried rbd-fuse and it's throughput using fio is approx. 1/4 that of the
> kernel client.
>
> Can you please let me know how to setup RBD backend for FIO? I'm assuming
> this RBD backend is also based on librbd?
You will probably have to build fio from source since the rbd engine is new:
htt
> 1. Is it possible to install Ceph and Ceph monitors on the the XCP
> (XEN) Dom0 or would we need to install it on the DomU containing the
> Openstack components?
I'm not a Xen guru but in the case of KVM I would run the OSDs on the
hypervisor to avoid virtualization overhead.
> 2. I
> There could be millions of tennants. Looking deeper at the docs, it looks
> like Ceph prefers to have one OSD per disk. We're aiming at having
> backblazes, so will be looking at 45 OSDs per machine, many machines. I want
> to separate the tennants and separately encrypt their data. The enc
> Why the limit of 6 OSDs per SSD?
SATA/SAS throughput generally.
> I am doing testing with a PCI-e based SSD, and showing that even with 15
OSD disk drives per SSD that the SSD is keeping up.
That will probably be fine performance wise but it's worth noting that all
OSDs will fail if the flash
> Ceph is seriously badass, but my requirements are to create a cluster in
> which I can host my customer's data in separate areas which are independently
> encrypted, with passphrases which we as cloud admins do not have access to.
>
> My current thoughts are:
> 1. Create an OSD per machine stre
> Is there an issue with IO performance?
Ceph monitors store cluster maps and various other things in leveldb,
which persists to disk. I wouldn't recommend using a sd/usb cards for
the monitor store because they tend to be slow and have poor
durability.
--
Kyle
_
> What could be the best replication ?
Are you using two sites to increase availability, durability, or both?
For availability your really better off using three sites and use
CRUSH to place each of three replicas in a different datacenter. In
this setup you can survive losing 1 of 3 datacenters.
> Why would it help? Since it's not that ONE OSD will be primary for all
objects. There will be 1 Primary OSD per PG and you'll probably have a
couple of thousands PGs.
The primary may be across a oversubscribed/expensive link, in which case a
local replica with a common ancestor to the client may
> Change pg_num for .rgw.buckets to power of 2, an 'crush tunables
> optimal' didn't help :(
Did you bump pgp_num as well? The split pgs will stay in place until
pgp_num is bumped as well, if you do this be prepared for (potentially
lots) of data movement.
_
> HEALTH_WARN 1 pgs down; 3 pgs incomplete; 3 pgs stuck inactive; 3 pgs
stuck unclean; 7 requests are blocked > 32 sec; 3 osds have slow requests;
pool cloudstack has too few pgs; pool .rgw.buckets has too few pgs
> pg 14.0 is stuck inactive since forever, current state incomplete, last
acting [5,0
gt; wants to use Ceph for VM storage in the future, we need to find a solution.
That's a shame, but at least you will be better prepared if it happens
again, hopefully your luck is not as unfortunate as mine!
--
Kyle Bader
___
ceph-users mailin
> Do monitors have to be on the cluster network as well or is it sufficient
> for them to be on the public network as
> http://ceph.com/docs/master/rados/configuration/network-config-ref/
> suggests?
Monitors only need to be on the public network.
> Also would the OSDs re-route their traffic over
> Yes, that also makes perfect sense, so the aforementioned 12500 objects
> for a 50GB image, at a 60 TB cluster/pool with 72 disk/OSDs and 3 way
> replication that makes 2400 PGs, following the recommended formula.
>
>> > What amount of disks (OSDs) did you punch in for the following run?
>> >> Di
> Is an object a CephFS file or a RBD image or is it the 4MB blob on the
> actual OSD FS?
Objects are at the RADOS level, CephFS filesystems, RBD images and RGW
objects are all composed by striping RADOS objects - default is 4MB.
> In my case, I'm only looking at RBD images for KVM volume storage
Using your data as inputs to in the Ceph reliability calculator [1]
results in the following:
Disk Modeling Parameters
size: 3TiB
FIT rate:826 (MTBF = 138.1 years)
NRE rate:1.0E-16
RAID parameters
replace: 6 hours
recovery rate: 500MiB/s (100 mi
> The area I'm currently investigating is how to configure the
> networking. To avoid a SPOF I'd like to have redundant switches for
> both the public network and the internal network, most likely running
> at 10Gb. I'm considering splitting the nodes in to two separate racks
> and connecting each
> Do you have any futher detail on this radosgw bug?
https://github.com/ceph/ceph/commit/0f36eddbe7e745665a634a16bf3bf35a3d0ac424
https://github.com/ceph/ceph/commit/0b9dc0e5890237368ba3dc34cb029010cb0b67fd
> Does it only apply to emperor?
The bug is present in dumpling too.
>> Has anyone tried scaling a VMs io by adding additional disks and
>> striping them in the guest os? I am curious what effect this would have
>> on io performance?
> Why would it? You can also change the stripe size of the RBD image.
Depending on the workload you might change it from 4MB to some
For you holiday pleasure I've prepared a SysAdvent article on Ceph:
http://sysadvent.blogspot.com/2013/12/day-15-distributed-storage-with-ceph.html
Check it out!
--
Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/list
> Introduction of Savanna for those haven't heard of it:
>
> Savanna project aims to provide users with simple means to provision a
> Hadoop
>
> cluster at OpenStack by specifying several parameters like Hadoop version,
> cluster
>
> topology, nodes hardware details and a few more.
>
> For now, Sav
It seems that NUMA can be problematic for ceph-osd daemons in certain
circumstances. Namely it seems that if a NUMA zone is running out of
memory due to uneven allocation it is possible for a NUMA zone to
enter reclaim mode when threads/processes are scheduled on a core in
that zone and those proce
> I've been running similar calculations recently. I've been using this
> tool from Inktank to calculate RADOS reliabilities with different
> assumptions:
> https://github.com/ceph/ceph-tools/tree/master/models/reliability
>
> But I've also had similar questions about RBD (or any multi-part files
> We're running OpenStack (KVM) with local disk for ephemeral storage.
> Currently we use local RAID10 arrays of 10k SAS drives, so we're quite rich
> for IOPS and have 20GE across the board. Some recent patches in OpenStack
> Havana make it possible to use Ceph RBD as the source of ephemeral VM
>
> looking at tcpdump all the traffic is going exactly where it is supposed to
> go, in particular an osd on the 192.168.228.x network appears to talk to an
> osd on the 192.168.229.x network without anything strange happening. I was
> just wondering if there was anything about ceph that could ma
>> Is having two cluster networks like this a supported configuration? Every
>> osd and mon can reach every other so I think it should be.
>
> Maybe. If your back end network is a supernet and each cluster network is a
> subnet of that supernet. For example:
>
> Ceph.conf cluster network (supernet)
> Is having two cluster networks like this a supported configuration? Every
osd and mon can reach every other so I think it should be.
Maybe. If your back end network is a supernet and each cluster network is a
subnet of that supernet. For example:
Ceph.conf cluster network (supernet): 10.0.0.0/8
> This journal problem is a bit of wizardry to me, I even had weird
intermittent issues with OSDs not starting because the journal was not
found, so please do not hesitate to suggest a better journal setup.
You mentioned using SAS for journal, if your OSDs are SATA and a expander
is in the data pa
> > Is the OS doing anything apart from ceph? Would booting a ramdisk-only
system from USB or compact flash work?
I haven't tested this kind of configuration myself but I can't think of
anything that would preclude this type of setup. I'd probably use sqashfs
layered with a tmpfs via aufs to avoid
> How much performance can be improved if use SSDs to storage journals?
You will see roughly twice the throughput unless you are using btrfs
(still improved but not as dramatic). You will also see lower latency
because the disk head doesn't have to seek back and forth between
journal and data par
> Is there any way to manually configure which OSDs are started on which
> machines? The osd configuration block includes the osd name and host, so is
> there a way to say that, say, osd.0 should only be started on host vashti
> and osd.1 should only be started on host zadok? I tried using thi
Several people have reported issues with combining OS and OSD journals
on the same SSD drives/RAID due to contention. If you do something
like this I would definitely test to make sure it meets your
expectations. Ceph logs are going to compose the majority of the
writes to the OS storage devices.
> So quick correction based on Michael's response. In question 4, I should
> have not made any reference to Ceph objects, since objects are not striped
> (per Michael's response). Instead, I should simply have used the words "Ceph
> VM Image" instead of "Ceph objects". A Ceph VM image would constit
> We have the plan to run ceph as block storage for openstack, but from test
> we found the IOPS is slow.
>
> Our apps primarily use the block storage for saving logs (i.e, nginx's
> access logs).
> How to improve this?
There are a number of things you can do, notably:
1. Tuning cache on the hype
> 3).Comment out, #hashtag the bad OSD drives in the “/etc/fstab”.
This is unnecessary if your using the provided upstart and udev
scripts, OSD data devices will be identified by label and mounted. If
you choose not to use the upstart and udev scripts then you should
write init scripts that do si
> Would this be something like
> http://wiki.ceph.com/01Planning/02Blueprints/Firefly/Ceph-Brag ?
Something very much like that :)
--
Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> I think this is a great idea. One of the big questions users have is
> "what kind of hardware should I buy." An easy way for users to publish
> information about their setup (hardware, software versions, use-case,
> performance) when they have successful deployments would be very valuable.
> Ma
> Zackc, Loicd, and I have been the main participants in a weekly Teuthology
> call the past few weeks. We've talked mostly about methods to extend
> Teuthology to capture performance metrics. Would you be willing to join us
> during the Teuthology and Ceph-Brag sessions at the Firefly Developer
>
> ST240FN0021 connected via a SAS2x36 to a LSI 9207-8i.
The problem might be SATA transport protocol overhead at the expander.
Have you tried directly connecting the SSDs to SATA2/3 ports on the
mainboard?
--
Kyle
___
ceph-users mailing list
ceph-user
> 1. To build a high performance yet cheap radosgw storage, which pools should
> be placed on ssd and which on hdd backed pools? Upon installation of
> radosgw, it created the following pools: .rgw, .rgw.buckets,
> .rgw.buckets.index, .rgw.control, .rgw.gc, .rgw.root, .usage, .users,
> .users.email
>> Once I know a drive has had a head failure, do I trust that the rest of the
>> drive isn't going to go at an inconvenient moment vs just fixing it right
>> now when it's not 3AM on Christmas morning? (true story) As good as Ceph
>> is, do I trust that Ceph is smart enough to prevent spreadin
The quick start guide is linked below, it should help you hit the ground
running.
http://ceph.com/docs/master/start/quick-ceph-deploy/
Let us know if you have questions or bump into trouble!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://l
Recovering from a degraded state by copying existing replicas to other OSDs
is going to cause reads on existing replicas and writes to the new
locations. If you have slow media then this is going to be felt more
acutely. Tuning the backfill options I posted is one way to lessen the
impact, another
The bobtail release added udev/upstart capabilities that allowed you
to not have per OSD entries in ceph.conf. Under the covers the new
udev/upstart scripts look for a special label on OSD data volumes,
matching volumes are mounted and then a few files are inspected:
journal_uuid whoami
The jour
You can change some OSD tunables to lower the priority of backfills:
osd recovery max chunk: 8388608
osd recovery op priority: 2
In general a lower op priority means it will take longer for your
placement groups to go from degraded to active+clean, the idea is to
balance recover
> I know that 10GBase-T has more delay then SFP+ with direct attached
> cables (.3 usec vs 2.6 usec per link), but does that matter? Some
> sites stay it is a huge hit, but we are talking usec, not ms, so I
> find it hard to believe that it causes that much of an issue. I like
> the lower cost and
>> This is going to get horribly ugly when you add neutron into the mix, so
>> much so I'd consider this option a non-starter. If someone is using
>> openvswitch to create network overlays to isolate each tenant I can't
>> imagine this ever working.
>
> I'm not following here. Are this only needed
> Option 1) The service plugs your filesystem's IP into the VM's network
> and provides direct IP access. For a shared box (like an NFS server)
> this is fairly straightforward and works well (*everything* has a
> working NFS client). It's more troublesome for CephFS, since we'd need
> to include a
Besides what Mark and Greg said it could be due to additional hops through
network devices. What network devices are you using, what is the network
topology and does your CRUSH map reflect the network topology?
On Oct 21, 2013 9:43 AM, "Gregory Farnum" wrote:
> On Mon, Oct 21, 2013 at 7:13 AM, Gu
> > * The IP address of at least one MON in the Ceph cluster
>
If you configure nodes with a single monitor in the "mon hosts" directive
then I believe your nodes will have issues if that one monitor goes down.
With Chef I've gone back and forth between using Chef search and having
monitors be dec
My first guess would be that it's due to LXC dropping capabilities, I'd
investigate whether CAP_SYS_ADMIN is being dropped. You need CAP_SYS_ADMIN
for mount and block ioctls, if the container doesn't have those privs a map
will likely fail. Maybe try tracing the command with strace?
On Thu, Oct 17
I've personally saturated 1Gbps links on multiple radosgw nodes on a large
cluster, if I remember correctly, Yehuda has tested it up into the 7Gbps
range with 10Gbps gear. Could you describe your clusters hardware and
connectivity?
On Mon, Oct 14, 2013 at 3:34 AM, Chu Duc Minh wrote:
> Hi sorry
I've contracted and expanded clusters by up to a rack of 216 OSDs - 18
nodes, 12 drives each. New disks are configured with a CRUSH weight of 0
and I slowly add weight (0.1 to 0.01 increments), wait for the cluster to
become active+clean and then add more weight. I was expanding after
contraction
ges to implement a faster tier
> via SSD.
>
> --
> Warren
>
> On Oct 9, 2013, at 5:52 PM, Kyle Bader wrote:
>
> Journal on SSD should effectively double your throughput because data will
> not be written to the same device twice to ensure transactional integrity.
> Additional
You can certainly use a similarly named device to back an OSD journal if
the OSDs are on separate hosts. If you want to take a single SSD device and
utilize it as a journal for many OSDs on the same machine then you would
want to partition the SSD device and use a different partition for each OSD
j
Journal on SSD should effectively double your throughput because data will
not be written to the same device twice to ensure transactional integrity.
Additionally, by placing the OSD journal on an SSD you should see less
latency, the disk head no longer has to seek back and forth between the
journa
79 matches
Mail list logo