Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-06 Thread Ilya Dryomov
On Sat, Aug 6, 2016 at 1:10 AM, Alex Gorbachev  wrote:
> Is there a way to perhaps increase the discard granularity?  The way I see
> it based on the discussion so far, here is why discard/unmap is failing to
> work with VMWare:
>
> - RBD provides space in 4MB blocks, which must be discarded entirely, or at
> least hitting the tail.
>
> - SCST communicates to ESXi that discard alignment is 4MB and discard
> granularity is also 4MB
>
> - ESXI's VMFS5 is aligned on 1MB, so 4MB discards never actually free
> anything
>
> What is it were possible to make a 6MB discard granularity?

I'm confused.  How can a 4M discard not free anything?  It's either
going to hit an entire object or two adjacent objects, truncating the
tail of one and zeroing the head of another.  Using rbd diff:

$ rbd diff test | grep -A 1 25165824
25165824  4194304 data
29360128  4194304 data

# a 4M discard at 1M into a RADOS object
$ blkdiscard -o $((25165824 + (1 << 20))) -l $((4 << 20)) /dev/rbd0

$ rbd diff test | grep -A 1 25165824
25165824  1048576 data
29360128  4194304 data

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd-mirror questions

2016-08-06 Thread Shain Miley
Thank you both for the detailed answers...this gives me a starting point to 
work from!

Shain

Sent from my iPhone

> On Aug 5, 2016, at 8:25 AM, Jason Dillaman  wrote:
> 
>> On Fri, Aug 5, 2016 at 3:42 AM, Wido den Hollander  wrote:
>> 
>>> Op 4 augustus 2016 om 18:17 schreef Shain Miley :
>>> 
>>> 
>>> Hello,
>>> 
>>> I am thinking about setting up a second Ceph cluster in the near future,
>>> and I was wondering about the current status of rbd-mirror.
>> 
>> I don't have all the answers, but I will give it a try.
>> 
>>> 1)is it production ready at this point?
>> 
>> Yes, but rbd-mirror is a single process at the moment. So mirroring a very 
>> large number of images might become a bottleneck at some point. I don't know 
>> where that point it.
> 
> Production ready could mean different things to different people. We
> haven't had any reports of data corruption or similar issues. The
> forthcoming 10.2.3 release will include several rbd-mirror daemon
> stability and performance improvements (especially in terms of memory
> usage) that were uncovered during heavy stress testing beyond our
> normal automated test cases.
> 
> It is not currently HA nor horizontally scalable, but we have a design
> blueprint in place to start addressing this for the upcoming Kraken
> release. It is also missing a "deep scrub"-like utility to
> periodically verify that your replicated images match your primary
> images, which I am hoping to include in the Luminous release. Finally,
> we are tweaking for performance issues with the default journal
> settings, but in the meantime setting the
> "rbd_journal_object_flush_age" config setting to a non-zero value (in
> seconds), will improve IOPS noticeably when journaling is enabled.
> 
>>> 2)can it be used when you have a cluster with existing data in order to
>>> replicate onto a new cluster?
>> 
>> iirc, images need the fast-diff feature enable to be able to support 
>> mirroring, more on that in the docs: 
>> http://docs.ceph.com/docs/master/rbd/rbd-mirroring/
>> 
>> The problem is, if you have old RBD images, maybe even format 1, you will 
>> not be able to mirror those.
>> 
>> Some rbd format 2 images neither, since they don't have the journal and 
>> don't have the fast-diff.
>> 
>> So per image it will depend if the mirroring is able to run.
> 
> Yes, it will automatically "bootstrap" existing images to the new
> cluster by performing a full, deep copy of the images. The default
> setting is to synchronize a maximum of 5 images concurrently, but for
> huge images you may want to tweak that setting down. This requires
> only the exclusive-lock and journaling features on the images -- which
> can be dynamically enabled/disabled on existing v2 images if needed.
> 
>>> 3)we have some rather large rbd images at this point..several in the
>>> 90TB range...would there be any concern using rbd-mirror given the size
>>> of our images?
>> 
>> The initial sync might be slow and block the single rbd-mirror process. 
>> Afterwards, if fast-diff is enabled it shouldn't be a real problem.
> 
> Agreed -- the initial sync will take the longest. By default it copies
> up to 10 backing object blocks concurrently for each syncing image,
> but if your cluster has enough capacity you can adjust that up using
> the "rbd_concurrent_management_ops" config setting to increase the
> transfer throughput. While the initial sync is in-progress, the
> journal will continue to grow since the remote rbd-mirror process
> won't be able to replay events until after the sync is complete.
> 
> As with any new feature or release of Ceph, I would recommend first
> playing around with it on non-production workloads. Since RBD
> mirroring is configured on a per-pool and per-image basis, there is a
> potentially lower barrier for testing.
> 
>> Wido
>> 
>>> Thanks,
>>> 
>>> Shain
>>> 
>>> --
>>> NPR | Shain Miley | Manager of Infrastructure, Digital Media |
>>> smi...@npr.org | 202.513.3649
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> -- 
> Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSDs going down when we bring down some OSD nodes Or cut-off the cluster network link between OSD nodes

2016-08-06 Thread Venkata Manojawa Paritala
Hi,

We have configured single Ceph cluster in a lab with the below
specification.

1. Divided the cluster into 3 logical sites (SiteA, SiteB & SiteC). This is
to simulate that nodes are part of different Data Centers and having
network connectivity between them for DR.
2. Each site operates in a different subnet and each subnet is part of one
VLAN. We have configured routing so that OSD nodes in one site can
communicate to OSD nodes in the other 2 sites.
3. Each site will have one monitor  node, 2  OSD nodes (to which we have
disks attached) and IO generating clients.
4. We have configured 2 networks.
4.1. Public network - To which all the clients, monitors and OSD nodes are
connected
4.2. Cluster network - To which only the OSD nodes are connected for -
Replication/recovery/hearbeat traffic.

5. We have 2 issues here.
5.1. We are unable sustain IO for clients from individual sites when we
isolate the OSD nodes by bringing down ONLY the cluster network between
sites. Logically this will make the individual sites to be in isolation
with respect to the cluster network. Please note that the public network is
still connected between the sites.
5.2. In a fully functional cluster, when we bring down 2 sites (shutdown
the OSD services of 2 sites - say Site A OSDs and Site B OSDs) then, OSDs
in the third site (Site C) are going down (OSD Flapping).

We need workarounds/solutions to  fix the above 2 issues.

Below are some of the parameters we have already mentioned in the Cenf.conf
to sustain the cluster for a longer time, when we cut-off the links between
sites. But, they were not successful.

--
[global]
public_network = 10.10.0.0/16
cluster_network = 192.168.100.0/16,192.168.150.0/16,192.168.200.0/16
osd hearbeat address = 172.16.0.0/16

[monitor]
mon osd report timeout = 1800

[OSD}
osd heartbeat interval = 12
osd hearbeat grace = 60
osd mon heartbeat interval = 60
osd mon report interval max = 300
osd mon report interval min = 10
osd mon act timeout = 60
.
.


We also confiured the parameter "osd_heartbeat_addr" and tried with the
values - 1) Ceph public network (assuming that when we bring down the
cluster network hearbeat should happen via public network). 2) Provided a
different network range altogether and had physical connections. But both
the options did not work.

We have a total of 49 OSDs (14 in Site A, 14 in SiteB, 21 in SiteC) in the
cluster. One Monitor in each Site.

We need to try the below two options.

A) Increase the "mon osd min down reporters" value. Question is how much.
Say, if I give this value to 49, then will the client IO sustain when we
cut-off the cluster network links between sites. In this case one issue
would be that if the OSD is really down we wouldn't know.

B) Add 2 monitors to each site. This would make each site with 3 monitors
and the overall cluster will have 9 monitors. The reason we wanted to try
this is, we think that the OSDs are going down as the the quorum is unable
to find the minimum number nodes (may be monitors) to sustain.

Thanks & Regards,
Manoj
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Giant to Jewel poor read performance with Rados bench

2016-08-06 Thread David
Hi All

I've just installed Jewel 10.2.2 on hardware that has previously been
running Giant. Rados Bench with the default rand and seq tests is giving me
approx 40% of the throughput I used to achieve. On Giant I would get
~1000MB/s (so probably limited by the 10GbE interface), now I'm getting 300
- 400MB/s.

I can see there is no activity on the disks during the bench so the data is
all coming out of cache. The cluster isn't doing anything else during the
test. I'm fairly sure my network is sound, I've done the usual testing with
iperf etc. The write test seems about the same as I used to get (~400MB/s).

This was a fresh install rather than an upgrade.

Are there any gotchas I should be aware of?

Some more details:

OS: CentOS 7
Kernel: 3.10.0-327.28.2.el7.x86_64
5 nodes (each 10 * 4TB SATA, 2 * Intel dc3700 SSD partitioned up for
journals).
10GbE public network
10GbE cluster network
MTU 9000 on all interfaces and switch
Ceph installed from ceph repo

Ceph.conf is pretty basic (IPs, hosts etc omitted):

filestore_xattr_use_omap = true
osd_journal_size = 1
osd_pool_default_size = 3
osd_pool_default_min_size = 2
osd_pool_default_pg_num = 4096
osd_pool_default_pgp_num = 4096
osd_crush_chooseleaf_type = 1
max_open_files = 131072
mon_clock_drift_allowed = .15
mon_clock_drift_warn_backoff = 30
mon_osd_down_out_interval = 300
mon_osd_report_timeout = 300
mon_osd_full_ratio = .95
mon_osd_nearfull_ratio = .80
osd_backfill_full_ratio = .80

Thanks
David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com