[ceph-users] Merging CephFS data pools

2016-08-18 Thread Burkhard Linke

Hi,

the current setup for CephFS at our site uses two data pools due to 
different requirements in the past. I want to merge these two pools now, 
eliminating the second pool completely.


I've written a small script to locate all files on the second pool using 
their file layout attributes and replace them with a copy on the correct 
pool. This works well for files, but modifies the timestamps of the 
directories.
Do you have any idea for a better solution that does not modify 
timestamps and plays well with active CephFS clients (e.g. no problem 
with files being used)? A simple 'rados cppool' probably does not work 
since the pool id/name is part of a file's metadata and client will not 
be aware of moved files.


Regards,
Burkhard

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How can we repair OSD leveldb?

2016-08-18 Thread Wido den Hollander

> Op 17 augustus 2016 om 23:54 schreef Dan Jakubiec :
> 
> 
> Hi Wido,
> 
> Thank you for the response:
> 
> > On Aug 17, 2016, at 16:25, Wido den Hollander  wrote:
> > 
> > 
> >> Op 17 augustus 2016 om 17:44 schreef Dan Jakubiec :
> >> 
> >> 
> >> Hello, we have a Ceph cluster with 8 OSD that recently lost power to all 8 
> >> machines.  We've managed to recover the XFS filesystems on 7 of the 
> >> machines, but the OSD service is only starting on 1 of them.
> >> 
> >> The other 5 machines all have complaints similar to the following:
> >> 
> >>2016-08-17 09:32:15.549588 7fa2f4666800 -1 
> >> filestore(/var/lib/ceph/osd/ceph-1) Error initializing leveldb : 
> >> Corruption: 6 missing files; e.g.: 
> >> /var/lib/ceph/osd/ceph-1/current/omap/042421.ldb
> >> 
> >> How can we repair the leveldb to allow the OSDs to startup?  
> >> 
> > 
> > My first question would be: How did this happen?
> > 
> > What hardware are you using underneath? Is there a RAID controller which is 
> > not flushing properly? Since this should not happen during a power failure.
> > 
> 
> Each OSD drive is connected to an onboard hardware RAID controller and 
> configured in RAID 0 mode as individual virtual disks.  The RAID controller 
> is an LSI 3108.
> 

Was that controller in writeback mode without a BBU?

> I agree -- I am finding it bizarre that 7 of our 8 OSDs (one per machine) did 
> not survive the power outage.  
> 

As Christian already asked, mounted the FS with nobarrier?

> We did have some problems with the stock Ubunut xfs_repair (3.1.9) seg 
> faulting, which eventually we overcame by building a newer version of 
> xfs_repair (4.7.0).  But it did finally repair clean.
> 

Not good. A xfs_repair should not be required after a power failure. A 
journaling filesystem properly mounted and a good controller underneath should 
mount and just replay it's journal.

> We actually have some different errors on other OSDs.  A few of them are 
> failing with "Missing map in load_pgs" errors.  But generally speaking it 
> appears to be missing files of various types causing different kinds of 
> failures.
> 

Missing files is not good, very bad actually. This should never happen and 
points to something which is not Ceph's fault. Controller in writeback, 
nobarrier mount option, etc.

> I'm really nervous now about the OSD's inability to start with any 
> inconsistencies and no repair utilities (that I can find).  Any advice on how 
> to recover?
> 

I am afraid that you won't be able to recover from this. You are missing 
essential files from the OSDs. Without them they won't be able to start.

Maybe, maybe, maybe something will be able to reconstruct the leveldb of the 
other OSDs with data from the one surviving OSD, but that's a very big maybe.

Wido

> > I don't know the answer to your question, but lost files are not good.
> > 
> > You might find them in a lost+found directory if XFS repair worked?
> > 
> 
> Sadly this directory is empty.
> 
> -- Dan
> 
> > Wido
> > 
> >> Thanks,
> >> 
> >> -- Dan J___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph all NVME Cluster sequential read speed

2016-08-18 Thread nick
Hi,
we are currently building a new ceph cluster with only NVME devices. One Node 
consists of 4x Intel P3600 2TB devices. Journal and filestore are on the same 
device. Each server has a 10 core CPU and uses 10 GBit ethernet NICs for 
public and ceph storage traffic. We are currently testing with 4 nodes overall. 

The cluster will be used only for virtual machine images via RBD. The pools 
are replicated (no EC).

Altough we are pretty happy with the single threaded write performance, the 
single threaded (iodepth=1) sequential read performance is a bit 
disappointing.

We are testing with fio and the rbd engine. After creating a 10GB RBD image, we 
use the following fio params to test:
"""
[global]
invalidate=1
ioengine=rbd
iodepth=1
ramp_time=2
size=2G
bs=4k
direct=1
buffered=0
"""

For a 4k workload we are reaching 1382 IOPS. Testing one NVME device directly 
(with psync engine and iodepth of 1) we can reach up to 84176 IOPS. This is a 
big difference.

I already read that the read_ahead setting might improve the situation, 
although this would only be true when using buffered reads, right?

Does anyone have other suggestions to get better serial read performance?

Cheers
Nick
 
-- 
Sebastian Nickel
Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch

signature.asc
Description: This is a digitally signed message part.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph all NVME Cluster sequential read speed

2016-08-18 Thread w...@42on.com


> Op 18 aug. 2016 om 10:15 heeft nick  het volgende geschreven:
> 
> Hi,
> we are currently building a new ceph cluster with only NVME devices. One Node 
> consists of 4x Intel P3600 2TB devices. Journal and filestore are on the same 
> device. Each server has a 10 core CPU and uses 10 GBit ethernet NICs for 
> public and ceph storage traffic. We are currently testing with 4 nodes 
> overall. 
> 
> The cluster will be used only for virtual machine images via RBD. The pools 
> are replicated (no EC).
> 
> Altough we are pretty happy with the single threaded write performance, the 
> single threaded (iodepth=1) sequential read performance is a bit 
> disappointing.
> 
> We are testing with fio and the rbd engine. After creating a 10GB RBD image, 
> we 
> use the following fio params to test:
> """
> [global]
> invalidate=1
> ioengine=rbd
> iodepth=1
> ramp_time=2
> size=2G
> bs=4k
> direct=1
> buffered=0
> """
> 
> For a 4k workload we are reaching 1382 IOPS. Testing one NVME device directly 
> (with psync engine and iodepth of 1) we can reach up to 84176 IOPS. This is a 
> big difference.
> 

Network is a big difference as well. Keep in mind the Ceph OSDs have to process 
the I/O as well.

For example, if you have a network latency of 0.200ms, in 1.000ms (1 sec) you 
will be able to potentially do 5.000 IOps, but that is without the OSD or any 
other layers doing any work.


> I already read that the read_ahead setting might improve the situation, 
> although this would only be true when using buffered reads, right?
> 
> Does anyone have other suggestions to get better serial read performance?
> 

You might want to disable all logging and look at AsyncMessenger. Disabling 
cephx might help, but that is not very safe to do.

Wido

> Cheers
> Nick
> 
> -- 
> Sebastian Nickel
> Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
> Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw error in its log rgw_bucket_sync_user_stats()

2016-08-18 Thread zhu tong
Hi all,

Version: 0.94.7
radosgw has reported the following error:

2016-08-16 15:26:06.883957 7fc2f0bfe700  0 ERROR: rgw_bucket_sync_user_stats() 
for user=user1, 
bucket=2537e61b32ca783432138237f234e610d1ee186e(@{i=.rgw.buckets.index,e=.rgw.buckets.extra}.rgw.buckets[default.4151.167])
 returned -2
2016-08-16 15:26:06.883989 7fc2f0bfe700  0 WARNING: sync_bucket() returned r=-2

ERROR like this happens to user1's all buckets during that time.

What caused this error? And what would this error affects?


Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph all NVME Cluster sequential read speed

2016-08-18 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> w...@42on.com
> Sent: 18 August 2016 09:35
> To: nick 
> Cc: ceph-users 
> Subject: Re: [ceph-users] Ceph all NVME Cluster sequential read speed
> 
> 
> 
> > Op 18 aug. 2016 om 10:15 heeft nick  het volgende geschreven:
> >
> > Hi,
> > we are currently building a new ceph cluster with only NVME devices.
> > One Node consists of 4x Intel P3600 2TB devices. Journal and filestore
> > are on the same device. Each server has a 10 core CPU and uses 10 GBit
> > ethernet NICs for public and ceph storage traffic. We are currently testing 
> > with 4 nodes overall.
> >
> > The cluster will be used only for virtual machine images via RBD. The
> > pools are replicated (no EC).
> >
> > Altough we are pretty happy with the single threaded write
> > performance, the single threaded (iodepth=1) sequential read
> > performance is a bit disappointing.
> >
> > We are testing with fio and the rbd engine. After creating a 10GB RBD
> > image, we use the following fio params to test:
> > """
> > [global]
> > invalidate=1
> > ioengine=rbd
> > iodepth=1
> > ramp_time=2
> > size=2G
> > bs=4k
> > direct=1
> > buffered=0
> > """
> >
> > For a 4k workload we are reaching 1382 IOPS. Testing one NVME device
> > directly (with psync engine and iodepth of 1) we can reach up to 84176
> > IOPS. This is a big difference.
> >
> 
> Network is a big difference as well. Keep in mind the Ceph OSDs have to 
> process the I/O as well.
> 
> For example, if you have a network latency of 0.200ms, in 1.000ms (1 sec) you 
> will be able to potentially do 5.000 IOps, but that
is
> without the OSD or any other layers doing any work.
> 
> 
> > I already read that the read_ahead setting might improve the
> > situation, although this would only be true when using buffered reads, 
> > right?
> >
> > Does anyone have other suggestions to get better serial read performance?
> >
> 
> You might want to disable all logging and look at AsyncMessenger. Disabling 
> cephx might help, but that is not very safe to do.

Just to add what Wido has mentioned. The problem is latency serialisation, the 
effect of the network, ceph code means that each IO
request has to travel further than if you are comparing to a local SATA cable.

The trick is to try and remove as much of this as possible where you can. Wido 
has mentioned 1 good option of turning off logging.
One thing I have found which helps massively is to force the CPU c-state to 1 
and pin the CPU's at their max frequency. Otherwise
the CPU's can spend up to 200us waking up from deep sleep several times every 
IO. Doing this I managed to get my 4kb write latency
for a 3x replica pool down to 600us!!

So stick this on your kernel boot line 

intel_idle.max_cstate=1

and stick this somewhere like your rc.local

echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct

Although there maybe some gains to setting it to 90-95%, so that when only 1 
core is active it can turbo slightly higher.

Also since you are using the RBD engine in fio you should be able to use 
readahead caching with directio. You just need to enable it
in your ceph.conf on the client machine where you are running fio.

Nick

> 
> Wido
> 
> > Cheers
> > Nick
> >
> > --
> > Sebastian Nickel
> > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich Tel
> > +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph all NVME Cluster sequential read speed

2016-08-18 Thread nick
Thanks for all the answers,
we will disable logging and check the c-state CPU pinning. I did not hear so 
far of async messenger. After checking the mailing list it looks like one can 
test with ms_type = async option. I did not find the documentation for that 
(looks like this is a quite recent added feature). We might try it out as 
well.

I will post our new benchmark results once we tested.

Cheers
Nick


On Thursday, August 18, 2016 10:35:06 AM w...@42on.com wrote:
> AsyncMessenger
 
-- 
Sebastian Nickel
Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch

signature.asc
Description: This is a digitally signed message part.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Designing ceph cluster

2016-08-18 Thread Mart van Santen
Dear Guarav,

Please respect everyones time & timezone differences. Flooding the
mail-list won't help

see below,



On 08/18/2016 01:39 AM, Gaurav Goyal wrote:
> Dear Ceph Users,
>
> Awaiting some suggestion please!
>
>
>
> On Wed, Aug 17, 2016 at 11:15 AM, Gaurav Goyal
> mailto:er.gauravgo...@gmail.com>> wrote:
>
> Hello Mart,
>
> Thanks a lot for the detailed information!
> Please find my response inline and help me to get more knowledge on it
>
>
> Ceph works best with more hardware. It is not really designed for
> small scale setups. Of course small setups can work for a PoC or
> testing, but I would not advise this for production.
>
> [Gaurav] : We need this setup for PoC or testing. 
>
> If you want to proceed however, have a good look the manuals or
> this mailinglist archive and do invest some time to understand the
> logic and workings of ceph before working or ordering hardware
>
> At least you want: 
> - 3 monitors, preferable on dedicated servers
> [Gaurav] : With my current setup, can i install MON on Host 1 -->
> Controller + Compute1, Host 2 and Host 3
>
> - Per disk you will be running an ceph-osd instance. So a host
> with 2 disks will run 2 osd instances. More OSD process is better
> performance, but also more memory and cpu usage.
>
> [Gaurav] : Understood, That means having 1T x 4 would be better
> than 2T x 2.
>
Yes, more disks will do more IO
>
>
> - Per default ceph uses a replication factor of 3 (it is possible
> to set this to 2, but is not advised)
> - You can not fill up disks to 100%, also data will not distribute
> even over all disks, expect disks to be filled up (on average)
> maximum to 60-70%. You want to add more disks once you reach this
> limit.
>
> All on all, with a setup of 3 hosts, with 2x2TB disks, this will
> result in a net data availablity of (3x2x2TBx0.6)/3 = 2.4 TB 
>
> [Gaurav] : As this is going to be a test lab environment, can we
> change the configuration to have more capacity rather than
> redundancy? How can we achieve it?
>

Ceph has an excellent documentation. This is easy to find and search for
"the number of replicas", you want to set both "size" and "min_size" to
1 on this case

> If speed is required, consider SSD's (for data & journals, or only
> journals).
>
> In you email you mention "compute1/2/3", please note, if you use
> the rbd kernel driver, this can interfere with the OSD process and
> is not advised to run OSD and Kernel driver on the same hardware.
> If you still want to do that, split it up using VMs (we have a
> small testing cluster where we do mix compute and storage, there
> we have the OSDs running in VMs)
>
> [Gaurav] : within my mentioned environment, How can we split rbd
> kernel driver and OSD process? Should it be like rbd kernel driver
> on controller and OSD processes on compute hosts?
>
> Since my host 1 is controller + Compute1, Can you please share the
> steps to split it up using VMs and suggested by you.
>

We are running kernel rbd on dom0 and osd's in domu, as well a monitor
in domu.

Regards,

Mart



>
> Regards
> Gaurav Goyal 
>
>
> On Wed, Aug 17, 2016 at 9:28 AM, Mart van Santen
> mailto:m...@greenhost.nl>> wrote:
>
>
> Dear Gaurav,
>
> Ceph works best with more hardware. It is not really designed
> for small scale setups. Of course small setups can work for a
> PoC or testing, but I would not advise this for production.
>
> If you want to proceed however, have a good look the manuals
> or this mailinglist archive and do invest some time to
> understand the logic and workings of ceph before working or
> ordering hardware
>
> At least you want:
> - 3 monitors, preferable on dedicated servers
> - Per disk you will be running an ceph-osd instance. So a host
> with 2 disks will run 2 osd instances. More OSD process is
> better performance, but also more memory and cpu usage.
> - Per default ceph uses a replication factor of 3 (it is
> possible to set this to 2, but is not advised)
> - You can not fill up disks to 100%, also data will not
> distribute even over all disks, expect disks to be filled up
> (on average) maximum to 60-70%. You want to add more disks
> once you reach this limit.
>
> All on all, with a setup of 3 hosts, with 2x2TB disks, this
> will result in a net data availablity of (3x2x2TBx0.6)/3 = 2.4 TB
>
>
> If speed is required, consider SSD's (for data & journals, or
> only journals).
>
> In you email you mention "compute1/2/3", please note, if you
> use the rbd kernel driver, this can interfere with the OSD
> process and is not advised to run OSD and Kernel driver on the
> same hardware. If you still 

[ceph-users] Signature V2

2016-08-18 Thread jan hugo prins
Hi everyone.

To connect to my S3 gateways using s3cmd I had to set the option
signature_v2 in my s3cfg to true.
If I didn't do that I would get Signature mismatch errors and this seems
to be because Amazon uses Signature version 4 while the S3 gateway of
Ceph only supports Signature Version 2.

Now I see the following error in a Jave project we are building that
should talk to S3.

Aug 18, 2016 12:12:38 PM org.apache.catalina.core.StandardWrapperValve
invoke
SEVERE: Servlet.service() for servlet [Default] in context with path
[/VehicleData] threw exception
com.betterbe.vd.web.servlet.LsExceptionWrapper: xxx
caused: com.amazonaws.services.s3.model.AmazonS3Exception: null
(Service: Amazon S3; Status Code: 400; Error Code:
XAmzContentSHA256Mismatch; Request ID:
tx02cc6-0057b58a15-25bba-default), S3 Extended Request
ID: 25bba-default-default
at
com.betterbe.vd.web.dataset.requesthandler.DatasetRequestHandler.handle(DatasetRequestHandler.java:262)
at com.betterbe.vd.web.servlet.Servlet.handler(Servlet.java:141)
at com.betterbe.vd.web.servlet.Servlet.doPost(Servlet.java:110)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:646)

To me this looks a bit the same, though I'm not a Java developer.
Am I correct, and if so, can I tell the Java S3 client to use Version 2
signatures?


-- 
Met vriendelijke groet / Best regards,

Jan Hugo Prins
Infra and Isilon storage consultant

Better.be B.V.
Auke Vleerstraat 140 E | 7547 AN Enschede | KvK 08097527
T +31 (0) 53 48 00 694 | M +31 (0)6 26 358 951
jpr...@betterbe.com | www.betterbe.com

This e-mail is intended exclusively for the addressee(s), and may not
be passed on to, or made available for use by any person other than 
the addressee(s). Better.be B.V. rules out any and every liability 
resulting from any electronic transmission.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Signature V2

2016-08-18 Thread jan hugo prins
did some more searching and according to some info I found RGW should
support V4 signatures.

http://tracker.ceph.com/issues/10333
http://tracker.ceph.com/issues/11858

The fact that everyone still modifies s3cmd to use Version 2 Signatures
suggests to me that we have a bug in this code.

If I use V4 signatures most of my requests work fine, but some requests
fail on a signature error.

Thanks,
Jan Hugo Prins


On 08/18/2016 12:46 PM, jan hugo prins wrote:
> Hi everyone.
>
> To connect to my S3 gateways using s3cmd I had to set the option
> signature_v2 in my s3cfg to true.
> If I didn't do that I would get Signature mismatch errors and this seems
> to be because Amazon uses Signature version 4 while the S3 gateway of
> Ceph only supports Signature Version 2.
>
> Now I see the following error in a Jave project we are building that
> should talk to S3.
>
> Aug 18, 2016 12:12:38 PM org.apache.catalina.core.StandardWrapperValve
> invoke
> SEVERE: Servlet.service() for servlet [Default] in context with path
> [/VehicleData] threw exception
> com.betterbe.vd.web.servlet.LsExceptionWrapper: xxx
> caused: com.amazonaws.services.s3.model.AmazonS3Exception: null
> (Service: Amazon S3; Status Code: 400; Error Code:
> XAmzContentSHA256Mismatch; Request ID:
> tx02cc6-0057b58a15-25bba-default), S3 Extended Request
> ID: 25bba-default-default
> at
> com.betterbe.vd.web.dataset.requesthandler.DatasetRequestHandler.handle(DatasetRequestHandler.java:262)
> at com.betterbe.vd.web.servlet.Servlet.handler(Servlet.java:141)
> at com.betterbe.vd.web.servlet.Servlet.doPost(Servlet.java:110)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:646)
>
> To me this looks a bit the same, though I'm not a Java developer.
> Am I correct, and if so, can I tell the Java S3 client to use Version 2
> signatures?
>
>

-- 
Met vriendelijke groet / Best regards,

Jan Hugo Prins
Infra and Isilon storage consultant

Better.be B.V.
Auke Vleerstraat 140 E | 7547 AN Enschede | KvK 08097527
T +31 (0) 53 48 00 694 | M +31 (0)6 26 358 951
jpr...@betterbe.com | www.betterbe.com

This e-mail is intended exclusively for the addressee(s), and may not
be passed on to, or made available for use by any person other than 
the addressee(s). Better.be B.V. rules out any and every liability 
resulting from any electronic transmission.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph all NVME Cluster sequential read speed

2016-08-18 Thread nick
So after disabling logging and setting intel_idle.max_cstate=1 we reach 1953 
IOPS for 4k blocksizes (with an iodepth of 1) instead of 1382. This is an 
increase of 41%. Very cool.

Furthermore I played a bit with striping in RBD images. When choosing a 1MB 
stripe unit and a stripe count of 4 there is a huge difference when 
benchmarking with bigger block sizes (with 4MB blocksize I get twice the 
speed). Benchmarking this with 4k blocksizes I can see almost no difference to 
the default images (stripe-unit=4M and stripe-count=1).

Did anyone else play with different stripe units in the images? I guess the 
stripe unit depends on the expected work pattern in the virtual machine.

Cheers
Nick

On Thursday, August 18, 2016 10:23:34 AM Nick Fisk wrote:
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > w...@42on.com Sent: 18 August 2016 09:35
> > To: nick 
> > Cc: ceph-users 
> > Subject: Re: [ceph-users] Ceph all NVME Cluster sequential read speed
> > 
> > > Op 18 aug. 2016 om 10:15 heeft nick  het volgende
> > > geschreven:
> > > 
> > > Hi,
> > > we are currently building a new ceph cluster with only NVME devices.
> > > One Node consists of 4x Intel P3600 2TB devices. Journal and filestore
> > > are on the same device. Each server has a 10 core CPU and uses 10 GBit
> > > ethernet NICs for public and ceph storage traffic. We are currently
> > > testing with 4 nodes overall.
> > > 
> > > The cluster will be used only for virtual machine images via RBD. The
> > > pools are replicated (no EC).
> > > 
> > > Altough we are pretty happy with the single threaded write
> > > performance, the single threaded (iodepth=1) sequential read
> > > performance is a bit disappointing.
> > > 
> > > We are testing with fio and the rbd engine. After creating a 10GB RBD
> > > image, we use the following fio params to test:
> > > """
> > > [global]
> > > invalidate=1
> > > ioengine=rbd
> > > iodepth=1
> > > ramp_time=2
> > > size=2G
> > > bs=4k
> > > direct=1
> > > buffered=0
> > > """
> > > 
> > > For a 4k workload we are reaching 1382 IOPS. Testing one NVME device
> > > directly (with psync engine and iodepth of 1) we can reach up to 84176
> > > IOPS. This is a big difference.
> > 
> > Network is a big difference as well. Keep in mind the Ceph OSDs have to
> > process the I/O as well.
> > 
> > For example, if you have a network latency of 0.200ms, in 1.000ms (1 sec)
> > you will be able to potentially do 5.000 IOps, but that
> is
> 
> > without the OSD or any other layers doing any work.
> > 
> > > I already read that the read_ahead setting might improve the
> > > situation, although this would only be true when using buffered reads,
> > > right?
> > > 
> > > Does anyone have other suggestions to get better serial read
> > > performance?
> > 
> > You might want to disable all logging and look at AsyncMessenger.
> > Disabling cephx might help, but that is not very safe to do.
> Just to add what Wido has mentioned. The problem is latency serialisation,
> the effect of the network, ceph code means that each IO request has to
> travel further than if you are comparing to a local SATA cable.
> 
> The trick is to try and remove as much of this as possible where you can.
> Wido has mentioned 1 good option of turning off logging. One thing I have
> found which helps massively is to force the CPU c-state to 1 and pin the
> CPU's at their max frequency. Otherwise the CPU's can spend up to 200us
> waking up from deep sleep several times every IO. Doing this I managed to
> get my 4kb write latency for a 3x replica pool down to 600us!!
> 
> So stick this on your kernel boot line
> 
> intel_idle.max_cstate=1
> 
> and stick this somewhere like your rc.local
> 
> echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
> 
> Although there maybe some gains to setting it to 90-95%, so that when only 1
> core is active it can turbo slightly higher.
> 
> Also since you are using the RBD engine in fio you should be able to use
> readahead caching with directio. You just need to enable it in your
> ceph.conf on the client machine where you are running fio.
> 
> Nick
> 
> > Wido
> > 
> > > Cheers
> > > Nick
> > > 
> > > --
> > > Sebastian Nickel
> > > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich Tel
> > > +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
-- 
Sebastian Nickel
Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch

signature.asc
Description: This is a digitally signed message part.
___

Re: [ceph-users] Ceph all NVME Cluster sequential read speed

2016-08-18 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of nick
> Sent: 18 August 2016 12:39
> To: n...@fisk.me.uk
> Cc: 'ceph-users' 
> Subject: Re: [ceph-users] Ceph all NVME Cluster sequential read speed
> 
> So after disabling logging and setting intel_idle.max_cstate=1 we reach 1953 
> IOPS for 4k blocksizes (with an iodepth of 1) instead
of
> 1382. This is an increase of 41%. Very cool.
> 
> Furthermore I played a bit with striping in RBD images. When choosing a 1MB 
> stripe unit and a stripe count of 4 there is a huge
> difference when benchmarking with bigger block sizes (with 4MB blocksize I 
> get twice the speed). Benchmarking this with 4k
> blocksizes I can see almost no difference to the default images 
> (stripe-unit=4M and stripe-count=1).
> 
> Did anyone else play with different stripe units in the images? I guess the 
> stripe unit depends on the expected work pattern in
the
> virtual machine.

The RBD is already striped in object sized chunks, the difference to RAID 
stripes is the size of the chunks/objects involved. A RAID
array might chunk into 64kb chunks, this will mean that even a small readahead 
will likely cause a read across all chunks of the
stripe, giving very good performance. In Ceph, the chunks are 4MB which means 
if you want to read across multiple objects, you will
need a readahead at least bigger than 4MB.

The image level striping is more to do with lowering contention on a single PG, 
rather than to improve sequential performance. Ie
you might have a couple of MB worth of data that is being hit by thousands of 
IO requests. By using striping you can try and spread
these requests over more PG's. There is a point in the data path of a PG that 
is effectively single threaded.

If you want to improve sequential reads you want to use buffered IO and use a 
large read ahead (>16M).

> 
> Cheers
> Nick
> 
> On Thursday, August 18, 2016 10:23:34 AM Nick Fisk wrote:
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > Behalf Of w...@42on.com Sent: 18 August 2016 09:35
> > > To: nick 
> > > Cc: ceph-users 
> > > Subject: Re: [ceph-users] Ceph all NVME Cluster sequential read
> > > speed
> > >
> > > > Op 18 aug. 2016 om 10:15 heeft nick  het volgende
> > > > geschreven:
> > > >
> > > > Hi,
> > > > we are currently building a new ceph cluster with only NVME devices.
> > > > One Node consists of 4x Intel P3600 2TB devices. Journal and
> > > > filestore are on the same device. Each server has a 10 core CPU
> > > > and uses 10 GBit ethernet NICs for public and ceph storage
> > > > traffic. We are currently testing with 4 nodes overall.
> > > >
> > > > The cluster will be used only for virtual machine images via RBD.
> > > > The pools are replicated (no EC).
> > > >
> > > > Altough we are pretty happy with the single threaded write
> > > > performance, the single threaded (iodepth=1) sequential read
> > > > performance is a bit disappointing.
> > > >
> > > > We are testing with fio and the rbd engine. After creating a 10GB
> > > > RBD image, we use the following fio params to test:
> > > > """
> > > > [global]
> > > > invalidate=1
> > > > ioengine=rbd
> > > > iodepth=1
> > > > ramp_time=2
> > > > size=2G
> > > > bs=4k
> > > > direct=1
> > > > buffered=0
> > > > """
> > > >
> > > > For a 4k workload we are reaching 1382 IOPS. Testing one NVME
> > > > device directly (with psync engine and iodepth of 1) we can reach
> > > > up to 84176 IOPS. This is a big difference.
> > >
> > > Network is a big difference as well. Keep in mind the Ceph OSDs have
> > > to process the I/O as well.
> > >
> > > For example, if you have a network latency of 0.200ms, in 1.000ms (1
> > > sec) you will be able to potentially do 5.000 IOps, but that
> > is
> >
> > > without the OSD or any other layers doing any work.
> > >
> > > > I already read that the read_ahead setting might improve the
> > > > situation, although this would only be true when using buffered
> > > > reads, right?
> > > >
> > > > Does anyone have other suggestions to get better serial read
> > > > performance?
> > >
> > > You might want to disable all logging and look at AsyncMessenger.
> > > Disabling cephx might help, but that is not very safe to do.
> > Just to add what Wido has mentioned. The problem is latency
> > serialisation, the effect of the network, ceph code means that each IO
> > request has to travel further than if you are comparing to a local SATA 
> > cable.
> >
> > The trick is to try and remove as much of this as possible where you can.
> > Wido has mentioned 1 good option of turning off logging. One thing I
> > have found which helps massively is to force the CPU c-state to 1 and
> > pin the CPU's at their max frequency. Otherwise the CPU's can spend up
> > to 200us waking up from deep sleep several times every IO. Doing this
> > I managed to get my 4kb write latency for a 3x replica pool down to 600us!!
> >
> > So stick t

Re: [ceph-users] Ceph all NVME Cluster sequential read speed

2016-08-18 Thread nick
Thanks for the explanation. I thought that when using a striped image 4MB of 
written data will be placed in 4 objects (with 4MB object size and when using 
1MB of stripe unit and a count of 4). With that a single read of 4MB will hit 
4 objects which might be in different PGs. So the read speed should be 
increased. Maybe I got that part wrong :-)
I might have the same speed improvement when using an object size of 1MB 
directly on the image.

Cheers
Nick

On Thursday, August 18, 2016 01:37:46 PM Nick Fisk wrote:
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > nick Sent: 18 August 2016 12:39
> > To: n...@fisk.me.uk
> > Cc: 'ceph-users' 
> > Subject: Re: [ceph-users] Ceph all NVME Cluster sequential read speed
> > 
> > So after disabling logging and setting intel_idle.max_cstate=1 we reach
> > 1953 IOPS for 4k blocksizes (with an iodepth of 1) instead
> of
> 
> > 1382. This is an increase of 41%. Very cool.
> > 
> > Furthermore I played a bit with striping in RBD images. When choosing a
> > 1MB stripe unit and a stripe count of 4 there is a huge difference when
> > benchmarking with bigger block sizes (with 4MB blocksize I get twice the
> > speed). Benchmarking this with 4k blocksizes I can see almost no
> > difference to the default images (stripe-unit=4M and stripe-count=1).
> > 
> > Did anyone else play with different stripe units in the images? I guess
> > the stripe unit depends on the expected work pattern in
> the
> 
> > virtual machine.
> 
> The RBD is already striped in object sized chunks, the difference to RAID
> stripes is the size of the chunks/objects involved. A RAID array might
> chunk into 64kb chunks, this will mean that even a small readahead will
> likely cause a read across all chunks of the stripe, giving very good
> performance. In Ceph, the chunks are 4MB which means if you want to read
> across multiple objects, you will need a readahead at least bigger than
> 4MB.
> 
> The image level striping is more to do with lowering contention on a single
> PG, rather than to improve sequential performance. Ie you might have a
> couple of MB worth of data that is being hit by thousands of IO requests.
> By using striping you can try and spread these requests over more PG's.
> There is a point in the data path of a PG that is effectively single
> threaded.
> 
> If you want to improve sequential reads you want to use buffered IO and use
> a large read ahead (>16M).
> > Cheers
> > Nick
> > 
> > On Thursday, August 18, 2016 10:23:34 AM Nick Fisk wrote:
> > > > -Original Message-
> > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > > Behalf Of w...@42on.com Sent: 18 August 2016 09:35
> > > > To: nick 
> > > > Cc: ceph-users 
> > > > Subject: Re: [ceph-users] Ceph all NVME Cluster sequential read
> > > > speed
> > > > 
> > > > > Op 18 aug. 2016 om 10:15 heeft nick  het volgende
> > > > > geschreven:
> > > > > 
> > > > > Hi,
> > > > > we are currently building a new ceph cluster with only NVME devices.
> > > > > One Node consists of 4x Intel P3600 2TB devices. Journal and
> > > > > filestore are on the same device. Each server has a 10 core CPU
> > > > > and uses 10 GBit ethernet NICs for public and ceph storage
> > > > > traffic. We are currently testing with 4 nodes overall.
> > > > > 
> > > > > The cluster will be used only for virtual machine images via RBD.
> > > > > The pools are replicated (no EC).
> > > > > 
> > > > > Altough we are pretty happy with the single threaded write
> > > > > performance, the single threaded (iodepth=1) sequential read
> > > > > performance is a bit disappointing.
> > > > > 
> > > > > We are testing with fio and the rbd engine. After creating a 10GB
> > > > > RBD image, we use the following fio params to test:
> > > > > """
> > > > > [global]
> > > > > invalidate=1
> > > > > ioengine=rbd
> > > > > iodepth=1
> > > > > ramp_time=2
> > > > > size=2G
> > > > > bs=4k
> > > > > direct=1
> > > > > buffered=0
> > > > > """
> > > > > 
> > > > > For a 4k workload we are reaching 1382 IOPS. Testing one NVME
> > > > > device directly (with psync engine and iodepth of 1) we can reach
> > > > > up to 84176 IOPS. This is a big difference.
> > > > 
> > > > Network is a big difference as well. Keep in mind the Ceph OSDs have
> > > > to process the I/O as well.
> > > > 
> > > > For example, if you have a network latency of 0.200ms, in 1.000ms (1
> > > > sec) you will be able to potentially do 5.000 IOps, but that
> > > 
> > > is
> > > 
> > > > without the OSD or any other layers doing any work.
> > > > 
> > > > > I already read that the read_ahead setting might improve the
> > > > > situation, although this would only be true when using buffered
> > > > > reads, right?
> > > > > 
> > > > > Does anyone have other suggestions to get better serial read
> > > > > performance?
> > > > 
> > > > You might want to disable all logging and look at AsyncMessenger.
> > > > Disabling cephx might help,

Re: [ceph-users] Ceph all NVME Cluster sequential read speed

2016-08-18 Thread Nick Fisk


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of nick
> Sent: 18 August 2016 14:02
> To: n...@fisk.me.uk
> Cc: 'ceph-users' 
> Subject: Re: [ceph-users] Ceph all NVME Cluster sequential read speed
> 
> Thanks for the explanation. I thought that when using a striped image 4MB of 
> written data will be placed in 4 objects (with 4MB
object
> size and when using 1MB of stripe unit and a count of 4). With that a single 
> read of 4MB will hit
> 4 objects which might be in different PGs. So the read speed should be 
> increased. Maybe I got that part wrong :-) I might have the
> same speed improvement when using an object size of 1MB directly on the image.

Yes, that is correct. But you were sending 4k io's, so it wouldn't have changed 
much apart from the data might not be in the OSD
pagecache because you are jumping around PG's. Another factor is latency again, 
with a 4MB object you are doing 1 read if you read
4MB, with 4x1MB objects you are having to issue more IO through Ceph which will 
incur a slight latency penalty, which might be why
you see slightly less performance.

> 
> Cheers
> Nick
> 
> On Thursday, August 18, 2016 01:37:46 PM Nick Fisk wrote:
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > Behalf Of nick Sent: 18 August 2016 12:39
> > > To: n...@fisk.me.uk
> > > Cc: 'ceph-users' 
> > > Subject: Re: [ceph-users] Ceph all NVME Cluster sequential read
> > > speed
> > >
> > > So after disabling logging and setting intel_idle.max_cstate=1 we
> > > reach
> > > 1953 IOPS for 4k blocksizes (with an iodepth of 1) instead
> > of
> >
> > > 1382. This is an increase of 41%. Very cool.
> > >
> > > Furthermore I played a bit with striping in RBD images. When
> > > choosing a 1MB stripe unit and a stripe count of 4 there is a huge
> > > difference when benchmarking with bigger block sizes (with 4MB
> > > blocksize I get twice the speed). Benchmarking this with 4k
> > > blocksizes I can see almost no difference to the default images 
> > > (stripe-unit=4M and stripe-count=1).
> > >
> > > Did anyone else play with different stripe units in the images? I
> > > guess the stripe unit depends on the expected work pattern in
> > the
> >
> > > virtual machine.
> >
> > The RBD is already striped in object sized chunks, the difference to
> > RAID stripes is the size of the chunks/objects involved. A RAID array
> > might chunk into 64kb chunks, this will mean that even a small
> > readahead will likely cause a read across all chunks of the stripe,
> > giving very good performance. In Ceph, the chunks are 4MB which means
> > if you want to read across multiple objects, you will need a readahead
> > at least bigger than 4MB.
> >
> > The image level striping is more to do with lowering contention on a
> > single PG, rather than to improve sequential performance. Ie you might
> > have a couple of MB worth of data that is being hit by thousands of IO 
> > requests.
> > By using striping you can try and spread these requests over more PG's.
> > There is a point in the data path of a PG that is effectively single
> > threaded.
> >
> > If you want to improve sequential reads you want to use buffered IO
> > and use a large read ahead (>16M).
> > > Cheers
> > > Nick
> > >
> > > On Thursday, August 18, 2016 10:23:34 AM Nick Fisk wrote:
> > > > > -Original Message-
> > > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > > > Behalf Of w...@42on.com Sent: 18 August 2016 09:35
> > > > > To: nick 
> > > > > Cc: ceph-users 
> > > > > Subject: Re: [ceph-users] Ceph all NVME Cluster sequential read
> > > > > speed
> > > > >
> > > > > > Op 18 aug. 2016 om 10:15 heeft nick  het
> > > > > > volgende
> > > > > > geschreven:
> > > > > >
> > > > > > Hi,
> > > > > > we are currently building a new ceph cluster with only NVME devices.
> > > > > > One Node consists of 4x Intel P3600 2TB devices. Journal and
> > > > > > filestore are on the same device. Each server has a 10 core
> > > > > > CPU and uses 10 GBit ethernet NICs for public and ceph storage
> > > > > > traffic. We are currently testing with 4 nodes overall.
> > > > > >
> > > > > > The cluster will be used only for virtual machine images via RBD.
> > > > > > The pools are replicated (no EC).
> > > > > >
> > > > > > Altough we are pretty happy with the single threaded write
> > > > > > performance, the single threaded (iodepth=1) sequential read
> > > > > > performance is a bit disappointing.
> > > > > >
> > > > > > We are testing with fio and the rbd engine. After creating a
> > > > > > 10GB RBD image, we use the following fio params to test:
> > > > > > """
> > > > > > [global]
> > > > > > invalidate=1
> > > > > > ioengine=rbd
> > > > > > iodepth=1
> > > > > > ramp_time=2
> > > > > > size=2G
> > > > > > bs=4k
> > > > > > direct=1
> > > > > > buffered=0
> > > > > > """
> > > > > >
> > > > > > For a 4k workload we are reaching 1382 IOPS. Testi

Re: [ceph-users] Signature V2

2016-08-18 Thread Chris Jones
I believe RGW Hammer and below use V2 and Jewel and above use V4.

Thanks

On Thu, Aug 18, 2016 at 7:32 AM, jan hugo prins  wrote:

> did some more searching and according to some info I found RGW should
> support V4 signatures.
>
> http://tracker.ceph.com/issues/10333
> http://tracker.ceph.com/issues/11858
>
> The fact that everyone still modifies s3cmd to use Version 2 Signatures
> suggests to me that we have a bug in this code.
>
> If I use V4 signatures most of my requests work fine, but some requests
> fail on a signature error.
>
> Thanks,
> Jan Hugo Prins
>
>
> On 08/18/2016 12:46 PM, jan hugo prins wrote:
> > Hi everyone.
> >
> > To connect to my S3 gateways using s3cmd I had to set the option
> > signature_v2 in my s3cfg to true.
> > If I didn't do that I would get Signature mismatch errors and this seems
> > to be because Amazon uses Signature version 4 while the S3 gateway of
> > Ceph only supports Signature Version 2.
> >
> > Now I see the following error in a Jave project we are building that
> > should talk to S3.
> >
> > Aug 18, 2016 12:12:38 PM org.apache.catalina.core.StandardWrapperValve
> > invoke
> > SEVERE: Servlet.service() for servlet [Default] in context with path
> > [/VehicleData] threw exception
> > com.betterbe.vd.web.servlet.LsExceptionWrapper: xxx
> > caused: com.amazonaws.services.s3.model.AmazonS3Exception: null
> > (Service: Amazon S3; Status Code: 400; Error Code:
> > XAmzContentSHA256Mismatch; Request ID:
> > tx02cc6-0057b58a15-25bba-default), S3 Extended Request
> > ID: 25bba-default-default
> > at
> > com.betterbe.vd.web.dataset.requesthandler.DatasetRequestHandler.handle(
> DatasetRequestHandler.java:262)
> > at com.betterbe.vd.web.servlet.Servlet.handler(Servlet.java:141)
> > at com.betterbe.vd.web.servlet.Servlet.doPost(Servlet.java:110)
> > at javax.servlet.http.HttpServlet.service(HttpServlet.java:646)
> >
> > To me this looks a bit the same, though I'm not a Java developer.
> > Am I correct, and if so, can I tell the Java S3 client to use Version 2
> > signatures?
> >
> >
>
> --
> Met vriendelijke groet / Best regards,
>
> Jan Hugo Prins
> Infra and Isilon storage consultant
>
> Better.be B.V.
> Auke Vleerstraat 140 E | 7547 AN Enschede | KvK 08097527
> T +31 (0) 53 48 00 694 | M +31 (0)6 26 358 951
> jpr...@betterbe.com | www.betterbe.com
>
> This e-mail is intended exclusively for the addressee(s), and may not
> be passed on to, or made available for use by any person other than
> the addressee(s). Better.be B.V. rules out any and every liability
> resulting from any electronic transmission.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Best Regards,
Chris Jones

cjo...@cloudm2.com
(p) 770.655.0770
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Signature V2

2016-08-18 Thread jan hugo prins
I believe the same, but when you use V4 from s3cmd or the AWS S3 java
API you get intermittend signature errors.
Only after returning to V2 those errors are gone.

Jan Hugo


On 08/18/2016 03:51 PM, Chris Jones wrote:
> I believe RGW Hammer and below use V2 and Jewel and above use V4.
>
> Thanks
>
> On Thu, Aug 18, 2016 at 7:32 AM, jan hugo prins  > wrote:
>
> did some more searching and according to some info I found RGW should
> support V4 signatures.
>
> http://tracker.ceph.com/issues/10333
> 
> http://tracker.ceph.com/issues/11858
> 
>
> The fact that everyone still modifies s3cmd to use Version 2
> Signatures
> suggests to me that we have a bug in this code.
>
> If I use V4 signatures most of my requests work fine, but some
> requests
> fail on a signature error.
>
> Thanks,
> Jan Hugo Prins
>
>
> On 08/18/2016 12:46 PM, jan hugo prins wrote:
> > Hi everyone.
> >
> > To connect to my S3 gateways using s3cmd I had to set the option
> > signature_v2 in my s3cfg to true.
> > If I didn't do that I would get Signature mismatch errors and
> this seems
> > to be because Amazon uses Signature version 4 while the S3
> gateway of
> > Ceph only supports Signature Version 2.
> >
> > Now I see the following error in a Jave project we are building that
> > should talk to S3.
> >
> > Aug 18, 2016 12:12:38 PM
> org.apache.catalina.core.StandardWrapperValve
> > invoke
> > SEVERE: Servlet.service() for servlet [Default] in context with path
> > [/VehicleData] threw exception
> > com.betterbe.vd.web.servlet.LsExceptionWrapper:
> xxx
> > caused: com.amazonaws.services.s3.model.AmazonS3Exception: null
> > (Service: Amazon S3; Status Code: 400; Error Code:
> > XAmzContentSHA256Mismatch; Request ID:
> > tx02cc6-0057b58a15-25bba-default), S3 Extended
> Request
> > ID: 25bba-default-default
> > at
> >
> 
> com.betterbe.vd.web.dataset.requesthandler.DatasetRequestHandler.handle(DatasetRequestHandler.java:262)
> > at
> com.betterbe.vd.web.servlet.Servlet.handler(Servlet.java:141)
> > at
> com.betterbe.vd.web.servlet.Servlet.doPost(Servlet.java:110)
> > at
> javax.servlet.http.HttpServlet.service(HttpServlet.java:646)
> >
> > To me this looks a bit the same, though I'm not a Java developer.
> > Am I correct, and if so, can I tell the Java S3 client to use
> Version 2
> > signatures?
> >
> >
>
> --
> Met vriendelijke groet / Best regards,
>
> Jan Hugo Prins
> Infra and Isilon storage consultant
>
> Better.be B.V.
> Auke Vleerstraat 140 E | 7547 AN Enschede | KvK 08097527
> T +31 (0) 53 48 00 694 | M +31 (0)6 26 358 951
> jpr...@betterbe.com  |
> www.betterbe.com 
>
> This e-mail is intended exclusively for the addressee(s), and may not
> be passed on to, or made available for use by any person other than
> the addressee(s). Better.be B.V. rules out any and every liability
> resulting from any electronic transmission.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
>
>
>
>
> -- 
> Best Regards,
> Chris Jones
>
> cjo...@cloudm2.com 
> (p) 770.655.0770
>

-- 
Met vriendelijke groet / Best regards,

Jan Hugo Prins
Infra and Isilon storage consultant

Better.be B.V.
Auke Vleerstraat 140 E | 7547 AN Enschede | KvK 08097527
T +31 (0) 53 48 00 694 | M +31 (0)6 26 358 951
jpr...@betterbe.com | www.betterbe.com

This e-mail is intended exclusively for the addressee(s), and may not
be passed on to, or made available for use by any person other than 
the addressee(s). Better.be B.V. rules out any and every liability 
resulting from any electronic transmission.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reading payload from rados_watchcb2_t callback

2016-08-18 Thread Nick Fisk
Just to answer myself in case anyone stumbles across this in the future. I was 
on the right track, but I think there are null
characters before the text payload which was tricking printf.

In the end I managed to work it out and came up with this:

char *temp = (char*)data+4;

Which skips the 1st few bytes of the payload.no idea what they are, but 
skipping 4 bytes takes you straight to the start of the
text part that you send with notify.

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Nick 
> Fisk
> Sent: 17 August 2016 21:49
> To: 'ceph-users' 
> Subject: [ceph-users] Reading payload from rados_watchcb2_t callback
> 
> Hi All,
> 
> I'm writing a small piece of code to call fsfreeze/unfreeze that can be 
> invoked by a RADOS notify. I have the basic watch/notify
> functionality working but I need to be able to determine if the notify 
> message is to freeze or unfreeze, or maybe something
> completely unrelated.
> 
> I'm looking at the rados_watchcb2_t callback and can see that the data 
> payload is returned as a void pointer. This is where it all
starts
> to go a little pear shaped for my basic C skills. I think I have to cast the 
> pointer to a (char *) but I still can't seem to get
anything useful
> from it.
> 
> I've been following some of the tests in the Ceph source and they seem to use 
> some sort of typedef called a bufferlist, is this
what I
> need to try and look into?
> 
> Does anyone have any pointers (excuse the pun) as to how I would read the 
> text part of the payload from it?
> 
> void watch_notify2_cb(void *arg, uint64_t notify_id, uint64_t cookie, 
> uint64_t notifier_gid, void *data, size_t data_len)
> 
> Many Thanks,
> Nick
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW multisite - second cluster woes

2016-08-18 Thread Ben Morrice
Hello,

I am trying to configure a second cluster into an existing Jewel RGW
installation.

I do not get the expected output when I perform a 'radosgw-admin realm
pull'. My realm on the first cluster is called 'gold', however when
doing a realm pull it doesn't reflect the 'gold' name or id and I get an
error related to latest_epoch (?).

The documentation seems straight forward, so i'm not quite sure what i'm
missing here?

Please see below for the full output.

# radosgw-admin realm pull --url=http://cluster1:80 --access-key=access
--secret=secret

2016-08-18 17:20:09.585261 7fb939d879c0  0 error read_lastest_epoch
.rgw.root:periods.8c64a4dd-ccd8-4975-b63b-324fbb24aab6.latest_epoch
{
"id": "98a7b356-83fd-4d42-b895-b58d45fa4233",
"name": "",
"current_period": "8c64a4dd-ccd8-4975-b63b-324fbb24aab6",
"epoch": 1
}

# radosgw-admin period pull --url=http://cluster1:80 --access-key=access
secret=secret
2016-08-18 17:21:33.277719 7f5dbc7849c0  0 error read_lastest_epoch
.rgw.root:periods..latest_epoch
{
"id": "",
"epoch": 0,
"predecessor_uuid": "",
"sync_status": [],
"period_map": {
"id": "",
"zonegroups": [],
"short_zone_ids": []
},
"master_zonegroup": "",
"master_zone": "",
"period_config": {
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
}
},
"realm_id": "",
"realm_name": "",
"realm_epoch": 0
}

# radosgw-admin realm default --rgw-realm=gold
failed to init realm: (2) No such file or directory2016-08-18
17:21:46.220181 7f720defa9c0  0 error in read_id for id  : (2) No such
file or directory

# radosgw-admin zonegroup default --rgw-zonegroup=us
failed to init zonegroup: (2) No such file or directory
2016-08-18 17:22:10.348984 7f9b2da699c0  0 error in read_id for id  :
(2) No such file or directory


-- 
Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL ENT CBS BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reading payload from rados_watchcb2_t callback

2016-08-18 Thread LOPEZ Jean-Charles
Hi Nick,

a good read to see what’s in it.

http://dachary.org/?p=1904

JC

> On Aug 18, 2016, at 08:28, Nick Fisk  wrote:
> 
> Just to answer myself in case anyone stumbles across this in the future. I 
> was on the right track, but I think there are null
> characters before the text payload which was tricking printf.
> 
> In the end I managed to work it out and came up with this:
> 
> char *temp = (char*)data+4;
> 
> Which skips the 1st few bytes of the payload.no idea what they are, but 
> skipping 4 bytes takes you straight to the start of the
> text part that you send with notify.
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>> Nick Fisk
>> Sent: 17 August 2016 21:49
>> To: 'ceph-users' 
>> Subject: [ceph-users] Reading payload from rados_watchcb2_t callback
>> 
>> Hi All,
>> 
>> I'm writing a small piece of code to call fsfreeze/unfreeze that can be 
>> invoked by a RADOS notify. I have the basic watch/notify
>> functionality working but I need to be able to determine if the notify 
>> message is to freeze or unfreeze, or maybe something
>> completely unrelated.
>> 
>> I'm looking at the rados_watchcb2_t callback and can see that the data 
>> payload is returned as a void pointer. This is where it all
> starts
>> to go a little pear shaped for my basic C skills. I think I have to cast the 
>> pointer to a (char *) but I still can't seem to get
> anything useful
>> from it.
>> 
>> I've been following some of the tests in the Ceph source and they seem to 
>> use some sort of typedef called a bufferlist, is this
> what I
>> need to try and look into?
>> 
>> Does anyone have any pointers (excuse the pun) as to how I would read the 
>> text part of the payload from it?
>> 
>> void watch_notify2_cb(void *arg, uint64_t notify_id, uint64_t cookie, 
>> uint64_t notifier_gid, void *data, size_t data_len)
>> 
>> Many Thanks,
>> Nick
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Designing ceph cluster

2016-08-18 Thread Gaurav Goyal
Hello Mart,

My Apologies for that!

We are couple of office colleagues using the common gmail account. That has
caused the nuisance.

Thanks for your response!

On Thu, Aug 18, 2016 at 6:00 AM, Mart van Santen  wrote:

> Dear Guarav,
>
> Please respect everyones time & timezone differences. Flooding the
> mail-list won't help
>
> see below,
>
>
>
> On 08/18/2016 01:39 AM, Gaurav Goyal wrote:
>
> Dear Ceph Users,
>
> Awaiting some suggestion please!
>
>
>
> On Wed, Aug 17, 2016 at 11:15 AM, Gaurav Goyal 
> wrote:
>
>> Hello Mart,
>>
>> Thanks a lot for the detailed information!
>> Please find my response inline and help me to get more knowledge on it
>>
>>
>> Ceph works best with more hardware. It is not really designed for small
>> scale setups. Of course small setups can work for a PoC or testing, but I
>> would not advise this for production.
>>
>> [Gaurav] : We need this setup for PoC or testing.
>>
>> If you want to proceed however, have a good look the manuals or this
>> mailinglist archive and do invest some time to understand the logic and
>> workings of ceph before working or ordering hardware
>>
>> At least you want:
>> - 3 monitors, preferable on dedicated servers
>> [Gaurav] : With my current setup, can i install MON on Host 1 -->
>> Controller + Compute1, Host 2 and Host 3
>>
>> - Per disk you will be running an ceph-osd instance. So a host with 2
>> disks will run 2 osd instances. More OSD process is better performance, but
>> also more memory and cpu usage.
>>
>> [Gaurav] : Understood, That means having 1T x 4 would be better than 2T x
>> 2.
>>
> Yes, more disks will do more IO
>
>
>> - Per default ceph uses a replication factor of 3 (it is possible to set
>> this to 2, but is not advised)
>> - You can not fill up disks to 100%, also data will not distribute even
>> over all disks, expect disks to be filled up (on average) maximum to
>> 60-70%. You want to add more disks once you reach this limit.
>>
>> All on all, with a setup of 3 hosts, with 2x2TB disks, this will result
>> in a net data availablity of (3x2x2TBx0.6)/3 = 2.4 TB
>>
>> [Gaurav] : As this is going to be a test lab environment, can we change
>> the configuration to have more capacity rather than redundancy? How can we
>> achieve it?
>>
>
> Ceph has an excellent documentation. This is easy to find and search for
> "the number of replicas", you want to set both "size" and "min_size" to 1
> on this case
>
> If speed is required, consider SSD's (for data & journals, or only
>> journals).
>>
>> In you email you mention "compute1/2/3", please note, if you use the rbd
>> kernel driver, this can interfere with the OSD process and is not advised
>> to run OSD and Kernel driver on the same hardware. If you still want to do
>> that, split it up using VMs (we have a small testing cluster where we do
>> mix compute and storage, there we have the OSDs running in VMs)
>>
>> [Gaurav] : within my mentioned environment, How can we split rbd kernel
>> driver and OSD process? Should it be like rbd kernel driver on controller
>> and OSD processes on compute hosts?
>>
>> Since my host 1 is controller + Compute1, Can you please share the steps
>> to split it up using VMs and suggested by you.
>>
>
> We are running kernel rbd on dom0 and osd's in domu, as well a monitor in
> domu.
>
> Regards,
>
> Mart
>
>
>
>
>
>> Regards
>> Gaurav Goyal
>>
>>
>> On Wed, Aug 17, 2016 at 9:28 AM, Mart van Santen < 
>> m...@greenhost.nl> wrote:
>>
>>>
>>> Dear Gaurav,
>>>
>>> Ceph works best with more hardware. It is not really designed for small
>>> scale setups. Of course small setups can work for a PoC or testing, but I
>>> would not advise this for production.
>>>
>>> If you want to proceed however, have a good look the manuals or this
>>> mailinglist archive and do invest some time to understand the logic and
>>> workings of ceph before working or ordering hardware
>>>
>>> At least you want:
>>> - 3 monitors, preferable on dedicated servers
>>> - Per disk you will be running an ceph-osd instance. So a host with 2
>>> disks will run 2 osd instances. More OSD process is better performance, but
>>> also more memory and cpu usage.
>>> - Per default ceph uses a replication factor of 3 (it is possible to set
>>> this to 2, but is not advised)
>>> - You can not fill up disks to 100%, also data will not distribute even
>>> over all disks, expect disks to be filled up (on average) maximum to
>>> 60-70%. You want to add more disks once you reach this limit.
>>>
>>> All on all, with a setup of 3 hosts, with 2x2TB disks, this will result
>>> in a net data availablity of (3x2x2TBx0.6)/3 = 2.4 TB
>>>
>>>
>>> If speed is required, consider SSD's (for data & journals, or only
>>> journals).
>>>
>>> In you email you mention "compute1/2/3", please note, if you use the rbd
>>> kernel driver, this can interfere with the OSD process and is not advised
>>> to run OSD and Kernel driver on the same hardware. If you still want to do
>>> that, split it up using VMs (we have a small testin

Re: [ceph-users] Designing ceph cluster

2016-08-18 Thread Peter Hinman
If you are wanting to run VMs, OSD, and Monitors all on the same 
hardware in a lab environment, it sounds like Proxmox might simplify 
things for you.


Peter

On 8/18/2016 9:57 AM, Gaurav Goyal wrote:

Hello Mart,

My Apologies for that!

We are couple of office colleagues using the common gmail account. 
That has caused the nuisance.


Thanks for your response!

On Thu, Aug 18, 2016 at 6:00 AM, Mart van Santen > wrote:


Dear Guarav,

Please respect everyones time & timezone differences. Flooding the
mail-list won't help

see below,



On 08/18/2016 01:39 AM, Gaurav Goyal wrote:

Dear Ceph Users,

Awaiting some suggestion please!



On Wed, Aug 17, 2016 at 11:15 AM, Gaurav Goyal
mailto:er.gauravgo...@gmail.com>> wrote:

Hello Mart,

Thanks a lot for the detailed information!
Please find my response inline and help me to get more
knowledge on it


Ceph works best with more hardware. It is not really designed
for small scale setups. Of course small setups can work for a
PoC or testing, but I would not advise this for production.

[Gaurav] : We need this setup for PoC or testing.

If you want to proceed however, have a good look the manuals
or this mailinglist archive and do invest some time to
understand the logic and workings of ceph before working or
ordering hardware

At least you want:
- 3 monitors, preferable on dedicated servers
[Gaurav] : With my current setup, can i install MON on Host 1
--> Controller + Compute1, Host 2 and Host 3

- Per disk you will be running an ceph-osd instance. So a
host with 2 disks will run 2 osd instances. More OSD process
is better performance, but also more memory and cpu usage.

[Gaurav] : Understood, That means having 1T x 4 would be
better than 2T x 2.


Yes, more disks will do more IO



- Per default ceph uses a replication factor of 3 (it is
possible to set this to 2, but is not advised)
- You can not fill up disks to 100%, also data will not
distribute even over all disks, expect disks to be filled up
(on average) maximum to 60-70%. You want to add more disks
once you reach this limit.

All on all, with a setup of 3 hosts, with 2x2TB disks, this
will result in a net data availablity of (3x2x2TBx0.6)/3 =
2.4 TB

[Gaurav] : As this is going to be a test lab environment, can
we change the configuration to have more capacity rather than
redundancy? How can we achieve it?



Ceph has an excellent documentation. This is easy to find and
search for "the number of replicas", you want to set both "size"
and "min_size" to 1 on this case


If speed is required, consider SSD's (for data & journals, or
only journals).

In you email you mention "compute1/2/3", please note, if you
use the rbd kernel driver, this can interfere with the OSD
process and is not advised to run OSD and Kernel driver on
the same hardware. If you still want to do that, split it up
using VMs (we have a small testing cluster where we do mix
compute and storage, there we have the OSDs running in VMs)

[Gaurav] : within my mentioned environment, How can we split
rbd kernel driver and OSD process? Should it be like rbd
kernel driver on controller and OSD processes on compute hosts?

Since my host 1 is controller + Compute1, Can you please
share the steps to split it up using VMs and suggested by you.



We are running kernel rbd on dom0 and osd's in domu, as well a
monitor in domu.

Regards,

Mart






Regards
Gaurav Goyal


On Wed, Aug 17, 2016 at 9:28 AM, Mart van Santen
mailto:m...@greenhost.nl>> wrote:


Dear Gaurav,

Ceph works best with more hardware. It is not really
designed for small scale setups. Of course small setups
can work for a PoC or testing, but I would not advise
this for production.

If you want to proceed however, have a good look the
manuals or this mailinglist archive and do invest some
time to understand the logic and workings of ceph before
working or ordering hardware

At least you want:
- 3 monitors, preferable on dedicated servers
- Per disk you will be running an ceph-osd instance. So a
host with 2 disks will run 2 osd instances. More OSD
process is better performance, but also more memory and
cpu usage.
- Per default ceph uses a replication factor of 3 (it is
possible to set this to 2, but is not advised)
- You can not fill up disks to 100%, also data will not
distrib

Re: [ceph-users] Designing ceph cluster

2016-08-18 Thread Vasu Kulkarni
Also most of the terminology looks like from Openstack and SAN, Here
are the right terminology that should be used for Ceph
http://docs.ceph.com/docs/master/glossary/


On Thu, Aug 18, 2016 at 8:57 AM, Gaurav Goyal  wrote:
> Hello Mart,
>
> My Apologies for that!
>
> We are couple of office colleagues using the common gmail account. That has
> caused the nuisance.
>
> Thanks for your response!
>
> On Thu, Aug 18, 2016 at 6:00 AM, Mart van Santen  wrote:
>>
>> Dear Guarav,
>>
>> Please respect everyones time & timezone differences. Flooding the
>> mail-list won't help
>>
>> see below,
>>
>>
>>
>> On 08/18/2016 01:39 AM, Gaurav Goyal wrote:
>>
>> Dear Ceph Users,
>>
>> Awaiting some suggestion please!
>>
>>
>>
>> On Wed, Aug 17, 2016 at 11:15 AM, Gaurav Goyal 
>> wrote:
>>>
>>> Hello Mart,
>>>
>>> Thanks a lot for the detailed information!
>>> Please find my response inline and help me to get more knowledge on it
>>>
>>>
>>> Ceph works best with more hardware. It is not really designed for small
>>> scale setups. Of course small setups can work for a PoC or testing, but I
>>> would not advise this for production.
>>>
>>> [Gaurav] : We need this setup for PoC or testing.
>>>
>>> If you want to proceed however, have a good look the manuals or this
>>> mailinglist archive and do invest some time to understand the logic and
>>> workings of ceph before working or ordering hardware
>>>
>>> At least you want:
>>> - 3 monitors, preferable on dedicated servers
>>> [Gaurav] : With my current setup, can i install MON on Host 1 -->
>>> Controller + Compute1, Host 2 and Host 3
>>>
>>> - Per disk you will be running an ceph-osd instance. So a host with 2
>>> disks will run 2 osd instances. More OSD process is better performance, but
>>> also more memory and cpu usage.
>>>
>>> [Gaurav] : Understood, That means having 1T x 4 would be better than 2T x
>>> 2.
>>
>> Yes, more disks will do more IO
>>>
>>>
>>> - Per default ceph uses a replication factor of 3 (it is possible to set
>>> this to 2, but is not advised)
>>> - You can not fill up disks to 100%, also data will not distribute even
>>> over all disks, expect disks to be filled up (on average) maximum to 60-70%.
>>> You want to add more disks once you reach this limit.
>>>
>>> All on all, with a setup of 3 hosts, with 2x2TB disks, this will result
>>> in a net data availablity of (3x2x2TBx0.6)/3 = 2.4 TB
>>>
>>> [Gaurav] : As this is going to be a test lab environment, can we change
>>> the configuration to have more capacity rather than redundancy? How can we
>>> achieve it?
>>
>>
>> Ceph has an excellent documentation. This is easy to find and search for
>> "the number of replicas", you want to set both "size" and "min_size" to 1 on
>> this case
>>
>>> If speed is required, consider SSD's (for data & journals, or only
>>> journals).
>>>
>>> In you email you mention "compute1/2/3", please note, if you use the rbd
>>> kernel driver, this can interfere with the OSD process and is not advised to
>>> run OSD and Kernel driver on the same hardware. If you still want to do
>>> that, split it up using VMs (we have a small testing cluster where we do mix
>>> compute and storage, there we have the OSDs running in VMs)
>>>
>>> [Gaurav] : within my mentioned environment, How can we split rbd kernel
>>> driver and OSD process? Should it be like rbd kernel driver on controller
>>> and OSD processes on compute hosts?
>>>
>>> Since my host 1 is controller + Compute1, Can you please share the steps
>>> to split it up using VMs and suggested by you.
>>
>>
>> We are running kernel rbd on dom0 and osd's in domu, as well a monitor in
>> domu.
>>
>> Regards,
>>
>> Mart
>>
>>
>>
>>
>>>
>>> Regards
>>> Gaurav Goyal
>>>
>>>
>>> On Wed, Aug 17, 2016 at 9:28 AM, Mart van Santen 
>>> wrote:


 Dear Gaurav,

 Ceph works best with more hardware. It is not really designed for small
 scale setups. Of course small setups can work for a PoC or testing, but I
 would not advise this for production.

 If you want to proceed however, have a good look the manuals or this
 mailinglist archive and do invest some time to understand the logic and
 workings of ceph before working or ordering hardware

 At least you want:
 - 3 monitors, preferable on dedicated servers
 - Per disk you will be running an ceph-osd instance. So a host with 2
 disks will run 2 osd instances. More OSD process is better performance, but
 also more memory and cpu usage.
 - Per default ceph uses a replication factor of 3 (it is possible to set
 this to 2, but is not advised)
 - You can not fill up disks to 100%, also data will not distribute even
 over all disks, expect disks to be filled up (on average) maximum to 
 60-70%.
 You want to add more disks once you reach this limit.

 All on all, with a setup of 3 hosts, with 2x2TB disks, this will result
 in a net data availablity of (3x2x2TBx0.6)/3 = 2.4 TB


 If speed is req

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-18 Thread Alex Gorbachev
On Sat, Aug 13, 2016 at 4:51 PM, Alex Gorbachev  
wrote:
> On Sat, Aug 13, 2016 at 12:36 PM, Alex Gorbachev  
> wrote:
>> On Mon, Aug 8, 2016 at 7:56 AM, Ilya Dryomov  wrote:
>>> On Sun, Aug 7, 2016 at 7:57 PM, Alex Gorbachev  
>>> wrote:
> I'm confused.  How can a 4M discard not free anything?  It's either
> going to hit an entire object or two adjacent objects, truncating the
> tail of one and zeroing the head of another.  Using rbd diff:
>
> $ rbd diff test | grep -A 1 25165824
> 25165824  4194304 data
> 29360128  4194304 data
>
> # a 4M discard at 1M into a RADOS object
> $ blkdiscard -o $((25165824 + (1 << 20))) -l $((4 << 20)) /dev/rbd0
>
> $ rbd diff test | grep -A 1 25165824
> 25165824  1048576 data
> 29360128  4194304 data

 I have tested this on a small RBD device with such offsets and indeed,
 the discard works as you describe, Ilya.

 Looking more into why ESXi's discard is not working.  I found this
 message in kern.log on Ubuntu on creation of the SCST LUN, which shows
 unmap_alignment 0:

 Aug  6 22:02:33 e1 kernel: [300378.136765] virt_id 33 (p_iSCSILun_sclun945)
 Aug  6 22:02:33 e1 kernel: [300378.136782] dev_vdisk: Auto enable thin
 provisioning for device /dev/rbd/spin1/unmap1t
 Aug  6 22:02:33 e1 kernel: [300378.136784] unmap_gran 8192,
 unmap_alignment 0, max_unmap_lba 8192, discard_zeroes_data 1
 Aug  6 22:02:33 e1 kernel: [300378.136786] dev_vdisk: Attached SCSI
 target virtual disk p_iSCSILun_sclun945
 (file="/dev/rbd/spin1/unmap1t", fs=409600MB, bs=512,
 nblocks=838860800, cyln=409600)
 Aug  6 22:02:33 e1 kernel: [300378.136847] [4682]:
 scst_alloc_add_tgt_dev:5287:Device p_iSCSILun_sclun945 on SCST lun=32
 Aug  6 22:02:33 e1 kernel: [300378.136853] [4682]: scst:
 scst_alloc_set_UA:12711:Queuing new UA 8810251f3a90 (6:29:0,
 d_sense 0) to tgt_dev 88102583ad00 (dev p_iSCSILun_sclun945,
 initiator copy_manager_sess)

 even though:

 root@e1:/sys/block/rbd29# cat discard_alignment
 4194304

 So somehow the discard_alignment is not making it into the LUN.  Could
 this be the issue?
>>>
>>> No, if you are not seeing *any* effect, the alignment is pretty much
>>> irrelevant.  Can you do the following on a small test image?
>>>
>>> - capture "rbd diff" output
>>> - blktrace -d /dev/rbd0 -o - | blkparse -i - -o rbd0.trace
>>> - issue a few discards with blkdiscard
>>> - issue a few unmaps with ESXi, preferrably with SCST debugging enabled
>>> - capture "rbd diff" output again
>>>
>>> and attach all of the above?  (You might need to install a blktrace
>>> package.)
>>>
>>
>> Latest results from VMWare validation tests:
>>
>> Each test creates and deletes a virtual disk, then calls ESXi unmap
>> for what ESXi maps to that volume:
>>
>> Test 1: 10GB reclaim, rbd diff size: 3GB, discards: 4829
>>
>> Test 2: 100GB reclaim, rbd diff size: 50GB, discards: 197837
>>
>> Test 3: 175GB reclaim, rbd diff size: 47 GB, discards: 197824
>>
>> Test 4: 250GB reclaim, rbd diff size: 125GB, discards: 197837
>>
>> Test 5: 250GB reclaim, rbd diff size: 80GB, discards: 197837
>>
>> At the end, the compounded used size via rbd diff is 608 GB from 775GB
>> of data.  So we release only about 20% via discards in the end.
>
> Ilya has analyzed the discard pattern, and indeed the problem is that
> ESXi appears to disregard the discard alignment attribute.  Therefore,
> discards are shifted by 1M, and are not hitting the tail of objects.
>
> Discards work much better on the EagerZeroedThick volumes, likely due
> to contiguous data.
>
> I will proceed with the rest of testing, and will post any tips or
> best practice results as they become available.
>
> Thank you for everyone's help and advice!

Testing completed - the discards definitely follow the alignment pattern:

- 4MB objects and VMFS5 - only some discards due to 1MB discard not
often hitting the tail of object

- 1MB objects - practically 100% space reclaim

I have not tried shifting the VMFS5 filesystem, as the test will not
work with that.  Also not sure how to properly incorporate into VMWare
routine operation.  So, as a best practice:

If you want efficient ESXi space reclaim with RBD and VMFS5, use 1 MB
object size in Ceph

Best regards,
--
Alex Gorbachev
Storcium
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Signature V2

2016-08-18 Thread jan hugo prins
I have been able to reproduce the error and create a debug log from the
failure.
I can't post the debug log here because there is sensitive information
in the debug log like access keys etc.
Where can I send this log for analysis? And who is able to have a look
at this?
A small part of the debug log without stripped of sensitive information:

2016-08-18 17:26:33.864658 7ff155ffb700 10 -
Verifying signatures
2016-08-18 17:26:33.864659 7ff155ffb700 10 Signature =
abbeb6af798b2aad58cd398491698f863253f3859d22b4c9558cc808159d256d
2016-08-18 17:26:33.864660 7ff155ffb700 10 New Signature =
e13d83bcd1f52103e9056add844e0037accb71436faee1a3e0048dd6c25cd4b6
2016-08-18 17:26:33.864661 7ff155ffb700 10 -
2016-08-18 17:26:33.864664 7ff155ffb700 20 delayed aws4 auth failed
2016-08-18 17:26:33.864674 7ff155ffb700  2 req 624:0.000642:s3:PUT
/Photos/Options/x/180x102.jpg:put_obj:completing
2016-08-18 17:26:33.864749 7ff155ffb700  2 req 624:0.000717:s3:PUT
/Photos/Options/x/180x102.jpg:put_obj:op status=-2027
2016-08-18 17:26:33.864757 7ff155ffb700  2 req 624:0.000726:s3:PUT
/Photos/Options/x/180x102.jpg:put_obj:http status=403
2016-08-18 17:26:33.864762 7ff155ffb700  1 == req done
req=0x7ff155ff5710 op status=-2027 http_status=403 ==
2016-08-18 17:26:33.864776 7ff155ffb700 20 process_request() returned -2027
2016-08-18 17:26:33.864801 7ff155ffb700  1 civetweb: 0x7ff1f8003e80:
192.168.2.59 - - [18/Aug/2016:17:26:33 +0200] "PUT
/Photos/Options/x/180x102.jpg HTTP/1.1" 403 0 - -


Jan Hugo Prins


On 08/18/2016 01:32 PM, jan hugo prins wrote:
> did some more searching and according to some info I found RGW should
> support V4 signatures.
>
> http://tracker.ceph.com/issues/10333
> http://tracker.ceph.com/issues/11858
>
> The fact that everyone still modifies s3cmd to use Version 2 Signatures
> suggests to me that we have a bug in this code.
>
> If I use V4 signatures most of my requests work fine, but some requests
> fail on a signature error.
>
> Thanks,
> Jan Hugo Prins
>
>
> On 08/18/2016 12:46 PM, jan hugo prins wrote:
>> Hi everyone.
>>
>> To connect to my S3 gateways using s3cmd I had to set the option
>> signature_v2 in my s3cfg to true.
>> If I didn't do that I would get Signature mismatch errors and this seems
>> to be because Amazon uses Signature version 4 while the S3 gateway of
>> Ceph only supports Signature Version 2.
>>
>> Now I see the following error in a Jave project we are building that
>> should talk to S3.
>>
>> Aug 18, 2016 12:12:38 PM org.apache.catalina.core.StandardWrapperValve
>> invoke
>> SEVERE: Servlet.service() for servlet [Default] in context with path
>> [/VehicleData] threw exception
>> com.betterbe.vd.web.servlet.LsExceptionWrapper: xxx
>> caused: com.amazonaws.services.s3.model.AmazonS3Exception: null
>> (Service: Amazon S3; Status Code: 400; Error Code:
>> XAmzContentSHA256Mismatch; Request ID:
>> tx02cc6-0057b58a15-25bba-default), S3 Extended Request
>> ID: 25bba-default-default
>> at
>> com.betterbe.vd.web.dataset.requesthandler.DatasetRequestHandler.handle(DatasetRequestHandler.java:262)
>> at com.betterbe.vd.web.servlet.Servlet.handler(Servlet.java:141)
>> at com.betterbe.vd.web.servlet.Servlet.doPost(Servlet.java:110)
>> at javax.servlet.http.HttpServlet.service(HttpServlet.java:646)
>>
>> To me this looks a bit the same, though I'm not a Java developer.
>> Am I correct, and if so, can I tell the Java S3 client to use Version 2
>> signatures?
>>
>>

-- 
Met vriendelijke groet / Best regards,

Jan Hugo Prins
Infra and Isilon storage consultant

Better.be B.V.
Auke Vleerstraat 140 E | 7547 AN Enschede | KvK 08097527
T +31 (0) 53 48 00 694 | M +31 (0)6 26 358 951
jpr...@betterbe.com | www.betterbe.com

This e-mail is intended exclusively for the addressee(s), and may not
be passed on to, or made available for use by any person other than 
the addressee(s). Better.be B.V. rules out any and every liability 
resulting from any electronic transmission.




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Understanding write performance

2016-08-18 Thread lewis.geo...@innoscale.net
Hi,
 So, I have really been trying to find information about this without 
annoying the list, but I just can't seem to get any clear picture of it. I 
was going to try to search the mailing list archive, but it seems there is 
an error when trying to search it right now(posting below, and sending to 
listed address in error). 
  
 I have been working for a couple of months now(slowly) on testing out 
Ceph. I only have a small PoC setup. I have 6 hosts, but I am only using 3 
of them in the cluster at the moment. They each have 6xSSDs(only 5 usable 
by Ceph), but the networks(1 public, 1 cluster) are only 1Gbps. I have the 
MONs running on the same 3 hosts, and I have an OSD process running for 
each of the 5 disks per host. The cluster shows in good health, with 15 
OSDs. I have one pool there, the default rbd, which I setup with 512 PGs. 
  
 I have create an rbd image on the pool, and I have it mapped and mounted 
on another client host. When doing write tests, like with 'dd', I am 
getting rather spotty performance. Not only is it up and down, but even 
when it is up, the performance isn't that great. On large'ish(4GB 
sequential) writes, it averages about 65MB/s, and on repeated smaller(40MB) 
sequential writes, it is jumping around between 20MB/s and 80MB/s. 
  
 However, with read tests, I am able to completely max out the network 
there, easily reaching 125MB/s. Tests on the disks directly are able to get 
up to 550MB/s reads and 350MB/s writes. So, I know it isn't a problem with 
the disks.
  
 I guess my question is, is there any additional optimizations or tuning I 
should review here. I have read over all the docs, but I don't know which, 
if any, of the values would need tweaking. Also, I am not sure if this is 
just how it is with Ceph, given the need to write multiple copies of each 
object. Is the slower write performance(averaging ~1/2 of the network 
throughput) to be expected? I haven't seen any clear answer on that in the 
docs or in articles I have found around. So, I am not sure if my 
expectation is just wrong. 
  
 Anyway, some basic idea on those concepts or some pointers to some good 
docs or articles would be wonderful. Thank you!
  
 Lewis George
  
  

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How can we repair OSD leveldb?

2016-08-18 Thread Sean Sullivan
We have a hammer cluster that experienced a similar power failure and ended
up corrupting our monitors leveldb stores. I am still trying to repair ours
but I can give you a few tips that seem to help.

1.)  I would copy the database off to somewhere safe right away. Just
opening it seems to change it.

2.) check out ceph-test tools (ceph-objectstore-tool, ceph-kvstore-tool,
ceph-osdmap-tool, etc).  It lets you list the keys/data in your
osd leveldb, possibly export them and get some barings on what you need to
do to recover your map.


   3.) I am making a few assumptions here. a.) You are using
replication for your pools. b.) you are using either S3 or rbd, not cephFS.
>From here worse case chances are your data is recoverable sans the osd and
monitor leveldb store so long as the rest of the data is okay. (The actual
rados objects spread across each osd in '/var/lib/ceph/osd/ceph-*/
current/blah_head)

If you use RBD there is a tool out there that lets you recover your RBD
images:: https://github.com/ceph/ceph/tree/master/src/tools/rbd_recover_tool
We only use S3 but this seems to be doable as well:

As an example we have a 9MB file that was stored in ceph::
I ran a find across all of the osds in my cluster and compiled a list of
files::

find /var/lib/ceph/osd/ceph-*/current/ -type f -iname \*this_is_my_File\.
gzip\*

>From here I resulted in a list that looks like the following::

This is the head. It's usually the bucket.id\file__head__

default.20283.1\ud975ef9e-c7b1-42c5-938b-d746fc2c7996\
sC1635.TCGA-DJ-A3UP-10A-01D-A22D-08.1.bam__head_CA57D598__1
[__A]\[_B___
_].[__C__]

default.20283.1\u\umultipart\ud975ef9e-c7b1-42c5-938b-
d746fc2c7996\sC1635.TCGA-DJ-A3UP-10A-01D-A22D-08.1.bam.2\
sYDDf8Qip4tn5YxQWfOmTt5fgm7o9Tw6.1__head_C338075C__1
[__A]\[_D___]\[__B_
_].[__C_
_]

And for each of those you'll have matching shadow files::
default.20283.1\u\ushadow\ud975ef9e-c7b1-42c5-938b-
d746fc2c7996\sC1635.TCGA-DJ-A3UP-10A-01D-A22D-08.1.bam.2\
sYDDf8Qip4tn5YxQWfOmTt5fgm7o9Tw6.1\u1__head_02F05634__1
[__A]\[_E__]\[__B___
].[__C__
__]

Here is another part of the multipart (this file only had 1 multipart and
we use multipart for all files larger than 5MB irrespective of size)::

default.20283.1\u\ushadow\ud975ef9e-c7b1-42c5-938b-
d746fc2c7996\sC1635.TCGA-DJ-A3UP-10A-01D-A22D-08.1.bam.2\
sYDDf8Qip4tn5YxQWfOmTt5fgm7o9Tw6.1\u2__head_1EA07BDF__1
[__A]\[_E__]\[__B___
].[__C__
__]



   ^^ notice the different part number
here.

A is the bucket.id and is the same for every object in the same bucket.
Even if you don't know what the bucket id for your bucket is, you should be
able to assume with good certainty after you review your list which is which

B is our object name. We generate uuids for each object so I can not be
certain how much of this is ceph or us but the tail of your object name
should exist and be the same across all of your parts.

C.) Is their suffix for each object. From here you may have suffix' like
the above

D.) Is your upload chunks

E.) Is your shadow chunks for each part of the multipart (i think)

I'm sure it's much more complicated than that but that's what worked for
me.  From here I just scanned through all of my osds and slowly pulled all
of the individual parts via ssh and concatinated them all to their
respective files. So far the md5 sums match our md5 of the file prior to
uploading them to ceph in the first place.

We have a python tool to do this but it's kind of specific to us. I can ask
the author and see if I can post a gist of the code if that helps. Please
let me know.



I can't speak for CephFS unfortunately as we do not use it but I wouldn't
be surprised if it is similar. So if you set up ssh-keys across all of your
osd nodes you should be able to export all of the data to another
server/cluster/etc.


I am working on trying to rebuild leveldb for our monitors with the correct
keys/values but I have a feeling this is going to be a long way off. I
wouldn't be surprised if the leveldb structure for the mon databse is
similar to the osd omap database.

On Wed, Aug 17, 2016 at 4:54 PM, Dan Jakubiec 
wrote:

> Hi Wido,
>
> Thank you for the response:
>
> > On Aug 17, 2016, at 16:25, Wido den Hollander  wrote:
> >
> >
> >> Op 17 augustus 2016 om 17:44 schreef Dan Jakubiec <
> dan.jakub...@gmail.com>:
> >>
> >>
> >> Hello, we have a Ceph cluster with 8 OSD that recently lost power to
> all 8 machines.  We've managed to recover the XFS filesystems on 7 of the
> machines, but the O

Re: [ceph-users] Rbd map command doesn't work

2016-08-18 Thread EP Komarla
I changed the profile to Hammer and it works.  This bring up a question, by 
changing the profile to “Hammer” am I going to lose some of the performance 
optimizations done in ‘Jewel’?

- epk

From: Bruce McFarland [mailto:bkmcfarl...@earthlink.net]
Sent: Tuesday, August 16, 2016 4:52 PM
To: Somnath Roy 
Cc: EP Komarla ; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Rbd map command doesn't work

EP,
Try setting the crush map to use legacy tunables. I've had the same issue with 
the"feature mismatch" errors when using krbd that didn't support format 2 and 
running jewel 10.2.2 on the storage nodes.

From the command line:
ceph osd crush tunables legacy

Bruce

On Aug 16, 2016, at 4:21 PM, Somnath Roy 
mailto:somnath@sandisk.com>> wrote:
This is usual feature mismatch stuff , the inbox krbd you are using is not 
supporting Jewel.
Try googling with the error and I am sure you will get lot of prior discussion 
around that..

From: EP Komarla [mailto:ep.koma...@flextronics.com]
Sent: Tuesday, August 16, 2016 4:15 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Rbd map command doesn't work

Somnath,

Thanks.

I am trying your suggestion.  See the commands below.  Still it doesn’t seem to 
go.

I am missing something here…

Thanks,

- epk

=
[test@ep-c2-client-01 ~]$ rbd create rbd/test1 --size 1G --image-format 1
rbd: image format 1 is deprecated
[test@ep-c2-client-01 ~]$ rbd map rbd/test1
rbd: sysfs write failed
In some cases useful info is found in syslog - try "dmesg | tail" or so.
rbd: map failed: (13) Permission denied
[test@ep-c2-client-01 ~]$ sudo rbd map rbd/test1
^C[test@ep-c2-client-01 ~]$
[test@ep-c2-client-01 ~]$
[test@ep-c2-client-01 ~]$
[test@ep-c2-client-01 ~]$
[test@ep-c2-client-01 ~]$ dmesg|tail -20
[1201954.248195] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1201954.253365] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1201964.274082] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1201964.281195] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1201974.298195] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1201974.305300] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204128.917562] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204128.924173] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204138.956737] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204138.964011] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204148.980701] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204148.987892] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204159.004939] libceph: mon2 172.20.60.53:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204159.012136] libceph: mon2 172.20.60.53:6789 missing required protocol 
features
[1204169.028802] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204169.035992] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204476.803192] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204476.810578] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204486.821279] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400



From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Tuesday, August 16, 2016 3:59 PM
To: EP Komarla mailto:ep.koma...@flextronics.com>>; 
ceph-users@lists.ceph.com
Subject: RE: Rbd map command doesn't work

The default format of rbd image in jewel is 2 along with bunch of other 
deatures enabled , so, you have following two option:

1. create a format 1 image –image-format 1

2. Or, do this in the ceph.conf file [client] or [global] before creating 
image..
rbd_default_features = 3

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of EP 
Komarla
Sent: Tuesday, August 16, 2016 2:52 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Rbd map command doesn't work

All,

I am creating an image and mapping it.  The below commands used to work in 
Hammer, now the same is not working in Jewel.  I see the message about some 
feature set mismatch – what features are we talking about here

[ceph-users] CephFS Fuse ACLs

2016-08-18 Thread Brady Deetz
I'm having an issue with ACLs on my CephFS test environment. Am I an idiot
or is something weird going on?

TLDR;
I setfacl as root for a local user and the user still can't access the file.

Example:
root@test-client:/media/cephfs/storage/labs# touch test
root@test-client:/media/cephfs/storage/labs# chown root:root test
root@test-client:/media/cephfs/storage/labs# chmod 660 test
root@test-client:/media/cephfs/storage/labs# setfacl -m u:brady:rwx test

other shell as local user:
brady@test-client:/media/cephfs/storage/labs$ getfacl test
# file: test
# owner: root
# group: root
user::rw-
user:brady:rwx
group::rw-
mask::rwx
other::---

brady@test-client:/media/cephfs/storage/labs$ cat test
cat: test: Permission denied



Configuration details:
Ubuntu 16.04.1
fuse 2.9.4-1ubuntu3.1
ceph-fuse 10.2.2-0ubuntu0.16.04.2
acl 2.2.52-3
kernel 4.4.0-34-generic (from ubuntu)

fstab entry:
mount.fuse.ceph#id=admin,conf=/etc/ceph/ceph.conf   /media/cephfs
fusedefaults,_netdev0   0

ceph.conf:
[global]
fsid = 6f91f60c-7bc0-4aaa-a136-4a90851fbe10
mon_initial_members = mon0
mon_host = 10.124.103.60
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public_network = 10.124.103.0/24
cluster_network = 10.124.104.0/24
osd_pool_default_size = 3

[client]
fuse_default_permission=0
client_acl_type=posix_acl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Fuse ACLs

2016-08-18 Thread Brady Deetz
apparently fuse_default_permission and client_acl_type have to be in the
fstab entry instead of the ceph.conf.

Sorry for polluting the mailing list with an amateur mis-configuration.

On Thu, Aug 18, 2016 at 4:26 PM, Brady Deetz  wrote:

> I'm having an issue with ACLs on my CephFS test environment. Am I an idiot
> or is something weird going on?
>
> TLDR;
> I setfacl as root for a local user and the user still can't access the
> file.
>
> Example:
> root@test-client:/media/cephfs/storage/labs# touch test
> root@test-client:/media/cephfs/storage/labs# chown root:root test
> root@test-client:/media/cephfs/storage/labs# chmod 660 test
> root@test-client:/media/cephfs/storage/labs# setfacl -m u:brady:rwx test
>
> other shell as local user:
> brady@test-client:/media/cephfs/storage/labs$ getfacl test
> # file: test
> # owner: root
> # group: root
> user::rw-
> user:brady:rwx
> group::rw-
> mask::rwx
> other::---
>
> brady@test-client:/media/cephfs/storage/labs$ cat test
> cat: test: Permission denied
>
>
>
> Configuration details:
> Ubuntu 16.04.1
> fuse 2.9.4-1ubuntu3.1
> ceph-fuse 10.2.2-0ubuntu0.16.04.2
> acl 2.2.52-3
> kernel 4.4.0-34-generic (from ubuntu)
>
> fstab entry:
> mount.fuse.ceph#id=admin,conf=/etc/ceph/ceph.conf   /media/cephfs
> fusedefaults,_netdev0   0
>
> ceph.conf:
> [global]
> fsid = 6f91f60c-7bc0-4aaa-a136-4a90851fbe10
> mon_initial_members = mon0
> mon_host = 10.124.103.60
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> public_network = 10.124.103.0/24
> cluster_network = 10.124.104.0/24
> osd_pool_default_size = 3
>
> [client]
> fuse_default_permission=0
> client_acl_type=posix_acl
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Simple question about primary-affinity

2016-08-18 Thread Christian Balzer


Hello,

completely ignoring your question about primary-affinity (which always
struck me as a corner case thing). ^o^

If you're adding SSDs to your cluster you will want to:

a) use them for OSD journals (if you're not doing so already)
b) create dedicated pools for high speed data (i.e. RBD images for DB
storage) 
c) use them for cache-tiering.

The last one is a much more efficient approach than primary-affinity,
since hot objects will wind up on the SSDs, as opposed to random ones.

Christian
On Thu, 18 Aug 2016 11:07:50 +0200 Florent B wrote:

> Hi everyone,
> 
> I begin to insert some SSD disks in my Ceph setup.
> 
> For now I only have 600GB on SSD (14 000 GB total space).
> 
> So my SSDs can't store *each* PG of my setup, for now.
> 
> If I set primary-affinity to 0 on non-SSD disks, will I get a problem
> for PGs stored on standard spinning disks ?
> 
> For example, a PG is on OSD 4,15,18, if they have primary-affinity to
> 0.0, will it be a problem to elect a primary ?
> 
> Do I have to set primary-affinity to 0.1 for non-SSD disks ?
> 
> Thank you ;)
> 
> Flo
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding write performance

2016-08-18 Thread Christian Balzer

Hello,

On Thu, 18 Aug 2016 12:03:36 -0700 lewis.geo...@innoscale.net wrote:

> Hi,
>  So, I have really been trying to find information about this without 
> annoying the list, but I just can't seem to get any clear picture of it. I 
> was going to try to search the mailing list archive, but it seems there is 
> an error when trying to search it right now(posting below, and sending to 
> listed address in error). 
>
Google (as in all the various archives of this ML) works well for me,
as always the results depend on picking "good" search strings.
   
>  I have been working for a couple of months now(slowly) on testing out 
> Ceph. I only have a small PoC setup. I have 6 hosts, but I am only using 3 
> of them in the cluster at the moment. They each have 6xSSDs(only 5 usable 
> by Ceph), but the networks(1 public, 1 cluster) are only 1Gbps. I have the 
> MONs running on the same 3 hosts, and I have an OSD process running for 
> each of the 5 disks per host. The cluster shows in good health, with 15 
> OSDs. I have one pool there, the default rbd, which I setup with 512 PGs. 
>   
Exact SSD models, please.
Also CPU, though at 1GbE that isn't going to be your problem. 

>  I have create an rbd image on the pool, and I have it mapped and mounted 
> on another client host. 
Mapped via the kernel interface?

>When doing write tests, like with 'dd', I am 
> getting rather spotty performance. 
Example dd command line please.

>Not only is it up and down, but even 
> when it is up, the performance isn't that great. On large'ish(4GB 
> sequential) writes, it averages about 65MB/s, and on repeated smaller(40MB) 
> sequential writes, it is jumping around between 20MB/s and 80MB/s. 
>
Monitor your storage nodes during these test runs with atop (or iostat)
and see how busy your actual SSDs are then.
Also test with "rados bench" to get a base line.
   
>  However, with read tests, I am able to completely max out the network 
> there, easily reaching 125MB/s. Tests on the disks directly are able to get 
> up to 550MB/s reads and 350MB/s writes. So, I know it isn't a problem with 
> the disks.
>
How did you test these speed, exact command line please.
There are SSDs that can write very fast with buffered I/O but are
abysmally slow with sync/direct I/O. 
Which is what Ceph journals use.

See the various threads in here and the "classic" link:
https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

>  I guess my question is, is there any additional optimizations or tuning I 
> should review here. I have read over all the docs, but I don't know which, 
> if any, of the values would need tweaking. Also, I am not sure if this is 
> just how it is with Ceph, given the need to write multiple copies of each 
> object. Is the slower write performance(averaging ~1/2 of the network 
> throughput) to be expected? I haven't seen any clear answer on that in the 
> docs or in articles I have found around. So, I am not sure if my 
> expectation is just wrong. 
>   
While the replication incurs some performance penalties, this is mostly an
issue with small I/Os, not the type of large sequential writes you're
doing.
I'd expect a setup like yours to deliver more or less full line speed, if
your network and SSDs are working correctly. 

In my crappy test cluster with an identical network setup to yours, 4
nodes with 4 crappy SATA disks each (so 16 OSDs), I can get better and
more consistent write speed than you, around 100MB/s.

Christian

>  Anyway, some basic idea on those concepts or some pointers to some good 
> docs or articles would be wonderful. Thank you!
>   
>  Lewis George
>   
>   
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fail to automount osd after reboot when the /var Partition is ext4 but success automount when /var Partition is ext4

2016-08-18 Thread Leo Yu
hi,cepher
 i have deploy a cluster jewel 10.2.2,and fail to automount osd after
reboot with the system Partition:

[root@node1 ~]# lsblk -f
NAME FSTYPE  LABEL UUID
MOUNTPOINT
fd0
sda
├─sda1   ext4  497a4f82-3cbf-4e27-b026-cdd3c5ecc2dd
/boot
└─sda2   LVM2_member   EXsHQz-Lee3-SST4-3ska-Do4R-sDzg-ymN3yd
  ├─bclinux-root ext4  b5a29747-9a1b-4bdb-bfc0-ccb4eb947a48   /
  ├─bclinux-swap swap  745fa7a2-e992-44ca-a587-8406f7487773
[SWAP]
  ├─bclinux-home ext4  af856cf6-e20a-4c09-b055-5bc38ddec0a5
/home
  └─bclinux-var  ext4  70a9c2e2-e6b0-4cd2-a607-016ddb5d8b7d
/var
sdb
├─sdb1   xfs   695ee560-826b-48d4-bbc4-cb2f78465db5
└─sdb2
sdc
├─sdc1   xfs   f8cb77d6-99db-4ff5-b31e-41cae315ca6e
└─sdc2
sdd
├─sdd1   xfs   e64ba20e-4248-44c1-be18-32b204f9e43e
└─sdd2


but success automount osd after reboot with the system Partition:
[root@node2 ~]# lsblk -f
NAME FSTYPE  LABEL UUID
MOUNTPOINT
fd0
sda
├─sda1   xfs   ce0656f8-cda9-4639-81dd-320088d17dba
/boot
└─sda2   LVM2_member   f5n2BM-IZAX-4wa1-hipa-4uZ2-v9eJ-PntKfq
  ├─bclinux-root xfs   4b022772-53c3-4420-8702-626f220ca344   /
  ├─bclinux-swap swap  492c1f8e-b857-43a8-ab29-bdbeef1515e1
[SWAP]
  └─bclinux-var  xfs   d08feb66-d3d3-4eb1-b43e-45fdba243bd2
/var
sdb
├─sdb1   xfs   bcded8bf-7965-4673-a732-464efdd0ecc8
/var/lib/ceph/osd/ceph-0
└─sdb2
sdc
├─sdc1   xfs   772772fe-f8b3-4597-b3f3-69be70b3ca3f
/var/lib/ceph/osd/ceph-1
└─sdc2
sdd
├─sdd1   xfs   19688192-caf4-498b-96c7-42de63b1ac53
/var/lib/ceph/osd/ceph-2
└─sdd2


is it a bug for ext4 file system?any solution to auto mount osd with ext4
format file system for /var ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Understading osd default min size

2016-08-18 Thread Erick Lazaro
Hi.

I would like to understand how the parameter default OSD min size pool.
For example, I set up:

osd default pool size = 3
osd default min pool size = 2

Failing 1 OSD, the ceph will block writing in degraded pgs?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fail to automount osd after reboot when the /var Partition is ext4 but success automount when /var Partition is xfs

2016-08-18 Thread Leo Yu
hi,cepher
 i have deploy a cluster jewel 10.2.2,and fail to automount osd after
reboot when the /var   Partition is ext4:

[root@node1 ~]# lsblk -f
NAME FSTYPE  LABEL UUID
MOUNTPOINT
fd0
sda
├─sda1   ext4  497a4f82-3cbf-4e27-b026-cdd3c5ecc2dd
/boot
└─sda2   LVM2_member   EXsHQz-Lee3-SST4-3ska-Do4R-sDzg-ymN3yd
  ├─bclinux-root ext4  b5a29747-9a1b-4bdb-bfc0-ccb4eb947a48   /
  ├─bclinux-swap swap  745fa7a2-e992-44ca-a587-8406f7487773
[SWAP]
  ├─bclinux-home ext4  af856cf6-e20a-4c09-b055-5bc38ddec0a5
/home
  └─bclinux-var  ext4  70a9c2e2-e6b0-4cd2-a607-016ddb5d8b7d
/var
sdb
├─sdb1   xfs   695ee560-826b-48d4-bbc4-cb2f78465db5
└─sdb2
sdc
├─sdc1   xfs   f8cb77d6-99db-4ff5-b31e-41cae315ca6e
└─sdc2
sdd
├─sdd1   xfs   e64ba20e-4248-44c1-be18-32b204f9e43e
└─sdd2


but success automount osd after reboot when the /var Partition is xfs:
[root@node2 ~]# lsblk -f
NAME FSTYPE  LABEL UUID
MOUNTPOINT
fd0
sda
├─sda1   xfs   ce0656f8-cda9-4639-81dd-320088d17dba
/boot
└─sda2   LVM2_member   f5n2BM-IZAX-4wa1-hipa-4uZ2-v9eJ-PntKfq
  ├─bclinux-root xfs   4b022772-53c3-4420-8702-626f220ca344   /
  ├─bclinux-swap swap  492c1f8e-b857-43a8-ab29-bdbeef1515e1
[SWAP]
  └─bclinux-var  xfs   d08feb66-d3d3-4eb1-b43e-45fdba243bd2
/var
sdb
├─sdb1   xfs   bcded8bf-7965-4673-a732-464efdd0ecc8
/var/lib/ceph/osd/ceph-0
└─sdb2
sdc
├─sdc1   xfs   772772fe-f8b3-4597-b3f3-69be70b3ca3f
/var/lib/ceph/osd/ceph-1
└─sdc2
sdd
├─sdd1   xfs   19688192-caf4-498b-96c7-42de63b1ac53
/var/lib/ceph/osd/ceph-2
└─sdd2


is it a bug for ext4 file system?any solution to auto mount osd with ext4
format file system for /var ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understading osd default min size

2016-08-18 Thread Christian Balzer
On Fri, 19 Aug 2016 01:56:13 + Erick Lazaro wrote:

> Hi.
> 
> I would like to understand how the parameter default OSD min size pool.
Did you read this?
http://docs.ceph.com/docs/hammer/rados/operations/pools/#set-the-number-of-object-replicas

> For example, I set up:
> 
> osd default pool size = 3
> osd default min pool size = 2
> 
> Failing 1 OSD, the ceph will block writing in degraded pgs?

No.
If a 2nd OSD would fail at the same time AND if it had replicas of some of
the PGs on the first failed OSD (not a given in a large cluster), then
your I/O will be blocked given the above values.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Spreading deep-scrubbing load

2016-08-18 Thread Mark Kirkwood

On 15/06/16 13:18, Christian Balzer wrote:


 "osd_scrub_min_interval": "86400",
 "osd_scrub_max_interval": "604800",
 "osd_scrub_interval_randomize_ratio": "0.5",
Latest Hammer and afterwards can randomize things (spreading the load out),
but if you want things to happen within a certain time frame this might
not be helpful.



We are on Hammer (0.94.7) and have osd_scrub_interval_randomize_ratio 
set - it does not appear to be helping much. E.g here are our deep scrub 
dates aggregated by day for one region:


Daynum  total size
--
2016-07-19  511 5357167224372
2016-07-20 2129 9325996976904
2016-07-21   81   10544023951
2016-07-2295754126355
2016-07-25   21 0
2016-07-26113
2016-07-27   19 0
2016-08-01   52 0
2016-08-051 0
2016-08-086 0
2016-08-10  230  711935323538
2016-08-11 1211 5190285214797
2016-08-12 1543 1900998763365
2016-08-13   20   41687250384
2016-08-14   26   61185603382
2016-08-15  647 1963297626685
2016-08-16 2133 8354221242053
--
Totals 8640 3282445266


We have the deep scrub interval set to 4 weeks but only 17 days are used 
for deep scrubbing, and clearly the workload is quite uneven due to the 
distribution. I am attacking these two issues with a pre-emptive deep 
scrubbing script that we will run nightly.


It would be cool to have a command or api to alter/set the last deep 
scrub timestamp - as it seems to me that the only way to change the 
distribution of deep scrubs is to perform deep scrubs...


regards

Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding write performance

2016-08-18 Thread lewis.geo...@innoscale.net
Hi Christian,
 Thank you for the follow-up on this. 
  
 I answered those questions inline below.
  
 Have a good day,
  
 Lewis George
  


 From: "Christian Balzer" 
Sent: Thursday, August 18, 2016 6:31 PM
To: ceph-users@lists.ceph.com
Cc: "lewis.geo...@innoscale.net" 
Subject: Re: [ceph-users] Understanding write performance   

Hello,

On Thu, 18 Aug 2016 12:03:36 -0700 lewis.geo...@innoscale.net wrote:

>> Hi,
>> So, I have really been trying to find information about this without
>> annoying the list, but I just can't seem to get any clear picture of it. 
I
>> was going to try to search the mailing list archive, but it seems there 
is
>> an error when trying to search it right now(posting below, and sending 
to
>> listed address in error).
>>
>Google (as in all the various archives of this ML) works well for me,
>as always the results depend on picking "good" search strings.
>
>> I have been working for a couple of months now(slowly) on testing out
>> Ceph. I only have a small PoC setup. I have 6 hosts, but I am only using 
3
>> of them in the cluster at the moment. They each have 6xSSDs(only 5 
usable
>> by Ceph), but the networks(1 public, 1 cluster) are only 1Gbps. I have 
the
>> MONs running on the same 3 hosts, and I have an OSD process running for
>> each of the 5 disks per host. The cluster shows in good health, with 15
>> OSDs. I have one pool there, the default rbd, which I setup with 512 
PGs.
>>
>Exact SSD models, please.
>Also CPU, though at 1GbE that isn't going to be your problem.
  
 #Lewis: Each SSD is of model:
 Model Family: Samsung based SSDs
Device Model: Samsung SSD 840 PRO Series
  
 Each of the 3 nodes has 2 x Intel E5645, with 48GB of memory.

>> I have create an rbd image on the pool, and I have it mapped and 
mounted
>> on another client host.
 >Mapped via the kernel interface?
  
 # Lewis On the client node(which is same specs as the 3 others), I used 
the 'rbd map' command to map a 100GB rbd image to rbd0, then created an xfs 
FS on there, and mounted it.

>>When doing write tests, like with 'dd', I am
>> getting rather spotty performance.
>Example dd command line please.
  
 #Lewis: I put those below.

>>Not only is it up and down, but even
>> when it is up, the performance isn't that great. On large'ish(4GB
>> sequential) writes, it averages about 65MB/s, and on repeated 
smaller(40MB)
>> sequential writes, it is jumping around between 20MB/s and 80MB/s.
>>
>Monitor your storage nodes during these test runs with atop (or iostat)
>and see how busy your actual SSDs are then.
>Also test with "rados bench" to get a base line.
  
 #Lewis: I have all the nodes instrumented with collectd. I am seeing each 
disk only writing at ~25MB/s during the write tests. I will check out the 
'rados bench' command, as I have not checked it yet.

>> However, with read tests, I am able to completely max out the network
>> there, easily reaching 125MB/s. Tests on the disks directly are able to 
get
>> up to 550MB/s reads and 350MB/s writes. So, I know it isn't a problem 
with
>> the disks.
>>
>How did you test these speed, exact command line please.
>There are SSDs that can write very fast with buffered I/O but are
>abysmally slow with sync/direct I/O.
>Which is what Ceph journals use.
  
 #Lewis: I have mostly been testing with just dd, though I have also tested 
using several fio tests too. With dd, I have tested writing 4GB files, with 
both 4k and 1M block sizes(get about the same results, on average).
  
 dd if=/dev/zero of=/mnt/set1/testfile700 bs=4k count=100 conv=fsync
 dd if=/dev/zero of=/mnt/set1/testfile700 bs=1M count=4000 conv=fsync

>See the various threads in here and the "classic" link:
>https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-i
s-suitable-as-a-journal-device/
  
 #Lewis: I have been reading over a lot of his articles. They are really 
good. I did not see that one. Thank you for pointing it out.

>> I guess my question is, is there any additional optimizations or tuning 
I
>> should review here. I have read over all the docs, but I don't know 
which,
>> if any, of the values would need tweaking. Also, I am not sure if this 
is
>> just how it is with Ceph, given the need to write multiple copies of 
each
>> object. Is the slower write performance(averaging ~1/2 of the network
>> throughput) to be expected? I haven't seen any clear answer on that in 
the
>> docs or in articles I have found around. So, I am not sure if my
>> expectation is just wrong.
>>
>While the replication incurs some performance penalties, this is mostly 
an
>issue with small I/Os, not the type of large sequential writes you're
>doing.
>I'd expect a setup like yours to deliver more or less full line speed, if
>your network and SSDs are working correctly.
>
>In my crappy test cluster with an identical network setup to yours, 4
>nodes with 4 crappy SATA disks each (so 16 OSDs), I can get better and
>more consistent write speed

Re: [ceph-users] Spreading deep-scrubbing load

2016-08-18 Thread Christian Balzer

Holly thread necromancy Batman!

On Fri, 19 Aug 2016 15:39:13 +1200 Mark Kirkwood wrote:

> On 15/06/16 13:18, Christian Balzer wrote:
> >
> >  "osd_scrub_min_interval": "86400",
> >  "osd_scrub_max_interval": "604800",
> >  "osd_scrub_interval_randomize_ratio": "0.5",
> > Latest Hammer and afterwards can randomize things (spreading the load out),
> > but if you want things to happen within a certain time frame this might
> > not be helpful.
> >
> 
> We are on Hammer (0.94.7) and have osd_scrub_interval_randomize_ratio 
> set - it does not appear to be helping much. E.g here are our deep scrub 
> dates aggregated by day for one region:
> 
> Daynum  total size
> --
> 2016-07-19  511 5357167224372
> 2016-07-20 2129 9325996976904
> 2016-07-21   81   10544023951
> 2016-07-2295754126355
> 2016-07-25   21 0
> 2016-07-26113
> 2016-07-27   19 0
> 2016-08-01   52 0
> 2016-08-051 0
> 2016-08-086 0
> 2016-08-10  230  711935323538
> 2016-08-11 1211 5190285214797
> 2016-08-12 1543 1900998763365
> 2016-08-13   20   41687250384
> 2016-08-14   26   61185603382
> 2016-08-15  647 1963297626685
> 2016-08-16 2133 8354221242053
> --
> Totals 8640 3282445266
> 
> 
> We have the deep scrub interval set to 4 weeks but only 17 days are used 
> for deep scrubbing, and clearly the workload is quite uneven due to the 
> distribution. I am attacking these two issues with a pre-emptive deep 
> scrubbing script that we will run nightly.
> 
> It would be cool to have a command or api to alter/set the last deep 
> scrub timestamp - as it seems to me that the only way to change the 
> distribution of deep scrubs is to perform deep scrubs...
> 
Yes, and if I didn't mention it in the original thread (I usually do when
it comes to this topic), that's what I do.
As in, no randomization (I hatez it), fixed window, kick off deep scrubs
once at the opportune time on a weekend night.
And if it fits in that time window, that's need, no need for further cron
jobs.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding write performance

2016-08-18 Thread Christian Balzer

Hello,

see below, inline.

On Thu, 18 Aug 2016 21:41:33 -0700 lewis.geo...@innoscale.net wrote:

> Hi Christian,
>  Thank you for the follow-up on this. 
>   
>  I answered those questions inline below.
>   
>  Have a good day,
>   
>  Lewis George
>   
> 
> 
>  From: "Christian Balzer" 
> Sent: Thursday, August 18, 2016 6:31 PM
> To: ceph-users@lists.ceph.com
> Cc: "lewis.geo...@innoscale.net" 
> Subject: Re: [ceph-users] Understanding write performance   
> 
> Hello,
> 
> On Thu, 18 Aug 2016 12:03:36 -0700 lewis.geo...@innoscale.net wrote:
> 
> >> Hi,
> >> So, I have really been trying to find information about this without
> >> annoying the list, but I just can't seem to get any clear picture of it. 
> I
> >> was going to try to search the mailing list archive, but it seems there 
> is
> >> an error when trying to search it right now(posting below, and sending 
> to
> >> listed address in error).
> >>
> >Google (as in all the various archives of this ML) works well for me,
> >as always the results depend on picking "good" search strings.
> >
> >> I have been working for a couple of months now(slowly) on testing out
> >> Ceph. I only have a small PoC setup. I have 6 hosts, but I am only using 
> 3
> >> of them in the cluster at the moment. They each have 6xSSDs(only 5 
> usable
> >> by Ceph), but the networks(1 public, 1 cluster) are only 1Gbps. I have 
> the
> >> MONs running on the same 3 hosts, and I have an OSD process running for
> >> each of the 5 disks per host. The cluster shows in good health, with 15
> >> OSDs. I have one pool there, the default rbd, which I setup with 512 
> PGs.
> >>
> >Exact SSD models, please.
> >Also CPU, though at 1GbE that isn't going to be your problem.
>   
>  #Lewis: Each SSD is of model:
>  Model Family: Samsung based SSDs
> Device Model: Samsung SSD 840 PRO Series
> 
Consumer model, known to be deadly slow with dsync/direct writes.

And even if it didn't have those issues, endurance would make it a no-go
outside a PoC environment. 

>  Each of the 3 nodes has 2 x Intel E5645, with 48GB of memory.
>
That's plenty then.
 
> >> I have create an rbd image on the pool, and I have it mapped and 
> mounted
> >> on another client host.
>  >Mapped via the kernel interface?
>   
>  # Lewis On the client node(which is same specs as the 3 others), I used 
> the 'rbd map' command to map a 100GB rbd image to rbd0, then created an xfs 
> FS on there, and mounted it.
>
Kernel then, OK.
 
> >>When doing write tests, like with 'dd', I am
> >> getting rather spotty performance.
> >Example dd command line please.
>   
>  #Lewis: I put those below.
> 
> >>Not only is it up and down, but even
> >> when it is up, the performance isn't that great. On large'ish(4GB
> >> sequential) writes, it averages about 65MB/s, and on repeated 
> smaller(40MB)
> >> sequential writes, it is jumping around between 20MB/s and 80MB/s.
> >>
> >Monitor your storage nodes during these test runs with atop (or iostat)
> >and see how busy your actual SSDs are then.
> >Also test with "rados bench" to get a base line.
>   
>  #Lewis: I have all the nodes instrumented with collectd. I am seeing each 
> disk only writing at ~25MB/s during the write tests. 
That won't show you how busy the drives are, in fact I'm not aware of any
collectd plugin that will give you that info.

Use atop (or iostat) locally as I said, though I know what the output will
be now of course.

> I will check out the 
> 'rados bench' command, as I have not checked it yet.
> 
It will be in the same ballpark, now knowing what SSDs you have.

> >> However, with read tests, I am able to completely max out the network
> >> there, easily reaching 125MB/s. Tests on the disks directly are able to 
> get
> >> up to 550MB/s reads and 350MB/s writes. So, I know it isn't a problem 
> with
> >> the disks.
> >>
> >How did you test these speed, exact command line please.
> >There are SSDs that can write very fast with buffered I/O but are
> >abysmally slow with sync/direct I/O.
> >Which is what Ceph journals use.
>   
>  #Lewis: I have mostly been testing with just dd, though I have also tested 
> using several fio tests too. With dd, I have tested writing 4GB files, with 
> both 4k and 1M block sizes(get about the same results, on average).
>   
>  dd if=/dev/zero of=/mnt/set1/testfile700 bs=4k count=100 conv=fsync
>  dd if=/dev/zero of=/mnt/set1/testfile700 bs=1M count=4000 conv=fsync
> 

You're using fsync, but as per cited article below, this is not what the
journal code uses.

> >See the various threads in here and the "classic" link:
> >https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-i
> s-suitable-as-a-journal-device/
>   
>  #Lewis: I have been reading over a lot of his articles. They are really 
> good. I did not see that one. Thank you for pointing it out.
> 
I wouldn't trust all the results and numbers there, some of them are
clearly wrong or were taken with differing m

[ceph-users] Using S3 java SDK to change a bucket acl fails. ceph version 10.2.2

2016-08-18 Thread zhu tong
Error is :

Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: 
Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; 
Request ID: null), S3 Extended Request ID: null

I have tried it with multiple SDK version, some shows different description 
("Content now allowed in prolog"), but all get 400 Bad Request.
I have tried using curl, succeeded.
The same java code works fine for ceph 0.94.7

Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Spreading deep-scrubbing load

2016-08-18 Thread Mark Kirkwood

On 19/08/16 17:33, Christian Balzer wrote:


On Fri, 19 Aug 2016 15:39:13 +1200 Mark Kirkwood wrote:



It would be cool to have a command or api to alter/set the last deep
scrub timestamp - as it seems to me that the only way to change the
distribution of deep scrubs is to perform deep scrubs...


Yes, and if I didn't mention it in the original thread (I usually do when
it comes to this topic), that's what I do.
As in, no randomization (I hatez it), fixed window, kick off deep scrubs
once at the opportune time on a weekend night.
And if it fits in that time window, that's need, no need for further cron
jobs.




Righty, well I'm testing a python rados api script that works out how 
many (bytes worth of) pgs it should deep scrub each night to move the 
distribution to be (eventually) uniform and also spread 'em to use all 
the days in deep scrub interval... will post here after satisfying 
myself that it is getting the maths right!


Cheers

Mark

P.s: Sage, if you are reading this...a command to alter deep scrub dates 
for pgs would be real cool :-)

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW multisite - second cluster woes

2016-08-18 Thread Shilpa Manjarabad Jagannath


- Original Message -
> From: "Ben Morrice" 
> To: ceph-users@lists.ceph.com
> Sent: Thursday, August 18, 2016 8:59:30 PM
> Subject: [ceph-users] RGW multisite - second cluster woes
> 
> Hello,
> 
> I am trying to configure a second cluster into an existing Jewel RGW
> installation.
> 
> I do not get the expected output when I perform a 'radosgw-admin realm
> pull'. My realm on the first cluster is called 'gold', however when
> doing a realm pull it doesn't reflect the 'gold' name or id and I get an
> error related to latest_epoch (?).
> 
> The documentation seems straight forward, so i'm not quite sure what i'm
> missing here?
> 
> Please see below for the full output.
> 
> # radosgw-admin realm pull --url=http://cluster1:80 --access-key=access
> --secret=secret
> 
> 2016-08-18 17:20:09.585261 7fb939d879c0  0 error read_lastest_epoch
> .rgw.root:periods.8c64a4dd-ccd8-4975-b63b-324fbb24aab6.latest_epoch
> {
> "id": "98a7b356-83fd-4d42-b895-b58d45fa4233",
> "name": "",
> "current_period": "8c64a4dd-ccd8-4975-b63b-324fbb24aab6",
> "epoch": 1
> }
> 

The realm name is empty here. Could you share the output of "radosgw-admin 
period get" from the first cluster?


> # radosgw-admin period pull --url=http://cluster1:80 --access-key=access
> secret=secret
> 2016-08-18 17:21:33.277719 7f5dbc7849c0  0 error read_lastest_epoch
> .rgw.root:periods..latest_epoch
> {
> "id": "",
> "epoch": 0,
> "predecessor_uuid": "",
> "sync_status": [],
> "period_map": {
> "id": "",
> "zonegroups": [],
> "short_zone_ids": []
> },
> "master_zonegroup": "",
> "master_zone": "",
> "period_config": {
> "bucket_quota": {
> "enabled": false,
> "max_size_kb": -1,
> "max_objects": -1
> },
> "user_quota": {
> "enabled": false,
> "max_size_kb": -1,
> "max_objects": -1
> }
> },
> "realm_id": "",
> "realm_name": "",
> "realm_epoch": 0
> }
> 
> # radosgw-admin realm default --rgw-realm=gold
> failed to init realm: (2) No such file or directory2016-08-18
> 17:21:46.220181 7f720defa9c0  0 error in read_id for id  : (2) No such
> file or directory
> 
> # radosgw-admin zonegroup default --rgw-zonegroup=us
> failed to init zonegroup: (2) No such file or directory
> 2016-08-18 17:22:10.348984 7f9b2da699c0  0 error in read_id for id  :
> (2) No such file or directory
> 
> 
> --
> Kind regards,
> 
> Ben Morrice
> 
> __
> Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
> EPFL ENT CBS BBP
> Biotech Campus
> Chemin des Mines 9
> 1202 Geneva
> Switzerland
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com