Re: [ceph-users] Please help me get rid of Slow / blocked requests

2018-05-01 Thread Shantur Rathore
Hi Paul,

Thanks for replying to my query.
I am not sure if the benchmark is overloading the cluster as 3 out of
5 runs the benchmark goes around 37K IOPS and suddenly for the
problematic runs it drops to 0 IOPS for a couple of minutes and then
resumes. This is a test cluster so nothing else is running off it.

OSD2 is same as all other OSDs and its always a different OSD every
time and on both the nodes.

Any pointers?

Regards,
Shantur

On Mon, Apr 30, 2018 at 6:34 PM, Paul Emmerich  wrote:
> Hi,
>
> blocked requests are just requests that took longer than 30 seconds to
> complete, this just means your cluster is completely overloaded by the
> benchmark.
> Also, OSD 2 might be slower than your other OSDs.
>
> Paul
>
> 2018-04-30 15:36 GMT+02:00 Shantur Rathore :
>>
>> Hi all,
>>
>> I am trying to get my first test Ceph cluster working.
>>
>> Centos 7 with Elrepo 4.16.3-1.el7.elrepo.x86_64 kernel ( for iSCSI HA )
>> Configured using - ceph-ansible
>> 3 Mons ( including 2 OSD nodes )
>> 2 OSD nodes
>> 20 OSDs ( 10 per node )
>>
>> Each OSD node has 256GB of memory and 2x10GBe Bonded interface.
>> For simplicity it uses public network only.
>>
>> During testing of the cluster from one of the OSD nodes whenever I do
>> a test i see slow / blocked requests on both nodes which clear up
>> after some time.
>>
>> I have checked the disks and network and both are working as expected.
>> I am trying to read and find ways to see what could be the issue but
>> unable to find any fix or solution to the problem.
>>
>> #Test Command
>> [root@storage-29 ~]# rbd bench --io-type write -p test --image disk1
>> --io-pattern seq --io-size 4K --io-total 10G
>>
>> In this run i saw in "ceph health detail" that osd.2 has blocked
>> requests. So i ran
>>
>> [root@storage-29 ~]# ceph daemon osd.2 dump_blocked_ops
>> .. Last op from the output
>>
>> {
>> "description": "osd_op(client.181675.0:933 6.e1
>> 6:8736f1d3:::rbd_data.20d9674b0dc51.06b7:head [write
>> 1150976~4096] snapc 0=[] ondisk+write+known_if_redirected e434)",
>> "initiated_at": "2018-04-30 14:04:37.656717",
>> "age": 79.228713,
>> "duration": 79.230355,
>> "type_data": {
>> "flag_point": "waiting for sub ops",
>> "client_info": {
>> "client": "client.181675",
>> "client_addr": "10.187.21.212:0/342865484",
>> "tid": 933
>> },
>> "events": [
>> {
>> "time": "2018-04-30 14:04:37.656717",
>> "event": "initiated"
>> },
>> {
>> "time": "2018-04-30 14:04:37.656789",
>> "event": "queued_for_pg"
>> },
>> {
>> "time": "2018-04-30 14:04:37.656869",
>> "event": "reached_pg"
>> },
>> {
>> "time": "2018-04-30 14:04:37.656917",
>> "event": "started"
>> },
>> {
>> "time": "2018-04-30 14:04:37.656970",
>> "event": "waiting for subops from 10"
>> },
>> {
>> "time": "2018-04-30 14:04:37.669473",
>> "event": "op_commit"
>> },
>> {
>> "time": "2018-04-30 14:04:37.669475",
>> "event": "op_applied"
>> }
>> ]
>> }
>> }
>>
>> I checked the logs from
>>
>> [root@storage-30 ~]# tail -n 1000 /var/log/ceph/ceph-osd.10.log
>>
>> And around that time nothing is printed in logs
>>
>> 2018-04-30 13:34:59.986731 7fa91d3c5700  0 --
>> 10.187.21.211:6810/343380 >> 10.187.21.211:6818/344034
>> conn(0x55b79db85000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
>> pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 56 vs existing
>> csq=55 existing_state=STATE_STANDBY
>> 2018-04-30 13:35:00.992309 7fa91d3c5700  0 --
>> 10.187.21.211:6810/343380 >> 10.187.21.212:6825/94560
>> conn(0x55b79e3fd000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
>> pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 9 vs existing
>> csq=9 existing_state=STATE_STANDBY
>> 2018-04-30 13:35:00.992711 7fa91d3c5700  0 --
>> 10.187.21.211:6810/343380 >> 10.187.21.212:6825/94560
>> conn(0x55b79e3fd000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
>> pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 10 vs existing
>> csq=9 existing_state=STATE_STANDBY
>> 2018-04-30 13:35:01.328882 7fa91d3c5700  0 --
>> 10.187.21.211:6810/343380 >> 10.187.21.212:6821/94497
>> conn(0x55b79e288000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
>> pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 7 vs existing
>> c

Re: [ceph-users] Please help me get rid of Slow / blocked requests

2018-05-01 Thread Van Leeuwen, Robert
> On 5/1/18, 12:02 PM, "ceph-users on behalf of Shantur Rathore" 
>  
> wrote:
>I am not sure if the benchmark is overloading the cluster as 3 out of
>   5 runs the benchmark goes around 37K IOPS and suddenly for the
>problematic runs it drops to 0 IOPS for a couple of minutes and then
>   resumes. This is a test cluster so nothing else is running off it.

Sounds like one of the following could be happening:
1) RBD write caching doing the 37K IOPS, which will need to flush at some point 
which causes the drop.

2) Hardware performance drops over time.
You could be hitting hardware write cache on RAID or disk controllers.
Especially SSDs can have a performance drop after writing to them for a while 
due to either SSD housekeeping or caches filling up.
So always run benchmarks over longer periods to make sure you get the actual 
sustainable performance of your cluster.

Cheers,
Robert van Leeuwen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Please help me get rid of Slow / blocked requests

2018-05-01 Thread John Hearns
>Sounds like one of the following could be happening:
> 1) RBD write caching doing the 37K IOPS, which will need to flush at some
point which causes the drop.

I am not sure this will help Shantur. But you could try running  'watch cat
/proc/meminfo' during a benchmark run.
You might be able to spot caches being flushed.
iostat is probably a better tool




On 1 May 2018 at 13:13, Van Leeuwen, Robert  wrote:

> > On 5/1/18, 12:02 PM, "ceph-users on behalf of Shantur Rathore" <
> ceph-users-boun...@lists.ceph.com on behalf of shantur.rath...@gmail.com>
> wrote:
> >I am not sure if the benchmark is overloading the cluster as 3 out of
> >   5 runs the benchmark goes around 37K IOPS and suddenly for the
> >problematic runs it drops to 0 IOPS for a couple of minutes and then
> >   resumes. This is a test cluster so nothing else is running off it.
>
> Sounds like one of the following could be happening:
> 1) RBD write caching doing the 37K IOPS, which will need to flush at some
> point which causes the drop.
>
> 2) Hardware performance drops over time.
> You could be hitting hardware write cache on RAID or disk controllers.
> Especially SSDs can have a performance drop after writing to them for a
> while due to either SSD housekeeping or caches filling up.
> So always run benchmarks over longer periods to make sure you get the
> actual sustainable performance of your cluster.
>
> Cheers,
> Robert van Leeuwen
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] troubleshooting librados error with concurrent requests

2018-05-01 Thread Sam Whitlock
I am using librados in application to read and write many small files
(<128MB) concurrently, both in the same process and in different processes
(across many nodes). The application is built on Tensorflow (the read and
write operations are custom kernels I wrote).

I'm having an issue with this application where, after a few minutes, the
all of my processes stop reading and writing to RADOS. In the debugging I
can see that they're all waiting, with some variation of the following
stack trace (edited for brevity), for various stat/read/write/write_full
operations:

#0 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib/x86_64-linux-gnu/libpthread.so.0
#1 in Cond::Wait (this=this@entry=0x7f7f977dce20, mutex=...) at
./common/Cond.h:56
#2 in librados::IoCtxImpl::operate_read (this=this@entry=0x7f7ed40b4190,
oid=..., o=o@entry=0x7f7f977dd050, pbl=pbl@entry=0x0, flags=flags@entry=0)
at librados/IoCtxImpl.cc:725
#3 in librados::IoCtxImpl::stat (this=0x7f7ed40b4190, oid=...,
psize=psize@entry=0x7f7f977dd198, pmtime=pmtime@entry=0x7f7f977dd1a0) at
librados/IoCtxImpl.cc:1238
#4 in librados::IoCtx::stat (this=0x7f7f977dd290, oid=...,
psize=0x7f7f977dd198, pmtime=0x7f7f977dd1a0) at librados/librados.cc:1260

The application then proceeds to complete requests at a glacial pace (~3-5
an hour) indefinitely.

When I run the application with a very low level of concurrency, it works
properly. This "lock up" doesn't happen.

All reads and writes are to a single pool from the same user. No files are
concurrently modified by different requests (i.e. completely independent /
embarrassingly parallel architecture in my app).

How might I go about troubleshooting this? I'm not sure which logs to look
at and what I might be looking for (if it is even logged).

I'm running Ceph 12.2.2, all machines running Ubuntu 16.04.

--
Sam Whitlock
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw bucket listing (s3 ls s3://$bucketname) slow with ~2 billion objects

2018-05-01 Thread Katie Holly
One of our radosgw buckets has grown a lot in size, `rgw bucket stats --bucket 
$bucketname` reports a total of 2,110,269,538 objects with the bucket index 
sharded across 32768 shards, listing the root context of the bucket with `s3 ls 
s3://$bucketname` takes more than an hour which is the hard limit to first-byte 
on our nginx reverse proxy and the aws-cli times out long before that timeout 
limit is hit.

The software we use supports sharding the data across multiple s3 buckets but 
before I go ahead and enable this, has anyone ever had that many objects in a 
single RGW bucket and can let me know how you solved the problem of RGW taking 
a long time to read the full index?

-- 
Best regards

Katie Holly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw bucket listing (s3 ls s3://$bucketname) slow with ~2 billion objects

2018-05-01 Thread Robert Stanford
 Listing will always take forever when using a high shard number, AFAIK.
That's the tradeoff for sharding.  Are those 2B objects in one bucket?
How's your read and write performance compared to a bucket with a lower
number (thousands) of objects, with that shard number?

On Tue, May 1, 2018 at 7:59 AM, Katie Holly <8ld3j...@meo.ws> wrote:

> One of our radosgw buckets has grown a lot in size, `rgw bucket stats
> --bucket $bucketname` reports a total of 2,110,269,538 objects with the
> bucket index sharded across 32768 shards, listing the root context of the
> bucket with `s3 ls s3://$bucketname` takes more than an hour which is the
> hard limit to first-byte on our nginx reverse proxy and the aws-cli times
> out long before that timeout limit is hit.
>
> The software we use supports sharding the data across multiple s3 buckets
> but before I go ahead and enable this, has anyone ever had that many
> objects in a single RGW bucket and can let me know how you solved the
> problem of RGW taking a long time to read the full index?
>
> --
> Best regards
>
> Katie Holly
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Collecting BlueStore per Object DB overhead

2018-05-01 Thread David Turner
Primary RGW usage.  270M objects, 857TB data/1195TB raw, EC 8+3 in the RGW
data pool, less than 200K objects in all other pools.  OSDs 366 and 367 are
NVMe OSDs, the rest are 10TB disks for data/DB and 2GB WAL NVMe partition.
The only things on the NVMe OSDs are the RGW metadata pools.  I only have 2
servers with bluestore, the rest are currently filestore in the cluster.

osd.319 onodes=164010 db_used_bytes=14433648640 avg_obj_size=23392454
overhead_per_obj=88004
osd.352 onodes=162395 db_used_bytes=12957253632 avg_obj_size=23440441
overhead_per_obj=79788
osd.357 onodes=159920 db_used_bytes=14039384064 avg_obj_size=24208736
overhead_per_obj=87790
osd.356 onodes=164420 db_used_bytes=13006536704 avg_obj_size=23155304
overhead_per_obj=79105
osd.355 onodes=164086 db_used_bytes=13021216768 avg_obj_size=23448898
overhead_per_obj=79356
osd.354 onodes=164665 db_used_bytes=13026459648 avg_obj_size=23357786
overhead_per_obj=79108
osd.353 onodes=164575 db_used_bytes=14099152896 avg_obj_size=23377114
overhead_per_obj=85670
osd.359 onodes=163922 db_used_bytes=13991149568 avg_obj_size=23397323
overhead_per_obj=85352
osd.358 onodes=164805 db_used_bytes=12706643968 avg_obj_size=23160121
overhead_per_obj=77101
osd.364 onodes=163009 db_used_bytes=14926479360 avg_obj_size=23552838
overhead_per_obj=91568
osd.365 onodes=163639 db_used_bytes=13615759360 avg_obj_size=23541130
overhead_per_obj=83206
osd.362 onodes=164505 db_used_bytes=13152288768 avg_obj_size=23324698
overhead_per_obj=79950
osd.363 onodes=164395 db_used_bytes=13104054272 avg_obj_size=23157437
overhead_per_obj=79710
osd.360 onodes=163484 db_used_bytes=14292090880 avg_obj_size=23347543
overhead_per_obj=87421
osd.361 onodes=164140 db_used_bytes=12977176576 avg_obj_size=23498778
overhead_per_obj=79061
osd.366 onodes=1516 db_used_bytes=7509901312 avg_obj_size=5743370
overhead_per_obj=4953760
osd.367 onodes=1435 db_used_bytes=7992246272 avg_obj_size=6419719
overhead_per_obj=5569509

On Tue, May 1, 2018 at 1:57 AM Wido den Hollander  wrote:

>
>
> On 04/30/2018 10:25 PM, Gregory Farnum wrote:
> >
> >
> > On Thu, Apr 26, 2018 at 11:36 AM Wido den Hollander  > > wrote:
> >
> > Hi,
> >
> > I've been investigating the per object overhead for BlueStore as I've
> > seen this has become a topic for a lot of people who want to store a
> lot
> > of small objects in Ceph using BlueStore.
> >
> > I've writting a piece of Python code which can be run on a server
> > running OSDs and will print the overhead.
> >
> > https://gist.github.com/wido/b1328dd45aae07c45cb8075a24de9f1f
> >
> > Feedback on this script is welcome, but also the output of what
> people
> > are observing.
> >
> > The results from my tests are below, but what I see is that the
> overhead
> > seems to range from 10kB to 30kB per object.
> >
> > On RBD-only clusters the overhead seems to be around 11kB, but on
> > clusters with a RGW workload the overhead goes higher to 20kB.
> >
> >
> > This change seems implausible as RGW always writes full objects, whereas
> > RBD will frequently write pieces of them and do overwrites.
> > I'm not sure what all knobs are available and which diagnostics
> > BlueStore exports, but is it possible you're looking at the total
> > RocksDB data store rather than the per-object overhead? The distinction
> > here being that the RocksDB instance will also store "client" (ie, RGW)
> > omap data and xattrs, in addition to the actual BlueStore onodes.
>
> Yes, that is possible. But in the end, the amount of onodes is the
> objects you store and then you want to know how many bytes the RocksDB
> database uses.
>
> I do agree that RGW doesn't do partial writes and has more metadata, but
> eventually that all has to be stored.
>
> We just need to come up with some good numbers on how to size the DB.
>
> Currently I assume a 10GB:1TB ratio and that is working out, but with
> people wanting to use 12TB disks we need to drill those numbers down
> even more. Otherwise you will need a lot of SSD space to store the DB in
> SSD if you want to.
>
> Wido
>
> > -Greg
> >
> >
> >
> > I know that partial overwrites and appends contribute to higher
> overhead
> > on objects and I'm trying to investigate this and share my
> information
> > with the community.
> >
> > I have two use-cases who want to store >2 billion objects with a avg
> > object size of 50kB (8 - 80kB) and the RocksDB overhead is likely to
> > become a big problem.
> >
> > Anybody willing to share the overhead they are seeing with what
> > use-case?
> >
> > The more data we have on this the better we can estimate how DBs
> need to
> > be sized for BlueStore deployments.
> >
> > Wido
> >
> > # Cluster #1
> > osd.25 onodes=178572 db_used_bytes=2188378112 <(218)%20837-8112>
> 
> > avg_obj_size=6196529
> > overhead=12254
> > osd.20 onodes=209871 db_used_bytes=2307915776 avg_obj_size=5452002
> > overhead=109

Re: [ceph-users] radosgw bucket listing (s3 ls s3://$bucketname) slow with ~2 billion objects

2018-05-01 Thread David Turner
Any time using shared storage like S3 or cephfs/nfs/gluster/etc the
absolute rule that I refuse to break is to never rely on a directory
listing to know where objects/files are.  You should be maintaining a
database of some sort or a deterministic naming scheme.  The only time a
full listing of a directory should be required is if you feel like your
tooling is orphaning files and you want to clean them up.  If I had someone
with a bucket with 2B objects, I would force them to use an index-less
bucket.

That's me, though.  I'm sure there are ways to manage a bucket in other
ways, but it sounds awful.

On Tue, May 1, 2018 at 10:10 AM Robert Stanford 
wrote:

>
>  Listing will always take forever when using a high shard number, AFAIK.
> That's the tradeoff for sharding.  Are those 2B objects in one bucket?
> How's your read and write performance compared to a bucket with a lower
> number (thousands) of objects, with that shard number?
>
> On Tue, May 1, 2018 at 7:59 AM, Katie Holly <8ld3j...@meo.ws> wrote:
>
>> One of our radosgw buckets has grown a lot in size, `rgw bucket stats
>> --bucket $bucketname` reports a total of 2,110,269,538 objects with the
>> bucket index sharded across 32768 shards, listing the root context of the
>> bucket with `s3 ls s3://$bucketname` takes more than an hour which is the
>> hard limit to first-byte on our nginx reverse proxy and the aws-cli times
>> out long before that timeout limit is hit.
>>
>> The software we use supports sharding the data across multiple s3 buckets
>> but before I go ahead and enable this, has anyone ever had that many
>> objects in a single RGW bucket and can let me know how you solved the
>> problem of RGW taking a long time to read the full index?
>>
>> --
>> Best regards
>>
>> Katie Holly
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Intel Xeon Scalable and CPU frequency scaling on NVMe/SSD Ceph OSDs

2018-05-01 Thread Wido den Hollander
Hi,

I've been trying to get the lowest latency possible out of the new Xeon
Scalable CPUs and so far I got down to 1.3ms with the help of Nick.

However, I can't seem to pin the CPUs to always run at their maximum
frequency.

If I disable power saving in the BIOS they stay at 2.1Ghz (Silver 4110),
but that disables the boost.

With the Power Saving enabled in the BIOS and when giving the OS all
control for some reason the CPUs keep scaling down.

$ echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct

cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009
Report errors and bugs to cpuf...@vger.kernel.org, please.
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 0.97 ms.
  hardware limits: 800 MHz - 3.00 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 800 MHz and 3.00 GHz.
  The governor "performance" may decide which speed to use
  within this range.
  current CPU frequency is 800 MHz.

I do see the CPUs scale up to 2.1Ghz, but they quickly scale down again
to 800Mhz and that hurts latency. (50% difference!)

With the CPUs scaling down to 800Mhz my latency jumps from 1.3ms to
2.4ms on avg. With turbo enabled I hope to get down to 1.1~1.2ms on avg.

$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
performance

Everything seems to be OK and I would expect the CPUs to stay at
2.10Ghz, but they aren't.

C-States are also pinned to 0 as a boot parameter for the kernel:

processor.max_cstate=1 intel_idle.max_cstate=0

Running Ubuntu 16.04.4 with the 4.13 kernel from the HWE from Ubuntu.

Has anybody tried this yet with the recent Intel Xeon Scalable CPUs?

Thanks,

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw bucket listing (s3 ls s3://$bucketname) slow with ~2 billion objects

2018-05-01 Thread Robert Stanford
 I second the indexless bucket suggestion.  The downside being that you
can't use bucket policies like object expiration in that case.

On Tue, May 1, 2018 at 10:02 AM, David Turner  wrote:

> Any time using shared storage like S3 or cephfs/nfs/gluster/etc the
> absolute rule that I refuse to break is to never rely on a directory
> listing to know where objects/files are.  You should be maintaining a
> database of some sort or a deterministic naming scheme.  The only time a
> full listing of a directory should be required is if you feel like your
> tooling is orphaning files and you want to clean them up.  If I had someone
> with a bucket with 2B objects, I would force them to use an index-less
> bucket.
>
> That's me, though.  I'm sure there are ways to manage a bucket in other
> ways, but it sounds awful.
>
> On Tue, May 1, 2018 at 10:10 AM Robert Stanford 
> wrote:
>
>>
>>  Listing will always take forever when using a high shard number, AFAIK.
>> That's the tradeoff for sharding.  Are those 2B objects in one bucket?
>> How's your read and write performance compared to a bucket with a lower
>> number (thousands) of objects, with that shard number?
>>
>> On Tue, May 1, 2018 at 7:59 AM, Katie Holly <8ld3j...@meo.ws> wrote:
>>
>>> One of our radosgw buckets has grown a lot in size, `rgw bucket stats
>>> --bucket $bucketname` reports a total of 2,110,269,538 objects with the
>>> bucket index sharded across 32768 shards, listing the root context of the
>>> bucket with `s3 ls s3://$bucketname` takes more than an hour which is the
>>> hard limit to first-byte on our nginx reverse proxy and the aws-cli times
>>> out long before that timeout limit is hit.
>>>
>>> The software we use supports sharding the data across multiple s3
>>> buckets but before I go ahead and enable this, has anyone ever had that
>>> many objects in a single RGW bucket and can let me know how you solved the
>>> problem of RGW taking a long time to read the full index?
>>>
>>> --
>>> Best regards
>>>
>>> Katie Holly
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel Xeon Scalable and CPU frequency scaling on NVMe/SSD Ceph OSDs

2018-05-01 Thread Blair Bethwaite
Also curious about this over here. We've got a rack's worth of R740XDs
with Xeon 4114's running RHEL 7.4 and intel-pstate isn't even active
on them, though I don't believe they are any different at the OS level
to our Broadwell nodes (where it is loaded).

Have you tried poking the kernel's pmqos interface for your use-case?

On 2 May 2018 at 01:07, Wido den Hollander  wrote:
> Hi,
>
> I've been trying to get the lowest latency possible out of the new Xeon
> Scalable CPUs and so far I got down to 1.3ms with the help of Nick.
>
> However, I can't seem to pin the CPUs to always run at their maximum
> frequency.
>
> If I disable power saving in the BIOS they stay at 2.1Ghz (Silver 4110),
> but that disables the boost.
>
> With the Power Saving enabled in the BIOS and when giving the OS all
> control for some reason the CPUs keep scaling down.
>
> $ echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
>
> cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009
> Report errors and bugs to cpuf...@vger.kernel.org, please.
> analyzing CPU 0:
>   driver: intel_pstate
>   CPUs which run at the same hardware frequency: 0
>   CPUs which need to have their frequency coordinated by software: 0
>   maximum transition latency: 0.97 ms.
>   hardware limits: 800 MHz - 3.00 GHz
>   available cpufreq governors: performance, powersave
>   current policy: frequency should be within 800 MHz and 3.00 GHz.
>   The governor "performance" may decide which speed to use
>   within this range.
>   current CPU frequency is 800 MHz.
>
> I do see the CPUs scale up to 2.1Ghz, but they quickly scale down again
> to 800Mhz and that hurts latency. (50% difference!)
>
> With the CPUs scaling down to 800Mhz my latency jumps from 1.3ms to
> 2.4ms on avg. With turbo enabled I hope to get down to 1.1~1.2ms on avg.
>
> $ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
> performance
>
> Everything seems to be OK and I would expect the CPUs to stay at
> 2.10Ghz, but they aren't.
>
> C-States are also pinned to 0 as a boot parameter for the kernel:
>
> processor.max_cstate=1 intel_idle.max_cstate=0
>
> Running Ubuntu 16.04.4 with the 4.13 kernel from the HWE from Ubuntu.
>
> Has anybody tried this yet with the recent Intel Xeon Scalable CPUs?
>
> Thanks,
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw bucket listing (s3 ls s3://$bucketname) slow with ~2 billion objects

2018-05-01 Thread Casey Bodley
The main problem with efficiently listing many-sharded buckets is the 
requirement to provide entries in sorted order. This means that each 
http request has to fetch ~1000 entries from every shard, combine them 
into a sorted order, and throw out the leftovers. The next request to 
continue the listing will advance its position slightly, but still end 
up fetching many of the same entries from each shard. As the number of 
shards increases, the more these shard listings will overlap, and the 
performance falls off.


Eric Ivancich recently added s3 and swift extensions for unordered 
bucket listing in https://github.com/ceph/ceph/pull/21026 (for mimic). 
That allows radosgw to list each shard separately, and avoid the step 
that throws away extra entries. If your application can tolerate 
unsorted listings, that could be a big help without having to resort to 
indexless buckets.



On 05/01/2018 11:09 AM, Robert Stanford wrote:


 I second the indexless bucket suggestion.  The downside being that 
you can't use bucket policies like object expiration in that case.


On Tue, May 1, 2018 at 10:02 AM, David Turner > wrote:


Any time using shared storage like S3 or cephfs/nfs/gluster/etc
the absolute rule that I refuse to break is to never rely on a
directory listing to know where objects/files are.  You should be
maintaining a database of some sort or a deterministic naming
scheme. The only time a full listing of a directory should be
required is if you feel like your tooling is orphaning files and
you want to clean them up.  If I had someone with a bucket with 2B
objects, I would force them to use an index-less bucket.

That's me, though.  I'm sure there are ways to manage a bucket in
other ways, but it sounds awful.

On Tue, May 1, 2018 at 10:10 AM Robert Stanford
mailto:rstanford8...@gmail.com>> wrote:


 Listing will always take forever when using a high shard
number, AFAIK.  That's the tradeoff for sharding.  Are those
2B objects in one bucket? How's your read and write
performance compared to a bucket with a lower number
(thousands) of objects, with that shard number?

On Tue, May 1, 2018 at 7:59 AM, Katie Holly <8ld3j...@meo.ws
> wrote:

One of our radosgw buckets has grown a lot in size, `rgw
bucket stats --bucket $bucketname` reports a total of
2,110,269,538 objects with the bucket index sharded across
32768 shards, listing the root context of the bucket with
`s3 ls s3://$bucketname` takes more than an hour which is
the hard limit to first-byte on our nginx reverse proxy
and the aws-cli times out long before that timeout limit
is hit.

The software we use supports sharding the data across
multiple s3 buckets but before I go ahead and enable this,
has anyone ever had that many objects in a single RGW
bucket and can let me know how you solved the problem of
RGW taking a long time to read the full index?

-- 
Best regards


Katie Holly
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS MDS stuck (failed to rdlock when getattr / lookup)

2018-05-01 Thread Oliver Freyermuth
Hi all,

Am 17.04.2018 um 19:38 schrieb Oliver Freyermuth:
> Am 17.04.2018 um 19:35 schrieb Daniel Gryniewicz:
>> On 04/17/2018 11:40 AM, Oliver Freyermuth wrote:
>>> Am 17.04.2018 um 17:34 schrieb Paul Emmerich:
>> 
>>> [...]

  We are right now using the packages from 
 https://eu.ceph.com/nfs-ganesha/  since 
 we would like not to have to build NFS Ganesha against Ceph ourselves,
  but would love to just follow upstream to save the maintenance 
 burden. Are you building packages yourself, or using a repo maintained 
 upstream?


 We are building it ourselves. We plan to soon publish our own repository 
 for it.
>>>
>>> This is certainly interesting for us!
>>> Ideally, we would love something with a similar maintenance-promise as the 
>>> eu.ceph.com repositories, always in sync with upstream's ceph releases.
>>> I still hope NFS Ganesha will play a larger role in Mimic and later 
>>> releases any maybe become a more integral part of the Ceph ecosystem.
>>>
>>> This is also the main problem preventing it from building it ourselves - if 
>>> we had to do it, we would have to build Ceph and NFS Ganesha, e.g. on the 
>>> Open Build Service,
>>> and closely monitor upstream ourselves.
>>>
>>> Many thanks and looking forward towards the publishing of your repo,
>>> Oliver
>>
>> 2.6 packages are now up on download.ceph.com, thanks to Ali.
> 
> Many thanks to Ali also from my side! 
> We'll schedule the upgrade within the next days and observe closely whether 
> the problem reappears. 
> 
> Many thanks and all the best,
>   Oliver

Finally, after watching the situation for ~2 weeks after the update, I can say:
Upgrading NFS Ganesha to 2.6 appears to have solved the issue. 
I did not observe any further lockup like this anymore, even though our users 
tried their best to re-trigger the issue! 
So it seems this has been solved for good by the update :-). 

So many thanks and all the best,
Oliver

> 
>>
>> Daniel
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] troubleshooting librados error with concurrent requests

2018-05-01 Thread Gregory Farnum
Can you provide the full backtrace? It kinda looks like you've left
something out.

In general though, a Wait inside of an operate call just means the thread
has submitted its request and is waiting for the answer to come back. It
could be blocked locally or remotely. If it's blocked remotely, the OSDs
should be reporting to the mon/mgr that they have slow requests, which you
can observe in "ceph -w" or whatever. If it's local, hrmm, not sure the
easiest way to debug without just cranking up logging.
Generically, I'd use the admin socket on your clients to look at what the
status of in-flight requests is, and to check the value of the throttle
limits in the perfcounter. If the requests are being handled slowly on the
OSD, do the same there. That will probably give you some clues.
-Greg

On Tue, May 1, 2018 at 5:20 AM Sam Whitlock  wrote:

> I am using librados in application to read and write many small files
> (<128MB) concurrently, both in the same process and in different processes
> (across many nodes). The application is built on Tensorflow (the read and
> write operations are custom kernels I wrote).
>
> I'm having an issue with this application where, after a few minutes, the
> all of my processes stop reading and writing to RADOS. In the debugging I
> can see that they're all waiting, with some variation of the following
> stack trace (edited for brevity), for various stat/read/write/write_full
> operations:
>
> #0 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #1 in Cond::Wait (this=this@entry=0x7f7f977dce20, mutex=...) at
> ./common/Cond.h:56
> #2 in librados::IoCtxImpl::operate_read (this=this@entry=0x7f7ed40b4190,
> oid=..., o=o@entry=0x7f7f977dd050, pbl=pbl@entry=0x0, flags=flags@entry=0)
> at librados/IoCtxImpl.cc:725
> #3 in librados::IoCtxImpl::stat (this=0x7f7ed40b4190, oid=...,
> psize=psize@entry=0x7f7f977dd198, pmtime=pmtime@entry=0x7f7f977dd1a0) at
> librados/IoCtxImpl.cc:1238
> #4 in librados::IoCtx::stat (this=0x7f7f977dd290, oid=...,
> psize=0x7f7f977dd198, pmtime=0x7f7f977dd1a0) at librados/librados.cc:1260
>
> The application then proceeds to complete requests at a glacial pace (~3-5
> an hour) indefinitely.
>
> When I run the application with a very low level of concurrency, it works
> properly. This "lock up" doesn't happen.
>
> All reads and writes are to a single pool from the same user. No files are
> concurrently modified by different requests (i.e. completely independent /
> embarrassingly parallel architecture in my app).
>
> How might I go about troubleshooting this? I'm not sure which logs to look
> at and what I might be looking for (if it is even logged).
>
> I'm running Ceph 12.2.2, all machines running Ubuntu 16.04.
>
> --
> Sam Whitlock
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Collecting BlueStore per Object DB overhead

2018-05-01 Thread Gregory Farnum
On Mon, Apr 30, 2018 at 10:57 PM Wido den Hollander  wrote:

>
>
> On 04/30/2018 10:25 PM, Gregory Farnum wrote:
> >
> >
> > On Thu, Apr 26, 2018 at 11:36 AM Wido den Hollander  > > wrote:
> >
> > Hi,
> >
> > I've been investigating the per object overhead for BlueStore as I've
> > seen this has become a topic for a lot of people who want to store a
> lot
> > of small objects in Ceph using BlueStore.
> >
> > I've writting a piece of Python code which can be run on a server
> > running OSDs and will print the overhead.
> >
> > https://gist.github.com/wido/b1328dd45aae07c45cb8075a24de9f1f
> >
> > Feedback on this script is welcome, but also the output of what
> people
> > are observing.
> >
> > The results from my tests are below, but what I see is that the
> overhead
> > seems to range from 10kB to 30kB per object.
> >
> > On RBD-only clusters the overhead seems to be around 11kB, but on
> > clusters with a RGW workload the overhead goes higher to 20kB.
> >
> >
> > This change seems implausible as RGW always writes full objects, whereas
> > RBD will frequently write pieces of them and do overwrites.
> > I'm not sure what all knobs are available and which diagnostics
> > BlueStore exports, but is it possible you're looking at the total
> > RocksDB data store rather than the per-object overhead? The distinction
> > here being that the RocksDB instance will also store "client" (ie, RGW)
> > omap data and xattrs, in addition to the actual BlueStore onodes.
>
> Yes, that is possible. But in the end, the amount of onodes is the
> objects you store and then you want to know how many bytes the RocksDB
> database uses.
>
> I do agree that RGW doesn't do partial writes and has more metadata, but
> eventually that all has to be stored.
>
> We just need to come up with some good numbers on how to size the DB.
>

Ah yeah, this makes sense if you're trying to size for the DB partitions. I
just don't want people to look at it and go "RADOS + BlueStore require 30KB
per object!?!?!?" ;)
(And on a similar vein, the RGW-imposed overhead will depend a great deal
on the object names you use; they can get pretty large and have to get
written down in a few different places...)
-Greg



>
> Currently I assume a 10GB:1TB ratio and that is working out, but with
> people wanting to use 12TB disks we need to drill those numbers down
> even more. Otherwise you will need a lot of SSD space to store the DB in
> SSD if you want to.
>
> Wido
>
> > -Greg
> >
> >
> >
> > I know that partial overwrites and appends contribute to higher
> overhead
> > on objects and I'm trying to investigate this and share my
> information
> > with the community.
> >
> > I have two use-cases who want to store >2 billion objects with a avg
> > object size of 50kB (8 - 80kB) and the RocksDB overhead is likely to
> > become a big problem.
> >
> > Anybody willing to share the overhead they are seeing with what
> > use-case?
> >
> > The more data we have on this the better we can estimate how DBs
> need to
> > be sized for BlueStore deployments.
> >
> > Wido
> >
> > # Cluster #1
> > osd.25 onodes=178572 db_used_bytes=2188378112 <(218)%20837-8112>
> 
> > avg_obj_size=6196529
> > overhead=12254
> > osd.20 onodes=209871 db_used_bytes=2307915776 avg_obj_size=5452002
> > overhead=10996
> > osd.10 onodes=195502 db_used_bytes=2395996160 <(239)%20599-6160>
> 
> > avg_obj_size=6013645
> > overhead=12255
> > osd.30 onodes=186172 db_used_bytes=2393899008 <(239)%20389-9008>
> 
> > avg_obj_size=6359453
> > overhead=12858
> > osd.1 onodes=169911 db_used_bytes=1799356416 avg_obj_size=4890883
> > overhead=10589
> > osd.0 onodes=199658 db_used_bytes=2028994560 <(202)%20899-4560>
> 
> > avg_obj_size=4835928
> > overhead=10162
> > osd.15 onodes=204015 db_used_bytes=2384461824 avg_obj_size=5722715
> > overhead=11687
> >
> > # Cluster #2
> > osd.1 onodes=221735 db_used_bytes=2773483520 avg_obj_size=5742992
> > overhead_per_obj=12508
> > osd.0 onodes=196817 db_used_bytes=2651848704 avg_obj_size=6454248
> > overhead_per_obj=13473
> > osd.3 onodes=212401 db_used_bytes=2745171968 avg_obj_size=6004150
> > overhead_per_obj=12924
> > osd.2 onodes=185757 db_used_bytes=356722 avg_obj_size=5359974
> > overhead_per_obj=19203
> > osd.5 onodes=198822 db_used_bytes=3033530368 <(303)%20353-0368>
> 
> > avg_obj_size=6765679
> > overhead_per_obj=15257
> > osd.4 onodes=161142 db_used_bytes=2136997888 <(213)%20699-7888>
> 
> > avg_obj_size=6377323
> > overhead_per_obj=13261
> > osd.7 onodes=158951 db_used_bytes=1836056576 avg_obj_size=5247527
> > overhead_per_obj=11551
> > osd.6 onodes=178874 db_used_bytes=2542796800 <(254)%20279-6800>
> 
> > avg_obj_size=6539688
> > overhead_per_obj=14215
> > osd.9 onodes=195166 db_used_bytes=2

[ceph-users] Configuration multi region

2018-05-01 Thread Anatoliy Guskov
Hello all,

I created S3 RGW cluster scheme like that:

   — Slave1 (region EU)
Master:
   — Slave2 (region US)

Master uses like store for users login and bucket info. What do you think is 
that good idea or exist better way for storing users data and bucket info? And 
last question. Is that possible disable redirect to region when bucket was 
created?

--
Best regards
Anatoliy Guskov

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] troubleshooting librados error with concurrent requests

2018-05-01 Thread Sam Whitlock
Thank you! I will try what you suggested.

Here is the full backtrace:
#0  0x7f816529b360 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x7f80c716bb82 in Cond::Wait (this=this@entry=0x7f7f977dce20,
mutex=...) at ./common/Cond.h:56
#2  0x7f80c717656f in librados::IoCtxImpl::operate_read
(this=this@entry=0x7f7ed40b4190,
oid=..., o=o@entry=0x7f7f977dd050, pbl=pbl@entry=0x0, flags=flags@entry=0)
at librados/IoCtxImpl.cc:725
#3  0x7f80c7180032 in librados::IoCtxImpl::stat (this=0x7f7ed40b4190,
oid=..., psize=psize@entry=0x7f7f977dd198, pmtime=pmtime@entry=0x7f7f977dd1a0)
at librados/IoCtxImpl.cc:1238
#4  0x7f80c7142606 in librados::IoCtx::stat (this=0x7f7f977dd290,
oid=..., psize=0x7f7f977dd198, pmtime=0x7f7f977dd1a0) at
librados/librados.cc:1260

My custom tensorflow kernels

#5  0x7f80d0e8fd23 in tensorflow::CephReaderOp::CephReadObject
(this=0x7f7e9c0607d0, file_key="ERR174324_1_05_2060",
name_space="ERR174324_1_05", ref_buffer=0x7f7f18000cd0,
io_ctx=...) at tensorflow/contrib/ceph_src/kernels/ceph_reader_op.cc:104
#6  0x7f80d0e8f854 in tensorflow::CephReaderOp::Compute
(this=0x7f7e9c0607d0, ctx=0x7f7f977dd8f0) at
tensorflow/contrib/ceph_src/kernels/ceph_reader_op.cc:64

Tensorflow runtime

#7  0x7f80d873d5c1 in tensorflow::ThreadPoolDevice::Compute
(this=0x7f812077a270, op_kernel=0x7f7e9c0607d0, context=0x7f7f977dd8f0)
at tensorflow/core/common_runtime/threadpool_device.cc:60
#8  0x7f80d86c80e4 in tensorflow::(anonymous
namespace)::ExecutorState::Process (this=0x7f78d80014b0, tagged_node=...,
scheduled_usec=0)
at tensorflow/core/common_runtime/executor.cc:1658
#9  0x7f80d86ca634 in tensorflow::(anonymous
namespace)::ExecutorStateoperator()(void) const
(__closure=0x7f7edee0) at
tensorflow/core/common_runtime/executor.cc:2094
#10 0x7f80d86d1882 in std::_Function_handler
>::_M_invoke(const std::_Any_data &) (__functor=...) at
/usr/include/c++/5/functional:1871
#11 0x7f80d7f449f6 in std::function::operator()() const
(this=0x7f7ed0030ef0) at /usr/include/c++/5/functional:2267

Tensorflow threadpool

#12 0x7f80d8188c86 in tensorflow::thread::EigenEnvironment::ExecuteTask
(this=0x7f81209e1778, t=...) at tensorflow/core/lib/core/threadpool.cc:83
#13 0x7f80d818b6f4 in
Eigen::NonBlockingThreadPoolTempl::WorkerLoop
(this=0x7f81209e1770, thread_id=46)
at
external/eigen_archive/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h:232
#14 0x7f80d8189b46 in
Eigen::NonBlockingThreadPoolTempl::NonBlockingThreadPoolTempl(int,
bool, tensorflow::thread::EigenEnvironment)::{lambda()#1}::operator()()
const () at
external/eigen_archive/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h:65
#15 0x7f80d818cd14 in std::_Function_handler::NonBlockingThreadPoolTempl(int,
bool,
tensorflow::thread::EigenEnvironment)::{lambda()#1}>::_M_invoke(std::_Any_data
const&) (__functor=...) at /usr/include/c++/5/functional:1871
#16 0x7f80d7f449f6 in std::function::operator()() const
(this=0x7f8120aac960) at /usr/include/c++/5/functional:2267
#17 0x7f80d81889d3 in
tensorflow::thread::EigenEnvironment::CreateThread(std::function)::{lambda()#1}::operator()() const (__closure=0x7f8120aac960)
at tensorflow/core/lib/core/threadpool.cc:56
#18 0x7f80d818a970 in std::_Function_handler)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (
__functor=...) at /usr/include/c++/5/functional:1871
#19 0x7f80d7f449f6 in std::function::operator()() const
(this=0x7f8120aac9b8) at /usr/include/c++/5/functional:2267
#20 0x7f80d81d1446 in std::_Bind_simple
()>::_M_invoke<>(std::_Index_tuple<>) (this=0x7f8120aac9b8) at
/usr/include/c++/5/functional:1531
#21 0x7f80d81d13af in std::_Bind_simple
()>::operator()() (this=0x7f8120aac9b8) at
/usr/include/c++/5/functional:1520
#22 0x7f80d81d134e in
std::thread::_Impl ()> >::_M_run()
(this=0x7f8120aac9a0) at /usr/include/c++/5/thread:115
#23 0x7f80d6c66c80 in ?? () from
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#24 0x7f81652956ba in start_thread () from
/lib/x86_64-linux-gnu/libpthread.so.0
#25 0x7f8164fcb41d in clone () from /lib/x86_64-linux-gnu/libc.so.6

On Tue, May 1, 2018 at 7:47 PM Gregory Farnum  wrote:

> Can you provide the full backtrace? It kinda looks like you've left
> something out.
>
> In general though, a Wait inside of an operate call just means the thread
> has submitted its request and is waiting for the answer to come back. It
> could be blocked locally or remotely. If it's blocked remotely, the OSDs
> should be reporting to the mon/mgr that they have slow requests, which you
> can observe in "ceph -w" or whatever. If it's local, hrmm, not sure the
> easiest way to debug without just cranking up logging.
> Generically, I'd use the admin socket on your clients to look at what the
> status of in-flight requests is, and to check the value of the throttle
> limits in the perfcounter. If the requests are being handled

Re: [ceph-users] CephFS MDS stuck (failed to rdlock when getattr / lookup)

2018-05-01 Thread Daniel Gryniewicz

On 05/01/2018 01:43 PM, Oliver Freyermuth wrote:

Hi all,

Am 17.04.2018 um 19:38 schrieb Oliver Freyermuth:

Am 17.04.2018 um 19:35 schrieb Daniel Gryniewicz:

On 04/17/2018 11:40 AM, Oliver Freyermuth wrote:

Am 17.04.2018 um 17:34 schrieb Paul Emmerich:



[...]


  We are right now using the packages from https://eu.ceph.com/nfs-ganesha/ 
 since we would like not to have to build NFS 
Ganesha against Ceph ourselves,
  but would love to just follow upstream to save the maintenance burden. 
Are you building packages yourself, or using a repo maintained upstream?


We are building it ourselves. We plan to soon publish our own repository for it.


This is certainly interesting for us!
Ideally, we would love something with a similar maintenance-promise as the 
eu.ceph.com repositories, always in sync with upstream's ceph releases.
I still hope NFS Ganesha will play a larger role in Mimic and later releases 
any maybe become a more integral part of the Ceph ecosystem.

This is also the main problem preventing it from building it ourselves - if we 
had to do it, we would have to build Ceph and NFS Ganesha, e.g. on the Open 
Build Service,
and closely monitor upstream ourselves.

Many thanks and looking forward towards the publishing of your repo,
 Oliver


2.6 packages are now up on download.ceph.com, thanks to Ali.


Many thanks to Ali also from my side!
We'll schedule the upgrade within the next days and observe closely whether the 
problem reappears.

Many thanks and all the best,
Oliver


Finally, after watching the situation for ~2 weeks after the update, I can say:
Upgrading NFS Ganesha to 2.6 appears to have solved the issue.
I did not observe any further lockup like this anymore, even though our users 
tried their best to re-trigger the issue!
So it seems this has been solved for good by the update :-).

So many thanks and all the best,
Oliver


Thanks, that's good to know.

Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel Xeon Scalable and CPU frequency scaling on NVMe/SSD Ceph OSDs

2018-05-01 Thread Nick Fisk
4.16 required?
https://www.phoronix.com/scan.php?page=news_item&px=Skylake-X-P-State-Linux-
4.16


-Original Message-
From: ceph-users  On Behalf Of Blair
Bethwaite
Sent: 01 May 2018 16:46
To: Wido den Hollander 
Cc: ceph-users ; Nick Fisk 
Subject: Re: [ceph-users] Intel Xeon Scalable and CPU frequency scaling on
NVMe/SSD Ceph OSDs

Also curious about this over here. We've got a rack's worth of R740XDs with
Xeon 4114's running RHEL 7.4 and intel-pstate isn't even active on them,
though I don't believe they are any different at the OS level to our
Broadwell nodes (where it is loaded).

Have you tried poking the kernel's pmqos interface for your use-case?

On 2 May 2018 at 01:07, Wido den Hollander  wrote:
> Hi,
>
> I've been trying to get the lowest latency possible out of the new 
> Xeon Scalable CPUs and so far I got down to 1.3ms with the help of Nick.
>
> However, I can't seem to pin the CPUs to always run at their maximum 
> frequency.
>
> If I disable power saving in the BIOS they stay at 2.1Ghz (Silver 
> 4110), but that disables the boost.
>
> With the Power Saving enabled in the BIOS and when giving the OS all 
> control for some reason the CPUs keep scaling down.
>
> $ echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
>
> cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009 Report 
> errors and bugs to cpuf...@vger.kernel.org, please.
> analyzing CPU 0:
>   driver: intel_pstate
>   CPUs which run at the same hardware frequency: 0
>   CPUs which need to have their frequency coordinated by software: 0
>   maximum transition latency: 0.97 ms.
>   hardware limits: 800 MHz - 3.00 GHz
>   available cpufreq governors: performance, powersave
>   current policy: frequency should be within 800 MHz and 3.00 GHz.
>   The governor "performance" may decide which speed to use
>   within this range.
>   current CPU frequency is 800 MHz.
>
> I do see the CPUs scale up to 2.1Ghz, but they quickly scale down 
> again to 800Mhz and that hurts latency. (50% difference!)
>
> With the CPUs scaling down to 800Mhz my latency jumps from 1.3ms to 
> 2.4ms on avg. With turbo enabled I hope to get down to 1.1~1.2ms on avg.
>
> $ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
> performance
>
> Everything seems to be OK and I would expect the CPUs to stay at 
> 2.10Ghz, but they aren't.
>
> C-States are also pinned to 0 as a boot parameter for the kernel:
>
> processor.max_cstate=1 intel_idle.max_cstate=0
>
> Running Ubuntu 16.04.4 with the 4.13 kernel from the HWE from Ubuntu.
>
> Has anybody tried this yet with the recent Intel Xeon Scalable CPUs?
>
> Thanks,
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore on HDD+SSD sync write latency experiences

2018-05-01 Thread Nick Fisk
Hi all,

 

Slowly getting round to migrating clusters to Bluestore but I am interested
in how people are handling the potential change in write latency coming from
Filestore? Or maybe nobody is really seeing much difference?

 

As we all know, in Bluestore, writes are not double written and in most
cases go straight to disk. Whilst this is awesome for people with pure SSD
or pure HDD clusters as the amount of overhead is drastically reduced, for
people with HDD+SSD journals in Filestore land, the double write had the
side effect of acting like a battery backed cache, accelerating writes when
not under saturation.

 

In some brief testing I am seeing Filestore OSD's with NVME journal show an
average apply latency of around 1-2ms whereas some new Bluestore OSD's in
the same cluster are showing 20-40ms. I am fairly certain this is due to
writes exhibiting the latency of the underlying 7.2k disk. Note, cluster is
very lightly loaded, this is not anything being driven into saturation.

 

I know there is a deferred write tuning knob which adjusts the cutover for
when an object is double written, but at the default of 32kb, I suspect a
lot of IO's even in the 1MB area are still drastically slower going straight
to disk than if double written to NVME 1st. Has anybody else done any
investigation in this area? Is there any long turn harm at running a cluster
deferring writes up to 1MB+ in size to mimic the Filestore double write
approach?

 

I also suspect after looking through github that deferred writes only happen
when overwriting an existing object or blob (not sure which case applies),
so new allocations are still written straight to disk. Can anyone confirm?

 

PS. If your spinning disks are connected via a RAID controller with BBWC
then you are not affected by this.

 

Thanks,

Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.2.5 Luminous released

2018-05-01 Thread Sergey Malinin
Useless due to http://tracker.ceph.com/issues/22102 



> On 24.04.2018, at 23:29, Abhishek  wrote:
> 
> We're glad to announce the fifth bugfix release of Luminous v12.2.x long term 
> stable

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-05-01 Thread Sean Sullivan
Forgot to reply to all:

Sure thing!

I couldn't install the ceph-mds-dbg packages without upgrading. I just
finished upgrading the cluster to 12.2.5. The issue still persists in 12.2.5

>From here I'm not really sure how to do generate the backtrace so I hope I
did it right. For others on Ubuntu this is what I did:

* firstly up the debug_mds to 20 and debug_ms to 1:
ceph tell mds.* injectargs '--debug-mds 20 --debug-ms 1'

* install the debug packages
ceph-mds-dbg in my case

* I also added these options to /etc/ceph/ceph.conf just in case they
restart.

* Now allow pids to dump (stolen partly from redhat docs and partly from
ubuntu)
echo -e 'DefaultLimitCORE=infinity\nPrivateTmp=true' | tee -a
/etc/systemd/system.conf
sysctl fs.suid_dumpable=2
sysctl kernel.core_pattern=/tmp/core
systemctl daemon-reload
systemctl restart ceph-mds@$(hostname -s)

* A crash was created in /var/crash by apport but gdb cant read it. I used
apport-unpack and then ran GDB on what is inside:

apport-unpack /var/crash/$(ls /var/crash/*mds*) /root/crash_dump/
cd /root/crash_dump/
gdb $(cat ExecutablePath) CoreDump -ex 'thr a a bt' | tee
/root/ceph_mds_$(hostname -s)_backtrace

* This left me with the attached backtraces (which I think are wrong as I
see a lot of ?? yet gdb says /usr/lib/debug/.build-id/1d/
23dc5ef4fec1dacebba2c6445f05c8fe6b8a7c.debug was loaded)

 kh10-8 mds backtrace -- https://pastebin.com/bwqZGcfD
 kh09-8 mds backtrace -- https://pastebin.com/vvGiXYVY


The log files are pretty large (one 4.1G and the other 200MB)

kh10-8 (200MB) mds log -- https://griffin-objstore.
opensciencedatacloud.org/logs/ceph-mds.kh10-8.log
kh09-8 (4.1GB) mds log -- https://griffin-objstore.
opensciencedatacloud.org/logs/ceph-mds.kh09-8.log

On Tue, May 1, 2018 at 12:09 AM, Patrick Donnelly 
wrote:

> Hello Sean,
>
> On Mon, Apr 30, 2018 at 2:32 PM, Sean Sullivan 
> wrote:
> > I was creating a new user and mount point. On another hardware node I
> > mounted CephFS as admin to mount as root. I created /aufstest and then
> > unmounted. From there it seems that both of my mds nodes crashed for some
> > reason and I can't start them any more.
> >
> > https://pastebin.com/1ZgkL9fa -- my mds log
> >
> > I have never had this happen in my tests so now I have live data here. If
> > anyone can lend a hand or point me in the right direction while
> > troubleshooting that would be a godsend!
>
> Thanks for keeping the list apprised of your efforts. Since this is so
> easily reproduced for you, I would suggest that you next get higher
> debug logs (debug_mds=20/debug_ms=1) from the MDS. And, since this is
> a segmentation fault, a backtrace with debug symbols from gdb would
> also be helpful.
>
> --
> Patrick Donnelly
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com