Re: [ceph-users] Help: pool not responding

2016-03-04 Thread Mario Giammarco
I have restarted each host using init scripts. Is there another way?

2016-03-03 21:51 GMT+01:00 Dimitar Boichev :

> But the whole cluster or what ?
>
> Regards.
>
> *Dimitar Boichev*
> SysAdmin Team Lead
> AXSMarine Sofia
> Phone: +359 889 22 55 42
> Skype: dimitar.boichev.axsmarine
> E-mail: dimitar.boic...@axsmarine.com
>
> On Mar 3, 2016, at 22:47, Mario Giammarco  wrote:
>
> Uses init script to restart
>
> *Da: *Dimitar Boichev
> *Inviato: *giovedì 3 marzo 2016 21:44
> *A: *Mario Giammarco
> *Cc: *Oliver Dzombic; ceph-users@lists.ceph.com
> *Oggetto: *Re: [ceph-users] Help: pool not responding
>
> I see a lot of people (including myself) ending with PGs that are stuck in
> “creating” state when you force create them.
>
> How did you restart ceph ?
> Mine were created fine after I restarted the monitor nodes after a minor
> version upgrade.
> Did you do it monitors firs, osds second, etc etc …..
>
> Regards.
>
>
> On Mar 3, 2016, at 13:13, Mario Giammarco  wrote:
>
> I have tried "force create". It says "creating" but at the end problem
> persists.
> I have restarted ceph as usual.
> I am evaluating ceph and I am shocked because it semeed a very robust
> filesystem and now for a glitch I have an entire pool blocked and there is
> no simple procedure to force a recovery.
>
> 2016-03-02 18:31 GMT+01:00 Oliver Dzombic :
>
>> Hi,
>>
>> i could also not find any delete, but a create.
>>
>> I found this here, its basically your situation:
>>
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/032412.html
>>
>> --
>> Mit freundlichen Gruessen / Best regards
>>
>> Oliver Dzombic
>> IP-Interactive
>>
>> mailto:i...@ip-interactive.de
>>
>> Anschrift:
>>
>> IP Interactive UG ( haftungsbeschraenkt )
>> Zum Sonnenberg 1-3
>> 63571 Gelnhausen
>>
>> HRB 93402 beim Amtsgericht Hanau
>> Geschäftsführung: Oliver Dzombic
>>
>> Steuer Nr.: 35 236 3622 1
>> UST ID: DE274086107
>>
>>
>> Am 02.03.2016 um 18:28 schrieb Mario Giammarco:
>> > Thans for info even if it is a bad info.
>> > Anyway I am reading docs again and I do not see a way to delete PGs.
>> > How can I remove them?
>> > Thanks,
>> > Mario
>> >
>> > 2016-03-02 17:59 GMT+01:00 Oliver Dzombic > > >:
>> >
>> > Hi,
>> >
>> > as i see your situation, somehow this 4 pg's got lost.
>> >
>> > They will not recover, because they are incomplete. So there is no
>> data
>> > from which it could be recovered.
>> >
>> > So all what is left is to delete this pg's.
>> >
>> > Since all 3 osd's are in and up, it does not seem like you can
>> somehow
>> > access this lost pg's.
>> >
>> > --
>> > Mit freundlichen Gruessen / Best regards
>> >
>> > Oliver Dzombic
>> > IP-Interactive
>> >
>> > mailto:i...@ip-interactive.de 
>> >
>> > Anschrift:
>> >
>> > IP Interactive UG ( haftungsbeschraenkt )
>> > Zum Sonnenberg 1-3
>> > 63571 Gelnhausen
>> >
>> > HRB 93402 beim Amtsgericht Hanau
>> > Geschäftsführung: Oliver Dzombic
>> >
>> > Steuer Nr.: 35 236 3622 1 
>> > UST ID: DE274086107
>> >
>> >
>> > Am 02.03.2016  um 17:45 schrieb Mario Giammarco:
>> > >
>> > >
>> > > Here it is:
>> > >
>> > >  cluster ac7bc476-3a02-453d-8e5c-606ab6f022ca
>> > >  health HEALTH_WARN
>> > > 4 pgs incomplete
>> > > 4 pgs stuck inactive
>> > > 4 pgs stuck unclean
>> > > 1 requests are blocked > 32 sec
>> > >  monmap e8: 3 mons at
>> > > {0=10.1.0.12:6789/0,1=10.1.0.14:6789/0,2=10.1.0.17:6789/0
>> > 
>> > > }
>> > > election epoch 840, quorum 0,1,2 0,1,2
>> > >  osdmap e2405: 3 osds: 3 up, 3 in
>> > >   pgmap v5904430: 288 pgs, 4 pools, 391 GB data, 100 kobjects
>> > > 1090 GB used, 4481 GB / 5571 GB avail
>> > >  284 active+clean
>> > >4 incomplete
>> > >   client io 4008 B/s rd, 446 kB/s wr, 23 op/s
>> > >
>> > >
>> > > 2016-03-02 9:31 GMT+01:00 Shinobu Kinjo > > 
>> > > >>:
>> > >
>> > > Is "ceph -s" still showing you same output?
>> > >
>> > > > cluster ac7bc476-3a02-453d-8e5c-606ab6f022ca
>> > > >  health HEALTH_WARN
>> > > > 4 pgs incomplete
>> > > > 4 pgs stuck inactive
>> > > > 4 pgs stuck unclean
>> > > >  monmap e8: 3 mons at
>> > > > {0=10.1.0.12:6789/0,1=10.1.0.14:6789/0,2=10.1.0.17:6789/0
>> > 
>> > > <
>> http://10.1.0.12:6789/0,1=10.1.0.14:6789/0,2=10.1.0.17:6789/0>}
>> > > > election epoch 832, quorum 

[ceph-users] Cache tier operation clarifications

2016-03-04 Thread Christian Balzer

Hello,

Unlike the subject may suggest, I'm mostly going to try and explain how
things work with cache tiers, as far as I understand them.
Something of a reference to point to.
Of course if you spot something that's wrong or have additional
information, by all means please do comment.

While the documentation in master now correctly warns that you HAVE to set
target_max_bytes (the size of your cache pool) for any of the relative
sizing bits to work, lets repeat that here since it wasn't mentioned there
previously. 
And without that value being set, none of the flushing or eviction will
happen, resulting in blocked IOs when it gets full.

The other thing about target_max_bytes is to remember (documented nowhere)
that this space calculation is base per PG. 
So if you have a 1024GB cache pool and target_max_bytes set accordingly
(one of the most annoying things about Ceph is have to specify full bytes
in most places instead of human friendly shortcuts like "1TB"), Ceph
(the cache tiering agent to be precise) will think that the cache is 50%
full when just one PG has reached 512MB.

In short, expect things to happen quite a bit before you reach the usage
that you think you specified in cache_target_dirty_ratio and
cache_target_full_ratio.
Annoying, but at least failing safe.

I'm ignoring target_max_objects for this, as it's the same for object
count instead of space.
min_read_recency_for_promote and min_write_recency_for_promote I shall
ignore for now as well, since I have no cluster to test them with.

Flush
Either way once Ceph thinks you've reached the cache_target_dirty_ratio
specified, it copies dirty objects to the backing storage. 
If they never existed there before, they will be created (so keep that in
mind if you see an increase in objects).
This (additional object) is similar to tier promotion, when an existing
object is copied from the base pool to the cache pool the first time it's
accessed.

In versions after Hammer there is also cache_target_dirty_high_ratio,
which specifies at which point more aggressive flushing starts.

Note that flushing keeps objects in the cache.
So that object you wrote too some days ago and kept reading frequently
ever since isn't just going away to the slower base pool.

Evict
Next is eviction. This is where things became bit more muddled for me and
I had to do some testing and staring at objects in PGs.
So your cache pool is now hitting the cache_target_full_ratio (or so the
wonky space per PG algorithm thinks).
Remember that all IO will stop once the cache pool gets 100% full, so you
want this to happen at some safe, sane point before this. 
What that point is depends of course on the maximum write speed to your
pool, how fast your cache can flush to the base pool, etc.
Now here is the fun part, clean objects (ones that have not been modified
since they were promoted from the base pool or last flushed) are eligible
for eviction. 
When reading about this the first time I thought this involved more moving
of data from the cache pool to the base pool.
However what happens is that since the object is "clean" (copy exists on
the base pool), it is simply zero'd (after demotion), leaving an empty
rados object in the cache pool and consequently releasing space.

So as far as IO and network traffic is concerned, your enemy is flushing,
not eviction.

In clusters that have a clear usage pattern and idle times, a command
to trigger flushes for a specified ratio and with settable IO limits would
be most welcome. (hint-hint)
Lacking this for now, I've be pondering a cron job that sets
cache_target_dirty_ratio from .7 (my current value) to .6 (or more
likely something smaller, like .65) for a few hours during night and then
back up again. 
This is based on our cache typically not growing more than 2% per day.

Lastly we come to cache_min_flush_age and cache_min_evict_age.
It is my understanding that in Hammer and later a truly full cache pool
will cause these to be ignored to prevent IO deadlocks, correct?

The largest source of cache pollution for us are VM reboots (all those
objects holding the kernel and other things only read at startup, never to
be needed again for months) while on the other hand we have about 10k
truly hot objects that are constantly being read/written. 
Lacking min_write_recency_for_promote for now, I've been thinking to set
cache_min_evict_age to several hours. 
Truly cold objects will be subject to eviction, even lukewarm ones get to
stay. 
Note that for objects that more or less belong in the cache we're using
less than 15% of its capacity.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: List of SSDs

2016-03-04 Thread Christian Balzer
On Fri, 4 Mar 2016 16:09:17 +0900 Shinobu Kinjo wrote:

> Comparing with these SSDs,
> 
>  S3710s
>  S3610s
>  SM863
>  845DC Pro
> 
> which one is more reasonable in terms of performance, cost or whatever?
> S3710s does not sound reasonable to me.
>
Apples and Oranges. 
I use S3700s (I would use 3710s only if larger than 200GB, which I have no
use case for now) exclusively for journals, especially when I can't
control the write usage/patterns. 
Their speed and endurance is worth the money in my book.

I use S3610s for a cache pool, because the price/performance is right, the
endurance is sufficient and the write patterns/volume is well known and
predictable. 
   
> > And I had no luck at all getting the newer versions into a generic
> > kernel or Debian.
> 
> So it's not always better to use newer version. Is my understanding
> right? If I don't understand that properly, point it out to me. I'm
> pretty serious about that.
> 
The problem was getting their module to compile/integrate as it was
against kernel versions I did not/could not use.
Newer LSI/Avago kernel drivers and firmware are definitely recommended,
given the problems the older stuff has.

Christian
> Cheers,
> Shinobu
> 
> 
> On Fri, Mar 4, 2016 at 3:17 PM, Christian Balzer  wrote:
> >
> > Hello,
> >
> > On Mon, 29 Feb 2016 15:00:08 -0800 Heath Albritton wrote:
> >
> >> > Did you just do these tests or did you also do the "suitable for
> >> > Ceph" song and dance, as in sync write speed?
> >>
> >> These were done with libaio, so async.  I can do a sync test if that
> >> helps.  My goal for testing wasn't specifically suitability with ceph,
> >> but overall suitability in my environment, much of which uses async
> >> IO.
> >>
> > Fair enough.
> > Sync tests would be nice, if nothing else to confirm that the Samsung
> > DC level SSDs are suitable and how they compare in that respect to the
> > Intels.
> >
> >>
> >> >> SM863 Pro (default over-provisioning) ~7k IOPS per thread (4
> >> >> threads, QD32) Intel S3710 ~10k IOPS per thread
> >> >> 845DC Pro ~12k IOPS per thread
> >> >> SM863 (28% over-provisioning) ~18k IOPS per thread
> >> >>
> >> > Very interesting.
> >> > To qualify your values up there, could you provide us with the exact
> >> > models, well size of the SSD will do.
> >>
> >> SM863 was 960GB, I've many of these and the 1.92TB models deployed
> >> 845DC Pro, 800GB
> >> S3710, 800GB
> >>
> > Thanks, pretty much an oranges with oranges comparison then. ^o^
> >
> >> > Also did you test with a S3700 (I find the 3710s to be a slight
> >> > regression in some ways)?
> >> > And for kicks, did you try over-provisioning with an Intel SSD to
> >> > see the effects there?
> >>
> >> These tests were performed mid-2015.  I requested an S3700, but at
> >> that point, I could only get the S3710.  I didn't test the Intel with
> >> increased over-provisioning.  I suspect it wouldn't have performed
> >> much better as it was already over-provisioned by 28% or thereabouts.
> >>
> > Yeah, my curiosity was mostly if there is similar ratio at work here
> > (might have made more sense for testing purposes to REDUCE the
> > overprovisioning of the Intel) and where the point of diminishing
> > returns is.
> >
> >> It's easy to guess at these sort of things.  The total capacity of
> >> flash is in some power of two and the advertised capacity is some
> >> power of ten.  Manufacturer's use the difference to buy themselves
> >> some space for garbage collection.  So, a terabyte worth of flash is
> >> 1099511627776 bytes.  800GB is 8e+11 bytes with the difference of
> >> about 299GB, which is the space they've set aside for GC.
> >>
> > Ayup, that I was quite aware of.
> >
> >> Again, if there's some tests you'd like to see done, let me know.
> >> It's relatively easy for me to get samples and the tests are a benefit
> >> to me as much as any other.
> >>
> > Well, see above, diminishing returns and all.
> >
> >>
> >> >> I'm seeing the S3710s at ~$1.20/GB and the SM863 around $.63/GB.
> >> >> As such, I'm buying quite a lot of the latter.
> >> >
> >> > I assume those numbers are before over-provisioning the SM863, still
> >> > quite a difference indeed.
> >>
> >> Yes, that's correct.  Here's some current pricing:  Newegg has the
> >> SM863 960GB at $565 or ~$.59/GB raw.  With 28% OP, that yields around
> >> 800GB and around $.71/GB
> >>
> > If I'm reading the (well hidden and only in the PDF) full specs of the
> > 960GB 863 correctly it has an endurance of about 3 DWPD, so the
> > comparable Intel model would be the 3610s.
> > At least when it comes to endurance.
> > Would be interesting to see those two in comparison. ^.^
> >
> >
> >> >> I've not had them deployed
> >> >> for very long, so I can't attest to anything beyond my synthetic
> >> >> benchmarks.  I'm using the LSI 3008 based HBA as well and I've had
> >> >> to use updated firmware and kernel module for it.  I haven't
> >> >> checked the kernel that comes with EL7.2, but 7.1 still had
> >> >> problems

Re: [ceph-users] Upgrade from Hammer LTS to Infernalis or wait for Jewel LTS?

2016-03-04 Thread Luis Periquito
On Wed, Mar 2, 2016 at 9:32 AM, Mihai Gheorghe  wrote:
> Hi,
>
> I've got two questions!
>
> First. We are currently running Hammer in production. You are thinking of
> upgrading to Infernalis. Should we upgrade now or wait for the next LTS,
> Jewel? On ceph releases i can see Hammers EOL is estimated in november 2016
> while Infernalis is June 2016.

I don't know where you got this information but it seems wrong. From
previous history the last 2 LTS versions are supported (currently
Firefly and Hammer). That would mean that Hammer should be supported
until the L version is released. Infernalis should be supported until
the release of Jewel.

> If i follow the upgrade procedure there should not be any problems, right?

So far we've upgraded every version without issues. But past performance...

>
> Second. When Jewel LTS will be released, does anybody know if we can upgrade
> straight from Hammer or first we need to upgrade to Infernalis and then
> Jewel. If the latter is the case i see no reason not to upgrade now to
> Infernalis and wait for Jewel release to upgrade again. This way we can take
> advantage of the new features in Infernalis.

Usually you can upgrade LTS -> LTS, so you should be able to go from
Hammer to Jewel. The same should be true to Infernalis. However
minimum versions may apply (like you need at least version 0.94.4 to
upgrade to infernalis).

>
> Also what is the correct order of upgrading? Mons first then OSDs?

Usually mons, then osds and then mds and radosgw. But if there's
something different it'll be published in the release notes.

>
> Any input on the matter would be greatly apreciated.

If it was me, depending on what you value most: if you prefer
stability and a conservative approach I'd install Hammer. If you
prefer features and performance I'd install Infernalis.
As an example all major players (like Redhat, Fujitsu, Suse, etc) use
only the LTS versions for their distros.

>
> Thank you.
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: List of SSDs

2016-03-04 Thread Shinobu Kinjo
On Mar 4, 2016 5:30 PM, "Christian Balzer"  wrote:
>
> On Fri, 4 Mar 2016 16:09:17 +0900 Shinobu Kinjo wrote:
>
> > Comparing with these SSDs,
> >
> >  S3710s
> >  S3610s
> >  SM863
> >  845DC Pro
> >
> > which one is more reasonable in terms of performance, cost or whatever?
> > S3710s does not sound reasonable to me.
> >
> Apples and Oranges.
> I use S3700s (I would use 3710s only if larger than 200GB, which I have no

This evaluation is interesting to me.

> use case for now) exclusively for journals, especially when I can't
> control the write usage/patterns.
> Their speed and endurance is worth the money in my book.
>
> I use S3610s for a cache pool, because the price/performance is right, the
> endurance is sufficient and the write patterns/volume is well known and
> predictable.

I am just thinking of this for my next testing cluster.

>
> > > And I had no luck at all getting the newer versions into a generic
> > > kernel or Debian.
> >
> > So it's not always better to use newer version. Is my understanding
> > right? If I don't understand that properly, point it out to me. I'm
> > pretty serious about that.
> >
> The problem was getting their module to compile/integrate as it was
> against kernel versions I did not/could not use.

This is good to know.

> Newer LSI/Avago kernel drivers and firmware are definitely recommended,
> given the problems the older stuff has.
>

Thanks for your suggestion.
I will definitely do this.

S

> Christian
> > Cheers,
> > Shinobu
> >
> >
> > On Fri, Mar 4, 2016 at 3:17 PM, Christian Balzer  wrote:
> > >
> > > Hello,
> > >
> > > On Mon, 29 Feb 2016 15:00:08 -0800 Heath Albritton wrote:
> > >
> > >> > Did you just do these tests or did you also do the "suitable for
> > >> > Ceph" song and dance, as in sync write speed?
> > >>
> > >> These were done with libaio, so async.  I can do a sync test if that
> > >> helps.  My goal for testing wasn't specifically suitability with
ceph,
> > >> but overall suitability in my environment, much of which uses async
> > >> IO.
> > >>
> > > Fair enough.
> > > Sync tests would be nice, if nothing else to confirm that the Samsung
> > > DC level SSDs are suitable and how they compare in that respect to the
> > > Intels.
> > >
> > >>
> > >> >> SM863 Pro (default over-provisioning) ~7k IOPS per thread (4
> > >> >> threads, QD32) Intel S3710 ~10k IOPS per thread
> > >> >> 845DC Pro ~12k IOPS per thread
> > >> >> SM863 (28% over-provisioning) ~18k IOPS per thread
> > >> >>
> > >> > Very interesting.
> > >> > To qualify your values up there, could you provide us with the
exact
> > >> > models, well size of the SSD will do.
> > >>
> > >> SM863 was 960GB, I've many of these and the 1.92TB models deployed
> > >> 845DC Pro, 800GB
> > >> S3710, 800GB
> > >>
> > > Thanks, pretty much an oranges with oranges comparison then. ^o^
> > >
> > >> > Also did you test with a S3700 (I find the 3710s to be a slight
> > >> > regression in some ways)?
> > >> > And for kicks, did you try over-provisioning with an Intel SSD to
> > >> > see the effects there?
> > >>
> > >> These tests were performed mid-2015.  I requested an S3700, but at
> > >> that point, I could only get the S3710.  I didn't test the Intel with
> > >> increased over-provisioning.  I suspect it wouldn't have performed
> > >> much better as it was already over-provisioned by 28% or thereabouts.
> > >>
> > > Yeah, my curiosity was mostly if there is similar ratio at work here
> > > (might have made more sense for testing purposes to REDUCE the
> > > overprovisioning of the Intel) and where the point of diminishing
> > > returns is.
> > >
> > >> It's easy to guess at these sort of things.  The total capacity of
> > >> flash is in some power of two and the advertised capacity is some
> > >> power of ten.  Manufacturer's use the difference to buy themselves
> > >> some space for garbage collection.  So, a terabyte worth of flash is
> > >> 1099511627776 bytes.  800GB is 8e+11 bytes with the difference of
> > >> about 299GB, which is the space they've set aside for GC.
> > >>
> > > Ayup, that I was quite aware of.
> > >
> > >> Again, if there's some tests you'd like to see done, let me know.
> > >> It's relatively easy for me to get samples and the tests are a
benefit
> > >> to me as much as any other.
> > >>
> > > Well, see above, diminishing returns and all.
> > >
> > >>
> > >> >> I'm seeing the S3710s at ~$1.20/GB and the SM863 around $.63/GB.
> > >> >> As such, I'm buying quite a lot of the latter.
> > >> >
> > >> > I assume those numbers are before over-provisioning the SM863,
still
> > >> > quite a difference indeed.
> > >>
> > >> Yes, that's correct.  Here's some current pricing:  Newegg has the
> > >> SM863 960GB at $565 or ~$.59/GB raw.  With 28% OP, that yields around
> > >> 800GB and around $.71/GB
> > >>
> > > If I'm reading the (well hidden and only in the PDF) full specs of the
> > > 960GB 863 correctly it has an endurance of about 3 DWPD, so the
> > > comparable Intel model would be th

Re: [ceph-users] abort slow requests ?

2016-03-04 Thread Luis Periquito
you should really fix the peering objects.

So far what I've seen in ceph is that it prefers data integrity over
availability. So if it thinks that it can't keep all working properly
it tends to stop (i.e. blocked requests), thus I don't believe there's
a way to do this.

On Fri, Mar 4, 2016 at 1:04 AM, Ben Hines  wrote:
> I have a few bad objects in ceph which are 'stuck on peering'.  The clients
> hit them and they build up and eventually stop all traffic to the OSD.   I
> can open up traffic by resetting the OSD (aborting those requests)
> temporarily.
>
> Is there a way to tell ceph to cancel/abort these 'slow requests' once they
> get to certain amount of time? Rather than building up and blocking
> everything..
>
> -Ben
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] abort slow requests ?

2016-03-04 Thread Ben Hines
Thanks, working on fixing the peering objects. Going to attempt a recovery
on the bad pgs tomorrow.

The corrupt OSD which they were on was marked 'lost' so i expected it
wouldn't try to peer with it anymore. Anyway I do have the data, at least.

-Ben

On Fri, Mar 4, 2016 at 1:04 AM, Luis Periquito  wrote:

> you should really fix the peering objects.
>
> So far what I've seen in ceph is that it prefers data integrity over
> availability. So if it thinks that it can't keep all working properly
> it tends to stop (i.e. blocked requests), thus I don't believe there's
> a way to do this.
>
> On Fri, Mar 4, 2016 at 1:04 AM, Ben Hines  wrote:
> > I have a few bad objects in ceph which are 'stuck on peering'.  The
> clients
> > hit them and they build up and eventually stop all traffic to the OSD.
>  I
> > can open up traffic by resetting the OSD (aborting those requests)
> > temporarily.
> >
> > Is there a way to tell ceph to cancel/abort these 'slow requests' once
> they
> > get to certain amount of time? Rather than building up and blocking
> > everything..
> >
> > -Ben
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade from Hammer LTS to Infernalis or wait for Jewel LTS?

2016-03-04 Thread Mihai Gheorghe
Here is the roadmap http://docs.ceph.com/docs/master/releases/

EOL is estimated. Or this is what i think of estimated retirement.

We are already running hammer. No issues here, except for cahce tier pool
with the promotion bug. Don't think the fix was backported to hammer as the
time of writing, although it might. Didn't test it again.

Are there stability issues with infernalis. As i know thw difference
between a LTS and a non-LTS release is that LTS has a longer life span and
fixes will be backported so no need for upgrading to the next release to
ger the patches. I thought Infernalis should be as stable as Hammer because
is an intermediarry release between LTS and not cutting-edge or nightly
release.

So in your expirience no problem encountered in upgrading from hammer to
infernalis as i understand right?
On 4 Mar 2016 10:53, "Luis Periquito"  wrote:

> On Wed, Mar 2, 2016 at 9:32 AM, Mihai Gheorghe  wrote:
> > Hi,
> >
> > I've got two questions!
> >
> > First. We are currently running Hammer in production. You are thinking of
> > upgrading to Infernalis. Should we upgrade now or wait for the next LTS,
> > Jewel? On ceph releases i can see Hammers EOL is estimated in november
> 2016
> > while Infernalis is June 2016.
>
> I don't know where you got this information but it seems wrong. From
> previous history the last 2 LTS versions are supported (currently
> Firefly and Hammer). That would mean that Hammer should be supported
> until the L version is released. Infernalis should be supported until
> the release of Jewel.
>
> > If i follow the upgrade procedure there should not be any problems,
> right?
>
> So far we've upgraded every version without issues. But past performance...
>
> >
> > Second. When Jewel LTS will be released, does anybody know if we can
> upgrade
> > straight from Hammer or first we need to upgrade to Infernalis and then
> > Jewel. If the latter is the case i see no reason not to upgrade now to
> > Infernalis and wait for Jewel release to upgrade again. This way we can
> take
> > advantage of the new features in Infernalis.
>
> Usually you can upgrade LTS -> LTS, so you should be able to go from
> Hammer to Jewel. The same should be true to Infernalis. However
> minimum versions may apply (like you need at least version 0.94.4 to
> upgrade to infernalis).
>
> >
> > Also what is the correct order of upgrading? Mons first then OSDs?
>
> Usually mons, then osds and then mds and radosgw. But if there's
> something different it'll be published in the release notes.
>
> >
> > Any input on the matter would be greatly apreciated.
>
> If it was me, depending on what you value most: if you prefer
> stability and a conservative approach I'd install Hammer. If you
> prefer features and performance I'd install Infernalis.
> As an example all major players (like Redhat, Fujitsu, Suse, etc) use
> only the LTS versions for their distros.
>
> >
> > Thank you.
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tier operation clarifications

2016-03-04 Thread Shinobu Kinjo
Great feedback (at least for me).
I would like to know if the behaviours you seeing are expected things or not.

BTW I will do some test regarding to cache tier with my new toy.

Cheers,
S

On Fri, Mar 4, 2016 at 5:17 PM, Christian Balzer  wrote:
>
> Hello,
>
> Unlike the subject may suggest, I'm mostly going to try and explain how
> things work with cache tiers, as far as I understand them.
> Something of a reference to point to.
> Of course if you spot something that's wrong or have additional
> information, by all means please do comment.
>
> While the documentation in master now correctly warns that you HAVE to set
> target_max_bytes (the size of your cache pool) for any of the relative
> sizing bits to work, lets repeat that here since it wasn't mentioned there
> previously.
> And without that value being set, none of the flushing or eviction will
> happen, resulting in blocked IOs when it gets full.
>
> The other thing about target_max_bytes is to remember (documented nowhere)
> that this space calculation is base per PG.
> So if you have a 1024GB cache pool and target_max_bytes set accordingly
> (one of the most annoying things about Ceph is have to specify full bytes
> in most places instead of human friendly shortcuts like "1TB"), Ceph
> (the cache tiering agent to be precise) will think that the cache is 50%
> full when just one PG has reached 512MB.
>
> In short, expect things to happen quite a bit before you reach the usage
> that you think you specified in cache_target_dirty_ratio and
> cache_target_full_ratio.
> Annoying, but at least failing safe.
>
> I'm ignoring target_max_objects for this, as it's the same for object
> count instead of space.
> min_read_recency_for_promote and min_write_recency_for_promote I shall
> ignore for now as well, since I have no cluster to test them with.
>
> Flush
> Either way once Ceph thinks you've reached the cache_target_dirty_ratio
> specified, it copies dirty objects to the backing storage.
> If they never existed there before, they will be created (so keep that in
> mind if you see an increase in objects).
> This (additional object) is similar to tier promotion, when an existing
> object is copied from the base pool to the cache pool the first time it's
> accessed.
>
> In versions after Hammer there is also cache_target_dirty_high_ratio,
> which specifies at which point more aggressive flushing starts.
>
> Note that flushing keeps objects in the cache.
> So that object you wrote too some days ago and kept reading frequently
> ever since isn't just going away to the slower base pool.
>
> Evict
> Next is eviction. This is where things became bit more muddled for me and
> I had to do some testing and staring at objects in PGs.
> So your cache pool is now hitting the cache_target_full_ratio (or so the
> wonky space per PG algorithm thinks).
> Remember that all IO will stop once the cache pool gets 100% full, so you
> want this to happen at some safe, sane point before this.
> What that point is depends of course on the maximum write speed to your
> pool, how fast your cache can flush to the base pool, etc.
> Now here is the fun part, clean objects (ones that have not been modified
> since they were promoted from the base pool or last flushed) are eligible
> for eviction.
> When reading about this the first time I thought this involved more moving
> of data from the cache pool to the base pool.
> However what happens is that since the object is "clean" (copy exists on
> the base pool), it is simply zero'd (after demotion), leaving an empty
> rados object in the cache pool and consequently releasing space.
>
> So as far as IO and network traffic is concerned, your enemy is flushing,
> not eviction.
>
> In clusters that have a clear usage pattern and idle times, a command
> to trigger flushes for a specified ratio and with settable IO limits would
> be most welcome. (hint-hint)
> Lacking this for now, I've be pondering a cron job that sets
> cache_target_dirty_ratio from .7 (my current value) to .6 (or more
> likely something smaller, like .65) for a few hours during night and then
> back up again.
> This is based on our cache typically not growing more than 2% per day.
>
> Lastly we come to cache_min_flush_age and cache_min_evict_age.
> It is my understanding that in Hammer and later a truly full cache pool
> will cause these to be ignored to prevent IO deadlocks, correct?
>
> The largest source of cache pollution for us are VM reboots (all those
> objects holding the kernel and other things only read at startup, never to
> be needed again for months) while on the other hand we have about 10k
> truly hot objects that are constantly being read/written.
> Lacking min_write_recency_for_promote for now, I've been thinking to set
> cache_min_evict_age to several hours.
> Truly cold objects will be subject to eviction, even lukewarm ones get to
> stay.
> Note that for objects that more or less belong in the cache we're using
> less than 15% of its capacity.
>
> Chri

[ceph-users] slow requests with rbd

2016-03-04 Thread Jan Krcmar
hi,

i have rbd0 mapped to client, xfs formatted. i'm putting a lot of data on it.
following messages appear in logs and 'ceph -s' output

osd.255 [WRN] 1 slow requests, 1 included below; oldest blocked for >
51.726881 secs
osd.255 [WRN] slow request 51.726881 seconds old, received at
2016-03-04 12:22:23.549737: osd_op(client.14296.1:389333
rbd_data.37d230c8153.000d1cc8 [set-alloc-hint object_size
4194304 write_size 4194304,writefull 0~4194304] 2.fc8c5908
ondisk+write e7523) currently waiting for subops from 120,239

it causes slow downs on writes. iostat, load, dmesg on osds shows nothing odd.

could anyone give me a hint?

server:
linux 3.16.0-0.bpo.4-amd64
ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
running in docker container

client:
linux 4.4.1-2-ARCH
ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)

thanks
fous
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests with rbd

2016-03-04 Thread Max A. Krasilnikov
Здравствуйте! 

On Fri, Mar 04, 2016 at 01:33:24PM +0100, honza801 wrote:

> hi,

> i have rbd0 mapped to client, xfs formatted. i'm putting a lot of data on it.
> following messages appear in logs and 'ceph -s' output

> osd.255 [WRN] 1 slow requests, 1 included below; oldest blocked for >
> 51.726881 secs
> osd.255 [WRN] slow request 51.726881 seconds old, received at
> 2016-03-04 12:22:23.549737: osd_op(client.14296.1:389333
> rbd_data.37d230c8153.000d1cc8 [set-alloc-hint object_size
> 4194304 write_size 4194304,writefull 0~4194304] 2.fc8c5908
> ondisk+write e7523) currently waiting for subops from 120,239

> it causes slow downs on writes. iostat, load, dmesg on osds shows nothing odd.

> could anyone give me a hint?

I spent a lot of time with this trouble because of "overtuning" of Linux TCP/IP
stack using sysctl. If your disks are not overloaded, if your network is not
overloaded, take a look on network configuration including sysctl.

BTW, default sysctl settings are quite well :) Things can be better, but they
are stable anough.

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG's stuck inactive, stuck unclean, incomplete, imports cause osd segfaults

2016-03-04 Thread Philip S. Hempel

On 03/03/2016 03:52 PM, Philip S. Hempel wrote:


Thanks, appreciate the help.
That is where I have gotten as well, so if we have a developer out 
there that can help please let me know.

There is budget to pay someone for the help.

We are still looking for someone to help us, if possible. I believe that 
with the data I have exported, that there is a way to get this data back 
into ceph. But these strange segfaults are beyond my understanding.


If there is a developer that is willing to be paid to help work with 
this, please let me know.


Thanks

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problem: silently corrupted RadosGW objects caused by slow requests

2016-03-04 Thread Ritter Sławomir
Thnx for contact.

> > 2016-02-23 13:49:58.818640 osd.260 10.176.67.27:6800/688083 2119 : [WRN] 4
> > slow requests, 4 included below; oldest blocked for > 30.727096 secs
> > 2016-02-23 13:49:58.818673 osd.260 10.176.67.27:6800/688083 2120 : [WRN]
> > slow request 30.727096 seconds old, received at 2016-02-23 13:49:28.091460:
> > osd_op(client.47792965.0:185007087 
> > default.14654.445__shadow_c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv.b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z.57_2
> > [writefull 0~524288] 10.ce729ebe e107594) v4 currently waiting for subops 
> > from
> > [469,9]
> Did these requests ever finish?
There is no more info in ceph.log (any other way to check it?).
...but related RADOS object is complete and it seems that have correct mtime 
(2016-02-23T12:49:28+00:00 =  time of HTTP_500 and "received time" from 
slow_req)
.rgw.buckets/default.14654.445__shadow_c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv.b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z.57_2
 mtime 1456231768, size 2097152   

The previous object also have the same mtime:
.rgw.buckets/default.14654.445__shadow_c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv.b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z.57_1
 mtime 1456231768, size 4194304

But first object from this multipart is empty and have other mtime 
(2016-02-23T12:50:00+00:00  - 22s later, during slow_req and before next 
HTTP_200 request).
.rgw.buckets/default.14654.445__multipart_c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv.b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z.57
 mtime 1456231800, size 0
There wasn't slow_req info about this object or its OSDs. It seems that "empty" 
state has been caused by slow_req on that latter object.

> > 127.0.0.1 - - [23/Feb/2016:13:49:28 +0100] "PUT 
> > /video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv?uploadId=b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z&partNumber=57
> >  HTTP/1.0" 500 751 "-" "Boto/2.31.1 Python/2.7.3
> > Linux/3.13.0-39-generic(syncworker)" > >
> > 127.0.0.1 - - [23/Feb/2016:13:49:58 +0100] "PUT 
> > /video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv?uploadId=b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z&partNumber=57
> >  HTTP/1.0" 200 221 "-" "Boto/2.31.1 Python/2.7.3
> > Linux/3.13.0-39-generic(syncworker)"

> Thank you. I think you provided some info here that will hopefully
> allow us to identify the root cause.
We have a lot of such S3-objects with empty or missing RADOS parts, but of 
course limited logs (rotation).
Right now, we are installing test-cluster. We have methods to release floods of 
slow_reqs :).

Regards,
SR
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problem: silently corrupted RadosGW objects caused by slow requests

2016-03-04 Thread Ritter Sławomir
> From: Robin H. Johnson [mailto:robb...@gentoo.org]
> Sent: Friday, March 04, 2016 12:40 AM
> To: Ritter Sławomir
> Cc: ceph-us...@ceph.com; ceph-devel
> Subject: Re: [ceph-users] Problem: silently corrupted RadosGW objects caused
> by slow requests
> 
> On Thu, Mar 03, 2016 at 01:55:13PM +0100, Ritter Sławomir wrote:
> > Hi,
> >
> > I think this is really serious problem - again:
> >
> > - we silently lost S3/RGW objects in clusters
> >
> > Moreover, it our situation looks very similiar to described in
> > uncorrected bug #13764 (Hammer) and in corrected #8269 (Dumpling).
> FYI fix in #8269 _is_ present in Hammer:
> commit bd8e026f88b rgw: don't allow multiple writers to same multiobject part
> 
> --
> Robin Hugh Johnson
> Gentoo Linux: Developer, Infrastructure Lead, Foundation Trustee
> E-Mail : robb...@gentoo.org
> GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
Yes,

fix for #8269 also has been included in our version: Dumpling 0.67.11.
Guys from #13764 are using patched Hammer version.

Both situations with corrupted files are very similiar to that described in 
#8269.
There was a problem with 2 threads writing to the same RADOS objects. 

Maybe there is another one uknown and specific exception to fix?

Cheers,
SR

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Data inaccessable after single OSD down, default size is 3 min size is 1

2016-03-04 Thread Oliver Dzombic
Hi,

we have here the effect, that single OSD's are getting down/out because
it happens that they are sometimes too slow.

osd_pool_default_size = 3
osd_pool_default_min_size = 1


pool 0 'rbd' replicated size 3 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 15391 flags hashpspool
stripe_width 0

pool 6 'cephfs_data' replicated size 3 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 256 pgp_num 256 last_change 10945 flags
hashpspool crash_replay_interval 45 stripe_width 0

pool 7 'cephfs_metadata' replicated size 3 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 128 pgp_num 128 last_change 10943 flags
hashpspool stripe_width 0

max_osd 18


If a single OSD getting out/down, i expect the cluster to continue to work.

Because we have replicated everything 3 times.

But the virtual servers ( KVM ) some accessing via librbd some accessing
via cephfs getting cut of from their virtual harddisks.

Why is it that way ?

For my understanding, if 1 OSD is gone, and we replicate everything 3
times, and i assume that ceph is not as stupid as putting all 3 replicas
on the same OSD, how can it go down like that ?


Thank you !

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade from Hammer LTS to Infernalis or wait for Jewel LTS?

2016-03-04 Thread Ken Dreyer
On Fri, Mar 4, 2016 at 1:53 AM, Luis Periquito  wrote:
> On Wed, Mar 2, 2016 at 9:32 AM, Mihai Gheorghe  wrote:
> From previous history the last 2 LTS versions are supported (currently
> Firefly and Hammer).

Note that Firefly reached end-of-life in January, and we're no longer
issuing releases for it.

- Ken
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problem: silently corrupted RadosGW objects caused by slow requests

2016-03-04 Thread Yehuda Sadeh-Weinraub
On Fri, Mar 4, 2016 at 7:26 AM, Ritter Sławomir
 wrote:
>> From: Robin H. Johnson [mailto:robb...@gentoo.org]
>> Sent: Friday, March 04, 2016 12:40 AM
>> To: Ritter Sławomir
>> Cc: ceph-us...@ceph.com; ceph-devel
>> Subject: Re: [ceph-users] Problem: silently corrupted RadosGW objects caused
>> by slow requests
>>
>> On Thu, Mar 03, 2016 at 01:55:13PM +0100, Ritter Sławomir wrote:
>> > Hi,
>> >
>> > I think this is really serious problem - again:
>> >
>> > - we silently lost S3/RGW objects in clusters
>> >
>> > Moreover, it our situation looks very similiar to described in
>> > uncorrected bug #13764 (Hammer) and in corrected #8269 (Dumpling).
>> FYI fix in #8269 _is_ present in Hammer:
>> commit bd8e026f88b rgw: don't allow multiple writers to same multiobject part
>>
>> --
>> Robin Hugh Johnson
>> Gentoo Linux: Developer, Infrastructure Lead, Foundation Trustee
>> E-Mail : robb...@gentoo.org
>> GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
> Yes,
>
> fix for #8269 also has been included in our version: Dumpling 0.67.11.
> Guys from #13764 are using patched Hammer version

I didn't notice that you were actually running Dumpling (which we
haven't supported and backported fixes for a while). Here's one issue
that you might have hit:

http://tracker.ceph.com/issues/11604

Yehuda

>
> Both situations with corrupted files are very similiar to that described in 
> #8269.
> There was a problem with 2 threads writing to the same RADOS objects.
>
> Maybe there is another one uknown and specific exception to fix?
>
> Cheers,
> SR
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Can I rebuild object maps while VMs are running ?

2016-03-04 Thread Christoph Adomeit
Hi there,

I just updated our ceph-cluster to infernalis and now I want to enable the new 
image features.

I wonder if I can add the features on the rbd images while the VMs are running.

I want to do something like this:

rbd feature enable $IMG exclusive-lock
rbd feature enable $IMG object-map
rbd feature enable $IMG fast-diff
rbd object-map rebuild $IMG 

I am afraid of corrupting my rbds when building images at runtime.

What do you think ?

Thanks
  Christoph


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-mon - mon daemon issues

2016-03-04 Thread M Ranga Swami Reddy
Hello,
I have couple of questions on ceph-mon with mon daemon:

Q1:   Working command:  /etc/init.d/ceph status mon
 Not working  : status ceph-mon id=node-13

   Why first command is working and why not the 2nd command nto working


status ceph-mon id=node-13
status: Unknown instance: ceph/node-13

node-13:~# /etc/init.d/ceph status mon
=== mon.node-13 ===
mon.node-13: running {"version":"0.80.7"}
===

Q2:

   In ceph-mon command running with --pid-file
   Not seen the --pid-file option in neither ceph doc or ceph-mon help?
   Why --pid-file option required? if it not passed what was the
default for this option?

===
node-13:~# ps -ef | grep ceph-mon

root 43508 1  1 18:31 ?00:00:04 /usr/bin/ceph-mon -i
node-13 --pid-file /var/run/ceph/mon.node-13.pid -c
/etc/ceph/ceph.conf --cluster ceph

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tier operation clarifications

2016-03-04 Thread Francois Lafont
Hello,

On 04/03/2016 09:17, Christian Balzer wrote:

> Unlike the subject may suggest, I'm mostly going to try and explain how
> things work with cache tiers, as far as I understand them.
> Something of a reference to point to. [...]

I'm currently unqualified concerning cache tiering but I'm pretty
sure that your post is very relevant and I think you should make
a pull-request on the Ceph documentation where you could bring all
these lights. Here, your explanations will be lost in the depths
of the mailing list. ;)

Regards.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph RBD latencies

2016-03-04 Thread Christian Balzer

Hello,

On Thu, 3 Mar 2016 23:26:13 + Adrian Saul wrote:

> 
> > Samsung EVO...
> > Which exact model, I presume this is not a DC one?
> >
> > If you had put your journals on those, you would already be pulling
> > your hairs out due to abysmal performance.
> >
> > Also with Evo ones, I'd be worried about endurance.
> 
> No,  I am using the P3700DCs for journals.  

Yup, thats why I wrote "If you had...". ^o^

>The Samsungs are the 850 2TB
> (MZ-75E2T0BW).  Chosen primarily on price.  

These are spec'ed at 150TBW, or an amazingly low 0.04 DWPD (over 5 years).
Unless you have a read-only cluster, you will wind up spending MORE on
replacing them (and/or loosing data when 2 fail at the same time) than
going with something more sensible like Samsung's DC models or the Intel
DC ones (S3610s come to mind for "normal" use). 
See also the current "List of SSDs" thread in this ML.

>We already built a system
> using the 1TB models with Solaris+ZFS and I have little faith in them.
> Certainly their write performance is erratic and not ideal.  We have
> other vendor options which are what they call "Enterprise Value" SSDs,
> but still 4x the price.   I would prefer a higher grade drive but
> unfortunately cost is being driven from above me.
>
Fast, reliable, cheap. Pick any 2. 

On your test setup or even better the Solaris one, have a look at their
media wearout, or  Wear_Leveling_Count as Samsung calls it.
I bet that makes for some scary reading.

> > > On the ceph side each disk in the OSD servers are setup as an
> > > individual OSD, with a 12G journal created on the flash mirror.   I
> > > setup the SSD servers into one root, and the SATA servers into
> > > another and created pools using hosts as fault boundaries, with the
> > > pools set for 2 copies.
> > Risky. If you have very reliable and well monitored SSDs you can get
> > away with 2 (I do so), but with HDDs and the combination of their
> > reliability and recovery time it's asking for trouble.
> > I realize that this is testbed, but if your production has a
> > replication of 3 you will be disappointed by the additional latency.
> 
> Again, cost - the end goal will be we build metro based dual site pools
> which will be 2+2 replication.  
Note that Ceph (RBD/RADOS to be precise) isn't particular suited for
"long" distance replication due to the incurred latencies. 

That's unless your replication is happening "above" Ceph in the iSCSI bits
with something that's more optimized for this. 

Something along the lines of the DRBD proxy has been suggested for Ceph,
but if at all it is a backburner project at best from what I gather.


> I am aware of the risks but already
> presenting numbers based on buying 4x the disk we are able to use gets
> questioned hard.
> 
There are some ways around this, which may or may not be suitable for your
use case.
EC pools (or RAID'ed OSDs, which I prefer) for HDD based pools.
Of course this comes at a performance penalty, which you can offset again
with for example fast RAID controllers with HW cache to some extend.
But it may well turn out to be zero sum game.

Another thing is to use a cache pool (with top of the line SSDs), this is
of course only a sensible course of action if your hot objects will fit in
there.
In my case they do (about 10-20% of the 2.4TB raw pool capacity) and
everything is as fast as can be expected and the VMs (their time
critical/sensitive application to be precise) are happy campers.

> > This smells like garbage collection on your SSDs, especially since it
> > matches time wise what you saw on them below.
> 
> I concur.   I am just not sure why that impacts back to the client when
> from the client perspective the journal should hide this.   If the
> journal is struggling to keep up and has to flush constantly then
> perhaps, but  on the current steady state IO rate I am testing with I
> don't think the journal should be that saturated.
>
There's a counter in Ceph (counter-filestore_journal_bytes) that you can
graph for journal usage. 
The highest I have ever seen is about 100MB for HDD based OSDs, less than
8MB for SSD based ones with default(ish) Ceph parameters. 

Since you seem to have experience with ZFS (I don't really, but I read
alot ^o^), consider the Ceph journal equivalent to the ZIL.  
It is a write only journal, it never gets read from unless there is a
crash.
That is why sequential, sync write speed is the utmost criteria for Ceph
journal device.

If I recall correctly you were testing with 4MB block streams, thus pretty
much filling the pipe to capacity, atop on your storage nodes will give a
good insight. 

The journal is great to cover some bursts, but the Ceph OSD is flushing
things from RAM to the backing storage on configurable time limits and
once these are exceeded and/or you run out RAM (pagecache), you are
limited to what your backing storage can sustain.

Now in real life, you would want a cluster and especially OSDs that are
lightly to medium loaded on average and i

[ceph-users] Infernalis 9.2.1: the "rados df"ommand show wrong data

2016-03-04 Thread Mike Almateia

Hello Cephers!

On my small cluster I see this:

[root@c1 ~]# rados df
pool name KB  objects   clones degraded 
 unfound   rdrd KB   wrwr KB
data   0000 
   064   158212215700473
hotec  797656118 2534907500 
   0   370557163024145 69629631  17786794779
rbd0000 
   00000

  total used  2080528140 25349075
  total avail   234455230732
  total space   236596810264
[root@c1 ~]#


Why the string "total used" show different value? It's need to be 
"797656118" I think.

Pools:
* data - EC pool 3+2
* hotec - replicated pool size 2, cache tier for 'data'

Any one can explain this?

--
Mike, runs!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com