from:"Mike"

[ceph-users] Cache Tier 1 vs. Journal

2015-02-12 Thread Mike

Hello!

If I using cache tier 1 pool in writeback mode, it is a good idea turn
off journal on OSDs?

I think in this sutuation journal can help if you are hit a rebalance
procedure in a "cold" storage. In outer situation the journal is
useless, I think.

Any comments?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Supermicro hardware recommendation

2015-02-17 Thread Mike

17.02.2015 04:11, Christian Balzer пишет:
> 
> Hello,
> 
> re-adding the mailing list.
> 
> On Mon, 16 Feb 2015 17:54:01 +0300 Mike wrote:
> 
>> Hello
>>
>> 05.02.2015 08:35, Christian Balzer пишет:
>>>
>>> Hello,
>>>
>>>>>
>>>>>> LSI 2308 IT
>>>>>> 2 x SSD Intel DC S3700 400GB
>>>>>> 2 x SSD Intel DC S3700 200GB
>>>>> Why the separation of SSDs? 
>>>>> They aren't going to be that busy with regards to the OS.
>>>>
>>>> We would like to use 400GB SSD for a cache pool, and 200GB SSD for
>>>> the journaling.
>>>>
>>> Don't, at least not like that.
>>> First and foremost, SSD based OSDs/pools have different requirements,
>>> especially when it comes to CPU. 
>>> Mixing your HDD and SSD based OSDs in the same chassis is a generally
>>> a bad idea.
>>
>> Why? If we have for example SuperServer 6028U-TR4+ with proper
>> configuration  (4 x SSD DC S3700 for cache pool/8 x 6-8Tb SATA HDD for
>> Cold storage/E5-2695V3 CPU/128Gb RAM), why it's still bad idea? It's
>> something inside Ceph don't work well?
>>
> 
> Ceph in and by itself will of course work.
> 
> But your example up there is total overkill on one hand and simply not
> balanced on the other hand.
> You'd be much better off (both performance and price wise) if you'd go
> with something less powerful for a HDD storage node like this:
> http://www.supermicro.com/products/system/2U/6027/SSG-6027R-E1R12T.cfm
> with 2 400GB Intels in the back for journals and 16 cores total.
> 
> While your SSD based storage nodes would be nicely dense by using
> something like:
> http://www.supermicro.com/products/system/2U/2028/SYS-2028TP-DC0TR.cfm
> with 2 E5-2690 v3 per node (I'd actually rather prefer E5-2687W v3, but
> those are running too hot).
> Alternatively one of the 1U cases with up to 10 SSDs.
> 
> Also maintaining a crush map that separates the SSD from HDD pools is made
> a lot easier, less error prone by segregating nodes into SSD and HDD ones.
> 
> There are several more reasons below.
> 
> 

Yes this normal variants of configurations. But in this way you have 2
different nodes versus 1, it's require a more support inside company.

In a whole setup you will be have for each MON, OSD, SSD-CACHE servers
one configuration and another configurations for compute nodes.

A lot of support, supplies, attention.

That's why we still trying reduce amount of configuration for support.
It's a balance support versus cost/speed/etc.

>> For me cache pool it's 1-st fast small storage between big slow storage.
>>
> That's the idea, yes.
> But besides the problems with performance I'm listing again below, that
> "small" is another, very difficult to judge in advance problem.
> By mixing your cache pool SSD OSDs into the HDD OSD chassis, you're
> making yourself inflexible in that area (as in just add another SSD cache
> pool node when needed). 
> 

Yes in some way inflexible, but I have one configuration not two and can
grow up cluster simply add modes.

>> You don't need journal anymore and if you need you can enlarge fast
>> storage.
>>
> You still need the journal of course, it's (unfortunately in some cases)
> a basic requirement in Ceph. 
> I suppose what you meant is "don't need journal on SSDs anymore".
> And while that is true, this makes your slow storage at least twice as
> slow, which at some point (deep-scrub, data re-balancing, very busy
> cluster) is likely to make you wish you had those journal SSDs.
> 
>  

Yes, journal on cold storage is need for re-balancing cluster if some
node/hdd fail or promote/remove object from ssd cache.

I remember a email in this mail list from one of inktank guys (sorry,
didn't remember him full email and name), they wrote that "you no need
journal if you use cache pool".

>>> If you really want to use SSD based OSDs, got at least with Giant,
>>> probably better even to wait for Hammer. 
>>> Otherwise your performance will be nowhere near the investment you're
>>> making. 
>>> Read up in the ML archives about SSD based clusters and their
>>> performance, as well as cache pools.
>>>
>>> Which brings us to the second point, cache pools are pretty pointless
>>> currently when it comes to performance. So unless you're planning to
>>> use EC pools, you will gain very little from them.
>>
>> So, ssd cache pool useless at all?
>>
> They're (currently) not perf

[ceph-users] What a maximum theoretical and practical capacity in ceph cluster?

2014-10-27 Thread Mike

Hello,
My company is plaining to build a big Ceph cluster for achieving and
storing data.
By requirements from customer - 70% of capacity is SATA, 30% SSD.
First day data is storing in SSD storage, on next day moving SATA storage.

By now we decide use a SuperMicro's SKU with 72 bays for HDD = 22 SSD +
50 SATA drives.
Our racks can hold 10 this servers and 50 this racks in ceph cluster =
36000 OSD's,
With 4tb SATA drives and replica = 2 and nerfull ratio = 0.8 we have 40
Petabyte of useful capacity.

It's too big or normal use case for ceph?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Again: full ssd ceph cluster

2014-12-10 Thread Mike

Hello all!
Some our customer asked for only ssd storage.
By now we looking to 2027R-AR24NV w/ 3 x HBA controllers (LSI3008 chip,
8 internal 12Gb ports on each), 24 x Intel DC S3700 800Gb SSD drives, 2
x mellanox 40Gbit ConnectX-3 (maybe newer ConnectX-4 100Gbit) and Xeon
e5-2660V2 with 64Gb RAM.
Replica is 2.
Or something like that but in 1U w/ 8 SSD's.

We see a little bottle neck on network cards, but the biggest question
can ceph (giant release) with sharding io and new cool stuff release
this potential?

Any ideas?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Again: full ssd ceph cluster

2014-12-11 Thread Mike

Hello,
On 12/11/2014 11:35 AM, Christian Balzer wrote:
> 
> Hello,
> 
> On Wed, 10 Dec 2014 18:08:23 +0300 Mike wrote:
> 
>> Hello all!
>> Some our customer asked for only ssd storage.
>> By now we looking to 2027R-AR24NV w/ 3 x HBA controllers (LSI3008 chip,
>> 8 internal 12Gb ports on each), 24 x Intel DC S3700 800Gb SSD drives, 2
>> x mellanox 40Gbit ConnectX-3 (maybe newer ConnectX-4 100Gbit) and Xeon
>> e5-2660V2 with 64Gb RAM.
> 
> A bit skimpy on the RAM given the amount of money you're willing to spend
> otherwise.
I think more amount of RAM can help with re-balance process in cluster
when one node fail.

> And while you're giving it 20 2.2GHz cores, that's not going to cut, not
> by a long shot. 
> I did some brief tests with a machine having 8 DC S3700 100GB for OSDs
> (replica 1) under 0.80.6 and the right (make that wrong) type of load
> (small, 4k I/Os) did melt all of the 8 3.5GHz cores in that box.
We can choose something more powerful from E5-266xV3 family.

> The suggest 1GHz per OSD by the Ceph team is for pure HDD based OSDs, the
> moment you add journals on SSDs it already becomes barely enough with 3GHz
> cores when dealing with many small I/Os.
> 
>> Replica is 2.
>> Or something like that but in 1U w/ 8 SSD's.
>>
> The potential CPU power to OSD ratio will be much better with this.
> 
Yes, it looks more right.

>> We see a little bottle neck on network cards, but the biggest question
>> can ceph (giant release) with sharding io and new cool stuff release
>> this potential?
>>
> You shouldn't worry too much about network bandwidth unless you're going
> to use this super expensive setup for streaming backups. ^o^ 
> I'm certain you'll run out of IOPS long before you'll run out of network
> bandwidth.
> 
I think about a bottle neck in kernel IO subsystem also.

> Given that what I recall of the last SSD cluster discussion, most of the
> Giant benefits were for read operations and the write improvement was
> about double that of Firefly. While nice, given my limited tests that is
> still a far cry away from what those SSDs can do, see above.
> 
I also read all this treads about giant read perfomans. But on write it
double worst now?

>> Any ideas?
>>
> Somebody who actually has upgraded an SSD cluster from Firefly to Giant
> would be in the correct position to answer that.
> 
> Christian
> 

Thank you for useful opinion, Christian!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Number of SSD for OSD journal

2014-12-15 Thread Mike

15.12.2014 23:45, Sebastien Han пишет:
> Salut,
> 
> The general recommended ratio (for me at least) is 3 journals per SSD. Using 
> 200GB Intel DC S3700 is great.
> If you’re going with a low perf scenario I don’t think you should bother 
> buying SSD, just remove them from the picture and do 12 SATA 7.2K 4TB.
> 
> For medium and medium ++ perf using a ratio 1:11 is way to high, the SSD will 
> definitely be the bottleneck here.
> Please also note that (bandwidth wise) with 22 drives you’re already hitting 
> the theoretical limit of a 10Gbps network. (~50MB/s * 22 ~= 1.1Gbps).
> You can theoretically up that value with LACP (depending on the 
> xmit_hash_policy you’re using of course).
> 
> Btw what’s the network? (since I’m only assuming here).
> 
> 
>> On 15 Dec 2014, at 20:44, Florent MONTHEL  wrote:
>>
>> Hi,
>>
>> I’m buying several servers to test CEPH and I would like to configure 
>> journal on SSD drives (maybe it’s not necessary for all use cases)
>> Could you help me to identify number of SSD I need (SSD are very expensive 
>> and GB price business case killer… ) ? I don’t want to experience SSD 
>> bottleneck (some abacus ?).
>> I think I will be with below CONF 2 & 3
>>
>>
>> CONF 1 DELL 730XC "Low Perf":
>> 10 SATA 7.2K 3.5  4TB + 2 SSD 2.5 » 200GB "intensive write"
>>
>> CONF 2 DELL 730XC « Medium Perf" :
>> 22 SATA 7.2K 2.5 1TB + 2 SSD 2.5 » 200GB "intensive write"
>>
>> CONF 3 DELL 730XC « Medium Perf ++" :
>> 22 SAS 10K 2.5 1TB + 2 SSD 2.5 » 200GB "intensive write"
>>
>> Thanks
>>
>> Florent Monthel
>>

This is also have another way.
* for CONF 2,3 replace 200Gb SSD to 800Gb and add another 1-2 SSD to
each node.
* make tier1 read-write cache on SSDs
* also you can add journal partition on them if you wish - then data
will moving from SSD to SSD before let down on HDD
* on HDD you can make erasure pool or replica pool

You have 10Gbit Eth, 4 SSD that also used for journals - then you may
will have bottleneck in NIC, than in future easy avoid of replace NIC.

In my opinion, backend network must be equivalent or faster then
frontend one, because time spend for balance cluster it very important,
and must be very low, to aim to zero.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Number of SSD for OSD journal

2014-12-16 Thread Mike

16.12.2014 10:53, Daniel Schwager пишет:
> Hallo Mike,
> 
>> This is also have another way.
>> * for CONF 2,3 replace 200Gb SSD to 800Gb and add another 1-2 SSD to
>> each node.
>> * make tier1 read-write cache on SSDs
>> * also you can add journal partition on them if you wish - then data
>> will moving from SSD to SSD before let down on HDD
>> * on HDD you can make erasure pool or replica pool
> 
> Do you have some experience (performance ?)  with SSD as caching tier1? Maybe 
> some small benchmarks? From the mailing list, I "feel" that SSD-tearing is 
> not much used in productive.
> 
> regards
> Danny
> 
> 

No. But I think it's better than using SSD only for journals. Looks on
StorPool or Nutanix (in some way) - they used SSD as a storage/long life
cache as a storage.

Cache pool tiering it's a new feature in Ceph introducing in Firefly.
It's explain why cache tiering by now haven't used in production.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Change size journal's blocks from 4k to another.

2014-05-07 Thread Mike

Hello.

In my Ceph instalation I am uses a ssd drive for journal with direct
access to a block device.

At an osd is started a see in log file string:
...
1 journal _open /dev/sda1 fd 22: 19327352832 bytes, block size 4096
bytes, directio = 1, aio = 1
...

How I can change size of block from 4k to 512k, because my SSD show
better perfomance witch large blocks:
* With 4K (sdr6 - source, sda8 - target)

dd if=/mnt/from/random of=/mnt/sda8/random bs=4k oflag=direct,dsync

iostat show me the statistic:
iostat -cdm 1 /dev/sda /dev/sdr

Device:tpsMB_read/sMB_wrtn/sMB_readMB_wrtn
sda   16198.00 0.00   126.54  0126
sdr 126.0015.75 0.00 15  0


* With 512K (sdr6 - source, sda8 - target)
Sync: sync
Clear cache: echo 1 > /proc/sys/vm/drop_caches
Clear cache LSI controller: megacli -AdpCacheFlush -a0

dd if=/mnt/from/random of=/mnt/sda8/random bs=512k oflag=direct,dsync

iostat show me the statistic:
iostat -cdm 1 /dev/sda /dev/sdr

Device:tpsMB_read/sMB_wrtn/sMB_readMB_wrtn
sda3021.00 0.01   318.00  0318
sdr2410.00   301.25 0.00301  0

I think my cluster have a bottle neck in journal block size. How I can
increase the size of block for journal?

--
Best regards, Mike.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] 5Tb useful space based on Erasure Coded Pool

2015-09-11 Thread Mike

Hello Cephers!
I have interesting a task from our client.
The client have 3000+ video cams (monitoring streets, porchs,  entrance,
etc.), we need store data from these cams for 30 days.

Each cam generating 1.3Tb data for 30 days, total bandwidth is 14Gbit/s.
In total we need ( 1.3+3000 ) ~4Pb+ data on storage plus 20% for
recovery if one jbod fail.

Quantity of cams can increase in time.

Another thing to keep in mind is to make cheaper storage.

My points of view:
* Make pair with ceph server + fat jbod
* Make ~15 pairs
* On jbods make erasure coded pool with reasonable fault domain
* On ceph server make read only cache tiring, because erasure coded pool
can't be directly access from clients.

Hardware:
Ceph server
* 2 x e5-2690v3 Xeon (may be 2697)
* 256Gb RAM
* some Intel SSD DCS36xxx series
* 2 x Dualport 10Gbit/s NIC (may be 1 x dualport 10Gbit plus 1 x
Dualport 40Gbit/s for storage network)
* 2 x 4 SAS external port HBA SAS controllers

JBOD
* DATAon DNS-2670/DNS-2684 each can carry 70 or 84 drives or Supermicro
946ED-R2KJBOD that can carry 90 drives.

Ceph settings
* Use lrc plugin (?), with k=6, m=3, l=3, ruleset-failure-domain=host,
ruleset-locality=rack

I have not yet learned much about the difference erasure plugins,
performance, low level configuration.

Have you some advice about it? It's can work at all or not? Erasure and
this implementation it Ceph can solve the task?

For any advice thanks.

--
Mike, yes.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Building a Pb EC cluster for a cheaper cold storage

2015-11-10 Thread Mike

10.11.2015 19:40, Paul Evans пишет:
> Mike - unless things have changed in the latest versions(s) of Ceph, I *not* 
> believe CRUSH will be successful in creating a valid PG map if the ’n' value 
> is 10 (k+m), your host count is 6, and your failure domain is set to host.  
> You’ll need to increase your host count to match or exceed ’n', change the 
> failure domain to OSD, or alter the k+m config to something that is more 
> compatible to your host count…otherwise you’ll end up with incomplete PG’s.
> Also note that having more failure domains (i.e. - hosts) than your ’n’ value 
> is recommended.
> 
> Beyond that, you’re likely to run operational challenges putting that many 
> drives behind a single CPU-complex when the host count is quite low. My $.02.
> --
> Paul

Thanks, Paul!
I didn't mentioned about it! It's a gold $.02 from you :)

> 
> On Nov 10, 2015, at 2:29 AM, Mike Almateia 
> mailto:mike.almat...@gmail.com>> wrote:
> 
> Hello.
> 
> For our CCTV storing streams project we decided to use Ceph cluster with EC 
> pool.
> Input requirements is not scary: max. 15Gbit/s input traffic from CCTV, 30 
> day storing,
> 99% write operations, a cluster must has grow up with out downtime.
> 
> By now our vision of architecture it like:
> * 6 JBOD with 90 HDD 8Tb capacity each (540 HDD total)
> * 6 Ceph servers connected to it own JBOD (we will have 6 pairs: 1 Server + 1 
> JBOD).
> 
> Ceph servers hardware details:
> * 2 x E5-2690v3 : 24 core (w/o HT), 2.6 Ghz each
> * 256 Gb RAM DDR4
> * 4 x 10Gbit/s NIC port (2 for Client network and 2 for Cluster Network)
> * servers also have 4 (8) x 2.5" HDD SATA on board for Cache Tiering Feature 
> (because ceph clients can't directly talk with EC pool)
> * Two HBA SAS controllers work with multipathing feature, for HA scenario.
> * For Ceph monitor functionality 3 servers have 2 SSD in Software RAID1
> 
> Some Ceph configuration rules:
> * EC pools with K=7 and M=3
> * EC plugin - ISA
> * technique = reed_sol_van
> * ruleset-failure-domain = host
> * near full ratio = 0.75
> * OSD journal partition on the same disk
> 
> We think that first and second problems it will be CPU and RAM on Ceph 
> servers.
> 
> Any ideas? it is can fly?
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Building a Pb EC cluster for a cheaper cold storage

2015-11-11 Thread Mike

10.11.2015 19:40, Paul Evans пишет:
> Mike - unless things have changed in the latest versions(s) of Ceph, I *not* 
> believe CRUSH will be successful in creating a valid PG map if the ’n' value 
> is 10 (k+m), your host count is 6, and your failure domain is set to host.  
> You’ll need to increase your host count to match or exceed ’n', change the 
> failure domain to OSD, or alter the k+m config to something that is more 
> compatible to your host count…otherwise you’ll end up with incomplete PG’s.
> Also note that having more failure domains (i.e. - hosts) than your ’n’ value 
> is recommended.
> 
> Beyond that, you’re likely to run operational challenges putting that many 
> drives behind a single CPU-complex when the host count is quite low. My $.02.
> --
> Paul

What if we make a buckets in each jbod, example; each jbod will be have
3 buckets witch 30HDD each? And failure domain make as bucket, not host?

Yes, if jbod totaly fail - we will lost 3 buckets, but a chance of
failure two disk in bucket it lower, than on whole jbod.

It's reasonable?

> 
> On Nov 10, 2015, at 2:29 AM, Mike Almateia 
> mailto:mike.almat...@gmail.com>> wrote:
> 
> Hello.
> 
> For our CCTV storing streams project we decided to use Ceph cluster with EC 
> pool.
> Input requirements is not scary: max. 15Gbit/s input traffic from CCTV, 30 
> day storing,
> 99% write operations, a cluster must has grow up with out downtime.
> 
> By now our vision of architecture it like:
> * 6 JBOD with 90 HDD 8Tb capacity each (540 HDD total)
> * 6 Ceph servers connected to it own JBOD (we will have 6 pairs: 1 Server + 1 
> JBOD).
> 
> Ceph servers hardware details:
> * 2 x E5-2690v3 : 24 core (w/o HT), 2.6 Ghz each
> * 256 Gb RAM DDR4
> * 4 x 10Gbit/s NIC port (2 for Client network and 2 for Cluster Network)
> * servers also have 4 (8) x 2.5" HDD SATA on board for Cache Tiering Feature 
> (because ceph clients can't directly talk with EC pool)
> * Two HBA SAS controllers work with multipathing feature, for HA scenario.
> * For Ceph monitor functionality 3 servers have 2 SSD in Software RAID1
> 
> Some Ceph configuration rules:
> * EC pools with K=7 and M=3
> * EC plugin - ISA
> * technique = reed_sol_van
> * ruleset-failure-domain = host
> * near full ratio = 0.75
> * OSD journal partition on the same disk
> 
> We think that first and second problems it will be CPU and RAM on Ceph 
> servers.
> 
> Any ideas? it is can fly?
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> --
> Paul Evans
> Principal Architect
> Daystrom Technology Group
> m: 707-479-1034o:  800-656-3224  x511
> f:  650-472-4005e:  
> paul.ev...@daystrom.com<mailto:paul.ev...@daystrom.com>
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Building a Pb EC cluster for a cheaper cold storage

2015-11-11 Thread Mike

11.11.2015 06:14, Christian Balzer пишет:
> 
> Hello,
> 
> On Tue, 10 Nov 2015 13:29:31 +0300 Mike Almateia wrote:
> 
>> Hello.
>>
>> For our CCTV storing streams project we decided to use Ceph cluster with 
>> EC pool.
>> Input requirements is not scary: max. 15Gbit/s input traffic from CCTV, 
>> 30 day storing,
>> 99% write operations, a cluster must has grow up with out downtime.
>>
> I have a production cluster that is also nearly write only.
> 
> I'd say that 1.5GB/s is a pretty significant amount of traffic, but not
> scary in and by itself. 
> The question is how many streams are we talking about, how are you writing
> that data (to CephFS, RBD volumes)?

Special software for CCTV, on max. 70 Windows KVM VM, will be storing
traffic on local drive. No CephFS, only RBD.

Traffic will balance between 70 VMs.

I look here max. 70 streams to the cluster and each steam looks like
around 200Mbit/s.

> 
> All of this will decide how IOPS intense (as opposed to throughput)
> storing your streams will be.
> 

99% write, only big blocks, sequential write.
I think around 2000 IOPS with 1 Mb blocks and QD=32.

>> By now our vision of architecture it like:
>> * 6 JBOD with 90 HDD 8Tb capacity each (540 HDD total)
>> * 6 Ceph servers connected to it own JBOD (we will have 6 pairs: 1 
>> Server + 1 JBOD).
>>
> As you guessed yourself and as Paul suspects as well, I think the amount
> of OSDs per node is too dense, more of a CPU than RAM problem, plus the
> other tuning it will require.

Yes, CPU need for Cache Tier and EC, but if we have around 250 Mb/s per
server, may be it will be enough?

> 
> Also the cache tier HDDs (unless they're SSDs) are likely going to be
> another bottleneck.
> 

There no need fast Cache Tier, if we will use Read Only Cache policy.
A SSD drive eating more CPU, also.

> Consider this alternative:
> 
> * Same JBOD chassis
> * Quite different Ceph nodes:
> - 1 or 2 RAID controllers with the most cache you can get (I like Areca's
>   with 4GB, YMMV). That cache (and the journal SSDs suggested below)
>   should take care of things if your 15GBit/s is sufficiently fragmented
>   to cause large amounts of IOPS.
> - 8x 11 disk RAID6, depending on how many controllers you have 1 or 2
>   global hotspares. 
> - 256GB RAM or more, tuned to .
> - If you can afford it, use FAST SSDs (or NVMe) as journals. You want to
>   be able to saturate your network, so around 2GB/s. 
>   Four Intel DC S3700 400GB will get you close to that.
> - Since you now only have 8 OSDs per node, your CPU requirements are more
>   to the tune of 12 (fast, 2.5GHz++) cores.
> 
> With "failproof" OSDs, you can choose 2x (not the default 3x) replication.
> 
> Another bonus is that you'll likely never have a failed OSD and the
> resulting traffic storm.

It's interesting, but more more expensive and we lose capacity too much:
Replicas instead EC + RAID6 (N-2) + at least 2 HostSpare per JBOD.

It is beyond a limit of our budget.

> 
> The trick to keep things happy here are to have enough RAM for all hot
> objects that need to be read, especially inodes and other FS metadata.
> 
> Of course if you can afford it (price/space), having less dense nodes will
> significantly reduce the impact of a node failure.
> 
>> Ceph servers hardware details:
>> * 2 x E5-2690v3 : 24 core (w/o HT), 2.6 Ghz each
>> * 256 Gb RAM DDR4
>> * 4 x 10Gbit/s NIC port (2 for Client network and 2 for Cluster Network)
>> * servers also have 4 (8) x 2.5" HDD SATA on board for Cache Tiering 
>> Feature (because ceph clients can't directly talk with EC pool)
>> * Two HBA SAS controllers work with multipathing feature, for HA
>> scenario.
> A bit of overkill, given how your failure domain will still be at least
> per storage node, worse depending on network/switch topology.
> 
> Regards,
> 
> Christian
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] OSD/BTRFS: OSD didn't start after change btrfs mount options

2016-09-19 Thread Mike

.fsid =
f1685d33-9572-4500-b61d-91d8462f8df0
2016-09-19 17:14:59.781344 7f4edc7c8840  2 journal No further valid
entries found, journal is most likely valid
2016-09-19 17:14:59.781355 7f4edc7c8840 10 journal open reached end of
journal.
2016-09-19 17:14:59.781404 7f4edc7c8840 -1 journal Unable to read past
sequence 2 but header indicates the journal has committed up through
150825, journal is corrupt
2016-09-19 17:14:59.785143 7f4edc7c8840 -1 os/filestore/FileJournal.cc:
In function 'bool FileJournal::read_entry(ceph::bufferlist&, uint64_t&,
bool*)' thread 7f4edc7c8840 time 2016-09-19 17:14:59.781414
os/filestore/FileJournal.cc: 2031: FAILED assert(0)

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0x7f4edd1f55b5]
 2: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&,
bool*)+0x90c) [0x7f4edcf99d0c]
 3: (JournalingObjectStore::journal_replay(unsigned long)+0x1ee)
[0x7f4edcee9cde]
 4: (FileStore::mount()+0x3cd6) [0x7f4edcec14d6]
 5: (OSD::init()+0x27d) [0x7f4edcb8504d]
 6: (main()+0x2c55) [0x7f4edcaeabe5]
 7: (__libc_start_main()+0xf5) [0x7f4ed94a5b15]
 8: (()+0x353009) [0x7f4edcb35009]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.
***

I'm stuck. I can't understand what I did wrong and how recover the OSDs?
Googling didn't help me.

-- 
Mike, runs!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] PG calculate for cluster with a huge small objects

2016-11-24 Thread Mike

Hello.

We have a cluster, 32 OSD, 80Tb usage space, FireFly 0.80.9 release.

This cluster we use for RBD and RadosGW Object storage witch our Openstack.

The data pools (Volumes, compute, images) are fine, but .rgw pool use 101Mb 
space and
have _462446_ objects. Average size an object in this pool is just 4.2k.

If we just will be increase PG/PGP num for this pool we stuck to limit a PGP 
number per
OSD after some time.

What we will do? Count of objects are increasing, space - a few.

What we living witch cluster of huge small size objects? Any suggestion?

-- 
Mike, run!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Multiple L2 LAN segments with Ceph

2014-05-28 Thread Mike Dawson


Travis,

We run a routed ECMP spine-leaf network architecture with Ceph and have 
no issues on the network side whatsoever. Each leaf switch has an L2 
cidr block inside a common L3 supernet.


We do not currently split cluster_network and public_network. If we did, 
we'd likely build a separate spine-leaf network with it's own L3 supernet.


A simple IPv4 example:

- ceph-cluster: 10.1.0.0/16
- cluster-leaf1: 10.1.1.0/24
- node1: 10.1.1.1/24
- node2: 10.1.1.2/24
- cluster-leaf2: 10.1.2.0/24

- ceph-public: 10.2.0.0/16
- public-leaf1: 10.2.1.0/24
- node1: 10.2.1.1/24
- node2: 10.2.1.2/24
- public-leaf2: 10.2.2.0/24

ceph.conf would be:

cluster_network: 10.1.0.0/255.255.0.0
public_network: 10.2.0.0/255.255.0.0

- Mike Dawson

On 5/28/2014 1:01 PM, Travis Rhoden wrote:

Hi folks,

Does anybody know if there are any issues running Ceph with multiple L2
LAN segements?  I'm picturing a large multi-rack/multi-row deployment
where you may give each rack (or row) it's own L2 segment, then connect
them all with L3/ECMP in a leaf-spine architecture.

I'm wondering how cluster_network (or public_network) in ceph.conf works
in this case.  Does that directive just tell a daemon starting on a
particular node which network to bind to?  Or is a CIDR that has to be
accurate for every OSD and MON in the entire cluster?

Thanks,

  - Travis


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Calamari Goes Open Source

2014-05-30 Thread Mike Dawson

Great work Inktank / Red Hat! An open source Calamari will be a great 
benefit to the community!


Cheers,
Mike Dawson


On 5/30/2014 6:04 PM, Patrick McGarry wrote:

Hey cephers,

Sorry to push this announcement so late on a Friday but...

Calamari has arrived!

The source code bits have been flipped, the ticket tracker has been
moved, and we have even given you a little bit of background from both
a technical and vision point of view:

Technical (ceph.com):
http://ceph.com/community/ceph-calamari-goes-open-source/

Vision (inktank.com):
http://www.inktank.com/software/future-of-calamari/

The ceph.com link should give you everything you need to know about
what tech comprises Calamari, where the source lives, and where the
discussions will take place.  If you have any questions feel free to
hit the new ceph-calamari list or stop by IRC and we'll get you
started.  Hope you all enjoy the GUI!



Best Regards,

Patrick McGarry
Director, Community || Inktank
http://ceph.com  ||  http://inktank.com
@scuttlemonkey || @ceph || @inktank
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to avoid deep-scrubbing performance hit?

2014-06-09 Thread Mike Dawson


Craig,

I've struggled with the same issue for quite a while. If your i/o is 
similar to mine, I believe you are on the right track. For the past 
month or so, I have been running this cronjob:


* * * * *   for strPg in `ceph pg dump | egrep 
'^[0-9]\.[0-9a-f]{1,4}' | sort -k20 | awk '{ print $1 }' | head -2`; do 
ceph pg deep-scrub $strPg; done


That roughly handles my 20672 PGs that are set to be deep-scrubbed every 
7 days. Your script may be a bit better, but this quick and dirty method 
has helped my cluster maintain more consistency.


The real key for me is to avoid the "clumpiness" I have observed without 
that hack where concurrent deep-scrubs sit at zero for a long period of 
time (despite having PGs that were months overdue for a deep-scrub), 
then concurrent deep-scrubs suddenly spike up and stay in the teens for 
hours, killing client writes/second.


The scrubbing behavior table[0] indicates that a periodic tick initiates 
scrubs on a per-PG basis. Perhaps the timing of ticks aren't 
sufficiently randomized when you restart lots of OSDs concurrently (for 
instance via pdsh).


On my cluster I suffer a significant drag on client writes/second when I 
exceed perhaps four or five concurrent PGs in deep-scrub. When 
concurrent deep-scrubs get into the teens, I get a massive drop in 
client writes/second.


Greg, is there locking involved when a PG enters deep-scrub? If so, is 
the entire PG locked for the duration or is each individual object 
inside the PG locked as it is processed? Some of my PGs will be in 
deep-scrub for minutes at a time.


0: http://ceph.com/docs/master/dev/osd_internals/scrub/

Thanks,
Mike Dawson


On 6/9/2014 6:22 PM, Craig Lewis wrote:

I've correlated a large deep scrubbing operation to cluster stability
problems.

My primary cluster does a small amount of deep scrubs all the time,
spread out over the whole week.  It has no stability problems.

My secondary cluster doesn't spread them out.  It saves them up, and
tries to do all of the deep scrubs over the weekend.  The secondary
starts loosing OSDs about an hour after these deep scrubs start.

To avoid this, I'm thinking of writing a script that continuously scrubs
the oldest outstanding PG.  In psuedo-bash:
# Sort by the deep-scrub timestamp, taking the single oldest PG
while ceph pg dump | awk '$1 ~ /[0-9a-f]+\.[0-9a-f]+/ {print $20, $21,
$1}' | sort | head -1 | read date time pg
  do
   ceph pg deep-scrub ${pg}
   while ceph status | grep scrubbing+deep
do
 sleep 5
   done
   sleep 30
done


Does anybody think this will solve my problem?

I'm also considering disabling deep-scrubbing until the secondary
finishes replicating from the primary.  Once it's caught up, the write
load should drop enough that opportunistic deep scrubs should have a
chance to run.  It should only take another week or two to catch up.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-28 Thread Mike Dawson


On 8/28/2014 12:23 AM, Christian Balzer wrote:

On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote:




On 27/08/2014 04:34, Christian Balzer wrote:


Hello,

On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:


Hi Craig,

I assume the reason for the 48 hours recovery time is to keep the cost
of the cluster low ? I wrote "1h recovery time" because it is roughly
the time it would take to move 4TB over a 10Gb/s link. Could you
upgrade your hardware to reduce the recovery time to less than two
hours ? Or are there factors other than cost that prevent this ?



I doubt Craig is operating on a shoestring budget.
And even if his network were to be just GbE, that would still make it
only 10 hours according to your wishful thinking formula.

He probably has set the max_backfills to 1 because that is the level of
I/O his OSDs can handle w/o degrading cluster performance too much.
The network is unlikely to be the limiting factor.

The way I see it most Ceph clusters are in sort of steady state when
operating normally, i.e. a few hundred VM RBD images ticking over, most
actual OSD disk ops are writes, as nearly all hot objects that are
being read are in the page cache of the storage nodes.
Easy peasy.

Until something happens that breaks this routine, like a deep scrub,
all those VMs rebooting at the same time or a backfill caused by a
failed OSD. Now all of a sudden client ops compete with the backfill
ops, page caches are no longer hot, the spinners are seeking left and
right. Pandemonium.

I doubt very much that even with a SSD backed cluster you would get
away with less than 2 hours for 4TB.

To give you some real life numbers, I currently am building a new
cluster but for the time being have only one storage node to play with.
It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs  and 8
actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.

So I took out one OSD (reweight 0 first, then the usual removal steps)
because the actual disk was wonky. Replaced the disk and re-added the
OSD. Both operations took about the same time, 4 minutes for
evacuating the OSD (having 7 write targets clearly helped) for measly
12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the
OSD. And that is on one node (thus no network latency) that has the
default parameters (so a max_backfill of 10) which was otherwise
totally idle.

In other words, in this pretty ideal case it would have taken 22 hours
to re-distribute 4TB.


That makes sense to me :-)

When I wrote 1h, I thought about what happens when an OSD becomes
unavailable with no planning in advance. In the scenario you describe
the risk of a data loss does not increase since the objects are evicted
gradually from the disk being decommissioned and the number of replica
stays the same at all times. There is not a sudden drop in the number of
replica  which is what I had in mind.


That may be, but I'm rather certain that there is no difference in speed
and priority of a rebalancing caused by an OSD set to weight 0 or one
being set out.


If the lost OSD was part of 100 PG, the other disks (let say 50 of them)
will start transferring a new replica of the objects they have to the
new OSD in their PG. The replacement will not be a single OSD although
nothing prevents the same OSD to be used in more than one PG as a
replacement for the lost one. If the cluster network is connected at
10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new
duplicates do not originate from a single OSD but from at least dozens
of them and since they target more than one OSD, I assume we can expect
an actual throughput of 5Gb/s. I should have written 2h instead of 1h to
account for the fact that the cluster network is never idle.

Am I being too optimistic ?

Vastly.


Do you see another blocking factor that
would significantly slow down recovery ?


As Craig and I keep telling you, the network is not the limiting factor.
Concurrent disk IO is, as I pointed out in the other thread.


Completely agree.

On a production cluster with OSDs backed by spindles, even with OSD 
journals on SSDs, it is insufficient to calculate single-disk 
replacement backfill time based solely on network throughput. IOPS will 
likely be the limiting factor when backfilling a single failed spinner 
in a production cluster.


Last week I replaced a 3TB 7200rpm drive that was ~75% full in a 72-osd 
cluster, 24 hosts, rbd pool with 3 replicas, osd journals on SSDs (ratio 
of 3:1), with dual 1GbE bonded NICs.


Using the only throughput math, backfill could have theoretically 
completed in a bit over 2.5 hours, but it actually took 15 hours. I've 
done this a few times with similar results.


Why? Spindle contention on the replacement drive. Graph the '%util' 
metric from something like 'iostat -xt 2' during a single disk backfill 
to get a very clear view that spindle contention is the true limiting 
factor. It'll be pegged at or near 100%

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-28 Thread Mike Dawson


On 8/28/2014 11:17 AM, Loic Dachary wrote:



On 28/08/2014 16:29, Mike Dawson wrote:

On 8/28/2014 12:23 AM, Christian Balzer wrote:

On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote:




On 27/08/2014 04:34, Christian Balzer wrote:


Hello,

On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:


Hi Craig,

I assume the reason for the 48 hours recovery time is to keep the cost
of the cluster low ? I wrote "1h recovery time" because it is roughly
the time it would take to move 4TB over a 10Gb/s link. Could you
upgrade your hardware to reduce the recovery time to less than two
hours ? Or are there factors other than cost that prevent this ?



I doubt Craig is operating on a shoestring budget.
And even if his network were to be just GbE, that would still make it
only 10 hours according to your wishful thinking formula.

He probably has set the max_backfills to 1 because that is the level of
I/O his OSDs can handle w/o degrading cluster performance too much.
The network is unlikely to be the limiting factor.

The way I see it most Ceph clusters are in sort of steady state when
operating normally, i.e. a few hundred VM RBD images ticking over, most
actual OSD disk ops are writes, as nearly all hot objects that are
being read are in the page cache of the storage nodes.
Easy peasy.

Until something happens that breaks this routine, like a deep scrub,
all those VMs rebooting at the same time or a backfill caused by a
failed OSD. Now all of a sudden client ops compete with the backfill
ops, page caches are no longer hot, the spinners are seeking left and
right. Pandemonium.

I doubt very much that even with a SSD backed cluster you would get
away with less than 2 hours for 4TB.

To give you some real life numbers, I currently am building a new
cluster but for the time being have only one storage node to play with.
It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs  and 8
actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.

So I took out one OSD (reweight 0 first, then the usual removal steps)
because the actual disk was wonky. Replaced the disk and re-added the
OSD. Both operations took about the same time, 4 minutes for
evacuating the OSD (having 7 write targets clearly helped) for measly
12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the
OSD. And that is on one node (thus no network latency) that has the
default parameters (so a max_backfill of 10) which was otherwise
totally idle.

In other words, in this pretty ideal case it would have taken 22 hours
to re-distribute 4TB.


That makes sense to me :-)

When I wrote 1h, I thought about what happens when an OSD becomes
unavailable with no planning in advance. In the scenario you describe
the risk of a data loss does not increase since the objects are evicted
gradually from the disk being decommissioned and the number of replica
stays the same at all times. There is not a sudden drop in the number of
replica  which is what I had in mind.


That may be, but I'm rather certain that there is no difference in speed
and priority of a rebalancing caused by an OSD set to weight 0 or one
being set out.


If the lost OSD was part of 100 PG, the other disks (let say 50 of them)
will start transferring a new replica of the objects they have to the
new OSD in their PG. The replacement will not be a single OSD although
nothing prevents the same OSD to be used in more than one PG as a
replacement for the lost one. If the cluster network is connected at
10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new
duplicates do not originate from a single OSD but from at least dozens
of them and since they target more than one OSD, I assume we can expect
an actual throughput of 5Gb/s. I should have written 2h instead of 1h to
account for the fact that the cluster network is never idle.

Am I being too optimistic ?

Vastly.


Do you see another blocking factor that
would significantly slow down recovery ?


As Craig and I keep telling you, the network is not the limiting factor.
Concurrent disk IO is, as I pointed out in the other thread.


Completely agree.

On a production cluster with OSDs backed by spindles, even with OSD journals on 
SSDs, it is insufficient to calculate single-disk replacement backfill time 
based solely on network throughput. IOPS will likely be the limiting factor 
when backfilling a single failed spinner in a production cluster.

Last week I replaced a 3TB 7200rpm drive that was ~75% full in a 72-osd 
cluster, 24 hosts, rbd pool with 3 replicas, osd journals on SSDs (ratio of 
3:1), with dual 1GbE bonded NICs.

Using the only throughput math, backfill could have theoretically completed in 
a bit over 2.5 hours, but it actually took 15 hours. I've done this a few times 
with similar results.

Why? Spindle contention on the replacement drive. Graph the '%util' metric from 
something like 'iostat -xt 2' during a single disk backfill to get a very clear 
view that

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-28 Thread Mike Dawson



On 8/28/2014 4:17 PM, Craig Lewis wrote:

My initial experience was similar to Mike's, causing a similar level of
paranoia.  :-)  I'm dealing with RadosGW though, so I can tolerate
higher latencies.

I was running my cluster with noout and nodown set for weeks at a time.


I'm sure Craig will agree, but wanted to add this for other readers:

I find value in the noout flag for temporary intervention, but prefer to 
set "mon osd down out interval" for dealing with events that may occur 
in the future to give an operator time to intervene.


The nodown flag is another beast altogether. The nodown flag tends to be 
*a bad thing* when attempting to provide reliable client io. For our use 
case, we want OSDs to be marked down quickly if they are in fact 
unavailable for any reason, so client io doesn't hang waiting for them.


If OSDs are flapping during recovery (i.e. the "wrongly marked me down" 
log messages), I've found far superior results by tuning the recovery 
knobs than by permanently setting the nodown flag.


- Mike



  Recovery of a single OSD might cause other OSDs to crash. In the
primary cluster, I was always able to get it under control before it
cascaded too wide.  In my secondary cluster, it did spiral out to 40% of
the OSDs, with 2-5 OSDs down at any time.






I traced my problems to a combination of osd max backfills was too high
for my cluster, and my mkfs.xfs arguments were causing memory starvation
issues.  I lowered osd max backfills, added SSD journals,
and reformatted every OSD with better mkfs.xfs arguments.  Now both
clusters are stable, and I don't want to break it.

I only have 45 OSDs, so the risk with a 24-48 hours recovery time is
acceptable to me.  It will be a problem as I scale up, but scaling up
will also help with the latency problems.




On Thu, Aug 28, 2014 at 10:38 AM, Mike Dawson mailto:mike.daw...@cloudapt.com>> wrote:


We use 3x replication and have drives that have relatively high
steady-state IOPS. Therefore, we tend to prioritize client-side IO
more than a reduction from 3 copies to 2 during the loss of one
disk. The disruption to client io is so great on our cluster, we
don't want our cluster to be in a recovery state without
operator-supervision.

Letting OSDs get marked out without operator intervention was a
disaster in the early going of our cluster. For example, an OSD
daemon crash would trigger automatic recovery where it was unneeded.
Ironically, often times the unneeded recovery would often trigger
additional daemons to crash, making a bad situation worse. During
the recovery, rbd client io would often times go to 0.

To deal with this issue, we set "mon osd down out interval = 14400",
so as operators we have 4 hours to intervene before Ceph attempts to
self-heal. When hardware is at fault, we remove the osd, replace the
drive, re-add the osd, then allow backfill to begin, thereby
completely skipping step B in your timeline above.

- Mike



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ISCSI LIO hang after 2-3 days of working

2015-02-05 Thread Mike Christie

Not sure if there are multiple problems.

On 02/05/2015 04:46 AM, reistlin87 wrote:
> Feb  3 13:17:01 is-k13bi32e2s6vdi-002 CRON[10237]: (root) CMD (   cd / && 
> run-parts --report /etc/cron.hourly)
> Feb  3 13:25:01 is-k13bi32e2s6vdi-002 CRON[10242]: (root) CMD (command -v 
> debian-sa1 > /dev/null && debian-sa1 1 1)
> Feb  3 13:35:01 is-k13bi32e2s6vdi-002 CRON[10247]: (root) CMD (command -v 
> debian-sa1 > /dev/null && debian-sa1 1 1)
> Feb  3 13:45:01 is-k13bi32e2s6vdi-002 CRON[10252]: (root) CMD (command -v 
> debian-sa1 > /dev/null && debian-sa1 1 1)
> Feb  3 13:55:01 is-k13bi32e2s6vdi-002 CRON[10258]: (root) CMD (command -v 
> debian-sa1 > /dev/null && debian-sa1 1 1)
> Feb  3 14:02:48 is-k13bi32e2s6vdi-002 kernel: [699523.553713] Detected 
> MISCOMPARE for addr: 88007989d000 buf: 88007b1ac800
> Feb  3 14:02:48 is-k13bi32e2s6vdi-002 kernel: [699523.553969] Target/iblock: 
> Send MISCOMPARE check condition and sense

For some reason COMPARE_AND_WRITE/ATS failed indicating data
miscompared. I am not sure if that is related to the hang or not.


> Feb  3 14:05:01 is-k13bi32e2s6vdi-002 CRON[10263]: (root) CMD (command -v 
> debian-sa1 > /dev/null && debian-sa1 1 1)
> Feb  3 14:05:17 is-k13bi32e2s6vdi-002 kernel: [699672.627202] ABORT_TASK: 
> Found referenced iSCSI task_tag: 5216104

..

A couple minutes later we see commands have timed out and ESXi is trying
to abort them to figure out what is going on and clean them up.


> Feb  3 14:22:17 is-k13bi32e2s6vdi-002 kernel: [700693.927187] iSCSI Login 
> timeout on Network Portal 10.1.1.7:3260

That did not go well. ESXi did not get the responses it wanted and it
looks like it escalated its error handling process and is trying to
relogin. This hangs, because it looks like LIO is waiting for commands
it has sent to RBD to complete.



> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.613521] NMI watchdog: 
> BUG: soft lockup - CPU#1 stuck for 22s! [iscsi_ttx:2002]
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.613764] Modules linked 
> in: ib_srpt(E) ib_cm(E) ib_sa(E) ib_mad(E) ib_core(E) ib_addr(E) 
> tcm_qla2xxx(E) qla2xxx(E) tcm_loop(
> E) tcm_fc(E) libfc(E) scsi_transport_fc(E) iscsi_target_mod(E) 
> target_core_pscsi(E) target_core_file(E) target_core_iblock(E) 
> target_core_mod(E) configfs(E) rbd(E) libceph(E) li
> bcrc32c(E) ppdev(E) coretemp(E) microcode(E) serio_raw(E) i2c_piix4(E) 
> shpchp(E) parport_pc(E) lp(E) parport(E) psmouse(E) floppy(E) vmw_pvscsi(E) 
> vmxnet3(E)
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.614912] irq event 
> stamp: 24644920
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.615021] hardirqs last  
> enabled at (24644919): [] __local_bh_enable_ip+0x6d/0xd0
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.615307] hardirqs last 
> disabled at (24644920): [] _raw_spin_lock_irq+0x17/0x50
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.615560] softirqs last  
> enabled at (24644918): [] 
> iscsit_conn_all_queues_empty+0x72/0x90 [iscsi_target_mod]
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.615904] softirqs last 
> disabled at (24644916): [] 
> iscsit_conn_all_queues_empty+0x58/0x90 [iscsi_target_mod]
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.616212] CPU: 1 PID: 
> 2002 Comm: iscsi_ttx Tainted: GE  3.18.0-ceph #1
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.616429] Hardware name: 
> VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 
> 6.00 07/09/2012
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.616705] task: 
> 88007a9921d0 ti: 88007a99c000 task.ti: 88007a99c000
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.616916] RIP: 
> 0010:[]  [] 
> _raw_spin_unlock_irqrestore+0x41/0x70
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.617173] RSP: 
> 0018:88007a99fc78  EFLAGS: 0282
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.617311] RAX: 
>  RBX: 0001 RCX: 47e347e2
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.617516] RDX: 
> 47e3 RSI: 0001 RDI: 8173cf8f
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.617720] RBP: 
> 88007a99fc88 R08: 0001 R09: 
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.617924] R10: 
>  R11:  R12: a01137bb
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.618128] R13: 
>  R14: 0046 R15: 88007a99fc58
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.618445] FS:  
> () GS:88007fd0() knlGS:
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.618669] CS:  0010 DS: 
>  ES:  CR0: 8005003b
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.618817] CR2: 
> 7f30e19f2000 CR3: 36bfa000 CR4: 07e0
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002

Re: [ceph-users] rbd: I/O Errors in low memory situations

2015-02-19 Thread Mike Christie

On 02/18/2015 06:05 PM, "Sebastian Köhler [Alfahosting GmbH]" wrote:
> Hi,
> 
> yesterday we had had the problem that one of our cluster clients
> remounted a rbd device in read-only mode. We found this[1] stack trace
> in the logs. We investigated further and found similar traces on all
> other machines that are using the rbd kernel module. It seems to me that
> whenever there is a swapping situation on a client those I/O errors occur.
> Is there anything we can do or is this something that needs to be fixed
> in the code?

Hi,

I was looking at that code the other day and was thinking rbd.c might
need some changes.

1. We cannot use GFP_KERNEL in the main IO path (requests that are sent
down rbd_request_fn and related helper IO), because the allocation could
come back on rbd_request_fn.
2. We should use GFP_NOIO instead of GFP_ATOMIC if we have the proper
context and are not holding a spin lock.
3. We should be using a mempool or preallocate enough mem, so we can
make forward progress on at least one IO at a time.

I started to make the attached patch (attached version is built over
linus's tree today). I think it can be further refined, so we pass in
the gfp_t to some functions, because I think in some cases we could use
GFP_KERNEL and/or we do not need to use the mempool. For example, I do
not think we could use GFP_KERNEL and not use the mempool in the
rbd_obj_watch_request_helper code paths.

I was not done with evaluating all the paths, so had not yet posted it.
Patch is not tested.

Hey Ilya, I was not sure about the layered related code. I thought
functions like rbd_img_obj_parent_read_full could get called as a result
of a IO getting sent down the rbd_request_fn, but was not 100% sure. I
meant to test it out, but have been busy with other stuff.
[PATCH] ceph/rbd: use GFP_NOIO and mempool

1. We cannot use GFP_KERNEL in the main IO path, because it could come back on us.
2. We should use GFP_NOIO instead of GFP_ATOMIC if we have the proper context and are not holding a spin lock.
3. We should be using a mempool or preallocate enough mem, so we can make forward progress on at least one IO at a time.

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 8a86b62..c01ecaf 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1915,8 +1915,8 @@ static struct ceph_osd_request *rbd_osd_req_create(
 	/* Allocate and initialize the request, for the num_ops ops */

 	osdc = &rbd_dev->rbd_client->client->osdc;
-	osd_req = ceph_osdc_alloc_request(osdc, snapc, num_ops, false,
-	  GFP_ATOMIC);
+	osd_req = ceph_osdc_alloc_request(osdc, snapc, num_ops, true,
+	  GFP_NOIO);
 	if (!osd_req)
 		return NULL;	/* ENOMEM */

@@ -1998,11 +1998,11 @@ static struct rbd_obj_request *rbd_obj_request_create(const char *object_name,
 	rbd_assert(obj_request_type_valid(type));

 	size = strlen(object_name) + 1;
-	name = kmalloc(size, GFP_KERNEL);
+	name = kmalloc(size, GFP_NOIO);
 	if (!name)
 		return NULL;

-	obj_request = kmem_cache_zalloc(rbd_obj_request_cache, GFP_KERNEL);
+	obj_request = kmem_cache_zalloc(rbd_obj_request_cache, GFP_NOIO);
 	if (!obj_request) {
 		kfree(name);
 		return NULL;
@@ -2456,7 +2456,7 @@ static int rbd_img_request_fill(struct rbd_img_request *img_request,
 	bio_chain_clone_range(&bio_list,
 &bio_offset,
 clone_size,
-GFP_ATOMIC);
+GFP_NOIO);
 			if (!obj_request->bio_list)
 goto out_unwind;
 		} else if (type == OBJ_REQUEST_PAGES) {
@@ -2687,7 +2687,7 @@ static int rbd_img_obj_parent_read_full(struct rbd_obj_request *obj_request)
 	 * from the parent.
 	 */
 	page_count = (u32)calc_pages_for(0, length);
-	pages = ceph_alloc_page_vector(page_count, GFP_KERNEL);
+	pages = ceph_alloc_page_vector(page_count, GFP_NOIO);
 	if (IS_ERR(pages)) {
 		result = PTR_ERR(pages);
 		pages = NULL;
@@ -2814,7 +2814,7 @@ static int rbd_img_obj_exists_submit(struct rbd_obj_request *obj_request)
 	 */
 	size = sizeof (__le64) + sizeof (__le32) + sizeof (__le32);
 	page_count = (u32)calc_pages_for(0, size);
-	pages = ceph_alloc_page_vector(page_count, GFP_KERNEL);
+	pages = ceph_alloc_page_vector(page_count, GFP_NOIO);
 	if (IS_ERR(pages))
 		return PTR_ERR(pages);

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] tgt and krbd

2015-03-06 Thread Mike Christie

On 03/06/2015 06:51 AM, Jake Young wrote:
> 
> 
> On Thursday, March 5, 2015, Nick Fisk  > wrote:
> 
> Hi All,
> 
> __ __
> 
> Just a heads up after a day’s experimentation.
> 
> __ __
> 
> I believe tgt with its default settings has a small write cache when
> exporting a kernel mapped RBD. Doing some write tests I saw 4 times
> the write throughput when using tgt aio + krbd compared to tgt with
> the builtin librbd.
> 
> __ __
> 
> After running the following command against the LUN, which
> apparently disables write cache, Performance dropped back to what I
> am seeing using tgt+librbd and also the same as fio.
> 
> __ __
> 
> tgtadm --op update --mode logicalunit --tid 2 --lun 3 -P
> 
> mode_page=8:0:18:0x10:0:0xff:0xff:0:0:0xff:0xff:0xff:0xff:0x80:0x14:0:0:0:0:0:0
> 
> __ __
> 
> From that I can only deduce that using tgt + krbd in its default
> state is not 100% safe to use, especially in an HA environment.
> 
> __ __
> 
> Nick
> 
> 
> 
> 
> Hey Nick,
> 
> tgt actually does not have any caches. No read, no write.  tgt's design
> is to passthrough all commands to the backend as efficiently as possible. 
> 
> http://lists.wpkg.org/pipermail/stgt/2013-May/005788.html
> 

tgt itself does not do any type of caching, but depending on how you
have tgt access the underlying block device you might end up using the
normal old linux page cache like you would if you did

dd if=/dev/rbd0 of=/dev/null bs=4K count=1
dd if=/dev/rbd0 of=/dev/null bs=4K count=1

This is what Ronnie meant in that thread when he was saying there might
be caching in the underlying device.

If you use tgt bs_rdwr.c (--bstype=rdwr) with the default settings and
with krbd then you will end up doing caching, because the krbd's block
device will be accessed like in the dd example above (no direct bits set).

You can tell tgt bs_rdwr devices to use O_DIRECT or O_SYNC. When you
create the lun pass in the "--bsoflags {direct | sync }". Here is an
example from the man page:

tgtadm --lld iscsi --op new --mode logicalunit --tid 1 --lun 1
--bsoflags="sync" --backing-store=/data/100m_image.raw

If you use bs_aio.c then we always set O_DIRECT when opening the krbd
device, so no page caching is done. I think linux aio might require this
or at least it did at the time it was written.

Also the cache settings exported to the other OS's initiator with that
modepage command might affect performance then too. It might change how
that OS does writes like send cache syncs down or do some sort of
barrier or FUA.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] tgt and krbd

2015-03-15 Thread Mike Christie

On 03/09/2015 11:15 AM, Nick Fisk wrote:
> Hi Mike,
> 
> I was using bs_aio with the krbd and still saw a small caching effect. I'm
> not sure if it was on the ESXi or tgt/krbd page cache side, but I was
> definitely seeing the IO's being coalesced into larger ones on the krbd

I am not sure what you mean here. By coalescing you mean merging right?
That is not the same as caching. Coalescing/merging is expected for both
aio and rdwr.


> device in iostat. Either way, it would make me potentially nervous to run it
> like that in a HA setup.
> 
> 
>> tgt itself does not do any type of caching, but depending on how you have
>> tgt access the underlying block device you might end up using the normal
> old
>> linux page cache like you would if you did
>>
>> dd if=/dev/rbd0 of=/dev/null bs=4K count=1 dd if=/dev/rbd0 of=/dev/null
>> bs=4K count=1
>>
>> This is what Ronnie meant in that thread when he was saying there might be
>> caching in the underlying device.
>>
>> If you use tgt bs_rdwr.c (--bstype=rdwr) with the default settings and
> with
>> krbd then you will end up doing caching, because the krbd's block device
> will
>> be accessed like in the dd example above (no direct bits set).
>>
>> You can tell tgt bs_rdwr devices to use O_DIRECT or O_SYNC. When you
>> create the lun pass in the "--bsoflags {direct | sync }". Here is an
> example
>> from the man page:
>>
>> tgtadm --lld iscsi --op new --mode logicalunit --tid 1 --lun 1
> --bsoflags="sync" -
>> -backing-store=/data/100m_image.raw
>>
>>
>> If you use bs_aio.c then we always set O_DIRECT when opening the krbd
>> device, so no page caching is done. I think linux aio might require this
> or at
>> least it did at the time it was written.
>>
>> Also the cache settings exported to the other OS's initiator with that
>> modepage command might affect performance then too. It might change
>> how that OS does writes like send cache syncs down or do some sort of
>> barrier or FUA.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] tgt and krbd

2015-03-15 Thread Mike Christie

On 03/15/2015 07:54 PM, Mike Christie wrote:
> On 03/09/2015 11:15 AM, Nick Fisk wrote:
>> Hi Mike,
>>
>> I was using bs_aio with the krbd and still saw a small caching effect. I'm
>> not sure if it was on the ESXi or tgt/krbd page cache side, but I was
>> definitely seeing the IO's being coalesced into larger ones on the krbd
> 
> I am not sure what you mean here. By coalescing you mean merging right?
> That is not the same as caching. Coalescing/merging is expected for both
> aio and rdwr.

For being able to see caching with aio though, I think you are right and
there might be a case where can use buffered writes even when using
O_DIRECT to the rbd device. I am not too familiar with that code, so let
me ping a person that works here and get back to the list.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] tgt and krbd

2015-03-17 Thread Mike Christie

On 03/15/2015 08:42 PM, Mike Christie wrote:
> On 03/15/2015 07:54 PM, Mike Christie wrote:
>> On 03/09/2015 11:15 AM, Nick Fisk wrote:
>>> Hi Mike,
>>>
>>> I was using bs_aio with the krbd and still saw a small caching effect. I'm
>>> not sure if it was on the ESXi or tgt/krbd page cache side, but I was
>>> definitely seeing the IO's being coalesced into larger ones on the krbd
>>
>> I am not sure what you mean here. By coalescing you mean merging right?
>> That is not the same as caching. Coalescing/merging is expected for both
>> aio and rdwr.
> 
> For being able to see caching with aio though, I think you are right and
> there might be a case where can use buffered writes even when using
> O_DIRECT to the rbd device. I am not too familiar with that code, so let
> me ping a person that works here and get back to the list.

I talked to the AIO person here and confirmed that the code can drop
down to buffered writes. However, if that happens it will then make it
still look like O_DIRECT by making sure writes are written back and
pages are invalidated.

If you want I can send you a patch to also do O_SYNC when doing AIO, so
we can make sure flushes/barriers are also done.

Send me the iometer workload when you get a chance, so I can test it out
here.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph migration to AWS

2015-05-04 Thread Mike Travis

To those interested in a tricky problem,

We have a Ceph cluster running at one of our data centers. One of our
client's requirements is to have them hosted at AWS. My question is: How do
we effectively migrate our data on our internal Ceph cluster to an AWS Ceph
cluster?

Ideas currently on the table:

1. Build OSDs at AWS and add them to our current Ceph cluster. Build quorum
at AWS then sever the connection between AWS and our data center.

2. Build a Ceph cluster at AWS and send snapshots from our data center to
our AWS cluster allowing us to "migrate" to AWS.

Is this a good idea? Suggestions? Has anyone done something like this
before?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Discuss: New default recovery config settings

2015-06-04 Thread Mike Dawson


With a write-heavy RBD workload, I add the following to ceph.conf:

osd_max_backfills = 2
osd_recovery_max_active = 2

If things are going well during recovery (i.e. guests happy and no slow 
requests), I will often bump both up to three:


# ceph tell osd.* injectargs '--osd-max-backfills 3 
--osd-recovery-max-active 3'


If I see slow requests, I drop them down.

The biggest downside to setting either to 1 seems to be the long tail 
issue detailed in:


http://tracker.ceph.com/issues/9566

Thanks,
Mike Dawson


On 6/3/2015 6:44 PM, Sage Weil wrote:

On Mon, 1 Jun 2015, Gregory Farnum wrote:

On Mon, Jun 1, 2015 at 6:39 PM, Paul Von-Stamwitz
 wrote:

On Fri, May 29, 2015 at 4:18 PM, Gregory Farnum  wrote:

On Fri, May 29, 2015 at 2:47 PM, Samuel Just  wrote:

Many people have reported that they need to lower the osd recovery config 
options to minimize the impact of recovery on client io.  We are talking about 
changing the defaults as follows:

osd_max_backfills to 1 (from 10)
osd_recovery_max_active to 3 (from 15)
osd_recovery_op_priority to 1 (from 10)
osd_recovery_max_single_start to 1 (from 5)


I'm under the (possibly erroneous) impression that reducing the number of max 
backfills doesn't actually reduce recovery speed much (but will reduce memory 
use), but that dropping the op priority can. I'd rather we make users manually 
adjust values which can have a material impact on their data safety, even if 
most of them choose to do so.

After all, even under our worst behavior we're still doing a lot better than a 
resilvering RAID array. ;) -Greg
--



Greg,
When we set...

osd recovery max active = 1
osd max backfills = 1

We see rebalance times go down by more than half and client write performance 
increase significantly while rebalancing. We initially played with these 
settings to improve client IO expecting recovery time to get worse, but we got 
a 2-for-1.
This was with firefly using replication, downing an entire node with lots of 
SAS drives. We left osd_recovery_threads, osd_recovery_op_priority, and 
osd_recovery_max_single_start default.

We dropped osd_recovery_max_active and osd_max_backfills together. If you're 
right, do you think osd_recovery_max_active=1 is primary reason for the 
improvement? (higher osd_max_backfills helps recovery time with erasure coding.)


Well, recovery max active and max backfills are similar in many ways.
Both are about moving data into a new or outdated copy of the PG ? the
difference is that recovery refers to our log-based recovery (where we
compare the PG logs and move over the objects which have changed)
whereas backfill requires us to incrementally move through the entire
PG's hash space and compare.
I suspect dropping down max backfills is more important than reducing
max recovery (gathering recovery metadata happens largely in memory)
but I don't really know either way.

My comment was meant to convey that I'd prefer we not reduce the
recovery op priority levels. :)


We could make a less extreme move than to 1, but IMO we have to reduce it
one way or another.  Every major operator I've talked to does this, our PS
folks have been recommending it for years, and I've yet to see a single
complaint about recovery times... meanwhile we're drowning in a sea of
complaints about the impact on clients.

How about

  osd_max_backfills to 1 (from 10)
  osd_recovery_max_active to 3 (from 15)
  osd_recovery_op_priority to 3 (from 10)
  osd_recovery_max_single_start to 1 (from 5)

(same as above, but 1/3rd the recovery op prio instead of 1/10th)
?

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] .New Ceph cluster - cannot add additional monitor

2015-06-09 Thread Mike Carlson

ldn't
decrypt with error: error decoding block for decryption
2015-06-09 11:33:24.661478 7fef2a806700  0 -- 10.5.68.229:6789/0 >>
10.5.68.236:6789/0 pipe(0x3571000 sd=13 :40912 s=1 pgs=0 cs=0 l=0
c=0x34083c0).failed verifying authorize reply
2015-06-09 11:33:24.763579 7fef2eb83700  0 log_channel(audit) log [DBG] :
from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
2015-06-09 11:33:24.763651 7fef2eb83700  0 log_channel(audit) log [DBG] :
from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
2015-06-09 11:33:25.825163 7fef2eb83700  0 log_channel(audit) log [DBG] :
from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
2015-06-09 11:33:25.825259 7fef2eb83700  0 log_channel(audit) log [DBG] :
from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
2015-06-09 11:33:26.661737 7fef2a806700  0 cephx: verify_reply couldn't
decrypt with error: error decoding block for decryption
2015-06-09 11:33:26.661750 7fef2a806700  0 -- 10.5.68.229:6789/0 >>
10.5.68.236:6789/0 pipe(0x3571000 sd=13 :40914 s=1 pgs=0 cs=0 l=0
c=0x34083c0).failed verifying authorize reply
2015-06-09 11:33:26.887973 7fef2eb83700  0 log_channel(audit) log [DBG] :
from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
2015-06-09 11:33:26.888047 7fef2eb83700  0 log_channel(audit) log [DBG] :
from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
2015-06-09 11:33:27.950014 7fef2eb83700  0 log_channel(audit) log [DBG] :
from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
2015-06-09 11:33:27.950113 7fef2eb83700  0 log_channel(audit) log [DBG] :
from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished


All of our google searching seems to indicate that there may be a clock
skew, but, the clocks are matched within .001 seconds


Any assistance is much appreciated, thanks,

Mike C
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] .New Ceph cluster - cannot add additional monitor

2015-06-14 Thread Mike Carlson

Thank you for the reply Alex, I'm going to check into that and see if it
helps resolve the issue.

Mike C

On Fri, Jun 12, 2015 at 11:57 PM, Alex Muntada  wrote:

> We've recently found similar problems creating a new cluster over an older
> one, even after using "ceph-deploy purge", because some of the data
> remained on /var/lib/ceph/*/* (ubuntu trusty) and the nodes were trying to
> use old keyrings.
>
> Hope it helps,
> Alex
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] .New Ceph cluster - cannot add additional monitor

2015-06-17 Thread Mike Carlson

Just to follow up, I started from scratch, and I think the key was to run
ceph-deploy purge (nodes) , ceph-deploy purgdata (nodes) and finally
ceph-deploy forgetkeys

Thanks for the replies Alex and Alex!
Mike C
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] What is this HEALTH_WARN indicating?

2013-07-08 Thread Mike Bryant

Run "ceph health detail" and it should give you more information.
(I'd guess an osd or mon has a full hard disk)

Cheers
Mike

On 8 July 2013 21:16, Jordi Llonch  wrote:
> Hello,
>
> I am testing ceph using ubuntu raring with ceph version 0.61.4
> (1669132fcfc27d0c0b5e5bb93ade59d147e23404) on 3 virtualbox nodes.
>
> What is this HEALTH_WARN indicating?
>
> # ceph -s
>health HEALTH_WARN
>monmap e3: 3 mons at
> {node1=192.168.56.191:6789/0,node2=192.168.56.192:6789/0,node3=192.168.56.193:6789/0},
> election epoch 52, quorum 0,1,2 node1,node2,node3
>osdmap e84: 3 osds: 3 up, 3 in
> pgmap v3209: 192 pgs: 192 active+clean; 460 MB data, 1112 MB used, 135
> GB / 136 GB avail
>mdsmap e37: 1/1/1 up {0=node3=up:active}, 1 up:standby
>
>
> Thanks,
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Mike Bryant | Systems Administrator | Ocado Technology
mike.bry...@ocado.com | 01707 382148 | www.ocadotechnology.com

-- 
Notice:  This email is confidential and may contain copyright material of 
Ocado Limited (the "Company"). Opinions and views expressed in this message 
may not necessarily reflect the opinions and views of the Company.

If you are not the intended recipient, please notify us immediately and 
delete all copies of this message. Please note that it is your 
responsibility to scan this message for viruses.

Company reg. no. 3875000.

Ocado Limited
Titan Court
3 Bishops Square
Hatfield Business Park
Hatfield
Herts
AL10 9NE
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph & hbase:

2013-07-17 Thread Mike Bryant

Yup, that was me.
We have hbase working here.
You'll want to disable localized reads, as per bug #5388. That bug
will cause your regionservers to crash fairly often when doing
compaction.
You'll also want to restart each of the regionservers and masters
often (We're doing it once a day) to mitigate the effects of bug
#5039, being that your data pool is growing much faster than you might
expect, and it being much larger than the visible filesize in cephfs.

With those workarounds in place we're running a stable install of
openTSDB on top of hbase.

Mike


On 17 July 2013 23:47, Noah Watkins  wrote:
> Yeh, check if the merge removed createNonRecursive. I specifically
> remember adding that function for someone on the mailing list that was
> trying to get HBase working.
>
> http://tracker.ceph.com/issues/4555
>
> On Wed, Jul 17, 2013 at 3:41 PM, ker can  wrote:
>> this is probably something i introduced in my private version ... when i
>> merged the 1.0 branch with the hadoop-topo branch.  Let me fix this and try
>> again.
>>
>>
>> On Wed, Jul 17, 2013 at 5:35 PM, ker can  wrote:
>>>
>>> Some more from lastIOE.printStackTrace():
>>>
>>> Caused by: java.io.IOException: java.io.IOException: createNonRecursive
>>> unsupported for this filesystem class
>>> org.apache.hadoop.fs.ceph.CephFileSystem
>>> at
>>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:175)
>>> at
>>> org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:741)
>>> ... 17 more
>>> Caused by: java.io.IOException: createNonRecursive unsupported for this
>>> filesystem class org.apache.hadoop.fs.ceph.CephFileSystem
>>> at
>>> org.apache.hadoop.fs.FileSystem.createNonRecursive(FileSystem.java:626)
>>> at
>>> org.apache.hadoop.fs.FileSystem.createNonRecursive(FileSystem.java:601)
>>> at
>>> org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:442)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>> at
>>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:156)
>>> ... 18 more
>>>
>>>
>>>
>>> On Wed, Jul 17, 2013 at 1:49 PM, Noah Watkins 
>>> wrote:
>>>>
>>>> On Wed, Jul 17, 2013 at 11:07 AM, ker can  wrote:
>>>> > Hi,
>>>> >
>>>> > Has anyone got hbase working on ceph ? I've got ceph (cuttlefish) and
>>>> > hbase-0.94.9.
>>>> > My setup is erroring out looking for getDefaultReplication &
>>>> > getDefaultBlockSize ... but I can see those defined in
>>>> > core/org/apache/hadoop/fs/ceph/CephFileSystem.java
>>>>
>>>> It looks like HBase is giving up trying to create a writer... The
>>>> lastIOE is probably an exception generated by CephFileSystem. Would it
>>>> be possible to add some debug info so we could see where the exception
>>>> is coming from? This try block seems to be masking that.
>>>>
>>>>   public Writer createWriter(FileSystem fs, Configuration conf, Path
>>>> hlogFile) throws IOException {
>>>> int i = 0;
>>>> IOException lastIOE = null;
>>>> do {
>>>>   try {
>>>> return HLog.createWriter(fs, hlogFile, conf);
>>>>   } catch (IOException ioe) {
>>>> lastIOE = ioe;
>>>> sleepBeforeRetry("Create Writer", i+1);
>>>>   }
>>>> } while (++i <= hdfsClientRetriesNumber);
>>>> throw new IOException("Exception in createWriter", lastIOE);
>>>
>>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Mike Bryant | Systems Administrator | Ocado Technology
mike.bry...@ocado.com | 01707 382148 | www.ocadotechnology.com

-- 
Notice:  This email is confidential and may contain copyright material of 
Ocado Limited (the "Company"). Opinions and views expressed in this message 
may not necessarily reflect the opinions and views of the Company.

If you are not the intended recipient, please notify us immediately and 
delete all copies of this message. Please note that it is your 
responsibility to scan this message for viruses.

Company reg. no. 3875000.

Ocado Limited
Titan Court
3 Bishops Square
Hatfield Business Park
Hatfield
Herts
AL10 9NE
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unclean PGs in active+degrared or active+remapped

2013-07-19 Thread Mike Lowe

I'm by no means an expert, but from what I understand you do need to stick to 
numbering from zero if you want things to work out in the long term.  Is there 
a chance that the cluster didn't finish bringing things back up to full 
replication before osd's were removed?  

If I were moving from 0,1 to 2,3 I'd bring both 2 and 3 up, set the weight of 
0,1 to zero and let all of the pg's get active+clean again then remove 0,1.  
Doing your swap I might bring up 2 under rack az2, set 1 to weight 0, stop 1 
after getting active+clean and remake what is now 3 as 1 and bring it back in 
as 1 with full weight, then finally drop 2 to weight zero and remove after 
active+clean.  I'd follow on doing a similar shuffle for the now inactive 
former osd 1 the current osd 0 and the future osd 0 which was osd 2.  Clear as 
mud?


On Jul 19, 2013, at 7:03 PM, Pawel Veselov  wrote:

> On Fri, Jul 19, 2013 at 3:54 PM, Mike Lowe  wrote:
> I'm not sure how to get you out of the situation you are in but what you have 
> in your crush map is osd 2 and osd 3 but ceph starts counting from 0 so I'm 
> guessing it's probably gotten confused.  Some history on your cluster might 
> give somebody an idea for a fix. 
> 
> We had osd.0 and osd.1 first, then we added osd.2. We then removed osd.1, 
> added osd.3 and removed osd.0.
> Do you think that adding back a new osd.0 and osd.1, and then removing osd.2 
> and osd.3 will solve that confusion? I'm a bit concerned that proper osd 
> numbering is required to maintain a healthy cluster...
>  
>  
> On Jul 19, 2013, at 6:44 PM, Pawel Veselov  wrote:
> 
>> Hi.
>> 
>> I'm trying to understand the reason behind some of my unclean pages, after 
>> moving some OSDs around. Any help would be greatly appreciated.I'm sure we 
>> are missing something, but can't quite figure out what.
>> 
>> [root@ip-10-16-43-12 ec2-user]# ceph health detail
>> HEALTH_WARN 29 pgs degraded; 68 pgs stuck unclean; recovery 4071/217370 
>> degraded (1.873%)
>> pg 0.50 is stuck unclean since forever, current state active+degraded, last 
>> acting [2]
>> ...
>> pg 2.4b is stuck unclean for 836.989336, current state active+remapped, last 
>> acting [3,2]
>> ...
>> pg 0.6 is active+degraded, acting [3]
>> 
>> These are distinct examples of problems. There are total of 676 page groups.
>> Query shows pretty much the same on them: .
>> 
>> crush map: http://pastebin.com/4Hkkgau6
>> There are some pg_temps (I don't quite understand what those are), that are 
>> mapped to non-existing OSDs. osdmap: http://pastebin.com/irbRNYJz
>> queries for all stuck page groups:http://pastebin.com/kzYa6s2G
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> -- 
> With best of best regards
> Pawel S. Veselov

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cinder volume creation issues

2013-07-26 Thread Mike Dawson


You can specify the uuid in the secret.xml file like:


bdf77f5d-bf0b-1053-5f56-cd76b32520dc

client.volumes secret
  


Then use that same uuid on all machines in cinder.conf:

rbd_secret_uuid=bdf77f5d-bf0b-1053-5f56-cd76b32520dc


Also, the column you are referring to in the OpenStack Dashboard lists 
the machine running the Cinder APIs, not specifically the server hosting 
the storage. Like Greg stated, Ceph stripes the storage across your cluster.


Fix your uuids and cinder.conf any you'll be moving in the right direction.

Cheers,
Mike


On 7/26/2013 1:32 PM, johnu wrote:

Greg,
 :) I am not getting where was the mistake in the
configuration. virsh secret-define gave  different secrets

sudo virsh secret-define --file secret.xml

sudo virsh secret-set-value --secret {uuid of secret} --base64 $(cat 
client.volumes.key)



On Fri, Jul 26, 2013 at 10:16 AM, Gregory Farnum mailto:g...@inktank.com>> wrote:

On Fri, Jul 26, 2013 at 10:11 AM, johnu mailto:johnugeorge...@gmail.com>> wrote:
 > Greg,
 > Yes, the outputs match

Nope, they don't. :) You need the secret_uuid to be the same on each
node, because OpenStack is generating configuration snippets on one
node (which contain these secrets) and then shipping them to another
node where they're actually used.

Your secrets are also different despite having the same rbd user
specified, so that's broken too; not quite sure how you got there...
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

 >
 > master node:
 >
 > ceph auth get-key client.volumes
 > AQC/ze1R2EOWNBAAmLUE4U7zO1KafZ/CzVVTqQ==
 >
 > virsh secret-get-value bdf77f5d-bf0b-1053-5f56-cd76b32520dc
 > AQC/ze1R2EOWNBAAmLUE4U7zO1KafZ/CzVVTqQ==
 >
 > /etc/cinder/cinder.conf
 >
 > volume_driver=cinder.volume.drivers.rbd.RBDDriver
 > rbd_pool=volumes
 > glance_api_version=2
 > rbd_user=volumes
 > rbd_secret_uuid=bdf77f5d-bf0b-1053-5f56-cd76b32520dc
 >
 >
 > slave1
 >
 > /etc/cinder/cinder.conf
 >
 > volume_driver=cinder.volume.drivers.rbd.RBDDriver
 > rbd_pool=volumes
 > glance_api_version=2
 > rbd_user=volumes
 > rbd_secret_uuid=62d0b384-50ad-2e17-15ed-66bfeda40252
 >
 >
 > virsh secret-get-value 62d0b384-50ad-2e17-15ed-66bfeda40252
 > AQC/ze1R2EOWNBAAmLUE4U7zO1KafZ/CzVVTqQ==
 >
 > slave2
 >
 > /etc/cinder/cinder.conf
 >
 > volume_driver=cinder.volume.drivers.rbd.RBDDriver
 > rbd_pool=volumes
 > glance_api_version=2
 > rbd_user=volumes
 > rbd_secret_uuid=33651ba9-5145-1fda-3e61-df6a5e6051f5
 >
 > virsh secret-get-value 33651ba9-5145-1fda-3e61-df6a5e6051f5
 > AQC/ze1R2EOWNBAAmLUE4U7zO1KafZ/CzVVTqQ==
 >
 >
 > Yes, Openstack horizon is showing same host for all volumes.
Somehow, if
 > volume is attached to an instance lying on the same host, it works
 > otherwise, it doesn't. Might be a coincidence. And I am surprised
that no
 > one else has seen or reported this issue. Any idea?
 >
 > On Fri, Jul 26, 2013 at 9:45 AM, Gregory Farnum mailto:g...@inktank.com>> wrote:
 >>
 >> On Fri, Jul 26, 2013 at 9:35 AM, johnu mailto:johnugeorge...@gmail.com>> wrote:
 >> > Greg,
 >> > I verified in all cluster nodes that rbd_secret_uuid
is same as
 >> > virsh secret-list. And If I do virsh secret-get-value of this
uuid, i
 >> > getting back the auth key for client.volumes.  What did you
mean by same
 >> > configuration?. Did you mean same secret for all compute nodes?
 >>
 >> If you run "virsh secret-get-value" with that rbd_secret_uuid on
each
 >> compute node, does it return the right secret for client.volumes?
 >>
 >> > when we login as admin, There is a column in admin
panel which
 >> > gives
 >> > the 'host' where the volumes lie. I know that volumes are
striped across
 >> > the
 >> > cluster but it gives same host for all volumes. That is why ,I got
 >> > little
 >> > confused.
 >>
 >> That's not something you can get out of the RBD stack itself; is
this
 >> something that OpenStack is showing you? I suspect it's just
making up
 >> information to fit some API expectations, but somebody more familiar
 >> with the OpenStack guts can probably chime in.
 >> -Greg
 >> Software Engineer #42 @ http://inktank.com | http://ceph.com
 >
 >




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Defective ceph startup script

2013-07-31 Thread Mike Dawson


Greg,

You can check the currently running version (and much more) using the 
admin socket:


http://ceph.com/docs/master/rados/operations/monitoring/#using-the-admin-socket

For me, this looks like:

# ceph --admin-daemon /var/run/ceph/ceph-mon.a.asok version
{"version":"0.61.7"}

# ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok version
{"version":"0.61.7"}


Also, I use 'service ceph restart' on Ubuntu 13.04 running a mkcephfs 
deployment. It may be different when using ceph-deploy.



Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 7/31/2013 2:51 PM, Greg Chavez wrote:


I am running on Ubuntu 13.04.

There is something amiss with /etc/init.d/ceph on all of my ceph nodes.

I was upgrading to 0.61.7 from what I *thought* was 0.61.5 today when I
realized that "service ceph-all restart" wasn't actually doing anything.
  I saw nothing in /var/log/ceph.log - it just kept printing pg statuses
- and the PIDs of the osd and mon daemons did not change.  Stops failed
as well.

Then, when I tried to do individual osd restarts like this:

root@kvm-cs-sn-14i:/var/lib/ceph/osd# service ceph -v status osd.10
/etc/init.d/ceph: osd.10 not found (/etc/ceph/ceph.conf defines ,
/var/lib/ceph defines )


Despite the fact that I have this directory: /var/lib/ceph/osd/ceph-10/.

I have the same issue with mon restarts:

root@kvm-cs-sn-14i:/var/lib/ceph/mon# ls
ceph-kvm-cs-sn-14i

root@kvm-cs-sn-14i:/var/lib/ceph/mon# service ceph -v status
mon.kvm-cs-sn-14i
/etc/init.d/ceph: mon.kvm-cs-sn-14i not found (/etc/ceph/ceph.conf
defines , /var/lib/ceph defines )


I'm very worried that I have all my packages at  0.61.7 while my osd and
mon daemons could be running as old as  0.61.1!

Can anyone help me figure this out?  Thanks.


--
\*..+.-
--Greg Chavez
+//..;};


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Production/Non-production segmentation

2013-07-31 Thread Mike Dawson


Greg,

IMO the most critical risks when running Ceph are bugs that affect 
daemon stability and the upgrade process.


Due to the speed of releases in the Ceph project, I feel having separate 
physical hardware is the safer way to go, especially in light of your 
mention of an SLA for your production services.


A separate non-production cluster will allow you to test and validate 
new versions (including point releases within a stable series) before 
you attempt to upgrade your production cluster.


Cheers,

Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 7/31/2013 10:47 AM, Greg Poirier wrote:

Does anyone here have multiple clusters or segment their single cluster
in such a way as to try to maintain different SLAs for production vs
non-production services?

We have been toying with the idea of running separate clusters (on the
same hardware, but reserve a portion of the OSDs for the production
cluster), but I'd rather have a single cluster in order to more evenly
distribute load across all of the spindles.

Thoughts or observations from people with Ceph in production would be
greatly appreciated.

Greg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Production/Non-production segmentation

2013-07-31 Thread Mike Dawson



On 7/31/2013 3:34 PM, Greg Poirier wrote:

On Wed, Jul 31, 2013 at 12:19 PM, Mike Dawson mailto:mike.daw...@cloudapt.com>> wrote:

Due to the speed of releases in the Ceph project, I feel having
separate physical hardware is the safer way to go, especially in
light of your mention of an SLA for your production services.

Ah. I guess I should offer a little more background as to what I mean by
production vs. non-production: customer-facing, and not.


That makes more sense.



We're using Ceph primarily for volume storage with OpenStack at the
moment and operate two OS clusters: one for all of our customer-facing
services (which require a higher SLA) and one for all of our internal
services. The idea being that all of the customer-facing stuff is
segmented physically from anything our developers might be testing
internally.

What I'm wondering:

Does anyone else here do this?


Have you looked at Ceph Pools? I think you may find they address many of 
your concerns while maintaining a single cluster.




If so, do you run multiple Ceph clusters?
Do you let Ceph sort itself out?
Can this be done with a single physical cluster, but multiple logical
clusters?
Should it be?

I know that, mathematically speaking, the larger your Ceph cluster is,
the more evenly distributed the load (thanks to CRUSH). I'm wondering
if, in practice, RBD can still create hotspots (say from a runaway
service with multiple instances and volumes that is suddenly doing a ton
of IO). This would increase IO latency across the Ceph cluster, I'd
assume, and could impact the performance of customer-facing services.

So, to some degree, physical segmentation makes sense to me. But can we
simply reserve some OSDs per physical host for a "production" logical
cluster and then use the rest for the "development" logical cluster
(separate MON clusters for each, but all running on the same hardware).
Or, given a sufficiently large cluster, is this not even a concern?

I'm also interested in hearing about experience using CephFS, Swift, and
RBD all on a single cluster or if people have chosen to use multiple
clusters for these as well. For example, if you need faster volume
storage in RBD, so you go for more spindles and smaller disks vs. larger
disks with fewer spindles for object storage, which can have a higher
allowance for latency than volume storage.


See the response from Greg F. from Inktank to a similar question:

http://comments.gmane.org/gmane.comp.file-systems.ceph.user/2090




A separate non-production cluster will allow you to test and
validate new versions (including point releases within a stable
series) before you attempt to upgrade your production cluster.


Oh yeah. I'm doing that for sure.
Thanks,

Greg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Why is my mon store.db is 220GB?

2013-08-01 Thread Mike Dawson

220GB is way, way too big. I suspect your monitors need to go through a 
successful leveldb compaction. The early releases of Cuttlefish suffered 
several issues with store.db growing unbounded. Most were fixed by 
0.61.5, I believe.


You may have luck stoping all Ceph daemons, then starting the monitor by 
itself. When there were bugs, leveldb compaction tended work better 
without OSD traffic hitting the monitors. Also, there are some settings 
to force a compact on startup like 'mon compact on start = true' and mon 
compact on trim = true". I don't think either are required anymore 
though. See some history here:


http://tracker.ceph.com/issues/4895


Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 8/1/2013 6:52 PM, Jeppesen, Nelson wrote:

My Mon store.db has been at 220GB for a few months now. Why is this and
how can I fix it? I have one monitor in this cluster and I suspect that
I can’t  add monitors to the cluster because it is too big. Thank you.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process

2013-08-02 Thread Mike Dawson


Oliver,

We've had a similar situation occur. For about three months, we've run 
several Windows 2008 R2 guests with virtio drivers that record video 
surveillance. We have long suffered an issue where the guest appears to 
hang indefinitely (or until we intervene). For the sake of this 
conversation, we call this state "wedged", because it appears something 
(rbd, qemu, virtio, etc) gets stuck on a deadlock. When a guest gets 
wedged, we see the following:


- the guest will not respond to pings
- the qemu-system-x86_64 process drops to 0% cpu
- graphite graphs show the interface traffic dropping to 0bps
- the guest will stay wedged forever (or until we intervene)
- strace of qemu-system-x86_64 shows QEMU is making progress [1][2]

We can "un-wedge" the guest by opening a NoVNC session or running a 
'virsh screenshot' command. After that, the guest resumes and runs as 
expected. At that point we can examine the guest. Each time we'll see:


- No Windows error logs whatsoever while the guest is wedged
- A time sync typically occurs right after the guest gets un-wedged
- Scheduled tasks do not run while wedged
- Windows error logs do not show any evidence of suspend, sleep, etc

We had so many issue with guests becoming wedged, we wrote a script to 
'virsh screenshot' them via cron. Then we installed some updates and had 
a month or so of higher stability (wedging happened maybe 1/10th as 
often). Until today we couldn't figure out why.


Yesterday, I realized qemu was starting the instances without specifying 
cache=writeback. We corrected that, and let them run overnight. With RBD 
writeback re-enabled, wedging came back as often as we had seen in the 
past. I've counted ~40 occurrences in the past 12-hour period. So I feel 
like writeback caching in RBD certainly makes the deadlock more likely 
to occur.


Joshd asked us to gather RBD client logs:

"joshd> it could very well be the writeback cache not doing a callback 
at some point - if you could gather logs of a vm getting stuck with 
debug rbd = 20, debug ms = 1, and debug objectcacher = 30 that would be 
great"


We'll do that over the weekend. If you could as well, we'd love the help!

[1] http://www.gammacode.com/kvm/wedged-with-timestamps.txt
[2] http://www.gammacode.com/kvm/not-wedged.txt

Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 8/2/2013 6:22 AM, Oliver Francke wrote:

Well,

I believe, I'm the winner of buzzwords-bingo for today.

But seriously speaking... as I don't have this particular problem with
qcow2 with kernel 3.2 nor qemu-1.2.2 nor newer kernels, I hope I'm not
alone here?
We have a raising number of tickets from people reinstalling from ISO's
with 3.2-kernel.

Fast fallback is to start all VM's with qemu-1.2.2, but we then lose
some features ala latency-free-RBD-cache ;)

I just opened a bug for qemu per:

https://bugs.launchpad.net/qemu/+bug/1207686

with all dirty details.

Installing a backport-kernel 3.9.x or upgrade Ubuntu-kernel to 3.8.x
"fixes" it. So we have a bad combination for all distros with 3.2-kernel
and rbd as storage-backend, I assume.

Any similar findings?
Any idea of tracing/debugging ( Josh? ;) ) very welcome,

Oliver.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Large storage nodes - best practices

2013-08-05 Thread Mike Dawson

Brian,

Short answer: Ceph generally is used with multiple OSDs per node. One
OSD per storage drive with no RAID is the most common setup. At 24- or
36-drives per chassis, there are several potential bottlenecks to consider.

Mark Nelson, the Ceph performance guy at Inktank, has published several
articles you should consider reading. A few of interest are [0], [1],
and [2]. The last link is a 5-part series.

There are lots of considerations:

- HBA performance
- Total OSD throughput vs network throughput
- SSD throughput vs. OSD throughput
- CPU / RAM overhead for the OSD processes

Also, note that there is on-going work to add erasure coding as a
optional backend (as opposed to the current replication scheme). If you
prioritize bulk storage over performance, you may be interested in
following the progress [3], [4], and [5].

[0]:
http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
[1]:
http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
[2]:
http://ceph.com/performance-2/ceph-cuttlefish-vs-bobtail-part-1-introduction-and-rados-bench/
[3]:
http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend
[4]:
http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend

[5]: http://www.inktank.com/about-inktank/roadmap/

Cheers,
Mike Dawson

On 8/5/2013 9:50 AM, Brian Candler wrote:

I am looking at evaluating ceph for use with large storage nodes (24-36
SATA disks per node, 3 or 4TB per disk, HBAs, 10G ethernet).

What would be the best practice for deploying this? I can see two main
options.

(1) Run 24-36 osds per node. Configure ceph to replicate data to one or
more other nodes. This means that if a disk fails, there will have to be
an operational process to stop the osd, unmount and replace the disk,
mkfs a new filesystem, mount it, and restart the osd - which could be
more complicated and error-prone than a RAID swap would be.

(2) Combine the disks using some sort of RAID (or ZFS raidz/raidz2), and
run one osd per node. In this case:
* if I use RAID0 or LVM, then a single disk failure will cause all the
data on the node to be lost and rebuilt
* if I use RAID5/6, then write performance is likely to be poor
* if I use RAID10, then capacity is reduced by half; with ceph
replication each piece of data will be replicated 4 times (twice on one
node, twice on the replica node)

It seems to me that (1) is what ceph was designed to achieve, maybe with
2 or 3 replicas. Is this what's recommended?

I have seen some postings which imply one osd per node: e.g.
http://www.sebastien-han.fr/blog/2012/08/17/ceph-storage-node-maintenance/
shows three nodes each with one OSD - but maybe this was just a trivial
example for simplicity.

Looking at
http://ceph.com/docs/next/install/hardware-recommendations/
it says " You *may* run multiple OSDs per host" (my emphasis), and goes
on to caution against having more disk bandwidth than network bandwidth.
Ah, but at another point it says " We recommend using a dedicated drive
for the operating system and software, and one drive for each OSD daemon
you run on the host." So I guess that's fairly clear.

Anything other options I should be considering?

Regards,

Brian.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Large storage nodes - best practices

2013-08-05 Thread Mike Dawson

On 8/5/2013 12:51 PM, Brian Candler wrote:

On 05/08/2013 17:15, Mike Dawson wrote:

Mark Nelson, the Ceph performance guy at Inktank, has published
several articles you should consider reading. A few of interest are
[0], [1], and [2]. The last link is a 5-part series.

Yep, I saw [0] and [1]. He tries a 6-disk RAID0 array and generally gets
lower throughput than 6 x JBOD disks (although I think he's using the
controller RAID0 functionality, rather than mdraid).

AFAICS he has a 36-disk chassis but only runs tests with 6 disks, which
is a shame as it would be nice to know which other bottleneck you could
hit first with this type of setup.

The third link I sent shows Mark's results with 24 spinners and 8 SSDs
for journals. Specifically read:

http://ceph.com/performance-2/ceph-cuttlefish-vs-bobtail-part-1-introduction-and-rados-bench/#setup

Florian Haas has also published some thoughts on bottenecks:

http://www.hastexo.com/resources/hints-and-kinks/solid-state-drives-and-ceph-osd-journals

Also, note that there is on-going work to add erasure coding as a
optional backend (as opposed to the current replication scheme). If
you prioritize bulk storage over performance, you may be interested in
following the progress [3], [4], and [5].

[0]:
http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/

[1]:
http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/

[2]:
http://ceph.com/performance-2/ceph-cuttlefish-vs-bobtail-part-1-introduction-and-rados-bench/

[3]:
http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend

[4]:
http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend

[5]: http://www.inktank.com/about-inktank/roadmap/

Thank you - erasure coding is very much of interest for the
archival-type storage I'm looking at. However your links [3] and [4] are
identical, did you mean to link to another one?

Oops.

http://wiki.ceph.com/01Planning/02Blueprints/Emperor/Erasure_coded_storage_backend_%28step_2%29

Cheers,

Brian.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Qemu-devel] [Bug 1207686]

2013-08-05 Thread Mike Dawson


Josh,

Logs are uploaded to cephdrop with the file name 
mikedawson-rbd-qemu-deadlock.


- At about 2013-08-05 19:46 or 47, we hit the issue, traffic went to 0
- At about 2013-08-05 19:53:51, ran a 'virsh screenshot'


Environment is:

- Ceph 0.61.7 (client is co-mingled with three OSDs)
- rbd cache = true and cache=writeback
- qemu 1.4.0 1.4.0+dfsg-1expubuntu4
- Ubuntu Raring with 3.8.0-25-generic

This issue is reproducible in my environment, and I'm willing to run any 
wip branch you need. What else can I provide to help?


Thanks,
Mike Dawson


On 8/5/2013 3:48 AM, Stefan Hajnoczi wrote:

On Sun, Aug 04, 2013 at 03:36:52PM +0200, Oliver Francke wrote:

Am 02.08.2013 um 23:47 schrieb Mike Dawson :

We can "un-wedge" the guest by opening a NoVNC session or running a 'virsh 
screenshot' command. After that, the guest resumes and runs as expected. At that point we 
can examine the guest. Each time we'll see:


If virsh screenshot works then this confirms that QEMU itself is still
responding.  Its main loop cannot be blocked since it was able to
process the screendump command.

This supports Josh's theory that a callback is not being invoked.  The
virtio-blk I/O request would be left in a pending state.

Now here is where the behavior varies between configurations:

On a Windows guest with 1 vCPU, you may see the symptom that the guest no
longer responds to ping.

On a Linux guest with multiple vCPUs, you may see the hung task message
from the guest kernel because other vCPUs are still making progress.
Just the vCPU that issued the I/O request and whose task is in
UNINTERRUPTIBLE state would really be stuck.

Basically, the symptoms depend not just on how QEMU is behaving but also
on the guest kernel and how many vCPUs you have configured.

I think this can explain how both problems you are observing, Oliver and
Mike, are a result of the same bug.  At least I hope they are :).

Stefan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Openstack glance ceph rbd_store_user authentification problem

2013-08-08 Thread Mike Dawson


Steffan,

It works for me. I have:

user@node:/etc/ceph# cat /etc/glance/glance-api.conf | grep rbd
default_store = rbd
#   glance.store.rbd.Store,
rbd_store_ceph_conf = /etc/ceph/ceph.conf
rbd_store_user = images
rbd_store_pool = images
rbd_store_chunk_size = 4


Thanks,
Mike Dawson


On 8/8/2013 9:01 AM, Steffen Thorhauer wrote:

Hi,
recently I had a problem with openstack glance and ceph.
I used the
http://ceph.com/docs/master/rbd/rbd-openstack/#configuring-glance
documentation and
http://docs.openstack.org/developer/glance/configuring.html documentation
I'm using ubuntu 12.04 LTS with grizzly from Ubuntu Cloud Archive and
ceph 61.7.

glance-api.conf had following config options

default_store = rbd
rbd_store_user=images
rbd_store_pool = images
rbd_store_ceph_conf = /etc/ceph/ceph.conf


All the time when doing glance image create I get errors. In the glance
api log I only found error like

2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images Traceback (most
recent call last):
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images   File
"/usr/lib/python2.7/dist-packages/glance/api/v1/images.py", line 444, in
_upload
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images image_meta['size'])
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images   File
"/usr/lib/python2.7/dist-packages/glance/store/rbd.py", line 241, in add
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images with
rados.Rados(conffile=self.conf_file, rados_id=self.user) as conn:
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images   File
"/usr/lib/python2.7/dist-packages/rados.py", line 134, in __enter__
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images self.connect()
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images   File
"/usr/lib/python2.7/dist-packages/rados.py", line 192, in connect
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images raise
make_ex(ret, "error calling connect")
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images ObjectNotFound:
error calling connect

This trace message helped me not very much :-(
My google search "glance.api.v1.images ObjectNotFound: error calling
connect" did only find
http://irclogs.ceph.widodh.nl/index.php?date=2012-10-26
This  points me to an ceph authentification problem. But the ceph tools
worked fine for me.
The I tried the debug option in glance-api.conf and I found following
entry .

DEBUG glance.common.config [-] rbd_store_pool = images
log_opt_values /usr/lib/python2.7/dist-packages/oslo/config/cfg.py:1485
DEBUG glance.common.config [-] rbd_store_user = glance
log_opt_values /usr/lib/python2.7/dist-packages/oslo/config/cfg.py:1485

The glance-api service  did not use my rbd_store_user = images option!!
Then I configured a client.glance auth and it worked with the
"implicit" glance user!!!

Now my question: Am I the only one with this problem??

Regards,
   Steffen Thorhauer
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to recover the osd.

2013-08-08 Thread Mike Dawson


Looks like you didn't get osd.0 deployed properly. Can you show:

- ls /var/lib/ceph/osd/ceph-0
- cat /etc/ceph/ceph.conf


Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 8/8/2013 9:13 AM, Suresh Sadhu wrote:

HI,

My storage health cluster is warning state , one of the osd is in down
state and even if I try to start the osd it fail to start

sadhu@ubuntu3:~$ ceph osd stat

e22: 2 osds: 1 up, 1 in

sadhu@ubuntu3:~$ ls /var/lib/ceph/osd/

ceph-0  ceph-1

sadhu@ubuntu3:~$ ceph osd tree

# idweight  type name   up/down reweight

-1  0.14root default

-2  0.14host ubuntu3

0   0.06999 osd.0   down0

1   0.06999 osd.1   up  1

sadhu@ubuntu3:~$ sudo /etc/init.d/ceph -a start 0

/etc/init.d/ceph: 0. not found (/etc/ceph/ceph.conf defines ,
/var/lib/ceph defines )

sadhu@ubuntu3:~$ sudo /etc/init.d/ceph -a start osd.0

/etc/init.d/ceph: osd.0 not found (/etc/ceph/ceph.conf defines ,
/var/lib/ceph defines )

Ceph health status in warning mode.

pg 4.10 is active+degraded, acting [1]

pg 3.17 is active+degraded, acting [1]

pg 5.16 is active+degraded, acting [1]

pg 4.17 is active+degraded, acting [1]

pg 3.10 is active+degraded, acting [1]

recovery 62/124 degraded (50.000%)

mds.ceph@ubuntu3 at 10.147.41.3:6803/2148 is laggy/unresponsi

regards

sadhu



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to recover the osd.

2013-08-08 Thread Mike Dawson



On 8/8/2013 12:30 PM, Suresh Sadhu wrote:

Thanks Mike,Please find the output of two commands

sadhu@ubuntu3:~$ ls /var/lib/ceph/osd/ceph-0


^^^ that is a problem. It appears that osd.0 didn't get deployed 
properly. To see an example of what structure should be there, do:


ls /var/lib/ceph/osd/ceph-1

ceph-0 should be similar to the apparently working ceph-1 on your cluster.

It should look similar to:

#ls /var/lib/ceph/osd/ceph-0
ceph_fsid
current
fsid
keyring
magic
ready
store_version
whoami

- Mike


sadhu@ubuntu3:~$ cat /etc/ceph/ceph.conf
[global]
fsid = 593dac9e-ce55-4803-acb4-2d32b4e0d3be
mon_initial_members = ubuntu3
mon_host = 10.147.41.3
#auth_supported = cephx
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd_journal_size = 1024
filestore_xattr_use_omap = true

-Original Message-
From: Mike Dawson [mailto:mike.daw...@cloudapt.com]
Sent: 08 August 2013 18:50
To: Suresh Sadhu
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] how to recover the osd.

Looks like you didn't get osd.0 deployed properly. Can you show:

- ls /var/lib/ceph/osd/ceph-0
- cat /etc/ceph/ceph.conf


Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 8/8/2013 9:13 AM, Suresh Sadhu wrote:

HI,

My storage health cluster is warning state , one of the osd is in down
state and even if I try to start the osd it fail to start

sadhu@ubuntu3:~$ ceph osd stat

e22: 2 osds: 1 up, 1 in

sadhu@ubuntu3:~$ ls /var/lib/ceph/osd/

ceph-0  ceph-1

sadhu@ubuntu3:~$ ceph osd tree

# idweight  type name   up/down reweight

-1  0.14root default

-2  0.14host ubuntu3

0   0.06999 osd.0   down0

1   0.06999 osd.1   up  1

sadhu@ubuntu3:~$ sudo /etc/init.d/ceph -a start 0

/etc/init.d/ceph: 0. not found (/etc/ceph/ceph.conf defines ,
/var/lib/ceph defines )

sadhu@ubuntu3:~$ sudo /etc/init.d/ceph -a start osd.0

/etc/init.d/ceph: osd.0 not found (/etc/ceph/ceph.conf defines ,
/var/lib/ceph defines )

Ceph health status in warning mode.

pg 4.10 is active+degraded, acting [1]

pg 3.17 is active+degraded, acting [1]

pg 5.16 is active+degraded, acting [1]

pg 4.17 is active+degraded, acting [1]

pg 3.10 is active+degraded, acting [1]

recovery 62/124 degraded (50.000%)

mds.ceph@ubuntu3 at 10.147.41.3:6803/2148 is laggy/unresponsi

regards

sadhu



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Storage, File Systems and Data Scrubbing

2013-08-21 Thread Mike Lowe

I think you are missing the distinction between metadata journaling and data 
journaling.  In most cases a journaling filesystem is one that journal's it's 
own metadata but your data is on its own.  Consider the case where you have a 
replication level of two, the osd filesystems have journaling disabled and you 
append a block to a file (which is an object in terms of ceph) but only one 
commits the change in file size to disk.  Later you scrub and discover a 
discrepancy in object sizes, with a replication level of 2 there is no way to 
authoritatively say which one is correct just based on what's in ceph.  This is 
a similar scenario to a btrfs bug that caused me to lose data with ceph.  
Journaling your metadata is the absolute minimum level of assurance you need to 
make a transactional system like ceph work.

On Aug 21, 2013, at 4:23 PM, Johannes Klarenbeek  
wrote:

> Dear ceph-users,
>  
> I read a lot of documentation today about ceph architecture and linux file 
> system benchmarks in particular and I could not help notice something that I 
> like to clear up for myself. Take into account that it has been a while that 
> I actually touched linux, but I did some programming on php2b12 and apache 
> back in the days so I’m not a complete newbie. The real question is below if 
> you do not like reading the rest ;)
>  
> What I have come to understand about file systems for OSD’s is that in theory 
> btrfs is the file system of choice. However, due to its young age it’s not 
> considered stable yet. Therefore EXT4 but preferably XFS is used in most 
> cases. It seems that most people choose this system because of its journaling 
> feature and XFS for its additional attribute storage which has a 64kb limit 
> which should be sufficient for most operations.
>  
> But when you look at file system benchmarks btrfs is really, really slow. 
> Then comes XFS, then EXT4, but EXT2 really dwarfs all other throughput 
> results. On journaling systems (like XFS, EXT4 and btrfs) disabling 
> journaling actually helps throughput as well. Sometimes more then 2 times for 
> write actions.
>  
> The preferred configuration for OSD’s is one OSD per disk. Each object is 
> striped among all Object Storage Daemons in a cluster. So if I would take one 
> disk for the cluster and check its data, chances are slim that I will find a 
> complete object there (a non-striped, full object I mean).
>  
> When a client issues an object write (I assume a full object/file write in 
> this case) it is the client’s responsibility to stripe it among the object 
> storage daemons. When a stripe is successfully stored by the daemon an ACK 
> signal is send to (?) the client and all participating OSD’s. When all 
> participating OSD’s for the object have completed the client assumes all is 
> well and returns control to the application
>  
> If I’m not mistaken, then journaling is meant for the rare occasions that a 
> hardware failure will occur and the data is corrupted. Ceph does this too in 
> another way of course. But ceph should be able to notice when a block/stripe 
> is correct or not. In the rare occasion that a node is failing while doing a 
> write; an ACK signal is not send to the caller and therefor the client can 
> resend the block/stripe to another OSD. Therefor I fail to see the purpose of 
> this extra journaling feature.
>  
> Also ceph schedules a data scrubbing process every day (or however it is 
> configured) that should be able to tackle bad sectors or other errors on the 
> file system and accordingly repair them on the same daemon or flag the whole 
> block as bad. Since everything is replicated the block is still in the 
> storage cluster so no harm is done.
>  
> In a normal/single file system I truly see the value of journaling and the 
> potential for btrfs (although it’s still very slow). However in a system like 
> ceph, journaling seems to me more like a paranoid super fail save.
>  
> Did anyone experiment with file systems that disabled journaling and how did 
> it perform?
>  
> Regards,
> Johannes
>  
>  
>  
>  
> 
> 
> __ Informatie van ESET Endpoint Antivirus, versie van database 
> viruskenmerken 8713 (20130821) __
> 
> Het bericht is gecontroleerd door ESET Endpoint Antivirus.
> 
> http://www.eset.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Storage, File Systems and Data Scrubbing

2013-08-21 Thread Mike Lowe

Let me make a simpler case, to do ACID (https://en.wikipedia.org/wiki/ACID) 
which are all properties you want in a filesystem or a database, you need a 
journal.  You need a journaled filesystem to make the object store's file 
operations safe.  You need a journal in ceph to make sure the object operations 
are safe.  Flipped bits are a separate problem that may be aided by journaling 
but the primary objective of a journal is to make guarantees about concurrent 
operations and interrupted operations.  There isn't a person on this list who 
hasn't had an osd die, without a journal starting that osd up again and getting 
it usable would be impractical.

On Aug 21, 2013, at 8:00 PM, Johannes Klarenbeek  
wrote:

>  
>  
>  
> I think you are missing the distinction between metadata journaling and data 
> journaling.  In most cases a journaling filesystem is one that journal's it's 
> own metadata but your data is on its own.  Consider the case where you have a 
> replication level of two, the osd filesystems have journaling disabled and 
> you append a block to a file (which is an object in terms of ceph) but only 
> one commits the change in file size to disk.  Later you scrub and discover a 
> discrepancy in object sizes, with a replication level of 2 there is no way to 
> authoritatively say which one is correct just based on what's in ceph.  This 
> is a similar scenario to a btrfs bug that caused me to lose data with ceph.  
> Journaling your metadata is the absolute minimum level of assurance you need 
> to make a transactional system like ceph work.
>  
> Hey Mike J
>  
> I get your point. However, isn’t it then possible to authoritatively say 
> which one is the correct one in case of 3 OSD’s?
> Or is the replication level a configuration setting that tells the cluster 
> that the object needs to be replicated 3 times?
> In both cases, data scrubbing chooses the majority of the same-same 
> replicated objects in order to know which one is authorative.
>  
> But I also believe (!) that each object has a checksum and each PG too so 
> that it should be easy to find the corrupted object on any of the OSD’s.
> How else would scrubbing find corrupted sectors? Especially when I think 
> about 2TB SATA disks being hit by cosmic-rays that flip a bit somewhere.
> It happens more often with big cheap TB disks, but that doesn’t mean the 
> corrupted sector is a bad sector (in not useable anymore). Journaling is not 
> going to help anyone with this.
> Therefor I believe (again) that the data scrubber must have a mechanism to 
> detect these types of corruptions even in a 2 OSD setup by means of checksums 
> (or better, with a hashed checksum id).
>  
> Also, aren’t there 2 types of transactions; one for writing and one for 
> replicating?
>  
> On Aug 21, 2013, at 4:23 PM, Johannes Klarenbeek 
>  wrote:
>  
> 
> Dear ceph-users,
>  
> I read a lot of documentation today about ceph architecture and linux file 
> system benchmarks in particular and I could not help notice something that I 
> like to clear up for myself. Take into account that it has been a while that 
> I actually touched linux, but I did some programming on php2b12 and apache 
> back in the days so I’m not a complete newbie. The real question is below if 
> you do not like reading the rest ;)
>  
> What I have come to understand about file systems for OSD’s is that in theory 
> btrfs is the file system of choice. However, due to its young age it’s not 
> considered stable yet. Therefore EXT4 but preferably XFS is used in most 
> cases. It seems that most people choose this system because of its journaling 
> feature and XFS for its additional attribute storage which has a 64kb limit 
> which should be sufficient for most operations.
>  
> But when you look at file system benchmarks btrfs is really, really slow. 
> Then comes XFS, then EXT4, but EXT2 really dwarfs all other throughput 
> results. On journaling systems (like XFS, EXT4 and btrfs) disabling 
> journaling actually helps throughput as well. Sometimes more then 2 times for 
> write actions.
>  
> The preferred configuration for OSD’s is one OSD per disk. Each object is 
> striped among all Object Storage Daemons in a cluster. So if I would take one 
> disk for the cluster and check its data, chances are slim that I will find a 
> complete object there (a non-striped, full object I mean).
>  
> When a client issues an object write (I assume a full object/file write in 
> this case) it is the client’s responsibility to stripe it among the object 
> storage daemons. When a stripe is successfully stored by the daemon an ACK 
> signal is send to (?) the client and all participating OSD’s. When all 
> participating OSD’s for the obje

Re: [ceph-users] RBD hole punching

2013-08-22 Thread Mike Lowe

There is TRIM/discard support and I use it with some success. There are some 
details here http://ceph.com/docs/master/rbd/qemu-rbd/  The one caveat I have 
is that I've sometimes been able to crash an osd by doing fstrim inside a guest.

On Aug 22, 2013, at 10:24 AM, Guido Winkelmann  
wrote:

> Hi,
> 
> RBD has had support for sparse allocation for some time now. However, when 
> using an RBD volume as a virtual disk for a virtual machine, the RBD volume 
> will inevitably grow until it reaches its actual nominal size, even if the 
> filesystem in the guest machine never reaches full utilization.
> 
> Is there some way to reverse this? Like going through the whole image, 
> looking 
> for large consecutive areas of zeroes and just deleting the objects for that 
> area? How about support for TRIM/discard commands used by some modern 
> filesystems?
> 
>   Guido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling

2013-08-22 Thread Mike Dawson

Jumping in pretty late on this thread, but I can confirm much higher CPU 
load on ceph-osd using 0.67.1 compared to 0.61.7 under a write-heavy RBD 
workload. Under my workload, it seems like it might be 2x-5x higher CPU 
load per process.


Thanks,
Mike Dawson


On 8/22/2013 4:41 AM, Oliver Daudey wrote:

Hey Samuel,

On wo, 2013-08-21 at 20:27 -0700, Samuel Just wrote:

I think the rbd cache one you'd need to run for a few minutes to get
meaningful results.  It should stabilize somewhere around the actual
throughput of your hardware.


Ok, I now also ran this test on Cuttlefish as well as Dumpling.

Cuttlefish:
# rbd bench-write test --rbd-cache
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
   SEC   OPS   OPS/SEC   BYTES/SEC
 1 13265  13252.45  3466029.67
 2 25956  12975.60  3589479.95
 3 38475  12818.61  3598590.70
 4 50184  12545.16  3530516.34
 5 59263  11852.22  3292258.13
<...>
   300   3421530  11405.08  3191555.35
   301   3430755  11397.83  3189251.09
   302   3443345  11401.73  3190694.98
   303   3455230  11403.37  3191478.97
   304   3467014  11404.62  3192136.82
   305   3475355  11394.57  3189525.71
   306   3488067  11398.90  3190553.96
   307   3499789  11399.96  3190770.21
   308   3510566  11397.93  3190289.49
   309   3519829  11390.98  3188620.93
   310   3532539  11395.25  3189544.03

Dumpling:
# rbd bench-write test --rbd-cache
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
   SEC   OPS   OPS/SEC   BYTES/SEC
 1 13201  13194.63  3353004.50
 2 25926  12957.05  3379695.03
 3 36624  12206.06  3182087.11
 4 46547  11635.35  3035794.95
 5 59290  11856.27  3090389.79
<...>
   300   3405215  11350.66  3130092.00
   301   3417789  11354.76  3131106.34
   302   3430067  11357.83  3131933.41
   303   3438792  11349.14  3129734.88
   304   3450237  11349.45  3129689.62
   305   3462840  11353.53  3130406.43
   306   3473151  11350.17  3128942.32
   307   3482327  11343.00  3126771.34
   308   3495020  11347.44  3127502.07
   309   3506894  11349.13  3127781.70
   310   3516532  11343.65  3126714.62

As you can see, the result is virtually identical.  What jumps out
during the cached tests, is that the CPU used by the OSDs is negligible
in both cases, while without caching, the OSDs get loaded quite well.
Perhaps the cache masks the problem we're seeing in Dumpling somehow?
And I'm not changing anything but the OSD-binary during my tests, so
cache-settings used in VMs are identical in both scenarios.



Hmm, 10k ios I guess is only 10 rbd chunks.  What replication level
are you using?  Try setting them to 1000 (you only need to set the
xfs ones).

For the rand test, try increasing
filestore_wbthrottle_xfs_inodes_hard_limit and
filestore_wbthrottle_xfs_inodes_start_flusher to 1 as well as
setting the above ios limits.


Ok, my current config:
 filestore wbthrottle xfs ios start flusher = 1000
 filestore wbthrottle xfs ios hard limit = 1000
 filestore wbthrottle xfs inodes hard limit = 1
 filestore wbthrottle xfs inodes start flusher = 1

Unfortunately, that still makes no difference at all in the original
standard-tests.

Random IO on Dumpling, after 120 secs of runtime:
# rbd bench-write test --io-pattern=rand
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern rand
   SEC   OPS   OPS/SEC   BYTES/SEC
 1   545534.98  1515804.02
 2  1162580.80  1662416.60
 3  1731576.52  1662966.61
 4  2317579.04  1695129.94
 5  2817562.56  1672754.87
<...>
   120 43564362.91  1080512.00
   121 43774361.76  1077368.28
   122 44419364.06  1083894.31
   123 45046366.22  1090518.68
   124 45287364.01  1084437.37
   125 45334361.54  1077035.12
   126 45336359.40  1070678.36
   127 45797360.60  1073985.78
   128 46388362.40  1080056.75
   129 46984364.21  1086068.63
   130 47604366.11  1092712.51

Random IO on Cuttlefish, after 120 secs of runtime:
rbd bench-write test --io-pattern=rand
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern rand
   SEC   OPS   OPS/SEC   BYTES/SEC
 1  1066   1065.54  3115713.13
 2  2099   1049.31  2936300.53
 3  3218   1072.32  3028707.50
 4  4026   1003.23  2807859.15
 5  4272793.80  2226962.63
<...>
   120 66935557.79  1612483.74
   121 68011562.01  1625419.34
   122 68428558.59  1615376.62
   123 68579557.06  1610780.38
   125 68777549.73  1589816.94
   126 69745553.52  1601671.46
   127 70855557.91  1614293.12
   128 71962562.20  1627070.81
   129 72529562.22  1627120.59
   130 73146562.66  1628818.79

Confirming your setting took properly:
# ceph --admin-daem

Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling

2013-08-29 Thread Mike Dawson


Sam and Oliver,

We've had tons of issues with Dumpling rbd volumes showing sporadic 
periods of high latency for Windows guests doing lots of small writes. 
We saw the issue occasionally with Cuttlefish, but it got significantly 
worse with Dumpling. Initial results with wip-dumpling-perf2 appear very 
promising.


Thanks for your work! I'll report back tomorrow if I have any new results.

Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 8/29/2013 2:52 PM, Oliver Daudey wrote:

Hey Mark and list,

FYI for you and the list: Samuel and I seem to have found and fixed the
remaining performance-problems.  For those who can't wait, fixes are in
"wip-dumpling-perf2" and will probably be in the next point-release.


Regards,

  Oliver

On 27-08-13 17:13, Mark Nelson wrote:

Ok, definitely let us know how it goes!  For what it's worth, I'm
testing Sam's wip-dumpling-perf branch with the wbthrottle code disabled
now and comparing it both to that same branch with it enabled along with
0.67.1.  Don't have any perf data, but quite a bit of other data to look
through, both in terms of RADOS bench and RBD.

Mark

On 08/27/2013 10:07 AM, Oliver Daudey wrote:

Hey Mark,

That will take a day or so for me to know with enough certainty.  With
the low CPU-usage and preliminary results today, I'm confident enough to
upgrade all OSDs in production and test the cluster all-Dumpling
tomorrow.  For now, I only upgraded a single OSD and measured CPU-usage
and whatever performance-effects that had on the cluster, so if I would
lose that OSD, I could recover. :-)  Will get back to you.


 Regards,

 Oliver

On 27-08-13 15:04, Mark Nelson wrote:

Hi Olver/Matthew,

Ignoring CPU usage, has speed remained slower as well?

Mark

On 08/27/2013 03:08 AM, Oliver Daudey wrote:

Hey Samuel,

The "PGLog::check()" is now no longer visible in profiling, so it
helped
for that.  Unfortunately, it doesn't seem to have helped to bring down
the OSD's CPU-loading much.  Leveldb still uses much more than in
Cuttlefish.  On my test-cluster, I didn't notice any difference in the
RBD bench-results, either, so I have to assume that it didn't help
performance much.

Here's the `perf top' I took just now on my production-cluster with
your
new version, under regular load.  Also note the "memcmp" and "memcpy",
which also don't show up when running a Cuttlefish-OSD:
15.65%  [kernel] [k]
intel_idle
 7.20%  libleveldb.so.1.9[.]
0x3ceae
 6.28%  libc-2.11.3.so   [.]
memcmp
 5.22%  [kernel] [k]
find_busiest_group
 3.92%  kvm  [.]
0x2cf006
 2.40%  libleveldb.so.1.9[.]
leveldb::InternalKeyComparator::Compar
 1.95%  [kernel] [k]
_raw_spin_lock
 1.69%  [kernel] [k]
default_send_IPI_mask_sequence_phys
 1.46%  libc-2.11.3.so   [.]
memcpy
 1.17%  libleveldb.so.1.9[.]
leveldb::Block::Iter::Next()
 1.16%  [kernel] [k]
hrtimer_interrupt
 1.07%  [kernel] [k]
native_write_cr0
 1.01%  [kernel] [k]
__hrtimer_start_range_ns
 1.00%  [kernel] [k]
clockevents_program_event
 0.93%  [kernel] [k]
find_next_bit
 0.93%  libstdc++.so.6.0.13  [.]
std::string::_M_mutate(unsigned long,
 0.89%  [kernel] [k]
cpumask_next_and
 0.87%  [kernel] [k]
__schedule
 0.85%  [kernel] [k]
_raw_spin_unlock_irqrestore
 0.85%  [kernel] [k]
do_select
 0.84%  [kernel] [k]
apic_timer_interrupt
 0.80%  [kernel] [k]
fget_light
 0.79%  [kernel] [k]
native_write_msr_safe
 0.76%  [kernel] [k]
_raw_spin_lock_irqsave
 0.66%  libc-2.11.3.so   [.]
0xdc6d8
 0.61%  libpthread-2.11.3.so [.]
pthread_mutex_lock
 0.61%  [kernel] [k]
tg_load_down
 0.59%  [kernel] [k]
reschedule_interrupt
 0.59%  libsnappy.so.1.1.2   [.]
snappy::RawUncompress(snappy::Source*,
 0.56%  libstdc++.so.6.0.13  [.] std::string::append(char
const*, unsig
 0.54%  [kvm_intel]  [k]
vmx_vcpu_run
 0.53%  [kernel] [k]
copy_user_generic_string
 0.53%  [kernel] [k]
load_balance
 0.50%  [kernel] [k]
rcu_needs_cpu
 0.45%  [kernel] [k] fput


  Regards,

Oliver

On ma, 2013-08-26 at 23:33 -0700, Samuel Just wrote:

I just pushed a patch to wip-dumpling-log-assert (based on current
dumpling head

Re: [ceph-users] status of glance/cinder/nova integration in openstack grizzly

2013-09-10 Thread Mike Dawson


Darren,

I can confirm Copy on Write (show_image_direct_url = True) does work in 
Grizzly.


It sounds like you are close. To check permissions, run 'ceph auth 
list', and reply with "client.images" and "client.volumes" (or whatever 
keys you use in Glance and Cinder).


Cheers,
Mike Dawson


On 9/10/2013 10:12 AM, Darren Birkett wrote:

Hi All,

tl;dr - does glance/rbd and cinder/rbd play together nicely in grizzly?

I'm currently testing a ceph/rados back end with an openstack
installation.  I have the following things working OK:

1. cinder configured to create volumes in RBD
2. nova configured to boot from RBD backed cinder volumes (libvirt UUID
secret set etc)
3. glance configured to use RBD as a back end store for images

With this setup, when I create a bootable volume in cinder, passing an
id of an image in glance, the image gets downloaded, converted to raw,
and then created as an RBD object and made available to cinder.  The
correct metadata field for the cinder volume is populated
(volume_image_metadata) and so the cinder client marks the volume as
bootable.  This is all fine.

If I want to take advantage of the fact that both glance images and
cinder volumes are stored in RBD, I can add the following flag to the
glance-api.conf:

show_image_direct_url = True

This enables cinder to see that the glance image is stored in RBD, and
the cinder rbd driver then, instead of downloading the image and
creating an RBD image from it, just issues an 'rbd clone' command (seen
in the cinder-volume.log):

rbd clone --pool images --image dcb2f16d-a09d-4064-9198-1965274e214d
--snap snap --dest-pool volumes --dest
volume-20987f9d-b4fb-463d-8b8f-fa667bd47c6d

This is all very nice, and the cinder volume is available immediately as
you'd expect.  The problem is that the metadata field is not populated
so it's not seen as bootable.  Even manually populating this field
leaves the volume unbootable.  The volume can not even be attached to
another instance for inspection.

libvirt doesn't seem to be able to access the rbd device. From
nova-compute.log:

qemu-system-x86_64: -drive
file=rbd:volumes/volume-20987f9d-b4fb-463d-8b8f-fa667bd47c6d:id=volumes:key=AQAnAy9ScPB4IRAAtxD/V1rDciqFiT9AMPPr+A==:auth_supported=cephx\;none,if=none,id=drive-virtio-disk0,format=raw,serial=20987f9d-b4fb-463d-8b8f-fa667bd47c6d,cache=none:
error reading header from volume-20987f9d-b4fb-463d-8b8f-fa667bd47c6d

qemu-system-x86_64: -drive
file=rbd:volumes/volume-20987f9d-b4fb-463d-8b8f-fa667bd47c6d:id=volumes:key=AQAnAy9ScPB4IRAAtxD/V1rDciqFiT9AMPPr+A==:auth_supported=cephx\;none,if=none,id=drive-virtio-disk0,format=raw,serial=20987f9d-b4fb-463d-8b8f-fa667bd47c6d,cache=none:
could not open disk image
rbd:volumes/volume-20987f9d-b4fb-463d-8b8f-fa667bd47c6d:id=volumes:key=AQAnAy9ScPB4IRAAtxD/V1rDciqFiT9AMPPr+A==:auth_supported=cephx\;none:
Operation not permitted

It's almost like a permission issue, but my ceph/rbd knowledge is still
fledgeling.

I know that the cinder rbd driver has been rewritten to use librbd in
havana, and I'm wondering if this will change any of this behaviour?
  I'm also wondering if anyone has actually got this working with
grizzly, and how?

Many thanks
Darren



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] status of glance/cinder/nova integration in openstack grizzly

2013-09-10 Thread Mike Dawson



On 9/10/2013 4:50 PM, Darren Birkett wrote:

Hi Mike,

That led me to realise what the issue was.  My cinder (volumes) client
did not have the correct perms on the images pool.  I ran the following
to update the perms for that client:

ceph auth caps client.volumes mon 'allow r' osd 'allow class-read
object_prefix rbd_children, allow rwx pool=volumes, allow rx pool=images'

...and was then able to successfully boot an instance from a cinder
volume that was created by cloning a glance image from the images pool!

Glad you found it. This has been a sticking point for several people.



One last question: I presume the fact that the 'volume_image_metadata'
field is not populated when cloning a glance image into a cinder volume
is a bug?  It means that the cinder client doesn't show the volume as
bootable, though I'm not sure what other detrimental effect it actually
has (clearly the volume can be booted from).
I think you are talking about data in the cinder table of your database 
backend (mysql?). I don't have 'volume_image_metadata' at all here. I 
don't think this is the issue.


To create a Cinder volume from Glance, I do something like:

cinder --os_tenant_name MyTenantName create --image-id 
00e0042e-d007-400a-918a-d5e00cea8b0f --display-name MyVolumeName 40


I can then spin up an instance backed by MyVolumeName and boot as expected.



Thanks
Darren


On 10 September 2013 21:04, Darren Birkett mailto:darren.birk...@gmail.com>> wrote:

Hi Mike,

Thanks - glad to hear it definitely works as expected!  Here's my
client.glance and client.volumes from 'ceph auth list':

client.glance
key: AQAWFi9SOKzAABAAPV1ZrpWkx72tmJ5E7nOi3A==
caps: [mon] allow r
caps: [osd] allow rwx pool=images, allow class-read object_prefix
rbd_children
client.volumes
key: AQAnAy9ScPB4IRAAtxD/V1rDciqFiT9AMPPr+A==
caps: [mon] allow r
caps: [osd] allow class-read object_prefix rbd_children, allow rwx
pool=volumes

Thanks
Darren


On 10 September 2013 20:08, Mike Dawson mailto:mike.daw...@cloudapt.com>> wrote:

Darren,

I can confirm Copy on Write (show_image_direct_url = True) does
work in Grizzly.

It sounds like you are close. To check permissions, run 'ceph
auth list', and reply with "client.images" and "client.volumes"
(or whatever keys you use in Glance and Cinder).

Cheers,
Mike Dawson



On 9/10/2013 10:12 AM, Darren Birkett wrote:

Hi All,

tl;dr - does glance/rbd and cinder/rbd play together nicely
in grizzly?

I'm currently testing a ceph/rados back end with an openstack
installation.  I have the following things working OK:

1. cinder configured to create volumes in RBD
2. nova configured to boot from RBD backed cinder volumes
(libvirt UUID
secret set etc)
3. glance configured to use RBD as a back end store for images

With this setup, when I create a bootable volume in cinder,
passing an
id of an image in glance, the image gets downloaded,
converted to raw,
and then created as an RBD object and made available to
cinder.  The
correct metadata field for the cinder volume is populated
(volume_image_metadata) and so the cinder client marks the
volume as
bootable.  This is all fine.

If I want to take advantage of the fact that both glance
images and
cinder volumes are stored in RBD, I can add the following
flag to the
glance-api.conf:

show_image_direct_url = True

This enables cinder to see that the glance image is stored
in RBD, and
the cinder rbd driver then, instead of downloading the image and
creating an RBD image from it, just issues an 'rbd clone'
command (seen
in the cinder-volume.log):

rbd clone --pool images --image
dcb2f16d-a09d-4064-9198-__1965274e214d
--snap snap --dest-pool volumes --dest
volume-20987f9d-b4fb-463d-__8b8f-fa667bd47c6d

This is all very nice, and the cinder volume is available
immediately as
you'd expect.  The problem is that the metadata field is not
populated
so it's not seen as bootable.  Even manually populating this
field
leaves the volume unbootable.  The volume can not even be
attached to
another instance for inspection.

libvirt doesn't seem to be able to access the rbd device. From
nova-compute.log:

qemu-system-x86_64: -drive

Re: [ceph-users] Pause i/o from time to time

2013-09-17 Thread Mike Dawson

You could be suffering from a known, but unfixed issue [1] where spindle 
contention from scrub and deep-scrub cause periodic stalls in RBD. You 
can try to disable scrub and deep-scrub with:


# ceph osd set noscrub
# ceph osd set nodeep-scrub

If your problem stops, Issue #6278 is likely the cause. To re-enable 
scrub and deep-scrub:


# ceph osd unset noscrub
# ceph osd unset nodeep-scrub

Because you seem to only have two OSDs, you may also be saturating your 
disks even without scrub or deep-scrub.


http://tracker.ceph.com/issues/6278

Cheers,
Mike Dawson


On 9/16/2013 12:30 PM, Timofey wrote:

I use ceph for HA-cluster.
Some time ceph rbd go to have pause in work (stop i/o operations). Sometime it 
can be when one of OSD slow response to requests. Sometime it can be my mistake 
(xfs_freeze -f for one of OSD-drive).
I have 2 storage servers with one osd on each. This pauses can be few minutes.

1. Is any settings for fast change primary osd if current osd work bad (slow, 
don't response).
2. Can I use ceph-rbd in software raid-array with local drive, for use local 
drive instead of ceph if ceph cluster fail?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD and Journal Files

2013-09-18 Thread Mike Dawson


Ian,

There are two schools of thought here. Some people say, run the journal 
on a separate partition on the spinner alongside the OSD partition, and 
don't mess with SSDs for journals. This may be the best practice for an 
architecture of high-density chassis.


The other design is to use SSDs for journals, but design with an 
appropriate ratio of journals per SSD. Plus, you need to understand 
losing an SSD will cause the loss of ALL of the OSDs which had their 
journal on the failed SSD.


For now, I'll assume you want to use SSDs and offer some suggestions.

First, you probably don't want RAID1 for the journal SSDs. It isn't 
particularly needed for resiliency and certainly isn't beneficial from a 
throughput perspective.


Next, the best practice is to have enough throughput in the Journals 
(SSDs) so your OSDs (spinners) aren't starved. Let's assume your SSDs 
sustain writes at 450MB/s and the spinners can do 120MB/s.


450MB/s divided by 120MB/s = 3.75

Which I would round to a ratio of four OSD Journals on each SSD.

Since it appears you are using 24-drive chassis and the first two drives 
are taken by the RAID1 set for the OS, you have 22 drives left. You 
could do:


- 4 SSDs, each with 4 Journals
- 16 spinners, each running an OSD process
- 2 RAID1 OS
- 2 Empty

Or, if you want to push the ratio a bit farther (6 OSD journals on an SSD):

- 3 SSDs, each with 6 Journals
- 18 spinners, each running an OSD process
- 1 spinner for OS (no RAID1)

Because your 10Gb network will peak at 1,250MB/s the 6:1 ratio shown 
above should be fine (as you're limited to ~70MB/s for each OSD by the 
network anyway).


I think you'll be OK on CPU and RAM.

Journals are small (default of 1GB, I run 10GB). Create a 10GB 
unformatted partition for each journal and leave the rest of the SSD 
unallocated (it will be used for wear-leveling). If you use 
high-endurance SSDs, you could certainly consider smaller drives as long 
as they maintain sufficient performance characteristics.


Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC


On 9/18/2013 9:52 AM, ian_m_por...@dell.com wrote:

*Dell - Internal Use - Confidential *

Hi,

I read in the ceph documentation that one of the main performance snags
in ceph was running the OSDs and journal files on the same disks and you
should consider at a minimum running the journals on SSDs.

Given I am looking to design a 150 TB cluster, I’m considering the
following configuration for the storage nodes

No of replicas: 3

Each node

·18 x 1 TB for storage (1 OSD per node, journals for each OSD are stored
to volume on SSD)

·2  x 512 GB SSD drives configured as RAID 1  to store the journal files
(assuming journal files are not replicated, correct me if Im wrong)

·2 x 300 GB drives for OS/software (RAID 1)

·48 GB RAM

·2 x 10 Gb for public and storage network

·1 x 1 Gb for management network

·Dual E2660 CPU

No of nodes required for 150 TB = 150*3/(18*1) = 25

Unfortunately I don’t have any metrics on the throughput into the
cluster so I can’t tell whether 512 GB for journal files will be
sufficient so it’s a best guess and may be overkill. Also, any concerns
regarding number of OSDs running on each node, ive seen some articles on
the web saying the sweet spot is around 8 OSDs per node?

Thanks

Ian

Dell Corporation Limited is registered in England and Wales. Company
Registration Number: 2081369
Registered address: Dell House, The Boulevard, Cain Road, Bracknell,
Berkshire, RG12 1LF, UK.
Company details for other Dell UK entities can be found on  www.dell.co.uk.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD and Journal Files

2013-09-18 Thread Mike Dawson


Joseph,

With properly architected failure domains and replication in a Ceph 
cluster, RAID1 has diminishing returns.


A well-designed CRUSH map should allow for failures at any level of your 
hierarchy (OSDs, hosts, racks, rows, etc) while protecting the data with 
a configurable number of copies.


That being said, losing a series of six OSDs is certainly a hassle and 
journals on a RAID1 set could help prevent that senerio.


But where do you stop? 3 monitors, 5, 7? RAID1 for OSDs, too? 3x 
replication, 4x, 10x? I suppose each operator gets to decide how far to 
chase the diminishing returns.


Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC

On 9/18/2013 1:27 PM, Gruher, Joseph R wrote:




-Original Message-
From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-
boun...@lists.ceph.com] On Behalf Of Mike Dawson

you need to understand losing an SSD will cause
the loss of ALL of the OSDs which had their journal on the failed SSD.

First, you probably don't want RAID1 for the journal SSDs. It isn't particularly
needed for resiliency and certainly isn't beneficial from a throughput
perspective.


Sorry, can you clarify this further for me?  If losing the SSD would cause 
losing all the OSDs journaling on it why would you not want to RAID it?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph performance with 8K blocks.

2013-09-18 Thread Mike Lowe

Well, in a word, yes. You really expect a network replicated storage system in 
user space to be comparable to direct attached ssd storage?  For what it's 
worth, I've got a pile of regular spinning rust, this is what my cluster will 
do inside a vm with rbd writeback caching on.  As you can see, latency is 
everything.

dd if=/dev/zero of=1g bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 6.26289 s, 171 MB/s
dd if=/dev/zero of=1g bs=1M count=1024 oflag=dsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 37.4144 s, 28.7 MB/s

As you can see, latency is a killer.

On Sep 18, 2013, at 3:23 PM, Jason Villalta  wrote:

> Any other thoughts on this thread guys.  I am just crazy to want near native 
> SSD performance on a small SSD cluster?
> 
> 
> On Wed, Sep 18, 2013 at 8:21 AM, Jason Villalta  wrote:
> That dd give me this.
> 
> dd if=ddbenchfile of=- bs=8K | dd if=- of=/dev/null bs=8K
> 819200 bytes (8.2 GB) copied, 31.1807 s, 263 MB/s 
> 
> Which makes sense because the SSD is running as SATA 2 which should give 
> 3Gbps or ~300MBps
> 
> I am still trying to better understand the speed difference between the small 
> block speeds seen with dd vs the same small object size with rados.  It is 
> not a difference of a few MB per sec.  It seems to nearly be a factor of 10.  
> I just want to know if this is a hard limit in Ceph or a factor of the 
> underlying disk speed.  Meaning if I use spindles to read data would the 
> speed be the same or would the read speed be a factor of 10 less than the 
> speed of the underlying disk?
> 
> 
> On Wed, Sep 18, 2013 at 4:27 AM, Alex Bligh  wrote:
> 
> On 17 Sep 2013, at 21:47, Jason Villalta wrote:
> 
> > dd if=ddbenchfile of=/dev/null bs=8K
> > 819200 bytes (8.2 GB) copied, 19.7318 s, 415 MB/s
> 
> As a general point, this benchmark may not do what you think it does, 
> depending on the version of dd, as writes to /dev/null can be heavily 
> optimised.
> 
> Try:
>   dd if=ddbenchfile of=- bs=8K | dd if=- of=/dev/null bs=8K
> 
> --
> Alex Bligh
> 
> 
> 
> 
> 
> 
> 
> -- 
> -- 
> Jason Villalta
> Co-founder
> 
> 800.799.4407x1230 | www.RubixTechnology.com
> 
> 
> 
> -- 
> -- 
> Jason Villalta
> Co-founder
> 
> 800.799.4407x1230 | www.RubixTechnology.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] radosgw-admin unable to list or store user info after upgrade

2013-09-26 Thread Mike O'Toole

After upgrading from Cuddlefish to Dumpling I am no longer able to obtain user 
information from the rados gateway.  
radosgw-admin user infocould not fetch user info: no user info saved

radosgw-admin user create --uid=bob --display-name="bob"could not create user: 
unable to create user, unable to store user info
Has anyone else experienced this problem?  Thanks, Mike 
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD Snap removal priority

2013-09-27 Thread Mike Dawson


[cc ceph-devel]

Travis,

RBD doesn't behave well when Ceph maintainance operations create spindle 
contention (i.e. 100% util from iostat). More about that below.


Do you run XFS under your OSDs? If so, can you check for extent 
fragmentation? Should be something like:


xfs_db -c frag -r /dev/sdb1

We recently saw a fragmentation factors of over 80%, with lots of ino's 
having hundreds of extents. After 24 hours+ of defrag'ing, we got it 
under control, but we're seeing the fragmentation factor grow by ~1.5% 
daily. We experienced spindle contention issues even after the defrag.




Sage, Sam, etc,

I think the real issue is Ceph has several states where it performs what 
I would call "maintanance operations" that saturate the underlying 
storage without properly yielding to client i/o (which should have a 
higher priority).


I have experienced or seen reports of Ceph maintainance affecting rbd 
client i/o in many ways:


- QEMU/RBD Client I/O Stalls or Halts Due to Spindle Contention from 
Ceph Maintainance [1]

- Recovery and/or Backfill Cause QEMU/RBD Reads to Hang [2]
- rbd snap rm (Travis' report below)

[1] http://tracker.ceph.com/issues/6278
[2] http://tracker.ceph.com/issues/6333

I think this family of issues speak to the need for Ceph to have more 
visibility into the underlying storage's limitations (especially spindle 
contention) when performing known expensive maintainance operations.


Thanks,
Mike Dawson

On 9/27/2013 12:25 PM, Travis Rhoden wrote:

Hello everyone,

I'm running a Cuttlefish cluster that hosts a lot of RBDs.  I recently
removed a snapshot of a large one (rbd snap rm -- 12TB), and I noticed
that all of the clients had markedly decreased performance.  Looking
at iostat on the OSD nodes had most disks pegged at 100% util.

I know there are thread priorities that can be set for clients vs
recovery, but I'm not sure what deleting a snapshot falls under.  I
couldn't really find anything relevant.  Is there anything I can tweak
to lower the priority of such an operation?  I didn't need it to
complete fast, as "rbd snap rm" returns immediately and the actual
deletion is done asynchronously.  I'd be fine with it taking longer at
a lower priority, but as it stands now it brings my cluster to a crawl
and is causing issues with several VMs.

I see an "osd snap trim thread timeout" option in the docs -- Is the
operation occuring here what you would call snap trimming?  If so, any
chance of adding an option for "osd snap trim priority" just like
there is for osd client op and osd recovery op?

Hope what I am saying makes sense...

  - Travis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] issues with 'https://ceph.com/git/?p=ceph.git; a=blob_plain; f=keys/release.asc'

2013-09-30 Thread Mike O'Toole

I have had the same issues.  

From: qgra...@onq.com.au
To: ceph-users@lists.ceph.com
Date: Mon, 30 Sep 2013 00:01:11 +
Subject: [ceph-users] issues with 'https://ceph.com/git/?p=ceph.git; 
a=blob_plain; f=keys/release.asc'









Hey Guys,
 
Looks like 'https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc' 
is down.
 
Regards,
Quenten Grasso




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Weird behavior of PG distribution

2013-10-01 Thread Mike Dawson


Ching-Cheng,

Data placement is handled by CRUSH. Please examine the following:

ceph osd getcrushmap -o crushmap && crushtool -d crushmap -o 
crushmap.txt && cat crushmap.txt


That will show the topology and placement rules Ceph is using.
Pay close attention to the "step chooseleaf" lines inside the rule for 
each pool. Under certain configurations, I believe the placement that 
you describe is in fact the expected behavior.


Thanks,

Mike Dawson
Co-Founder, Cloudapt LLC


On 10/1/2013 10:46 AM, Chen, Ching-Cheng (KFRM 1) wrote:

Found a weird behavior (or looks like weird) with ceph 0.67.3

I have 5 servers.  Monitor runs on server 1.   And server 2 to 5 have
one OSD running each (osd.0 - osd.3)

I did a 'ceph pg dump'.  I can see PGs got somehow randomly distributed
to all 4 OSDs which is expected behavior.

However, if I bring up one OSD in the same server running monitor.   It
seems all PGs has their primary ODS move to this new OSD.  After I add a
new OSD (osd.4) to the same server running monitor, the 'ceph pg dump'
command showing active OSDs as [4,x] for all PGs.

Is this expected behavior??

Regards,

Chen

Ching-Cheng Chen

*CREDIT SUISSE*

Information Technology | MDS - New York, KVBB 41

One Madison Avenue | 10010 New York | United States

Phone +1 212 538 8031 | Mobile +1 732 216 7939

chingcheng.c...@credit-suisse.com
<mailto:chingcheng.c...@credit-suisse.com> | www.credit-suisse.com
<http://www.credit-suisse.com>



==
Please access the attached hyperlink for an important electronic
communications disclaimer:
http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
==




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph and RAID

2013-10-03 Thread Mike Dawson

Currently Ceph uses replication. Each pool is set with a replication 
factor. A replication factor of 1 obviously offers no redundancy. 
Replication factors of 2 or 3 are common. So, Ceph currently halfs or 
thirds your usable storage, accordingly. Also, note you can co-mingle 
pools of various replication factors, so the actual math can get more 
complicated.


There is a team of developers building an Erasure Coding backend for 
Ceph that will allow for more options.


http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend

http://wiki.ceph.com/01Planning/02Blueprints/Emperor/Erasure_coded_storage_backend_%28step_2%29

Initial release is scheduled for Ceph's Firefly release in February 2014.


Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC

On 10/3/2013 2:44 PM, Aronesty, Erik wrote:

Does Ceph really halve your storage like that?

If if you specify N+1,does it really store two copies, or just compute 
checksums across MxN stripes?  I guess Raid5+Ceph with a large array (12 disks 
say) would be not too bad (2.2TB for each 1).

But It would be nicer, if I had 12 storage units in a single rack on a single 
network, for me to tell CEPH to stripe across them in a RAIDZ fashion, so that 
I'm only losing 10% of my storage to redundancy... not 50%.

-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of John-Paul Robinson
Sent: Thursday, October 03, 2013 12:08 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph and RAID

What is this take on such a configuration?

Is it worth the effort of tracking "rebalancing" at two layers, RAID
mirror and possibly Ceph if the pool has a redundancy policy.  Or is it
better to just let ceph rebalance itself when you lose a non-mirrored disk?

If following the "raid mirror" approach, would you then skip redundency
at the ceph layer to keep your total overhead the same?  It seems that
would be risky in the even you loose your storage server with the
raid-1'd drives.  No Ceph level redunancy would then be fatal.  But if
you do raid-1 plus ceph redundancy, doesn't that mean it takes 4TB for
each 1 real TB?

~jpr

On 10/02/2013 10:03 AM, Dimitri Maziuk wrote:

I would consider (mdadm) raid-1, dep. on the hardware & budget,
because this way a single disk failure will not trigger a cluster-wide
rebalance.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] About Ceph SSD and HDD strategy

2013-10-07 Thread Mike Lowe

Based on my experience I think you are grossly underestimating the expense and 
frequency of flushes issued from your vm's.  This will be especially bad if you 
aren't using the async flush from qemu >= 1.4.2 as the vm is suspended while 
qemu waits for the flush to finish.  I think your best course of action until 
the caching pool work is completed (I think I remember correctly that this is 
currently in development) is to either use the ssd's as large caches with 
bcache or to use them for journal devices.  I'm sure there are some other more 
informed opinions out there on the best use of ssd's in a ceph cluster and 
hopefully they will chime in.

On Oct 6, 2013, at 9:23 PM, Martin Catudal  wrote:

> Hi Guys,
> I read all Ceph documentation more than twice. I'm now very 
> comfortable with all the aspect of Ceph except for the strategy of using 
> my SSD and HDD.
> 
> Here is my reflexion
> 
> I've two approach in my understanding about use fast SSD (900 GB) for my 
> primary storage and huge but slower HDD (4 TB) for replicas.
> 
> FIRST APPROACH
> 1. I can use PG with cache write enable as my primary storage that's 
> goes on my SSD and let replicas goes on my 7200 RPM.
>  With the cache write enable, I will gain performance for my VM 
> user machine in VDI environment since Ceph client will not have to wait 
> for the replicas write confirmation on the slower HDD.
> 
> SECOND APPROACH
> 2. Use pools hierarchies and let's have one pool for the SSD as primary 
> and lets the replicas goes to a second pool name platter for HDD 
> replication.
> As explain in the Ceph documentation
> rule ssd-primary {
>   ruleset 4
>   type replicated
>   min_size 5
>   max_size 10
>   step take ssd
>   step chooseleaf firstn 1 type host
>   step emit
>   step take platter
>   step chooseleaf firstn -1 type host
>   step emit
>   }
> 
> At this point, I could not figure out what approach could have the most 
> advantage.
> 
> Your point of view would definitely help me.
> 
> Sincerely,
> Martin
> 
> -- 
> Martin Catudal
> Responsable TIC
> Ressources Metanor Inc
> Ligne directe: (819) 218-2708
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Expanding ceph cluster by adding more OSDs

2013-10-09 Thread Mike Lowe

You can add PGs,  the process is called splitting.  I don't think PG merging, 
the reduction in the number of PGs, is ready yet.

On Oct 8, 2013, at 11:58 PM, Guang  wrote:

> Hi ceph-users,
> Ceph recommends the PGs number of a pool is (100 * OSDs) / Replicas, per my 
> understanding, the number of PGs for a pool should be fixed even we scale out 
> / in the cluster by adding / removing OSDs, does that mean if we double the 
> OSD numbers, the PG number for a pool is not optimal any more and there is no 
> chance to correct it?
> 
> 
> Thanks,
> Guang
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] indexing object store with SOLR

2013-10-09 Thread Mike O'Toole

All, I have been prototyping an object store and am looking at a way to index 
content and metadata.  Has anyone looked at doing anything similar?  I would be 
interested in kicking around some ideas. I'd really like to implement something 
with Apache Solr or something similar.  
Thanks, Mike  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] retrieving usage information via admin API

2013-10-10 Thread Mike O'Toole

Yes, you need to enabling usage logging in your ceph.conf ... something like:
rgw enable usage log = truergw usage log tick interval = 30rgw usage log flush 
threshold = 1024rgw usage max shards = 32rgw usage max user shards = 1
You can find more info here: http://ceph.com/docs/master/radosgw/config-ref/
MikeDate: Thu, 10 Oct 2013 11:17:37 +0100
From: watering...@gmail.com
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] retrieving usage information via admin API

Hi All,
The admin API is working fine for me (I've tested adding users, etc.) however 
when I attempt to query (GET) /admin/usage?format=json I simply receive the 
following back:

{"entries":[],"summary":[]}

Is there some configuration I need to make to radosgw or elsewhere to capture 
and or return the responses outlined on 
http://ceph.com/docs/master/radosgw/adminops/#response-entities?

Thanks!

-Matt

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] osds and gateway not coming up on restart

2013-10-10 Thread Mike O'Toole

So I took a power hit today and after coming back up 3 of my osds and my 
radosgw are not coming back up.  The logs show no clue as to what may have 
happened. 
When I manually try to restart the gateway I see the following in the logs:
2013-10-10 16:04:23.166046 7f8480d9a700  2 
RGWDataChangesLog::ChangesRenewThread: start2013-10-10 16:04:45.166193 
7f8480d9a700  2 RGWDataChangesLog::ChangesRenewThread: start2013-10-10 
16:05:07.166335 7f8480d9a700  2 RGWDataChangesLog::ChangesRenewThread: 
start2013-10-10 16:05:29.166501 7f8480d9a700  2 
RGWDataChangesLog::ChangesRenewThread: start2013-10-10 16:05:51.166638 
7f8480d9a700  2 RGWDataChangesLog::ChangesRenewThread: start2013-10-10 
16:06:13.166762 7f8480d9a700  2 RGWDataChangesLog::ChangesRenewThread: 
start2013-10-10 16:06:35.166914 7f8480d9a700  2 
RGWDataChangesLog::ChangesRenewThread: start2013-10-10 16:06:57.167055 
7f8480d9a700  2 RGWDataChangesLog::ChangesRenewThread: start2013-10-10 
16:07:10.196475 7f848535c700 -1 Initialization timeout, failed to initialize
and then the process dies.
As for the OSDs, there is no logging.  I try to manually start them and it 
reports they are already running all their are no OSD pids on that server.  
$ sudo start ceph-allstart: Job is already running: ceph-all
Any ideas where to look for more info on these two issues?  I am running ceph 
0.67.3.
Cluster status :HEALTH_WARN 78 pgs down; 78 pgs peering; 78 pgs stuck inactive; 
78 pgs stuck unclean; 16 requests are blocked > 32 sec; 1 osds have slow 
requests
ceph osd state134: 18 osds: 15 up, 15 in
Thanks, Mike  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osds and gateway not coming up on restart

2013-10-10 Thread Mike O'Toole






Sorry, I didn't mean it logged nothing,  I just saw no clues that were apparent 
to me.Subsequent restart attempts log nothing.   Here are the last few 
lines 
2013-10-10 15:19:49.832776 7f94567ac700 10 -- 10.10.2.202:6809/15169 reaper 
deleted pipe 0x1e045002013-10-10 15:19:49.832785 7f94567ac700 10 -- 
10.10.2.202:6809/15169 reaper done2013-10-10 15:19:49.832805 7f945c5d97c0 10 -- 
10.10.2.202:6809/15169 wait: waiting for dispatch queue2013-10-10 
15:19:49.832829 7f945c5d97c0 10 -- 10.10.2.202:6809/15169 wait: dispatch queue 
is stopped2013-10-10 15:19:49.832837 7f945c5d97c0 20 -- 10.10.2.202:6809/15169 
wait: stopping accepter thread2013-10-10 15:19:49.832841 7f945c5d97c0 10 
accepter.stop accepter2013-10-10 15:19:49.832864 7f944c17a700 20 
accepter.accepter poll got 12013-10-10 15:19:49.832874 7f944c17a700 20 
accepter.accepter closing2013-10-10 15:19:49.832891 7f944c17a700 10 
accepter.accepter stopping2013-10-10 15:19:49.832956 7f945c5d97c0 20 -- 
10.10.2.202:6809/15169 wait: stopped accepter thread2013-10-10 15:19:49.832969 
7f945c5d97c0 20 -- 10.10.2.202:6809/15169 wait: stopping reaper 
thread2013-10-10 15:19:49.832995 7f94567ac700 10 -- 10.10.2.202:6809/15169 
reaper_entry done2013-10-10 15:19:49.833072 7f945c5d97c0 20 -- 
10.10.2.202:6809/15169 wait: stopped reaper thread2013-10-10 15:19:49.833082 
7f945c5d97c0 10 -- 10.10.2.202:6809/15169 wait: closing pipes2013-10-10 
15:19:49.833086 7f945c5d97c0 10 -- 10.10.2.202:6809/15169 reaper2013-10-10 
15:19:49.833090 7f945c5d97c0 10 -- 10.10.2.202:6809/15169 reaper done2013-10-10 
15:19:49.833093 7f945c5d97c0 10 -- 10.10.2.202:6809/15169 wait: waiting for 
pipes  to close2013-10-10 15:19:49.833097 7f945c5d97c0 10 -- 
10.10.2.202:6809/15169 wait: done.2013-10-10 15:19:49.833101 7f945c5d97c0  1 -- 
10.10.2.202:6809/15169 shutdown complete.



> Date: Thu, 10 Oct 2013 23:04:05 +0200
> From: w...@42on.com
> To: ceph-users@lists.ceph.com
> CC: mike.oto...@outlook.com
> Subject: Re: [ceph-users] osds and gateway not coming up on restart
> 
> On 10/10/2013 11:01 PM, Mike O'Toole wrote:
> >
> > I created them with ceph-deploy and there are no OSD entries in the
> > ceph.conf.  Trying to start them that way doesnt work.
> >
> 
> (bringing discussion back to the list)
> 
> Are you sure there is no logging? Because there should be in /var/log/ceph
> 
> Wido
> 
> >
> >  > Date: Thu, 10 Oct 2013 22:57:29 +0200
> >  > From: w...@42on.com
> >  > To: mike.oto...@outlook.com
> >  > Subject: Re: [ceph-users] osds and gateway not coming up on restart
> >  >
> >  > On 10/10/2013 10:54 PM, Mike O'Toole wrote:
> >  > > I verified the OSDs were not running and I issued "sudo stop ceph-all"
> >  > > and "sudo start ceph-all" but nothing comes up. The OSDS are all on the
> >  > > same server. The file systems are xfs and I am able to mount them.
> >  >
> >  > Could you try starting them manually via:
> >  >
> >  > $ service ceph start osd.X
> >  >
> >  > where X is the OSD number of those three OSDs.
> >  >
> >  > If that doesn't work, check the logs of the OSDs why they aren't
> > starting.
> >  >
> >  > I'm not so familiar with the upstart scripts from Ceph, but I think it
> >  > only starts the OSDs when they have been created via ceph-deploy thus
> >  > ceph-disk-prepare and ceph-disk-activate
> >  >
> >  > Wido
> >  >
> >  > >
> >  > > /dev/sdb1 931G 1.1G 930G 1% /data-1
> >  > > /dev/sdb2 931G 1.1G 930G 1% /data-2
> >  > > /dev/sdb3 931G 1.1G 930G 1% /data-3
> >  > >
> >  > > Interestingly though they are empty.
> >  > >
> >  > > > Date: Thu, 10 Oct 2013 22:46:26 +0200
> >  > > > From: w...@42on.com
> >  > > > To: ceph-users@lists.ceph.com
> >  > > > Subject: Re: [ceph-users] osds and gateway not coming up on restart
> >  > > >
> >  > > > On 10/10/2013 10:43 PM, Mike O'Toole wrote:
> >  > > > > So I took a power hit today and after coming back up 3 of my osds
> >  > > and my
> >  > > > > radosgw are not coming back up. The logs show no clue as to
> > what may
> >  > > > > have happened.
> >  > > > >
> >  > > > > When I manually try to restart the gateway I see the following in
> >  > > the logs:
> >  > > > >
> >  > > > > 2013-10-10 16:04:23.166046 7f8480d9a700 2
> >  > > > > RGWDataChangesLog::ChangesRenewThread: start
> >  > > >

Re: [ceph-users] osds and gateway not coming up on restart

2013-10-10 Thread Mike O'Toole

I think I just figured out the issue.  When I installed ceph I had already 
prepared the three partitions for these OSDS and had fstab entries for mounting 
them.  When I first did this install I didn't realize that ceph would mount the 
file system for you.  I removed the entries and everything came back up.
Thanks for your help!


From: mike.oto...@outlook.com
To: w...@42on.com; ceph-users@lists.ceph.com
Date: Thu, 10 Oct 2013 17:27:50 -0400
Subject: Re: [ceph-users] osds and gateway not coming up on restart









Sorry, I didn't mean it logged nothing,  I just saw no clues that were apparent 
to me.Subsequent restart attempts log nothing.   Here are the last few 
lines 
2013-10-10 15:19:49.832776 7f94567ac700 10 -- 10.10.2.202:6809/15169 reaper 
deleted pipe 0x1e045002013-10-10 15:19:49.832785 7f94567ac700 10 -- 
10.10.2.202:6809/15169 reaper done2013-10-10 15:19:49.832805 7f945c5d97c0 10 -- 
10.10.2.202:6809/15169 wait: waiting for dispatch queue2013-10-10 
15:19:49.832829 7f945c5d97c0 10 -- 10.10.2.202:6809/15169 wait: dispatch queue 
is stopped2013-10-10 15:19:49.832837 7f945c5d97c0 20 -- 10.10.2.202:6809/15169 
wait: stopping accepter thread2013-10-10 15:19:49.832841 7f945c5d97c0 10 
accepter.stop accepter2013-10-10 15:19:49.832864 7f944c17a700 20 
accepter.accepter poll got 12013-10-10 15:19:49.832874 7f944c17a700 20 
accepter.accepter closing2013-10-10 15:19:49.832891 7f944c17a700 10 
accepter.accepter stopping2013-10-10 15:19:49.832956 7f945c5d97c0 20 -- 
10.10.2.202:6809/15169 wait: stopped accepter thread2013-10-10 15:19:49.832969 
7f945c5d97c0 20 -- 10.10.2.202:6809/15169 wait: stopping reaper 
thread2013-10-10 15:19:49.832995 7f94567ac700 10 -- 10.10.2.202:6809/15169 
reaper_entry done2013-10-10 15:19:49.833072 7f945c5d97c0 20 -- 
10.10.2.202:6809/15169 wait: stopped reaper thread2013-10-10 15:19:49.833082 
7f945c5d97c0 10 -- 10.10.2.202:6809/15169 wait: closing pipes2013-10-10 
15:19:49.833086 7f945c5d97c0 10 -- 10.10.2.202:6809/15169 reaper2013-10-10 
15:19:49.833090 7f945c5d97c0 10 -- 10.10.2.202:6809/15169 reaper done2013-10-10 
15:19:49.833093 7f945c5d97c0 10 -- 10.10.2.202:6809/15169 wait: waiting for 
pipes  to close2013-10-10 15:19:49.833097 7f945c5d97c0 10 -- 
10.10.2.202:6809/15169 wait: done.2013-10-10 15:19:49.833101 7f945c5d97c0  1 -- 
10.10.2.202:6809/15169 shutdown complete.



> Date: Thu, 10 Oct 2013 23:04:05 +0200
> From: w...@42on.com
> To: ceph-users@lists.ceph.com
> CC: mike.oto...@outlook.com
> Subject: Re: [ceph-users] osds and gateway not coming up on restart
> 
> On 10/10/2013 11:01 PM, Mike O'Toole wrote:
> >
> > I created them with ceph-deploy and there are no OSD entries in the
> > ceph.conf.  Trying to start them that way doesnt work.
> >
> 
> (bringing discussion back to the list)
> 
> Are you sure there is no logging? Because there should be in /var/log/ceph
> 
> Wido
> 
> >
> >  > Date: Thu, 10 Oct 2013 22:57:29 +0200
> >  > From: w...@42on.com
> >  > To: mike.oto...@outlook.com
> >  > Subject: Re: [ceph-users] osds and gateway not coming up on restart
> >  >
> >  > On 10/10/2013 10:54 PM, Mike O'Toole wrote:
> >  > > I verified the OSDs were not running and I issued "sudo stop ceph-all"
> >  > > and "sudo start ceph-all" but nothing comes up. The OSDS are all on the
> >  > > same server. The file systems are xfs and I am able to mount them.
> >  >
> >  > Could you try starting them manually via:
> >  >
> >  > $ service ceph start osd.X
> >  >
> >  > where X is the OSD number of those three OSDs.
> >  >
> >  > If that doesn't work, check the logs of the OSDs why they aren't
> > starting.
> >  >
> >  > I'm not so familiar with the upstart scripts from Ceph, but I think it
> >  > only starts the OSDs when they have been created via ceph-deploy thus
> >  > ceph-disk-prepare and ceph-disk-activate
> >  >
> >  > Wido
> >  >
> >  > >
> >  > > /dev/sdb1 931G 1.1G 930G 1% /data-1
> >  > > /dev/sdb2 931G 1.1G 930G 1% /data-2
> >  > > /dev/sdb3 931G 1.1G 930G 1% /data-3
> >  > >
> >  > > Interestingly though they are empty.
> >  > >
> >  > > > Date: Thu, 10 Oct 2013 22:46:26 +0200
> >  > > > From: w...@42on.com
> >  > > > To: ceph-users@lists.ceph.com
> >  > > > Subject: Re: [ceph-users] osds and gateway not coming up on restart
> >  > > >
> >  > > > On 10/10/2013 10:43 PM, Mike O'Toole wrote:
> >  > > > > So I took a power hit today and after coming back up 3 of my osds
> >  > > and my
> >

Re: [ceph-users] kvm live migrate wil ceph

2013-10-16 Thread Mike Lowe

I wouldn't go so far as to say putting a vm in a file on a networked filesystem 
is wrong.  It is just not the best choice if you have a ceph cluster at hand, 
in my opinion.  Networked filesystems have a bunch of extra stuff to implement 
posix semantics and live in kernel space.  You just need simple block device 
semantics and you don't need to entangle the hypervisor's kernel space.  What 
it boils down to is the engineering first principle of selecting the least 
complicated solution that satisfies the requirements of the problem. You don't 
get anything when you trade the simplicity of rbd for the complexity of a 
networked filesystem.

For format 2 I think the only caveat is that it requires newer clients and the 
kernel client takes some time to catch up to the user space clients.  You may 
not be able to mount filesystems on rbd devices with the kernel client 
depending on kernel version, this may or may not be important to you.  You can 
always use a vm to mount a filesystem on a rbd device as a work around.  

On Oct 16, 2013, at 9:11 AM, Jon  wrote:

> Hello Michael,
> 
> Thanks for the reply.  It seems like ceph isn't actually "mounting" the rbd 
> to the vm host which is where I think I was getting hung up (I had previously 
> been attempting to mount rbds directly to multiple hosts and as you can 
> imagine having issues).
> 
> Could you possible expound on why using a clustered filesystem approach is 
> wrong (or conversely why using RBD's is the correct approach)?
> 
> As for format2 rbd images, it looks like they provide exactly the 
> Copy-On-Write functionality that I am looking for.  Any caveats or things I 
> should look out for when going from format 1 to format 2 images? (I think I 
> read something about not being able to use both at the same time...)
> 
> Thanks Again,
> Jon A
> 
> 
> On Mon, Oct 14, 2013 at 4:42 PM, Michael Lowe  
> wrote:
> I live migrate all the time using the rbd driver in qemu, no problems.  Qemu 
> will issue a flush as part of the migration so everything is consistent.  
> It's the right way to use ceph to back vm's. I would strongly recommend 
> against a network file system approach.  You may want to look into format 2 
> rbd images, the cloning and writable snapshots may be what you are looking 
> for.
> 
> Sent from my iPad
> 
> On Oct 14, 2013, at 5:37 AM, Jon  wrote:
> 
>> Hello,
>> 
>> I would like to live migrate a VM between two "hypervisors".  Is it possible 
>> to do this with a rbd disk or should the vm disks be created as qcow images 
>> on a CephFS/NFS share (is it possible to do clvm over rbds? OR GlusterFS 
>> over rbds?)and point kvm at the network directory.  As I understand it, rbds 
>> aren't "cluster aware" so you can't mount an rbd on multiple hosts at once, 
>> but maybe libvirt has a way to handle the transfer...?  I like the idea of 
>> "master" or "golden" images where guests write any changes to a new image, I 
>> don't think rbds are able to handle copy-on-write in the same way kvm does 
>> so maybe a clustered filesystem approach is the ideal way to go.
>> 
>> Thanks for your input. I think I'm just missing some piece. .. I just don't 
>> grok...
>> 
>> Bestv Regards,
>> Jon A
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Multiply OSDs per host strategy ?

2013-10-16 Thread Mike Dawson


Andrija,

You can use a single pool and the proper CRUSH rule


step chooseleaf firstn 0 type host


to accomplish your goal.

http://ceph.com/docs/master/rados/operations/crush-map/


Cheers,
Mike Dawson


On 10/16/2013 5:16 PM, Andrija Panic wrote:

Hi,

I have 2 x  2TB disks, in 3 servers, so total of 6 disks... I have
deployed total of 6 OSDs.
ie:
host1 = osd.0 and osd.1
host2 = osd.2 and osd.3
host4 = osd.4 and osd.5

Now, since I will have total of 3 replica (original + 2 replicas), I
want my replica placement to be such, that I don't end up having 2
replicas on 1 host (replica on osd0, osd1 (both on host1) and replica on
osd2. I want all 3 replicas spread on different hosts...

I know this is to be done via crush maps, but I'm not sure if it would
be better to have 2 pools, 1 pool on osd0,2,4 and and another pool on
osd1,3,5.

If possible, I would want only 1 pool, spread across all 6 OSDs, but
with data placement such, that I don't end up having 2 replicas on 1
host...not sure if this is possible at all...

Is that possible, or maybe I should go for RAID0 in each server (2 x 2Tb
= 4TB for osd0) or maybe JBOD  (1 volume, so 1 OSD per host) ?

Any suggesting about best practice ?

Regards,

--

Andrija Panić


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] poor read performance on rbd+LVM, LVM overload

2013-10-17 Thread Mike Snitzer

On Wed, Oct 16 2013 at 12:16pm -0400,
Sage Weil  wrote:

> Hi,
> 
> On Wed, 16 Oct 2013, Ugis wrote:
> > 
> > What could make so great difference when LVM is used and what/how to
> > tune? As write performance does not differ, DM extent lookup should
> > not be lagging, where is the trick?
> 
> My first guess is that LVM is shifting the content of hte device such that 
> it no longer aligns well with the RBD striping (by default, 4MB).  The 
> non-aligned reads/writes would need to touch two objects instead of 
> one, and dd is generally doing these synchronously (i.e., lots of 
> waiting).
> 
> I'm not sure what options LVM provides for aligning things to the 
> underlying storage...

LVM will consume the underlying storage's device limits.  So if rbd
establishes appropriate minimum_io_size and optimal_io_size that reflect
the striping config LVM will pick it up -- provided
'data_alignment_detection' is enabled in lvm.conf (which it is by
default).

Ugis, please provide the output of:

RBD_DEVICE=
pvs -o pe_start $RBD_DEVICE
cat /sys/block/$RBD_DEVICE/queue/minimum_io_size
cat /sys/block/$RBD_DEVICE/queue/optimal_io_size

The 'pvs' command will tell you where LVM aligned the start of the data
area (which follows the LVM metadata area).  Hopefully it reflects what
was published in sysfs for rbd's striping.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] poor read performance on rbd+LVM, LVM overload

2013-10-21 Thread Mike Snitzer

On Mon, Oct 21 2013 at 11:01am -0400,
Mike Snitzer  wrote:

> On Mon, Oct 21 2013 at 10:11am -0400,
> Christoph Hellwig  wrote:
> 
> > On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote:
> > > It looks like without LVM we're getting 128KB requests (which IIRC is 
> > > typical), but with LVM it's only 4KB.  Unfortunately my memory is a bit 
> > > fuzzy here, but I seem to recall a property on the request_queue or 
> > > device 
> > > that affected this.  RBD is currently doing
> > 
> > Unfortunately most device mapper modules still split all I/O into 4k
> > chunks before handling them.  They rely on the elevator to merge them
> > back together down the line, which isn't overly efficient but should at
> > least provide larger segments for the common cases.
> 
> It isn't DM that splits the IO into 4K chunks; it is the VM subsystem
> no?  Unless care is taken to assemble larger bios (higher up the IO
> stack, e.g. in XFS), all buffered IO will come to bio-based DM targets
> in $PAGE_SIZE granularity.
> 
> I would expect direct IO to before better here because it will make use
> of bio_add_page to build up larger IOs.

s/before/perform/ ;)
 
> Taking a step back, the rbd driver is exposing both the minimum_io_size
> and optimal_io_size as 4M.  This symmetry will cause XFS to _not_ detect
> the exposed limits as striping.  Therefore, AFAIK, XFS won't take steps
> to respect the limits when it assembles its bios (via bio_add_page).
> 
> Sage, any reason why you don't use traditional raid geomtry based IO
> limits?, e.g.:
> 
> minimum_io_size = raid chunk size
> optimal_io_size = raid chunk size * N stripes (aka full stripe)
> 
> ___
> linux-lvm mailing list
> linux-...@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] poor read performance on rbd+LVM, LVM overload

2013-10-21 Thread Mike Snitzer

On Mon, Oct 21 2013 at 10:11am -0400,
Christoph Hellwig  wrote:

> On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote:
> > It looks like without LVM we're getting 128KB requests (which IIRC is 
> > typical), but with LVM it's only 4KB.  Unfortunately my memory is a bit 
> > fuzzy here, but I seem to recall a property on the request_queue or device 
> > that affected this.  RBD is currently doing
> 
> Unfortunately most device mapper modules still split all I/O into 4k
> chunks before handling them.  They rely on the elevator to merge them
> back together down the line, which isn't overly efficient but should at
> least provide larger segments for the common cases.

It isn't DM that splits the IO into 4K chunks; it is the VM subsystem
no?  Unless care is taken to assemble larger bios (higher up the IO
stack, e.g. in XFS), all buffered IO will come to bio-based DM targets
in $PAGE_SIZE granularity.

I would expect direct IO to before better here because it will make use
of bio_add_page to build up larger IOs.

Taking a step back, the rbd driver is exposing both the minimum_io_size
and optimal_io_size as 4M.  This symmetry will cause XFS to _not_ detect
the exposed limits as striping.  Therefore, AFAIK, XFS won't take steps
to respect the limits when it assembles its bios (via bio_add_page).

Sage, any reason why you don't use traditional raid geomtry based IO
limits?, e.g.:

minimum_io_size = raid chunk size
optimal_io_size = raid chunk size * N stripes (aka full stripe)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] poor read performance on rbd+LVM, LVM overload

2013-10-21 Thread Mike Snitzer

On Mon, Oct 21 2013 at 12:02pm -0400,
Sage Weil  wrote:

> On Mon, 21 Oct 2013, Mike Snitzer wrote:
> > On Mon, Oct 21 2013 at 10:11am -0400,
> > Christoph Hellwig  wrote:
> > 
> > > On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote:
> > > > It looks like without LVM we're getting 128KB requests (which IIRC is 
> > > > typical), but with LVM it's only 4KB.  Unfortunately my memory is a bit 
> > > > fuzzy here, but I seem to recall a property on the request_queue or 
> > > > device 
> > > > that affected this.  RBD is currently doing
> > > 
> > > Unfortunately most device mapper modules still split all I/O into 4k
> > > chunks before handling them.  They rely on the elevator to merge them
> > > back together down the line, which isn't overly efficient but should at
> > > least provide larger segments for the common cases.
> > 
> > It isn't DM that splits the IO into 4K chunks; it is the VM subsystem
> > no?  Unless care is taken to assemble larger bios (higher up the IO
> > stack, e.g. in XFS), all buffered IO will come to bio-based DM targets
> > in $PAGE_SIZE granularity.
> > 
> > I would expect direct IO to before better here because it will make use
> > of bio_add_page to build up larger IOs.
> 
> I do know that we regularly see 128 KB requests when we put XFS (or 
> whatever else) directly on top of /dev/rbd*.

Should be pretty straight-forward to identify any limits that are
different by walking sysfs/queue, e.g.:

grep -r . /sys/block/rdbXXX/queue
vs
grep -r . /sys/block/dm-X/queue

Could be there is an unexpected difference.  For instance, there was
this fix recently: http://patchwork.usersys.redhat.com/patch/69661/

> > Taking a step back, the rbd driver is exposing both the minimum_io_size
> > and optimal_io_size as 4M.  This symmetry will cause XFS to _not_ detect
> > the exposed limits as striping.  Therefore, AFAIK, XFS won't take steps
> > to respect the limits when it assembles its bios (via bio_add_page).
> > 
> > Sage, any reason why you don't use traditional raid geomtry based IO
> > limits?, e.g.:
> > 
> > minimum_io_size = raid chunk size
> > optimal_io_size = raid chunk size * N stripes (aka full stripe)
> 
> We are... by default we stripe 4M chunks across 4M objects.  You're 
> suggesting it would actually help to advertise a smaller minimim_io_size 
> (say, 1MB)?  This could easily be made tunable.

You're striping 4MB chunks across 4 million stripes?

So the full stripe size in bytes is 17592186044416 (or 16TB)?  Yeah
cannot see how XFS could make use of that ;)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] poor read performance on rbd+LVM, LVM overload

2013-10-21 Thread Mike Snitzer

On Mon, Oct 21 2013 at  2:06pm -0400,
Christoph Hellwig  wrote:

> On Mon, Oct 21, 2013 at 11:01:29AM -0400, Mike Snitzer wrote:
> > It isn't DM that splits the IO into 4K chunks; it is the VM subsystem
> > no?
> 
> Well, it's the block layer based on what DM tells it.  Take a look at
> dm_merge_bvec
> 
> >From dm_merge_bvec:
> 
>   /*
>  * If the target doesn't support merge method and some of the devices
>  * provided their merge_bvec method (we know this by looking at
>  * queue_max_hw_sectors), then we can't allow bios with multiple 
> vector
>  * entries.  So always set max_size to 0, and the code below allows
>  * just one page.
>  */
>   
> Although it's not the general case, just if the driver has a
> merge_bvec method.  But this happens if you using DM ontop of MD where I
> saw it aswell as on rbd, which is why it's correct in this context, too.

Right, but only if the DM target that is being used doesn't have a
.merge method.  I don't think it was ever shared which DM target is in
use here.. but both the linear and stripe DM targets provide a .merge
method.
 
> Sorry for over generalizing a bit.

No problem.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] saucy salamander support?

2013-10-22 Thread Mike Dawson


For the time being, you can install the Raring debs on Saucy without issue.

echo deb http://ceph.com/debian-dumpling/ raring main | sudo tee 
/etc/apt/sources.list.d/ceph.list


I'd also like to register a +1 request for official builds targeted at 
Saucy.


Cheers,
Mike


On 10/22/2013 11:42 AM, LaSalle, Jurvis wrote:

Hi,

I accidentally installed Saucy Salamander.  Does the project have a
timeframe for supporting this Ubuntu release?

Thanks,
JL

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] saucy salamander support?

2013-10-22 Thread Mike Lowe

And a +1 from me as well.  It would appear that ubuntu has picked up the 0.67.4 
source and included a build of it in their official repo, so you may be able to 
get by until the next point release with those.

http://packages.ubuntu.com/search?keywords=ceph

On Oct 22, 2013, at 11:46 AM, Mike Dawson  wrote:

> For the time being, you can install the Raring debs on Saucy without issue.
> 
> echo deb http://ceph.com/debian-dumpling/ raring main | sudo tee 
> /etc/apt/sources.list.d/ceph.list
> 
> I'd also like to register a +1 request for official builds targeted at Saucy.
> 
> Cheers,
> Mike
> 
> 
> On 10/22/2013 11:42 AM, LaSalle, Jurvis wrote:
>> Hi,
>> 
>>  I accidentally installed Saucy Salamander.  Does the project have a
>> timeframe for supporting this Ubuntu release?
>> 
>> Thanks,
>> JL
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] About use same SSD for OS and Journal

2013-10-25 Thread Mike Dawson


Kurt,

When you had OS and osd journals co-located, how many osd journals were 
on the SSD containing the OS?


You mention you now use a 5:1 ratio. Was the ratio something like 11:1 
before (one SSD for OS plus 11 osd journals to 11 OSDs in a 12-disk 
chassis)?


Also, what throughput per drive were you seeing on the cluster during 
the periods where things got laggy due to backfills, etc?


Last, did you attempt to throttle using ceph config setting in the old 
setup? Do you need to throttle in your current setup?


Thanks,
Mike Dawson


On 10/24/2013 10:40 AM, Kurt Bauer wrote:

Hi,

we had a setup like this and ran into trouble, so I would strongly
discourage you from setting it up like this. Under normal circumstances
there's no problem, but when the cluster is under heavy load, for
example when it has a lot of pgs backfilling, for whatever reason
(increasing num of pgs, adding OSDs,..), there's obviously a lot of
entries written to the journals.
What we saw then was extremly laggy behavior of the cluster and when
looking at the iostats of the SSD, they were at 100% most of the time. I
don't exactly know what causes this and why the SSDs can't cope with the
amount of IOs, but seperating OS and journals did the trick. We now have
quick 15k HDDs in Raid1 for OS and Monitor journal and per 5 OSD
journals one SSD with one partition per journal (used as raw partition).

Hope that helps,
best regards,
Kurt

Martin Catudal schrieb:

Hi,
  Here my scenario :
I will have a small cluster (4 nodes) with 4 (4 TB) OSD's per node.

I will have OS installed on two SSD in raid 1 configuration.

Is one of you have successfully and efficiently a Ceph cluster that is
built with Journal on a separate partition on the OS SSD's?

I know that it may occur a lot of IO on the Journal SSD and I'm scared
of have my OS suffer from too much IO.

Any background experience?

Martin



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How can I check the image's IO ?

2013-10-30 Thread Mike Dawson


Vernon,

You can use the rbd command bench-write documented here:

http://ceph.com/docs/next/man/8/rbd/#commands

The command might looks something like:

rbd --pool test-pool bench-write --io-size 4096 --io-threads 16 
--io-total 1GB test-image


Some other interesting flags are --rbd-cache, --no-rbd-cache, and 
--io-pattern {seq|rand}



Cheers,
Mike

On 10/30/2013 3:23 AM, vernon1987 wrote:

Hi cephers,
I use "qemu-img create -f rbd rbd:test-pool/test-image" to create a
image. I want to know how can I check this image's IO. Or how to check
the IO for each block?
Thanks.
2013-10-30

vernon


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD numbering

2013-10-30 Thread Mike Lowe

You really should, I believe the osd number is used in computing crush.  Bad 
things will happen if you don't use sequential numbers.

On Oct 30, 2013, at 11:37 AM, Glen Aidukas  wrote:

> I wanted to know, does the OSD numbering half to be sequential and what is 
> the highest usable number (2^16 or 2^32)?
> 
> The reason is, I would like to use a numbering convention that reflects the 
> cluster number (assuming I will have more than one down the road; test, dev, 
> prod), the host and disk used by a given OSD.
> 
> So, for example: osd.CHHHDD  where:
>   C   Cluster number 1-9
>   HHH Host number  IE: ceph001, ceph002, ...
>   DD  Disk number on a given host Ex: 00 = /dev/sda or something like 
> this.
> 
> If the highest number usable is 65534 or near that (2^16) then maybe I could 
> use format CHHDD or CHHHD where I could have clusters 1-5.
> 
> The up side to this is I quickly know where osd.200503 is.  It's on cluster 2 
> host ceph005 and the third disk.  Also, if I add a new disk on a middle host, 
> it doesn’t scatter the numbering to where I don't easily know were an OSD is. 
>  I know I can always look this up but having it as part of the OSD number 
> makes life easier. :)
> 
> Also, it might seem silly to have the first digit as a cluster number but I 
> think we probable can't pad the number with zeros so using an initial digit 
> of 1-9 cleans this up so I might as well use it to identify the cluster.  
> 
> This numbering system is not important for the monitors or metadata but could 
> help with the OSDs.
> 
> -Glen
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Red Hat clients

2013-10-30 Thread Mike Lowe

If you were to run your Red Hat based client in a vm you could run run 
unmodified an unmodified kernel.  if you are using rhel 6.4 then you get the 
extra goodies in the virtio-scsi qemu driver.

On Oct 30, 2013, at 2:47 PM,  
 wrote:

> Now that my ceph cluster seems to be happy and stable, I have been looking at 
> different ways of using it.   Object, block and file.
>  
> Object is relatively easy and I will use different ones to test with Ceph.
>  
> When I look at block, I’m getting the impression from a lot of Googling that 
> deploying clients on Red Hat to connect to a Ceph cluster can be complex.   
> As I understand it, the rbd module is not currently in the Red Hat kernel 
> (and I am not allowed to make changes to our standard kernel as is suggested 
> in places as a possible solution).  Does this mean I can’t connect a Red Hat 
> machine to Ceph as a block client?
>  
> ___
> 
> This message is for information purposes only, it is not a recommendation, 
> advice, offer or solicitation to buy or sell a product or service nor an 
> official confirmation of any transaction. It is directed at persons who are 
> professionals and is not intended for retail customer use. Intended for 
> recipient only. This message is subject to the terms at: 
> www.barclays.com/emaildisclaimer.
> 
> For important disclosures, please see: 
> www.barclays.com/salesandtradingdisclaimer regarding market commentary from 
> Barclays Sales and/or Trading, who are active market participants; and in 
> respect of Barclays Research, including disclosures relating to specific 
> issuers, please see http://publicresearch.barclays.com.
> 
> ___
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph monitor problems

2013-10-30 Thread Mike Dawson


Aaron,

Don't mistake valid for advisable.

For documentation purposes, three monitors is the advisable initial 
configuration for multi-node ceph clusters. If there is a valid need for 
more than three monitors, it is advisable to add them two at a time to 
maintain an odd number of total monitors.


-Mike

On 10/30/2013 4:46 PM, Aaron Ten Clay wrote:

On Wed, Oct 30, 2013 at 1:43 PM, Joao Eduardo Luis
mailto:joao.l...@inktank.com>> wrote:


A quorum of 2 monitors is completely fine as long as both monitors
are up.  A quorum is always possible regardless of how many monitors
you have, as long as a majority is up and able to form it (1 out of
1, 2 out of 2, 2 out of 3, 3 out of 4, 3 out of 5, 4 out of 6,...).

   -Joao


Joao,

The page at http://ceph.com/docs/master/rados/operations/add-or-rm-mons/
only lists "1; 3 out of 5; 4 out of 6; etc.". Perhaps it should be
updated if 2 out of 2 is a valid configuration?

-Aaron



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph health checkup

2013-10-31 Thread Mike Dawson


Narendra,

This is an issue. You really want your cluster to he HEALTH_OK with all 
PGs active+clean. Some exceptions apply (like scrub / deep-scrub).


What do 'ceph health detail' and 'ceph osd tree' show?

Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 10/31/2013 6:53 PM, Trivedi, Narendra wrote:

My Ceph cluster health checkup tells me the following. Should I be concerned? 
What's the remedy? What is missing? I issued this command from the monitor 
node. Please correct me if I am wrong, but  I think admin's node job is done 
after the installation unless I want to add additional OSD/MONs.

[ceph@ceph-node1-mon-centos-6-4 ceph]$ sudo ceph health
HEALTH_WARN 145 pgs degraded; 43 pgs down; 47 pgs peering; 76 pgs stale; 47 pgs 
stuck inactive; 76 pgs stuck stale; 192 pgs stuck unclean

Thanks a lot in advance!
Narendra


This message contains information which may be confidential and/or privileged. 
Unless you are the intended recipient (or authorized to receive for the 
intended recipient), you may not read, use, copy or disclose to anyone the 
message or any information contained in the message. If you have received the 
message in error, please advise the sender by reply e-mail and delete the 
message and any attachment(s) thereto without retaining any copies.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph User Committee

2013-11-06 Thread Mike Dawson


I also have time I could spend. Thanks for getting this started Loic!

Thanks,
Mike Dawson


On 11/6/2013 12:35 PM, Loic Dachary wrote:

Hi Ceph,

I would like to open a discussion about organizing a Ceph User Committee. We 
briefly discussed the idea with Ross Turk, Patrick McGarry and Sage Weil today 
during the OpenStack summit. A pad was created and roughly summarizes the idea:

http://pad.ceph.com/p/user-committee

If there is enough interest, I'm willing to devote one day a week working for 
the Ceph User Committee. And yes, that includes sitting at the Ceph booth 
during the FOSDEM :-) And interviewing Ceph users and describing their use 
cases, which I enjoy very much. But also contribute to a user centric roadmap, 
which is what ultimately matters for the company I work for.

If you'd like to see this happen but don't have time to participate in this 
discussion, please add your name + email at the end of the pad.

What do you think ?



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph cluster performance

2013-11-06 Thread Mike Dawson

We just fixed a performance issue on our cluster related to spikes of
high latency on some of our SSDs used for osd journals. In our case, the
slow SSDs showed spikes of 100x higher latency than expected.

What SSDs were you using that were so slow?

Cheers,
Mike

On 11/6/2013 12:39 PM, Dinu Vlad wrote:

I'm using the latest 3.8.0 branch from raring. Is there a more recent/better
kernel recommended?

Meanwhile, I think I might have identified the culprit - my SSD drives are
extremely slow on sync writes, doing 5-600 iops max with 4k blocksize. By
comparison, an Intel 530 in another server (also installed behind a SAS
expander is doing the same test with ~ 8k iops. I guess I'm good for replacing
them.

Removing the SSD drives from the setup and re-testing with ceph => 595 MB/s
throughput under the same conditions (only mechanical drives, journal on a
separate partition on each one, 8 rados bench processes, 16 threads each).

On Nov 5, 2013, at 4:38 PM, Mark Nelson wrote:

Ok, some more thoughts:

1) What kernel are you using?

2) Mixing SATA and SAS on an expander backplane can some times have bad
effects. We don't really know how bad this is and in what circumstances, but
the Nexenta folks have seen problems with ZFS on solaris and it's not
impossible linux may suffer too:

http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html

3) If you are doing tests and look at disk throughput with something like "collectl
-sD -oT" do the writes look balanced across the spinning disks? Do any devices
have much really high service times or queue times?

4) Also, after the test is done, you can try:

find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {}
dump_historic_ops \; > foo

and then grep for "duration" in foo. You'll get a list of the slowest
operations over the last 10 minutes from every osd on the node. Once you identify a slow
duration, you can go back and in an editor search for the slow duration and look at where
in the OSD it hung up. That might tell us more about slow/latent operations.

5) Something interesting here is that I've heard from another party that in a
36 drive Supermicro SC847E16 chassis they had 30 7.2K RPM disks and 6 SSDs on a
SAS9207-8i controller and were pushing significantly faster throughput than you
are seeing (even given the greater number of drives). So it's very interesting
to me that you are pushing so much less. The 36 drive supermicro chassis I
have with no expanders and 30 drives with 6 SSDs can push about 2100MB/s with a
bunch of 9207-8i controllers and XFS (no replication).

Mark

On 11/05/2013 05:15 AM, Dinu Vlad wrote:

Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* ceph
settings I was able to get 440 MB/s from 8 rados bench instances, over a single
osd node (pool pg_num = 1800, size = 1)

This still looks awfully slow to me - fio throughput across all disks reaches
2.8 GB/s!!

I'd appreciate any suggestion, where to look for the issue. Thanks!

On Oct 31, 2013, at 6:35 PM, Dinu Vlad wrote:

I tested the osd performance from a single node. For this purpose I deployed a new cluster
(using ceph-deploy, as before) and on fresh/repartitioned drives. I created a single pool,
1800 pgs. I ran the rados bench both on the osd server and on a remote one. Cluster
configuration stayed "default", with the same additions about xfs mount &
mkfs.xfs as before.

With a single host, the pgs were "stuck unclean" (active only, not
active+clean):

# ceph -s
cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062
health HEALTH_WARN 1800 pgs stuck unclean
monmap e1: 3 mons at
{cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0},
election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3
osdmap e101: 18 osds: 18 up, 18 in
pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB /
16759 GB avail
mdsmap e1: 0/0/1 up

Test results:
Local test, 1 process, 16 threads: 241.7 MB/s
Local test, 8 processes, 128 threads: 374.8 MB/s
Remote test, 1 process, 16 threads: 231.8 MB/s
Remote test, 8 processes, 128 threads: 366.1 MB/s

Maybe it's just me, but it seems on the low side too.

Thanks,
Dinu

On Oct 30, 2013, at 8:59 PM, Mark Nelson wrote:

On 10/30/2013 01:51 PM, Dinu Vlad wrote:

Mark,

The SSDs are
http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021
and the HDDs are
http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS.

The chasis is a "SiliconMechanics C602" - but I don't have the exact model.
It's based on Supermicro, has 24 slots front and 2 in the back and a SAS expander.

I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according to
what the driver reports in dmesg). here are the resu

Re: [ceph-users] ceph cluster performance

2013-11-06 Thread Mike Dawson

No, in our case flashing the firmware to the latest release cured the
problem.

If you build a new cluster with the slow SSDs, I'd be interested in the
results of ioping[0] or fsync-tester[1]. I theorize that you may see
spikes of high latency.

[0] https://code.google.com/p/ioping/
[1] https://github.com/gregsfortytwo/fsync-tester

Thanks,
Mike Dawson

On 11/6/2013 4:18 PM, Dinu Vlad wrote:

ST240FN0021 connected via a SAS2x36 to a LSI 9207-8i.

By "fixed" - you mean replaced the SSDs?

Thanks,
Dinu

On Nov 6, 2013, at 10:25 PM, Mike Dawson wrote:

We just fixed a performance issue on our cluster related to spikes of high
latency on some of our SSDs used for osd journals. In our case, the slow SSDs
showed spikes of 100x higher latency than expected.

What SSDs were you using that were so slow?

Cheers,
Mike

On 11/6/2013 12:39 PM, Dinu Vlad wrote:

I'm using the latest 3.8.0 branch from raring. Is there a more recent/better
kernel recommended?

On Nov 5, 2013, at 4:38 PM, Mark Nelson wrote:

Ok, some more thoughts:

1) What kernel are you using?

http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html

4) Also, after the test is done, you can try:

find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {}
dump_historic_ops \; > foo

Mark

On 11/05/2013 05:15 AM, Dinu Vlad wrote:

Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* ceph
settings I was able to get 440 MB/s from 8 rados bench instances, over a single
osd node (pool pg_num = 1800, size = 1)

This still looks awfully slow to me - fio throughput across all disks reaches
2.8 GB/s!!

I'd appreciate any suggestion, where to look for the issue. Thanks!

On Oct 31, 2013, at 6:35 PM, Dinu Vlad wrote:

With a single host, the pgs were "stuck unclean" (active only, not
active+clean):

Maybe it's just me, but it seems on the low side too.

Thanks,
Dinu

On Oct 30, 2013, at 8:59 PM, Mark Nelson wrote:

On 10/30/2013 01:51 PM,

Re: [ceph-users] Running on disks that lose their head

2013-11-07 Thread Mike Dawson




Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 11/7/2013 2:12 PM, Kyle Bader wrote:

Once I know a drive has had a head failure, do I trust that the rest of the 
drive isn't going to go at an inconvenient moment vs just fixing it right now 
when it's not 3AM on Christmas morning? (true story)  As good as Ceph is, do I 
trust that Ceph is smart enough to prevent spreading corrupt data all over the 
cluster if I leave bad disks in place and they start doing terrible things to 
the data?


I have a lot more disks than I have trust in disks. If a drive lost a
head then I want it gone.

I love the idea of using smart data but can foresee see some
implementation issues. We have seen some raid configurations where
polling smart will halt all raid operations momentarily. Also, some
controllers require you to use their CLI tool to pool for smart vs
smartmontools.

It would be similarly awesome to embed something like an apdex score
against each osd, especially if it factored in hierarchy to identify
poor performing osds, nodes, racks, etc..


Kyle,

I think you are spot-on here. Apdex or similar scoring for gear 
performance is important for Ceph, IMO. Due to pseudo-random placement 
and replication, it can be quite difficult to identify 1) if hardware, 
software, or configuration are the cause of slowness, and 2) which 
hardware (if any) is slow. I recently discovered a method that seems 
address both points built.


Zackc, Loicd, and I have been the main participants in a weekly 
Teuthology call the past few weeks. We've talked mostly about methods to 
extend Teuthology to capture performance metrics. Would you be willing 
to join us during the Teuthology and Ceph-Brag sessions at the Firefly 
Developer Summit?


Cheers,
Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to enable rbd cache

2013-11-25 Thread Mike Dawson

Greg is right, you need to enable RBD admin sockets. This can be a bit 
tricky though, so here are a few tips:


1) In ceph.conf on the compute node, explicitly set a location for the 
admin socket:


[client.volumes]
admin socket = /var/run/ceph/rbd-$pid.asok

In this example, libvirt/qemu is running with permissions from 
ceph.client.volumes.keyring. If you use something different, adjust 
accordingly. You can put this under a more generic [client] section, but 
there are some downsides (like a new admin socket for each ceph cli 
command).


2) Watch for permissions issues creating the admin socket at the path 
you used above. For me, I needed to explicitly grant some permissions in 
/etc/apparmor.d/abstractions/libvirt-qemu, specifically I had to add:


  # for rbd
  capability mknod,

and

  # for rbd
  /etc/ceph/ceph.conf r,
  /var/log/ceph/* rw,
  /{,var/}run/ceph/** rw,

3) Be aware that if you have multiple rbd volumes attached to a single 
rbd image, you'll only get an admin socket to the volume mounted last. 
If you can set admin_socket via the libvirt xml for each volume, you can 
avoid this issue. This thread will explain better:


http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg16168.html

4) Once you get an RBD admin socket, query it like:

ceph --admin-daemon /var/run/ceph/rbd-29050.asok config show | grep rbd


Cheers,
Mike Dawson


On 11/25/2013 11:12 AM, Gregory Farnum wrote:

On Mon, Nov 25, 2013 at 5:58 AM, Mark Nelson  wrote:

On 11/25/2013 07:21 AM, Shu, Xinxin wrote:


Recently , I want to enable rbd cache to identify performance benefit. I
add rbd_cache=true option in my ceph configure file, I use ’virsh
attach-device’ to attach rbd to vm, below is my vdb xml file.



Ceph configuration files are a bit confusing because sometimes you'll see
something like "rbd_cache" listed somewhere but in the ceph.conf file you'll
want a space instead:

rbd cache = true

with no underscore.  That should (hopefully) fix it for you!


I believe the config file will take either format.

The RBD cache is a client-side thing, though, so it's not ever going
to show up in the OSD! You want to look at the admin socket created by
QEMU (via librbd) to see if it's working. :)
-Greg
-Greg













6b5ff6f4-9f8c-4fe0-84d6-9d795967c7dd

i



I do not know this is ok to enable rbd cache. I see perf counter for rbd
cache in source code, but when I used admin daemon to check rbd cache
statistics,

Ceph –admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump

But I did not get any rbd cahce flags.

My question is how to enable rbd cahce and check rbd cache perf counter,
or how can I make sure rbd cache is enabled, any tips will be
appreciated? Thanks in advanced.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CEPH HA with Ubuntu OpenStack and Highly Available Controller Nodes

2013-12-02 Thread Mike Dawson


Gary,

Yes, Ceph can provide a highly available infrastructure. A few pointers:

- IO will stop if anything less than a majority of the monitors are 
functioning properly. So, run a minimum of 3 Ceph monitors distributed 
across your data center's failure domains. If you choose to add more 
monitors, it is advisable to add them in pairs to maintain an odd number.


- Run at least a replication size of 2 (lots of operators choose 3 for 
more redundancy). Ceph will gracefully handle failure conditions of the 
primary OSD once it is automatically or manually marked as "down".


- Design your CRUSH map to mimic the failure domains in your datacenter 
(root, room, row, rack, chassis, host, etc). Use the CRUSH chooseleaf 
rules to spread replicas across the largest failure domain that will 
have more entries than your replication factor. For instance, try to 
replicate across racks rather than hosts if your cluster will be large 
enough.


- Don't set the "ceph osd set nodown" flag on your cluster, as it will 
prevent osds from being marked as down automatically if unavailable, 
substantially diminishing the HA capabilities.


Cheers,
Mike Dawson


On 12/2/2013 11:43 AM, Gary Harris (gharris) wrote:

Hi, I have a question about CEP high availability when integrated with
OpenStack.


Assuming you have all openstack controller nodes in HA mode would your
actually
have an HA CEPH implementation as well meaning two Primary OSDs or both
pointing to the primary?

Or,  do the client requests get forwarded automatically to the secondary
OSD should the primary fail?
(Excuse the simplifications):)

So my assumption is that if the primary fails, the mon would detect and
a new primary OSD candidate would be presented to clients?

Thanks,
Gary




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Adding new OSDs, need to increase PGs?

2013-12-03 Thread Mike Dawson


Robert,

Interesting results on the effect of # of PG/PGPs. My cluster struggles 
a bit under the strain of heavy random small-sized writes.


The IOPS you mention seem high to me given 30 drives and 3x replication 
unless they were pure reads or on high-rpm drives. Instead of assuming, 
I want to pose a few questions:


- How are you testing? rados bench, rbd bench, rbd bench with writeback 
cache, etc?


- Were the 2000-2500 random 4k IOPS more reads than writes? If you test 
100% 4k random reads, what do you get? If you test 100% 4k random 
writes, what do you get?


- What drives do you have? Any RAID involved under your OSDs?

Thanks,
Mike Dawson


On 12/3/2013 1:31 AM, Robert van Leeuwen wrote:



On 2 dec. 2013, at 18:26, "Brian Andrus"  wrote:



  Setting your pg_num and pgp_num to say... 1024 would A) increase data 
granularity, B) likely lend no noticeable increase to resource consumption, and 
C) allow some room for future OSDs two be added while still within range of 
acceptable pg numbers. You could probably safely double even that number if you 
plan on expanding at a rapid rate and want to avoid splitting PGs every time a 
node is added.

In general, you can conservatively err on the larger side when it comes to 
pg/p_num. Any excess resource utilization will be negligible (up to a certain 
point). If you have a comfortable amount of available RAM, you could experiment 
with increasing the multiplier in the equation you are using and see how it 
affects your final number.

The pg_num and pgp_num parameters can safely be changed before or after your 
new nodes are integrated.


I would be a bit conservative with the PGs / PGPs.
I've experimented with the PG number a bit and noticed the following random IO 
performance drop.
( this could be something to our specific setup but since the PG is easily 
increased and impossible to decrease I would be conservative)

  The setup:
3 OSD nodes with 128GB ram, 2 * 6 core CPU (12 with ht).
Nodes have 10 OSDs running on 1 tb disks and 2 SSDs for Journals.

We use a replica count of 3 so optimum according to formula is about 1000
With 1000 PGs I got about 2000-2500 random 4k IOPS.

Because the nodes are fast enough and I expect the cluster to be expanded with 
3 more nodes I set the PGs to 2000.
Performance dropped to about 1200-1400 IOPS.

I noticed that the spinning disks where no longer maxing out on 100% usage.
Memory and CPU did not seem to be a problem.
Since had the option to recreate the pool and I was not using the recommended 
settings I did not really dive into the issue.
I will not stray to far from the recommended settings in the future though :)

Cheers,
Robert van Leeuwen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Adding new OSDs, need to increase PGs?

2013-12-03 Thread Mike Dawson


Robert,

Do you have rbd writeback cache enabled on these volumes? That could 
certainly explain the higher than expected write performance. Any chance 
you could re-test with rbd writeback on vs. off?


Thanks,
Mike Dawson

On 12/3/2013 10:37 AM, Robert van Leeuwen wrote:

Hi Mike,

I am using filebench within a kvm virtual. (Like an actual workload we will 
have)
Using 100% synchronous 4k writes with a 50GB file on a 100GB volume with 32 
writer threads.
Also tried from multiple KVM machines from multiple hosts.
Aggregated performance keeps at 2k+ IOPS

The disks are 7200RPM 2.5 inch drives, no RAID whatsoever.
I agree the amount of IOPS seem high.
Maybe the journal on SSD (2 x Intel 3500) helps a bit in this regard but the 
SSD's where not maxed out yet.
The writes seem to be limited by the spinning disks:
As soon as the benchmark starts the are used for 100% percent.
Also the usage dropped to 0% pretty much immediately after the benchmark so it 
looks like it's not lagging behind the journal.

Did not really test reads yet since we have so much read cache (128 GB per 
node) I assume we will mostly be write limited.

Cheers,
Robert van Leeuwen



Sent from my iPad


On 3 dec. 2013, at 16:15, "Mike Dawson"  wrote:

Robert,

Interesting results on the effect of # of PG/PGPs. My cluster struggles a bit 
under the strain of heavy random small-sized writes.

The IOPS you mention seem high to me given 30 drives and 3x replication unless 
they were pure reads or on high-rpm drives. Instead of assuming, I want to pose 
a few questions:

- How are you testing? rados bench, rbd bench, rbd bench with writeback cache, 
etc?

- Were the 2000-2500 random 4k IOPS more reads than writes? If you test 100% 4k 
random reads, what do you get? If you test 100% 4k random writes, what do you 
get?

- What drives do you have? Any RAID involved under your OSDs?

Thanks,
Mike Dawson



On 12/3/2013 1:31 AM, Robert van Leeuwen wrote:


On 2 dec. 2013, at 18:26, "Brian Andrus"  wrote:


  Setting your pg_num and pgp_num to say... 1024 would A) increase data 
granularity, B) likely lend no noticeable increase to resource consumption, and 
C) allow some room for future OSDs two be added while still within range of 
acceptable pg numbers. You could probably safely double even that number if you 
plan on expanding at a rapid rate and want to avoid splitting PGs every time a 
node is added.

In general, you can conservatively err on the larger side when it comes to 
pg/p_num. Any excess resource utilization will be negligible (up to a certain 
point). If you have a comfortable amount of available RAM, you could experiment 
with increasing the multiplier in the equation you are using and see how it 
affects your final number.

The pg_num and pgp_num parameters can safely be changed before or after your 
new nodes are integrated.


I would be a bit conservative with the PGs / PGPs.
I've experimented with the PG number a bit and noticed the following random IO 
performance drop.
( this could be something to our specific setup but since the PG is easily 
increased and impossible to decrease I would be conservative)

  The setup:
3 OSD nodes with 128GB ram, 2 * 6 core CPU (12 with ht).
Nodes have 10 OSDs running on 1 tb disks and 2 SSDs for Journals.

We use a replica count of 3 so optimum according to formula is about 1000
With 1000 PGs I got about 2000-2500 random 4k IOPS.

Because the nodes are fast enough and I expect the cluster to be expanded with 
3 more nodes I set the PGs to 2000.
Performance dropped to about 1200-1400 IOPS.

I noticed that the spinning disks where no longer maxing out on 100% usage.
Memory and CPU did not seem to be a problem.
Since had the option to recreate the pool and I was not using the recommended 
settings I did not really dive into the issue.
I will not stray to far from the recommended settings in the future though :)

Cheers,
Robert van Leeuwen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v0.67.11 dumpling released

2014-09-25 Thread Mike Dawson


On 9/25/2014 11:09 AM, Sage Weil wrote:

v0.67.11 "Dumpling"
===

This stable update for Dumpling fixes several important bugs that affect a
small set of users.

We recommend that all Dumpling users upgrade at their convenience.  If
none of these issues are affecting your deployment there is no urgency.


Notable Changes
---

* common: fix sending dup cluster log items (#9080 Sage Weil)
* doc: several doc updates (Alfredo Deza)
* libcephfs-java: fix build against older JNI headesr (Greg Farnum)
* librados: fix crash in op timeout path (#9362 Matthias Kiefer, Sage Weil)
* librbd: fix crash using clone of flattened image (#8845 Josh Durgin)
* librbd: fix error path cleanup when failing to open image (#8912 Josh Durgin)
* mon: fix crash when adjusting pg_num before any OSDs are added (#9052
   Sage Weil)
* mon: reduce log noise from paxos (Aanchal Agrawal, Sage Weil)
* osd: allow scrub and snap trim thread pool IO priority to be adjusted
   (Sage Weil)


Sage,

Thanks for the great work! Could you provide any links describing how to 
tune the scrub and snap trim thread pool IO priority? I couldn't find 
these settings in the docs.


IIUC, 0.67.11 does not include the proposed changes to address #9487 or 
#9503, right?


Thanks,
Mike Dawson



* osd: fix mount/remount sync race (#9144 Sage Weil)

Getting Ceph


* Git at git://github.com/ceph/ceph.git
* Tarball at http://ceph.com/download/ceph-0.67.11.tar.gz
* For packages, see http://ceph.com/docs/master/install/get-packages
* For ceph-deploy, see http://ceph.com/docs/master/install/install-ceph-deploy
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v0.67.11 dumpling released

2014-09-25 Thread Mike Dawson

Looks like the packages have partially hit the repo, but at least the 
following are missing:


Failed to fetch 
http://ceph.com/debian-dumpling/pool/main/c/ceph/librbd1_0.67.11-1precise_amd64.deb 
 404  Not Found
Failed to fetch 
http://ceph.com/debian-dumpling/pool/main/c/ceph/librados2_0.67.11-1precise_amd64.deb 
 404  Not Found
Failed to fetch 
http://ceph.com/debian-dumpling/pool/main/c/ceph/python-ceph_0.67.11-1precise_amd64.deb 
 404  Not Found
Failed to fetch 
http://ceph.com/debian-dumpling/pool/main/c/ceph/ceph_0.67.11-1precise_amd64.deb 
 404  Not Found
Failed to fetch 
http://ceph.com/debian-dumpling/pool/main/c/ceph/libcephfs1_0.67.11-1precise_amd64.deb 
 404  Not Found


Based on the timestamps of the files that made it, it looks like the 
process to publish the packages isn't still in process, but rather 
failed yesterday.


Thanks,
Mike Dawson


On 9/25/2014 11:09 AM, Sage Weil wrote:

v0.67.11 "Dumpling"
===

This stable update for Dumpling fixes several important bugs that affect a
small set of users.

We recommend that all Dumpling users upgrade at their convenience.  If
none of these issues are affecting your deployment there is no urgency.


Notable Changes
---

* common: fix sending dup cluster log items (#9080 Sage Weil)
* doc: several doc updates (Alfredo Deza)
* libcephfs-java: fix build against older JNI headesr (Greg Farnum)
* librados: fix crash in op timeout path (#9362 Matthias Kiefer, Sage Weil)
* librbd: fix crash using clone of flattened image (#8845 Josh Durgin)
* librbd: fix error path cleanup when failing to open image (#8912 Josh Durgin)
* mon: fix crash when adjusting pg_num before any OSDs are added (#9052
   Sage Weil)
* mon: reduce log noise from paxos (Aanchal Agrawal, Sage Weil)
* osd: allow scrub and snap trim thread pool IO priority to be adjusted
   (Sage Weil)
* osd: fix mount/remount sync race (#9144 Sage Weil)

Getting Ceph


* Git at git://github.com/ceph/ceph.git
* Tarball at http://ceph.com/download/ceph-0.67.11.tar.gz
* For packages, see http://ceph.com/docs/master/install/get-packages
* For ceph-deploy, see http://ceph.com/docs/master/install/install-ceph-deploy
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] converting legacy puppet-ceph configured OSDs to look like ceph-deployed OSDs

2014-10-15 Thread Mike Dawson


On 10/15/2014 4:20 PM, Dan van der Ster wrote:

Hi Ceph users,

(sorry for the novel, but perhaps this might be useful for someone)

During our current project to upgrade our cluster from disks-only to
SSD journals, we've found it useful to convert our legacy puppet-ceph
deployed cluster (using something like the enovance module) to one that
looks like it has had its OSD created with ceph-disk prepare. It's been
educational for me, and I thought it would be good experience to share.

To start, the "old" puppet-ceph configures OSDs explicitly in
ceph.conf, like this:

[osd.211]
host = p05151113489275
devs = /dev/disk/by-path/pci-:02:00.0-sas-...-lun-0-part1

and ceph-disk list says this about the disks:

/dev/sdh :
  /dev/sdh1 other, xfs, mounted on /var/lib/ceph/osd/osd.211

In other words, ceph-disk doesn't know anything about the OSD living
on that disk.

Before deploying our SSD journals I was trying to find the best way to
map OSDs to SSD journal partitions (in puppet!), but basically there is
no good way to do this with the legacy puppet-ceph module. (What we'd
have to do is puppetize the partitioning of SSDs, then manually map OSDs
to SSD partitions. This would be tedious, and also error prone after
disk replacements and reboots).

However, I've found that by using ceph-deploy, i.e ceph-disk, to
prepare and activate OSDs, this becomes very simple, trivial even. Using
ceph-disk we keep the OSD/SSD mapping out of puppet; instead the state
is stored in the OSD itself. (1.5 years ago when we deployed this
cluster, ceph-deploy was advertised as quick tool to spin up small
clusters, so we didn't dare
use it. I realize now that it (or the puppet/chef/... recipes based on
it) is _the_only_way_ to build a cluster if you're starting out today.)

Now our problem was that I couldn't go and re-ceph-deploy the whole
cluster, since we've got some precious user data there. Instead, I
needed to learn how ceph-disk is labeling and preparing disks, and
modify our existing OSDs in place to look like they'd been prepared and
activated with ceph-disk.

In the end, I've worked out all the configuration and sgdisk magic and
put the recipes into a couple of scripts here [1]. Note that I do not
expect these to work for any other cluster unmodified. In fact, that
would be dangerous, so don't blame me if you break something. But they
might helpful for understanding how the ceph-disk udev magic works and
could be a basis for upgrading other clusters.

The scripts are:

ceph-deployifier/ceph-create-journals.sh:
   - this script partitions SSDs (assuming sda to sdd) with 5 partitions
each
   - the only trick is to add the partition name 'ceph journal' and set
the typecode to the magic JOURNAL_UUID along with a random partition guid

ceph-deployifier/ceph-label-disks.sh:
   - this script discovers the next OSD which is not prepared with
ceph-disk, finds an appropriate unused journal partition, and converts
the OSD to a ceph-disk prepared lookalike.
   - aside from the discovery part, the main magic is to:
 - create the files active, sysvinit and journal_uuid on the OSD
 - rename the partition to 'ceph data', set the typecode to the
magic OSD_UUID, and the partition guid to the OSD's uuid.
 - link to the /dev/disk/by-partuuid/ journal symlink, and make the
new journal
   - at the end, udev is triggered and the OSD is started (via the
ceph-disk activation magic)

The complete details are of course in the scripts. (I also have
another version of ceph-label-disks.sh that doesn't expect an SSD
journal but instead prepares the single disk 2 partitions scheme.)

After running these scripts you'll get a nice shiny ceph-disk list output:

/dev/sda :
  /dev/sda1 ceph journal, for /dev/sde1
  /dev/sda2 ceph journal, for /dev/sdf1
  /dev/sda3 ceph journal, for /dev/sdg1
...
/dev/sde :
  /dev/sde1 ceph data, active, cluster ceph, osd.2, journal /dev/sda1
/dev/sdf :
  /dev/sdf1 ceph data, active, cluster ceph, osd.8, journal /dev/sda2
/dev/sdg :
  /dev/sdg1 ceph data, active, cluster ceph, osd.12, journal /dev/sda3
...

And all of the udev magic is working perfectly. I've tested all of the
reboot, failed OSD, and failed SSD scenarios and it all works as it
should. And the puppet-ceph manifest for osd's is now just a very simple
wrapper around ceph-disk prepare. (I haven't published ours to github
yet, but it is very similar to the stackforge puppet-ceph manifest).

There you go, sorry that was so long. I hope someone finds this useful :)

Best Regards,
Dan

[1]
https://github.com/cernceph/ceph-scripts/tree/master/tools/ceph-deployifier



Dan,

Thank you for publishing this! I put some time into this very issue 
earlier this year, but got pulled in another direction before completing 
the work. I'd like to bring a production cluster deployed with mkcephfs 
out of the stone ages, so your work will

Re: [ceph-users] Poor RBD performance as LIO iSCSI target

2014-10-28 Thread Mike Christie

On 10/27/2014 04:24 PM, Christopher Spearman wrote:
> 
>  - What tested with bad performance (Reads ~25-50MB/s - Writes ~25-50MB/s)
>* RBD setup as target using LIO
>* RBD -> LVM -> LIO target
>* RBD -> RAID0/1 -> LIO target
>  - What tested with good performance (Reads ~700-800MB/s - Writes
> ~400-700MB/s)
>* RBD on local system, no iSCSI
>* Ramdisk (No RBD) -> LIO target
>* RBD -> Mounted ext4 -> disk image -> LIO fileio target
>* RBD -> Mounted ext4 -> disk image -> loop device -> LIO blockio target
>* RBD -> loop device -> LIO target
> 

Hi Christopher,

Could you send me the LIO target configs for "RBD setup as target using
LIO" and for "RBD -> Mounted ext4 -> disk image -> LIO fileio target"
and "RBD -> loop device -> LIO target".

If you are using targetcli, just send the /etc/target/saveconfig.json.
You can probably find configs for all setups in /etc/target/backup. If
you do not have those files anymore can you tell me what settings you
used for the storage object attributes (block size, max sectors, optimal
sectors, cache, etc), and can you tell me what backing store driver
(block or fileio) for the setups above?

What tool did you use to test performance and how did you run it?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Negative amount of objects degraded

2014-10-30 Thread Mike Dawson


Erik,

I reported a similar issue 22 months ago. I don't think any developer 
has ever really prioritized these issues.


http://tracker.ceph.com/issues/3720

I was able to recover that cluster. The method I used is in the 
comments. I have no idea if my cluster was broken for the same reason as 
your. Your results may vary.


- Mike Dawson


On 10/30/2014 4:50 PM, Erik Logtenberg wrote:

Thanks for pointing that out. Unfortunately, those tickets contain only
a description of the problem, but no solution or workaround. One was
opened 8 months ago and the other more than a year ago. No love since.

Is there any way I can get my cluster back in a healthy state?

Thanks,

Erik.


On 10/30/2014 05:13 PM, John Spray wrote:

There are a couple of open tickets about bogus (negative) stats on PGs:
http://tracker.ceph.com/issues/5884
http://tracker.ceph.com/issues/7737

Cheers,
John

On Thu, Oct 30, 2014 at 12:38 PM, Erik Logtenberg  wrote:

Hi,

Yesterday I removed two OSD's, to replace them with new disks. Ceph was
not able to completely reach all active+clean state, but some degraded
objects remain. However, the amount of degraded objects is negative
(-82), see below:

2014-10-30 13:31:32.862083 mon.0 [INF] pgmap v209175: 768 pgs: 761
active+clean, 7 active+remapped; 1644 GB data, 2524 GB used, 17210 GB /
19755 GB avail; 2799 B/s wr, 1 op/s; -82/1439391 objects degraded (-0.006%)

According to "rados df", the -82 degraded objects are part of the
cephfs-data-cache pool, which is an SSD-backed replicated pool, that
functions as a cache pool for an HDD-backed erasure coded pool for cephfs.

The cache should be empty, because I isseud "rados
cache-flush-evict-all"-command, and "rados -p cephfs-data-cache ls"
indeed shows zero objects in this pool.

"rados df" however does show 192 objects for this pool, with just 35KB
used and -82 degraded:

pool name   category KB  objects   clones
   degraded  unfound   rdrd KB   wrwr KB
cephfs-data-cache - 35  1920
  -82   0 1119   348800  1198371   1703673493

Please advice...

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Poor RBD performance as LIO iSCSI target

2014-11-13 Thread Mike Christie

On 11/13/2014 10:17 AM, David Moreau Simard wrote:
> Running into weird issues here as well in a test environment. I don't have a 
> solution either but perhaps we can find some things in common..
> 
> Setup in a nutshell:
> - Ceph cluster: Ubuntu 14.04, Kernel 3.16.7, Ceph 0.87-1 (OSDs with separate 
> public/cluster network in 10 Gbps)
> - iSCSI Proxy node (targetcli/LIO): Ubuntu 14.04, Kernel 3.16.7, Ceph 0.87-1 
> (10 Gbps)
> - Client node: Ubuntu 12.04, Kernel 3.11 (10 Gbps)
> 
> Relevant cluster config: Writeback cache tiering with NVME PCI-E cards (2 
> replica) in front of a erasure coded pool (k=3,m=2) backed by spindles.
> 
> I'm following the instructions here: 
> http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd-images-san-storage-devices
> No issues with creating and mapping a 100GB RBD image and then creating the 
> target.
> 
> I'm interested in finding out the overhead/performance impact of re-exporting 
> through iSCSI so the idea is to run benchmarks.
> Here's a fio test I'm trying to run on the client node on the mounted iscsi 
> device:
> fio --name=writefile --size=100G --filesize=100G --filename=/dev/sdu --bs=1M 
> --nrfiles=1 --direct=1 --sync=0 --randrepeat=0 --rw=write --refill_buffers 
> --end_fsync=1 --iodepth=200 --ioengine=libaio
> 
> The benchmark will eventually hang towards the end of the test for some long 
> seconds before completing.
> On the proxy node, the kernel complains with iscsi portal login timeout: 
> http://pastebin.com/Q49UnTPr and I also see irqbalance errors in syslog: 
> http://pastebin.com/AiRTWDwR
> 

You are hitting a different issue. German Anders is most likely correct
and you hit the rbd hang. That then caused the iscsi/scsi command to
timeout which caused the scsi error handler to run. In your logs we see
the LIO error handler has received a task abort from the initiator and
that timed out which caused the escalation (iscsi portal login related
messages).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Poor RBD performance as LIO iSCSI target

2014-12-08 Thread Mike Christie

lems. I want to run a few
>>>>>> more tests with different settings to see if I can reproduce your
>>>>>> problem. I will let you know if I find anything.
>>>>>>
>>>>>> If there is anything you would like me to try, please let me know.
>>>>>>
>>>>>> Nick
>>>>>>
>>>>>> -Original Message-
>>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>>>>>> Of David Moreau Simard
>>>>>> Sent: 19 November 2014 10:48
>>>>>> To: Ramakrishna Nishtala (rnishtal)
>>>>>> Cc: ceph-users@lists.ceph.com; Nick Fisk
>>>>>> Subject: Re: [ceph-users] Poor RBD performance as LIO iSCSI target
>>>>>>
>>>>>> Rama,
>>>>>>
>>>>>> Thanks for your reply.
>>>>>>
>>>>>> My end goal is to use iSCSI (with LIO/targetcli) to export rbd block
>>>>>> devices.
>>>>>>
>>>>>> I was encountering issues with iSCSI which are explained in my
>>>>>> previous emails.
>>>>>> I ended up being able to reproduce the problem at will on various
>>>>>> Kernel and OS combinations, even on raw RBD devices - thus ruling out
>>>>>> the hypothesis that it was a problem with iSCSI but rather with Ceph.
>>>>>> I'm even running 0.88 now and the issue is still there.
>>>>>>
>>>>>> I haven't isolated the issue just yet.
>>>>>> My next tests involve disabling the cache tiering.
>>>>>>
>>>>>> I do have client krbd cache as well, i'll try to disable it too if
>>>>>> cache tiering isn't enough.
>>>>>> --
>>>>>> David Moreau Simard
>>>>>>
>>>>>>
>>>>>>> On Nov 18, 2014, at 8:10 PM, Ramakrishna Nishtala (rnishtal)
>>>>>>  wrote:
>>>>>>>
>>>>>>> Hi Dave
>>>>>>> Did you say iscsi only? The tracker issue does not say though.
>>>>>>> I am on giant, with both client and ceph on RHEL 7 and seems to work
>>>>>>> ok,
>>>>>> unless I am missing something here. RBD on baremetal with kmod-rbd
>>>>>> and caching disabled.
>>>>>>>
>>>>>>> [root@compute4 ~]# time fio --name=writefile --size=100G
>>>>>>> --filesize=100G --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1
>>>>>>> --sync=0 --randrepeat=0 --rw=write --refill_buffers --end_fsync=1
>>>>>>> --iodepth=200 --ioengine=libaio
>>>>>>> writefile: (g=0): rw=write, bs=1M-1M/1M-1M/1M-1M, ioengine=libaio,
>>>>>>> iodepth=200
>>>>>>> fio-2.1.11
>>>>>>> Starting 1 process
>>>>>>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/853.0MB/0KB /s] [0/853/0
>>>>>>> iops] [eta 00m:00s] ...
>>>>>>> Disk stats (read/write):
>>>>>>> rbd0: ios=184/204800, merge=0/0, ticks=70/16164931,
>>>>>>> in_queue=16164942, util=99.98%
>>>>>>>
>>>>>>> real1m56.175s
>>>>>>> user0m18.115s
>>>>>>> sys 0m10.430s
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Rama
>>>>>>>
>>>>>>>
>>>>>>> -Original Message-
>>>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>>>>>>> Behalf Of David Moreau Simard
>>>>>>> Sent: Tuesday, November 18, 2014 3:49 PM
>>>>>>> To: Nick Fisk
>>>>>>> Cc: ceph-users@lists.ceph.com
>>>>>>> Subject: Re: [ceph-users] Poor RBD performance as LIO iSCSI target
>>>>>>>
>>>>>>> Testing without the cache tiering is the next test I want to do when
>>>>>>> I
>>>>>> have time..
>>>>>>>
>>>>>>> When it's hanging, there is no activity at all on the cluster.
>>>>>>> Nothing in "ceph -w", nothing in "ceph osd pool stats".
>>>>>>>
>>>>>>> I'll provide an update when I have a chance to test withou

Re: [ceph-users] Poor RBD performance as LIO iSCSI target

2014-12-08 Thread Mike Christie

Oh yeah, for the iscsi fio full write test, did you experiment with bs
and numjobs? For just 10 GB iscsi, I think numjibs > 1 (around 4 is when
I stop seeing benefits) and bs < 1MB (around 64K to 256K) works better.

On 12/08/2014 05:22 PM, Mike Christie wrote:
> Some distros have LIO setup by default to use 64 for the session wide
> (default_cmdsn_depth) so increase that (session wide means all LUs
> accessed through that session will be limited by total of 64 requests
> across all LUs).
> 
> If you are using linux on the initiator side increase
> node.session.queue_depth and node.session.cmds_max.
> 
> MTU should be increased too.
> 
> Unless you are using a single processor, changing the
> /sys/block/sdX/queue/rq_affinity to 2 might help if you are noticing
> only one CPU is getting bogged down with all the IO processing.
> 
> For the random IO tests, the iscsi initiator is probably at some fault
> for the poor performance. It does not do well with small IO like 4K.
> However, here is probably something else in play, because it should not
> be as low as you are seeing.
> 
> For the iscsi+krbd setup, is the storage being used the WD disks in the
> RAID0 config?
> 
> 
> On 12/08/2014 01:21 PM, David Moreau Simard wrote:
>> Not a bad idea.. which reminds me there might be some benefits to toying 
>> with the MTU settings as well..
>>
>> I'll check when I have a chance.
>> --
>> David Moreau Simard
>>
>>> On Dec 8, 2014, at 2:13 PM, Nick Fisk  wrote:
>>>
>>> Hi David,
>>>
>>> This is a long shot, but have you checked the Max queue depth on the iscsi 
>>> side. I've got a feeling that lio might be set at 32 as default.
>>>
>>> This would definitely have an effect at the high queue depths you are 
>>> testing with.
>>>
>>> On 8 Dec 2014 16:53, David Moreau Simard  wrote:
>>>>
>>>> Haven't tried other iSCSI implementations (yet). LIO/targetcli makes it 
>>>> very easy to iQuoting David Moreau Simard 
>>>
>>>> Haven't tried other iSCSI implementations (yet).
>>>>
>>>> LIO/targetcli makes it very easy to implement/integrate/wrap/automate 
>>>> around so I'm really trying to get this right.
>>>>
>>>> PCI-E SSD cache tier in front of spindles-backed erasure coded pool in 10 
>>>> Gbps across the board yields results slightly better or very similar to 
>>>> two spindles in hardware RAID-0 with writeback caching.
>>>> With that in mind, the performance is not outright awful by any means, 
>>>> there's just a lot of overhead we have to be reminded about.
>>>>
>>>> What I'd like to further test but am unable to right now is to see what 
>>>> happens if you scale up the cluster. Right now I'm testing on only two 
>>>> nodes.
>>>> Does the IOPS scale linearly with increasing amount of OSDs/servers ? Or 
>>>> is it more about a capacity thing ?
>>>>
>>>> Perhaps if someone else can chime in, I'm really curious.
>>>> --
>>>> David Moreau Simard
>>>>
>>>>> On Dec 6, 2014, at 11:18 AM, Nick Fisk  wrote:
>>>>>
>>>>> Hi David,
>>>>>
>>>>> Very strange, but  I'm glad you managed to finally get the cluster working
>>>>> normally. Thank you for posting the benchmarks figures, it's interesting 
>>>>> to
>>>>> see the overhead of LIO over pure RBD performance.
>>>>>
>>>>> I should have the hardware for our cluster up and running early next 
>>>>> year, I
>>>>> will be in a better position to test the iSCSI performance then. I will
>>>>> report back once I have some numbers.
>>>>>
>>>>> Just out of interest, have you tried any of the other iSCSI 
>>>>> implementations
>>>>> to see if they show the same performance drop?
>>>>>
>>>>> Nick
>>>>>
>>>>> -Original Message-
>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>>>>> David Moreau Simard
>>>>> Sent: 05 December 2014 16:03
>>>>> To: Nick Fisk
>>>>> Cc: ceph-users@lists.ceph.com
>>>>> Subject: Re: [ceph-users] Poor RBD performance as LIO iSCSI target
>>>>>
>>>>> I've flushed everything - data, pools, configs and reconfigured the whole
>>>

1 2 3 4 5 >

1 - 100 of 452 matches

Mail list logo