Re: [ceph-users] disk timeouts in libvirt/qemu VMs...

2017-03-30 Thread Peter Maloney
On 03/28/17 17:28, Brian Andrus wrote:
> Just adding some anecdotal input. It likely won't be ultimately
> helpful other than a +1..
>
> Seemingly, we also have the same issue since enabling exclusive-lock
> on images. We experienced these messages at a large scale when making
> a CRUSH map change a few weeks ago that resulted in many many VMs
> experiencing the blocked task kernel messages, requiring reboots.
>
> We've since disabled on all images we can, but there are still
> jewel-era instances that cannot have the feature disabled. Since
> disabling the feature, I have not observed any cases of blocked tasks,
> but so far given the limited timeframe I'd consider that anecdotal.
>
>

Why do you need it enabled in jewel-era instances? With jewel you can
set them on the fly, and live migrate the VM to get the client to update
its usage of it.

I couldn't find any difference except removing big images is faster with
object-map (which depends on exclusive-lock). So I can't imagine why it
can be required.

And how long did you test it? I tested it a few weeks ago for about a
week, with no hangs. Normally there are hangs after a few days. And I
have permanently disabled it since the 20th, without any hangs since.
And I'm gradually adding back the VMs that died when they were there,
starting with the worst offenders. With that small time, I'm still very
convinced.

And did you test other features? I suspected exclusive-lock, so I only
tested removing that one, which required removing object-map and
fast-diff too, so I didn't test those 2 separately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] FreeBSD port net/ceph-devel released

2017-03-30 Thread Willem Jan Withagen
Hi,

I'm pleased to announce that my efforts to port to FreeBSD have resulted
in a ceph-devel port commit in the ports tree.

https://www.freshports.org/net/ceph-devel/

I'd like to thank everybody that helped me by answering my questions,
fixing by mistakes, undoing my Git mess. Especially Sage, Kefu and
Haomei gave a lot of support

Next release step will be to release an net/ceph port when the
'Luminous' version goes officially in release.

In the meantime I'll be updating the ceph-devel port to a more current
state of affairs

Thanx,
--WjW
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question about unfound objects

2017-03-30 Thread Steve Taylor
We've had a couple of puzzling experiences recently with unfound
objects, and I wonder if anyone can shed some light.

This happened with Hammer 0.94.7 on a cluster with 1,309 OSDs. Our use
case is exclusively RBD in this cluster, so it's naturally replicated.
The rbd pool size is 3, min_size is 2. The crush map is flat, so each
host is a failure domain. The OSD hosts are 4U Supermicro chassis with
32 OSDs each. Drive failures have caused the OSD count to be 1,309
instead of 1,312.

Twice in the last few weeks we've experienced issues where the cluster
was HEALTH_OK but was frequently getting some blocked requests. In each
of the two occurrences we investigated and discovered that the blocked
requests resulted from two drives in the same host that were
misbehaving (different set of 2 drives in each occurrence). We decided
to remove the misbehaving OSDs and let things backfill to see if that
would address the issue. Removing the drives resulted in a small number
of unfound objects, which was surprising. We were able to add the OSDs
back with 0 weight and recover the unfound objects in both cases, but
removing two OSDs from a single failure domain shouldn't have resulted
in unfound objects in an otherwise healthy cluster, correct?



[cid:image575d42.JPG@8ddd3310.40afc06a]   Steve 
Taylor | Senior Software Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 |



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw leaking objects

2017-03-30 Thread Luis Periquito
I have a cluster that has been leaking objects in radosgw and I've
upgraded it to 10.2.6.

After that I ran
radosgw-admin orphans find --pool=default.rgw.buckets.data --job-id=orphans

which found a bunch of objects. And ran
radosgw-admin orphans finish --pool=default.rgw.buckets.data --job-id=orphans

which returned quickly. But no real space was recovered. Ran again the
orphans find and it still stated quite a few leaked objects.

Shouldn't this be working, and finding/deleting all the leaked objects?

If it helps this cluster has a cache tiering solution... I've ran a
cache-flush-evict-all on the ct pool, but no changes...

Should I open a BUG with this, as indications on
http://tracker.ceph.com/issues/18331 and
http://tracker.ceph.com/issues/18258 seem to have this fixed in jewel
10.2.6...

thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about unfound objects

2017-03-30 Thread Nick Fisk
Hi Steve,

 

If you can recreate or if you can remember the object name, it might be worth 
trying to run "ceph osd map" on the objects and see
where it thinks they map to. And/or maybe pg query might show something?

 

Nick

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Steve 
Taylor
Sent: 30 March 2017 16:24
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Question about unfound objects

 

We've had a couple of puzzling experiences recently with unfound
objects, and I wonder if anyone can shed some light.

This happened with Hammer 0.94.7 on a cluster with 1,309 OSDs. Our use
case is exclusively RBD in this cluster, so it's naturally replicated.
The rbd pool size is 3, min_size is 2. The crush map is flat, so each
host is a failure domain. The OSD hosts are 4U Supermicro chassis with
32 OSDs each. Drive failures have caused the OSD count to be 1,309
instead of 1,312.

Twice in the last few weeks we've experienced issues where the cluster
was HEALTH_OK but was frequently getting some blocked requests. In each
of the two occurrences we investigated and discovered that the blocked
requests resulted from two drives in the same host that were
misbehaving (different set of 2 drives in each occurrence). We decided
to remove the misbehaving OSDs and let things backfill to see if that
would address the issue. Removing the drives resulted in a small number
of unfound objects, which was surprising. We were able to add the OSDs
back with 0 weight and recover the unfound objects in both cases, but
removing two OSDs from a single failure domain shouldn't have resulted
in unfound objects in an otherwise healthy cluster, correct?

  _  


  

Steve Taylor | Senior Software Engineer |   
StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 | 

  _  


If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together
with any attachments, and be advised that any dissemination or copying of this 
message is prohibited.

  _  

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about unfound objects

2017-03-30 Thread Steve Taylor
Good suggestion, Nick. I actually did that at the time. The "ceph osd map" 
wasn't all that interesting because the OSDs had been outed and their PGs had 
been mapped to new OSDs. Everything appeared to be in order with the PGs being 
mapped to the right number of new OSDs. The PG mappings looked fine, but the 
objects just didn't exist anywhere except on the OSDs that had been marked out.

The PG queries were a little more useful, but still didn't really help in the 
end. In all cases (unfound objects from 2 OSDs in each of 2 occurrences), the 
PGs showed 5 or so OSDs where they thought the unfound objects might be, one of 
which was an OSD that had been marked out. In both cases we even waited until 
backfilling completed to see if perhaps the missing objects would turn up 
somewhere else, but none ever did.

In the first instance we were simply able to reattach the 2 OSDs to the cluster 
with 0 weight and recover the unfound objects. The second instance involved 
drive problems and was a little bit trickier. The drives had experienced errors 
and the XFS filesystems had both become corrupt and wouldn't even mount. We 
didn't have any spare drives large enough, so I ended up using dd, ignoring 
errors, to copy the disks to RBDs in a different Ceph cluster. I then kernel 
mapped the RBDs on the host with the failed drives, ran XFS repairs on them, 
mouted them to the OSD directories, started the OSDs, and put them back in the 
cluster with 0 weight. I was lucky enough that those objects were available and 
they were recovered. Of course I immediately removed those OSDs once the 
unfound objects cleared up.

That's the other intersting aspect of this problem. This cluster had 4TB HGST 
drives for its OSDs, but we had to expand it fairly urgently and didn't have 
enough drives. We added two new servers, each with 16 4TB drives and 16 8TB 
HGST He8 drives. In both instances the problems we encountered were with the 
8TB drives. We have since acquired more 4TB drives and have replaced all of the 
8TB drives in the cluster. We have a total of 8 production clusters globally 
and have been running Ceph in production for 2 years. These two occurences 
recently are the only times we've seen these types of issues, and it was 
exclusive to the 8TB OSDs. I'm not sure how that would cause such a problem, 
but it's an interesting data point.

On Thu, 2017-03-30 at 17:33 +0100, Nick Fisk wrote:
Hi Steve,

If you can recreate or if you can remember the object name, it might be worth 
trying to run “ceph osd map” on the objects and see where it thinks they map 
to. And/or maybe pg query might show something?

Nick




[cid:imagec0161b.JPG@d2cd1459.4ebbf9d5]   Steve 
Taylor | Senior Software Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 |



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Steve 
Taylor
Sent: 30 March 2017 16:24
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Question about unfound objects

We've had a couple of puzzling experiences recently with unfound
objects, and I wonder if anyone can shed some light.

This happened with Hammer 0.94.7 on a cluster with 1,309 OSDs. Our use
case is exclusively RBD in this cluster, so it's naturally replicated.
The rbd pool size is 3, min_size is 2. The crush map is flat, so each
host is a failure domain. The OSD hosts are 4U Supermicro chassis with
32 OSDs each. Drive failures have caused the OSD count to be 1,309
instead of 1,312.

Twice in the last few weeks we've experienced issues where the cluster
was HEALTH_OK but was frequently getting some blocked requests. In each
of the two occurrences we investigated and discovered that the blocked
requests resulted from two drives in the same host that were
misbehaving (different set of 2 drives in each occurrence). We decided
to remove the misbehaving OSDs and let things backfill to see if that
would address the issue. Removing the drives resulted in a small number
of unfound objects, which was surprising. We were able to add the OSDs
back with 0 weight and recover the unfound objects in both cases, but
removing two OSDs from a single failure domain shouldn't have resulted
in unfound objects in an otherwise healthy cluster, correct?

[cid:1490894021.2469.65.camel@storagecraft.com]

Steve Taylor | Senior Software Engineer | StorageCraft Technology 
Corpor

Re: [ceph-users] FreeBSD port net/ceph-devel released

2017-03-30 Thread kefu chai
On Thu, Mar 30, 2017 at 7:56 PM, Willem Jan Withagen  wrote:
> Hi,
>
> I'm pleased to announce that my efforts to port to FreeBSD have resulted
> in a ceph-devel port commit in the ports tree.
>
> https://www.freshports.org/net/ceph-devel/
>
> I'd like to thank everybody that helped me by answering my questions,
> fixing by mistakes, undoing my Git mess. Especially Sage, Kefu and
> Haomei gave a lot of support
>
> Next release step will be to release an net/ceph port when the
> 'Luminous' version goes officially in release.
>
> In the meantime I'll be updating the ceph-devel port to a more current
> state of affairs

Great job, Willem!


-- 
Regards
Kefu Chai
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSD network with IPv6 SLAAC networks?

2017-03-30 Thread Richard Hesse
Thanks for the reply Wido! How do you handle IPv6 routes and routing with
IPv6 on public and cluster networks? You mentioned that your cluster
network is routed, so they will need routes to reach the other racks. But
you can't have more than 1 default gateway. Are you running a routing
protocol to handle that?

We're using classless static routes via DHCP on v4 to solve this problem,
and I'm curious what the v6 SLAAC equivalent was.

Thanks,
-richard

On Tue, Mar 28, 2017 at 8:30 AM, Wido den Hollander  wrote:

>
> > Op 27 maart 2017 om 21:49 schreef Richard Hesse <
> richard.he...@weebly.com>:
> >
> >
> > Has anyone run their Ceph OSD cluster network on IPv6 using SLAAC? I know
> > that ceph supports IPv6, but I'm not sure how it would deal with the
> > address rotation in SLAAC, permanent vs outgoing address, etc. It would
> be
> > very nice for me, as I wouldn't have to run any kind of DHCP server or
> use
> > static addressing -- just configure RA's and go.
> >
>
> Yes, I do in many clusters. Works fine! SLAAC doesn't generate random
> addresses which change over time. That's a feature called 'Privacy
> Extensions' and is controlled on Linux by:
>
> - net.ipv6.conf.all.use_tempaddr
> - net.ipv6.conf.default.use_tempaddr
> - net.ipv6.conf.X.use_tempaddr
>
> Set this to 0 and the kernel will generate one address based on the
> MAC-Address (EUI64) of the interface. This address is stable and will not
> change.
>
> I like this very much as I don't have any static or complex network
> configurations on the hosts. It moves the whole responsibility of
> networking and addresses to the network. A host just boots and obtains a IP.
>
> The OSDs contact the MONs on boot and they will tell them their address.
> OSDs do not need a fixed address for Ceph.
>
> However, using SLAAC without Privacy Extensions means that in practice the
> address will not change of a machine, so you don't need to worry about it
> that much.
>
> The biggest system I have running this way is 400 nodes running IPv6-only.
> 10 racks, 40 nodes per rack. Each rack has a Top-of-Rack switch running in
> Layer 3 and a /64 is assigned per rack.
>
> Layer 3 routing is used between the racks that based on the IPv6 address
> we can even determine in which rack the host/OSD is.
>
> Layer 2 domains don't expand over racks which makes a rack a true failure
> domain in our case.
>
> Wido
>
> > On that note, does anyone have any experience with running ceph in a
> mixed
> > v4 and v6 environment?
> >
> > Thanks,
> > -richard
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How do I mix drive sizes in a CEPH cluster?

2017-03-30 Thread Adam Carheden
When mixing hard drives of different sizes, what are the advantages
and disadvantages of one big pool vs multiple pools with matching
drives within each pool?

-= Long Story =-
Using a mix of new and existing hardware, I'm going to end up with
10x8T HDD and 42x600G@15krpm HDD. I can distribute drives evenly among
5 nodes for the 8T drives and among 7 nodes for the 600G drives. All
drives will have journals on SSD. 2x10G LAG for the ceph network.
Usage will be rbd for VMs.

Is the following correct?

-= 1 big pool =-
* Should work fine, but performance is in question
* Smaller I/O could be inconsistent when under load. Normally small
writes will all go to the SSDs, but under load that saturates the SSDs
smaller writes may be slower if the bits happen to be on the slower 8T
drives.
* Larger I/O should get the average performance off all drives
assuming images are created with appropriate striping
* Rebuilds will be bottle-necked by the 8T drives

-= 2 pools with matching disks =-
* Should work fine
* Smaller I/O should be the same for both pools due to SSD journals
* Larger I/O will be faster for pool with 600G@15krpm drives due both
to drive speed and count
* Larger I/O will be slower for pool with 8T drives for the same reasons
* Rebuilds will be significantly faster on the 600G/42-drive pool

Is either configuration a bad idea, or is it just a matter of my
space/speed needs?

It should be possible to have 3 pools:
1) 8T only (slow pool)
2) 600G only (fast pool)
3) all OSDs (medium speed pool)
...but the rebuild would impact performance on the "fast" 600G drive
pool if a 8T drive failed since the medium speed pool would be
rebuilding across all drives, correct?

Thanks
-- 
Adam Carheden
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about unfound objects

2017-03-30 Thread Steve Taylor
One other thing to note with this experience is that we do a LOT of RBD snap 
trimming, like hundreds of millions of objects per day added to our snap_trimqs 
globally. All of the unfound objects in these cases were found on other OSDs in 
the cluster with identical contents, but associated with different snapshots. 
In other words, the file contents matched exactly, but the xattrs differed and 
the filenames indicated that the objects belonged to different snapshots.

Some of the unfound objects belonged to head, so I don't necessarily believe 
that they were in the process of being trimmed, but I imagine there is some 
possibility that this issue is related to snap trimming or deleting snapshots. 
Just more information...

On Thu, 2017-03-30 at 17:13 +, Steve Taylor wrote:

Good suggestion, Nick. I actually did that at the time. The "ceph osd map" 
wasn't all that interesting because the OSDs had been outed and their PGs had 
been mapped to new OSDs. Everything appeared to be in order with the PGs being 
mapped to the right number of new OSDs. The PG mappings looked fine, but the 
objects just didn't exist anywhere except on the OSDs that had been marked out.

The PG queries were a little more useful, but still didn't really help in the 
end. In all cases (unfound objects from 2 OSDs in each of 2 occurrences), the 
PGs showed 5 or so OSDs where they thought the unfound objects might be, one of 
which was an OSD that had been marked out. In both cases we even waited until 
backfilling completed to see if perhaps the missing objects would turn up 
somewhere else, but none ever did.

In the first instance we were simply able to reattach the 2 OSDs to the cluster 
with 0 weight and recover the unfound objects. The second instance involved 
drive problems and was a little bit trickier. The drives had experienced errors 
and the XFS filesystems had both become corrupt and wouldn't even mount. We 
didn't have any spare drives large enough, so I ended up using dd, ignoring 
errors, to copy the disks to RBDs in a different Ceph cluster. I then kernel 
mapped the RBDs on the host with the failed drives, ran XFS repairs on them, 
mouted them to the OSD directories, started the OSDs, and put them back in the 
cluster with 0 weight. I was lucky enough that those objects were available and 
they were recovered. Of course I immediately removed those OSDs once the 
unfound objects cleared up.

That's the other intersting aspect of this problem. This cluster had 4TB HGST 
drives for its OSDs, but we had to expand it fairly urgently and didn't have 
enough drives. We added two new servers, each with 16 4TB drives and 16 8TB 
HGST He8 drives. In both instances the problems we encountered were with the 
8TB drives. We have since acquired more 4TB drives and have replaced all of the 
8TB drives in the cluster. We have a total of 8 production clusters globally 
and have been running Ceph in production for 2 years. These two occurences 
recently are the only times we've seen these types of issues, and it was 
exclusive to the 8TB OSDs. I'm not sure how that would cause such a problem, 
but it's an interesting data point.

On Thu, 2017-03-30 at 17:33 +0100, Nick Fisk wrote:
Hi Steve,

If you can recreate or if you can remember the object name, it might be worth 
trying to run “ceph osd map” on the objects and see where it thinks they map 
to. And/or maybe pg query might show something?

Nick




[cid:1490900827.2469.72.camel@storagecraft.com]   
Steve Taylor | Senior Software Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 |



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.





[cid:imagef0e9d2.JPG@294fd64f.4893a633]   Steve 
Taylor | Senior Software Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 |



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Steve 
Taylor
Sent: 30 March 2017 16:24
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Question about unfound objects

We've had a couple of puzzling experiences recently with unfound
objects, and I wonder if anyone can shed some light.

This happened with Hammer 0.94.7 on a cluster with 1,309 

Re: [ceph-users] Troubleshooting incomplete PG's

2017-03-30 Thread nokia ceph
Hello Brad,

Many thanks of the info :)

ENV:-- Kracken - bluestore - EC 4+1 - 5 node cluster : RHEL7

What is the status of the down+out osd? Only one osd osd.6 down and out
from cluster.
What role did/does it play? Mostimportantly, is it osd.6? Yes, due to
underlying I/O error issue we removed this device from the cluster.

I put this parameter " osd_find_best_info_ignore_history_les = true" in
ceph.conf, and find those 22 PG's were changed to "down+remapped" . Now all
are reverted to "remapped+incomplete" state.

#ceph pg stat 2> /dev/null
v2731828: 4096 pgs: 1 incomplete, 21 remapped+incomplete, 4074
active+clean; 268 TB data, 371 TB used, 267 TB / 638 TB avail

## ceph -s
2017-03-30 19:02:14.350242 7f8b0415f700 -1 WARNING: the following dangerous
and experimental features are enabled: bluestore,rocksdb
2017-03-30 19:02:14.366545 7f8b0415f700 -1 WARNING: the following dangerous
and experimental features are enabled: bluestore,rocksdb
cluster bd8adcd0-c36d-4367-9efe-f48f5ab5f108
 health HEALTH_ERR
22 pgs are stuck inactive for more than 300 seconds
22 pgs incomplete
22 pgs stuck inactive
22 pgs stuck unclean
 monmap e2: 5 mons at {au-adelaide=
10.50.21.24:6789/0,au-brisbane=10.50.21.22:6789/0,au-canberra=10.50.21.23:6789/0,au-melbourne=10.50.21.21:6789/0,au-sydney=10.50.21.20:6789/0
}
election epoch 180, quorum 0,1,2,3,4
au-sydney,au-melbourne,au-brisbane,au-canberra,au-adelaide
mgr active: au-adelaide
 osdmap e6506: 117 osds: 117 up, 117 in; 21 remapped pgs
flags sortbitwise,require_jewel_osds,require_kraken_osds
  pgmap v2731828: 4096 pgs, 1 pools, 268 TB data, 197 Mobjects
371 TB used, 267 TB / 638 TB avail
4074 active+clean
  21 remapped+incomplete
   1 incomplete


## ceph osd dump 2>/dev/null | grep cdvr
pool 1 'cdvr_ec' erasure size 5 min_size 4 crush_ruleset 1 object_hash
rjenkins pg_num 4096 pgp_num 4096 last_change 456 flags
hashpspool,nodeep-scrub stripe_width 65536

Inspecting affected PG *1.e4b*

# ceph pg dump 2> /dev/null | grep 1.e4b
1.e4b 50832  00 0   0 73013340821
1000610006 remapped+incomplete 2017-03-30 14:14:26.297098 3844'161662
 6506:325748 [113,66,15,73,103]113  [NONE,NONE,NONE,73,NONE]
  73 1643'139486 2017-03-21 04:56:16.683953 0'0 2017-02-21
10:33:50.012922

When I trigger below command.

#ceph pg force_create_pg 1.e4b
pg 1.e4b now creating, ok

As it went to creating state, no change after that. Can you explain why
this PG showing null values after triggering "force_create_pg",?

]# ceph pg dump 2> /dev/null | grep 1.e4b
1.e4b 0  00 0   0   0
  00creating 2017-03-30 19:07:00.982178 0'0
 0:0 [] -1[]
  -1 0'0   0.00 0'0
  0.00

Then I triggered below command

# ceph pg  repair 1.e4b
Error EAGAIN: pg 1.e4b has no primary osd  --<<

Could you please provide answer for below queries.

1. How to fix this "incomplete+remapped" PG issue, here all OSD's were up
and running and affected OSD marked out and removed from the cluster.
2. Will reduce min_size helps? currently it set to 4. Could you please
explain what is the impact if we reduce min_size for the current config EC
4+1
3. Is there any procedure to safely remove an affected PG? As per my
understanding I'm aware about this command.

===
#ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph --pgid 1.e4b --op
remove
===

Awaiting for your suggestions to proceed.

Thanks






On Thu, Mar 30, 2017 at 7:32 AM, Brad Hubbard  wrote:

>
>
> On Thu, Mar 30, 2017 at 4:53 AM, nokia ceph 
> wrote:
> > Hello,
> >
> > Env:-
> > 5 node, EC 4+1 bluestore kraken v11.2.0 , RHEL7.2
> >
> > As part of our resillency testing with kraken bluestore, we face more
> PG's
> > were in incomplete+remapped state. We tried to repair each PG using
> "ceph pg
> > repair " still no luck. Then we planned to remove incomplete PG's
> > using below procedure.
> >
> >
> > #ceph health detail | grep  1.e4b
> > pg 1.e4b is remapped+incomplete, acting [2147483647,66,15,73,2147483647]
> > (reducing pool cdvr_ec min_size from 4 may help; search ceph.com/docs
> for
> > 'incomplete')
>
> "Incomplete Ceph detects that a placement group is missing information
> about
> writes that may have occurred, or does not have any healthy copies. If you
> see
> this state, try to start any failed OSDs that may contain the needed
> information."
>
> >
> > Here we shutdown the OSD's 66,15 and 73 then proceeded with below
> operation.
> >
> > #ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-135 --op
> list-pgs
> > #ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-135 --pgid
> 1.e4b
> > --op remove
> >
> > Please confirm that we are following the correct procedure to rem

Re: [ceph-users] Question about unfound objects

2017-03-30 Thread Nick Fisk
That’s interesting, the only time I have experienced unfound objects has also 
been related to snapshots and highly likely snap trimming. I had a number of 
OSD’s start flapping under load of snap trimming and 2 of them on the same host 
died with an assert.

 

>From memory the unfound objects were relating to objects that were trimmed, so 
>I could just delete them. I assume that when the PG is remapped/recovered, as 
>the objects have already been removed on the other OSD’s it tries to roll back 
>the transaction and fails, hence it wants the now down OSD’s to try and roll 
>back???

 

From: Steve Taylor [mailto:steve.tay...@storagecraft.com] 
Sent: 30 March 2017 20:07
To: n...@fisk.me.uk; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Question about unfound objects

 

One other thing to note with this experience is that we do a LOT of RBD snap 
trimming, like hundreds of millions of objects per day added to our snap_trimqs 
globally. All of the unfound objects in these cases were found on other OSDs in 
the cluster with identical contents, but associated with different snapshots. 
In other words, the file contents matched exactly, but the xattrs differed and 
the filenames indicated that the objects belonged to different snapshots.

 

Some of the unfound objects belonged to head, so I don't necessarily believe 
that they were in the process of being trimmed, but I imagine there is some 
possibility that this issue is related to snap trimming or deleting snapshots. 
Just more information...

 

On Thu, 2017-03-30 at 17:13 +, Steve Taylor wrote:

Good suggestion, Nick. I actually did that at the time. The "ceph osd map" 
wasn't all that interesting because the OSDs had been outed and their PGs had 
been mapped to new OSDs. Everything appeared to be in order with the PGs being 
mapped to the right number of new OSDs. The PG mappings looked fine, but the 
objects just didn't exist anywhere except on the OSDs that had been marked out.

 

The PG queries were a little more useful, but still didn't really help in the 
end. In all cases (unfound objects from 2 OSDs in each of 2 occurrences), the 
PGs showed 5 or so OSDs where they thought the unfound objects might be, one of 
which was an OSD that had been marked out. In both cases we even waited until 
backfilling completed to see if perhaps the missing objects would turn up 
somewhere else, but none ever did.

 

In the first instance we were simply able to reattach the 2 OSDs to the cluster 
with 0 weight and recover the unfound objects. The second instance involved 
drive problems and was a little bit trickier. The drives had experienced errors 
and the XFS filesystems had both become corrupt and wouldn't even mount. We 
didn't have any spare drives large enough, so I ended up using dd, ignoring 
errors, to copy the disks to RBDs in a different Ceph cluster. I then kernel 
mapped the RBDs on the host with the failed drives, ran XFS repairs on them, 
mouted them to the OSD directories, started the OSDs, and put them back in the 
cluster with 0 weight. I was lucky enough that those objects were available and 
they were recovered. Of course I immediately removed those OSDs once the 
unfound objects cleared up.

 

That's the other intersting aspect of this problem. This cluster had 4TB HGST 
drives for its OSDs, but we had to expand it fairly urgently and didn't have 
enough drives. We added two new servers, each with 16 4TB drives and 16 8TB 
HGST He8 drives. In both instances the problems we encountered were with the 
8TB drives. We have since acquired more 4TB drives and have replaced all of the 
8TB drives in the cluster. We have a total of 8 production clusters globally 
and have been running Ceph in production for 2 years. These two occurences 
recently are the only times we've seen these types of issues, and it was 
exclusive to the 8TB OSDs. I'm not sure how that would cause such a problem, 
but it's an interesting data point.

 

On Thu, 2017-03-30 at 17:33 +0100, Nick Fisk wrote:

Hi Steve,

 

If you can recreate or if you can remember the object name, it might be worth 
trying to run “ceph osd map” on the objects and see where it thinks they map 
to. And/or maybe pg query might show something?

 

Nick

 


  _  


  

Steve Taylor | Senior Software Engineer |   
StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 | 


  _  


If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.


  _  


  _  


  

Steve Taylor | Senior Software Engineer |   
StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 | 

  _  


If you are not the in

Re: [ceph-users] How to mount different ceph FS using ceph-fuse or kernel cephfs mount

2017-03-30 Thread Deepak Naidu
Hi John, any idea on what's wrong. Any info is appreciated.

--
Deepak

-Original Message-
From: Deepak Naidu 
Sent: Thursday, March 23, 2017 2:20 PM
To: John Spray
Cc: ceph-users
Subject: RE: [ceph-users] How to mount different ceph FS using ceph-fuse or 
kernel cephfs mount

Fixing typo

 What version of ceph-fuse?
ceph-fuse-10.2.6-0.el7.x86_64

--
Deepak

-Original Message-
From: Deepak Naidu
Sent: Thursday, March 23, 2017 9:49 AM
To: John Spray
Cc: ceph-users
Subject: Re: [ceph-users] How to mount different ceph FS using ceph-fuse or 
kernel cephfs mount

>> What version of ceph-fuse?

I have ceph-common-10.2.6-0 ( on CentOS 7.3.1611)
 
--
Deepak

>> On Mar 23, 2017, at 6:28 AM, John Spray  wrote:
>> 
>> On Wed, Mar 22, 2017 at 3:30 PM, Deepak Naidu  wrote:
>> Hi John,
>> 
>> 
>> 
>> I tried the below option for ceph-fuse & kernel mount. Below is what 
>> I see/error.
>> 
>> 
>> 
>> 1)  When trying using ceph-fuse, the mount command succeeds but I see
>> parse error setting 'client_mds_namespace' to 'dataX' .  Not sure if 
>> this is normal message or some error
> 
> What version of ceph-fuse?
> 
> John
> 
>> 
>> 2)  When trying the kernel mount, the mount command just hangs & after
>> few seconds I see mount error 5 = Input/output error. I am using 
>> 4.9.15-040915-generic kernel on Ubuntu 16.x
>> 
>> 
>> 
>> --
>> 
>> Deepak
>> 
>> 
>> 
>> -Original Message-
>> From: John Spray [mailto:jsp...@redhat.com]
>> Sent: Wednesday, March 22, 2017 6:16 AM
>> To: Deepak Naidu
>> Cc: ceph-users
>> Subject: Re: [ceph-users] How to mount different ceph FS using 
>> ceph-fuse or kernel cephfs mount
>> 
>> 
>> 
>>> On Tue, Mar 21, 2017 at 5:31 PM, Deepak Naidu  wrote:
>>> 
>>> Greetings,
>> 
>> 
>> 
>> 
>>> I have below two cephFS "volumes/filesystem" created on my ceph
>> 
>>> cluster. Yes I used the "enable_multiple" flag to enable the 
>>> multiple
>> 
>>> cephFS feature. My question
>> 
>> 
>> 
>> 
>>> 1)  How do I mention the fs name ie dataX or data1 during cephFS mount
>> 
>>> either using kernel mount of ceph-fuse mount.
>> 
>> 
>> 
>> The option for ceph_fuse is --client_mds_namespace=dataX (you can do 
>> this on the command line or in your ceph.conf)
>> 
>> 
>> 
>> With the kernel client use "-o mds_namespace=DataX" (assuming you 
>> have a sufficiently recent kernel)
>> 
>> 
>> 
>> Cheers,
>> 
>> John
>> 
>> 
>> 
>> 
>>> 2)  When using kernel / ceph-fuse how do I mention dataX or data1
>>> during
>> 
>>> the fuse mount or kernel mount
>> 
>> 
>> 
>> 
>> 
>> 
>>> [root@Admin ~]# ceph fs ls
>> 
>> 
>>> name: dataX, metadata pool: rcpool_cepfsMeta, data pools:
>> 
>>> [rcpool_cepfsData ]
>> 
>> 
>>> name: data1, metadata pool: rcpool_cepfsMeta, data pools:
>> 
>>> [rcpool_cepfsData ]
>> 
>> 
>> 
>> 
>> 
>> 
>>> --
>> 
>> 
>>> Deepak
>> 
>> 
>>> 
>> 
>>> This email message is for the sole use of the intended recipient(s)
>> 
>>> and may contain confidential information.  Any unauthorized review,
>> 
>>> use, disclosure or distribution is prohibited.  If you are not the
>> 
>>> intended recipient, please contact the sender by reply email and
>> 
>>> destroy all copies of the original message.
>> 
>>> 
>> 
>> 
>>> ___
>> 
>>> ceph-users mailing list
>> 
>>> ceph-users@lists.ceph.com
>> 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>>> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Posix AIO vs libaio read performance

2017-03-30 Thread Xavier Trilla
Hi Alexandre,

But can we use aio=native when using librbd volume, or it will be plainly 
ignored by QEMU? (My understanding is that for networked volumes, like ceph, 
aio=native doesn't make a difference and it can only be used when using RAW 
disks).

Thanks!
Xavi

-Mensaje original-
De: Alexandre DERUMIER [mailto:aderum...@odiso.com] 
Enviado el: sábado, 11 de marzo de 2017 7:25
Para: Xavier Trilla 
CC: ceph-users 
Asunto: Re: [ceph-users] Posix AIO vs libaio read performance

>>Regarding rbd cache, is something I will try -today I was thinking about it- 
>>but I did not try it yet because I don't want to reduce write speed.

note that rbd_cache only work for sequential writes. so it don't help for 
random writes.

also, internaly, qemu force use of aio=threads with cache=writeback is enable, 
but can use aio=native with cache=none.



- Mail original -
De: "Xavier Trilla" 
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Vendredi 10 Mars 2017 14:12:59
Objet: Re: [ceph-users] Posix AIO vs libaio read performance

Hi Alexandre, 

Debugging is disabled in client and osds. 

Regarding rbd cache, is something I will try -today I was thinking about it- 
but I did not try it yet because I don't want to reduce write speed. 

I also tried iothreads, but no benefit. 

I tried as well with virtio-blk and virtio-scsi, there is a small improvement 
with virtio-blk, but it's around a 10%. 

This is becoming a quite strange issue, as it only affects posix aio read 
performance. Nothing less seems to be affected -although posix aio write isn't 
nowhere near libaio performance-. 

Thanks for you help, if you have any other ideas they will be really 
appreciated. 

Also if somebody could run in their cluster from inside a VM the following 
command: 



fio --name=randread-posix --output ./test --runtime 60 --ioengine=posixaio 
--buffered=0 --direct=1 --rw=randread --bs=4k --size=1024m --iodepth=32 



It would be really helpful to know if I'm the only one affected or this is 
happening in all qemu + ceph setups. 

Thanks! 
Xavier 

El 10 mar 2017, a las 8:07, Alexandre DERUMIER < [ mailto:aderum...@odiso.com | 
aderum...@odiso.com ] > escribió: 


BQ_BEGIN



BQ_BEGIN

BQ_BEGIN
But it still looks like there is some bottleneck in QEMU o Librbd I cannot 
manage to find. 

BQ_END

BQ_END

you can improve latency on client with disable debug. 

on your client, create a /etc/ceph/ceph.conf with 

[global]
debug asok = 0/0
debug auth = 0/0
debug buffer = 0/0
debug client = 0/0
debug context = 0/0
debug crush = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug journal = 0/0
debug journaler = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug mon = 0/0
debug monc = 0/0
debug ms = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rgw = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0 


you can also disable rbd_cache=false or in qemu set cache=none. 

Using iothread on qemu drive should help a little bit too. 

- Mail original -
De: "Xavier Trilla" < [ mailto:xavier.tri...@silicontower.net | 
xavier.tri...@silicontower.net ] >
À: "ceph-users" < [ mailto:ceph-users@lists.ceph.com | 
ceph-users@lists.ceph.com ] >
Envoyé: Vendredi 10 Mars 2017 05:37:01
Objet: Re: [ceph-users] Posix AIO vs libaio read performance 



Hi, 



We compiled Hammer .10 to use jemalloc and now the cluster performance improved 
a lot, but POSIX AIO operations are still quite slower than libaio. 



Now with a single thread read operations are about 1000 per second and write 
operations about 5000 per second. 



Using same FIO configuration, but libaio read operations are about 15K per 
second and writes 12K per second. 



I’m compiling QEMU with jemalloc support as well, and I’m planning to replace 
librbd in QEMU hosts to the new one using jemalloc. 



But it still looks like there is some bottleneck in QEMU o Librbd I cannot 
manage to find. 



Any help will be much appreciated. 



Thanks. 






De: ceph-users [ [ mailto:ceph-users-boun...@lists.ceph.com | 
mailto:ceph-users-boun...@lists.ceph.com ] ] En nombre de Xavier Trilla Enviado 
el: jueves, 9 de marzo de 2017 6:56
Para: [ mailto:ceph-users@lists.ceph.com | ceph-users@lists.ceph.com ]
Asunto: [ceph-users] Posix AIO vs libaio read performance 




Hi, 



I’m trying to debut why there is a big difference using POSIX AIO and libaio 
when performing read tests from inside a VM using librbd. 



The results I’m getting using FIO are: 



POSIX AIO Read: 



Type: Random Read - IO Engine: POSIX AIO - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /: 



Average: 2.54 MB/s 

Average: 632 IOPS 



Libaio Read: 



Type: Random Read - IO Engine: Libaio - Buffered:

Re: [ceph-users] Posix AIO vs libaio read performance

2017-03-30 Thread Xavier Trilla
After some tests I just wanted to post my findings about this. Looks like for 
some reason POSIX AIO reads -at least using FIO- are not really asynchronous, 
as the results I'm getting are quite similar to using SYNC engine instead of 
POSIX AIO engine. 

The biggest improvement for this has been using jemalloc instead of TCmalloc. 
It really improved the latency -and CPU usage of OSDs- but I don't really get 
why POSIX AIO reads using FIO are giving so bad results, were writes using 
POSIX AIO are a lot faster.

But as I've said, looks like there is something wrong with FIO and POSIX AIO. 
I've been checking and looks like the only way I can manage to run a different 
test using POSIX AIO will be if I write myself -or one of our developers does 
it- a test using POSIX AIO (My developer days are way behind me... )

Anyway, at least this issue help us improve by a huge margin the latency and 
overall performance of our ceph cluster :)

Thanks for all your help!
Xavi.

-Mensaje original-
De: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] En nombre de Xavier 
Trilla
Enviado el: viernes, 10 de marzo de 2017 20:28
Para: dilla...@redhat.com
CC: ceph-users 
Asunto: Re: [ceph-users] Posix AIO vs libaio read performance

Hi Jason,


Just to add more information: 

- The issue doesn't seem to be fio or glibc (guest) related, as it is working 
properly on other environments using the same software versions. Also I've 
tried using Ubuntu 14.04 and 16.04 and I'm getting really similar results, but 
I'll ran more tests just to be 100% sure.
- If I increase the number of concurrent jobs in fio (F.e. 16) results are much 
better (They get above 10k IOPS)
- I'm seeing similar bad results when using KRBD, but I still need to run more 
tests on this front (I'm using KRBD from inside a VM, because in our 
infrastructure getting your hands on a test physical machine it's quite 
difficult, but I'll manage. The VM has 10G connection, and I'm mounting the RBD 
volume from inside the VM using the kernel module -4.4- so the result should 
give an idea of how KRBD will perform)
- I'm not seeing improvements with librbd compiled with jemalloc support.
- No difference between QEMU 2.0, 2.5 or 2.7

Looks like it's related with an interaction of how POSIX AIO handles the direct 
reads and how Ceph works -but it could also be KVM related-. I could argue it's 
related with being a networked storage, but for example in other environments 
like Amazon EBS I'm not seeing this issue, but obviously I don't have any idea 
about EBS internals (But I guess that's what we are trying to match... if it 
works properly on EBS it should work properly on Ceph too ;) Also, I'm still 
trying to verify if this is just related to my setup or affects all ceph 
installations. 

One of the things I find more strange, is the performance difference in the 
read department. Libaio performance is way better in both read and write, but 
the biggest difference is between posix aio read and librbd read.

BTW: Do you have a test environment were you could test fio using posix aio? 
I've been running tests in our production and test cluster, but they run almost 
the same version (hammer) of everything :/ Maybe I'll try to deploy a new 
cluster using jewel -if I can get my hands on enough hardware-. Here are the 
command lines for FIO:

POSIX AIO:
fio --name=randread-posix --runtime 60 --ioengine=posixaio --buffered=0 
--direct=1 --rw=randread --bs=4k --size=1024m --iodepth=32

Libaio:

fio --name=randread-libaio --runtime 60 --ioengine=libaio --buffered=0 
--direct=1 --rw=randread --bs=4k --size=1024m --iodepth=32

Also thanks for the blktrace tip, on Monday I'll start playing with it and I'll 
post my findings.

Thanks!
Xavier

-Mensaje original-
De: Jason Dillaman [mailto:jdill...@redhat.com] Enviado el: viernes, 10 de 
marzo de 2017 19:18
Para: Xavier Trilla 
CC: Alexandre DERUMIER ; ceph-users 

Asunto: Re: [ceph-users] Posix AIO vs libaio read performance

librbd doesn't know that you are using libaio vs POSIX AIO. Therefore, the best 
bet is that the issue is in fio or glibc. As a first step, I would recommend 
using blktrace (or similar) within your VM to determine if there is a delta 
between libaio and POSIX AIO at the block level.

On Fri, Mar 10, 2017 at 12:28 PM, Xavier Trilla 
 wrote:
> I disabled rbd cache but no improvement, just a huge performance drop 
> in writes (Which proves the cache was properly disabled).
>
>
>
> Now I’m working on two other fronts:
>
>
>
> -Using librbd with jemalloc in the Hypervisors (Hammer .10)
>
> -Compiling QEMU with jemalloc (QEMU 2.6)
>
> -Running some tests from a Bare Metal server using FIO tool, but it
> will use the librbd directly so no way to simulate POSIX AIO (Maybe 
> I’ll try via KRBD)
>
>
>
> I’m quite sure is something on the client side, but I don’t know 
> enough about the Ceph internals to totally discard the issue being related to 
> OSDs.
> But so far perfo

Re: [ceph-users] Posix AIO vs libaio read performance

2017-03-30 Thread Xavier Trilla
Hi Michal,

Yeah, looks like there is something wrong with FIO and POSIX AIO, as the reads 
don’t seem to be really asynchronous. I don’t know why this is happening -might 
be related with the parameters I’m using- but is really bothering me.

What is bothering me even more, is that in Amazon EBS volumes I get quite 
better results than with our Ceph Cluster. We compiled Hammer with jemalloc 
support and now we are getting quite better results, but still not there. 
Obviusly I don’t have any idea how EBS works internally, but IMO ceph should 
get close to EBS, and I would love to identify the bottleneck -if there is one-,

But at least we really improved the latency with those changes, but I still 
have to invest more time on this. Now I’m busy with other projects -and the 
improvements we managed to get are quite substantial- but definitively I want 
to spend more time debugging this.

Thanks!
Xavi.


De: Michal Kozanecki [mailto:michal.kozane...@live.ca]
Enviado el: sábado, 11 de marzo de 2017 1:36
Para: dilla...@redhat.com; Xavier Trilla 
CC: ceph-users 
Asunto: Re: [ceph-users] Posix AIO vs libaio read performance

Hi Xavier,

Are you sure this is due to CEPH? I get similar results on my bare-metal (no 
cep anywhere in sight) hosts posix-aio vs libaio;

POSIX-AIO on baremetal (E3-1240v2, Debian Jessie 8.7, Linux 4.9.13, S3500 80GB):
andread-posix: (groupid=0, jobs=1): err= 0: pid=4644: Fri Mar 10 19:26:23 2017
  read : io=1024.0MB, bw=21243KB/s, iops=5310, runt= 49361msec


LIBAIO on baremetal (E3-1240v2, Debian Jessie 8.7, Linux 4.9.13, S3500 80GB):
randread-libaio: (groupid=0, jobs=1): err= 0: pid=32712: Fri Mar 10 19:24:33 
2017
  read : io=1024.0MB, bw=272570KB/s, iops=68142, runt=  3847msec

Cheers,
--
Michal Kozanecki


On March 10, 2017 at 2:28:23 PM, Xavier Trilla 
(xavier.tri...@silicontower.net) wrote:
Hi Jason,


Just to add more information:

- The issue doesn't seem to be fio or glibc (guest) related, as it is working 
properly on other environments using the same software versions. Also I've 
tried using Ubuntu 14.04 and 16.04 and I'm getting really similar results, but 
I'll ran more tests just to be 100% sure.
- If I increase the number of concurrent jobs in fio (F.e. 16) results are much 
better (They get above 10k IOPS)
- I'm seeing similar bad results when using KRBD, but I still need to run more 
tests on this front (I'm using KRBD from inside a VM, because in our 
infrastructure getting your hands on a test physical machine it's quite 
difficult, but I'll manage. The VM has 10G connection, and I'm mounting the RBD 
volume from inside the VM using the kernel module -4.4- so the result should 
give an idea of how KRBD will perform)
- I'm not seeing improvements with librbd compiled with jemalloc support.
- No difference between QEMU 2.0, 2.5 or 2.7

Looks like it's related with an interaction of how POSIX AIO handles the direct 
reads and how Ceph works -but it could also be KVM related-. I could argue it's 
related with being a networked storage, but for example in other environments 
like Amazon EBS I'm not seeing this issue, but obviously I don't have any idea 
about EBS internals (But I guess that's what we are trying to match... if it 
works properly on EBS it should work properly on Ceph too ;) Also, I'm still 
trying to verify if this is just related to my setup or affects all ceph 
installations.

One of the things I find more strange, is the performance difference in the 
read department. Libaio performance is way better in both read and write, but 
the biggest difference is between posix aio read and librbd read.

BTW: Do you have a test environment were you could test fio using posix aio? 
I've been running tests in our production and test cluster, but they run almost 
the same version (hammer) of everything :/ Maybe I'll try to deploy a new 
cluster using jewel -if I can get my hands on enough hardware-. Here are the 
command lines for FIO:

POSIX AIO:
fio --name=randread-posix --runtime 60 --ioengine=posixaio --buffered=0 
--direct=1 --rw=randread --bs=4k --size=1024m --iodepth=32

Libaio:

fio --name=randread-libaio --runtime 60 --ioengine=libaio --buffered=0 
--direct=1 --rw=randread --bs=4k --size=1024m --iodepth=32

Also thanks for the blktrace tip, on Monday I'll start playing with it and I'll 
post my findings.

Thanks!
Xavier

-Mensaje original-
De: Jason Dillaman [mailto:jdill...@redhat.com]
Enviado el: viernes, 10 de marzo de 2017 19:18
Para: Xavier Trilla 
mailto:xavier.tri...@silicontower.net>>
CC: Alexandre DERUMIER mailto:aderum...@odiso.com>>; 
ceph-users mailto:ceph-users@lists.ceph.com>>
Asunto: Re: [ceph-users] Posix AIO vs libaio read performance

librbd doesn't know that you are using libaio vs POSIX AIO. Therefore, the best 
bet is that the issue is in fio or glibc. As a first step, I would recommend 
using blktrace (or similar) within your VM to determine if there is a delta 
between libaio and POSIX AIO

Re: [ceph-users] How do I mix drive sizes in a CEPH cluster?

2017-03-30 Thread Xavier Trilla
My opinion, go for the 2 pools option. And, try to use SSD for journals. In our 
tests HDDs and VMs don't really work well together (Too much small IOs) but 
obviously it depends on what the VMs are running.

Another option would be to have an SSD cache tier in front of the HDD. That 
would really help.

But even with that, I would hesitate to have in the same pool both slow and 
fast HDD. As the slow HDD are quite bigger, you'll have to assign to them a 
quite high weight, meaning plenty of PGs will end there and you will not really 
benefit of the fast HDDs you have.

Another option would be to use the 15k HDDs to cache the slow ones... But then 
you'll lose plenty of space (Were you could get better results having some SSDs 
for a cache tier.)

Cheers!
Xavi.


-Mensaje original-
De: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] En nombre de Adam 
Carheden
Enviado el: jueves, 30 de marzo de 2017 20:37
Para: ceph-users 
Asunto: [ceph-users] How do I mix drive sizes in a CEPH cluster?

When mixing hard drives of different sizes, what are the advantages and 
disadvantages of one big pool vs multiple pools with matching drives within 
each pool?

-= Long Story =-
Using a mix of new and existing hardware, I'm going to end up with 10x8T HDD 
and 42x600G@15krpm HDD. I can distribute drives evenly among
5 nodes for the 8T drives and among 7 nodes for the 600G drives. All drives 
will have journals on SSD. 2x10G LAG for the ceph network.
Usage will be rbd for VMs.

Is the following correct?

-= 1 big pool =-
* Should work fine, but performance is in question
* Smaller I/O could be inconsistent when under load. Normally small writes will 
all go to the SSDs, but under load that saturates the SSDs smaller writes may 
be slower if the bits happen to be on the slower 8T drives.
* Larger I/O should get the average performance off all drives assuming images 
are created with appropriate striping
* Rebuilds will be bottle-necked by the 8T drives

-= 2 pools with matching disks =-
* Should work fine
* Smaller I/O should be the same for both pools due to SSD journals
* Larger I/O will be faster for pool with 600G@15krpm drives due both to drive 
speed and count
* Larger I/O will be slower for pool with 8T drives for the same reasons
* Rebuilds will be significantly faster on the 600G/42-drive pool

Is either configuration a bad idea, or is it just a matter of my space/speed 
needs?

It should be possible to have 3 pools:
1) 8T only (slow pool)
2) 600G only (fast pool)
3) all OSDs (medium speed pool)
...but the rebuild would impact performance on the "fast" 600G drive pool if a 
8T drive failed since the medium speed pool would be rebuilding across all 
drives, correct?

Thanks
--
Adam Carheden
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Giant Repo problem

2017-03-30 Thread Vlad Blando
Hi Guys,

I encountered some issue with installing ceph package for giant, was there
a change somewhere or was I using wrong repo information.

ceph.repo
-
[Ceph]
name=Ceph packages for $basearch
baseurl=http://download.ceph.com/rpm-giant/rhel7/$basearch
enabled=1
priority=1
gpgcheck=1
type=rpm-md
gpgkey=https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc

[Ceph-noarch]
name=Ceph noarch packages
baseurl=http://download.ceph.com/rpm-giant/rhel7/noarch
enabled=1
priority=1
gpgcheck=1
type=rpm-md
gpgkey=https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc

[ceph-source]
name=Ceph source packages
baseurl=http://download.ceph.com/rpm-giant/rhel7/SRPMS
enabled=1
priority=1
gpgcheck=1
type=rpm-md
gpgkey=https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc



installation error

[root@ceph-test yum.repos.d]# yum install ceph
Loaded plugins: priorities
Ceph

  | 2.9 kB  00:00:00
Ceph-noarch

 | 2.9 kB  00:00:00
ceph-source

 | 2.9 kB  00:00:00
rhel-7-server-optional-rpms

 | 3.5 kB  00:00:00
rhel-7-server-rpms

  | 3.5 kB  00:00:00
(1/9): ceph-source/primary_db

 | 4.4 kB  00:00:00
(2/9): Ceph-noarch/primary_db

 | 4.1 kB  00:00:00
(3/9): Ceph/x86_64/primary_db

 |  61 kB  00:00:01
(4/9): rhel-7-server-optional-rpms/7Server/x86_64/group

 |  25 kB  00:00:01
(5/9): rhel-7-server-rpms/7Server/x86_64/group

  | 701 kB  00:00:05
(6/9): rhel-7-server-optional-rpms/7Server/x86_64/updateinfo

  | 1.3 MB  00:00:07
(7/9): rhel-7-server-rpms/7Server/x86_64/updateinfo

 | 1.8 MB  00:00:06
(8/9): rhel-7-server-optional-rpms/7Server/x86_64/primary_db

  | 5.0 MB  00:00:25
(9/9): rhel-7-server-rpms/7Server/x86_64/primary_db

 |  34 MB  00:00:50
9 packages excluded due to repository priority protections
Resolving Dependencies
--> Running transaction check
---> Package ceph.x86_64 1:0.87.2-0.el7.centos will be installed
--> Processing Dependency: python-ceph = 1:0.87.2-0.el7.centos for package:
1:ceph-0.87.2-0.el7.centos.x86_64
Package python-ceph is obsoleted by python-rados, but obsoleting package
does not provide for requirements
--> Processing Dependency: ceph-common = 1:0.87.2-0.el7.centos for package:
1:ceph-0.87.2-0.el7.centos.x86_64
--> Processing Dependency: libcephfs1 = 1:0.87.2-0.el7.centos for package:
1:ceph-0.87.2-0.el7.centos.x86_64
--> Processing Dependency: librbd1 = 1:0.87.2-0.el7.centos for package:
1:ceph-0.87.2-0.el7.centos.x86_64
--> Processing Dependency: librados2 = 1:0.87.2-0.el7.centos for package:
1:ceph-0.87.2-0.el7.centos.x86_64
--> Processing Dependency: libaio.so.1(LIBAIO_0.4)(64bit) for package:
1:ceph-0.87.2-0.el7.centos.x86_64
--> Processing Dependency: cryptsetup for package:
1:ceph-0.87.2-0.el7.centos.x86_64
--> Processing Dependency: python-flask for package:
1:ceph-0.87.2-0.el7.centos.x86_64
--> Processing Dependency: hdparm for package:
1:ceph-0.87.2-0.el7.centos.x86_64
--> Processing Dependency: libaio.so.1(LIBAIO_0.1)(64bit) for package:
1:ceph-0.87.2-0.el7.centos.x86_64
--> Processing Dependency: libcephfs.so.1()(64bit) for package:
1:ceph-0.87.2-0.el7.centos.x86_64
--> Processing Dependency: libboost_system-mt.so.1.53.0()(64bit) for
package: 1:ceph-0.87.2-0.el7.centos.x86_64
--> Processing Dependency: libaio.so.1()(64bit) for package:
1:ceph-0.87.2-0.el7.centos.x86_64
--> Processing Dependency: libboost_thread-mt.so.1.53.0()(64bit) for
package: 1:ceph-0.87.2-0.el7.centos.x86_64
--> Processing Dependency: librados.so.2()(64bit) for package:
1:ceph-0.87.2-0.el7.centos.x86_64
--> Running transaction check
---> Package boost-system.x86_64 0:1.53.0-26.el7 will be installed
---> Package boost-thread.x86_64 0:1.53.0-26.el7 will be installed
---> Package ceph.x86_64 1:0.87.2-0.el7.centos will be installed
--> Processing Dependency: python-ceph = 1:0.87.2-0.el7.centos for package:
1:ceph-0.87.2-0.el7.centos.x86_64
Package python-ceph is obsoleted by python-rados, but obsoleting package
does not provide for requirements
--> Processing Dependency: python-flask for package:
1:ceph-0.87.2-0.el7.centos.x86_64
---> Package ceph-common.x86_64 1:0.87.2-0.el7.centos will be installed
--> Processing Dependency: python-ceph = 1:0.87.2-0.el7.centos for package:
1:ceph-common-0.87.2-0.el7.centos.x86_64
Package python-ceph is obsoleted by python-rados, but obsoleting package
does not provide for requirements
--> Processing Dependency: redhat-lsb-core for package:
1:ceph-common-0.87.2-0.el7.centos.x86_64
---> Package cryptsetup.x86_64 0:1.7.2-1.el7 will be installed
--> Processing Dependency: cryptsetup-libs(x86-64) = 1.7.2-1.el7 for
package: cryptsetup-1.7.2-1.el7.x86_64
---> Package hdparm.x86_64 0:9.43-5.el7 will be installed
---> Package libaio.x86_64 0:0.3.109-13.el7 will be installed
---> Package libcephfs1.x86_64 1:0.87.2-0.el7.centos will be installed
---> Package librados2.x86_64 1:0.87.2-0.el7.centos will be installed
---> Package librbd1.x86_64 1:0.87.2-0.el7.centos will be installed
--> Running transaction check
---> Package ceph.x86_64

Re: [ceph-users] Ceph Giant Repo problem

2017-03-30 Thread Erik McCormick
Try setting

obsoletes=0

in /etc/yum.conf and see if that doesn't make it happier. The package is
clearly there and it even shows it available in your log.

-Erik

On Thu, Mar 30, 2017 at 8:55 PM, Vlad Blando  wrote:

> Hi Guys,
>
> I encountered some issue with installing ceph package for giant, was there
> a change somewhere or was I using wrong repo information.
>
> ceph.repo
> -
> [Ceph]
> name=Ceph packages for $basearch
> baseurl=http://download.ceph.com/rpm-giant/rhel7/$basearch
> enabled=1
> priority=1
> gpgcheck=1
> type=rpm-md
> gpgkey=https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc
>
> [Ceph-noarch]
> name=Ceph noarch packages
> baseurl=http://download.ceph.com/rpm-giant/rhel7/noarch
> enabled=1
> priority=1
> gpgcheck=1
> type=rpm-md
> gpgkey=https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc
>
> [ceph-source]
> name=Ceph source packages
> baseurl=http://download.ceph.com/rpm-giant/rhel7/SRPMS
> enabled=1
> priority=1
> gpgcheck=1
> type=rpm-md
> gpgkey=https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc
>
> 
>
> installation error
> 
> [root@ceph-test yum.repos.d]# yum install ceph
> Loaded plugins: priorities
> Ceph
>
>   | 2.9 kB  00:00:00
> Ceph-noarch
>
>| 2.9 kB  00:00:00
> ceph-source
>
>| 2.9 kB  00:00:00
> rhel-7-server-optional-rpms
>
>| 3.5 kB  00:00:00
> rhel-7-server-rpms
>
>   | 3.5 kB  00:00:00
> (1/9): ceph-source/primary_db
>
>| 4.4 kB  00:00:00
> (2/9): Ceph-noarch/primary_db
>
>| 4.1 kB  00:00:00
> (3/9): Ceph/x86_64/primary_db
>
>|  61 kB  00:00:01
> (4/9): rhel-7-server-optional-rpms/7Server/x86_64/group
>
>|  25 kB  00:00:01
> (5/9): rhel-7-server-rpms/7Server/x86_64/group
>
> | 701 kB  00:00:05
> (6/9): rhel-7-server-optional-rpms/7Server/x86_64/updateinfo
>
> | 1.3 MB  00:00:07
> (7/9): rhel-7-server-rpms/7Server/x86_64/updateinfo
>
>| 1.8 MB  00:00:06
> (8/9): rhel-7-server-optional-rpms/7Server/x86_64/primary_db
>
> | 5.0 MB  00:00:25
> (9/9): rhel-7-server-rpms/7Server/x86_64/primary_db
>
>|  34 MB  00:00:50
> 9 packages excluded due to repository priority protections
> Resolving Dependencies
> --> Running transaction check
> ---> Package ceph.x86_64 1:0.87.2-0.el7.centos will be installed
> --> Processing Dependency: python-ceph = 1:0.87.2-0.el7.centos for
> package: 1:ceph-0.87.2-0.el7.centos.x86_64
> Package python-ceph is obsoleted by python-rados, but obsoleting package
> does not provide for requirements
> --> Processing Dependency: ceph-common = 1:0.87.2-0.el7.centos for
> package: 1:ceph-0.87.2-0.el7.centos.x86_64
> --> Processing Dependency: libcephfs1 = 1:0.87.2-0.el7.centos for package:
> 1:ceph-0.87.2-0.el7.centos.x86_64
> --> Processing Dependency: librbd1 = 1:0.87.2-0.el7.centos for package:
> 1:ceph-0.87.2-0.el7.centos.x86_64
> --> Processing Dependency: librados2 = 1:0.87.2-0.el7.centos for package:
> 1:ceph-0.87.2-0.el7.centos.x86_64
> --> Processing Dependency: libaio.so.1(LIBAIO_0.4)(64bit) for package:
> 1:ceph-0.87.2-0.el7.centos.x86_64
> --> Processing Dependency: cryptsetup for package:
> 1:ceph-0.87.2-0.el7.centos.x86_64
> --> Processing Dependency: python-flask for package:
> 1:ceph-0.87.2-0.el7.centos.x86_64
> --> Processing Dependency: hdparm for package: 1:ceph-0.87.2-0.el7.centos.
> x86_64
> --> Processing Dependency: libaio.so.1(LIBAIO_0.1)(64bit) for package:
> 1:ceph-0.87.2-0.el7.centos.x86_64
> --> Processing Dependency: libcephfs.so.1()(64bit) for package:
> 1:ceph-0.87.2-0.el7.centos.x86_64
> --> Processing Dependency: libboost_system-mt.so.1.53.0()(64bit) for
> package: 1:ceph-0.87.2-0.el7.centos.x86_64
> --> Processing Dependency: libaio.so.1()(64bit) for package:
> 1:ceph-0.87.2-0.el7.centos.x86_64
> --> Processing Dependency: libboost_thread-mt.so.1.53.0()(64bit) for
> package: 1:ceph-0.87.2-0.el7.centos.x86_64
> --> Processing Dependency: librados.so.2()(64bit) for package:
> 1:ceph-0.87.2-0.el7.centos.x86_64
> --> Running transaction check
> ---> Package boost-system.x86_64 0:1.53.0-26.el7 will be installed
> ---> Package boost-thread.x86_64 0:1.53.0-26.el7 will be installed
> ---> Package ceph.x86_64 1:0.87.2-0.el7.centos will be installed
> --> Processing Dependency: python-ceph = 1:0.87.2-0.el7.centos for
> package: 1:ceph-0.87.2-0.el7.centos.x86_64
> Package python-ceph is obsoleted by python-rados, but obsoleting package
> does not provide for requirements
> --> Processing Dependency: python-flask for package:
> 1:ceph-0.87.2-0.el7.centos.x86_64
> ---> Package ceph-common.x86_64 1:0.87.2-0.el7.centos will be installed
> --> Processing Dependency: python-ceph = 1:0.87.2-0.el7.centos for
> package: 1:ceph-common-0.87.2-0.el7.centos.x86_64
> Package python-ceph is obsoleted by python-rados, but obsoleting package
> does not provide for requirements
> --> Processing Dependency: redhat-lsb-core for package:
> 1:ceph-common-0.87.2-0.el7.centos.x86_64
> ---> Package cryptsetup.x86_64 0:1.7.2-1.el7 will be insta

Re: [ceph-users] S3 Multi-part upload broken with newer AWS Java SDK and Kraken RGW

2017-03-30 Thread Ben Hines
Hey Yehuda,

Are there plans to port of this fix to Kraken?  (or is there even another
Kraken release planned? :)

thanks!

-Ben

On Wed, Mar 1, 2017 at 11:33 AM, Yehuda Sadeh-Weinraub 
wrote:

> This sounds like this bug:
> http://tracker.ceph.com/issues/17076
>
> Will be fixed in 10.2.6. It's triggered by aws4 auth, so a workaround
> would be to use aws2 instead.
>
> Yehuda
>
>
> On Wed, Mar 1, 2017 at 10:46 AM, John Nielsen  wrote:
> > Hi all-
> >
> > We use Amazon S3 quite a bit at $WORK but are evaluating Ceph+radosgw as
> an alternative for some things. We have an "S3 smoke test" written using
> the AWS Java SDK that we use to validate a number of operations. On my
> Kraken cluster, multi-part uploads work fine for s3cmd. Our smoke test also
> passes fine using version 1.9.27 of the AWS SDK. However in SDK 1.11.69 the
> multi-part upload fails. The initial POST (to reserve the object name and
> start the upload) succeeds, but the first PUT fails with a 403 error.
> >
> > So, does anyone know offhand what might be going on here? If not, how
> can I get more details about the 403 error and what is causing it?
> >
> > The cluster was installed with Jewel and recently updated to Kraken.
> Using the built-in civetweb server.
> >
> > Here is the log output for three multi-part uploads. The first two are
> s3cmd and the older SDK, respectively. The last is the failing one with the
> newer SDK.
> >
> > S3cmd, Succeeds.
> > 2017-03-01 17:33:16.845613 7f80b06de700  1 == starting new request
> req=0x7f80b06d8340 =
> > 2017-03-01 17:33:16.856522 7f80b06de700  1 == req done
> req=0x7f80b06d8340 op status=0 http_status=200 ==
> > 2017-03-01 17:33:16.856628 7f80b06de700  1 civetweb: 0x7f81131fd000:
> 10.251.50.7 - - [01/Mar/2017:17:33:16 +] "POST /
> testdomobucket10x3x104x64250438/multipartStreamTest?uploads HTTP/1.1" 1 0
> - -
> > 2017-03-01 17:33:16.953967 7f80b06de700  1 == starting new request
> req=0x7f80b06d8340 =
> > 2017-03-01 17:33:24.094134 7f80b06de700  1 == req done
> req=0x7f80b06d8340 op status=0 http_status=200 ==
> > 2017-03-01 17:33:24.094211 7f80b06de700  1 civetweb: 0x7f81131fd000:
> 10.251.50.7 - - [01/Mar/2017:17:33:16 +] "PUT /
> testdomobucket10x3x104x64250438/multipartStreamTest?
> partNumber=1&uploadId=2~IGYuZC4uDC27TGWfpFkKk-Makqvk_XB HTTP/1.1" 1 0 - -
> > 2017-03-01 17:33:24.193747 7f80b06de700  1 == starting new request
> req=0x7f80b06d8340 =
> > 2017-03-01 17:33:30.002050 7f80b06de700  1 == req done
> req=0x7f80b06d8340 op status=0 http_status=200 ==
> > 2017-03-01 17:33:30.002124 7f80b06de700  1 civetweb: 0x7f81131fd000:
> 10.251.50.7 - - [01/Mar/2017:17:33:16 +] "PUT /
> testdomobucket10x3x104x64250438/multipartStreamTest?
> partNumber=2&uploadId=2~IGYuZC4uDC27TGWfpFkKk-Makqvk_XB HTTP/1.1" 1 0 - -
> > 2017-03-01 17:33:30.085033 7f80b06de700  1 == starting new request
> req=0x7f80b06d8340 =
> > 2017-03-01 17:33:30.104944 7f80b06de700  1 == req done
> req=0x7f80b06d8340 op status=0 http_status=200 ==
> > 2017-03-01 17:33:30.105007 7f80b06de700  1 civetweb: 0x7f81131fd000:
> 10.251.50.7 - - [01/Mar/2017:17:33:16 +] "POST /
> testdomobucket10x3x104x64250438/multipartStreamTest?uploadId=2~
> IGYuZC4uDC27TGWfpFkKk-Makqvk_XB HTTP/1.1" 1 0 - -
> >
> > AWS SDK (1.9.27). Succeeds.
> > 2017-03-01 17:54:50.720093 7f80c0eff700  1 == starting new request
> req=0x7f80c0ef9340 =
> > 2017-03-01 17:54:50.733109 7f80c0eff700  1 == req done
> req=0x7f80c0ef9340 op status=0 http_status=200 ==
> > 2017-03-01 17:54:50.733188 7f80c0eff700  1 civetweb: 0x7f811314c000:
> 10.251.50.7 - - [01/Mar/2017:17:54:42 +] "POST /
> testdomobucket10x3x104x6443285/multipartStreamTest?uploads HTTP/1.1" 1 0
> - aws-sdk-java/1.9.27 Mac_OS_X/10.10.5 Java_HotSpot(TM)_64-Bit_
> Server_VM/24.71-b01/1.7.0_71
> > 2017-03-01 17:54:50.831618 7f80c0eff700  1 == starting new request
> req=0x7f80c0ef9340 =
> > 2017-03-01 17:54:58.057011 7f80c0eff700  1 == req done
> req=0x7f80c0ef9340 op status=0 http_status=200 ==
> > 2017-03-01 17:54:58.057082 7f80c0eff700  1 civetweb: 0x7f811314c000:
> 10.251.50.7 - - [01/Mar/2017:17:54:42 +] "PUT /
> testdomobucket10x3x104x6443285/multipartStreamTest?uploadId=2%
> 7EPlNR4meSvAvCYtvbqz8JLlSKu5_laxo&partNumber=1 HTTP/1.1" 1 0 -
> aws-sdk-java/1.9.27 Mac_OS_X/10.10.5 Java_HotSpot(TM)_64-Bit_
> Server_VM/24.71-b01/1.7.0_71
> > 2017-03-01 17:54:58.143235 7f80c0eff700  1 == starting new request
> req=0x7f80c0ef9340 =
> > 2017-03-01 17:54:58.328351 7f80c0eff700  1 == req done
> req=0x7f80c0ef9340 op status=0 http_status=200 ==
> > 2017-03-01 17:54:58.328437 7f80c0eff700  1 civetweb: 0x7f811314c000:
> 10.251.50.7 - - [01/Mar/2017:17:54:42 +] "PUT /
> testdomobucket10x3x104x6443285/multipartStreamTest?uploadId=2%
> 7EPlNR4meSvAvCYtvbqz8JLlSKu5_laxo&partNumber=2 HTTP/1.1" 1 0 -
> aws-sdk-java/1.9.27 Mac_OS_X/10.10.5 Java_HotSpot(TM)_64-Bit_
> Server_VM/24.71-b01/1.7.0_71