Re: [ceph-users] Dell R515/510 with H710 PERC RAID | JBOD

2014-04-09 Thread Punit Dambiwal
Hi,

I have changed my plan and now i want to use the following supermicro
server :-

SuperStorage Server 6047R-E1R24L

Can any one tell meis this server is good for the OSD nodes...two ssd
on RAID1 (OS & journal) and 24 HDD for OSD (JBOD on the motherboard
controller).






On Fri, Apr 4, 2014 at 11:51 AM, Ирек Фасихов  wrote:

> You need to use Dell OpenManage:
>
> https://linux.dell.com/repo/hardware/.
>
>
>
> 2014-04-04 7:26 GMT+04:00 Punit Dambiwal :
>
>> Hi,
>>
>> I want to use Dell R515/R510 for the OSD node purpose
>>
>> 1. 2*SSD for OS purpose (Raid 1)
>> 2. 10* Segate 3.5' HDDx 3TB for OSD purpose (No RAID...JBOD)
>>
>> To create JBOD...i created all 10 HDD as raid0but the problem is when
>> i will plug out the HDD from the server and plug-in again,i need to import
>> the RAID configuration again to make this OSD working
>>
>> Can anyone suggest me good way to do this ??
>>
>> Thanks,
>> Punit
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> С уважением, Фасихов Ирек Нургаязович
> Моб.: +79229045757
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dell R515/510 with H710 PERC RAID | JBOD

2014-04-09 Thread Josef Johansson
Hi,

The server would be good as a OSD node I believe,  even though it's a
tad bigger than you set out for. You talked about using 10 disks before,
http://www.supermicro.nl/products/system/2U/6027/SSG-6027R-E1R12T.cfm or
http://www.supermicro.nl/products/system/2U/6027/SSG-6027R-E1R12L.cfm
may be better in that case? The first one is possible to flash to JBOD
as well. Remember that you want at least three of them. It's all up to
what you want, storage or IOPS and how much of it.

I think the recommendation is to have separate disks for OS and journal.
Also not running journals in RAID1, but rather use partitions on them,
if you have 3 journals, and 12 disk slots, you put 3 disks on each
journal. If you have 2 journals and 12 disks slots you put 5 disks on
each journal.

If you have lots of disks you could save the journals directly on disk,
I don't know the limit where that is better to do though, but someone
else surely knows this ;)

If you're running an Intel shop, there's the Intel DC S3500 that fits
good for OS-disks, and Intel DC S3700 for journals.

Cheers,
Josef

On 09/04/14 08:59, Punit Dambiwal wrote:
> Hi,
>
> I have changed my plan and now i want to use the following supermicro
> server :-
>
>
>   SuperStorage Server 6047R-E1R24L
>
> Can any one tell meis this server is good for the OSD nodes...two
> ssd on RAID1 (OS & journal) and 24 HDD for OSD (JBOD on the
> motherboard controller).
>
>
>
>
>
>
> On Fri, Apr 4, 2014 at 11:51 AM,  ???  > wrote:
>
> You need to use Dell OpenManage:
>
> https://linux.dell.com/repo/hardware/.
>
>
>
> 2014-04-04 7:26 GMT+04:00 Punit Dambiwal  >:
>
> Hi,
>
> I want to use Dell R515/R510 for the OSD node purpose
>
> 1. 2*SSD for OS purpose (Raid 1)
> 2. 10* Segate 3.5' HDDx 3TB for OSD purpose (No RAID...JBOD)
>
> To create JBOD...i created all 10 HDD as raid0but the
> problem is when i will plug out the HDD from the server and
> plug-in again,i need to import the RAID configuration again to
> make this OSD working
>
> Can anyone suggest me good way to do this ??
>
> Thanks,
> Punit
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> -- 
> ? ?, ???  ???
> ???.: +79229045757 
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dell R515/510 with H710 PERC RAID | JBOD

2014-04-09 Thread Christian Balzer
On Wed, 9 Apr 2014 14:59:30 +0800 Punit Dambiwal wrote:

> Hi,
> 
> I have changed my plan and now i want to use the following supermicro
> server :-
> 
> SuperStorage Server 6047R-E1R24L
> 
> Can any one tell meis this server is good for the OSD nodes...two ssd
> on RAID1 (OS & journal) and 24 HDD for OSD (JBOD on the motherboard
> controller).
>
Wrong on so many levels. 

Firstly, 2 SSDs (really just one if you're using RAID1 for the journal
partitions as well) for 24 OSDs. 
The SSD will be a speed bottleneck and also have to handle ALL the writes
that ever happen to the whole machine (meaning it will wear out quickly). 

If you want/need SSD journals, a sensible ratio would be 3-4 OSDs per
journal SSD (partitioned into the respective amount of journals).
So something like 6 SSDs and 18 HDDs.

Secondly that backplane is connected to the HBA with one mini-SAS link.
That means at best 4 lanes of 6Gb/s for 24 drives, but it might be just
one lane, the manual is typical Supermicro quality. =.= 
Another, potentially massive bottleneck.

Also what are your goals in term of throughput, IOPS here?
If you're planning on getting lots of 24 disk boxes, fine. 
Otherwise you might be better off getting smaller nodes.

Regards,

Christian.
> 
> 
> 
> 
> 
> On Fri, Apr 4, 2014 at 11:51 AM, Ирек Фасихов  wrote:
> 
> > You need to use Dell OpenManage:
> >
> > https://linux.dell.com/repo/hardware/.
> >
> >
> >
> > 2014-04-04 7:26 GMT+04:00 Punit Dambiwal :
> >
> >> Hi,
> >>
> >> I want to use Dell R515/R510 for the OSD node purpose
> >>
> >> 1. 2*SSD for OS purpose (Raid 1)
> >> 2. 10* Segate 3.5' HDDx 3TB for OSD purpose (No RAID...JBOD)
> >>
> >> To create JBOD...i created all 10 HDD as raid0but the problem is
> >> when i will plug out the HDD from the server and plug-in again,i need
> >> to import the RAID configuration again to make this OSD working
> >>
> >> Can anyone suggest me good way to do this ??
> >>
> >> Thanks,
> >> Punit
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >
> >
> > --
> > С уважением, Фасихов Ирек Нургаязович
> > Моб.: +79229045757
> >


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Debian 7 : fuse unable to resolve monitor hostname

2014-04-09 Thread Yan, Zheng
On Wed, Apr 9, 2014 at 3:47 PM, Florent B  wrote:
> Hi,
>
> I'm trying again and my system has now a load average of 3.15.
>
> I did a perf report, 91,42% of CPU time is used by :
>
> +  91.42% 63080   swapper  [kernel.kallsyms]  [k]
> native_safe_halt
>

could you give -g  (Enables call-graph recording) to perf record and try again.

Regards
Yan, Zheng

> My system has 2 vCore, and does not use all 4 GB of RAM...
>
> When I do a "top", I can see "kworker/0:0" always first...
>
> Debian 7 with 3.13 kernel backport.
>
> Do you have an idea of what is consuming so far ?
>
> Thank you :)
>
> On 04/04/2014 04:50 PM, Yan, Zheng wrote:
>> On Fri, Apr 4, 2014 at 6:33 PM, Florent B  wrote:
>>> Hi all,
>>>
>>> My machines are all running Debian Wheezy.
>>>
>>> After a few days using kernel driver to mount my Ceph pools (with backports
>>> 3.13 kernel), I'm now switching to FUSE because of very high CPU usage with
>>> kernel driver (load average > 35).
>>>
>> I'm curious which code uses CPU time. could you use perf to get a profile
>>
>> Regards
>> Yan, Zheng
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions about federated gateways configure

2014-04-09 Thread wsnote
Thank you very much!
I did as what you said. But there are some mistake.


 [root@ceph69 ceph]# radosgw-agent -c region-data-sync.conf
Traceback (most recent call last):
  File "/usr/bin/radosgw-agent", line 5, in 
from pkg_resources import load_entry_point
  File "/usr/lib/python2.6/site-packages/pkg_resources.py", line 2659, in 

parse_requirements(__requires__), Environment()
  File "/usr/lib/python2.6/site-packages/pkg_resources.py", line 546, in resolve
raise DistributionNotFound(req)
pkg_resources.DistributionNotFound: requests>=1.2.1 





At 2014-04-09 12:11:09,"Craig Lewis"  wrote:
I posted inline.


1. Create Pools
there are many us-east and us-west pools.
Do I have to create both us-east and us-west pools in a ceph instance? Or, I 
just create us-east pools in us-east zone and create us-west pools in us-west 
zone?

No, just create the us-east pools in the us-east cluster, and the us-west pools 
in the us-west cluster.




2. Create a keyring

Generate a Ceph Object Gateway user name and key for each instance.

sudo ceph-authtool /etc/ceph/ceph.client.radosgw.keyring -n 
client.radosgw.us-east-1 --gen-key
sudo ceph-authtool /etc/ceph/ceph.client.radosgw.keyring -n 
client.radosgw.us-west-1 --gen-key
Do I use the all above commands in every ceph instance, or use first in us-east 
zone and use second in us-west zone?

For the keyrings, you should only need to do the key in the respective zone.  
I'm not 100% sure though, as I'm not using CephX.





3. add instances to ceph config file

[client.radosgw.us-east-1]
rgw region = us
rgw region root pool = .us.rgw.root
rgw zone = us-east
rgw zone root pool = .us-east.rgw.root
keyring = /etc/ceph/ceph.client.radosgw.keyring
rgw dns name = {hostname}
rgw socket path = /var/run/ceph/$name.sock
host = {host-name}

[client.radosgw.us-west-1]
rgw region = us
rgw region root pool = .us.rgw.root
rgw zone = us-west
rgw zone root pool = .us-west.rgw.root
keyring = /etc/ceph/ceph.client.radosgw.keyring
rgw dns name = {hostname}
rgw socket path = /var/run/ceph/$name.sock
host = {host-name}


Does both of above config put in one ceph.conf, or put us-east in us-east zone 
and us-west in us-west zone?

It only needs to be in each cluster's ceph.conf.  Assuming your client names 
are globally unique., it won't hurt if you put it in all cluster's ceph.conf. 




4. Create Zones
radosgw-admin zone set --rgw-zone=us-east --infile us-east.json --name 
client.radosgw.us-east-1
radosgw-admin zone set --rgw-zone=us-east --infile us-east.json --name 
client.radosgw.us-west-1
Use both commands in every instance or separately?

Yes, the zones need to know about each other.  The slaves definitely need to 
know the master zone information.  The master might be able to get away with 
not knowing about the slave zones, but I haven't tested it.  I ran both 
commands in both zones, using the respective --name argument for the node in 
the zone I was running the command on.




5. Create Zone Users


radosgw-admin user create --uid="us-east" --display-name="Region-US Zone-East" 
--name client.radosgw.us-east-1 --system
radosgw-admin user create --uid="us-west" --display-name="Region-US Zone-West" 
--name client.radosgw.us-west-1 --system
Does us-east zone have to create uid us-west?
Does us-west zone have to create uid us-east?

When you create the system users, you do need to create all users in all zone.  
I think you don't need the master user in the slave zones, but I haven't taken 
the time to test it.  You do need the access keys to match in all zones.  So if 
you create the users in the master zone with

radosgw-admin user create --uid="$name" --display-name="$display_name" --name 
client.radosgw.us-west-1 --gen-access-key --gen-secret --system

you'll copy the access and secret keys to the slave zone with

radosgw-admin user create --uid="$name" --display-name="$display_name" --name 
client.radosgw.us-east-1 --access_key="$access_key" --secret="$secret_key" 
--system




6. about secondary region


Create zones from master region in the secondary region.
Create zones from secondary region in the master region.


Do these two steps aim at that the two regions have the same pool?

I haven't tried multiple regions yet, but since the two regions are in two 
different clusters, they can't share pools.  They could use the same pool names 
in different clusters, but I recommend against that.  You really want all pools 
in all locations to be named uniquely.  Having the same names in different 
locations is a recipe for human error.

I'm pretty sure you just need to load the region and zone maps in all of the 
clusters.  Since the other regions will only be storing metadata about the 
other regions and zones, they shouldn't need extra pools.  Similar to my answer 
to question #1.




The best advice I can give is to setup a pair of virtual machines, and start 
messing around.  Make liberal use of VM snapshots.  I broke my test clusters 
several times.  I could've fixed the

Re: [ceph-users] Ceph v0.79 Firefly RC :: erasure-code-profile command set not present

2014-04-09 Thread Mark Kirkwood

Hi Karan,

Just to double check - run the same command after ssh'ing into each of 
the osd hosts, and maybe again on the monitor hosts too (just in case 
you have *some* hosts successfully updated to 0.79 and some still on < 
0.78).


Regards

Mark

On 08/04/14 22:32, Karan Singh wrote:

Hi Loic

Here is the output

# ceph --version
ceph version 0.79 (4c2d73a5095f527c3a2168deb5fa54b3c8991a6e)
#



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD space usage 2x object size after rados put

2014-04-09 Thread Mark Kirkwood

Hi all,

I've noticed that objects are using twice their actual space for a few 
minutes after they are 'put' via rados:


$ ceph -v
ceph version 0.79-42-g010dff1 (010dff12c38882238591bb042f8e497a1f7ba020)

$ ceph osd tree
# idweight  type name   up/down reweight
-1  0.03998 root default
-2  0.009995host ceph1
0   0.009995osd.0   up  1
-3  0.009995host ceph2
1   0.009995osd.1   up  1
-4  0.009995host ceph3
2   0.009995osd.2   up  1
-5  0.009995host ceph4
3   0.009995osd.3   up  1

$ ceph osd dump|grep repool
pool 5 'repool' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 57 owner 0 flags hashpspool 
stripe_width 0


$ du -m  file
1025file

$ rados put -p repool file file

$ cd /var/lib/ceph/osd/ceph-1/current/
$ du -m 5.1a_head
2048  5.1a_head

[later]

$ du -m 5.1a_head
1024  5.1a_head

The above situation is repeated on the other two OSD's where this pg is 
mapped. So after about 5 minutes or so we have (as expected) that the 1G 
file is using 1G on each of the 3 OSD's it is mapped to, however for a 
short period of time it is using twice this! I very interested to know 
what activity is happening that causes the 2x space use - as this could 
be a significant foot gun if uploading large files when we don't have 2x 
the space available on each OSD.


Regards

Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and shared backend storage.

2014-04-09 Thread Franks Andy (RLZ) IT Systems Engineer
Hi,
  The logic of going clustered file system is that ctdb needs it. The brief is 
simply to provide a "windows file sharing" cluster without using windows, which 
would require us to buy loads of CALs for 2012, so isn't an option. The SAN 
would provide this, but only if we bought the standalone head units which do 
the job I'm trying to achieve here, and they've been quoted at £30-40k.

So .. the idea was that ceph would provide the required clustered filesystem 
element, and it was the only FS that provided the required "resize on the fly 
and snapshotting" things that were needed.
I can't see it working with one shared lun. In theory I can't see why it 
couldn't work, but have no idea how the likes of vmfs achieve the locking 
across the cluster with single luns, I certainly don't know of something within 
linux that could do it.

I guess I could see a clustered FS like Ceph providing something similar to a 
software raid 1, where two volumes were replicated and access from a couple of 
hosts point to the two different backend "halves" of the raid via a load 
balancer?

The other option is to ditch the HA features and just go with samba on the top 
of zfs, which could still provide the snapshots we need, although it's a step 
backwards, but then I don't like the sound of the ctdb complexity or 
performance problems cited.

I guess people just don't do HA in file sharing roles.. What about NFS or the 
like for VM provision though - it's pretty similar just with CTDB bolted on top?

Thanks
Andy


>Ceph is designed to handle reliability in its system rather than in an 
>external one. You could set it up to use that storage and not do its own 
>replication, but then you lose availability if the OSD ?>process hosts 
>disappear, etc. And the filesystem (which I guess is the part you're 
>interested in) is the least stable component of the overall system. Maybe if 
>you could describe more about what >you think the system stack will look like?
>-Greg
>Software Engineer #42 @ http://inktank.com | http://ceph.com

> We have a need to provide HA storage to a few thousand users, 
> replacing our aging windows storage server.
>
> Our storage all comes from a group of equallogic SANs, and since we've 
> invested in these and vmware, the obvious
>
> Our storage cluster mentioned above needs to export SMB and maybe NFS, 
> using samba CTDB and whatever NFS needs (not looked into that yet). My 
> question is how to present the storage ceph needs given that I'd like 
> the SAN itself to provide the resilience through it's replication and 
> snapshot capabilities, but for ceph to provide the logical HA (active/active 
> if possible).

For me it does not seem that Ceph is the most logical solution.
Currently you could look at Ceph as a SAN replacement. 
It can also act like an object store, similar to Amazon S3 / Openstack Swift.

The distibuted filesystem part (cephfs) might be a fit but is not really 
production ready yet as far as I know.
( I think people are using it but I would not put 1000s of users on it yet. 
E.g. it is missing active-active ha option )

Since you want to keep using the SAN and are using SMB and NFS clients  (e.g. 
no native/ ceph kernel client/ qemu clients) it seems to me you are just adding 
another layer of complexity without any of the benefits that Ceph can bring.

To be brutally honest I would look if the SAN supports NFS / SMB exports.

CTDB is nice but it requires a shared filesystem so you would have to look at 
GFS or something similar.
You can get it to work but it is a bit of a PITA. 
There are also some performance considerations with those filesystems so you 
should really do some proper testing before any large scale deployments.

Cheers,
Robert van Leeuwen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to detect journal problems

2014-04-09 Thread Christian Balzer
On Tue, 8 Apr 2014 09:35:19 -0700 Gregory Farnum wrote:

> On Tuesday, April 8, 2014, Christian Balzer  wrote:
> 
> > On Tue, 08 Apr 2014 14:19:20 +0200 Josef Johansson wrote:
> > >
> > > On 08/04/14 10:39, Christian Balzer wrote:
> > > > On Tue, 08 Apr 2014 10:31:44 +0200 Josef Johansson wrote:
> > > >
> > > >> On 08/04/14 10:04, Christian Balzer wrote:
> > > >>> Hello,
> > > >>>
> > > >>> On Tue, 08 Apr 2014 09:31:18 +0200 Josef Johansson wrote:
> > > >>>
> > >  Hi all,
> > > 
> > >  I am currently benchmarking a standard setup with Intel DC S3700
> > >  disks as journals and Hitachi 4TB-disks as data-drives, all on
> > >  LACP 10GbE network.
> > > 
> > > >>> Unless that is the 400GB version of the DC3700, you're already
> > > >>> limiting yourself to 365MB/s throughput with the 200GB variant.
> > > >>> If sequential write speed is that important to you and you think
> > > >>> you'll ever get those 5 HDs to write at full speed with Ceph
> > > >>> (unlikely).
> > > >> It's the 400GB version of the DC3700, and yes, I'm aware that I
> > > >> need a 1:3 ratio to max out these disks, as they write sequential
> > > >> data at about 150MB/s.
> > > >> But our thoughts are that it would cover the current demand with
> > > >> a 1:5 ratio, but we could upgrade.
> > > > I'd reckon you'll do fine, as in run out of steam and IOPS before
> > > > hitting that limit.
> > > >
> > >  The size of my journals are 25GB each, and I have two journals
> > >  per machine, with 5 OSDs per journal, with 5 machines in total.
> > >  We currently use the tunables optimal and the version of ceph
> > >  is the latest dumpling.
> > > 
> > >  Benchmarking writes with rbd show that there's no problem
> > >  hitting upper levels on the 4TB-disks with sequential data,
> > >  thus maxing out 10GbE. At this moment we see full utilization
> > >  on the journals.
> > > 
> > >  But lowering the byte-size to 4k shows that the journals are
> > >  utilized to about 20%, and the 4TB-disks 100%. (rados -p 
> > >  -b 4096 -t 256 100 write)
> > > 
> > > >>> When you're saying utilization I assume you're talking about
> > > >>> iostat or atop output?
> > > >> Yes, the utilization is iostat.
> > > >>> That's not a bug, that's comparing apple to oranges.
> > > >> You mean comparing iostat-results with the ones from rados
> > > >> benchmark?
> > > >>> The rados bench default is 4MB, which not only happens to be the
> > > >>> default RBD objectsize but also to generate a nice amount of
> > > >>> bandwidth.
> > > >>>
> > > >>> While at 4k writes your SDD is obviously bored, but actual OSD
> > > >>> needs to handle all those writes and becomes limited by the IOPS
> > > >>> it can peform.
> > > >> Yes, it's quite bored and just shuffles data.
> > > >> Maybe I've been thinking about this the wrong way,
> > > >> but shouldn't the Journal buffer more until the Journal partition
> > > >> is full or when the flush_interval is met.
> > > >>
> > > > Take a look at "journal queue max ops", which has a default of a
> > > > mere 500, so that's full after 2 seconds. ^o^
> > > Hm, that makes sense.
> > >
> > > So, tested out both low values ( 5000 )  and large value ( 6553600 ),
> > > but it didn't seem that change anything.
> > > Any chance I could dump the current values from a running OSD, to
> > > actually see what is in use?
> > >
> > The value can be checked like this (for example):
> > ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok config show
> >
> > If you restarted your OSD after updating ceph.conf I'm sure you will
> > find the values you set there.
> >
> > However you are seriously underestimating the packet storm you're
> > unleashing with 256 threads of 4KB packets over a 10Gb/s link.
> >
> > That's theoretically 256K packets/s, very quickly filling even your
> > "large" max ops setting.
> > Also the "journal max write entries" will need to be adjusted to suit
> > the abilities (speed and merge wise) of your OSDs.
> >
> > With 40 million max ops and 2048 max write I get this (instead of
> > similar values to you with the defaults):
> >
> >  1 256  2963  2707   10.5707   10.5742  0.125177
> > 0.0830565 2 256  5278  5022   9.80635   9.04297  0.247757
> > 0.0968146 3 256  7276  7020   9.13867   7.80469  0.002813
> > 0.0994022 4 256  8774  8518   8.31665   5.85156  0.002976
> > 0.107339 5 256 10121  9865   7.70548   5.26172  0.002569
> > 0.117767 6 256 11363 11107   7.22969   4.85156   0.38909
> > 0.130649 7 256 12354 120986.7498   3.87109  0.002857
> > 0.137199 8 256 12392 12136   5.92465  0.148438   1.15075
> > 0.138359 9 256 12551 12295   5.33538  0.621094  0.003575
> > 0.151978 10 256 13099 128435.0159   2.14062
> > 0.146283   0.17639
> >
> > Of course this tails off eventually, but the effect is obvious and the
> > bandwidth is double that of the 

Re: [ceph-users] Questions about federated gateways configure

2014-04-09 Thread wsnote
Now I can configure it but it seems make no sense.
The following is the Error info.


 [root@ceph69 ceph]# radosgw-agent -c /etc/ceph/cluster-data-sync.conf
INFO:urllib3.connectionpool:Starting new HTTPS connection (1): s3.ceph71.com
ERROR:root:Could not retrieve region map from destination
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/radosgw_agent/cli.py", line 269, in 
main
region_map = client.get_region_map(dest_conn)
  File "/usr/lib/python2.6/site-packages/radosgw_agent/client.py", line 391, in 
get_region_map
region_map = request(connection, 'get', 'admin/config')
  File "/usr/lib/python2.6/site-packages/radosgw_agent/client.py", line 153, in 
request
result = handler(url, params=params, headers=request.headers, data=data)
  File "/usr/lib/python2.6/site-packages/requests/api.py", line 55, in get
return request('get', url, **kwargs)
  File "/usr/lib/python2.6/site-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 279, in 
request
resp = self.send(prep, stream=stream, timeout=timeout, verify=verify, 
cert=cert, proxies=proxies)
  File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 374, in 
send
r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.6/site-packages/requests/adapters.py", line 213, in 
send
raise SSLError(e)
SSLError: hostname 's3.ceph71.com' doesn't match u'ceph71' 


What's the probably reason?
Thanks!




At 2014-04-09 16:24:48,wsnote  wrote:

Thank you very much!
I did as what you said. But there are some mistake.


 [root@ceph69 ceph]# radosgw-agent -c region-data-sync.conf
Traceback (most recent call last):
  File "/usr/bin/radosgw-agent", line 5, in 
from pkg_resources import load_entry_point
  File "/usr/lib/python2.6/site-packages/pkg_resources.py", line 2659, in 

parse_requirements(__requires__), Environment()
  File "/usr/lib/python2.6/site-packages/pkg_resources.py", line 546, in resolve
raise DistributionNotFound(req)
pkg_resources.DistributionNotFound: requests>=1.2.1 





At 2014-04-09 12:11:09,"Craig Lewis"  wrote:
I posted inline.


1. Create Pools
there are many us-east and us-west pools.
Do I have to create both us-east and us-west pools in a ceph instance? Or, I 
just create us-east pools in us-east zone and create us-west pools in us-west 
zone?

No, just create the us-east pools in the us-east cluster, and the us-west pools 
in the us-west cluster.




2. Create a keyring

Generate a Ceph Object Gateway user name and key for each instance.

sudo ceph-authtool /etc/ceph/ceph.client.radosgw.keyring -n 
client.radosgw.us-east-1 --gen-key
sudo ceph-authtool /etc/ceph/ceph.client.radosgw.keyring -n 
client.radosgw.us-west-1 --gen-key
Do I use the all above commands in every ceph instance, or use first in us-east 
zone and use second in us-west zone?

For the keyrings, you should only need to do the key in the respective zone.  
I'm not 100% sure though, as I'm not using CephX.





3. add instances to ceph config file
[client.radosgw.us-east-1]
rgw region = us
rgw region root pool = .us.rgw.root
rgw zone = us-east
rgw zone root pool = .us-east.rgw.root
keyring = /etc/ceph/ceph.client.radosgw.keyring
rgw dns name = {hostname}
rgw socket path = /var/run/ceph/$name.sock
host = {host-name}

[client.radosgw.us-west-1]
rgw region = us
rgw region root pool = .us.rgw.root
rgw zone = us-west
rgw zone root pool = .us-west.rgw.root
keyring = /etc/ceph/ceph.client.radosgw.keyring
rgw dns name = {hostname}
rgw socket path = /var/run/ceph/$name.sock
host = {host-name}


Does both of above config put in one ceph.conf, or put us-east in us-east zone 
and us-west in us-west zone?

It only needs to be in each cluster's ceph.conf.  Assuming your client names 
are globally unique., it won't hurt if you put it in all cluster's ceph.conf. 




4. Create Zones
radosgw-admin zone set --rgw-zone=us-east --infile us-east.json --name 
client.radosgw.us-east-1
radosgw-admin zone set --rgw-zone=us-east --infile us-east.json --name 
client.radosgw.us-west-1
Use both commands in every instance or separately?

Yes, the zones need to know about each other.  The slaves definitely need to 
know the master zone information.  The master might be able to get away with 
not knowing about the slave zones, but I haven't tested it.  I ran both 
commands in both zones, using the respective --name argument for the node in 
the zone I was running the command on.




5. Create Zone Users


radosgw-admin user create --uid="us-east" --display-name="Region-US Zone-East" 
--name client.radosgw.us-east-1 --system
radosgw-admin user create --uid="us-west" --display-name="Region-US Zone-West" 
--name client.radosgw.us-west-1 --system
Does us-east zone have to create uid us-west?
Does us-west zone have to create uid us-east?

When you create the system users, you do need to create all users in all zone.  

Re: [ceph-users] Ceph and shared backend storage.

2014-04-09 Thread Robert van Leeuwen
> So .. the idea was that ceph would provide the required clustered filesystem 
> element,
>  and it was the only FS that provided the required "resize on the fly and 
> snapshotting" things that were needed.
> I can't see it working with one shared lun. In theory I can't see why it 
> couldn't work, but have no idea how the likes of
> vmfs achieve the locking across the cluster with single luns, I certainly 
> don't know of something within linux that could do it.

Linux also has a few clustered filesystems e.g. GFS2 or OCFS.
I'm not sure how suited these are for fileservers though since they will have 
to lots of locking of files.

> I guess I could see a clustered FS like Ceph providing something similar to a 
> software raid 1, where 
> two volumes were replicated and access from a couple of hosts point to the 
> two different backend "halves" of the raid via a load balancer?
* Ceph RDB (which is stable) just gives you a block device. 
This is "similar" to ISCSI except that the data is distributed accross x ceph 
nodes. 
Just as ISCSI you should mount this on two locations unless you run a clustered 
filesystem (e.g. GFS / OCFS)
* CephFS gives you a clustered posix filesystem. You can run NFS/CTDB directly 
on top of this.
In theory this is what you are looking for except that it isn't fully mature 
yet.

> The other option is to ditch the HA features and just go with samba on the 
> top of zfs, which could still provide the snapshots we need,
> although it's a step backwards, but then I don't like the sound of the ctdb 
> complexity or performance problems cited.
> I guess people just don't do HA in file sharing roles.. 
Most clusters will be HA but not active-active.
As mentioned above it is possible but you might run into performance issues 
with file locking. (Its been a while since I did things with GFS)

> What about NFS or the like for VM provision though - it's pretty similar just 
> with CTDB bolted on top?
With NFS you have the same issues. The posix filesystem it runs on needs to be 
clustered.
When there is a ceph driver for vmware (not sure what the status is but I think 
they are working on it) you could directly plugin your hypervisors into Ceph.
Still, running it ceph on a SAN still defeats the purpose, might as well just 
use ISCSI + vmfs directly on the SAN.

Cheers,
Robert



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and shared backend storage.

2014-04-09 Thread Robert van Leeuwen
> This is "similar" to ISCSI except that the data is distributed accross x ceph 
> nodes.
> Just as ISCSI you should mount this on two locations unless you run a 
> clustered filesystem (e.g. GFS / OCFS)

Oops I meant, should NOT mount this on two locations unles... :)

Cheers,
Robert
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RadosGW: bad request

2014-04-09 Thread Gandalf Corvotempesta
2014-04-07 20:24 GMT+02:00 Yehuda Sadeh :
> Try bumping up logs (debug rgw = 20, debug ms = 1). Not enough info
> here to say much, note that it takes exactly 30 seconds for the
> gateway to send the error response, may be some timeout. I'd verify
> that the correct fastcgi module is running.

Sorry for the huge delay.
http://pastebin.com/raw.php?i=JAFZtbRg

# ps aux | grep rados
root  9412  0.1  1.5 1062288 7836 ?Ssl  14:25   0:00
/usr/bin/radosgw -n client.radosgw.gateway
root  9547  0.0  0.1   7848   876 pts/0S+   14:30   0:00 grep rados

# ceph -s -n client.radosgw.gateway 2>/dev/null | grep HEALTH
 health HEALTH_OK
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph v0.79 Firefly RC :: erasure-code-profile command set not present

2014-04-09 Thread Alfredo Deza
On Wed, Apr 9, 2014 at 4:45 AM, Mark Kirkwood
 wrote:
> Hi Karan,
>
> Just to double check - run the same command after ssh'ing into each of the
> osd hosts, and maybe again on the monitor hosts too (just in case you have
> *some* hosts successfully updated to 0.79 and some still on < 0.78).

Just as a sanity check, erasure coding *is* present in 0.79 as
installed from scratch from the packages
that were built.

The problem seems to be a combination of an upgrade with monitors that
have not been restarted. This cannot
be replicated otherwise.
>
> Regards
>
> Mark
>
>
> On 08/04/14 22:32, Karan Singh wrote:
>>
>> Hi Loic
>>
>> Here is the output
>>
>> # ceph --version
>> ceph version 0.79 (4c2d73a5095f527c3a2168deb5fa54b3c8991a6e)
>> #
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to bypass existing clusters when running teuthology

2014-04-09 Thread Jian Zhang
Hi lists,
Recently we are trying to use teuthology for some tests, however, we met
some issues when trying to ignore the existing cluster.

We fount it pretty hard to find the related documents, we need to go
through the code to understand which parameter to set. But, even use
use_existing_cluster: true, it seems teuthology still trying to download
Ceph.

Is there anyone can point us a link for releated source or give an yaml
example files?
Thanks for the help in advance.


Here is the yaml file used for the tests.

check-locks: false
roles:
- [mon.0,  osd.0]
targets:
  ceph01: ssh-rsa xxx
use_existing_cluster: true
tasks:
- install: null
- ceph: null
- interactive:
verbose: true
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 0.79 dependency issue on RPM packages

2014-04-09 Thread Alfredo Deza
Yesterday we found out that there was a dependency issue for the init
script on CentOS/RHEL
distros where we depend on some functions that are available through
redhat-lsb-core but were
not declared in the ceph.spec file.

This will cause daemons not to start at all since the init script will
attempt to source a file that is not
there (if that package is not installed).

The workaround for this issue is to just install that one package:

sudo yum install redhat-lsb-core

And make sure that `/lib/lsb/init-functions` is present.

This should not affect Debian (and Debian based) distros.

Ticket reference: http://tracker.ceph.com/issues/8028
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD space usage 2x object size after rados put

2014-04-09 Thread Gregory Farnum
I don't think the backing store should be seeing any effects like
that. What are the filenames which are using up that space inside the
folders?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Apr 9, 2014 at 1:58 AM, Mark Kirkwood
 wrote:
> Hi all,
>
> I've noticed that objects are using twice their actual space for a few
> minutes after they are 'put' via rados:
>
> $ ceph -v
> ceph version 0.79-42-g010dff1 (010dff12c38882238591bb042f8e497a1f7ba020)
>
> $ ceph osd tree
> # idweight  type name   up/down reweight
> -1  0.03998 root default
> -2  0.009995host ceph1
> 0   0.009995osd.0   up  1
> -3  0.009995host ceph2
> 1   0.009995osd.1   up  1
> -4  0.009995host ceph3
> 2   0.009995osd.2   up  1
> -5  0.009995host ceph4
> 3   0.009995osd.3   up  1
>
> $ ceph osd dump|grep repool
> pool 5 'repool' replicated size 3 min_size 2 crush_ruleset 0 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 57 owner 0 flags hashpspool
> stripe_width 0
>
> $ du -m  file
> 1025file
>
> $ rados put -p repool file file
>
> $ cd /var/lib/ceph/osd/ceph-1/current/
> $ du -m 5.1a_head
> 2048  5.1a_head
>
> [later]
>
> $ du -m 5.1a_head
> 1024  5.1a_head
>
> The above situation is repeated on the other two OSD's where this pg is
> mapped. So after about 5 minutes or so we have (as expected) that the 1G
> file is using 1G on each of the 3 OSD's it is mapped to, however for a short
> period of time it is using twice this! I very interested to know what
> activity is happening that causes the 2x space use - as this could be a
> significant foot gun if uploading large files when we don't have 2x the
> space available on each OSD.
>
> Regards
>
> Mark
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to detect journal problems

2014-04-09 Thread Gregory Farnum
journal_max_write_bytes: the maximum amount of data the journal will
try to write at once when it's coalescing multiple pending ops in the
journal queue.
journal_queue_max_bytes: the maximum amount of data allowed to be
queued for journal writing.

In particular, both of those are about how much is waiting to get into
the durable journal, not waiting to get flushed out of it.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Apr 9, 2014 at 3:06 AM, Christian Balzer  wrote:
> On Tue, 8 Apr 2014 09:35:19 -0700 Gregory Farnum wrote:
>
>> On Tuesday, April 8, 2014, Christian Balzer  wrote:
>>
>> > On Tue, 08 Apr 2014 14:19:20 +0200 Josef Johansson wrote:
>> > >
>> > > On 08/04/14 10:39, Christian Balzer wrote:
>> > > > On Tue, 08 Apr 2014 10:31:44 +0200 Josef Johansson wrote:
>> > > >
>> > > >> On 08/04/14 10:04, Christian Balzer wrote:
>> > > >>> Hello,
>> > > >>>
>> > > >>> On Tue, 08 Apr 2014 09:31:18 +0200 Josef Johansson wrote:
>> > > >>>
>> > >  Hi all,
>> > > 
>> > >  I am currently benchmarking a standard setup with Intel DC S3700
>> > >  disks as journals and Hitachi 4TB-disks as data-drives, all on
>> > >  LACP 10GbE network.
>> > > 
>> > > >>> Unless that is the 400GB version of the DC3700, you're already
>> > > >>> limiting yourself to 365MB/s throughput with the 200GB variant.
>> > > >>> If sequential write speed is that important to you and you think
>> > > >>> you'll ever get those 5 HDs to write at full speed with Ceph
>> > > >>> (unlikely).
>> > > >> It's the 400GB version of the DC3700, and yes, I'm aware that I
>> > > >> need a 1:3 ratio to max out these disks, as they write sequential
>> > > >> data at about 150MB/s.
>> > > >> But our thoughts are that it would cover the current demand with
>> > > >> a 1:5 ratio, but we could upgrade.
>> > > > I'd reckon you'll do fine, as in run out of steam and IOPS before
>> > > > hitting that limit.
>> > > >
>> > >  The size of my journals are 25GB each, and I have two journals
>> > >  per machine, with 5 OSDs per journal, with 5 machines in total.
>> > >  We currently use the tunables optimal and the version of ceph
>> > >  is the latest dumpling.
>> > > 
>> > >  Benchmarking writes with rbd show that there's no problem
>> > >  hitting upper levels on the 4TB-disks with sequential data,
>> > >  thus maxing out 10GbE. At this moment we see full utilization
>> > >  on the journals.
>> > > 
>> > >  But lowering the byte-size to 4k shows that the journals are
>> > >  utilized to about 20%, and the 4TB-disks 100%. (rados -p 
>> > >  -b 4096 -t 256 100 write)
>> > > 
>> > > >>> When you're saying utilization I assume you're talking about
>> > > >>> iostat or atop output?
>> > > >> Yes, the utilization is iostat.
>> > > >>> That's not a bug, that's comparing apple to oranges.
>> > > >> You mean comparing iostat-results with the ones from rados
>> > > >> benchmark?
>> > > >>> The rados bench default is 4MB, which not only happens to be the
>> > > >>> default RBD objectsize but also to generate a nice amount of
>> > > >>> bandwidth.
>> > > >>>
>> > > >>> While at 4k writes your SDD is obviously bored, but actual OSD
>> > > >>> needs to handle all those writes and becomes limited by the IOPS
>> > > >>> it can peform.
>> > > >> Yes, it's quite bored and just shuffles data.
>> > > >> Maybe I've been thinking about this the wrong way,
>> > > >> but shouldn't the Journal buffer more until the Journal partition
>> > > >> is full or when the flush_interval is met.
>> > > >>
>> > > > Take a look at "journal queue max ops", which has a default of a
>> > > > mere 500, so that's full after 2 seconds. ^o^
>> > > Hm, that makes sense.
>> > >
>> > > So, tested out both low values ( 5000 )  and large value ( 6553600 ),
>> > > but it didn't seem that change anything.
>> > > Any chance I could dump the current values from a running OSD, to
>> > > actually see what is in use?
>> > >
>> > The value can be checked like this (for example):
>> > ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok config show
>> >
>> > If you restarted your OSD after updating ceph.conf I'm sure you will
>> > find the values you set there.
>> >
>> > However you are seriously underestimating the packet storm you're
>> > unleashing with 256 threads of 4KB packets over a 10Gb/s link.
>> >
>> > That's theoretically 256K packets/s, very quickly filling even your
>> > "large" max ops setting.
>> > Also the "journal max write entries" will need to be adjusted to suit
>> > the abilities (speed and merge wise) of your OSDs.
>> >
>> > With 40 million max ops and 2048 max write I get this (instead of
>> > similar values to you with the defaults):
>> >
>> >  1 256  2963  2707   10.5707   10.5742  0.125177
>> > 0.0830565 2 256  5278  5022   9.80635   9.04297  0.247757
>> > 0.0968146 3 256  7276  7020   9.13867   7.80469  0.002813
>> > 0.0994022 4 256  8774  85

Re: [ceph-users] Question about mark_unfound_lost on RGW metadata.

2014-04-09 Thread Gregory Farnum
Yeah, the log's not super helpful, but that and your description give
us something to talk about. Thanks!
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Apr 8, 2014 at 8:20 PM, Craig Lewis  wrote:
>
> Craig Lewis
> Senior Systems Engineer
> Office +1.714.602.1309
> Email cle...@centraldesktop.com
>
> Central Desktop. Work together in ways you never thought possible.
> Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog
>
> On 4/8/14 18:27 , Gregory Farnum wrote:
>
> On Tue, Apr 8, 2014 at 4:57 PM, Craig Lewis 
> wrote:
>
> pg query says the recovery state is:
>   "might_have_unfound": [
> { "osd": 11,
>   "status": "querying"},
> { "osd": 13,
>   "status": "already probed"}],
>
> I figured out why it wasn't probing osd.11.
>
> When I manually replaced the disk, I added the OSD to the cluster with a
> CRUSH weight of 0.
>
> As soon as I changed fixed the CRUSH weight, some PGs were allocated to the
> OSD, and the probing completed.  My PG that was stuck in recovery mode for
> 24h has been remapped to be on osd.11.  I believe this will allow the
> recovery to complete.
>
> Glad you worked it out. That sounds odd to me, though. Do you have any
> logs from osd.11?
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> Sure, but I don't think they'll be very helpful.  I only had the default
> logging levels.  Here are the logs from today:
> https://cd.centraldesktop.com/p/eAAADQ70AEBvDJY
>
>
> At 2014-04-08 16:15, I restarted the OSD.  That was to force the stalled
> recovery to yield to another recovery/backfill.  It seems to get hung up
> every so often.  Whenever I only saw this one PG in recovery state for more
> than 15 minutes, I'd restart osd.11, and it would recover/backfill other PGs
> for another ~12 hours.  It's probably not helping that I have max backfills
> set to 1.
>
>
> I didn't record the exact time, but I ran a few of these, trying to zero in
> on the right weight for the device.  The final command was:
> ceph osd crush reweight osd.11 3.64
> around 17:00 PDT (timezone in the logs).  Since the logs show a scrub
> starting at 2014-04-08 16:50:40.682409, so I'd say it was just before that.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to detect journal problems

2014-04-09 Thread Christian Balzer

Hello,

On Wed, 9 Apr 2014 07:31:53 -0700 Gregory Farnum wrote:

> journal_max_write_bytes: the maximum amount of data the journal will
> try to write at once when it's coalescing multiple pending ops in the
> journal queue.
> journal_queue_max_bytes: the maximum amount of data allowed to be
> queued for journal writing.
> 
> In particular, both of those are about how much is waiting to get into
> the durable journal, not waiting to get flushed out of it.

Thanks a bundle for that clarification Greg.

So the tunable to play with when trying to push the backing storage to its
throughput limits would be "filestore min sync interval" then?

Or can something else cause the journal to be flushed long before it
becomes full?

Christian

> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> 
> On Wed, Apr 9, 2014 at 3:06 AM, Christian Balzer  wrote:
> > On Tue, 8 Apr 2014 09:35:19 -0700 Gregory Farnum wrote:
> >
> >> On Tuesday, April 8, 2014, Christian Balzer  wrote:
> >>
> >> > On Tue, 08 Apr 2014 14:19:20 +0200 Josef Johansson wrote:
> >> > >
> >> > > On 08/04/14 10:39, Christian Balzer wrote:
> >> > > > On Tue, 08 Apr 2014 10:31:44 +0200 Josef Johansson wrote:
> >> > > >
> >> > > >> On 08/04/14 10:04, Christian Balzer wrote:
> >> > > >>> Hello,
> >> > > >>>
> >> > > >>> On Tue, 08 Apr 2014 09:31:18 +0200 Josef Johansson wrote:
> >> > > >>>
> >> > >  Hi all,
> >> > > 
> >> > >  I am currently benchmarking a standard setup with Intel DC
> >> > >  S3700 disks as journals and Hitachi 4TB-disks as
> >> > >  data-drives, all on LACP 10GbE network.
> >> > > 
> >> > > >>> Unless that is the 400GB version of the DC3700, you're already
> >> > > >>> limiting yourself to 365MB/s throughput with the 200GB
> >> > > >>> variant. If sequential write speed is that important to you
> >> > > >>> and you think you'll ever get those 5 HDs to write at full
> >> > > >>> speed with Ceph (unlikely).
> >> > > >> It's the 400GB version of the DC3700, and yes, I'm aware that I
> >> > > >> need a 1:3 ratio to max out these disks, as they write
> >> > > >> sequential data at about 150MB/s.
> >> > > >> But our thoughts are that it would cover the current demand
> >> > > >> with a 1:5 ratio, but we could upgrade.
> >> > > > I'd reckon you'll do fine, as in run out of steam and IOPS
> >> > > > before hitting that limit.
> >> > > >
> >> > >  The size of my journals are 25GB each, and I have two
> >> > >  journals per machine, with 5 OSDs per journal, with 5
> >> > >  machines in total. We currently use the tunables optimal and
> >> > >  the version of ceph is the latest dumpling.
> >> > > 
> >> > >  Benchmarking writes with rbd show that there's no problem
> >> > >  hitting upper levels on the 4TB-disks with sequential data,
> >> > >  thus maxing out 10GbE. At this moment we see full utilization
> >> > >  on the journals.
> >> > > 
> >> > >  But lowering the byte-size to 4k shows that the journals are
> >> > >  utilized to about 20%, and the 4TB-disks 100%. (rados -p
> >> > >   -b 4096 -t 256 100 write)
> >> > > 
> >> > > >>> When you're saying utilization I assume you're talking about
> >> > > >>> iostat or atop output?
> >> > > >> Yes, the utilization is iostat.
> >> > > >>> That's not a bug, that's comparing apple to oranges.
> >> > > >> You mean comparing iostat-results with the ones from rados
> >> > > >> benchmark?
> >> > > >>> The rados bench default is 4MB, which not only happens to be
> >> > > >>> the default RBD objectsize but also to generate a nice amount
> >> > > >>> of bandwidth.
> >> > > >>>
> >> > > >>> While at 4k writes your SDD is obviously bored, but actual OSD
> >> > > >>> needs to handle all those writes and becomes limited by the
> >> > > >>> IOPS it can peform.
> >> > > >> Yes, it's quite bored and just shuffles data.
> >> > > >> Maybe I've been thinking about this the wrong way,
> >> > > >> but shouldn't the Journal buffer more until the Journal
> >> > > >> partition is full or when the flush_interval is met.
> >> > > >>
> >> > > > Take a look at "journal queue max ops", which has a default of a
> >> > > > mere 500, so that's full after 2 seconds. ^o^
> >> > > Hm, that makes sense.
> >> > >
> >> > > So, tested out both low values ( 5000 )  and large value
> >> > > ( 6553600 ), but it didn't seem that change anything.
> >> > > Any chance I could dump the current values from a running OSD, to
> >> > > actually see what is in use?
> >> > >
> >> > The value can be checked like this (for example):
> >> > ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok config show
> >> >
> >> > If you restarted your OSD after updating ceph.conf I'm sure you will
> >> > find the values you set there.
> >> >
> >> > However you are seriously underestimating the packet storm you're
> >> > unleashing with 256 threads of 4KB packets over a 10Gb/s link.
> >> >
> >> > That's theoretically 256K packets/s, very quickly filling even your
> >> > "la

Re: [ceph-users] How to detect journal problems

2014-04-09 Thread Josef Johansson
Thanks all for helping to clarify in this matter :)

On 09/04/14 17:03, Christian Balzer wrote:
> Hello,
>
> On Wed, 9 Apr 2014 07:31:53 -0700 Gregory Farnum wrote:
>
>> journal_max_write_bytes: the maximum amount of data the journal will
>> try to write at once when it's coalescing multiple pending ops in the
>> journal queue.
>> journal_queue_max_bytes: the maximum amount of data allowed to be
>> queued for journal writing.
>>
>> In particular, both of those are about how much is waiting to get into
>> the durable journal, not waiting to get flushed out of it.
> Thanks a bundle for that clarification Greg.
>
> So the tunable to play with when trying to push the backing storage to its
> throughput limits would be "filestore min sync interval" then?
>
> Or can something else cause the journal to be flushed long before it
> becomes full?
This. Because this is what I see. I see the OSDs writing in 1-3MB/s with
300w/s, with 100% util. Which makes me want to optimize the journal further.

Even if I cram the journal_queue settings higher, it seems to stay that way.

My idea of the journal making everything sequential was that the data
would merge  inside the journal and get out on the disk as nice
sequential I/O.

I assume that it also could be that it didn't manage to merge the ops
because they were spread out too much. As the objects are 4M maybe the
4K data is spread out over different objects.

Cheers,
Josef
> Christian
>
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Wed, Apr 9, 2014 at 3:06 AM, Christian Balzer  wrote:
>>> On Tue, 8 Apr 2014 09:35:19 -0700 Gregory Farnum wrote:
>>>
 On Tuesday, April 8, 2014, Christian Balzer  wrote:

> On Tue, 08 Apr 2014 14:19:20 +0200 Josef Johansson wrote:
>> On 08/04/14 10:39, Christian Balzer wrote:
>>> On Tue, 08 Apr 2014 10:31:44 +0200 Josef Johansson wrote:
>>>
 On 08/04/14 10:04, Christian Balzer wrote:
> Hello,
>
> On Tue, 08 Apr 2014 09:31:18 +0200 Josef Johansson wrote:
>
>> Hi all,
>>
>> I am currently benchmarking a standard setup with Intel DC
>> S3700 disks as journals and Hitachi 4TB-disks as
>> data-drives, all on LACP 10GbE network.
>>
> Unless that is the 400GB version of the DC3700, you're already
> limiting yourself to 365MB/s throughput with the 200GB
> variant. If sequential write speed is that important to you
> and you think you'll ever get those 5 HDs to write at full
> speed with Ceph (unlikely).
 It's the 400GB version of the DC3700, and yes, I'm aware that I
 need a 1:3 ratio to max out these disks, as they write
 sequential data at about 150MB/s.
 But our thoughts are that it would cover the current demand
 with a 1:5 ratio, but we could upgrade.
>>> I'd reckon you'll do fine, as in run out of steam and IOPS
>>> before hitting that limit.
>>>
>> The size of my journals are 25GB each, and I have two
>> journals per machine, with 5 OSDs per journal, with 5
>> machines in total. We currently use the tunables optimal and
>> the version of ceph is the latest dumpling.
>>
>> Benchmarking writes with rbd show that there's no problem
>> hitting upper levels on the 4TB-disks with sequential data,
>> thus maxing out 10GbE. At this moment we see full utilization
>> on the journals.
>>
>> But lowering the byte-size to 4k shows that the journals are
>> utilized to about 20%, and the 4TB-disks 100%. (rados -p
>>  -b 4096 -t 256 100 write)
>>
> When you're saying utilization I assume you're talking about
> iostat or atop output?
 Yes, the utilization is iostat.
> That's not a bug, that's comparing apple to oranges.
 You mean comparing iostat-results with the ones from rados
 benchmark?
> The rados bench default is 4MB, which not only happens to be
> the default RBD objectsize but also to generate a nice amount
> of bandwidth.
>
> While at 4k writes your SDD is obviously bored, but actual OSD
> needs to handle all those writes and becomes limited by the
> IOPS it can peform.
 Yes, it's quite bored and just shuffles data.
 Maybe I've been thinking about this the wrong way,
 but shouldn't the Journal buffer more until the Journal
 partition is full or when the flush_interval is met.

>>> Take a look at "journal queue max ops", which has a default of a
>>> mere 500, so that's full after 2 seconds. ^o^
>> Hm, that makes sense.
>>
>> So, tested out both low values ( 5000 )  and large value
>> ( 6553600 ), but it didn't seem that change anything.
>> Any chance I could dump the current values from a running OSD, to
>> actually see what is in u

Re: [ceph-users] How to detect journal problems

2014-04-09 Thread Gregory Farnum
On Wed, Apr 9, 2014 at 8:03 AM, Christian Balzer  wrote:
>
> Hello,
>
> On Wed, 9 Apr 2014 07:31:53 -0700 Gregory Farnum wrote:
>
>> journal_max_write_bytes: the maximum amount of data the journal will
>> try to write at once when it's coalescing multiple pending ops in the
>> journal queue.
>> journal_queue_max_bytes: the maximum amount of data allowed to be
>> queued for journal writing.
>>
>> In particular, both of those are about how much is waiting to get into
>> the durable journal, not waiting to get flushed out of it.
>
> Thanks a bundle for that clarification Greg.
>
> So the tunable to play with when trying to push the backing storage to its
> throughput limits would be "filestore min sync interval" then?
>
> Or can something else cause the journal to be flushed long before it
> becomes full?

The min and max sync intervals are the principle controls (and it will
in general sync on the min interval), yes. It will also flush when it
reaches half full (not entirely full) so that it can continue
accepting incoming writes.
The tradeoff with those sync intervals is that turning them up means
the sync could take longer, and I think there are some impacts on
applying writes to the filesystem, but I don't remember for sure.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions about federated gateways configure

2014-04-09 Thread Craig Lewis

On 4/9/2014 3:33 AM, wsnote wrote:

Now I can configure it but it seems make no sense.
The following is the Error info.

[root@ceph69 ceph]# radosgw-agent -c /etc/ceph/cluster-data-sync.conf
INFO:urllib3.connectionpool:Starting new HTTPS connection (1): 
s3.ceph71.com

ERROR:root:Could not retrieve region map from destination


This error means that radosgw-agent can't retrieve the region and zone 
maps from the slave zone.


In cluster-data-sync.conf, double check that the destination URL is 
correct, and that ceph69 can connect to that URL.


Next verify dest_access_key and dest_secret_key are correct. Compare 
them to radosgw-admin user show --name client.radosgw.us-east-1. 
radosgw-admin uses those credentials to pull that data. Before I 
started, I made sure that all of my secret keys did not have backslashes.


One of the issues I ran into was making sure I created everything in the 
zone RGW pools, not the default RGW pools. Sometimes my users would end 
up in .users.uid, not .us.east-1.users.uid, because I forgot to add the 
--name parameter to the radosgw-admin commands.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS feature set mismatch with v0.79 and recent kernel

2014-04-09 Thread Michael Nelson
Actually my intent is to use EC with RGW pools :). If I fiddle around with 
cap bits temporarily will I be able to get things to work, or will 
protocol issues / CRUSH map parsing get me into trouble?


Is there an idea of when this might work in general? Even if the kernel 
doesn't support EC pools directly, but would work in a cluster with EC 
pools in use?


Thanks,
-mike

On Wed, 9 Apr 2014, Gregory Farnum wrote:


This flag won't be listed as required if you don't have any erasure
coding parameters in your OSD/crush maps. So if you aren't using it,
you should remove the EC rules and the kernel should be happy.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Apr 8, 2014 at 6:08 PM, Aaron Ten Clay  wrote:

On Tue, Apr 8, 2014 at 4:50 PM, Michael Nelson 
wrote:


I am trying to mount CephFS from a freshly installed v0.79 cluster using a
kernel built from git.kernel.org:kernel/git/sage/ceph-client.git (for-linus
a30be7cb) and running into the following dmesg errors on mount:

libceph: mon0 198.18.32.12:6789 feature set mismatch, my 2b84a042aca <
server's 2f84a042aca, missing 40
libceph: mon0 198.18.32.12:6789 socket error on read

which maps to:

ceph_features.h:#define CEPH_FEATURE_OSD_ERASURE_CODES (1ULL<<38)
ceph_features.h:#define CEPH_FEATURE_OSD_TMAP2OMAP (1ULL<<38)   /* overlap
with EC */

The same issue happens on the official 3.14 kernel.



According to the documentation, this is only supported by 3.15:
http://ceph.com/docs/master/rados/operations/crush-map/#which-client-versions-support-crush-tunables3

I don't know what kernel patches implement support for this, but you can
work around the problem by using the FUSE client until patches are released.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS feature set mismatch with v0.79 and recent kernel

2014-04-09 Thread Gregory Farnum
I'm not sure when that'll happen -- supporting partial usage isn't
something we're targeting right now. Most users are segregated into
one kind of client (userspace or kernel).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Apr 9, 2014 at 12:10 PM, Michael Nelson  wrote:
> Actually my intent is to use EC with RGW pools :). If I fiddle around with
> cap bits temporarily will I be able to get things to work, or will protocol
> issues / CRUSH map parsing get me into trouble?
>
> Is there an idea of when this might work in general? Even if the kernel
> doesn't support EC pools directly, but would work in a cluster with EC pools
> in use?
>
> Thanks,
> -mike
>
>
> On Wed, 9 Apr 2014, Gregory Farnum wrote:
>
>> This flag won't be listed as required if you don't have any erasure
>> coding parameters in your OSD/crush maps. So if you aren't using it,
>> you should remove the EC rules and the kernel should be happy.
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Tue, Apr 8, 2014 at 6:08 PM, Aaron Ten Clay 
>> wrote:
>>>
>>> On Tue, Apr 8, 2014 at 4:50 PM, Michael Nelson 
>>> wrote:


 I am trying to mount CephFS from a freshly installed v0.79 cluster using
 a
 kernel built from git.kernel.org:kernel/git/sage/ceph-client.git
 (for-linus
 a30be7cb) and running into the following dmesg errors on mount:

 libceph: mon0 198.18.32.12:6789 feature set mismatch, my 2b84a042aca <
 server's 2f84a042aca, missing 40
 libceph: mon0 198.18.32.12:6789 socket error on read

 which maps to:

 ceph_features.h:#define CEPH_FEATURE_OSD_ERASURE_CODES (1ULL<<38)
 ceph_features.h:#define CEPH_FEATURE_OSD_TMAP2OMAP (1ULL<<38)   /*
 overlap
 with EC */

 The same issue happens on the official 3.14 kernel.
>>>
>>>
>>>
>>> According to the documentation, this is only supported by 3.15:
>>>
>>> http://ceph.com/docs/master/rados/operations/crush-map/#which-client-versions-support-crush-tunables3
>>>
>>> I don't know what kernel patches implement support for this, but you can
>>> work around the problem by using the FUSE client until patches are
>>> released.
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] unsubscribe

2014-04-09 Thread Steve Carter


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Largest Production Ceph Cluster

2014-04-09 Thread Craig Lewis
I had a question about one of the points on the slide.  Slide 24, last 
bullet point, says:


 * If you use XFS, don't put your OSD journal as a file on the disk
 o Use a separate partition, the first partition!
 o We still need to reinstall our whole cluster to re-partition the
   OSDs


Do you have any details on that?

I am doing that, mostly because my rados bench test indicated that it 
was faster than using a journal partition.  Now I'm having time-out 
issues with OSD during recovery, when there really shouldn't be any, so 
I'm curious.



Thanks for any info.


*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com 

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website   | Twitter 
  | Facebook 
  | LinkedIn 
  | Blog 



On 4/1/14 07:18 , Dan Van Der Ster wrote:

Hi,

On 1 Apr 2014 at 15:59:07, Andrey Korolyov (and...@xdel.ru 
) wrote:

On 04/01/2014 03:51 PM, Robert Sander wrote:
> On 01.04.2014 13:38, Karol Kozubal wrote:
>
>> I am curious to know what is the largest known ceph production 
deployment?

>
> I would assume it is the CERN installation.
>
> Have a look at the slides from Frankfurt Ceph Day:
>
> http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern
>
> Regards
>

Just curious, how CERN guys built the network topology to prevent
possible cluster splits, because split in the middle will cause huge
downtime even for a relatively short split time enough to mark half of
those 1k OSDs as down by remaining MON majority.


The mons are distributed around the data centre, across N switches.
The OSDs are across a few switches --- actually, we could use CRUSH 
rules to replicate across switches but didn't do so because of an 
(unconfirmed) fear that the uplinks would become a bottleneck.
So a switch or routing outage scenario is clearly a point of failure 
where some PGs could become stale, but we've been lucky enough not to 
suffer from that yet.


BTW, this 3PB cluster was built to test the scalability of Ceph's 
implementation, not because we have 3PB of data to store in Ceph today 
(most of the results of those tests are discussed in that 
presentation.). And we are currently partitioning this cluster down 
into a smaller production instance for Cinder and other instances for 
ongoing tests.


BTW#2, I don't think the CERN cluster is the largest. Isn't 
DreamHost's bigger?


Cheers, Dan

-- Dan van der Ster || Data & Storage Services || CERN IT Department --


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions about federated gateways configure

2014-04-09 Thread wsnote
In cluster-data-sync.conf, if I use https,then it will show the error:


INFO:urllib3.connectionpool:Starting new HTTPS connection (1): s3.ceph71.com
ERROR:root:Could not retrieve region map from destination
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/radosgw_agent/cli.py", line 269, in 
main
region_map = client.get_region_map(dest_conn)
  File "/usr/lib/python2.6/site-packages/radosgw_agent/client.py", line 391, in 
get_region_map
region_map = request(connection, 'get', 'admin/config')
  File "/usr/lib/python2.6/site-packages/radosgw_agent/client.py", line 153, in 
request
result = handler(url, params=params, headers=request.headers, data=data)
  File "/usr/lib/python2.6/site-packages/requests/api.py", line 55, in get
return request('get', url, **kwargs)
  File "/usr/lib/python2.6/site-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 279, in 
request
resp = self.send(prep, stream=stream, timeout=timeout, verify=verify, 
cert=cert, proxies=proxies)
  File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 374, in 
send
r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.6/site-packages/requests/adapters.py", line 213, in 
send
raise SSLError(e)
SSLError: hostname 's3.ceph71.com' doesn't match u'ceph71'


If I use http, there is no error, and the log is
INFO:radosgw_agent.worker:finished processing shard 26
INFO:radosgw_agent.sync:27/128 items processed
INFO:radosgw_agent.worker:15413 is processing shard number 27
INFO:urllib3.connectionpool:Starting new HTTP connection (1): s3.ceph71.com
INFO:urllib3.connectionpool:Starting new HTTP connection (1): s3.ceph71.com
INFO:radosgw_agent.worker:finished processing shard 27
INFO:radosgw_agent.sync:28/128 items processed
INFO:radosgw_agent.worker:15413 is processing shard number 28
INFO:urllib3.connectionpool:Starting new HTTP connection (1): s3.ceph71.com
INFO:urllib3.connectionpool:Starting new HTTP connection (1): s3.ceph71.com
INFO:urllib3.connectionpool:Starting new HTTP connection (1): s3.ceph69.com
INFO:radosgw_agent.worker:syncing bucket "zhangyt6"
INFO:urllib3.connectionpool:Starting new HTTP connection (1): s3.ceph71.com
INFO:urllib3.connectionpool:Starting new HTTP connection (1): s3.ceph71.com
INFO:urllib3.connectionpool:Starting new HTTP connection (1): s3.ceph71.com
INFO:urllib3.connectionpool:Starting new HTTP connection (1): s3.ceph71.com
INFO:urllib3.connectionpool:Starting new HTTP connection (1): s3.ceph71.com
INFO:urllib3.connectionpool:Starting new HTTP connection (1): s3.ceph71.com
INFO:urllib3.connectionpool:Starting new HTTP connection (1): s3.ceph71.com
INFO:urllib3.connectionpool:Starting new HTTP connection (1): s3.ceph71.com


I can see the sync of bucket but the sync is failed:
 INFO:radosgw_agent.worker:syncing bucket "zhangyt6"




Another question is that it will create a pool called .rgw.root  
automatically.Does it have some affect?


Thanks!





At 2014-04-10 00:40:15,"Craig Lewis"  wrote:

On 4/9/2014 3:33 AM, wsnote wrote:

Now I can configure it but it seems make no sense.
The following is the Error info.


 [root@ceph69 ceph]# radosgw-agent -c /etc/ceph/cluster-data-sync.conf
INFO:urllib3.connectionpool:Starting new HTTPS connection (1): s3.ceph71.com
ERROR:root:Could not retrieve region map from destination


This error means that radosgw-agent can't retrieve the region and zone maps 
from the slave zone.

In cluster-data-sync.conf, double check that the destination URL is correct, 
and that ceph69 can connect to that URL. 

Next verify dest_access_key and dest_secret_key are correct.  Compare them to 
radosgw-admin user show --name client.radosgw.us-east-1.  radosgw-admin uses 
those credentials to pull that data.  Before I started, I made sure that all of 
my secret keys did not have backslashes.

One of the issues I ran into was making sure I created everything in the zone 
RGW pools, not the default RGW pools.  Sometimes my users would end up in 
.users.uid, not .us.east-1.users.uid, because I forgot to add the --name 
parameter to the radosgw-admin commands.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD space usage 2x object size after rados put

2014-04-09 Thread Mark Kirkwood

It is only that single pg using the space (see attached) - but essentially:

$ du -m /var/lib/ceph/osd/ceph-1
...
2048/var/lib/ceph/osd/ceph-1/current/5.1a_head
2053/var/lib/ceph/osd/ceph-1/current
2053/var/lib/ceph/osd/ceph-1/

Which is resized to 1025 soon after. Interestingly I am not seeing this 
effect (same ceph version) on a single host setup with 2 osds using 
preexisting partitions... it's only on these multi host configurations 
that have osd's using whole devices (both setups installed using 
ceph-deploy, so in theory nothing exotic about 'em except for the multi 
'hosts' are actually VMs).


Regards

Mark
On 10/04/14 02:27, Gregory Farnum wrote:

I don't think the backing store should be seeing any effects like
that. What are the filenames which are using up that space inside the
folders?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Apr 9, 2014 at 1:58 AM, Mark Kirkwood
 wrote:

Hi all,

I've noticed that objects are using twice their actual space for a few
minutes after they are 'put' via rados:

$ ceph -v
ceph version 0.79-42-g010dff1 (010dff12c38882238591bb042f8e497a1f7ba020)

$ ceph osd tree
# idweight  type name   up/down reweight
-1  0.03998 root default
-2  0.009995host ceph1
0   0.009995osd.0   up  1
-3  0.009995host ceph2
1   0.009995osd.1   up  1
-4  0.009995host ceph3
2   0.009995osd.2   up  1
-5  0.009995host ceph4
3   0.009995osd.3   up  1

$ ceph osd dump|grep repool
pool 5 'repool' replicated size 3 min_size 2 crush_ruleset 0 object_hash
rjenkins pg_num 64 pgp_num 64 last_change 57 owner 0 flags hashpspool
stripe_width 0

$ du -m  file
1025file

$ rados put -p repool file file

$ cd /var/lib/ceph/osd/ceph-1/current/
$ du -m 5.1a_head
2048  5.1a_head

[later]

$ du -m 5.1a_head
1024  5.1a_head

The above situation is repeated on the other two OSD's where this pg is
mapped. So after about 5 minutes or so we have (as expected) that the 1G
file is using 1G on each of the 3 OSD's it is mapped to, however for a short
period of time it is using twice this! I very interested to know what
activity is happening that causes the 2x space use - as this could be a
significant foot gun if uploading large files when we don't have 2x the
space available on each OSD.

Regards

Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


5   /var/lib/ceph/osd/ceph-1/current/omap
1   /var/lib/ceph/osd/ceph-1/current/meta/DIR_3
1   /var/lib/ceph/osd/ceph-1/current/meta/DIR_B
1   /var/lib/ceph/osd/ceph-1/current/meta/DIR_C
1   /var/lib/ceph/osd/ceph-1/current/meta/DIR_E
0   /var/lib/ceph/osd/ceph-1/current/meta/DIR_F
1   /var/lib/ceph/osd/ceph-1/current/meta
0   /var/lib/ceph/osd/ceph-1/current/2.9_head
0   /var/lib/ceph/osd/ceph-1/current/2.e_head
0   /var/lib/ceph/osd/ceph-1/current/0.c_head
0   /var/lib/ceph/osd/ceph-1/current/1.d_head
0   /var/lib/ceph/osd/ceph-1/current/1.c_head
0   /var/lib/ceph/osd/ceph-1/current/2.c_head
0   /var/lib/ceph/osd/ceph-1/current/1.f_head
0   /var/lib/ceph/osd/ceph-1/current/2.d_head
0   /var/lib/ceph/osd/ceph-1/current/1.11_head
0   /var/lib/ceph/osd/ceph-1/current/2.13_head
0   /var/lib/ceph/osd/ceph-1/current/0.11_head
0   /var/lib/ceph/osd/ceph-1/current/3.1_head
0   /var/lib/ceph/osd/ceph-1/current/3.27_head
0   /var/lib/ceph/osd/ceph-1/current/1.13_head
0   /var/lib/ceph/osd/ceph-1/current/3.4_head
0   /var/lib/ceph/osd/ceph-1/current/3.29_head
0   /var/lib/ceph/osd/ceph-1/current/1.14_head
0   /var/lib/ceph/osd/ceph-1/current/0.16_head
0   /var/lib/ceph/osd/ceph-1/current/1.16_head
0   /var/lib/ceph/osd/ceph-1/current/3.2e_head
0   /var/lib/ceph/osd/ceph-1/current/1.19_head
0   /var/lib/ceph/osd/ceph-1/current/1.18_head
0   /var/lib/ceph/osd/ceph-1/current/3.32_head
0   /var/lib/ceph/osd/ceph-1/current/0.1a_head
0   /var/lib/ceph/osd/ceph-1/current/1.1b_head
0   /var/lib/ceph/osd/ceph-1/current/2.19_head
0   /var/lib/ceph/osd/ceph-1/current/1.1a_head
0   /var/lib/ceph/osd/ceph-1/current/0.1b_head
0   /var/lib/ceph/osd/ceph-1/current/2.1e_head
0   /var/lib/ceph/osd/ceph-1/current/0.1c_head
0   /var/lib/ceph/osd/ceph-1/current/1.1d_head
0   /var/lib/ceph/osd/ceph-1/current/3.6_head
0   /var/lib/ceph/osd/ceph-1/current/3.38_head
0   /var/lib/ceph/osd/ceph-1/current/0.1e_head
0   /var/lib/ceph/osd/ceph-1/current/2.1d_head
0   /var/lib/ceph/osd/ceph-1/current/1.1e_head
0   /var/lib/ceph/osd/ceph-1/current/3.42_head
0   /var/lib/ceph/osd/ceph-1/current/0.20_head
0   /var/lib/ceph/osd/ceph-1/current/2.23_head

Re: [ceph-users] OSD space usage 2x object size after rados put

2014-04-09 Thread Gregory Farnum
Right, but I'm interested in the space allocation within the PG. The
best guess I can come up with without trawling through the code is
that some layer in the stack is preallocated and then trimmed the
objects back down once writing stops, but I'd like some more data
points before I dig.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Apr 9, 2014 at 7:59 PM, Mark Kirkwood
 wrote:
> It is only that single pg using the space (see attached) - but essentially:
>
> $ du -m /var/lib/ceph/osd/ceph-1
> ...
> 2048/var/lib/ceph/osd/ceph-1/current/5.1a_head
> 2053/var/lib/ceph/osd/ceph-1/current
> 2053/var/lib/ceph/osd/ceph-1/
>
> Which is resized to 1025 soon after. Interestingly I am not seeing this
> effect (same ceph version) on a single host setup with 2 osds using
> preexisting partitions... it's only on these multi host configurations that
> have osd's using whole devices (both setups installed using ceph-deploy, so
> in theory nothing exotic about 'em except for the multi 'hosts' are actually
> VMs).
>
> Regards
>
> Mark
>
> On 10/04/14 02:27, Gregory Farnum wrote:
>>
>> I don't think the backing store should be seeing any effects like
>> that. What are the filenames which are using up that space inside the
>> folders?
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Wed, Apr 9, 2014 at 1:58 AM, Mark Kirkwood
>>  wrote:
>>>
>>> Hi all,
>>>
>>> I've noticed that objects are using twice their actual space for a few
>>> minutes after they are 'put' via rados:
>>>
>>> $ ceph -v
>>> ceph version 0.79-42-g010dff1 (010dff12c38882238591bb042f8e497a1f7ba020)
>>>
>>> $ ceph osd tree
>>> # idweight  type name   up/down reweight
>>> -1  0.03998 root default
>>> -2  0.009995host ceph1
>>> 0   0.009995osd.0   up  1
>>> -3  0.009995host ceph2
>>> 1   0.009995osd.1   up  1
>>> -4  0.009995host ceph3
>>> 2   0.009995osd.2   up  1
>>> -5  0.009995host ceph4
>>> 3   0.009995osd.3   up  1
>>>
>>> $ ceph osd dump|grep repool
>>> pool 5 'repool' replicated size 3 min_size 2 crush_ruleset 0 object_hash
>>> rjenkins pg_num 64 pgp_num 64 last_change 57 owner 0 flags hashpspool
>>> stripe_width 0
>>>
>>> $ du -m  file
>>> 1025file
>>>
>>> $ rados put -p repool file file
>>>
>>> $ cd /var/lib/ceph/osd/ceph-1/current/
>>> $ du -m 5.1a_head
>>> 2048  5.1a_head
>>>
>>> [later]
>>>
>>> $ du -m 5.1a_head
>>> 1024  5.1a_head
>>>
>>> The above situation is repeated on the other two OSD's where this pg is
>>> mapped. So after about 5 minutes or so we have (as expected) that the 1G
>>> file is using 1G on each of the 3 OSD's it is mapped to, however for a
>>> short
>>> period of time it is using twice this! I very interested to know what
>>> activity is happening that causes the 2x space use - as this could be a
>>> significant foot gun if uploading large files when we don't have 2x the
>>> space available on each OSD.
>>>
>>> Regards
>>>
>>> Mark
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD space usage 2x object size after rados put

2014-04-09 Thread Mark Kirkwood
Ah right - sorry, I didn't realize that my 'du' was missing the files! I 
will retest and post updated output.


Cheers

Mark

On 10/04/14 15:04, Gregory Farnum wrote:

Right, but I'm interested in the space allocation within the PG. The
best guess I can come up with without trawling through the code is
that some layer in the stack is preallocated and then trimmed the
objects back down once writing stops, but I'd like some more data
points before I dig.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Apr 9, 2014 at 7:59 PM, Mark Kirkwood
 wrote:

It is only that single pg using the space (see attached) - but essentially:

$ du -m /var/lib/ceph/osd/ceph-1
...
2048/var/lib/ceph/osd/ceph-1/current/5.1a_head
2053/var/lib/ceph/osd/ceph-1/current
2053/var/lib/ceph/osd/ceph-1/

Which is resized to 1025 soon after. Interestingly I am not seeing this
effect (same ceph version) on a single host setup with 2 osds using
preexisting partitions... it's only on these multi host configurations that
have osd's using whole devices (both setups installed using ceph-deploy, so
in theory nothing exotic about 'em except for the multi 'hosts' are actually
VMs).

Regards

Mark

On 10/04/14 02:27, Gregory Farnum wrote:

I don't think the backing store should be seeing any effects like
that. What are the filenames which are using up that space inside the
folders?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Apr 9, 2014 at 1:58 AM, Mark Kirkwood
 wrote:

Hi all,

I've noticed that objects are using twice their actual space for a few
minutes after they are 'put' via rados:

$ ceph -v
ceph version 0.79-42-g010dff1 (010dff12c38882238591bb042f8e497a1f7ba020)

$ ceph osd tree
# idweight  type name   up/down reweight
-1  0.03998 root default
-2  0.009995host ceph1
0   0.009995osd.0   up  1
-3  0.009995host ceph2
1   0.009995osd.1   up  1
-4  0.009995host ceph3
2   0.009995osd.2   up  1
-5  0.009995host ceph4
3   0.009995osd.3   up  1

$ ceph osd dump|grep repool
pool 5 'repool' replicated size 3 min_size 2 crush_ruleset 0 object_hash
rjenkins pg_num 64 pgp_num 64 last_change 57 owner 0 flags hashpspool
stripe_width 0

$ du -m  file
1025file

$ rados put -p repool file file

$ cd /var/lib/ceph/osd/ceph-1/current/
$ du -m 5.1a_head
2048  5.1a_head

[later]

$ du -m 5.1a_head
1024  5.1a_head

The above situation is repeated on the other two OSD's where this pg is
mapped. So after about 5 minutes or so we have (as expected) that the 1G
file is using 1G on each of the 3 OSD's it is mapped to, however for a
short
period of time it is using twice this! I very interested to know what
activity is happening that causes the 2x space use - as this could be a
significant foot gun if uploading large files when we don't have 2x the
space available on each OSD.

Regards

Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mixing Ceph OSDs and hypervisor/compute nodes

2014-04-09 Thread Blair Bethwaite
Hi all,

We're building a new OpenStack zone very soon. Our compute hosts are spec'd
with direct attached disk for ephemeral instance storage and we have a
bunch of other storage nodes for Ceph serving Cinder volumes.

We're wondering about the feasibility of setting up the compute nodes as
OSDs and running a Ceph shared ephemeral storage pool across those nodes.
We understand this implies some more resource overheads, but we tend to
have plenty of CPU available on our compute nodes so don't anticipate a
major issue there.

Is anyone else doing this successfully, is there anything else to consider?
Oh, and we're ICE customers, so would such a setup be supportable?

Cheers, Blair
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mixing Ceph OSDs and hypervisor/compute nodes

2014-04-09 Thread Haomai Wang
I think you need to bind osd to specified cores and bind qemu-kvm to
other cores. Memory size is another factor need to take care of. If
your vm's root disk uses local file, the IO problem maybe
intractability

On Thu, Apr 10, 2014 at 12:47 PM, Blair Bethwaite
 wrote:
> Hi all,
>
> We're building a new OpenStack zone very soon. Our compute hosts are spec'd
> with direct attached disk for ephemeral instance storage and we have a bunch
> of other storage nodes for Ceph serving Cinder volumes.
>
> We're wondering about the feasibility of setting up the compute nodes as
> OSDs and running a Ceph shared ephemeral storage pool across those nodes. We
> understand this implies some more resource overheads, but we tend to have
> plenty of CPU available on our compute nodes so don't anticipate a major
> issue there.
>
> Is anyone else doing this successfully, is there anything else to consider?
> Oh, and we're ICE customers, so would such a setup be supportable?
>
> Cheers, Blair
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com