On Thu, Oct 18, 2018 at 3:35 PM Florent B wrote:
>
> I'm not familiar with gdb, what do I need to do ? Install "-gdb" version
> of ceph-mds package ? Then ?
> Thank you
>
install ceph with debug info, install gdb. run 'gdb attach '
> On 18/10/2018 03:40, Yan, Zheng wrote:
> > On Thu, Oct 18, 201
Hi,
I copy some big files to radosgw with awscli. But I found some copy
will failed, like :
* aws s3 --endpoint=XXX cp ./bigfile s3://mybucket/bigfile*
*upload failed: ./bigfile to s3://mybucket/bigfile An error occurred
(InternalError) when calling the CompleteMultipartUpload operation
On 17/10/18 15:23, Paul Emmerich wrote:
[apropos building Mimic on Debian 9]
apt-get install -y g++ libc6-dbg libc6 -t testing
apt-get install -y git build-essential cmake
I wonder if you could avoid the "need a newer libc" issue by using
backported versions of cmake/g++ ?
Regards,
Matthe
Not all pools need the same amount of PGs. When you get to so many pools
you want to start calculating how much data each pool will have. If 1 of
your pools will have 80% of your data in it, it should have 80% of your
PGs. The metadata pools for rgw likely won't need more than 8 or so PGs
each. If
What are you OSD node stats? CPU, RAM, quantity and size of OSD disks.
You might need to modify some bluestore settings to speed up the time it
takes to peer or perhaps you might just be underpowering the amount of OSD
disks you're trying to do and your servers and OSD daemons are going as
fast as
Hi Tom,
I used a slightly modified version of your script to generate a comparative
list to mine (echoing out the bucket name, id and actual_id), which has
returned substantially more indexes than mine, including a number that
don't show any indication of resharding having been run, or versioning
I had the same problem (or a problem with the same symptoms)
In my case the problem was with wrong ownership of the log file
You might want to check if you are having the same issue
Cheers, Massimo
On Mon, Oct 15, 2018 at 6:00 AM Zhenshi Zhou wrote:
> Hi,
>
> I added some OSDs into cluster(lum
Hi,
Ceph Version = 12.2.8
8TB spinner with 20G SSD partition
Perf dump shows the following:
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 21472731136,
"db_used_bytes": 3467640832,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
Am Do., 18. Okt. 2018 um 13:01 Uhr schrieb Matthew Vernon :
>
> On 17/10/18 15:23, Paul Emmerich wrote:
>
> [apropos building Mimic on Debian 9]
>
> > apt-get install -y g++ libc6-dbg libc6 -t testing
> > apt-get install -y git build-essential cmake
>
> I wonder if you could avoid the "need a newer
Hi,
today we had an issue with our 6 node Ceph cluster.
We had to shutdown one node (Ceph-03), to replace a disk (because, we did now
know the slot). We set the noout flag and did a graceful shutdown. All was O.K.
After the disk was replaced, the node comes up and our VMs had a big I/O
latenc
re-adding the list.
i'm glad to hear you got things back to a working state. one thing you
might want to check is the hit_set_history in the pg data. if the missing
hit sets are no longer in the history, then it is probably safe to go back
to the normal builds. that is until you have to mark anoth
After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing a
problem where the new ceph-mgr would sometimes hang indefinitely when doing
commands like 'ceph pg dump' on our largest cluster (~1,300 OSDs). The rest of
our clusters (10+) aren't seeing the same issue, but they ar
On Wed, Oct 17, 2018 at 1:14 AM Yang Yang wrote:
>
> Hi,
> A few weeks ago I found radosgw index has been inconsistent with reality.
> Some object I can not list, but I can get them by key. Please see the details
> below:
>
> BACKGROUND:
> Ceph version 12.2.4 (52085d5249a80c5f5121a76d628
I left some of the 'ceph pg dump' commands running and twice they returned
results after 30 minutes, and three times it took 45 minutes. Is there
something that runs every 15 minutes that would let these commands finish?
Bryan
From: Bryan Stillwell
Date: Thursday, October 18, 2018 at 11:16 AM
15 minutes seems like the ms tcp read timeout would be related.
Try shortening that and see if it works around the issue...
(We use ms tcp read timeout = 60 over here -- the 900s default seems
really long to keep idle connections open)
-- dan
On Thu, Oct 18, 2018 at 9:39 PM Bryan Stillwell wr
Thanks Dan!
It does look like we're hitting the ms_tcp_read_timeout. I changed it to 79
seconds and I've had a couple dumps that were hung for ~2m40s
(2*ms_tcp_read_timeout) and one that was hung for 8 minutes
(6*ms_tcp_read_timeout).
I agree that 15 minutes (900s) is a long timeout. Anyone
On Thu, Oct 18, 2018 at 6:17 PM Bryan Stillwell wrote:
>
> After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing
> a problem where the new ceph-mgr would sometimes hang indefinitely when doing
> commands like 'ceph pg dump' on our largest cluster (~1,300 OSDs). The rest
On Thu, Oct 18, 2018 at 1:35 PM Bryan Stillwell wrote:
>
> Thanks Dan!
>
>
>
> It does look like we're hitting the ms_tcp_read_timeout. I changed it to 79
> seconds and I've had a couple dumps that were hung for ~2m40s
> (2*ms_tcp_read_timeout) and one that was hung for 8 minutes
> (6*ms_tcp_r
Thanks Greg,
This did get resolved though I'm not 100% certain why!
For one of the suspect shards which caused crash on backfill, I
attempted to delete the associated via s3, late last week. I then
examined the filestore OSDs and the file shards were still present...
maybe for an hour followi
I could see something related to that bug might be happening, but we're not
seeing the "clock skew" or "signal: Hangup" messages in our logs.
One reason that this cluster might be running into this problem is that we
appear to have a script that is gathering stats for collectd which is running
On Thu, Oct 18, 2018 at 10:31 PM Bryan Stillwell wrote:
>
> I could see something related to that bug might be happening, but we're not
> seeing the "clock skew" or "signal: Hangup" messages in our logs.
>
>
>
> One reason that this cluster might be running into this problem is that we
> appear
On 10/18/2018 7:49 PM, Nick Fisk wrote:
Hi,
Ceph Version = 12.2.8
8TB spinner with 20G SSD partition
Perf dump shows the following:
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 21472731136,
"db_used_bytes": 3467640832,
"wal_t
On Mon, Oct 15, 2018 at 9:54 PM Dietmar Rieder
wrote:
>
> On 10/15/18 1:17 PM, jes...@krogh.cc wrote:
> >> On 10/15/18 12:41 PM, Dietmar Rieder wrote:
> >>> No big difference here.
> >>> all CentOS 7.5 official kernel 3.10.0-862.11.6.el7.x86_64
> >>
> >> ...forgot to mention: all is luminous ceph-
Hmm, It's useful to rebuild the index by rewriting a object.
But at first, I need know the all keys of objects. If I want to know all
keys, I need list_objects ...
Maybe I can make an union set of instances, then copy all of them into
themselves.
Anyway, I want to find out more about why it happen
Hi David,
Thanks for the explanation!
I'll make a search on how much data each pool will use.
Thanks!
David Turner 于2018年10月18日周四 下午9:26写道:
> Not all pools need the same amount of PGs. When you get to so many pools
> you want to start calculating how much data each pool will have. If 1 of
> yo
Hi Massimo,
I checked the ownership of the file as well as the log directory.
The files' ownership are ceph with permission 644. Besides, the
log directory's ownership is ceph and the permission is 'drwxrws--T'
I suppose that the ownership and file permission are enough for
ceph to write logs.
T
After RGW upgrade from Jewel to Luminous, one S3 user started to receive
errors from his postgre wal-e solution. Error is like this: "Server Side
Encryption with KMS managed key requires HTTP header
x-amz-server-side-encryption : aws:kms".
This can be resolved via simple patch of wal-e/wal-g. I
I want to ask did you had similar experience with upgrading Jewel RGW to
Luminous. After upgrading monitors and OSD's, I started two new Luminous
RGWs and put them to LB together with Jewel ones. And than interesting
things started to happen. Some our jobs start to fail with "
fatal error: An err
Hi!
I use ceph 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable),
and find that:
When expand whole cluster, i update pg_num, all succeed, but the status is as
below:
cluster:
id: 41ef913c-2351-4794-b9ac-dd340e3fbc75
health: HEALTH_WARN
3 pools have pg_num > pgp_num
Then i
Yes, that's understandable, but question was about "transit period" when at
some point we had part of RGW's upgraded and some of them were still in
Jewel. At that time we had a lot of complains from S3 users, who couldn't
access their buckets randomly. We did several upgrades in last years and it
w
Hi, yes we did it two days ago too. There is PR for this, but it's not
commited yet.
Thanks, anyway!
Arvydas
On Fri, Oct 19, 2018 at 7:15 AM Konstantin Shalygin wrote:
> After RGW upgrade from Jewel to Luminous, one S3 user started to receive
> errors from his postgre wal-e solution. Error is l
On 10/19/18 1:37 PM, Arvydas Opulskis wrote:
Yes, that's understandable, but question was about "transit period"
when at some point we had part of RGW's upgraded and some of them were
still in Jewel. At that time we had a lot of complains from S3 users,
who couldn't access their buckets randoml
Hi,
we have same question when trying to understand output of bucket stats.
Maybe you had found explanation somewhere else?
Thanks,
Arvydas
On Mon, Aug 6, 2018 at 10:28 AM Tomasz Płaza
wrote:
> Hi all,
>
> I have a bucket with a vary big num_objects in rgw.none:
>
> {
> "bucket": "dyna",
>
Yes, we know it now :) But it was a surprise at the moment we started RGW
upgrade, because it was not noticed in release notes, or I missed it
somehow.
On Fri, Oct 19, 2018 at 9:41 AM Konstantin Shalygin wrote:
> On 10/19/18 1:37 PM, Arvydas Opulskis wrote:
> > Yes, that's understandable, but qu
34 matches
Mail list logo