[pve-devel] [PATCH manager] vzdump: Fix typo in UPID error message

2020-09-29 Thread Dominic Jäger
Signed-off-by: Dominic Jäger 
---
 PVE/VZDump.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/PVE/VZDump.pm b/PVE/VZDump.pm
index 6e0d3dbf..aea7389b 100644
--- a/PVE/VZDump.pm
+++ b/PVE/VZDump.pm
@@ -522,7 +522,7 @@ sub getlock {
 
 my $maxwait = $self->{opts}->{lockwait} || $self->{lockwait};
 
-die "missimg UPID" if !$upid; # should not happen
+die "missing UPID" if !$upid; # should not happen
 
 if (!open (SERVER_FLCK, ">>$lockfile")) {
debugmsg ('err', "can't open lock on file '$lockfile' - $!", undef, 1);
-- 
2.20.1


___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


[pve-devel] [PATCH v5 manager 1/2] Allow prune-backups as an alternative to maxfiles

2020-09-29 Thread Fabian Ebner
and make the two options mutally exclusive as long
as they are specified on the same level (e.g. both
from the storage configuration). Otherwise prefer
option > storage config > default (only maxfiles has a default currently).

Defines the backup limit for prune-backups as the sum of all
keep-values.

There is no perfect way to determine whether a
new backup would trigger a removal with prune later:
1. we would need a way to include the not yet existing backup
   in a 'prune --dry-run' check.
2. even if we had that check, if it's executed right before
   a full hour, and the actual backup happens after the full
   hour, the information from the check is not correct.

So in some cases, we allow backup jobs with remove=0, that
will lead to a removal when the next prune is executed.
Still, the job with remove=0 does not execute a prune, so:
1. There is a well-defined limit.
2. A job with remove=0 never removes an old backup.

Signed-off-by: Fabian Ebner 
---

Changes from v4:
* add newline to 'cannot have ... at the same time' error message
* fix typo and correctly assign to $opts->{'prune-backups'} instead
  of $opts->{'prune_backups'}. Because of this, the mapping of
  maxfiles to keep-last had no effect in v4

 PVE/API2/VZDump.pm |  4 +--
 PVE/VZDump.pm  | 88 --
 2 files changed, 64 insertions(+), 28 deletions(-)

diff --git a/PVE/API2/VZDump.pm b/PVE/API2/VZDump.pm
index 2eda973e..19fa1e3b 100644
--- a/PVE/API2/VZDump.pm
+++ b/PVE/API2/VZDump.pm
@@ -25,7 +25,7 @@ __PACKAGE__->register_method ({
 method => 'POST',
 description => "Create backup.",
 permissions => {
-   description => "The user needs 'VM.Backup' permissions on any VM, and 
'Datastore.AllocateSpace' on the backup storage. The 'maxfiles', 'tmpdir', 
'dumpdir', 'script', 'bwlimit' and 'ionice' parameters are restricted to the 
'root\@pam' user.",
+   description => "The user needs 'VM.Backup' permissions on any VM, and 
'Datastore.AllocateSpace' on the backup storage. The 'maxfiles', 
'prune-backups', 'tmpdir', 'dumpdir', 'script', 'bwlimit' and 'ionice' 
parameters are restricted to the 'root\@pam' user.",
user => 'all',
 },
 protected => 1,
@@ -58,7 +58,7 @@ __PACKAGE__->register_method ({
if $param->{stdout};
}
 
-   foreach my $key (qw(maxfiles tmpdir dumpdir script bwlimit ionice)) {
+   foreach my $key (qw(maxfiles prune-backups tmpdir dumpdir script 
bwlimit ionice)) {
raise_param_exc({ $key => "Only root may set this option."})
if defined($param->{$key}) && ($user ne 'root@pam');
}
diff --git a/PVE/VZDump.pm b/PVE/VZDump.pm
index 6e0d3dbf..1fe4c4ee 100644
--- a/PVE/VZDump.pm
+++ b/PVE/VZDump.pm
@@ -89,6 +89,12 @@ sub storage_info {
maxfiles => $scfg->{maxfiles},
 };
 
+$info->{'prune-backups'} = 
PVE::JSONSchema::parse_property_string('prune-backups', 
$scfg->{'prune-backups'})
+   if defined($scfg->{'prune-backups'});
+
+die "cannot have 'maxfiles' and 'prune-backups' configured at the same 
time\n"
+   if defined($info->{'prune-backups'}) && defined($info->{maxfiles});
+
 if ($type eq 'pbs') {
$info->{pbs} = 1;
 } else {
@@ -459,12 +465,18 @@ sub new {
 
 if ($opts->{storage}) {
my $info = eval { storage_info ($opts->{storage}) };
-   $errors .= "could not get storage information for '$opts->{storage}': 
$@"
-   if ($@);
-   $opts->{dumpdir} = $info->{dumpdir};
-   $opts->{scfg} = $info->{scfg};
-   $opts->{pbs} = $info->{pbs};
-   $opts->{maxfiles} //= $info->{maxfiles};
+   if (my $err = $@) {
+   $errors .= "could not get storage information for 
'$opts->{storage}': $err";
+   } else {
+   $opts->{dumpdir} = $info->{dumpdir};
+   $opts->{scfg} = $info->{scfg};
+   $opts->{pbs} = $info->{pbs};
+
+   if (!defined($opts->{'prune-backups'}) && 
!defined($opts->{maxfiles})) {
+   $opts->{'prune-backups'} = $info->{'prune-backups'};
+   $opts->{maxfiles} = $info->{maxfiles};
+   }
+   }
 } elsif ($opts->{dumpdir}) {
$errors .= "dumpdir '$opts->{dumpdir}' does not exist"
if ! -d $opts->{dumpdir};
@@ -472,7 +484,9 @@ sub new {
die "internal error";
 }
 
-$opts->{maxfiles} //= $defaults->{maxfiles};
+if (!defined($opts->{'prune-backups'}) && !defined($opts->{maxfiles})) {
+   $opts->{maxfiles} = $defaults->{maxfiles};
+}
 
 if ($opts->{tmpdir} && ! -d $opts->{tmpdir}) {
$errors .= "\n" if $errors;
@@ -653,6 +667,7 @@ sub exec_backup_task {
 
 my $opts = $self->{opts};
 
+my $cfg = PVE::Storage::config();
 my $vmid = $task->{vmid};
 my $plugin = $task->{plugin};
 my $vmtype = $plugin->type();
@@ -706,8 +721,18 @@ sub exec_backup_task {
my $basename = $bkname . strftime("-%Y_%m_%d-%H_%M_%S", 
localtime($task->{backup_time

[pve-devel] [PATCH-SERIES v5] fix #2649: introduce prune-backups property for storages supporting backups

2020-09-29 Thread Fabian Ebner
Make use of the new 'prune-backups' storage property with vzdump.

Changes from v4:
* drop already applied patches
* rebase on current master
* fix typo
* add newline to error message

Fabian Ebner (2):
  Allow prune-backups as an alternative to maxfiles
  Always use prune-backups instead of maxfiles internally

 PVE/API2/VZDump.pm |  4 +--
 PVE/VZDump.pm  | 72 +++---
 2 files changed, 51 insertions(+), 25 deletions(-)

-- 
2.20.1



___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel



[pve-devel] [PATCH v5 manager 2/2] Always use prune-backups instead of maxfiles internally

2020-09-29 Thread Fabian Ebner
For the use case with '--dumpdir', it's not possible to call prune_backups
directly, so a little bit of special handling is required there.

Signed-off-by: Fabian Ebner 
---
 PVE/VZDump.pm | 42 --
 1 file changed, 16 insertions(+), 26 deletions(-)

diff --git a/PVE/VZDump.pm b/PVE/VZDump.pm
index 1fe4c4ee..c8f37d04 100644
--- a/PVE/VZDump.pm
+++ b/PVE/VZDump.pm
@@ -484,8 +484,10 @@ sub new {
die "internal error";
 }
 
-if (!defined($opts->{'prune-backups'}) && !defined($opts->{maxfiles})) {
-   $opts->{maxfiles} = $defaults->{maxfiles};
+if (!defined($opts->{'prune-backups'})) {
+   $opts->{maxfiles} //= $defaults->{maxfiles};
+   $opts->{'prune-backups'} = { 'keep-last' => $opts->{maxfiles} };
+   delete $opts->{maxfiles};
 }
 
 if ($opts->{tmpdir} && ! -d $opts->{tmpdir}) {
@@ -720,16 +722,11 @@ sub exec_backup_task {
my $bkname = "vzdump-$vmtype-$vmid";
my $basename = $bkname . strftime("-%Y_%m_%d-%H_%M_%S", 
localtime($task->{backup_time}));
 
-   my $maxfiles = $opts->{maxfiles};
my $prune_options = $opts->{'prune-backups'};
 
my $backup_limit = 0;
-   if (defined($maxfiles)) {
-   $backup_limit = $maxfiles;
-   } elsif (defined($prune_options)) {
-   foreach my $keep (values %{$prune_options}) {
-   $backup_limit += $keep;
-   }
+   foreach my $keep (values %{$prune_options}) {
+   $backup_limit += $keep;
}
 
if ($backup_limit && !$opts->{remove}) {
@@ -952,25 +949,18 @@ sub exec_backup_task {
 
# purge older backup
if ($opts->{remove}) {
-   if ($maxfiles) {
+   if (!defined($opts->{storage})) {
+   my $bklist = get_backup_file_list($opts->{dumpdir}, $bkname, 
$task->{target});
+   PVE::Storage::prune_mark_backup_group($bklist, $prune_options);
 
-   if ($self->{opts}->{pbs}) {
-   my $args = [$pbs_group_name, '--quiet', '1', '--keep-last', 
$maxfiles];
-   my $logfunc = sub { my $line = shift; debugmsg ('info', 
$line, $logfd); };
-   PVE::Storage::PBSPlugin::run_raw_client_cmd(
-   $opts->{scfg}, $opts->{storage}, 'prune', $args, 
logfunc => $logfunc);
-   } else {
-   my $bklist = get_backup_file_list($opts->{dumpdir}, 
$bkname, $task->{target});
-   $bklist = [ sort { $b->{ctime} <=> $a->{ctime} } @$bklist ];
-
-   while (scalar (@$bklist) >= $maxfiles) {
-   my $d = pop @$bklist;
-   my $archive_path = $d->{path};
-   debugmsg ('info', "delete old backup '$archive_path'", 
$logfd);
-   PVE::Storage::archive_remove($archive_path);
-   }
+   foreach my $prune_entry (@{$bklist}) {
+   next if $prune_entry->{mark} ne 'remove';
+
+   my $archive_path = $prune_entry->{path};
+   debugmsg ('info', "delete old backup '$archive_path'", 
$logfd);
+   PVE::Storage::archive_remove($archive_path);
}
-   } elsif (defined($prune_options)) {
+   } else {
my $logfunc = sub { debugmsg($_[0], $_[1], $logfd) };
PVE::Storage::prune_backups($cfg, $opts->{storage}, 
$prune_options, $vmid, $vmtype, 0, $logfunc);
}
-- 
2.20.1



___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel



Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

2020-09-29 Thread Fabian Grünbichler
On September 28, 2020 5:59 pm, Alexandre DERUMIER wrote:
> Here a new test http://odisoweb1.odiso.net/test5
> 
> This has occured at corosync start
> 
> 
> node1:
> -
> start corosync : 17:30:19
> 
> 
> node2: /etc/pve locked
> --
> Current time : 17:30:24
> 
> 
> I have done backtrace of all nodes at same time with parallel ssh at 17:35:22 
> 
> and a coredump of all nodes at same time with parallel ssh at 17:42:26
> 
> 
> (Note that this time, /etc/pve was still locked after backtrace/coredump)

okay, so this time two more log lines got printed on the (again) problem 
causing node #13, but it still stops logging at a point where this makes 
no sense.

I rebuilt the packages:

f318f12e5983cb09d186c2ee37743203f599d103b6abb2d00c78d312b4f12df942d8ed1ff5de6e6c194785d0a81eb881e80f7bbfd4865ca1a5a509acd40f64aa
  pve-cluster_6.1-8_amd64.deb
b220ee95303e22704793412e83ac5191ba0e53c2f41d85358a247c248d2a6856e5b791b1d12c36007a297056388224acf4e5a1250ef1dd019aee97e8ac4bcac7
  pve-cluster-dbgsym_6.1-8_amd64.deb

with a change of how the logging is set up (I now suspect that some 
messages might get dropped if the logging throughput is high enough), 
let's hope this gets us the information we need. please repeat the test5 
again with these packages.

is there anything special about node 13? network topology, slower 
hardware, ... ?


___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel



[pve-devel] applied: [PATCH manager] vzdump: Fix typo in UPID error message

2020-09-29 Thread Thomas Lamprecht
On 29.09.20 10:07, Dominic Jäger wrote:
> Signed-off-by: Dominic Jäger 
> ---
>  PVE/VZDump.pm | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
>

applied, thanks!



___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

2020-09-29 Thread Alexandre DERUMIER
>>with a change of how the logging is set up (I now suspect that some 
>>messages might get dropped if the logging throughput is high enough), 
>>let's hope this gets us the information we need. please repeat the test5 
>>again with these packages. 

I'll test this afternoon

>>is there anything special about node 13? network topology, slower
>>hardware, ... ?

no nothing special, all nodes have exactly same hardware/cpu (24cores/48threads 
3ghz)/memory/disk.

this node is around 10% cpu usage, load is around 5.

- Mail original -
De: "Fabian Grünbichler" 
À: "Proxmox VE development discussion" 
Envoyé: Mardi 29 Septembre 2020 10:51:32
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

On September 28, 2020 5:59 pm, Alexandre DERUMIER wrote: 
> Here a new test http://odisoweb1.odiso.net/test5 
> 
> This has occured at corosync start 
> 
> 
> node1: 
> - 
> start corosync : 17:30:19 
> 
> 
> node2: /etc/pve locked 
> -- 
> Current time : 17:30:24 
> 
> 
> I have done backtrace of all nodes at same time with parallel ssh at 17:35:22 
> 
> and a coredump of all nodes at same time with parallel ssh at 17:42:26 
> 
> 
> (Note that this time, /etc/pve was still locked after backtrace/coredump) 

okay, so this time two more log lines got printed on the (again) problem 
causing node #13, but it still stops logging at a point where this makes 
no sense. 

I rebuilt the packages: 

f318f12e5983cb09d186c2ee37743203f599d103b6abb2d00c78d312b4f12df942d8ed1ff5de6e6c194785d0a81eb881e80f7bbfd4865ca1a5a509acd40f64aa
 pve-cluster_6.1-8_amd64.deb 
b220ee95303e22704793412e83ac5191ba0e53c2f41d85358a247c248d2a6856e5b791b1d12c36007a297056388224acf4e5a1250ef1dd019aee97e8ac4bcac7
 pve-cluster-dbgsym_6.1-8_amd64.deb 

with a change of how the logging is set up (I now suspect that some 
messages might get dropped if the logging throughput is high enough), 
let's hope this gets us the information we need. please repeat the test5 
again with these packages. 

is there anything special about node 13? network topology, slower 
hardware, ... ? 


___ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

2020-09-29 Thread Alexandre DERUMIER
here a new test:

http://odisoweb1.odiso.net/test6/

node1
-
start corosync : 12:08:33


node2 (/etc/pve lock)
-
Current time : 12:08:39


node1 (stop corosync : unlock /etc/pve)
-
12:28:11 : systemctl stop corosync


backtraces: 12:26:30


coredump : 12:27:21


- Mail original -
De: "aderumier" 
À: "Proxmox VE development discussion" 
Envoyé: Mardi 29 Septembre 2020 11:37:41
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

>>with a change of how the logging is set up (I now suspect that some 
>>messages might get dropped if the logging throughput is high enough), 
>>let's hope this gets us the information we need. please repeat the test5 
>>again with these packages. 

I'll test this afternoon 

>>is there anything special about node 13? network topology, slower 
>>hardware, ... ? 

no nothing special, all nodes have exactly same hardware/cpu (24cores/48threads 
3ghz)/memory/disk. 

this node is around 10% cpu usage, load is around 5. 

- Mail original - 
De: "Fabian Grünbichler"  
À: "Proxmox VE development discussion"  
Envoyé: Mardi 29 Septembre 2020 10:51:32 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On September 28, 2020 5:59 pm, Alexandre DERUMIER wrote: 
> Here a new test http://odisoweb1.odiso.net/test5 
> 
> This has occured at corosync start 
> 
> 
> node1: 
> - 
> start corosync : 17:30:19 
> 
> 
> node2: /etc/pve locked 
> -- 
> Current time : 17:30:24 
> 
> 
> I have done backtrace of all nodes at same time with parallel ssh at 17:35:22 
> 
> and a coredump of all nodes at same time with parallel ssh at 17:42:26 
> 
> 
> (Note that this time, /etc/pve was still locked after backtrace/coredump) 

okay, so this time two more log lines got printed on the (again) problem 
causing node #13, but it still stops logging at a point where this makes 
no sense. 

I rebuilt the packages: 

f318f12e5983cb09d186c2ee37743203f599d103b6abb2d00c78d312b4f12df942d8ed1ff5de6e6c194785d0a81eb881e80f7bbfd4865ca1a5a509acd40f64aa
 pve-cluster_6.1-8_amd64.deb 
b220ee95303e22704793412e83ac5191ba0e53c2f41d85358a247c248d2a6856e5b791b1d12c36007a297056388224acf4e5a1250ef1dd019aee97e8ac4bcac7
 pve-cluster-dbgsym_6.1-8_amd64.deb 

with a change of how the logging is set up (I now suspect that some 
messages might get dropped if the logging throughput is high enough), 
let's hope this gets us the information we need. please repeat the test5 
again with these packages. 

is there anything special about node 13? network topology, slower 
hardware, ... ? 


___ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


___ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

2020-09-29 Thread Alexandre DERUMIER
>>
>>node1 (stop corosync : unlock /etc/pve)
>>-
>>12:28:11 : systemctl stop corosync

sorry, this was wrong,I need to start corosync after the stop to get it working 
again
I'll reupload theses logs


- Mail original -
De: "aderumier" 
À: "Proxmox VE development discussion" 
Envoyé: Mardi 29 Septembre 2020 12:52:44
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

here a new test: 

http://odisoweb1.odiso.net/test6/ 

node1 
- 
start corosync : 12:08:33 


node2 (/etc/pve lock) 
- 
Current time : 12:08:39 


node1 (stop corosync : unlock /etc/pve) 
- 
12:28:11 : systemctl stop corosync 


backtraces: 12:26:30 


coredump : 12:27:21 


- Mail original - 
De: "aderumier"  
À: "Proxmox VE development discussion"  
Envoyé: Mardi 29 Septembre 2020 11:37:41 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

>>with a change of how the logging is set up (I now suspect that some 
>>messages might get dropped if the logging throughput is high enough), 
>>let's hope this gets us the information we need. please repeat the test5 
>>again with these packages. 

I'll test this afternoon 

>>is there anything special about node 13? network topology, slower 
>>hardware, ... ? 

no nothing special, all nodes have exactly same hardware/cpu (24cores/48threads 
3ghz)/memory/disk. 

this node is around 10% cpu usage, load is around 5. 

- Mail original - 
De: "Fabian Grünbichler"  
À: "Proxmox VE development discussion"  
Envoyé: Mardi 29 Septembre 2020 10:51:32 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On September 28, 2020 5:59 pm, Alexandre DERUMIER wrote: 
> Here a new test http://odisoweb1.odiso.net/test5 
> 
> This has occured at corosync start 
> 
> 
> node1: 
> - 
> start corosync : 17:30:19 
> 
> 
> node2: /etc/pve locked 
> -- 
> Current time : 17:30:24 
> 
> 
> I have done backtrace of all nodes at same time with parallel ssh at 17:35:22 
> 
> and a coredump of all nodes at same time with parallel ssh at 17:42:26 
> 
> 
> (Note that this time, /etc/pve was still locked after backtrace/coredump) 

okay, so this time two more log lines got printed on the (again) problem 
causing node #13, but it still stops logging at a point where this makes 
no sense. 

I rebuilt the packages: 

f318f12e5983cb09d186c2ee37743203f599d103b6abb2d00c78d312b4f12df942d8ed1ff5de6e6c194785d0a81eb881e80f7bbfd4865ca1a5a509acd40f64aa
 pve-cluster_6.1-8_amd64.deb 
b220ee95303e22704793412e83ac5191ba0e53c2f41d85358a247c248d2a6856e5b791b1d12c36007a297056388224acf4e5a1250ef1dd019aee97e8ac4bcac7
 pve-cluster-dbgsym_6.1-8_amd64.deb 

with a change of how the logging is set up (I now suspect that some 
messages might get dropped if the logging throughput is high enough), 
let's hope this gets us the information we need. please repeat the test5 
again with these packages. 

is there anything special about node 13? network topology, slower 
hardware, ... ? 


___ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


___ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


___ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

2020-09-29 Thread Alexandre DERUMIER
I have reuploaded the logs

node1
-
start corosync : 12:08:33   (corosync.log)


node2 (/etc/pve lock)
-
Current time : 12:08:39


node1 (stop corosync : ---> not unlocked)   (corosync-stop.log)
-
12:28:11 : systemctl stop corosync

node2 (start corosync: > /etc/pve unlocked(corosync-start.log)


13:41:16 : systemctl start corosync


- Mail original -
De: "aderumier" 
À: "Proxmox VE development discussion" 
Envoyé: Mardi 29 Septembre 2020 13:43:08
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

>> 
>>node1 (stop corosync : unlock /etc/pve) 
>>- 
>>12:28:11 : systemctl stop corosync 

sorry, this was wrong,I need to start corosync after the stop to get it working 
again 
I'll reupload theses logs 


- Mail original - 
De: "aderumier"  
À: "Proxmox VE development discussion"  
Envoyé: Mardi 29 Septembre 2020 12:52:44 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

here a new test: 

http://odisoweb1.odiso.net/test6/ 

node1 
- 
start corosync : 12:08:33 


node2 (/etc/pve lock) 
- 
Current time : 12:08:39 


node1 (stop corosync : unlock /etc/pve) 
- 
12:28:11 : systemctl stop corosync 


backtraces: 12:26:30 


coredump : 12:27:21 


- Mail original - 
De: "aderumier"  
À: "Proxmox VE development discussion"  
Envoyé: Mardi 29 Septembre 2020 11:37:41 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

>>with a change of how the logging is set up (I now suspect that some 
>>messages might get dropped if the logging throughput is high enough), 
>>let's hope this gets us the information we need. please repeat the test5 
>>again with these packages. 

I'll test this afternoon 

>>is there anything special about node 13? network topology, slower 
>>hardware, ... ? 

no nothing special, all nodes have exactly same hardware/cpu (24cores/48threads 
3ghz)/memory/disk. 

this node is around 10% cpu usage, load is around 5. 

- Mail original - 
De: "Fabian Grünbichler"  
À: "Proxmox VE development discussion"  
Envoyé: Mardi 29 Septembre 2020 10:51:32 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On September 28, 2020 5:59 pm, Alexandre DERUMIER wrote: 
> Here a new test http://odisoweb1.odiso.net/test5 
> 
> This has occured at corosync start 
> 
> 
> node1: 
> - 
> start corosync : 17:30:19 
> 
> 
> node2: /etc/pve locked 
> -- 
> Current time : 17:30:24 
> 
> 
> I have done backtrace of all nodes at same time with parallel ssh at 17:35:22 
> 
> and a coredump of all nodes at same time with parallel ssh at 17:42:26 
> 
> 
> (Note that this time, /etc/pve was still locked after backtrace/coredump) 

okay, so this time two more log lines got printed on the (again) problem 
causing node #13, but it still stops logging at a point where this makes 
no sense. 

I rebuilt the packages: 

f318f12e5983cb09d186c2ee37743203f599d103b6abb2d00c78d312b4f12df942d8ed1ff5de6e6c194785d0a81eb881e80f7bbfd4865ca1a5a509acd40f64aa
 pve-cluster_6.1-8_amd64.deb 
b220ee95303e22704793412e83ac5191ba0e53c2f41d85358a247c248d2a6856e5b791b1d12c36007a297056388224acf4e5a1250ef1dd019aee97e8ac4bcac7
 pve-cluster-dbgsym_6.1-8_amd64.deb 

with a change of how the logging is set up (I now suspect that some 
messages might get dropped if the logging throughput is high enough), 
let's hope this gets us the information we need. please repeat the test5 
again with these packages. 

is there anything special about node 13? network topology, slower 
hardware, ... ? 


___ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


___ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


___ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


___ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


[pve-devel] applied: [RFC zfsonlinux 1/1] Add systemd-unit for importing specific pools

2020-09-29 Thread Thomas Lamprecht
On 16.09.20 14:14, Stoiko Ivanov wrote:
> This patch addresses the problems some users experience when some zpools are
> created/imported with cachefile (which then causes other pools not to get
> imported during boot) - when our tooling creates a pool we explictly
> instantiate the service with the pool's name, ensuring that it will get
> imported by scanning.
> 
> Suggested-by: Fabian Grünbichler 
> Signed-off-by: Stoiko Ivanov 
> ---
>  ...md-unit-for-importing-specific-pools.patch | 75 +++
>  debian/patches/series |  1 +
>  debian/zfsutils-linux.install |  1 +
>  3 files changed, 77 insertions(+)
>  create mode 100644 
> debian/patches/0008-Add-systemd-unit-for-importing-specific-pools.patch
> 
>

applied, thanks! Dropped the "Require=systemd-udev-settle.service", though.
But, it's still ordered after.



___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

2020-09-29 Thread Fabian Grünbichler
huge thanks for all the work on this btw!

I think I've found a likely culprit (a missing lock around a 
non-thread-safe corosync library call) based on the last logs (which 
were now finally complete!).

rebuilt packages with a proof-of-concept-fix:

23b03a48d3aa9c14e86fe8cf9bbb7b00bd8fe9483084b9e0fd75fd67f29f10bec00e317e2a66758713050f36c165d72f107ee3449f9efeb842d3a57c25f8bca7
  pve-cluster_6.1-8_amd64.deb
9e1addd676513b176f5afb67cc6d85630e7da9bbbf63562421b4fd2a3916b3b2af922df555059b99f8b0b9e64171101a1c9973846e25f9144ded9d487450baef
  pve-cluster-dbgsym_6.1-8_amd64.deb

I removed some logging statements which are no longer needed, so output 
is a bit less verbose again. if you are not able to trigger the issue 
with this package, feel free to remove the -debug and let it run for a 
little longer without the massive logs.

if feedback from your end is positive, I'll whip up a proper patch 
tomorrow or on Thursday.


___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel



Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

2020-09-29 Thread Alexandre DERUMIER

>>huge thanks for all the work on this btw! 

huge thanks to you ! ;)


>>I think I've found a likely culprit (a missing lock around a 
>>non-thread-safe corosync library call) based on the last logs (which 
>>were now finally complete!).

YES :)  


>>if feedback from your end is positive, I'll whip up a proper patch 
>>tomorrow or on Thursday. 

I'm going to launch a new test right now !


- Mail original -
De: "Fabian Grünbichler" 
À: "Proxmox VE development discussion" 
Envoyé: Mardi 29 Septembre 2020 15:28:19
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

huge thanks for all the work on this btw! 

I think I've found a likely culprit (a missing lock around a 
non-thread-safe corosync library call) based on the last logs (which 
were now finally complete!). 

rebuilt packages with a proof-of-concept-fix: 

23b03a48d3aa9c14e86fe8cf9bbb7b00bd8fe9483084b9e0fd75fd67f29f10bec00e317e2a66758713050f36c165d72f107ee3449f9efeb842d3a57c25f8bca7
 pve-cluster_6.1-8_amd64.deb 
9e1addd676513b176f5afb67cc6d85630e7da9bbbf63562421b4fd2a3916b3b2af922df555059b99f8b0b9e64171101a1c9973846e25f9144ded9d487450baef
 pve-cluster-dbgsym_6.1-8_amd64.deb 

I removed some logging statements which are no longer needed, so output 
is a bit less verbose again. if you are not able to trigger the issue 
with this package, feel free to remove the -debug and let it run for a 
little longer without the massive logs. 

if feedback from your end is positive, I'll whip up a proper patch 
tomorrow or on Thursday. 


___ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


[pve-devel] applied: [PATCH pve-qemu 1/5] Add transaction patches and fix for blocking finish

2020-09-29 Thread Thomas Lamprecht
On 28.09.20 17:48, Stefan Reiter wrote:
> With the transaction patches, patch 0026-PVE-Backup-modify-job-api.patch
> is no longer necessary, so drop it and rebase all following patches on
> top.
> 
> Signed-off-by: Stefan Reiter 
> ---
> 

applied, thanks!


___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel



[pve-devel] applied: [PATCH qemu-server 5/5] vzdump: log 'finishing' state

2020-09-29 Thread Thomas Lamprecht
On 28.09.20 17:48, Stefan Reiter wrote:
> ...and avoid printing 100% status twice
> 
> Signed-off-by: Stefan Reiter 
> ---
>  PVE/VZDump/QemuServer.pm | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
>

applied, thanks! But, I did s/verification/backup validation/ to avoid some
possible confusion with the more costly/in-depth server verification.



___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel



[pve-devel] [PATCH storage] fix regression in zfs volume activation

2020-09-29 Thread Stoiko Ivanov
commit 815df2dd08ac4c7295135262e60d64fbb57b8f5c introduced a small issue
when activating linked clone volumes - the volname passed contains
basevol/subvol, which needs to be translated to subvol.

using the path method should be a robust way to get the actual path for
activation.

Found and tested by building the package as root (otherwise the zfs
regressiontests are skipped).

Reported-by: Thomas Lamprecht 
Signed-off-by: Stoiko Ivanov 
---
 PVE/Storage/ZFSPoolPlugin.pm | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/PVE/Storage/ZFSPoolPlugin.pm b/PVE/Storage/ZFSPoolPlugin.pm
index 4f8df5e..6ac05b4 100644
--- a/PVE/Storage/ZFSPoolPlugin.pm
+++ b/PVE/Storage/ZFSPoolPlugin.pm
@@ -554,9 +554,10 @@ sub activate_volume {
 if ($format eq 'raw') {
$class->zfs_wait_for_zvol_link($scfg, $volname);
 } elsif ($format eq 'subvol') {
-   my $mounted = $class->zfs_get_properties($scfg, 'mounted', 
"$scfg->{pool}/$volname");
+   my ($path, undef, undef) = $class->path($scfg, $volname, $storeid);
+   my $mounted = $class->zfs_get_properties($scfg, 'mounted', "$path");
if ($mounted !~ m/^yes$/) {
-   $class->zfs_request($scfg, undef, 'mount', 
"$scfg->{pool}/$volname");
+   $class->zfs_request($scfg, undef, 'mount', "$path");
}
 }
 
-- 
2.20.1



___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel



[pve-devel] applied: [RFC storage 1/1] Disks: instantiate import unit for created zpool

2020-09-29 Thread Thomas Lamprecht
On 16.09.20 14:14, Stoiko Ivanov wrote:
> When creating a new ZFS storage, also instantiate an import-unit for the pool.
> This should help mitigate the case where some pools don't get imported during
> boot, because they are not listed in an existing zpool.cache file.
> 
> This patch needs the corresponding addition of 'zfs-import@.service' in
> the zfsonlinux repository.
> 
> Suggested-by: Fabian Grünbichler 
> Signed-off-by: Stoiko Ivanov 
> ---
>  PVE/API2/Disks/ZFS.pm | 6 ++
>  1 file changed, 6 insertions(+)
> 
>

applied, thanks! As we have no dependency on zfsutils-linux here to do a
bump to the versioned dependency, I added a simple check if the zfs-import@
template service exists (roughly, if -e 
'/lib/systemd/system/zfs-import@.service')



___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


[pve-devel] applied: [PATCH storage] fix regression in zfs volume activation

2020-09-29 Thread Thomas Lamprecht
On 29.09.20 18:49, Stoiko Ivanov wrote:
> commit 815df2dd08ac4c7295135262e60d64fbb57b8f5c introduced a small issue
> when activating linked clone volumes - the volname passed contains
> basevol/subvol, which needs to be translated to subvol.
> 
> using the path method should be a robust way to get the actual path for
> activation.
> 
> Found and tested by building the package as root (otherwise the zfs
> regressiontests are skipped).
> 
> Reported-by: Thomas Lamprecht 
> Signed-off-by: Stoiko Ivanov 
> ---
>  PVE/Storage/ZFSPoolPlugin.pm | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
>

applied, thanks!



___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel



Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

2020-09-29 Thread Alexandre DERUMIER
Hi,

some news, my last test is running for 14h now, and I don't have had any 
problem :)

So, it seem that is indeed fixed ! Congratulations !



I wonder if it could be related to this forum user
https://forum.proxmox.com/threads/proxmox-6-2-corosync-3-rare-and-spontaneous-disruptive-udp-5405-storm-flood.75871/

His problem is that after corosync lag (he's have 1 cluster stretch on 2DC with 
10km distance, so I think sometimes he's having some small lag,
1 node is flooding other nodes with a lot of udp packets. (and making things 
worst, as corosync cpu is going to 100% / overloaded, and then can't see other 
onodes

I had this problem 6month ago after shutting down a node, that's why I'm 
thinking it could "maybe" related.

So, I wonder if it could be same pmxcfs bug, when something looping or send 
again again packets.

The forum user seem to have the problem multiple times in some week, so maybe 
he'll be able to test the new fixed pmxcs, and tell us if it's fixing this bug 
too.



- Mail original -
De: "aderumier" 
À: "Proxmox VE development discussion" 
Envoyé: Mardi 29 Septembre 2020 15:52:18
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

>>huge thanks for all the work on this btw! 

huge thanks to you ! ;) 


>>I think I've found a likely culprit (a missing lock around a 
>>non-thread-safe corosync library call) based on the last logs (which 
>>were now finally complete!). 

YES :) 


>>if feedback from your end is positive, I'll whip up a proper patch 
>>tomorrow or on Thursday. 

I'm going to launch a new test right now ! 


- Mail original - 
De: "Fabian Grünbichler"  
À: "Proxmox VE development discussion"  
Envoyé: Mardi 29 Septembre 2020 15:28:19 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

huge thanks for all the work on this btw! 

I think I've found a likely culprit (a missing lock around a 
non-thread-safe corosync library call) based on the last logs (which 
were now finally complete!). 

rebuilt packages with a proof-of-concept-fix: 

23b03a48d3aa9c14e86fe8cf9bbb7b00bd8fe9483084b9e0fd75fd67f29f10bec00e317e2a66758713050f36c165d72f107ee3449f9efeb842d3a57c25f8bca7
 pve-cluster_6.1-8_amd64.deb 
9e1addd676513b176f5afb67cc6d85630e7da9bbbf63562421b4fd2a3916b3b2af922df555059b99f8b0b9e64171101a1c9973846e25f9144ded9d487450baef
 pve-cluster-dbgsym_6.1-8_amd64.deb 

I removed some logging statements which are no longer needed, so output 
is a bit less verbose again. if you are not able to trigger the issue 
with this package, feel free to remove the -debug and let it run for a 
little longer without the massive logs. 

if feedback from your end is positive, I'll whip up a proper patch 
tomorrow or on Thursday. 


___ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


___ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

2020-09-29 Thread Thomas Lamprecht
Hi,

On 30.09.20 08:09, Alexandre DERUMIER wrote:
> some news, my last test is running for 14h now, and I don't have had any 
> problem :)
> 

great! Thanks for all your testing time, this would have been much harder,
if even possible at all, without you probiving so much testing effort on a
production(!) cluster - appreciated!

Naturally many thanks to Fabian too, for reading so many logs without going
insane :-)

> So, it seem that is indeed fixed ! Congratulations !
> 

honza comfirmed Fabians suspicion about lacking guarantees of thread safety
for cpg_mcast_joined, which was sadly not documented, so this is surely
a bug, let's hope the last of such hard to reproduce ones.

> 
> 
> I wonder if it could be related to this forum user
> https://forum.proxmox.com/threads/proxmox-6-2-corosync-3-rare-and-spontaneous-disruptive-udp-5405-storm-flood.75871/
> 
> His problem is that after corosync lag (he's have 1 cluster stretch on 2DC 
> with 10km distance, so I think sometimes he's having some small lag,
> 1 node is flooding other nodes with a lot of udp packets. (and making things 
> worst, as corosync cpu is going to 100% / overloaded, and then can't see 
> other onodes

I can imagine this problem showing up as a a side effect of a flood where 
partition
changes happen. Not so sure that this can be the cause of that directly.

> 
> I had this problem 6month ago after shutting down a node, that's why I'm 
> thinking it could "maybe" related.
> 
> So, I wonder if it could be same pmxcfs bug, when something looping or send 
> again again packets.
> 
> The forum user seem to have the problem multiple times in some week, so maybe 
> he'll be able to test the new fixed pmxcs, and tell us if it's fixing this 
> bug too.

Testing once available would be sure a good idea for them.



___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel