from:"DERUMIER, Alexandre via pve\-devel"

Re: [pve-devel] [PATCH close #545, #5203: Allow bridges to have any valid interface name 0/2]

2024-02-22 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
>>It might make sense to check for any possible conflicts with the SDN
>>config (running & staged).

Technically, ifupdown2 will try to merge config options, if the
interface is defined in both /etc/network/interfaces  &&
/etc/network/interfaces.d/


I have seen user doing some specific tuning on vnet like this:

/etc/network/interfaces

iface vnet1
  otheroptions ..


/etc/network/interfaces.d/

iface vnet1
  address 
  bridge_ports ...




 Message initial 
De: Stefan Hanreich 
Répondre à: Proxmox VE development discussion 
À: Proxmox VE development discussion ,
Jillian Morgan 
Objet: Re: [pve-devel] [PATCH close #545, #5203: Allow bridges to have
any valid interface name 0/2]
Date: 21/02/2024 15:21:06

It might make sense to check for any possible conflicts with the SDN
config (running & staged).

--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH pve-network 0/7] add dhcp support for all zones

2024-02-22 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi Stefan !  

I don't known the roadmap for dhcp, but I'll have time to help  in
March. I don't have looked at qinq yet.




>>I've had another look at this patch series and I think I found the
>>reason for the issue(s) I encountered during my testing.
>>
>>One issue is related to the new IP forwarding settings. It seems like
>>they are not applying. I've looked at the ifquery output after
>>creating
>>a QinQ / VLAN zone with DHCP enabled:

>>It seems like the ip-forward settings do not get applied and
>>therefore
>>the command 'fails'. The bridges are up and working but IP forwarding
>>is
>>enabled:
>>
>>root@hoan-02:~# cat /proc/sys/net/ipv4/conf/vlan4/forwarding
>>1
>>
>>root@hoan-02:~# cat /proc/sys/net/ipv4/conf/qinq5/forwarding
>>1

what is the output of "ifreload -a -d"  ?



>>The other issue was using QinQ zone with a bridge that has no bridge
>>port configured and is not vlan-aware. In that case status is
>>checking
>>for the existence of the sv_ interface but it doesn't exist since
>>there isn't a bridge port.
>>This is also occuring without this patch, so no show stopper here
imo.

mmm, good catch, I'll look to add a check for this.
I'll check too for vlan zone, with non vlan-aware bridge without
interface.



BTW, for dnsmasq + ipv6 with evpn/vrf, we need to add a patch.

What do you think about providing a proxmox package for dnsmasq with
this patch,  + remove the default dnsmasq service.  (as currently it's
a little bit hacky, with manually disabling the main service)



> It just miss the ip for dhcpserver different than gateway for ipv6
> handling for vlan/qinq/vxlan, but it should be easy to implement.
> 
> Also, for ipv6 in vrf, it need a patch for dnsmasq, so I think this
> will need to proxmox dnsmasq package version.


--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH pve-network 0/7] add dhcp support for all zones

2024-02-22 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
> what is the output of "ifreload -a -d"  ?

>>nothing mentioning ip-forward sadly, I had already looked at
>>/var/log/ifupdown2 to get an idea of what's going wrong but I
>>couldn't
>>find anything mentioned there as well (I think the output is the
>>same..). I think it might just get ignored when applying - but not
>>when
>>checking.
>>
>>I guess I'll have to dig deeper into ifupdown2...


Ok, I have found it.

ip-forward 1|0 is not applied on bridge.

It's only done dynamically if bridge have an ip-adress or not currently

I have send patch upstream:
https://github.com/CumulusNetworks/ifupdown2/pull/292


I'll send a patch on pve-devel today or tomorrow.


--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH pve-manager] sdn: evpn: allow empty primary exit node in zone form

2024-02-22 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
 Message initial 
De: Stefan Hanreich 

>>This was broken when adding a new EVPN zone and there's an easier way
>>built-in to our widget toolkit. I've taken the liberty of sending a
v2
>>and mentioning you [1].


Oh, great, thanks !  I was banging my head to find a clean way to fix
it :)

--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v2 firewall 6/6] simulator: use new bridge naming scheme

2024-02-26 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
hi,I think you should limit to 8 characters like for sdn vnet, 

as we need to space to  vlan tag for example (vmbrY.), or other sdn
construct.


 Message initial 
De: Stefan Hanreich 
Répondre à: Proxmox VE development discussion 
À: pve-devel@lists.proxmox.com
Objet: [pve-devel] [PATCH v2 firewall 6/6] simulator: use new bridge
naming scheme
Date: 23/02/2024 15:36:51

We now allow bridges without the vmbr prefix, so we need to allow them
here in the simulator as well.

Signed-off-by: Stefan Hanreich 
---
 src/PVE/FirewallSimulator.pm    | 20 ++--
 src/PVE/Service/pve_firewall.pm |  4 ++--
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/src/PVE/FirewallSimulator.pm
b/src/PVE/FirewallSimulator.pm
index 140c46e..bd297d5 100644
--- a/src/PVE/FirewallSimulator.pm
+++ b/src/PVE/FirewallSimulator.pm
@@ -397,7 +397,7 @@ sub route_packet {
        $pkg->{physdev_in} = $target->{fwln} || die 'internal
error';
        $pkg->{physdev_out} = $target->{tapdev} || die 'internal
error';
 
-   } elsif ($route_state =~ m/^vmbr\d+$/) {
+   } elsif ($route_state =~ m/^[a-zA-Z][a-zA-Z0-9]{0,14}$/) {
 
        die "missing physdev_in - internal error?" if
!$physdev_in;
        $pkg->{physdev_in} = $physdev_in;
@@ -531,11 +531,6 @@ sub simulate_firewall {
    $from_info->{type} = 'host';
    $start_state = 'host';
    $pkg->{source} = $host_ip if !defined($pkg->{source});
-    } elsif ($from =~ m|^(vmbr\d+)/(\S+)$|) {
-   $from_info->{type} = 'bport';
-   $from_info->{bridge} = $1;
-   $from_info->{iface} = $2;
-   $start_state = 'from-bport';
 } elsif ($from eq 'outside') {
    $from_info->{type} = 'bport';
    $from_info->{bridge} = 'vmbr0';
@@ -559,6 +554,11 @@ sub simulate_firewall {
    $from_info = extract_vm_info($vmdata, $vmid, $netnum);
    $start_state = 'fwbr-out';
    $pkg->{mac_source} = $from_info->{macaddr};
+    } elsif ($from =~ m|^([a-zA-Z][a-zA-Z0-9]{0,14})/(\S+)$|) {
+   $from_info->{type} = 'bport';
+   $from_info->{bridge} = $1;
+   $from_info->{iface} = $2;
+   $start_state = 'from-bport';
 } else {
    die "unable to parse \"from => '$from'\"\n";
 }
@@ -569,10 +569,6 @@ sub simulate_firewall {
    $target->{type} = 'host';
    $target->{iface} = 'host';
    $pkg->{dest} = $host_ip if !defined($pkg->{dest});
-    } elsif ($to =~ m|^(vmbr\d+)/(\S+)$|) {
-   $target->{type} = 'bport';
-   $target->{bridge} = $1;
-   $target->{iface} = $2;
 } elsif ($to eq 'outside') {
    $target->{type} = 'bport';
    $target->{bridge} = 'vmbr0';
@@ -591,6 +587,10 @@ sub simulate_firewall {
    my $vmid = $1;
    $target = extract_vm_info($vmdata, $vmid, 0);
    $target->{iface} = $target->{tapdev};
+    } elsif ($to =~ m|^([a-zA-Z][a-zA-Z0-9]{0,14})/(\S+)$|) {
+   $target->{type} = 'bport';
+   $target->{bridge} = $1;
+   $target->{iface} = $2;
 } else {
    die "unable to parse \"to => '$to'\"\n";
 }
diff --git a/src/PVE/Service/pve_firewall.pm
b/src/PVE/Service/pve_firewall.pm
index 30d14d9..20fbc31 100755
--- a/src/PVE/Service/pve_firewall.pm
+++ b/src/PVE/Service/pve_firewall.pm
@@ -312,14 +312,14 @@ __PACKAGE__->register_method ({
        from => {
    description => "Source zone.",
    type => 'string',
-   pattern => '(host|outside|vm\d+|ct\d+|vmbr\d+/\S+)',
+   pattern => '(host|outside|vm\d+|ct\d+|([a-zA-Z][a-zA-
Z0-9]{0,14})/(\S+))',
    optional => 1,
    default => 'outside',
        },
        to => {
    description => "Destination zone.",
    type => 'string',
-   pattern => '(host|outside|vm\d+|ct\d+|vmbr\d+/\S+)',
+   pattern => '(host|outside|vm\d+|ct\d+|([a-zA-Z][a-zA-
Z0-9]{0,14})/(\S+))',
    optional => 1,
    default => 'host',
        },

--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] Proxmox support - to develop custom storage plugin

2024-03-12 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi,

In general, the qemu processi have access to filepath
(/mnt/.../*.raw|.qcow)  or block dev (/dev/).

in:
/usr/share/perl5/PVE/QemuServer.pm
sub print_drivedevice_full {
my ($storecfg, $conf, $vmid, $drive, $bridges, $arch,
$machine_type) = @_;
 
  $path = PVE::Storage::path($storecfg, $drive->{file});


The path function defined in your plugin or used the default inherited
from PVE/Storage/Plugin.pm)




Depending of your storage, you also have the activate_volume() to
expose the /dev/ on the host for example

activate_storage() to mount a nfs fs for example.


Also, if you use iscsi, it's possible that qemu is mapping an iscsi lun
directly , wihout need to expose it on the host.
(have a look at zfs over iscsi for example)





What is your storage protocol ?


 Message initial 
De: "Lothe, Jyotsna Pankaj via pve-devel" 
Répondre à: Proxmox VE development discussion 
À: pve-devel@lists.proxmox.com 
Cc: "Lothe, Jyotsna Pankaj" 
Objet: [pve-devel] Proxmox support - to develop custom storage plugin
Date: 12/03/2024 14:28:22

Hi Team,

I am from an HPE team where we are looking for a POC to use Proxmox as 
replacement for VMWare virtualization.
We are able to write small code for custom storage plugin by writing a plugin 
file under /usr/share/perl5/PVE/Storage/Custom/HpePlugin.pm.
We are able to connect to our storage and are fetching the list of volumes from 
our storage.

The pvesm status command shows plugin as active, see below output:

root@pmvm:~# pvesm status
Name    Type Status   Total    Used   Available 
   %
local    dir active 6733592 3761440 2608840 
  55.86%
prim_220 hpe active 6733592 3761440 2608840 
  55.86%
root@pmvm:~#

and the pvem list shows below volumes from our storage :

root@pmvm:~# pvesm list three_par
Volid Format  Type  Size VMID
0 raw images 10 101
1 raw images 80 101
3 raw images 20 101
4 raw images 32 101
5 raw images    100 101

Now we would like to understand the flow how we can map these volumes to create 
VMs?

Can someone from team guide us to understand the volume--> to--> VM mapping?

Regards,
Jyotsna
--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] Proxmox support - to develop custom storage plugin

2024-03-12 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
you can have a look at this old storage plugin I have wrote for netapp
10year ago
https://github.com/odiso/proxmox-pve-storage-netapp/blob/master/PVE/Storage/NetappPlugin.pm

It don't think it's still working, but the concepts should be the same


create volume, list volume, activate volume/storage, return the
volumepath.



Then add extra features: snasphot,rollback,clone, volume resize...
--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

[pve-devel] release a new pve-network package ? (evpn fix)

2024-03-21 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi,
a critical bug in evpn with multiple nodes is fixed in git,
https://git.proxmox.com/?p=pve-network.git;a=commit;h=e614da43f13e3c61f9b78ee9984364495eff91b6
 but package is still not released

I see a lot of user bug report since 4 months  about this, like this
recent one:
https://forum.proxmox.com/threads/evpn-sdn-issues-after-upgrade-proxmox-ve-from-7-to-8.131236/post-645618


could it be possible to release à 0.9.6 version?
--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH pve-manager v2] sdn: evpn: allow empty primary exit node in zone form

2024-04-02 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi,

could it be possible to merge this patch ?

I have see another report about it on the forum:

https://forum.proxmox.com/threads/bugfix-for-evpn-sdn-multiple-exit-nodes.137784/post-649071


 Message initial 
De: Stefan Hanreich 
Répondre à: Proxmox VE development discussion 
À: pve-devel@lists.proxmox.com
Objet: [pve-devel] [PATCH pve-manager v2] sdn: evpn: allow empty
primary exit node in zone form
Date: 22/02/2024 17:40:47

 sdn: evpn: allow empty primary exit node in zone form

--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [RFC container/firewall/manager/proxmox-firewall/qemu-server 00/37] proxmox firewall nftables implementation

2024-04-02 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi Stefan,

I'll really take time to test it (I was super busy theses last month
with a datacenter migration), as I wait for nftables since a while.

Can't help too much with rust, but I really appriciate it, as I had
some servers with a lot of vms && rules, take more than 10s to generate
the rules with current perl implementation).


I really would like to not have fwbr bridge anymore, because I have
seen a big performance bug with them: 

if you have a lot of vms in the same vlan/network, with a lot of fwbr
plugged on vmbr,  the broadcast traffic, like arp, is replicated over
each fwbr,  and it's not using a fast path, and use a lot of cpu with 
ksoftirqd

I have also seen some evpn bug sometimes, putting some stress with mac
tracking when a lot of fwbr are used.

Also, users have reported conntrack problem with fwbr, when the guest
send a tcp reset not received on other side.



I'll try your code, see the generated rules, and try to see if I can
get reject working.


Regards,

Alexandre




 Message initial 
De: Stefan Hanreich 
Répondre à: Proxmox VE development discussion 
À: pve-devel@lists.proxmox.com
Objet: [pve-devel] [RFC container/firewall/manager/proxmox-
firewall/qemu-server 00/37] proxmox firewall nftables implementation
Date: 02/04/2024 19:15:52

## Introduction
This RFC provides a drop-in replacement for the current pve-firewall
package
that is based on Rust and nftables.

It consists of three crates:
* proxmox-ve-config
  for parsing firewall and guest configuration files, as well as some
helpers
  to access host configuration (particularly networking)
* proxmox-nftables
  contains bindings for libnftables as well as types that implement the
JSON
  schema defined by libnftables-json
* proxmox-firewall
  uses the other two crates to read the firewall configuration and
create the
  respective nftables configuration


## Installation
* Build & install all deb packages on your PVE instance
* Enable the nftables firewall by going to
  Web UI >  > Firewall > Options > proxmox-nftables
* Enable the firewall datacenter-wide if you haven't already
* Restarting running VMs/CTs is required so the changes to the fwbr
creation
  go into effect

For your convenience I have provided pre-built packages on our share
under
`shanreich-proxmox-firewall`.

The source code is also available on my staff repo as `proxmox-
firewall`.


## Configuration
The firewall should work as a drop-in replacement for the pve-firewall,
so you
should be able to configure the firewall as usual via the Web UI or
configuration files.


## Known Issues
There is currently one major issue that we still need to solve:
REJECTing
packets from the guest firewalls is currently not possible for incoming
traffic
(it will instead be dropped).

This is due to the fact that we are using the postrouting hook of
nftables in a
table with type bridge for incoming traffic. In the bridge table in the
postrouting hook we cannot tell whether the packet has also been sent
to other
ports in the bridge (e.g. when a MAC has not yet been learned and the
packet
then gets flooded to all bridge ports). If we would then REJECT a
packet in the
postrouting hook this can lead to a bug where the firewall rules for
one guest
REJECT a packet and send a response (RST for TCP, ICMP port/host-
unreachable
otherwise).

This has also been explained in the respective commit introducing the
restriction [1].

We were able to circumvent this restriction in the old firewall due to
using
firewall bridges and rejecting in the firewall bridge itself. Doing
this leads
to the behavior described above, which has tripped up some of our users
before
[2] [3] and which is, frankly, wrong.

I currently see two possible solutions for this, both of which carry
downsides.
Your input on this matter would be much appreciated, particularly if
you can
think of another solution which I cannot currently see:

1. Only REJECT packets in the prerouting chain of the firewall bridge
with the
destination MAC address set to the MAC address of the network device,
otherwise
DROP

The downside of this is that we, once again, will have to resort to
using
firewall bridges, which we wanted to eliminate. This would also be the
sole
reason for still having to resort to using firewall bridges.

2. Only allow DROP in the guest firewall for incoming traffic

This would be quite awkward since, well, rejecting traffic would be
quite nice
for a firewall I'd say ;)

I'm happy for all input regarding this matter.


## Useful Commands

You can check if firewall rules got created by running

```
nft list ruleset
```

You can also check that `iptables` rules are not created via
```
iptables-save
```

Further info about the services:
```
systemctl status proxmox-firewall.{service,timer}
```

You can grab the debug output from the new firewall like so:

```
RUST_LOG=trace proxmox-firewall
```

## Upcoming

There are some (very minor) features missing:
* automatically generating an ipf

Re: [pve-devel] [RFC container/firewall/manager/proxmox-firewall/qemu-server 00/37] proxmox firewall nftables implementation

2024-04-02 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
>>
>>## Known Issues
>>There is currently one major issue that we still need to solve:
>>REJECTing
>>packets from the guest firewalls is currently not possible for
>>incoming traffic
>>(it will instead be dropped).


That remember me this Hetzner bug

--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [RFC container/firewall/manager/proxmox-firewall/qemu-server 00/37] proxmox firewall nftables implementation

2024-04-02 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
>>## Known Issues
>>There is currently one major issue that we still need to solve:
>>REJECTing
>>packets from the guest firewalls is currently not possible for
>>incoming traffic
>>(it will instead be dropped).

That's remember me this old Hetzner bug  (Hetzner flooding bad packet
with wrong dest mac flooding to all ports), then firewall reject with
tcp-reset, with a random bridge mac

https://forum.proxmox.com/threads/proxmox-claiming-mac-address.52601/page-3#post-416219



Personnaly, I'm not sure than using reject / tcp-reset in a bridged is
a good idea.  (Even if personally I'm using it production, I don't have
problem to switch to DROP, if I can avoid other problems)



>>
>>This is due to the fact that we are using the postrouting hook of
>>nftables in a
>>table with type bridge for incoming traffic. In the bridge table in
>>the
>>postrouting hook we cannot tell whether the packet has also been sent
>>to other
>>ports in the bridge (e.g. when a MAC has not yet been learned and the
>>packet
>>then gets flooded to all bridge ports). 


Maybe it is time to disable dynamic mac-learning  by default ? 
The code is already here and works fine.

AFAIK, other hypervisor like vmware disable port flooding by default
with static mac registration too.


--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [RFC container/firewall/manager/proxmox-firewall/qemu-server 00/37] proxmox firewall nftables implementation

2024-04-03 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---

> Maybe it is time to disable dynamic mac-learning  by default ? 
> The code is already here and works fine.
> 
> AFAIK, other hypervisor like vmware disable port flooding by default
> with static mac registration too.

>>Might be a good idea, although it still wouldn't solve the problem -
>>sadly (since we're still not allowed to do REJECT then).

maybe revert the kernel patch ? ^_^
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/net/bridge/netfilter/nft_reject_bridge.c?h=v6.8.2&id=127917c29a432c3b798e014a1714e9c1af0f87fe


Or Improve it for upstream, something like:

if !bridge_unicast_flooding && !bridge_mac_learning && proto = tcp|udp
allow_use_of_reject


as the original commit message seem to be about unicast flood 

"
 If we allow this to be used from forward or any other later
bridge hook, if the frame is flooded to several ports, we'll end up
sending several reject packets,
"




___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://antiphishing.vadesecure.com/v4?f=dVpnOERZb0JKOFlaRnBNeQ-
aJAXZb5aW6JXm5NyXq0ZSryyNaYxsZDLB8WDV0q4oZylZ86zxfmQyzg5dawW4cw&i=TG56O
W16ck5wUlFINGEzQ79EVPOILSGYD2XAUbTQrkI&k=1ZtS&r=enJEWGxReW5qbm5MS3pxTW8
Kub8XGodVNRkE_1iQQaZcsg_WcpdPfj8fEnEUbIAG&s=df68f05c7c9a0ea625e65001c10
eadba11343149ec52826a395f84870d55994a&u=https%3A%2F%2Flists.proxmox.com
%2Fcgi-bin%2Fmailman%2Flistinfo%2Fpve-devel


--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [RFC container/firewall/manager/proxmox-firewall/qemu-server 00/37] proxmox firewall nftables implementation

2024-04-03 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
 Message initial 
De: Stefan Hanreich 
Répondre à: Proxmox VE development discussion 
À: pve-devel@lists.proxmox.com
Objet: Re: [pve-devel] [RFC container/firewall/manager/proxmox-
firewall/qemu-server 00/37] proxmox firewall nftables implementation
Date: 03/04/2024 14:25:40

On 4/3/24 14:03, DERUMIER, Alexandre via pve-devel wrote:
> maybe revert the kernel patch ? ^_^
> https://antiphishing.vadesecure.com/v4?f=YVdEbUdjdUhGSnlic1ZwZTZs5NC0
> IK5UpoMm-JbYmH0g8SvUq6T2pULKyCWNdAtigmuEY2RK_MUtsxeEJS-
> hxg&i=VG1PWDJGYXFXcTREa3RZRfomnJARQnrbyjpk2iZcdNI&k=BnAq&r=M1hxaWZ5bn
> NuVExjSWtSa9ErN2N5EUBGcCr8vO1Apj6WrYRZZdbucoU7ZCJIs1cd&s=3ec641e9b97c
> 0cd606bd785a4f86cf529a63c00affecb597ebc3fa3e7f6b8764&u=https%3A%2F%2F
> git.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Fstable%2Flinux.gi
> t%2Fcommit%2Fnet%2Fbridge%2Fnetfilter%2Fnft_reject_bridge.c%3Fh%3Dv6.
> 8.2%26id%3D127917c29a432c3b798e014a1714e9c1af0f87fe

>>I also thought about it shortly. If we can ensure that certain
>>conditions are met that might be an option. We would have to think
>>about
>>broadcast/multicast traffic like ARP / DHCP I would assume. It seems
>>a
>>bit drastic from my POV, which is why dropped that thought.

I think you can just use DROP for this kind of traffic, as anyway, you
don't expect to receive a response like tcp-reset or icmp port
unreachable.

only udp,tcp,icmp (unicast traffic) should be interesting (and can be
protected from unicast_flood)

> Or Improve it for upstream, something like:
> 
> if !bridge_unicast_flooding && !bridge_mac_learning && proto =
> tcp|udp
>     allow_use_of_reject

>>that might be a possibility, although I'm not sure that information
>>about the bridge is available in the netfilter modules.

Yes, I'm not sure that it's possible to have this kind of information.
Maybe simply adding a sysctl flag to allow|disallow usage of reject in
forward/postrouting.
Like this, the userland software can enabled it if it known that's it's
safe.

___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://antiphishing.vadesecure.com/v4?f=YVdEbUdjdUhGSnlic1ZwZTZs5NC0IK
5UpoMm-JbYmH0g8SvUq6T2pULKyCWNdAtigmuEY2RK_MUtsxeEJS-
hxg&i=VG1PWDJGYXFXcTREa3RZRfomnJARQnrbyjpk2iZcdNI&k=BnAq&r=M1hxaWZ5bnNu
VExjSWtSa9ErN2N5EUBGcCr8vO1Apj6WrYRZZdbucoU7ZCJIs1cd&s=18886ae882494937
dd226c13263c6706f72f922dea16c781a962348d3ddbfc6a&u=https%3A%2F%2Flists.
proxmox.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fpve-devel

--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH pve-network/pve-common/pve-manager] fix #4300 : sdn: add bridge ports isolation

2024-06-17 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi,

Could it be possible to apply this patch series ? (or a review if it
need cleanup)

(I see a lot of users requesting for it)

Thanks !

Alexandre



BTW: I'm a little bit off currently, I'm working on vm luks encryption,
I'll send a patch series soon.



 Message initial 
De: Alexandre Derumier via pve-devel 
Répondre à: Proxmox VE development discussion 
À: pve-devel@lists.proxmox.com
Cc: Alexandre Derumier 
Objet: [pve-devel] [PATCH pve-network/pve-common/pve-manager] fix #4300
: sdn: add bridge ports isolation
Date: 25/04/2024 16:43:49

___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://antiphishing.vadesecure.com/v4?f=VmNQMmJDQ0hHaTA5alRDNCL_-
44OVmltABzQ0e1bsd_7nWEkVLittYcyfccG6u8cOJvYIK6lE_k8ITzm9r5Y0w&i=b3diUTZ
GTG5ZeGdnYUVUQe4vRf_vVqdECnbwLkyrFZw&k=Znx7&r=bk1HS29PWk1VdElEOTBqVJN5E
Bt4nYRlpeAVR4dNFSi2ANtRVfOliSTesgTyCcqX&s=fe3a09b7f9bf32322c85f6afdc8c0
1b6abb91b27481a5fba19d2edfa8041cfc0&u=https%3A%2F%2Flists.proxmox.com%2
Fcgi-bin%2Fmailman%2Flistinfo%2Fpve-devel

--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

[pve-devel] missing proxmox-backup-client_3.2.5-1 package in no subscription repo

2024-06-17 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi,

A lot of user are complaining about   dist-upgrade   trying to  remove
pve-manager 

https://forum.proxmox.com/threads/upgrading-pve-tries-to-remove-
proxmox-ve-package.149101/page-3


It seem that's because of a missing  proxmox-backup-client_3.2.5-1
version depend . (It's ok in pvetest)


Alexandre
--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH pve-network/pve-common/pve-manager] fix #4300 : sdn: add bridge ports isolation

2024-06-27 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi!


>>Hi! I gave this a quick test on my machine and everything worked
well.
>>Would we maybe want to expose this setting on the NIC level as well?

I don't think it can work, because a port not isolated, have access to
all other ports,including isolated ports.


"
isolated on or isolated off
Controls whether a given port will be isolated, which means it will be
able to communicate with non-isolated ports only. By default this flag
is off."


for example:
vm1: isolated
vm2: isolated
vm3: non isolated


vm1: can't access to vm2
vm2: can't access to vm1

vm3 have access to vm1 && vm2 isolated.  (but user is thinking that vm1
&& vm2 are secure).
and vm1/vm2 have access to vm3 too.


That's why I have done it at bridge/vnet level,  all or nothing.

The main usage is to have only 1 upstream port non isolated (traffic
coming from outside) 


>>Also I think 'Isolate Ports' or 'Port Isolation' would be the better
>>label, 'Ports Isolation' sounds a bit wrong to me.

I'll send a v2 with "Port Isolation"



Otherwise, consider this:

>>Tested-By: Stefan Hanreich 
>>Reviewed-By: Stefan Hanreich 

Thanks !

 4/25/24 16:43, Alexandre Derumier via pve-devel wrote:
> ___
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://antiphishing.vadesecure.com/v4?f=OGhLSzUzUW5ZSnhsUnB1Zwk-
> iBGgPyVY4TTNWFYEVcCg2sqZ42p4ld6uKOxcEXt1&i=enliNE9Ec0FwcDdnUXU4UdqsUW
> Q6P4MlGVBmGUhBgqg&k=qWGl&r=TnY3ZTF2Q2plM1daMndLWY2hdyEItuD5-
> BacJIgJqvZ3qD1cLHhtTB2x5DvZF4UIAZISGlCJrAF01C9VxKgOjg&s=926df6762a5f8
> 47592379de9a2d61dc8a3bf0ade01884ae3830a7e63f216d753&u=https%3A%2F%2Fl
> ists.proxmox.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fpve-devel


--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v2 pve-manager 2/2] ui: qemu : memoryedit: add new max && virtio fields

2024-07-09 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
> 
> and for virtio-mem, dimmsize can be replace by chunk size
>>Sorry about the very late response!
>>
>>When calculating from the DIMM size with 1 GiB step size, we can only
>>get to max values N * 64 GiB. We could still a have a separate max
>>field
>>with smaller step size, e.g. a max value of 100 GiB would be using
>>DIMM
>>size of 2 GiB and reject any API request trying to plug more memory
>>(so
>>a total of 50 DIMMs can be plugged). I'm just a bit worried the
>>auto-update might get a bit confusing, e.g. user might just want to
>>change DIMM size without expecting that this will override the max
>>value. Should DIMM size maybe be a separate setting?
>>
>>Maybe it's easier to just start with max and wait for use
>>cases/requests
>>where changing DIMM size is actually required?

>>@Alexandre: I came back to this since
>>https://antiphishing.vadesecure.com/v4?f=SXhidDdiNE5weDlFTk1JSLHA-
>>liRz7IhIIVh697XNAns0HPu3bRbZ7tq40QrBNcG&i=VEJXQ0RvaGFoOExkQUtDWd5cWRM
>>PcZda_rDn7DGZyXY&k=gTNE&r=ZmttOUJBaHM0cVJhc0pLb7YCjpcuri9D5srwg5HyY68
>>jcxSdKUjiEbecfdTTWjy2vcNi99970VylQXFhxOytiw&s=eb86f2bfd18ff30f77a7ecb
>>15ebf22740dfe242d141e6f28a7a0eb3c94c8a8b0&u=https%3A%2F%2Fbugzilla.pr
>>oxmox.com%2Fshow_bug.cgi%3Fid%3D5585 came in and wanted to
>>ask if you'd still like to send out a new version with the improved
>>max
>>granularity or if this should be picked up by a developer at Proxmox
>>instead?


Hi Fiona !  I'm a lot busy currently working on shared lvm san snapshot
&& thin provisioning (I'll submit an RFC soon).

I don't think I'll time to rework on this until September/October (with
the holiday coming).
So, if a proxmox developper could work again on this, it could be great
:) 

(BTW and they are also the new hv-balloon for windows ^_^)


--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] issues with Virtio-SCSI devicde on Proxmox...

2024-08-13 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi,

Currently they are no way to add custom options for virtio-devices
command line from vm config, so it should be patched to add support for
openvms os and add special tuning.

for 
example:https://git.proxmox.com/?p=qemu-server.git;a=blob_plain;f=PVE/QemuServer.pm;hb=HEAD
sub print_netdevice_full {
 

if ($conf->{ostype} && $conf->{ostype} eq 'openvms') {
tmpstr .= ",disable_legacy=on";
}


>>I have tried both “virtio-scsi-pci” and “virtio-scsi-single”, but no
>>luck. The emulated virtio-scsi device is a legacy device
>>
>>
>>Questions:
>>* is there a way to configure a modern virtio-scsi devcie (i.e.
>>disable_legacy=on) ?
>>

Do you mean than currently on proxmox, the default is legacy mode and
and not modern ?  I was pretty sure than qemu was defaulting to modern,

and that you need to use  ",disable-legacy=off,disable-modern=true" to
switch to legacy.

(and AFAIK, proxmox qemu version don't have any patch forcing legacy)

but maybe I'm wrong



maybe it's only enabled by default with pci-express (using q35 machine
model) ?

https://github.com/qemu/qemu/blob/9eb51530c12ae645b91e308d16196c68563ea883/docs/pcie.txt
"Virtio devices plugged into PCI Express ports are PCI Express devices
and
have "1.0" behavior by default without IO support.
In both cases disable-legacy and disable-modern properties can be used
to override the behaviour."


in qemu code I see:

https://github.com/qemu/qemu/blob/9eb51530c12ae645b91e308d16196c68563ea883/hw/core/machine.c#L258
GlobalProperty hw_compat_2_6[] = {
{ "virtio-mmio", "format_transport_address", "off" },
/* Optional because not all virtio-pci devices support legacy mode
*/
{ "virtio-pci", "disable-modern", "on",  .optional = true },
{ "virtio-pci", "disable-legacy", "off", .optional = true },
};
const size_t hw_compat_2_6_len = G_N_ELEMENTS(hw_compat_2_6);

so I'm really not sure what is the real default


do you have qemu command line of another kvm hypervisor
(ovirt,rhev,...) where it's working correctly ?  Maybe they are
enabling modern bits manually , when a virtio device is supporting it ?



 Message initial 
De: Christian Moser 
Répondre à: Proxmox VE development discussion 
À: pve-devel@lists.proxmox.com
Cc: Christian Moser 
Objet: [pve-devel] issues with Virtio-SCSI devicde on Proxmox...
Date: 12/08/2024 12:40:10

Hello,

I work for VSI (VMS Software Inc) which is porting the OpenVMS
operating system to x86. At this point we successfully on various
hypervisors, but have some issues on the KVM running on Proxmox.

The OpenVMS VM works just fine with SATA disks and it also works with
for example virtio-network device etc., but trying to use
virtio-scsi hangs  the driver. I have debugged this and I can
successfully configure the port/controller, send the IO request to the
device. It then gets processed by the device, which posts the results
and sets the interrupt bit in the ISR register, but it never
asserts the interrupt hence the driver never gets notified and the I/O
hangs.

I have tried both “virtio-scsi-pci” and “virtio-scsi-single”, but no
luck. The emulated virtio-scsi device is a legacy device. But
then again, the virtio-network device is also a legacy device and here
we are getting interrupts. One thing which bothers me is the
fact that the “legacy interrupt disable” bit is set in the PCI config
space of the virtio-scsi device (i.e. bit 10 at offset 4)

Questions:
*   is there a way to configure a modern virtio-scsi devcie (i.e.
disable_legacy=on) ?
*   why is the legacy interrupt bit set in the PCI config space ?
*   Are there any working driver for virtio-scsi on this KVM using
Q35 machine? i.e. any other OS

Any thoughts why these interrupts are not getting delivered on the PCIE
bus?

thanks

/cmos

___
Christian Moser
Mobile:    +358-40-5022105  
Email:  c...@maklee.com
URL:   www.maklee.com

___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://antiphishing.vadesecure.com/v4?f=MXNCQ25TYXBra2RFV0VXZHojMIIFPS
zCvkhJApFQpU82zfAhP9K7STA7y-
wx1jAbgWgLhPFHm0qw3LR0qKok0w&i=cVVpR014Sk93Z0c1QzFSMs3n08s7G9NqllHFFNJ9
1W0&k=ThnE&r=VWoxMWNDTTl5UmRwenhPYicvAeXd4xlreb0WEHQKfL1pVhoIeFvTknKKfK
g5BYLZ&s=c0303a1ebcdeef9c5d38eaa27d25b873b9eed7bddb72bc270d5944f93d63bb
ec&u=https%3A%2F%2Flists.proxmox.com%2Fcgi-
bin%2Fmailman%2Flistinfo%2Fpve-devel

--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] issues with Virtio-SCSI devicde on Proxmox...

2024-08-13 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi,

I didn't see the responde of Fiona but indeed:

https://lists.gnu.org/archive/html/qemu-devel/2021-09/msg01567.html

"virtio devices can be exposed in upto three ways

 - Legacy - follows virtio 0.9 specification. always uses PCI
ID range 0x1000-0x103F
 - Transitional - follows virtio 0.9 specification by default, but
  can auto-negotiate with guest for 1.0 spce. Always
  uses PCI ID range 0x1000-0x103F
 - Modern - follows virtio 1.0 specification. always uses PCI
ID range 0x1040-0x107F

With QEMU, historically devices placed on a PCI bus will always default
to being in transitional mode, while devices placed on a PCI-E bus will
always dfault to being in modern mode.
"


 Message initial 
De: Fiona Ebner 
Répondre à: Proxmox VE development discussion 
À: Proxmox VE development discussion ,
Christian Moser 
Objet: Re: [pve-devel] issues with Virtio-SCSI devicde on Proxmox...
Date: 13/08/2024 10:55:47

Hi,

Am 12.08.24 um 12:40 schrieb Christian Moser:
> Hello,
> 
> I work for VSI (VMS Software Inc) which is porting the OpenVMS
> operating system to x86. At this point we successfully on various
> hypervisors, but have some issues on the KVM running on Proxmox.
> 
> The OpenVMS VM works just fine with SATA disks and it also works with
> for example virtio-network device etc., but trying to use
> virtio-scsi hangs  the driver. I have debugged this and I can
> successfully configure the port/controller, send the IO request to
> the
> device. It then gets processed by the device, which posts the results
> and sets the interrupt bit in the ISR register, but it never
> asserts the interrupt hence the driver never gets notified and the
> I/O hangs.
> 
> I have tried both “virtio-scsi-pci” and “virtio-scsi-single”, but no
> luck. The emulated virtio-scsi device is a legacy device. But
> then again, the virtio-network device is also a legacy device and
> here we are getting interrupts. One thing which bothers me is the
> fact that the “legacy interrupt disable” bit is set in the PCI config
> space of the virtio-scsi device (i.e. bit 10 at offset 4)
> 
> Questions:
> * is there a way to configure a modern virtio-scsi devcie (i.e.
> disable_legacy=on) ?

I've already answered this question when you asked in a mail addressed
directly to me:

Am 12.08.24 um 11:58 schrieb Fiona Ebner:
> Hi,
> 
> It seems that you either need to attach the "virtio-scsi-pci" device
> to
> a pcie bus or explicitly set the "disable_legacy=on" option for the
> device, neither of which Proxmox VE currently does or allows
> configuring. The only way right now seems to be to attach the disk
> yourself via custom arguments (the 'args' in the Proxmox VE VM
> configuration), but then the disk will be invisible to Proxmox VE
> operations which look at specific disk keys in the configuration!
> 
> Feel free to open a feature request on our bug tracker to make this
> configurable:
> https://antiphishing.vadesecure.com/v4?f=MDk0SW9xRkhTVGYydkJlTIW3tbTr
> aK7neiUoWcvis0-pokd-Q2cwuWCZhgBcTw_yw4KqJS-oP6jCsk-zj-
> 1YMQ&i=ZURHSDhnY0huQ2tPS3VZahJdhRaQu1ItpJrYkl8wXrA&k=q1N6&r=RjIyR1Rob
> kVxVWlHTXhKT3I72JjP2S9ryFg3B_csBogfeb2oROpP8B8yUJUd6awk&s=1aab5a2ada7
> 3beb46aa02df4e18ff6c5ba2db6d6ff2d1f302a3c4c83b13c8ef6&u=https%3A%2F%2
> Fbugzilla.proxmox.com%2F
> 
> P.S. Please write to the developer list rather than individual
> developers for such questions in the feature:
> https://antiphishing.vadesecure.com/v4?f=MDk0SW9xRkhTVGYydkJlTIW3tbTr
> aK7neiUoWcvis0-pokd-Q2cwuWCZhgBcTw_yw4KqJS-oP6jCsk-zj-
> 1YMQ&i=ZURHSDhnY0huQ2tPS3VZahJdhRaQu1ItpJrYkl8wXrA&k=q1N6&r=RjIyR1Rob
> kVxVWlHTXhKT3I72JjP2S9ryFg3B_csBogfeb2oROpP8B8yUJUd6awk&s=50133960c87
> 16b5426bc084f398f7760f04af8739fd68cad36d17b1dcd887778&u=https%3A%2F%2
> Flists.proxmox.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fpve-devel
> 
> Best Regards,
> Fiona

> * why is the legacy interrupt bit set in the PCI config space ?

Most likely because the virtio-scsi-pci is configured without the
"disable_legacy=on" option. If not explicitly set, the option will be
"disable_legacy=auto" and when not attached to PCIe (which is the case
for Proxmox VE currently), then legacy mode will be enabled.

> * Are there any working driver for virtio-scsi on this KVM using Q35
> machine? i.e. any other OS

The virtio drivers for Windows and the ones in Linux work just fine
with
our configuration.


> Any thoughts why these interrupts are not getting delivered on the
> PCIE bus?

We do not configure the virtio-scsi-pci on a PCIe bus currently, see my
initial answer.

Best Regards,
Fiona


___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://antiphishing.vadesecure.com/v4?f=MDk0SW9xRkhTVGYydkJlTIW3tbTraK
7neiUoWcvis0-pokd-Q2cwuWCZhgBcTw_yw4KqJS-oP6jCsk-zj-
1YMQ&i=ZURHSDhnY0huQ2tPS3VZahJdhRaQu1ItpJrYkl8wXrA&k=q1N6&r=RjIyR1RobkV
xVWlHTXhKT3I72JjP2S9ryFg3B_csBogfeb2oROpP8B8yUJUd6awk&s=50133960c871

Re: [pve-devel] FW: issues with Virtio-SCSI devicde on Proxmox...

2024-08-14 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
>>Alexandre,
>>
>>the statement below is not true for our case. The OpenVMS guest OS is
>>using a PCIE bus, so the virtio-scsi device should be exposed as
>>"modern", but is not. Not sure why not at this point

See Fiona response,

the pci express bridge is present, but the virtio-net is plugged on a
simple pci bridge.

pci express slots are mostly used for passthrough devices

/usr/share/perl5/PVE/QemuServer/PCI.pm

my $pcie_addr_map;
sub get_pcie_addr_map {
$pcie_addr_map = {
vga => { bus => 'pcie.0', addr => 1 },
hostpci0 => { bus => "ich9-pcie-port-1", addr => 0 },
hostpci1 => { bus => "ich9-pcie-port-2", addr => 0 },
hostpci2 => { bus => "ich9-pcie-port-3", addr => 0 },
hostpci3 => { bus => "ich9-pcie-port-4", addr => 0 },
hostpci4 => { bus => "ich9-pcie-port-5", addr => 0 },
hostpci5 => { bus => "ich9-pcie-port-6", addr => 0 },
hostpci6 => { bus => "ich9-pcie-port-7", addr => 0 },
hostpci7 => { bus => "ich9-pcie-port-8", addr => 0 },
hostpci8 => { bus => "ich9-pcie-port-9", addr => 0 },
hostpci9 => { bus => "ich9-pcie-port-10", addr => 0 },
hostpci10 => { bus => "ich9-pcie-port-11", addr => 0 },
hostpci11 => { bus => "ich9-pcie-port-12", addr => 0 },
hostpci12 => { bus => "ich9-pcie-port-13", addr => 0 },
hostpci13 => { bus => "ich9-pcie-port-14", addr => 0 },
hostpci14 => { bus => "ich9-pcie-port-15", addr => 0 },
hostpci15 => { bus => "ich9-pcie-port-16", addr => 0 },
# win7 is picky about pcie assignments
hostpci0bus0 => { bus => "pcie.0", addr => 16 },
hostpci1bus0 => { bus => "pcie.0", addr => 17 },
hostpci2bus0 => { bus => "pcie.0", addr => 18 },
hostpci3bus0 => { bus => "pcie.0", addr => 19 },
ivshmem => { bus => 'pcie.0', addr => 20 },
hostpci4bus0 => { bus => "pcie.0", addr => 9 },
hostpci5bus0 => { bus => "pcie.0", addr => 10 },
hostpci6bus0 => { bus => "pcie.0", addr => 11 },
hostpci7bus0 => { bus => "pcie.0", addr => 12 },
hostpci8bus0 => { bus => "pcie.0", addr => 13 },
hostpci9bus0 => { bus => "pcie.0", addr => 14 },
hostpci10bus0 => { bus => "pcie.0", addr => 15 },
hostpci11bus0 => { bus => "pcie.0", addr => 21 },
hostpci12bus0 => { bus => "pcie.0", addr => 22 },
hostpci13bus0 => { bus => "pcie.0", addr => 23 },
hostpci14bus0 => { bus => "pcie.0", addr => 24 },
hostpci15bus0 => { bus => "pcie.0", addr => 25 },
}

___
Christian Moser
Mobile:    +358-40-5022105  
Email: c...@maklee.com
URL:   www.maklee.com
-Original Message-
From: DERUMIER, Alexandre  
Sent: Wednesday, August 14, 2024 09:45
To: pve-devel@lists.proxmox.com; c...@maklee.com
Subject: Re: [pve-devel] issues with Virtio-SCSI devicde on Proxmox...

Hi,

I didn't see the responde of Fiona but indeed:

https://lists.gnu.org/archive/html/qemu-devel/2021-09/msg01567.html

"virtio devices can be exposed in upto three ways

 - Legacy - follows virtio 0.9 specification. always uses PCI
    ID range 0x1000-0x103F
 - Transitional - follows virtio 0.9 specification by default, but
  can auto-negotiate with guest for 1.0 spce. Always
  uses PCI ID range 0x1000-0x103F
 - Modern - follows virtio 1.0 specification. always uses PCI
    ID range 0x1040-0x107F

With QEMU, historically devices placed on a PCI bus will always default
to being in transitional mode, while devices placed on a PCI-E bus will
always dfault to being in modern mode.
"


 Message initial 
De: Fiona Ebner 
Répondre à: Proxmox VE development discussion 
À: Proxmox VE development discussion ,
Christian Moser 
Objet: Re: [pve-devel] issues with Virtio-SCSI devicde on Proxmox...
Date: 13/08/2024 10:55:47

Hi,

Am 12.08.24 um 12:40 schrieb Christian Moser:
> Hello,
> 
> I work for VSI (VMS Software Inc) which is porting the OpenVMS 
> operating system to x86. At this point we successfully on various 
> hypervisors, but have some issues on the KVM running on Proxmox.
> 
> The OpenVMS VM works just fine with SATA disks and it also works with
> for example virtio-network device etc., but trying to use virtio-scsi
> hangs  the driver. I have debugged this and I can successfully 
> configure the port/controller, send the IO request to the device. It 
> then gets processed by the device, which posts the results and sets 
> the interrupt bit in the ISR register, but it never asserts the 
> interrupt hence the driver never gets notified and the I/O hangs.
> 
> I have tried both “virtio-scsi-pci” and “virtio-scsi-single”, but no 
> luck. The emulated virtio-scsi device is a legacy device. But then 
> again, the virtio-network device is also a legacy device and here we 
> are getting interrupts. One thing which bothers me is the fact t

Re: [pve-devel] FW: issues with Virtio-SCSI devicde on Proxmox...

2024-08-14 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
>>I'm talking about virtio-scsi. Our virtio-network device is working
>>fine 

Yes, sorry, I wanted to said virtio-scsi.   

All pci devices excluding passthrough devices (with pcie=on flag) are
actually plugged to pci bridge


sub get_pci_addr_map {
$pci_addr_map = {
piix3 => { bus => 0, addr => 1, conflict_ok => qw(ehci)  },
ehci => { bus => 0, addr => 1, conflict_ok => qw(piix3) }, #
instead of piix3 on arm
vga => { bus => 0, addr => 2, conflict_ok => qw(legacy-igd) },
'legacy-igd' => { bus => 0, addr => 2, conflict_ok => qw(vga)
}, # legacy-igd requires vga=none
balloon0 => { bus => 0, addr => 3 },
watchdog => { bus => 0, addr => 4 },
scsihw0 => { bus => 0, addr => 5, conflict_ok => qw(pci.3) },
'pci.3' => { bus => 0, addr => 5, conflict_ok => qw(scsihw0)
}, # also used for virtio-scsi-single bridge
scsihw1 => { bus => 0, addr => 6 },
ahci0 => { bus => 0, addr => 7 },
qga0 => { bus => 0, addr => 8 },
spice => { bus => 0, addr => 9 },
virtio0 => { bus => 0, addr => 10 },
virtio1 => { bus => 0, addr => 11 },
virtio2 => { bus => 0, addr => 12 },
virtio3 => { bus => 0, addr => 13 },
virtio4 => { bus => 0, addr => 14 },
virtio5 => { bus => 0, addr => 15 },
hostpci0 => { bus => 0, addr => 16 },
hostpci1 => { bus => 0, addr => 17 },
net0 => { bus => 0, addr => 18 },
net1 => { bus => 0, addr => 19 },
net2 => { bus => 0, addr => 20 },
net3 => { bus => 0, addr => 21 },
net4 => { bus => 0, addr => 22 },
net5 => { bus => 0, addr => 23 },
vga1 => { bus => 0, addr => 24 },
vga2 => { bus => 0, addr => 25 },
vga3 => { bus => 0, addr => 26 },
hostpci2 => { bus => 0, addr => 27 },
hostpci3 => { bus => 0, addr => 28 },
#addr29 : usb-host (pve-usb.cfg)
'pci.1' => { bus => 0, addr => 30 },
'pci.2' => { bus => 0, addr => 31 },
'net6' => { bus => 1, addr => 1 },
'net7' => { bus => 1, addr => 2 },
'net8' => { bus => 1, addr => 3 },
'net9' => { bus => 1, addr => 4 },
'net10' => { bus => 1, addr => 5 },
'net11' => { bus => 1, addr => 6 },
'net12' => { bus => 1, addr => 7 },
'net13' => { bus => 1, addr => 8 },
'net14' => { bus => 1, addr => 9 },
'net15' => { bus => 1, addr => 10 },
'net16' => { bus => 1, addr => 11 },
'net17' => { bus => 1, addr => 12 },
'net18' => { bus => 1, addr => 13 },
'net19' => { bus => 1, addr => 14 },
'net20' => { bus => 1, addr => 15 },
'net21' => { bus => 1, addr => 16 },
'net22' => { bus => 1, addr => 17 },
'net23' => { bus => 1, addr => 18 },
'net24' => { bus => 1, addr => 19 },
'net25' => { bus => 1, addr => 20 },
'net26' => { bus => 1, addr => 21 },
'net27' => { bus => 1, addr => 22 },
'net28' => { bus => 1, addr => 23 },
'net29' => { bus => 1, addr => 24 },
'net30' => { bus => 1, addr => 25 },
'net31' => { bus => 1, addr => 26 },
'xhci' => { bus => 1, addr => 27 },
'pci.4' => { bus => 1, addr => 28 },
'rng0' => { bus => 1, addr => 29 },
'pci.2-igd' => { bus => 1, addr => 30 }, # replaces pci.2 in
case a legacy IGD device is passed through
'virtio6' => { bus => 2, addr => 1 },
'virtio7' => { bus => 2, addr => 2 },
'virtio8' => { bus => 2, addr => 3 },
'virtio9' => { bus => 2, addr => 4 },
'virtio10' => { bus => 2, addr => 5 },
'virtio11' => { bus => 2, addr => 6 },
'virtio12' => { bus => 2, addr => 7 },
'virtio13' => { bus => 2, addr => 8 },
'virtio14' => { bus => 2, addr => 9 },
'virtio15' => { bus => 2, addr => 10 },
'ivshmem' => { bus => 2, addr => 11 },
'audio0' => { bus => 2, addr => 12 },
hostpci4 => { bus => 2, addr => 13 },
hostpci5 => { bus => 2, addr => 14 },
hostpci6 => { bus => 2, addr => 15 },
hostpci7 => { bus => 2, addr => 16 },
hostpci8 => { bus => 2, addr => 17 },
hostpci9 => { bus => 2, addr => 18 },
hostpci10 => { bus => 2, addr => 19 },
hostpci11 => { bus => 2, addr => 20 },
hostpci12 => { bus => 2, addr => 21 },
hostpci13 => { bus => 2, addr => 22 },
hostpci14 => { bus => 2, addr => 23 },
hostpci15 => { bus => 2, addr => 24 },
'virtioscsi0' => { bus => 3, addr => 1 },
'virtioscsi1' => { bus => 3, addr => 2 },
'virtioscsi2' => { bus => 3, addr => 3 },
'virtioscsi3' => { bus => 3, addr => 4 },
'virtioscsi4' => { bus => 3, addr => 5 },
'virtioscsi5' => { bus => 3, addr => 6 },
'virtioscsi6' => { bus => 3, addr => 7 },
'virtioscsi7' => { bus => 3, addr => 8 },
'virtioscsi8' => { bus => 3, addr

Re: [pve-devel] [PATCH SERIES storage/qemu-server/-manager] RFC : add lvmqcow2 storage support

2024-08-29 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---



>>just my personal opinion, maybe you also want to wait for more
>>feedback from somebody else...
>>(also i just glanced over the patches, so correct me if I'm wrong)

Hi Dominik !

i see some problems with this approach (some are maybe fixable, some
probably not?)

>>* as you mentioned, if the storage is fast enough you have a runaway
>>VM
>>   this is IMHO not acceptable, as that leads to VMs that are
>>completely blocked and
>>   can't do anything. I fear this will generate many support calls
>>why their guests
>>   are stopped/hanging...

If the chunksize is correctly configured, it shouldn't happen.
(for example, if the storage is able to write 500MB/S,   use a
chunksize of 5~10GB  , this give your 10~20s window )


>>* the code says containers are supported (rootdir => 1) but i don't
>>see how?
>>   there is AFAICS no code to handle them in any way...
>>   (maybe just falsely copied?)

oh indeed, I don't have checked about CT yet. (it could be implemented
with a storage usage check each x second, but I'm not sure it's scale
fine with a lof of CT volumes)



>>* you lock the local blockextend call, but give it a timeout of 60
>>seconds.
>>   what if that timeout expires? the vm again gets completely blocked
>>until it's
>>   resized by pvestatd

I'm locking it to avoid multiple extend, I have set arbitrary 60s, but
it could be a lot lower. (lvm extend don't use more than 1s for me)




>>* IMHO pvestatd is the wrong place to make such a call. It's already
>>doing much
>>   stuff in a way where a single storage operation blocks many other
>>things
>>   (metrics, storage/vm status, ballooning, etc..)
>>
>>   cramming another thing in there seems wrong and will only lead to
>>even more people
>>   complaining about the pvestatd not working, only in this case the
>>vms
>>   will be in an io-error state indefinitely then.
>>
>>   I'd rather make a separate daemon/program, or somehow integrate it
>>into
>>   qmeventd (but then it would have to become multi
>>threaded/processes/etc.
>>   to not block it's other purposes)

Yes, I agree with this.  (BTW, if one day, we could have threading,
queues  or seperate daemon for each storage monitor, it could help a
lot with hangy storage)




>>* there is no cluster locking?
>>   you only mention
>>
>>   ---8<---
>>   #don't·use·global·cluster·lock·here,·use·on·native·local·lvm·lock
>>   --->8---
>>
>>   but don't configure any lock? (AFAIR lvm cluster locking needs
>>additional
>>   configuration/daemons?)
>>
>>   this *will* lead to errors if multiple VMs on different hosts try
>>   to resize at the same time.
>>
>>   even with cluster locking, this will very soon lead to contention,
>>since
>>   storage operations are inherently expensive, e.g. if i have
>>   10-100 VMs wanting to resize at the same time, some of them will
>>run
>>   into a timeout or at least into the blocking state.
>>
>>   That does not even need much IO, just bad luck when multiple VMs
>>go
>>   over the threshold within a short time.

mmm,ok, This one could be a problem indeed.  
I need to look at ovirt code. (because they are really using it in
production since 10year), to see how they handle locks.


>>All in all, I'm not really sure if the gain (snapshots on shared LVM)
>>is worth
>>the potential cost in maintenance, support and customer
>>dissatisfaction with
>>stalled/blocked VMs.

>>Generally a better approach could be for your customers to use some
>>kind of shared filesystem (GFS2/OCFS/?). I know those are not really
>>tested or supported by us, but i would hope that they scale and
>>behave
>>better than qcow2-on-lvm-with-dynamic-resize.

Yes, if we can get it working fine, it could be *a lot* better.  I'm
still afraid about kernel bug/regression. (at least in ocfs2, 10 year
ago, it was a knightmare. I have used it in prod for 1~2 year).

For gfs2, they are a user on proxmox forum, using it in production
since 2019 without any problem.
https://forum.proxmox.com/threads/pve-7-x-cluster-setup-of-shared-lvm-lv-with-msa2040-sas-partial-howto.57536/


I need to test if we can have storage timeout if one node goes down.
(for ocfs2, it was the case, the forum user tell me that it was ok with
gfs2)

I'll do test on my side.

I really need this feature for a lot of onprem customers, migrating
from vmware.  They are mostly small clusters. (2~3 nodes with direct
attach san).  

So even if gfs2 don't scale too much with many nodes, personnaly, it
should be enough for me if we limits the number of supported nodes.


>>best regards
>>Dominik

Thanks again for the review ! 



(BTW, I have some small fixes to do on this patch series on pvestatd
code)

--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH SERIES storage/qemu-server/-manager] RFC : add lvmqcow2 storage support

2024-08-30 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
> 
Hi,
I have done more tests

> > * there is no cluster locking?
> >    you only mention
> > 
> >    ---8<---
> >   
> > #don't·use·global·cluster·lock·here,·use·on·native·local·lvm·lock
> >    --->8---
> > 
> >    but don't configure any lock? (AFAIR lvm cluster locking needs
> > additional
> >    configuration/daemons?)
> > 
> >    this *will* lead to errors if multiple VMs on different hosts
> > try
> >    to resize at the same time.
> > 
> >    even with cluster locking, this will very soon lead to
> > contention,
> > since
> >    storage operations are inherently expensive, e.g. if i have
> >    10-100 VMs wanting to resize at the same time, some of them will
> > run
> >    into a timeout or at least into the blocking state.
> > 
> >    That does not even need much IO, just bad luck when multiple VMs
> > go
> >    over the threshold within a short time.

>>>mmm,ok, This one could be a problem indeed.  
>>>I need to look at ovirt code. (because they are really using it in
>>>production since 10year), to see how they handle locks.

ok, you are right, we need a cluster lock here. Redhat is using sanlock
daemon or dlm using corosync for coordination. 



* IMHO pvestatd is the wrong place to make such a call. It's
already
doing much
   stuff in a way where a single storage operation blocks many
other
things
   (metrics, storage/vm status, ballooning, etc..)

   cramming another thing in there seems wrong and will only lead
to
even more people
   complaining about the pvestatd not working, only in this case
the
vms
   will be in an io-error state indefinitely then.

   I'd rather make a separate daemon/program, or somehow integrate
it
into
   qmeventd (but then it would have to become multi
threaded/processes/etc.
   to not block it's other purposes)

>>Yes, I agree with this.  (BTW, if one day, we could have threading,
>>queues  or seperate daemon for each storage monitor, it could help a
>>lot with hangy storage)

Ok, I think we could manage a queue of disk to resize somewhere.
pvestatd could fill the queue on io-error, and it could be processed
by qemu-eventd. (or maybe another daemon)
It could be done sequentially, as we need a cluster lock anyway


> > All in all, I'm not really sure if the gain (snapshots on shared
> > LVM)
> > is worth
> > the potential cost in maintenance, support and customer
> > dissatisfaction with
> > stalled/blocked VMs.

> > Generally a better approach could be for your customers to use some
> > kind of shared filesystem (GFS2/OCFS/?). I know those are not
> > really
> > tested or supported by us, but i would hope that they scale and
> > behave
> > better than qcow2-on-lvm-with-dynamic-resize.

>>>Yes, if we can get it working fine, it could be *a lot* better.  I'm
>>>still afraid about kernel bug/regression. (at least in ocfs2, 10
>>>year
>>>ago, it was a knightmare. I have used it in prod for 1~2 year).

>>>For gfs2, they are a user on proxmox forum, using it in production
>>>since 2019 without any problem.


>>>I need to test if we can have storage timeout if one node goes down.
>>>(for ocfs2, it was the case, the forum user tell me that it was ok
>>>with
>>gfs2)

>>>I'll do test on my side.

ok, I have done tests with gfs2. Installation is easy, and it's well
integrated with corosync. (using dlm daemon to manage locks which use
corosync). (Note: It need fencing if corosync is dead, it's currently
not able to resume the lock)

It's working fine with preallocated qcow2. I have almost same
performance than raw device, around 20k iops 4k && 3GB/s on my test
storage.

But when the file is not preallocated. (or when you take a snapshot on
a preallocated drive, so new write are not preallocated anymore),
the performance is abymissal.  (60 iops 4k, 40MB/S).
Seem to be a well known problem with gfs2, with cluster lock on block
allocation.


I'll do tests with ocfs2 to compare


--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

[pve-devel] qcow2 internal snapshot: 4k write speed reduce to 100~200 iops on ssd ?

2024-08-30 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi,

I was doing tests with gfs2 && ocfs2,

and I have notice 4k randwrite iops going from 2 iops to
100~200iops when a snapshot is present.

I thinked it was related to gfs2 && ocfs2 allocation, 
but I can reproduce too with a simple qcow2 file on
a local ssd drive.

is it expected ?

(I don't have use qcow2 snapshot since 10year, so I really don't 
 remember performance).


Using external snapshot file, I'm around 1iops.

test inside the vm:

fio --filename=/dev/sdX --ioengine=libaio --direct 1 --bs=4k  --
rw=randwrite --numjobs=64


--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH ifupdown2 0/1] Support IPv6 in vxlan

2024-10-09 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi,

Thanks for the patch !

Do you have submitted it also on the ifupdown2 github ?



 Message initial 
De: apalrd via pve-devel 
Répondre à: Proxmox VE development discussion 
À: pve-devel@lists.proxmox.com
Cc: apalrd 
Objet: [pve-devel] [PATCH ifupdown2 0/1] Support IPv6 in vxlan
Date: 08/10/2024 06:01:08

___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://antiphishing.vadesecure.com/v4?f=M0g4TEI4V3FNSkJBeWN4SLdioKxlmz
xgBAPANhy0WaE8KAjfljQ_dmBasYGzU0h0XJsWc1FhEkSHJ-
Jb9Rn4SQ&i=N09lOTdlWXo2NjcwdHNkNwqc5pZJvm3m8Fexe-
kx2d8&k=yA5R&r=MzlWbmNrYlpDUU56aU9hRXbojRDOegIRC1Vec2Kkr2PaNYYjIuCOrGHk
uJ0Mfhcfl0_zHTBbmKe8o7MqcfpC_w&s=ef31d600d8dad8336d2afb3dffb13e07a204fa
073d38e7ae9f077c8395f5dcbe&u=https%3A%2F%2Flists.proxmox.com%2Fcgi-
bin%2Fmailman%2Flistinfo%2Fpve-devel

--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH ifupdown2 1/1] Correctly handle IPv6 addresses in vxlan

2024-10-09 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Try to look at ifupdown2 github, their are 2 old pull request about
this (never merged/ never completed)



https://github.com/CumulusNetworks/ifupdown2/pull/172

"
For this we would need a new attribute vxlan-local-tunnelip6, we don't
want to reuse the same attribute for ipv6.
We are using netlink to configure vxlans, so it's important to use a
different attribute to set the proper netlink attribute (I don't want
to have things like if IPAddress(value).version == 6:  set
Link.IFLA_VXLAN_LOCAL
"

https://github.com/CumulusNetworks/ifupdown2/pull/182


so, at minimum, this need to use a different "vxlan-local-tunnelip6"
attribute for ipv6


 Message initial 
De: apalrd 
À: pve-devel@lists.proxmox.com
Cc: apalrd 
Objet: [PATCH ifupdown2 1/1] Correctly handle IPv6 addresses in vxlan
Date: 08/10/2024 06:01:09

---
 ifupdown2/addons/vxlan.py | 26 --
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/ifupdown2/addons/vxlan.py b/ifupdown2/addons/vxlan.py
index 084aec9..4aa8e50 100644
--- a/ifupdown2/addons/vxlan.py
+++ b/ifupdown2/addons/vxlan.py
@@ -51,7 +51,7 @@ class vxlan(Vxlan, moduleBase):
 },
 "vxlan-local-tunnelip": {
 "help": "vxlan local tunnel ip",
-    "validvals": [""],
+    "validvals": [","],
 "example": ["vxlan-local-tunnelip 172.16.20.103"]
 },
 "vxlan-svcnodeip": {
@@ -66,7 +66,7 @@ class vxlan(Vxlan, moduleBase):
 },
 "vxlan-remoteip": {
 "help": "vxlan remote ip",
-    "validvals": [""],
+    "validvals": [","],
 "example": ["vxlan-remoteip 172.16.22.127"],
 "multiline": True
 },
@@ -521,7 +521,7 @@ class vxlan(Vxlan, moduleBase):
 local = self._vxlan_local_tunnelip
 
 if link_exists:
-    cached_ifla_vxlan_local =
cached_vxlan_ifla_info_data.get(Link.IFLA_VXLAN_LOCAL)
+    cached_ifla_vxlan_local =
cached_vxlan_ifla_info_data.get(Link.IFLA_VXLAN_LOCAL) or
cached_vxlan_ifla_info_data.get(Link.IFLA_VXLAN_LOCAL6)
 
 # on ifreload do not overwrite anycast_ip to individual ip
 # if clagd has modified
@@ -547,7 +547,7 @@ class vxlan(Vxlan, moduleBase):
 
 if local:
 try:
-    local = ipnetwork.IPv4Address(local)
+    local = ipnetwork.IPAddress(local)
 
 if local.initialized_with_prefixlen:
 self.logger.warning("%s: vxlan-local-tunnelip %s:
netmask ignored" % (ifname, local))
@@ -559,13 +559,19 @@ class vxlan(Vxlan, moduleBase):
 if local:
 if local != cached_ifla_vxlan_local:
 self.logger.info("%s: set vxlan-local-tunnelip %s" %
(ifname, local))
-    user_request_vxlan_info_data[Link.IFLA_VXLAN_LOCAL] =
local
+    if local.version == 6:
+   
user_request_vxlan_info_data[Link.IFLA_VXLAN_LOCAL6] = local
+    else:
+   
user_request_vxlan_info_data[Link.IFLA_VXLAN_LOCAL] = local
 
 # if both local-ip and anycast-ip are identical the
function prints a warning
 self.syntax_check_localip_anycastip_equal(ifname,
local, self._clagd_vxlan_anycast_ip)
 elif cached_ifla_vxlan_local:
 self.logger.info("%s: removing vxlan-local-tunnelip (cache
%s)" % (ifname, cached_ifla_vxlan_local))
-    user_request_vxlan_info_data[Link.IFLA_VXLAN_LOCAL] = None
+    if cached_ifla_vxlan_local.version == 6:
+    user_request_vxlan_info_data[Link.IFLA_VXLAN_LOCAL6] =
None
+    else:
+    user_request_vxlan_info_data[Link.IFLA_VXLAN_LOCAL] =
None
 
 return local
 
@@ -1236,7 +1242,7 @@ class vxlan(Vxlan, moduleBase):
 if remoteips:
 try:
 for remoteip in remoteips:
-    ipnetwork.IPv4Address(remoteip)
+    ipnetwork.IPAddress(remoteip)
 except Exception as e:
 self.log_error('%s: vxlan-remoteip: %s' %
(ifaceobj.name, str(e)))
 
@@ -1244,7 +1250,7 @@ class vxlan(Vxlan, moduleBase):
 # purge any removed remote ip
 old_remoteips = self.get_old_remote_ips(ifaceobj.name)
 
-    if vxlan_purge_remotes or remoteips or (remoteips !=
old_remoteips):
+    if vxlan_purge_remotes or (isinstance(remoteips,list) and
remoteips != old_remoteips):
 # figure out the diff for remotes and do the bridge fdb
updates
 # only if provisioned by user and not by an vxlan external
 # controller.
@@ -1281,8 +1287,8 @@ class vxlan(Vxlan, moduleBase):
 "00:00:00:00:00:00",
 None, True, addr
 )
-    except Exception:
-    pass
+    except Exception as e:
+

Re: [pve-devel] [PATCH ifupdown2 1/1] Correctly handle IPv6 addresses in vxlan

2024-10-09 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Personnally, I'm ok with your patch

>>Ultimately I disagreed with the solution to use a separate parameter
>>for IPv6, for the following reasons:
>>- We can only have one local tunnel IP, so having two parameters
>>means we need to check if the other one has been set (since setting
>>both would be invalid)
>>- There are already other cases in ifupdown2 which do ip.version == 6
>>including common parameters like address and gateway, and most
>>parameters do take both IPv4 and IPv6 addresses (such as the remoteip
>>field in vxlan), so having one parameter for both families would be
>>consistent with other parameters in the interfaces file which take
>>both families (regardless of the kernel implementation having two
>>fields instead of one in this case)
>>- ifupdown-ng already solved this problem using a single parameter
>>instead of two, so doing something else would diverge ifupdown2 vs
>>ifupdown-ng syntax which should be the same


but we need to be able to not diverge too much from upstream,
if one day (ok that's waiting since 4 years...) an official
support is added by cumulus/nvidia.
(to be honest, the upstream is a lot less active since nvidia have buy
cumulus/mellanox)

That's mean maintain both syntax, or plan a config update,
between 2 majors pve releases. Not a big deal.


but yes, we already use "address ..." for example, for both ipv6/ipv4.
and here, as you said, we can have only ipv4 or ipv6 for local tunnel.


>>I could add additional error checking to ensure that remoteip[] and
>>localip are the same address family if you’d like. Currently that
>>results in a Netlink exception which gets passed back as an error
>>message. The kernel only allows one address family for a vxlan
>>interface.

yes, it could be great to test it.



BTW, I don't have followed the ifupdown-ng project since a long time.
(just follow the early days). Is the project really active and have
almost same features than ifupdown2 ?  (netlink support, reload support
with diff of running config,...)

 Message initial 
De: Andrew 
À: "DERUMIER, Alexandre" 
Cc: pve-devel@lists.proxmox.com 
Objet: Re: [PATCH ifupdown2 1/1] Correctly handle IPv6 addresses in
vxlan
Date: 09/10/2024 18:55:50

Yes, I read all of the PRs and discussion on ifupdown2 GitHub before
implementing this.

Ultimately I disagreed with the solution to use a separate parameter
for IPv6, for the following reasons:
- We can only have one local tunnel IP, so having two parameters means
we need to check if the other one has been set (since setting both
would be invalid)
- There are already other cases in ifupdown2 which do ip.version == 6
including common parameters like address and gateway, and most
parameters do take both IPv4 and IPv6 addresses (such as the remoteip
field in vxlan), so having one parameter for both families would be
consistent with other parameters in the interfaces file which take both
families (regardless of the kernel implementation having two fields
instead of one in this case)
- ifupdown-ng already solved this problem using a single parameter
instead of two, so doing something else would diverge ifupdown2 vs
ifupdown-ng syntax which should be the same

I could add additional error checking to ensure that remoteip[] and
localip are the same address family if you’d like. Currently that
results in a Netlink exception which gets passed back as an error
message. The kernel only allows one address family for a vxlan
interface.

Thanks,

Andrew

> On Oct 9, 2024, at 11:37, DERUMIER, Alexandre
>  wrote:
> 
> Try to look at ifupdown2 github, their are 2 old pull request about
> this (never merged/ never completed)
> 
> 
> 
> https://github.com/CumulusNetworks/ifupdown2/pull/172
> 
> "
> For this we would need a new attribute vxlan-local-tunnelip6, we
> don't
> want to reuse the same attribute for ipv6.
> We are using netlink to configure vxlans, so it's important to use a
> different attribute to set the proper netlink attribute (I don't want
> to have things like if IPAddress(value).version == 6:  set
> Link.IFLA_VXLAN_LOCAL
> "
> 
> https://github.com/CumulusNetworks/ifupdown2/pull/182
> 
> 
> so, at minimum, this need to use a different "vxlan-local-tunnelip6"
> attribute for ipv6
> 
> 
>  Message initial 
> De: apalrd 
> À: pve-devel@lists.proxmox.com
> Cc: apalrd 
> Objet: [PATCH ifupdown2 1/1] Correctly handle IPv6 addresses in vxlan
> Date: 08/10/2024 06:01:09
> 
> ---
>  ifupdown2/addons/vxlan.py | 26 --
>  1 file changed, 16 insertions(+), 10 deletions(-)
> 
> diff --git a/ifupdown2/addons/vxlan.py b/ifupdown2/addons/vxlan.py
> index 084aec9..4aa8e50 100644
> --- a/ifupdown2/addons/vxlan.py
> +++ b/ifupdown2/addons/vxlan.py
> @@ -51,7 +51,7 @@ class vxlan(Vxlan, moduleBase):
>  },
>  "vxlan-local-tunnelip": {
>  "help": "vxlan local tunnel ip",
> -    "validvals": [""],
> +    "validvals": [","],
>

Re: [pve-devel] [PATCH pve-network 1/1] Update vxlan plugin to emit local tunnel IP

2024-10-09 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
patch logic seem to be ok for me. (I don't have tested it)

>>
>>     for my $address (@peers) {
>>- next if $address eq $ifaceip;
>>- push @iface_config, "vxlan_remoteip $address";
>>+     push @iface_config, "vxlan_local_tunnelip $address" if $address eq 
>>$ifaceip;
>>+    push @iface_config, "vxlan_remoteip $address" if $address ne 
>>$ifaceip;
>>     }

just be carefull about the identation and maybe use a if/else


--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH manager 1/2] close #4369: api: optionally only suggest unique IDs

2024-09-30 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
>>The pmxcfs filesystem has limits, and I do not really want to waste
>>space for such things. I would prefer the run-length encoded list.

>>@Alexandre: Why do you want to keep a backup of old config files?

Don't need content of old config. ( I have backup anyway).

I was more thinking about something "atomic" delete, like a simple
vmid.conf move could be done, where we can do a lot of delete in //
without need to lock a global variable.

To be honest, I don't known enough the internal of pxmcfs to known what
is the best method.

My main problems currently is limitation of // for vm create/delete
when we create vms with automation scripts.




Talking about unique ID, any plan (longterm) to implement (optionnal)
uuid one day ? (I known that it's a lot of work, but it could be
usefull to be able to share storage between differents cluster, and be
able to move vms between them without need to rewrite volume datas with
different id)



 Message initial 
De: Dietmar Maurer 
À: Severen Redwood , Proxmox VE
development discussion , Alexandre
Derumier 
Objet: Re: [pve-devel] [PATCH manager 1/2] close #4369: api: optionally
only suggest unique IDs
Date: 30/09/2024 09:50:25

> This approach would also use more storage as you now have the
> overhead
> of FS metadata for every single ID you have marked as used.
> 
> Dietmar, what do you think is the best option here? I'm personally
> leaning towards using the list with your run-length encoding
> suggestion,
> but I'm open to implementing Alexandre's idea if you think it's worth
> it.

The pmxcfs filesystem has limits, and I do not really want to waste
space for such things. I would prefer the run-length encoded list.

@Alexandre: Why do you want to keep a backup of old config files?


--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH SERIES openid/access-control/docs/manager] fix #4411: add support for openid groups

2024-10-06 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Looking for this feature too for my production :)

 Message initial 
De: Thomas Skinner 
Répondre à: Proxmox VE development discussion 
À: pve-devel@lists.proxmox.com
Objet: Re: [pve-devel] [PATCH SERIES openid/access-
control/docs/manager] fix #4411: add support for openid groups
Date: 03/10/2024 03:45:15

This is still applicable to the latest master for the referenced
repositories. Any movement?

On Sun, Sep 1, 2024, 11:55 AM Thomas Skinner 
wrote:

> This patch series adds support for groups for OpenID logins.
> 
> The following options are implemented:
>   - Configurable claim for retrieving a list of groups and adding
> them to
> the
>     user in PVE
>   - Allowing "synchronization" of groups on login by overriding
> groups
> assigned
>     to the user in PVE (this option is off by default)
>   - Replacing invalid characters in group names with a configurable
> valid
> characters
>     (by default, this is an underscore '_')
> 
> The logic implemented by this patch expects the groups claim in the
> ID
> token/userinfo
> endpoint to send a list of groups.
> 
> This patch also adds support for using additional claims from the
> OpenID
> provider
> by exposing all additional claims and their values from the ID token
> (previously
> only available for the userinfo endpoint). This is necessary for
> OpenID
> providers
> that do not support additional information in the userinfo endpoint.
> 
> 
> proxmox/proxmox-openid:
> 
> Thomas Skinner (1):
>   fix #4411: openid: add library code for openid groups support
> 
>  proxmox-openid/src/lib.rs | 55 +
> --
>  1 file changed, 47 insertions(+), 8 deletions(-)
> 
> 
> pve-access-control:
> 
> Thomas Skinner (1):
>   fix #4411: openid: add logic for openid groups support
> 
>  src/PVE/API2/OpenId.pm | 32 
>  src/PVE/Auth/OpenId.pm | 21 +
>  2 files changed, 53 insertions(+)
> 
> 
> pve-docs:
> 
> Thomas Skinner (1):
>   fix #4411: openid: add docs for openid groups support
> 
>  api-viewer/apidata.js | 40 
>  pveum.adoc    | 32 
>  2 files changed, 72 insertions(+)
> 
> 
> pve-manager:
> 
> Thomas Skinner (1):
>   fix #4411: openid: add ui config for openid groups support
> 
>  www/manager6/dc/AuthEditOpenId.js | 35 -
> --
>  1 file changed, 32 insertions(+), 3 deletions(-)
> 
> --
> 2.39.2
> 
> 
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://antiphishing.vadesecure.com/v4?f=c2xMdVN4Smh2R2tOZDdIRKCk7WEocH
pTPMerT1Q-
Aq5qwr8l2xvAWuOGvFsV3frp2oSAgxNUQCpJDHp2iUmTWg&i=d1l4NXNNaWE4SWZqU0dLWe
iyW515SOU2RVoj_9OQ7Ew&k=fjzS&r=MXJUa0FrUVJqc1UwYWxNZ-
tmXdGQOM0bQR6kYEgvrmGZrgAumkB5XEgd10kSzvIx&s=8cd24c2f90b250765f2ceb8891
4e45da75f5223e1030847b6919a9863d0e2f09&u=https%3A%2F%2Flists.proxmox.co
m%2Fcgi-bin%2Fmailman%2Flistinfo%2Fpve-devel

--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] Proposal: support for atomic snapshot of all VM disks at once

2024-10-07 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
I'm also interested. I have already seen the time drift when take
snapshots, not cool on a database where time transaction is important.


ceph rbd support group snapshot too.


 Message initial 
De: Ivaylo Markov via pve-devel 
Répondre à: Proxmox VE development discussion 
À: pve-devel@lists.proxmox.com
Cc: Ivaylo Markov 
Objet: Re: [pve-devel] Proposal: support for atomic snapshot of all VM
disks at once
Date: 07/10/2024 09:27:09

___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://antiphishing.vadesecure.com/v4?f=RzlZTWtkemNSOUVkZTJBYdS-
XA_pzMA4Fz4JRkZPA3dhY7Z3Sd_r2__EuGgZU0WuXFGUbGNFySEFJnvos_-
YVg&i=SGxjaUlnNWlZWlNxaGhBVLimx-
9C6COfka5pjSczJyE&k=NLuj&r=Qnhka2E0dmNmY3lSdFV6VMmOR0tkA89HrIpauh33IaSI
b8HXyemHfLHRjfniwD5E&s=3030c97c49aff97e4946212cc892131a8e5755e717fc137b
93ea997364cb2b8c&u=https%3A%2F%2Flists.proxmox.com%2Fcgi-
bin%2Fmailman%2Flistinfo%2Fpve-devel

--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH SERIES v2 pve-storage/qemu-server] add external qcow2 snapshot support

2024-10-20 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi,

Any comment about this patch series ?

I really think that external snapshot could be a great feature (as I
still see report on the forum about freeze on snasphot deletion),
and support for lvm and shared san is really a feature than enterprise
users are waiting.  (To be honest, I have a lot of customers stuck on
vmware because of this)


About my previous patch serie, with lvm dynamic extent, I think I'll
give up, it seem to be too complex, with too many corner case.
(So keeping shared lvm + external snapshot without thin provisioning)


In Parallel, I have done more tests with gfs2/ocfs2, and I finally
found a 
way to have good performance on block allocation for thin non
preallocated qcow2 file.

Currently, I'm around 200 iops on gfs2 with 4k randwrite (instead
2iops...)

(fio --rw=randwrite ---direct=1 -bs=4k --ioengine=libaio --iodepth=64 -
-filename=/dev/sdX)

qemu have "preallocate" filter feature
https://patchwork.kernel.org/project/qemu-devel/cover/20200814130348.20625-1-vsement...@virtuozzo.com/

-drive driver=qcow2,file.driver=preallocate,file.prealloc-
size=1073741824,file.file.driver=file,file.file.filename=/mnt/pve/gfs2s
an/images/100/vm-100-disk-0.qcow2,id=drive-scsi2,if=none


which allow to prellocate on the fly more blocks than requested.
(for example, you need to write a 4k block on a unallocated block, I'll
reserve 128MB for example).
This reduce a lot the number of locks and round-trip network for fs
like ocfs2/gfs2  when you have a lot of write.

With qcow2 format allocating random blocks consecutively, that's
working very well.

I have done a small test, the fio result is around to 20~30k write.
I'll send a patch soon after doing more test.


Regards,

Alexandre

 Message initial 
De: Alexandre Derumier 
À: pve-devel@lists.proxmox.com
Objet: [PATCH SERIES v2 pve-storage/qemu-server] add external qcow2
snapshot support
Date: 30/09/2024 13:31:50

This patch series implement qcow2 external snapshot support for files
&& lvm volumes

The current internal qcow2 snapshots have a lot of performance
problems.

I have tested through nfs and also local filesystem


I see that Fiona don't have same result than me, but I got something
like 200~300iops
vs 2 iops with 4k randwrite when a snapshot exist.

The result is even worst on a shared filesystem like ocfs2 or gfs2.
(around 80 iops)
I think (I'm not 100% sure) this is mostly because metadatas are not
preallocated
anymore with qcow2 internal snap.

With external snapshot, I almost don't have any performance impact when
a snapshot exist.

Also other bugs are freeze/lock reported by users since years on
snapshots delete on nfs
https://antiphishing.vadesecure.com/v4?f=S1Zkd042VWdrZG5qQUxxWk5ps4t67k
NuHsBZzdzhpquLKuXqTZLIq2K1DfKr9N61yBafm7AuAITd6bHtRU4zEQ&i=MlZSTzBhZFZ6
Nzl4c3EyN5T6buHjA4kKs6Oz9IPjCIg&k=F1is&r=cm1qVmRYUWk2WXhYZVFHWA0PXtTaYx
z7-FIOTkZBm34_dHdSch-
gXn7ST9eGhQLN&s=64b60d6fd396d266b432ee693cc8f61d2632a8524491fef07cef3c3
f51c98871&u=https%3A%2F%2Fforum.proxmox.com%2Fthreads%2Fsnapshot-
removal-jams-the-vm.111648%2F
(The disk access seem to be frozen during all the delete duration)

External qcow2 snapshots also allow snapshot of raw devices ,so 0
performance impact without snapshots.

This also open doors for remote snapshot export-import for storage
replication.

This V2 introduce support for qcow2 external snapshot for lvm, extra
lvm
volume is created for each snapsphot and formated with qcow2.
This is a lot more performant than lvm (non-thin/nomedata) snapshot,
and allow to use
it for shared lvm.  (I have another patch series for thick lvm dynamic
extend, but if we could have at minimum
snapshot working, it could great :)

I have tested: snasphot, snap rollback, snap delete, clone, move disk,
rename disk, create_base. (online && offline)

lxc is not yet supported, but I think we could look to implement the
recent dm-qcow2 kernel block driver
https://antiphishing.vadesecure.com/v4?f=S1Zkd042VWdrZG5qQUxxWk5ps4t67k
NuHsBZzdzhpquLKuXqTZLIq2K1DfKr9N61yBafm7AuAITd6bHtRU4zEQ&i=MlZSTzBhZFZ6
Nzl4c3EyN5T6buHjA4kKs6Oz9IPjCIg&k=F1is&r=cm1qVmRYUWk2WXhYZVFHWA0PXtTaYx
z7-FIOTkZBm34_dHdSch-
gXn7ST9eGhQLN&s=1865f514f95ac1d8e0088b598376751d4d98fa25de6a8b2868a74f9
2ac661cfa&u=https%3A%2F%2Flore.kernel.org%2Flkml%2F164846619932.251310.
3668540533992131988.stgit%40pro%2FT%2F


storage.cfg example:

dir: local2
    path /var/liv/vz
    content snippets,vztmpl,backup,images,iso,rootdir
    snapext 1

lvmqcow2:test
    vgname test
    snapext 1
    content images


changelog v2:
 implement lvm with external qcow2 snapshots

pve-storage:

Alexandre Derumier (2):
  add external snasphot support
  add lvmqcow2 plugin: (lvm with external qcow2 snapshot)

 src/PVE/Storage.pm    |   2 +
 src/PVE/Storage/DirPlugin.pm  |   1 +
 src/PVE/Storage/LvmQcow2Plugin.pm | 460 ++
 src/PVE/Storage/Makefile  |   3 +-
 src/PVE/Storage/Plugin.pm

Re: [pve-devel] [PATCH SERIES v2 pve-storage/qemu-server] add external qcow2 snapshot support

2024-10-22 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
 Message initial 
De: Esi Y via pve-devel 
Répondre à: Proxmox VE development discussion 
À: Proxmox VE development discussion 
Cc: Esi Y 
Objet: Re: [pve-devel] [PATCH SERIES v2 pve-storage/qemu-server] add
external qcow2 snapshot support
Date: 22/10/2024 11:51:31

> wasting developers time that then, among other things, delays actual

>>I hoped this would bump it up for Alexandre to get a response.
As far I remember, when we have implement snapshot for qcow2 (I think
in 2010~2011, I'm becoming old ^_^ ) , only internal snapshot was
possible, 
because they were no block-commit job. (to merge data in parent on
snapshot deletion).

Only block-stream job was available at this time (merge snapshot to
child snapshot)

I think that redhat have mostly worked on external snapshots these last
10years (mostly because they used them for backup, but also replication
where it's not possible with internal snasphot).

And the missing block job to merge data for internal snapshot is also
why the io need to be frozen during the merge.

So, that's why I never haved use qcow2 in production (mostly ceph,
or though custom netapp api for customer using nfs with raw files).

That mean that we don't have a clean snapshot solution currently for
shared san/nas without api.

I'm trying to fix/improve both nas (nfs) && san (iscsi,lvm) snapshots
implementation.

Mainly because I have a lot of onprem customers coming from vmware with
small san (iscsi/fiberchannel) needing snapshot feature.

--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v2 pve-storage 1/2] add external snasphot support

2024-10-24 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
> 
> 
> But even with that, you can still have performance impact.
> So yes, I think they are really usecase for workload when you only
> need
> snapshot time to time (before an upgrade for example), but max
> performance with no snaphot exist.

>>my main point here is - all other storages treat snapshots as
>>"cheap". if you combine raw+qcow2 snapshot overlays, suddenly
>>performance will get worse if you keep a snapshot around for whatever
>>reason.. 
 

Ok, I redone a lot of bench yesterday, with a real san storage, and I
don't see too much difference between qcow2 and raw. (something like 
3iops on raw and 28000~29000 iops on qcow2).
I have tested with 2TB qcow2 file to be sure, and with new qcow2 sub-
cluster feature with l2_extended, it's not too much.

The difference is a little more big on a local nvme (I think because of
low latency), but as the usecase is for network storage, it's ok.

Let's go for full .qcow2, it'll be easier ;)


> > > it's a bit confusing to have a volid ending with raw, with the
> > > current volume and all but the first snapshot actually being
> > > stored
> > > in qcow2 files, with the raw file being the "oldest" snapshot in
> > > the
> > > chain..

> if it's too confusing, we could use for example an .snap extension.
> (as we known that it's qcow2 behind)

>>I haven't thought yet about how to encode the snapshot name into the
>>snapshot file name, but yeah, maybe something like  that would be
>>good. or maybe snap-VMID-disk-DISK.qcow2 ?

ok we can use snap-VMID-disk-DISK.qcow2 , I'll be easier for regex :p


> > > storage_migrate needs to handle external snapshots, or at least
> > > error
> > > out. 
> it should already work. (I have tested move_disk, and live migration
> +
> storage migration). qemu_img_convert offline and qemu block job for
> live.

>>but don't all of those lose the snapshots? did you test it with
>>snapshots and rollback afterwards?

ok, sorry, I have tested clone a new vm from a snapshot. (which use the
same code). I don't remember how it's work with move disk of a running
vm when snaphot exist.

> 
> The main problem is when you start a vm on a specific snasphot,
> we don't send the $snapname param.
> 
> One way could be that qemu-server check the current snapshot from
> config when doing specific action like start.

>>if we manage to find a way to make the volid always point at the top
>>overlay, then that wouldn't be needed..

yes, indeed if we are able to rename the current snapshot file to vm-
100-disk-0.qcow2  , it's super easy :)  

I need to do more test, because drive-reopen only seem to work if the
original drive is defined with -blockdev syntax. (It seem to clash on
nodename if it's not defined with -drive ).  

I have begin to look to implement blockdev, it don't seem too much
difficult for the start command line, but I need check the hotplug
part.
Maybe for pve9 ? (it could open door to features like luks encryption
too or or)



I'll rework all the patches after my holiday,  with both renaming of
current snapshot and only use .qcow2 format,  it should be a lot of
more clean and KISS.

Thanks again for the review !


--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v2 pve-storage 2/2] add lvmqcow2 plugin: (lvm with external qcow2 snapshot)

2024-10-24 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Thanks Fabian for your time ! 

I have tried to respond as much as possible. (I'm going to Holiday for
1 week tomorrow, so sorry if I don't reply to all your comments)



 Message initial 
De: Fabian Grünbichler 
À: "DERUMIER, Alexandre" , pve-
de...@lists.proxmox.com 
Objet: Re: [pve-devel] [PATCH v2 pve-storage 2/2] add lvmqcow2 plugin:
(lvm with external qcow2 snapshot)
Date: 24/10/2024 09:42:00


> DERUMIER, Alexandre  hat am
> 23.10.2024 15:45 CEST geschrieben:
> 
>  
> > > I am not yet convinced this is somehow a good idea, but maybe you
> > > can
> > > convince me otherwise ;)

>>I maybe judged this too quickly - I thought this was combining LVM +
>>a dir-based storage, but this is putting the qcow2 overlays on LVs,
>>which I missed on the first pass!

Ah ok ! ^_^  yes, this is really 100% lvm (for shared lvm)


> > > variant A: this is just useful for very short-lived snapshots
> > > variant B: these snapshots are supposed to be long-lived
> 
> Can you defined "short "/ "long"  ? and the different usecase ?
> 
> because for me, a snapshot is a snapshot. Sometime I take a snapshot
> before doing some critical changes, but I can't known if I need to
> rollback in next 2h, or next month.

>>yes, this would be an example of a short-lived snapshot
 
ok
> I think that "long-lived" usecase is backup (but we don't need it),
> or replication (this could apply here, if we want to add replication
> for disaster recovery)

>>backup would also be short-lived usually (the snapshot is just to
>>take the backup, not to keep a backup). long-lived usually is
>>something like "take daily snapshot and keep for a few weeks for file
>>recovery", in addition to regular backups. or "snapshot because we
>>just had an incidence and might need this for forensics in a few
>>months" (can also be solved with backups, of course ;)).

>>the main difference between the two is - for short-lived snapshots
>>performance implications related to snapshots existing are not that
>>important. I can live with a few hours of degraded performance, if
>>the snapshot is part of some important process/work flow. with long-
>>lived snapshots there is a requirement for them to not hurt
>>performance just by existing, because otherwise you can't use them.
>>there is a second reason why long-lived snapshots can be impossible 

ok, so here, with qcow2 format, performance shouldn't be a problem.
(That's the whole point of this patch, using qcow2 format instead basic
slow lvm snasphot)

>>if you need to decide up-front how "big" the delta of that snapshot
>>can grow at most, then in PVE context, you always need to allocate
>>the full volume size (regular thick LVM had both issues - bad
>>performance, and new writes going into a thick snapshot volume).

about thick snaphot volume, technically, it could be possible to create
smaller lvm volume than the qcow2 virtual-size, and dynamically extend
it. ovirt is doing it like this. (I nave send a prelimary patch in
september, but for now, I'll like to keep it simple with thick
snapshots volume).


>>if you can support long-lived snapshots, then you automatically also
>>support short-lived snapshots. the other direction doesn't hold.
>>since PVE only has one kind of snapshots, they need to be useful for
>>long-lived snapshots.

ok got it.
> > > A is not something we want. we intentionally don't have non-thin
> > > LVM
> > > snapshots for example.
> 
> AFAIK, we never had implemented it because LVM snasphot is slow as
> hell.(as a lvm extent are around 4MB, if you want 4k on a snapshot,
> you
> need to reread and rewrite the 4MB,  so around 1000x over-
> amplification and slow iops)

>>see above - there's two issues, one is performance, the other is that
>>you need to either
>>- make the snapshot smaller than the original volume (and risk
>>running out of space)
>>- make the snapshot as big as the original volume (and blow up space
>>requirements)
>>
>>(thick) LVM snapshots basically barely work for the "take a
>>consistent backup during quiet periods" use case, and not much else.


> This is really the main blocker for my customers migrating from
> vmware
> (and to be honest I have some of them going to oracle ovlm (with
> ovirt), because ovirt support it this way). 

> > > B once I create a single snapshot, the "original" storage only
> > > contains the data written up to that point, anything else is
> > > stored
> > > on the "snapshot" storage. this means my snapshot storage must be
> > > at
> > > least as fast/good/shared/.. as my original storage. in that
> > > case, I
> > > can just use the snapshot storage directly and ditch the original
> > > storage?
> 
> Sorry, but I don't understand why you are talking about
> original/snapshot storage ? I never have thinked to use another
> storage
> for external snapshot.
> 
> The patch is really to add external snapshot on same lvm storage,
> through lvm additional volume, but with qcow2 format to have good
> performance (vs

Re: [pve-devel] [PATCH v2 pve-storage 2/2] add lvmqcow2 plugin: (lvm with external qcow2 snapshot)

2024-10-23 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
>>I am not yet convinced this is somehow a good idea, but maybe you can
>>convince me otherwise ;)
>>
>>variant A: this is just useful for very short-lived snapshots
>>variant B: these snapshots are supposed to be long-lived

Can you defined "short "/ "long"  ? and the different usecase ?

because for me, a snapshot is a snapshot. Sometime I take a snapshot
before doing some critical changes, but I can't known if I need to
rollback in next 2h, or next month.

I think that "long-lived" usecase is backup (but we don't need it),
or replication (this could apply here, if we want to add replication
for disaster recovery)


>>A is not something we want. we intentionally don't have non-thin LVM
>>snapshots for example.

AFAIK, we never had implemented it because LVM snasphot is slow as
hell.(as a lvm extent are around 4MB, if you want 4k on a snapshot, you
need to reread and rewrite the 4MB,  so around 1000x over-
amplification and slow iops)

This is really the main blocker for my customers migrating from vmware
(and to be honest I have some of them going to oracle ovlm (with
ovirt), because ovirt support it this way). 



>>B once I create a single snapshot, the "original" storage only
>>contains the data written up to that point, anything else is stored
>>on the "snapshot" storage. this means my snapshot storage must be at
>>least as fast/good/shared/.. as my original storage. in that case, I
>>can just use the snapshot storage directly and ditch the original
>>storage?

Sorry, but I don't understand why you are talking about
original/snapshot storage ? I never have thinked to use another storage
for external snapshot.

The patch is really to add external snapshot on same lvm storage,
through lvm additional volume, but with qcow2 format to have good
performance (vs slow lvm snapshot)









> Alexandre Derumier via pve-devel  hat am
> 30.09.2024 13:31 CEST geschrieben:
> Signed-off-by: Alexandre Derumier  cyllene.com>
> ---
>  src/PVE/Storage.pm    |   2 +
>  src/PVE/Storage/LvmQcow2Plugin.pm | 460
> ++
>  src/PVE/Storage/Makefile  |   3 +-
>  3 files changed, 464 insertions(+), 1 deletion(-)
>  create mode 100644 src/PVE/Storage/LvmQcow2Plugin.pm
> 
> diff --git a/src/PVE/Storage.pm b/src/PVE/Storage.pm
> index 57b2038..119998f 100755
> --- a/src/PVE/Storage.pm
> +++ b/src/PVE/Storage.pm
> @@ -28,6 +28,7 @@ use PVE::Storage::Plugin;
>  use PVE::Storage::DirPlugin;
>  use PVE::Storage::LVMPlugin;
>  use PVE::Storage::LvmThinPlugin;
> +use PVE::Storage::LvmQcow2Plugin;
>  use PVE::Storage::NFSPlugin;
>  use PVE::Storage::CIFSPlugin;
>  use PVE::Storage::ISCSIPlugin;
> @@ -54,6 +55,7 @@ our $KNOWN_EXPORT_FORMATS = ['raw+size',
> 'tar+size', 'qcow2+size', 'vmdk+size',
>  PVE::Storage::DirPlugin->register();
>  PVE::Storage::LVMPlugin->register();
>  PVE::Storage::LvmThinPlugin->register();
> +PVE::Storage::LvmQcow2Plugin->register();
>  PVE::Storage::NFSPlugin->register();
>  PVE::Storage::CIFSPlugin->register();
>  PVE::Storage::ISCSIPlugin->register();
> diff --git a/src/PVE/Storage/LvmQcow2Plugin.pm
> b/src/PVE/Storage/LvmQcow2Plugin.pm
> new file mode 100644
> index 000..68c8686
> --- /dev/null
> +++ b/src/PVE/Storage/LvmQcow2Plugin.pm
> @@ -0,0 +1,460 @@
> +package PVE::Storage::LvmQcow2Plugin;
> +
> +use strict;
> +use warnings;
> +
> +use IO::File;
> +
> +use PVE::Tools qw(run_command trim);
> +use PVE::Storage::Plugin;
> +use PVE::Storage::LVMPlugin;
> +use PVE::JSONSchema qw(get_standard_option);
> +
> +use base qw(PVE::Storage::LVMPlugin);
> +
> +# Configuration
> +
> +sub type {
> +    return 'lvmqcow2';
> +}
> +
> +sub plugindata {
> +    return {
> + #container not yet implemented #need to implemented dm-qcow2
> + content => [ {images => 1, rootdir => 1}, { images => 1 }],
> +    };
> +}
> +
> +sub properties {
> +    return {
> +    };
> +}
> +
> +sub options {
> +    return {
> + vgname => { fixed => 1 },
> + nodes => { optional => 1 },
> + shared => { optional => 1 },
> + disable => { optional => 1 },
> + saferemove => { optional => 1 },
> + saferemove_throughput => { optional => 1 },
> + content => { optional => 1 },
> + base => { fixed => 1, optional => 1 },
> + tagged_only => { optional => 1 },
> + bwlimit => { optional => 1 },
> + snapext => { fixed => 1 },
> +    };
> +}
> +
> +# Storage implementation
> +
> +sub parse_volname {
> +    my ($class, $volname) = @_;
> +
> +    PVE::Storage::Plugin::parse_lvm_name($volname);
> +    my $format = $volname =~ m/^(.*)-snap-/ ? 'qcow2' : 'raw';
> +
> +    if ($volname =~ m/^((vm|base)-(\d+)-\S+)$/) {
> +    return ('images', $1, $3, undef, undef, $2 eq 'base',
> $format);
> +    }
> +
> +    die "unable to parse lvm volume name '$volname'\n";
> +}
> +
> +sub filesystem_path {
> +    my ($class, $scfg, $volname, $snapname, $current_snap) = @_;
> +
> +    my ($vtype, $name, $vmid) = $class->parse_volname($volname);
> +
> +    my $vg = $scfg->{vgname};
> +
> +    my $path = "

Re: [pve-devel] [PATCH v2 pve-storage 1/2] add external snasphot support

2024-10-23 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi Fabian,

thanks for the review !

>> Message initial 
>>De: Fabian Grünbichler 
>>À: Proxmox VE development discussion 
>>Cc: Alexandre Derumier 
>>Objet: Re: [pve-devel] [PATCH v2 pve-storage 1/2] add external
>>snasphot support
>>Date: 23/10/2024 12:12:46
>>
>>some high level comments:
>>
>>I am not sure how much we gain here with the raw support?

They are really qcow2 overhead, mostly with big disk.
as for good performance, the qcow2 l2-cache-size need to be keeped in
memory (and it's 1MB by disk)
https://events.static.linuxfound.org/sites/events/files/slides/kvm-forum-2017-slides.pdf

Hopefully, they are improvments with the "new" sub-cluster feature
https://people.igalia.com/berto/files/kvm-forum-2020-slides.pdf
I'm already using it at snapshot create, but I think we should also use
it for main qcow2 volume.


But even with that, you can still have performance impact.
So yes, I think they are really usecase for workload when you only need
snapshot time to time (before an upgrade for example), but max
performance with no snaphot exist.



>> it's a bit confusing to have a volid ending with raw, with the
>>current volume and all but the first snapshot actually being stored
>>in qcow2 files, with the raw file being the "oldest" snapshot in the
>>chain..
if it's too confusing, we could use for example an .snap extension.
(as we known that it's qcow2 behind)


if possible, I'd be much happier with the snapshot name in the snapshot
file being a 1:1 match, see comments inline

>>- makes it a lot easier to understand (admin wants to manually remove
>>snapshot "foo", if "foo" was the last snapshot then right now the
>>volume called "foo" is actually the current contents!)

This part is really difficult, because you can't known in advance the
name of the snapshot you'll take in the future. The only way could be
to create a "current" volume ,  rename it when you took another
snasphot (I'm not sure it's possible to do it live,
and this could break link chain too)

Also, I don't known how to manage the main volume, when you take the
first snapshot ? we should rename it too.

so "vm-disk-100-disk-0.raw|qcow2"  , become "vm-disk-100-disk-0-
snap1.(raw|qcow2)" + new "vm-disk-100-disk-0-current.qcow2" ?


I'll try to do test again to see what is possible.




>>- means we don't have to do lookups via the full snapshot list all
>>the time (e.g., if I want to do a full clone from a snapshot "foo", I
>>can just pass the snap-foo volume to qemu-img)

ok got it



>>the naming scheme for snapshots needs to be adapted to not clash with
>>regular volumes:

>>$ pvesm alloc extsnap 131314 vm-131314-disk-foobar.qcow2 2G
>>Formatting '/mnt/pve/ext4/extsnap/images/131314/vm-131314-disk-
>>foobar.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off
>>preallocation=off compression_type=zlib size=2147483648
>>lazy_refcounts=off refcount_bits=16
>>successfully created 'extsnap:131314/vm-131314-disk-foobar.qcow2'
>>$ qm rescan --vmid 131314
>>rescan volumes...
>>can't parse snapname from path at
>>/usr/share/perl5/PVE/Storage/Plugin.pm line 1934.

any preference for naming scheme ? for lvm external snap, I have used
"vm-131314-disk-0-snap-";

>>storage_migrate needs to handle external snapshots, or at least error
>>out. 
it should already work. (I have tested move_disk, and live migration +
storage migration). qemu_img_convert offline and qemu block job for
live.


>>I haven't tested that part or linked clones or a lot of other
>>advanced related actions at all ;)

For linked clone, we can't have a base image with snapshots (other than
_base_). so It'll be safe.



> Alexandre Derumier via pve-devel  hat am
> 30.09.2024 13:31 CEST geschrieben:
> Signed-off-by: Alexandre Derumier  cyllene.com>
> ---
>  src/PVE/Storage/DirPlugin.pm |   1 +
>  src/PVE/Storage/Plugin.pm    | 225 +++--
> --
>  2 files changed, 201 insertions(+), 25 deletions(-)
> 
> diff --git a/src/PVE/Storage/DirPlugin.pm
> b/src/PVE/Storage/DirPlugin.pm
> index 2efa8d5..2bef673 100644
> --- a/src/PVE/Storage/DirPlugin.pm
> +++ b/src/PVE/Storage/DirPlugin.pm
> @@ -80,6 +80,7 @@ sub options {
>   is_mountpoint => { optional => 1 },
>   bwlimit => { optional => 1 },
>   preallocation => { optional => 1 },
> + snapext => { optional => 1 },
>     };
>  }
>  
> diff --git a/src/PVE/Storage/Plugin.pm b/src/PVE/Storage/Plugin.pm
> index 6444390..5e5197a 100644
> --- a/src/PVE/Storage/Plugin.pm
> +++ b/src/PVE/Storage/Plugin.pm
> @@ -214,6 +214,11 @@ my $defaultData = {
>       maximum => 65535,
>       optional => 1,
>   },
> +    'snapext' => {
> +     type => 'boolean',
> +     description => 'enable external snapshot.',
> +     optional => 1,
> +    },
>  },
>  };
>  
> @@ -695,7 +700,7 @@ sub get_subdir {
>  }
>  
>  sub filesystem_path {
> -    my ($class, $scfg, $volname, $snapname) = @_;
> +    my ($class, $scfg, $volname, $snapname, $current_snap) = @_;

see comment below

>  
>  my ($

Re: [pve-devel] [PATCH v2 qemu-server 1/1] implement external snapshot

2024-10-23 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
>>if we want the current volume to keep its name, and the snapshot
>>volume to actually contain *that* snapshot's data, we need some sort
>>of rename dance here as well.. i.e., rename the current volume to
>>have the snapshot volume name, then snapshot it back into the
>>"current" name. not sure what the proper qmp runes would be to
>>achieve that?

I really to check that, but I would like to keep it as most atomic than
possible (avoid stream, double snapshot, and all fancy stuff just to
take a snapshot. because shit happen  ^_^ , and generally it'll happen
when you'll take a snapshot with multiple disk, you'll to manage
recovery and current state of differents volumes)

Another stupid way is to use generic name for snapfile (maybe with
ctime inside the name), and create symlinks when snapshot is taken.(
qemu follow symlink and use realpaths for backing chain). 
(and for lvm, it can be done with metadatas).

I'll really try to find a way with renaming volume, I'll keep you in
touch.







--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v2 qemu-server 1/1] implement external snapshot

2024-10-23 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
ok, I think it could be possible to use blockdev-reopen to rename
current filename
https://lists.gnu.org/archive/html/qemu-devel/2021-05/msg04455.html

example: take snapshot : snap1 on vm-disk-100-disk-0.qcow2

- create a hardlink: ln vm-disk-100-disk-0.qcow2 vm-disk-100-disk-0-
snap1.qcow2

- qmp blockdev-reopen file vm-disk-100-disk-0-snap1.qcow2

- rm vm-disk-100-disk-0.qcow2

- create a snapshot with a new file vm-disk-100-disk-0.qcow2 with vm-
disk-100-disk-0-snap1.qcow2 as backing file



I'll try to test this tomorrow !





--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH pve-network/pve-common/pve-manager] fix #4300 : sdn: add bridge ports isolation

2024-10-24 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi,

any news about this patch series ?

I think it's still not applied ?   (I see a lot of request about it on
the forum and on the bugzilla)

Regards,

Alexandre


 Message initial 
De: "DERUMIER, Alexandre" 
À: pve-devel@lists.proxmox.com ,
s.hanre...@proxmox.com 
Objet: Re: [pve-devel] [PATCH pve-network/pve-common/pve-manager] fix
#4300 : sdn: add bridge ports isolation
Date: 27/06/2024 18:23:56

Hi!


> > Hi! I gave this a quick test on my machine and everything worked
well.
> > Would we maybe want to expose this setting on the NIC level as
> > well?

I don't think it can work, because a port not isolated, have access to
all other ports,including isolated ports.


"
isolated on or isolated off
Controls whether a given port will be isolated, which means it will be
able to communicate with non-isolated ports only. By default this flag
is off."


for example:
vm1: isolated
vm2: isolated
vm3: non isolated


vm1: can't access to vm2
vm2: can't access to vm1

vm3 have access to vm1 && vm2 isolated.  (but user is thinking that vm1
&& vm2 are secure).
and vm1/vm2 have access to vm3 too.


That's why I have done it at bridge/vnet level,  all or nothing.

The main usage is to have only 1 upstream port non isolated (traffic
coming from outside) 


> > Also I think 'Isolate Ports' or 'Port Isolation' would be the
> > better
> > label, 'Ports Isolation' sounds a bit wrong to me.

I'll send a v2 with "Port Isolation"



Otherwise, consider this:

> > Tested-By: Stefan Hanreich 
> > Reviewed-By: Stefan Hanreich 

Thanks !

 4/25/24 16:43, Alexandre Derumier via pve-devel wrote:
> ___
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://antiphishing.vadesecure.com/v4?f=OGhLSzUzUW5ZSnhsUnB1Zwk-
> iBGgPyVY4TTNWFYEVcCg2sqZ42p4ld6uKOxcEXt1&i=enliNE9Ec0FwcDdnUXU4UdqsUW
> Q6P4MlGVBmGUhBgqg&k=qWGl&r=TnY3ZTF2Q2plM1daMndLWY2hdyEItuD5-
> BacJIgJqvZ3qD1cLHhtTB2x5DvZF4UIAZISGlCJrAF01C9VxKgOjg&s=926df6762a5f8
> 47592379de9a2d61dc8a3bf0ade01884ae3830a7e63f216d753&u=https%3A%2F%2Fl
> ists.proxmox.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fpve-devel



--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH manager 1/2] close #4369: api: optionally only suggest unique IDs

2024-09-29 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi,

I'm very interested by this patch series too.

My 2 cents:

Couldn't we simply move the deleted vm config file
to a trash/tombstone directory ?

/etc/pve/.deleted/.conf ?


(I could be great to be able to mass delete vms in // without having a
big lock on a file)


I'm not sure about read performance with a lof of delete vms ?


Regards,

Alexandre


 Message initial 
De: Dietmar Maurer 
Répondre à: Proxmox VE development discussion 
À: Proxmox VE development discussion 
Objet: Re: [pve-devel] [PATCH manager 1/2] close #4369: api: optionally
only suggest unique IDs
Date: 26/09/2024 17:34:20

The format of the used_vmids.list is simply, but can lead to
a very large file over time (we want to avoid large files on
/etc/pev/).

>    PVE::Cluster::cfs_write_file('used_vmids.list', join("\n",
> @$vmid_list));

A future version could compress that list, by using integer ranges,
for example:

---used_vmids.list---
100-10005
10007-2
-

I guess this would be simple to implement...


___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://antiphishing.vadesecure.com/v4?f=S084UmJ0TUJ5SGx4M2JIcAIhGX6X6z
EreV6yzjRX6e4emC-ysDzTtLPBgM9ZbyIXhW74ybnKUou-
3S_OXWnGoQ&i=QlRib2gzNE4zbnc2V0tXWXbojbkfbDhF08DP0Pbwl7A&k=txWj&r=OHc5N
2VwYkhyTWJ3YXNsN2-
dQK1UWYRKaljQfJwVQ6Ar59VbBlnwTPhyqMnvPxOp&s=ceea9f4110a5b090c1585cc8c70
3e460dc77ba643b2d94451299a18d4b7dc18d&u=https%3A%2F%2Flists.proxmox.com
%2Fcgi-bin%2Fmailman%2Flistinfo%2Fpve-devel


--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] qcow2 internal snapshot: 4k write speed reduce to 100~200 iops on ssd ?

2024-09-19 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
>>Hi,

Hi Fiona, 

sorry I didn't see your response.


>>Is the performance drop also this big with the local storage?

yes. (The result is even worst on nfs, or gfs/ocfs2)


>>A performance drop is of course expected, because AFAIU it needs to
>>do
>>COW for the sectors that the snapshot references, but the drop does
>>sound rather dramatic in your case. The performance should be better
>>again once the COW is done, e.g. when running another test
>>afterwards. I
>>quickly tested in a nested VM with qcow2 disk on a directory storage
and
>>--numjobs=4:²
>>
>>ran test, got ~33k IOPS
>>made a snapshot
>>ran test, got ~20k IOPS (it starts out very slow and then recovers)
>>ran test, got ~33k IOPS again
>>

 I have tested with another server, I around 60k write and going down
to 600~1000k iops max when a snapshot is token. 


I need to increase slowing after some time, maybe 2~3min to get 20k
iops
and 7~10min to get 50kiops again

I'm testing with a 100gb disk

Maybe is is related to metadatas preallocation ? (I'm not sure, but
when you have a new snapshot, the preallocated metadatas are not using
anymore ?)


Anyway, I just send patch for external qcow2 snapshot :)
If somebody have time to review it ^_^


Thanks again for your response !




> is it expected ?
> 
> (I don't have use qcow2 snapshot since 10year, so I really don't 
>  remember performance).
> 

Best Regards,
Fiona

> 
> Using external snapshot file, I'm around 1iops.
> 
> test inside the vm:
> 
> fio --filename=/dev/sdX --ioengine=libaio --direct 1 --bs=4k  --
> rw=randwrite --numjobs=64
> 
> 


--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v2 pve-storage 1/2] add external snasphot support

2024-10-25 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
 Message initial 
De: Fabian Grünbichler 
À: Proxmox VE development discussion ,
"DERUMIER, Alexandre" 
Cc: Giotta Simon RUAGH 
Objet: Re: [pve-devel] [PATCH v2 pve-storage 1/2] add external snasphot
support
Date: 24/10/2024 11:48:03


> Giotta Simon RUAGH via pve-devel  hat am
> 24.10.2024 09:59 CEST geschrieben:
> > I mean, if we don't allow .raw files to be snapshotted then this
> > problem doesn't exist ;)
> 
> Quick comment from the bleacher; Adding a mechanism to shapshot raw
> disks might solve the TPM (tpmstate) snapshotting issue, as well as
> allowing containers to be snapshot.
> 
> For context: 
> When using a storage that does not natively support snapshotting (NFS
> on NetApp or similar enterprise storage, in particular), raw disks
> cannot be snapshot. 
> Since tpmstate disks can only be stored as raw (as I understand they
> are just a binary blob?), this makes it impossible to snapshot or
> (link-)clone any VMs that have a TPM. This especially is an issue for
> current Windows clients.
> Same issue for LXC containers, as their storage format is raw only as
> well.
>  
> https://antiphishing.vadesecure.com/v4?f=OVFyc3FkSEdWUWx0QkZXZpBaFZH9
> xbUoQi0GpC0KVIU1UWG2AZ7f_MrrmMArnShL&i=Sm1YaTk1OUR6bzFoY3JtMLa1y1UZBH
> RmExEJw6jsROc&k=Hbsl&r=dmh0RHJVSG1CUXhDTmJ3UlzJQNCs3CJCbvk0g2AF56AIGO
> 1hR25I2pdFPY1trx1rDP3XHfwmNmQ-
> fWda_VoksA&s=d330b0a625b7cfcbde904428642b953a712c1a40b54a60918ac39b62
> f8ca6535&u=https%3A%2F%2Fbugzilla.proxmox.com%2Fshow_bug.cgi%3Fid%3D4
> 693

>>no it does not - with the mechanisms proposed in this patch series,
>>only the initial volume can be raw, if it is snapshotted, the
>>overlays are qcow2. so anything reading from the volume needs qcow2
>>support, including swtpm. 
>>
>>that's why containers are not on the table for now either..

Hi, I really don't known how swtpm is working, but for containers maybe
it could be possible


not yet merged to kernel, but a dm-qcow2 driver is on the roadmap :)
https://www.youtube.com/watch?v=Z7jPpWydEC8

another possibility is qemu-storage-daemon  + nbd or vdpa export:
https://blog.deckhouse.io/lvm-qcow-csi-driver-shared-san-kubernetes-81455201590e


About vtpm, is it really a problem to not be able to snapshot ? (I
mean, does the content change regulary ? can't we just skip the disk ?
I really don't known how it's working, I don't use tpm :p)

--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH pve-storage] qcow2 format: enable subcluster allocation by default

2024-11-14 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi Fiona,

I'm really sorry, I didn't see your reponse, lost in the flood of email
:(


>>How does read performance compare for you (with 128 KiB cluster
size)?

>>I don't see any noticeable difference in my testing with an ext4
>>directory storage on an SSD, attaching the qcow2 images as SCSI disks
>>to
>>the VM, neither for reading nor writing. I only tested without your
>>change and with your change using 4k (rand)read and (rand)write.
>>
>>I'm not sure we should enable this for everybody, there's always a
>>risk
>>to break stuff with added complexity. Maybe it's better to have a
>>storage configuration option that people can opt-in to, e.g.
>>
>>qcow2-create-opts extended_l2=on,cluster_size=128k
>>
>>If we get enough positive feedback, we can still change the default
>>in a
>>future (major) release.

What disk size do you use for your bench ? 
It's really important, because generally, the more bigger, the slower
read in qcow2 will be , because l2 metadatas need to be cached handle
in memory. (and qemu have 1MB cache by default)

witout subcluster, for 1TB image, it's around 128MB l2 metadas in qcow2
file. 

for write, it's mostly for first block allocation, so it's depend
of the filesystem behind, if fallocate is supported.
(I have seen really big difference on shared ocfs2/gfs2 fs)

Also, for write, this is helping a lot with backing file. (so for
linked qcow2 clone, and maybe in the future for external snapshots).
Because currently, if you write 4k on a overlay , you need to read the
full cluster 64k on the base write, and rewrite it. (so 8x
overamplification)


They are good information from the developper :
https://blogs.igalia.com/berto/2020/12/03/subcluster-allocation-for-qcow2-images/


I really don't think it's hurt, but maybe a default for 9.0 release to
be safe. (I'll try to have the external snapshot ready for this date
too)


Regards,

Alexandre

> Signed-off-by: Alexandre Derumier  cyllene.com>
> ---
>  src/PVE/Storage/Plugin.pm | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/src/PVE/Storage/Plugin.pm b/src/PVE/Storage/Plugin.pm
> index 6444390..31b20fe 100644
> --- a/src/PVE/Storage/Plugin.pm
> +++ b/src/PVE/Storage/Plugin.pm
> @@ -561,7 +561,7 @@ sub preallocation_cmd_option {
>   die "preallocation mode '$prealloc' not supported by format
> '$fmt'\n"
>       if !$QCOW2_PREALLOCATION->{$prealloc};
>  
> - return "preallocation=$prealloc";
> + return "preallocation=$prealloc,extended_l2=on,cluster_size=128k";

Also, it doesn't really fit here in the preallocation helper as the
helper is specific to that setting.

>  } elsif ($fmt eq 'raw') {
>   $prealloc = $prealloc // 'off';
>   $prealloc = 'off' if $prealloc eq 'metadata';
> -- 
> 2.39.2
> 
> 


--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH pve-storage] qcow2 format: enable subcluster allocation by default

2024-11-25 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi Fiona,

Thanks for your tests ! Indeed,don't seem to be a silver bullet for
the classic usecase.


>>We could also think about only using it for linked clones by default
>>initially.
>>
>>Independently, you can make it the default for LvmQcow2Plugin of
>>course,
s>>ince you see much better results for that use case :)

Yes, I wonder it couldn't be set only on linked clones && external
snapshots. 

I'll send a patch for lvmcow2plugin soon, still working on blockdev &&
qemu-server side, it's almost done.

Regards,

Alexandre
--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH pve-storage] qcow2: resize: add preallocation support

2025-02-03 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
 Message initial 
De: Fiona Ebner 
À: Proxmox VE development discussion 
Cc: Alexandre Derumier 
Objet: Re: [pve-devel] [PATCH pve-storage] qcow2: resize: add
preallocation support
Date: 03/02/2025 15:39:41

Am 19.12.24 um 17:18 schrieb Alexandre Derumier via pve-devel:
> Seem that we totally forgot to add it, it's available since 2017
> https://antiphishing.vadesecure.com/v4?f=TmtFVlNVNmxSYnFaWFhxYhCRpRpL
> 9Z3oy4_Tk4UXcP5N_qAOXqIRqmal4wpM8L7y&i=d09ZU0Z5WTAyTG85WWdYbKbcMR2wMI
> IXhqLwqdlSH4I&k=DWI7&r=UTEzTUpQcktwRVdhdEg1TKCFOzhw8CGaAiMfyFTpTR_LTs
> pF9zP2JS-LN0ctA-XBzHeMG-
> sD1OqL3ihNxDMXJg&s=ee8fad1f3a3cde35c58d5e5735a648efe5c5270d76a0a57b9a
> 6909d8d3104966&u=https%3A%2F%2Fwww.mail-archive.com%2Fqemu-
> devel%40nongnu.org%2Fmsg436979.html

Nit: it's better to link to the commit rather than the mailing list for
things that are already applied.

>>Missing your Signed-off-by.

oops,sorry

>>Hmm, I wanted to suggest to query the image to see what kind of
>>preallocation it was created with and then use that setting to stay
>>consistent. 
>>But that information doesn't seem to get recorded (on an
>>image-wide level) AFAICS.

for full pre-allocation, I think we can simply check the current qcow2
usage vs the size configured.

for qcow2 metadatas, I really don't known any way to do it.

>> It might be surprising that changes to the
>>storage configuration setting will also apply to already existing
>>images

Personnaly, I was more surprised than this never have worked on resize
before ^_^.

That don't shock me that it's respect the current assigned option at
the moment of the resize.

>>and we should document the behavior for resize in the description of
>>the
>>'preallocation' setting.

But yes, it should be documented.
I'll write a patch of pve-docs


>>Seems like the "block_resize" QMP command does not have the setting
>>at
>>all, so if we add it here, the behavior would still be inconsistent
>>in
>>that regard :/ But oh well, could still be added on top later if we
>>can
>>get that feature in upstream. But should also be documented, that it
>>doesn't apply for live resize.

yes, indeed, it doesn't exist for live running image. (I think to have
seen discussion on the qemu mailing about it, but it require some kind
of block job if I remember correctly).

It's existing a preallocate-filter 

https://qemu.googlesource.com/qemu/+/refs/tags/v8.0.3/block/preallocate.c

but it's a little bit different, it's preallocating live.
(allocating by chunk of 1MB for example, when you have a 4k write
reaching EOF)



Thanks for the review !



> ---
>  src/PVE/Storage/Plugin.pm | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/src/PVE/Storage/Plugin.pm b/src/PVE/Storage/Plugin.pm
> index 6a6a804..d0ce3ae 100644
> --- a/src/PVE/Storage/Plugin.pm
> +++ b/src/PVE/Storage/Plugin.pm
> @@ -1166,7 +1166,9 @@ sub volume_resize {
>  
>  my $format = ($class->parse_volname($volname))[6];
>  
> -    my $cmd = ['/usr/bin/qemu-img', 'resize', '-f', $format, $path ,
> $size];
> +    my $prealloc_opt = preallocation_cmd_option($scfg, $format);
> +
> +    my $cmd = ['/usr/bin/qemu-img', 'resize', "--$prealloc_opt", '-
> f', $format, $path , $size];
>  
>  run_command($cmd, timeout => 10);
>  
> -- 
> 2.39.5
> 
> 



--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

[pve-devel] fosdem 2025 feeback

2025-02-06 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi,


First,  Thanks to all Proxmox Team for the help during the Fosdem!

It was 2 long days days for only 2 people, so it have give us some
time to rest a little bit && eat.

And thanks for the Dinner, it was great to meet you again.



Here my notes:




users feedback:

- a lot of "thanks you" from a lot of proxmox VE users.

- 3/4 users would like to contribute, but they don't known how. 
  Maybe something better than the developper wiki page could be great.

- a lot of interest for pdm (lot of people waiting for the final
release)

- users would like to have a proxmox talk during the fosdem 

- we have seen a lof of homelabers, but also a lot of admins of big
companies with big deployments with 1000~4000 vms

- 1 admin from a hosting company asked me if proxmox VE could handle
10 vms ^_^

- Still some people asking about "what is proxmox ve ?"  (but less than
last year)

- Lot of people currently migrating from vmware or currently comparing
alternatives




features request:

No new features had been requested, but the missing classic one

- replication && disaster recovery between 2 clusters. (pdm ?)
- snapshot on san  
- parallel backup limitation
- affinity-anti-affinity in HA
- differents requests for sdn  (mostly from telco company with crazy
network setup ^_^ )
- arm support (but mostly from rasperry homelabbers, and from a ampere
company guy too)
- "official" teraform provider (or at least "officially supported")



Merch:

stickers: not enough stickers  (almost out saturday at 2pm)
t-shirt : too much S && XS size ^_^
bottle openers: enough for the 2 days.  (Some guys have bring us beers
for exchange  ^_^ )
maybe have some presentation flyers could be great too
--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH pve-network] Fix #5496: NetBox add ip range

2024-12-11 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi,

I don't remember exactly when I have done the code (because dhcp range
has been added after my initial implementation, where I was looking
only the full prefix) 

but shouldn't theses ranges be added when the dhcp ranges are submitted
on subnet create/update api call ? (I'm not 100% sure, but I think it
was missing a hook to call the sdn api when submitting the dhcp ranges
)


 Message initial 
De: nurohman via pve-devel 
Répondre à: Proxmox VE development discussion 
À: pve-devel@lists.proxmox.com
Cc: nurohman 
Objet: [pve-devel] [PATCH pve-network] Fix #5496: NetBox add ip range
Date: 12/12/2024 00:38:04

when ip range not available in the NetBox, function response "can't
find free ip in range". 
Add new ip range if can't find ip range ID and response new ip range
ID.  

Co-authored-by: Jacob Green 
Co-authored-by: Lou Lecrivain 
Signed-off-by: Nurohman 
---
 src/PVE/Network/SDN/Ipams/NetboxPlugin.pm | 34 +--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/src/PVE/Network/SDN/Ipams/NetboxPlugin.pm
b/src/PVE/Network/SDN/Ipams/NetboxPlugin.pm
index d923269..5591b0b 100644
--- a/src/PVE/Network/SDN/Ipams/NetboxPlugin.pm
+++ b/src/PVE/Network/SDN/Ipams/NetboxPlugin.pm
@@ -151,7 +151,7 @@ sub add_next_freeip {
 
 my $params = { dns_name => $hostname, description => $description
};
 
-    eval {
+    my $ip = eval {
 
 my $result = PVE::Network::SDN::api_request("POST",
"$url/ipam/prefixes/$internalid/available-ips/", $headers, $params);
 
 my ($ip, undef) = split(/\//, $result->{address});
 
 return $ip;
@@ -160,6 +160,8 @@ sub add_next_freeip {
 if ($@) {
 
 die "can't find free ip in subnet $cidr: $@" if !$noerr;
 }
+
+    return $ip;
 }
 
 sub add_range_next_freeip {
@@ -170,11 +172,14 @@ sub add_range_next_freeip {
 my $headers = ['Content-Type' => 'application/json; charset=UTF-
8', 'Authorization' => "token $token"];
 
 my $internalid = get_iprange_id($url, $range, $headers);
+    if ($internalid eq "") {
+
 $internalid = add_iprange($url, $range, $subnet, $headers);
+    }
 my $description = "mac:$data->{mac}" if $data->{mac};
 
 my $params = { dns_name => $data->{hostname}, description =>
$description };
 
-    eval {
+    my $ip = eval {
 
 my $result = PVE::Network::SDN::api_request("POST", "$url/ipam/ip-
ranges/$internalid/available-ips/", $headers, $params);
 
 my ($ip, undef) = split(/\//, $result->{address});
 
 print "found ip free $ip in range $range->{'start-address'}-$range-
>{'end-address'}\n" if $ip;
@@ -184,6 +189,8 @@ sub add_range_next_freeip {
 if ($@) {
 
 die "can't find free ip in range $range->{'start-address'}-$range-
>{'end-address'}: $@" if !$noerr;
 }
+
+    return $ip;
 }
 
 sub del_ip {
@@ -290,6 +297,29 @@ sub is_ip_gateway {
 return $is_gateway;
 }
 
+sub add_iprange {
+    my ($url, $range, $subnet, $headers) = @_;
+
+    my ($start, $smask)= split('/', $range->{'start-address'});
+    my ($end, $emask)= split('/', $range->{'end-address'});
+    my $params = {
+
 start_address => "$start/$subnet->{mask}",
+
 end_address => "$end/$subnet->{mask}",
+
 description => "prefix:$subnet->{cidr}"
+    };
+
+    my $data = eval {
+    my $result = PVE::Network::SDN::api_request("POST",
"$url/ipam/ip-ranges/", $headers, $params);
+
 return $result;
+    };
+
+    if ($@) {
+    die "Can't add ip range $start/$subnet->{mask} ->
$end/$subnet->{mask} $@";
+    }
+
+    return $data->{id};
+}
+
 1;
 
 

--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v2 storage 05/10] rbd plugin: implement volume import/export

2024-12-18 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
>>Am 18.12.24 um 15:20 schrieb Daniel Kral:
> >>- When exporting with "pvesm export ...", the volume has the same
> checksum as with "rbd export ..." with the size header prepended

>>Well, I totally missed the existence of "rbd export" in my hurry to
>>get
>>this working. Seems to be about 1.5 times faster than mapping+dd from
>>some initial testing. Will use that in v3.

Hi, fiona, rbd export|import, this is the way, like zfs send|receive. 
(with snapshot support too, with export-diff|import-diff)

I't really fast because, if I remember, it's use big block size and is 
able to do parallelism. (can be tunned with --rbd-concurrent-
management-ops  )



No related, but could it be possible to implement it, for simple
vm/template full cloning with source+target are both rbd ? It's really
faster with 'qemu-img convert'




--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v2 storage 05/10] rbd plugin: implement volume import/export

2024-12-19 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
> No related, but could it be possible to implement it, for simple
> vm/template full cloning with source+target are both rbd ? It's
> really
> faster with 'qemu-img convert'

>>Hmm, we could shift offline copy of images to the storage layer (at
>>least in some cases). We just need a version of storage_migrate()
>>that
>>goes to the local node instead of SSH to a different one. Could you
>>please open a feature request for this?

Done:

https://bugzilla.proxmox.com/show_bug.cgi?id=6002

(Another way could be clone + flatten if the storage support it (like
ceph), this avoid  network read/write, and really offloading the copy)
--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v1 pve-qemu 1/1] add block-commit-replaces option patch

2025-01-09 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
 Message initial 
De: Fabian Grünbichler 
À: Proxmox VE development discussion 
Cc: Alexandre Derumier , Fiona
Ebner 
Objet: Re: [pve-devel] [PATCH v1 pve-qemu 1/1] add block-commit-
replaces option patch
Date: 08/01/2025 14:27:02


> Alexandre Derumier via pve-devel  hat am
> 16.12.2024 10:12 CET geschrieben:

> This is needed for external snapshot live commit,
> when the top blocknode is not the fmt-node.
> (in our case, the throttle-group node is the topnode)

>>so this is needed to workaround a limitation in block-commit? I think
>>if we need this it should probably be submitted upstream for
>>inclusion, or we provide our own copy of block-commit with it in the
>>meantime?
Yes, it could be submitted upstream (after a little bit of review, I'm
not too good in C;)).

It's more a missing option in the qmp syntax, as it's already using 
blockdev-mirror code in background.

(redhat don't used throttle group feature until recently, so I think
they never had seen this problem with block-commit, as their top root
node was the disk directly, and not the throttle group)


--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v3 pve-storage 2/3] lvmplugin: add qcow2 snapshot

2025-01-10 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
>>one downside with this part in particular - we have to always
>>allocate full-size LVs (+qcow2 overhead), even if most of them will
>>end up storing just a single snapshot delta which might be a tiny
>>part of that full-size.. hopefully if discard is working across the
>>whole stack this doesn't actually explode space usage on the storage
>>side, but it makes everything a bit hard to track.. OTOH, while we
>>could in theory extend/reduce the LVs and qcow2 images on them when
>>modifying the backing chain, the additional complexity is probably
>>not worth it at the moment..

see this RFC with dynamic extend. (not shrink/not discard)
https://lore.proxmox.com/pve-devel/mailman.475.1725007456.302.pve-de...@lists.proxmox.com/t/
(I think that the tricky part (as Dominic have fast review), is to
handle resize cluster lock correctly, and handle timeout/retry with a
queue through a specific daemon)

But technically, this is how ovirt is Managed it  (and it's works in
production, I have customers using it since multiple years)



> 
>  sub plugindata {
>  return {
>   content => [ {images => 1, rootdir => 1}, { images => 1 }],
> + format => [ { raw => 1, qcow2 => 1 } , 'raw' ],

>>I wonder if we want to guard the snapshotting-related parts below
>>with an additional "snapext" option here as well? 

I really don't known, it's not possible to do snapshots with .raw
anyway. 
on the gui side, it could allow to enable/display the format field for
example if snapext is defined in the storage.

>>or even the usage >>of qcow2 altogether?

I think we should keep to possiblity to choose .raw vs .qcow2 on same
storage, because
maybe a user really need max performance for a specific vm without the
need of snapshot.


> 
> +
> +    #add extra space for qcow2 metadatas
> +    #without sub-allocated clusters : For 1TB storage : l2_size =
> disk_size × 8 / cluster_size
> +    #with sub-allocated clusters : For 1TB storage : l2_size =
> disk_size × 8 / cluster_size / 16
> +   #4MB overhead for 1TB with
> extented l2 clustersize=128k
> +
> +    my $qcow2_overhead = ceil($size/1024/1024/1024) * 4096;

>>there's "qemu-img measure", which seems like it would do exactly what
>>we want ;)

"Calculate the file size required for a new image. This information can
be used to size logical volumes or SAN LUNs appropriately for the image
that will be placed in them."

indeed, lol.  I knowned the command, but I thinked it was to measure
the content of an existing file. I'll do tests to see if I got same
results (and if sub-allocated clusters is correctly handled)

> +
> +    my $lvmsize = $size;
> +    $lvmsize += $qcow2_overhead if $fmt eq 'qcow2';
> +
>  die "not enough free space ($free < $size)\n" if $free < $size;
>  
> -    $name = $class->find_free_diskname($storeid, $scfg, $vmid)
> +    $name = $class->find_free_diskname($storeid, $scfg, $vmid, $fmt)
>   if !$name;
>  
> -    lvcreate($vg, $name, $size, ["pve-vm-$vmid"]);
> -
> +    my $tags = ["pve-vm-$vmid"];
> +    push @$tags, "\@pve-$name" if $fmt eq 'qcow2';

>>that's a creative way to avoid the need to discover and activate
>>snapshots one by one below, but it might warrant a comment ;)

ah sorry (but yes,this was the idea to active/desactivate the whole
chain in 1call)

> >>
> +
> +    #rename current lvm volume to snap volume
> +    my $vg² = $scfg->{vgname};
> +    print"rename $volname to $snap_volname\n";
> +    eval { lvrename($vg, $volname, $snap_volname) } ;
>> missing error handling..

> +
> +
> +    #allocate a new lvm volume
> +    $class->alloc_new_image($storeid, $scfg, $vmid, 'qcow2',
> $volname, $size/1024);

>>missing error handling

ah ,sorry, it should include in the following eval

> +    eval {
> +    $class->format_qcow2($storeid, $scfg, $volname, undef,
> $snap_path);
> +    };
> +
> +    if ($@) {
> +    eval { $class->free_image($storeid, $scfg, $volname, 0) };
> +    warn $@ if $@;
> +    }
> +}
> +
> +sub volume_rollback_is_possible {
> +    my ($class, $scfg, $storeid, $volname, $snap, $blockers) = @_;
> +
> +    my $snap_path = $class->path($scfg, $volname, $storeid, $snap);
> +
> +    my $snapshots = $class->volume_snapshot_info($scfg, $storeid,
> $volname);
> +    my $parent_snap = $snapshots->{current}->{parent};
> +
> +    return 1 if !-e $snap_path || $snapshots->{$parent_snap}->{file}
> eq $snap_path;

>>the first condition here seems wrong, see storage patch #1

yes

> +    die "can't rollback, '$snap' is not most recent snapshot on
> '$volname'\n";
> +
> +    return 1;
>  }
>  
> +
>  sub volume_snapshot_rollback {
>  my ($class, $scfg, $storeid, $volname, $snap) = @_;
>  
> -    die "lvm snapshot rollback is not implemented";
> +    die "can't rollback snapshot this image format\n" if $volname !~
> m/\.(qcow2|qed)$/;

>>above we only have qcow2, which IMHO makes more sense..

We could remove the .qed everywhere, IT's deprecated since 2017 and we
never have exposed it

Re: [pve-devel] [PATCH-SERIES v3 pve-storage/qemu-server/pve-qemu] add external qcow2 snapshot support

2025-01-09 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
 Message initial 
De: Fabian Grünbichler 
À: Proxmox VE development discussion 
Cc: Alexandre Derumier 
Objet: Re: [pve-devel] [PATCH-SERIES v3 pve-storage/qemu-server/pve-
qemu] add external qcow2 snapshot support
Date: 09/01/2025 15:13:14

> Alexandre Derumier via pve-devel  hat am
> 16.12.2024 10:12 CET geschrieben:

> This patch series implement qcow2 external snapshot support for files
> && lvm volumes
> 
> The current internal qcow2 snapshots have bad write performance
> because no metadatas can be preallocated.
> 
> This is particulary visible on a shared filesystem like ocfs2 or
> gfs2.
> 
> Also other bugs are freeze/lock reported by users since years on
> snapshots delete on nfs
> (The disk access seem to be frozen during all the delete duration)
> 
> This also open doors for remote snapshot export-import for storage
> replication.
>>
>>a few high level remarks:
>>- I am not sure whether we want to/can switch over to blockdev on the
>>fly (i.e., without some sort of opt-in phase to iron out kinks). what
>>about upgrades with running VMs? I guess some sort of flag and per-VM
>>switching would be a better plan..

I have tested live migration, and it's works for the small tests I have
done. (I was surprised myself). I'll try to do more longer test to be
100% sure that they are not corruption of datas.

 on the guest side, it's transparent. on qemu side, the devices and pci
plumbing is still the same, and qemu already use blockdev behind.

If needed, we could make a switch based on qemu version, or or manual
option.

>>- I am also not sure whether we want to mix and match internal and
>>external snapshots, we probably should require a storage to use it or
>>not at creation time and fix it, to allow us to keep the handling
>>code from exploding complexity-wise..

for qcow2, I have a "snapext" option on the storage, so you can't mix
internal/external at the same time.  
(We just need to allow to define the snapext option at storage create
only)

For lvm, only external snapshot are possible. (if format is qcow2)

>>- if you see a way to name the block graph nodes in a deterministic
>>fashion (i.e., have a 1:1 mapping between snapshot and block graph
>>node name) that would be wonderful, else we'd have to find another
>>way to improve the lookup there..

1:1 mapping with snapshot is not possible (I have tried it a lot),
because:
  - snapshot name can be too long (blockdev name is 31 characters max,
hash based on filename is difficult)
  - with external snapshot file renaming, this don't work  (snap-->
current). We can't rename a blockdev, so the mapping will drift.

  So, I don't think that it's possible to avoid lookup. (I really don't
have idea how to manage it).  
I'm not sure it's really a problem ?  it's just an extra qmp call, but
it's super fast.

>>- the relevant commits and code would really benefit from a summary
>>of the design/semantics, it's not easy to reconstruct this from the
>>code alone

ok will do

>>- some tests would be great as well, both to verify that the code
>>actually behaves like we expect, and also to catch regressions when
>>it is touched again in the future

yes, I was planning to add test, as we don't have too much tests on
qemu command line currently, this is a good time to add them.

>>- some performance numbers would also be great :)
>>-- internal vs external snapshots using the same underlying dir/fs
>>-- qcow2 vs qcow2 with snapshots vs raw on LVM using the same
>>underlying disk

I'll prepare different hardware san to compare.  (Note that it's not
only performance, but disk lock/freeze on snapshot rollback for example
with internal snapshot).

I'll try to rebench again ocfs2 && gfs2 too with external snapshot (as
metadatas preallocation is really important with them, and it's not
possible with internal snap)

>>- did you verify the space consumption on the SAN side with the LVM
>>plugin? would be interesting to know how it holds up in practice :)
Well, currently, it's reserving a full lvm volume for each snapshot.

If your storage support thin provisioning, its not too much a problem,
you can create a lun bigger than your physical storage. I have tried
with a netapp san, it's works.

but for low cost san, we thin provisioning not exist, I'm planning to
add dynamic extend fo small lvm.

I have send some test patch in september 2024,  qcow2 size can be
bigger than lvm size. (like qcow2=100G && lvm=1G), then it's possible
to send event with qemu on threshold usage and extend (through a custom
daemon) the lvm volume.
But I was planning to add this in a separated after that the external
snapshot code is commited.

>>I haven't done any real world testing yet since I expect that the
>>delta for the next version won't be small either ;)

Yes, the biggest part is done. (I Hope ^_^ )

I'll try to finish the blockdev convertion of some missing part (iscsi
path, uefi ,...).
The main blocking point for me is the proxmox backup code, I

Re: [pve-devel] [PATCH v1 pve-qemu 1/1] add block-commit-replaces option patch

2025-01-10 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
>>Maybe it could even be a bug then? 

Yes, it's a bug.  I just think that libvirt currently only implement 
block-commit with disk blockdev on topnode.

throttle group are not currently implement in libvirt (but I have seen
some commit to add support recently), they still used the old throttle
method.

>>In many situations, the filter >>nodes
>>on top (like throttle groups) are ignored/skipped to get to the
>>actually
>>interesting block node for certain block operations. 

yes, and this option exist in the qmp blockdev-mirror.   (and block-
commit is reusing blockdev-mirror code behind)


>>Are there any situations where you wouldn't want to do that in the
>>block-commit case?
mmm, I think it should always be rettach to disk (format blocknode or
file blocknode if no formatnode exist). I really don't known how to
code this, I have just reused the blockdev-mirror way.


Feel free to cleanup this patch and submit it to qemu devs, you are a
better C developper than me ^_^


--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v3 qemu-server 11/11] qcow2: add external snapshot support

2025-01-20 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi Fabio !

>>
>>In this implementation I don't see the possibility of using them on
>>raw 
>>disks (on files) from a fast look, or am I wrong? If so, why? I think
>>the main use would be in cases like that where you don't have
>>snapshot 
>--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v3 qemu-server 08/11] blockdev: convert drive_mirror to blockdev_mirror

2025-01-20 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
>>the path referenced in the running VM is stable. the path you are
>>looking for in the graph is not. e.g., the path might be something
>>some storage software returns. or udev. or .. and that can change
>>with any software upgrade or not be 100% deterministic in the first
>>place.
>>
>>let's say you start the VM today and the path returned by the RBD
>>storage plugin is /dev/rbd/XYZ, so that is how the blockdev is
>>opened/the path is recorded. in two weeks, ceph gets updated and now
>>the udev rule or the storage plugin code changes to return the more
>>deterministic /dev/rbd/POOL/XYZ. now the paths don't match anymore.
>>(this is just an example where such a thing happened in practice
>>already ;)).

Yes, got it, thanks!


I have  begin to work on the v4 with all your comments, I think it
should work with hash of volid, now that I have fixed the live
renaming. (that was really the main blocker)

I'll try to send a v4 before the fosdem 

Thanks for the review and your time ! 




--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v3 qemu-server 11/11] qcow2: add external snapshot support

2025-01-20 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
>>out of curiosity, besides the obvious cases where external snapshots 
>>would be useful, i.e. on raw files and on lvm (not lvm-thin), what
>>other 
>>cases would be useful given that they already have snapshot support 
>>(qcow2 with the internal one, zfs, btrfs, lvm-thin etc)?

>>not having it on raw files I think would take away a lot of its 
>>usefulness, but I could be wrong if there are uses I can't imagine, 
>>maybe make something more possible with replicas and migrations? (I 
>>haven't looked into it enough)

I think they could be use for remote replication with snapshot
export/import if storage don't support it  (lvm-thin for example).

I'm planning to add pseudo-thin provisoniong too on lvm shared block.
(lvm smaller than the qcow2 format, with dynamic lvm size increase
through  a proxmox special daemon )



Note that , for .raw support, it's still possible to add support later,
if user really want it.

Just keep it simple for a first version then add more feature over time
:)


Also, the target is really network storage (qcow2 on nfs/cifs , lvm +
qcow2 on SAN block), that's why I think that network latency also hide
the perf difference between qcow2 && raw.

For local storage, with super fast nvme, it don't make too much sense
to use qcow2. you can use zfs, lvm-thin,.. where internal snasphots are
working great :)







--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v3 qemu-server 09/11] blockdev: mirror: change aio on target if io_uring is not default.

2025-01-13 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
> +    #change aio if io_uring is not supported on target
> +    if ($dst_drive->{aio} && $dst_drive->{aio} eq 'io_uring') {
> + my ($dst_storeid) = PVE::Storage::parse_volume_id($dst_drive-
> >{file});
> + my $dst_scfg = PVE::Storage::storage_config($storecfg,
> $dst_storeid);
> + my $cache_direct = drive_uses_cache_direct($dst_drive, $dst_scfg);
> + if(!storage_allows_io_uring_default($dst_scfg, $cache_direct)) {
> +     $dst_drive->{aio} = $cache_direct ? 'native' : 'threads';
> + }
> +    }

>>couldn't/shouldn't we just handle this in generate_file_blockdev?

yes, better to reuse existing code to avoid difference. I'll do it.





--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v3 qemu-server 08/11] blockdev: convert drive_mirror to blockdev_mirror

2025-01-13 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---


> +    my $path = PVE::Storage::path($storecfg, $volid);

>>is this guaranteed to be stable? also across versions? and including
>>external storage plugins?

it can't be different than the value we have use for command line
generation. But I think that I should use $path directly (It's working
for block/file  ,  but I think it'll not work with ceph,gluster,...)
I need to reuse the code used to generated the blockdev commande line.

Another way, maybe a better way,is to parse the tree from the top node
(the throttle-group) where the name is fixed.  and look for fmt|file
chain attached to this node.

(I just need need to check when we are a doing live renaming, we have 2
files nodes, with the newer file node not attached to the tree before
the switch)



> +
> +    my $node = find_blockdev_node($nodes, $path, 'fmt');

>>that one is only added in a later patch.. but I don't think lookups
>>by path are a good idea, we should probably have a deterministic node
>>naming concept instead? e.g., encode the drive + snapshot name?

I really would like to have something deterministic but:

- devices node are 31 characters max.   (snapshot name can be more big)
- we can't rename a node  (but we are renaming files for snapshot over
time)


As Fiona said, we could have random names and do 1 lookup each time to
list them.

(I really need to have our own name, because blockdev-reopen, for live
renaming of files, is not working with autogenerated block# name)





> +    return $node->{'node-name'};
> +}
> +
> +sub get_blockdev_nextid {
> +    my ($nodename, $nodes) = @_;
> +    my $version = 0;
> +    for my $nodeid (keys %$nodes) {
> + if ($nodeid =~ m/^$nodename-(\d+)$/) {
> +     my $current_version = $1;
> +     $version = $current_version if $current_version >= $version;
> + }
> +    }
> +    $version++;
> +    return "$nodename-$version";

>>since we shouldn't ever have more than one job for a drive running
(right?), couldn't we just have a deterministic name for this? that
>>would also simplify cleanup, including cleanup of a failed cleanup ;)

Still same, 
- you need 2 file nodes at the same time for live renaming
- you can have 2 fmt nodes for blockdev-mirror at same time.








--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v3 qemu-server 10/11] blockdev: add backing_chain support

2025-01-13 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
>  
> +sub generate_backing_blockdev {
> +    my ($storecfg, $snapshots, $deviceid, $drive, $id) = @_;
> +
> +    my $snapshot = $snapshots->{$id};
> +    my $order = $snapshot->{order};
> +    my $parentid = $snapshot->{parent};
> +    my $snap_fmt_nodename = "fmt-$deviceid-$order";
> +    my $snap_file_nodename = "file-$deviceid-$order";

>>would it make sense to use the snapshot name here instead of the
>>order? that would allow a deterministic mapping even when snapshots
>>are removed..

31 characters limit for nodename :(

if we could be able to encode the fullpath, I think it could be
easier.I don't known if qemu could be patched to allow more characters
for node names.

With a unique id by file/path (not by drive), I think it could work
with mirror/replace/...




> +
> +    my $snap_file_blockdev = generate_file_blockdev($storecfg,
> $drive, $snap_file_nodename);
> +    $snap_file_blockdev->{filename} = $snapshot->{file};
> +    my $snap_fmt_blockdev = generate_format_blockdev($storecfg,
> $drive, $snap_fmt_nodename, $snap_file_blockdev, 1);
> +    $snap_fmt_blockdev->{backing} =
> generate_backing_blockdev($storecfg, $snapshots, $deviceid, $drive,
> $parentid) if $parentid;
> +    return $snap_fmt_blockdev;
> +}
> +
> +sub generate_backing_chain_blockdev {
> +    my ($storecfg, $deviceid, $drive) = @_;
> +
> +    my $volid = $drive->{file};
> +    my $do_snapshots_with_qemu = do_snapshots_with_qemu($storecfg,
> $volid, $deviceid);
> +    return if !$do_snapshots_with_qemu || $do_snapshots_with_qemu !=
> 2;
> +
> +    my $chain_blockdev = undef;
> +    PVE::Storage::activate_volumes($storecfg, [$volid]);
> +    #should we use qemu config to list snapshots ?

>>from a data consistency PoV, trusting the qcow2 metadata is probably
>>safer.. 
(I asked about this, because we need to active volumes for this, and
currently we are activate them after the config generation).
With this code, this activate volumes when we generate the command
line, but if user have a wrong config, we need to handle desactivate
too.

>>but we could check that the storage and the config agree, and >>error
>>out otherwise?

yes, I was thinking about this. we can list volumes from storage
without need to activate them , then check that all volumes from vm
config are present.




--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v3 qemu-server 04/11] blockdev: vm_devices_list : fix block-query

2025-01-12 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
 Message initial 
De: Fabian Grünbichler 
À: Proxmox VE development discussion 
Cc: Alexandre Derumier 
Objet: Re: [pve-devel] [PATCH v3 qemu-server 04/11] blockdev:
vm_devices_list : fix block-query
Date: 08/01/2025 15:31:36


> Alexandre Derumier via pve-devel  hat am
> 16.12.2024 10:12 CET geschrieben:

> Look at qdev value, as cdrom drives can be empty
> without any inserted media


>>is this needed if we don't drive_del the cdrom drive when ejecting
>>the medium?

The original code is buggy for me, because vm_devices_list should list
devices (the media device reder), not (drives/blockdev) -> the media

we can't list an an empty device without media without this.


(We don't drive_del the cdrom drive(device), they are ide devices,
and can't be removed online)

> 
> Signed-off-by: Alexandre Derumier  cyllene.com>
> ---
>  PVE/QemuServer.pm | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm
> index baf78ec0..3b33fd7d 100644
> --- a/PVE/QemuServer.pm
> +++ b/PVE/QemuServer.pm
> @@ -4425,10 +4425,9 @@ sub vm_devices_list {
>  }
>  
>  my $resblock = mon_cmd($vmid, 'query-block');
> -    foreach my $block (@$resblock) {
> - if($block->{device} =~ m/^drive-(\S+)/){
> - $devices->{$1} = 1;
> - }
> +    $resblock = { map { $_->{qdev} => $_ } $resblock->@* };
> +    foreach my $blockid (keys %$resblock) {
> + $devices->{$blockid} = 1;
>  }
>  
>  my $resmice = mon_cmd($vmid, 'query-mice');
> -- 
> 2.39.5


--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH-SERIES v3 pve-storage/qemu-server/pve-qemu] add external qcow2 snapshot support

2025-01-13 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---


>>For almost all QMP commands, we only need to care about the node
>>that's
>>inserted for the drive.
(yes, that the throttle group in my implementation, and I have a fixed
name, I'm reusing the "drive-(ide|scsi|virtio)x naming"

>>And for your use-case, checking that the top
>>node of the chain matches what we expect is already a good first
>>step.
>>The lookup itself is a different question, I'll also answer to the
>>other
>>mail.


Maybe this could help to understand the problem:

Here a small resume of the 2 workflow, snapshot && block mirror,
where we have a switch between nodes:



snapshot


a) renaming current

  
1)
device--->throttle-group--->fmt-node1->file-node1 > vm-100-
disk-0.qcow2
  

2) create a new file node with new file name

device--->throttle-group--->fmt-node1->file-node1 > vm-100-
disk-0.qcow2

file-node2 --> vm-100-disk-0-snap1.qcow2



3) switch the file node with blockdev-reopen

device--->throttle-group--->fmt-node1->file-node2 ---> vm-100-disk-
0-snap1.qcow2
file-node1 --> vm-100-disk-0.qcow2


4) delete the old filenode
  
device--->throttle-group--->fmt-node1->file-node2 ---> vm-100-disk-
0-snap1.qcow2
  
  
  
b) create the new current
-
1) add a new fmt node 

device--->throttle-group--->fmt-node1->file-node2 ---> vm-100-disk-
0-snap1.qcow2

fmt-node3->file-node3->vm-100-disk-0.qcow2


2) blockdev-snapshot -> set fmt-node3 active with fmt-node1 as backing

device--->throttle-group--->fmt-node3->file-node3->vm-100-disk-
0.qcow2
   |
   |--> fmt-node1->file-node2 ---> vm-
100-disk-0-snap1.qcow2
   
   


mirror


1) 
device--->throttle-group--->fmt-node1->file-node1 > vm-100-
disk-0.qcow2


2) add a new target fmt + file node

device--->throttle-group--->fmt-node1->file-node1 > vm-100-
disk-0.qcow2

fmt-node2->file-node2 > vm-100-disk-1.qcow2 


3) blockdev-mirror (mirror + switch the fmt node on complete)

device--->throttle-group--->fmt-node2->file-node2 > vm-100-
disk-1.qcow2 

fmt-node1->file-node1 > vm-100-disk-0.qcow2


4) delete the old fmt+file node

device--->throttle-group--->fmt-node2->file-node2 > vm-100-
disk-1.qcow2 
--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH-SERIES v3 pve-storage/qemu-server/pve-qemu] add external qcow2 snapshot support

2025-01-13 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
> Hmm, sounds like it might a bug, I can look into it. If really
> required
> to make it work, we can still set fixed node-names on the
> commandline,
> but also query them before usage to be sure we have the correct, i.e.
> currently inserted node.

>>AFAICT, this is because the node name is not set in the original
>>options
>>for the block driver state and then it wrongly detects an attempt to
>>change the node name (even if specifying the correct auto-generated
>>one
>>during reopen). However, it is rather ugly to try and use a -drive
>>together with blockdev-reopen in any case, blockdev-reopen is really
>>written with -blockdev in mind and -blockdev necessarily requires
>>setting node-name up front.

you are too fast ^_^

>>I don't think it's even worth fixing that bug. We should use
>>blockdev-reopen only after switching to -blockdev, it's much nicer
>>like
>>that :)
>>
>>I'd still be in favor of querying the node-name of drives with
>>query-block before doing QMP operations though. Like that, we can
>>warn/error if it doesn't match what we expect for example, to catch
>>unexpected situations.

ok, so the question is : how to query to node-noname of different disk
(+snapshot chain).

Fabian is against the use of the path.

So, I'm a little bit out of idea ^_^






--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v3 qemu-server 11/11] qcow2: add external snapshot support

2025-01-13 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
> 
> 
> > > should this maybe have been vdisk_alloc and it just works by
> accident?
> It's not works fine with vdisk_alloc, because the volume need to be
> created without the size specified but with backing file param
> instead.
> (if I remember, qemu-img is looking at the backing file size+metadas
> and set the correct total size for the new volume)

>>I am not sure I follow.. we create a snapshot, but then we pretend it
>>isn't a file with backing file when passing it to qemu? this seems
>>wrong.. IMHO we should just allocate (+format) here, and then let
>>qemu do the backing link up instead of this confusing (and error-
>>prone, as it masks problems that should be a hard error!) call..
>>
>>
>>
>>
> >>Maybe a better way could be to reuse vdisk_alloc, and add backing
> >>file
> >>as param ?
>>
>>that seems.. wrong as well? the file can never be bigger just because
>>it has a backing file right? why can't we just allocate and format
>>the regular volume?

I need to redo tests as I don't remember exactly.

From memory (I have wrote it 2month ago, sorry ^_^ ) , it was maybe :
 - related with metadatas prealloc size + lvm size).

 or (but I need to verify)

 The "blockdev-snasphot" qmp later, only change the backing-file in
memory in the blockgraph, but not inside the file itself. (so after a
restart of the process, the chain is borken).


> > +    my $new_file_blockdev = generate_file_blockdev($storecfg,
> > $drive, $new_current_file_nodename);
> > +    my $new_fmt_blockdev = generate_format_blockdev($storecfg,
> > $drive, $new_current_fmt_nodename, $new_file_blockdev);
> > +
> > +    $new_fmt_blockdev->{backing} = undef;
> 
> > > generate_format_blockdev doesn't set backing? 
> yes, it's adding backing
> 
> > > maybe this should be >>converted into an assertion?
> 
> but they are a limitation of the qmp blockdev-ad ++blockdev-snapshot
> where the backing attribute need undef in the blockdev-add or the
> blockdev-snapshot will fail because it's trying itself to set the
> backing file when doing the switch.
> 
> From my test, it was related to this

>>yeah, what I mean is:
>>
>>generate_format_blockdev doesn't (and should never) set backing. so
>>setting it to undef has no effect. but we might want to assert that
>>it *is* undef, so that if we ever mistakenly change
>>generate_format_blockdev to set backing, it will be caught instead of
>>silently being papered over..

you need it if you want to have block node-names for snapshots.
if backing is no defined, the backing chain is autogenerated with
random #block node-names.   (so rename/block reopen can't work)





--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v3 qemu-server 11/11] qcow2: add external snapshot support

2025-01-13 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---

> Alexandre Derumier via pve-devel  hat am
> 16.12.2024 10:12 CET geschrieben:

>>it would be great if there'd be a summary of the design choices and a
>>high level summary of what happens to the files and block-node-graph
>>here. it's a bit hard to judge from the code below whether it would
>>be possible to eliminate the dynamically named block nodes, for
>>example ;)

yes, sorry, I'll add more infos with qemu limitations and why I'm doing
it like this.

>>a few more comments documenting the behaviour and ideally also some
>>tests (mocking the QMP interactions?) would be nice
yes, I'll add tests, need to find a good way to mock it.


> +
> +    #preallocate add a new current file
> +    my $new_current_fmt_nodename = get_blockdev_nextid("fmt-
> $deviceid", $nodes);
> +    my $new_current_file_nodename = get_blockdev_nextid("file-
> $deviceid", $nodes);

>>okay, so here we have a dynamic node name because the desired target
>>name is still occupied. could we rename the old block node first?

we can't rename a node, that's the problem.


> +    PVE::Storage::volume_snapshot($storecfg, $volid, $snap);

>>(continued from above) and this invocation here are the same??
The invocation is the same, but they it's not doing the same if it's an
external snasphot.

>> wouldn't this already create the snapshot on the storage layer? 
yes, it's create the (lvm volume) +  qcow2 file with preallocation

>>and didn't we just hardlink + reopen + unlink to transform the
>>previous current volume into the snap volume?
yes.
here, we are creating the new current volume,  adding it to the graph
with blockdev-add, then finally switch to it with blockdev-snapshot

The ugly trick in pve-storage is in plugin.pm
#rename current volume to snap volume
rename($path, $snappath) if -e $path && !-e $snappath;
or in lvm plugin.
eval { lvrename($vg, $volname, $snap_volname) } ;


(and you have already made comment about them ;)


because I'm re-using volume_snapshot (I didn't have to add a new method
in pve-storage) to allocate the snasphot file, but indeed, we have
already to the rename online.


>>should this maybe have been vdisk_alloc and it just works by
accident?
It's not works fine with vdisk_alloc, because the volume need to be
created without the size specified but with backing file param instead.
(if I remember, qemu-img is looking at the backing file size+metadas
and set the correct total size for the new volume)


Maybe a better way could be to reuse vdisk_alloc, and add backing file
as param ?



> +    my $new_file_blockdev = generate_file_blockdev($storecfg,
> $drive, $new_current_file_nodename);
> +    my $new_fmt_blockdev = generate_format_blockdev($storecfg,
> $drive, $new_current_fmt_nodename, $new_file_blockdev);
> +
> +    $new_fmt_blockdev->{backing} = undef;

>>generate_format_blockdev doesn't set backing? 
yes, it's adding backing

>>maybe this should be >>converted into an assertion?

but they are a limitation of the qmp blockdev-ad ++blockdev-snapshot
where the backing attribute need undef in the blockdev-add or the
blockdev-snapshot will fail because it's trying itself to set the
backing file when doing the switch.

From my test, it was related to this
https://lists.gnu.org/archive/html/qemu-block/2019-10/msg01404.html



> +    PVE::QemuServer::Monitor::mon_cmd($vmid, 'blockdev-add',
> %$new_fmt_blockdev);
> +    mon_cmd($vmid, 'blockdev-snapshot', node => $format_nodename,
> overlay => $new_current_fmt_nodename);
> +}
> +
> +sub blockdev_snap_rename {
> +    my ($storecfg, $vmid, $deviceid, $drive, $src_path,
> $target_path) = @_;

>>I think this whole thing needs more error handling and thought about
>>how to recover from various points failing.. 
yes, that's the problem with renaming, it's not atomic.

Also, if we need to recover (rollback), how to manage multiple disk ?



>>there's also quite some overlap with blockdev_current_rename, I
>>wonder whether it would be possible to simplify the code further by
>merging the two? but see below, I think we can even get away with
>>dropping this altogether if we switch from block-commit to block-
>>stream..
Yes, I have seperated them because I was not sure of the different
workflow, and It was more simplier to fix one method without breaking
the other.

I'll look to implement block-stream.  (and keep commit to initial image
for the last snapshot delete)


> +    #untaint
> +    if ($src_path =~ m/^(\S+)$/) {
> + $src_path = $1;
> +    }
> +    if ($target_path =~ m/^(\S+)$/) {
> + $target_path = $1;
> +    }

>>shouldn't that have happened in the storage plugin?
>>
> +
> +    #create a hardlink
> +    link($src_path, $target_path);

>>should this maybe be done by the storage plugin?

This was to avoid to introduce a sub method, but yes, it could be
better indeed.

PVE::Storage::link  ?

> 
> +    #delete old $path link
> +    unlink($src_path);

and this

PVE::Storage::unlink  ?
(can't use free_image here, because we really want to remove the li

Re: [pve-devel] [PATCH v3 qemu-server 08/11] blockdev: convert drive_mirror to blockdev_mirror

2025-01-13 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
>>something like this was what I was afraid of ;) this basically means
>>we need to have some way to lookup the nodes based on the structure
>>of the graph, which probably also means verifying that the structure
>>matches the expected one (e.g., if we have X snapshots, we expect N
>>nodes, if we currently have operation A going on, there should be an
>>extra node, etc.pp. - and then we can "know" that the seventh node
>>from the bottom must be snapshot 'foobar' ;)). 

I think it's not a much a problem to follow the graph from top to
bottom. (as everything attached is have parent-child relationship)
and

- for snapshot, we have the snapshot name in the filename
So we can known if it' a specific snap or the live image.


for the temporary nodes (when we do block-add, before a mirror or
switch), we define the nodename, so we don't need to parse the graph
here.



>>relying on $path being >>stable definitely won't work.

I really don't see why the path couldn't be stable ?

Over time, if something is changing in qemu (for example, rbd://
with a new param),
it'll only be apply on the new qemu process (after restart or live
migration), so the path in the block-node will be updated too. (live
migration will not keep old block-nodes infos, the used value are qemu
command line arguments)


and the file path is the only attribute in a node that you can't
update.





--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v1 pve-qemu 1/1] add block-commit-replaces option patch

2025-01-14 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
> 
> Feel free to cleanup this patch and submit it to qemu devs, you are a
> better C developper than me ^_^

>>I can try to look into it, but could you give some more details how
>>exactly the issue manifests? What parameters are you using for
>>block-commit, how does the graph look like at the time of the
>>operation?
>>What error do you get without your patch or what exactly does not
>>work
>>in the block graph?

I'll try to redo the test today, but if I remember, the block-commit is
working, but the new fmt node (your topnode node0) is not attached to
the throttle group.

Try to dump the graph  before/after




My first try did not result in an error:

> #!/bin/bash
> rm -f /tmp/backing.qcow2
> rm -f /tmp/top.qcow2
> qemu-img create /tmp/backing.qcow2 -f qcow2 64M
> qemu-img create /tmp/top.qcow2 -f qcow2 64M
> qemu-system-x86_64 --qmp stdio \
> --nodefaults \
> --object throttle-group,id=thrgr0 \
> --blockdev qcow2,node-
> name=backing0,file.driver=file,file.filename=/tmp/backing.qcow2 \
> --blockdev throttle,node-name=drive-scsi0,throttle-
> group=thrgr0,file.driver=qcow2,file.node-
> name=node0,file.file.driver=file,file.file.filename=/tmp/top.qcow2,fi
> le.backing=backing0 \
> --device 'virtio-scsi-pci,id=virtioscsi0,bus=pci.0,addr=0x2' \
> --device 'scsi-hd,bus=virtioscsi0.0,channel=0,scsi-
> id=0,lun=0,drive=drive-scsi0,id=scsi0' \
> < {"execute": "qmp_capabilities"}
> {"execute": "query-block"}
> {"execute": "block-commit", "arguments": { "device": "drive-scsi0",
> "top-node": "node0", "base-node": "backing0", "job-id": "commit0" } }
> {"execute": "query-block"}
> {"execute": "quit"}
> EOF



--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v3 qemu-server 08/11] blockdev: convert drive_mirror to blockdev_mirror

2025-01-14 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
> 

>>Upgrading libpve-storage-perl or an external storage plugin while the
>>VM
>>is running could lead to a different result for path() and thus
>>breakage, right?
mmm, yes, you are right


>>If we do need lookup, an idea to get around the character limit is
>>using
>>a hash of the information to generate the node name, e.g.
>>hash("fmt-$volid@$snapname"), hash("file-$volid@$snapname") or
>>whatever

yes, I think it should works

>>is actually needed as unique information. Even if we only use
>>lowercase
>>letters, we have 26 base chars, so 26^31 possible values.

yes, I was think about a hash too, but I was not sure how to convert it
to the alphanum characters (valid char : alphanum , ‘-’, ‘.’ and ‘_’. 
)



>>So hashes with up to
>>
> > > math.log2(26**31)
>>145.71363126237387
>>
>>bits can still fit, which should be more than enough. Even with an
>>enormous number of 2^50 block nodes (realistically, the max values we
>>expect to encounter are more like 2^10), the collision probability
>>(using a simple approximation for the birthday problem) would only be
>>
> > > d=2**145
> > > n=2**50
> > > 1 - math.exp(-(n*n)/(2*d))
>>1.4210854715202004e-14

yes, should be enough

a simple md5 is 128bit, 
sha1 is 160bit(it's 150bits space with extra -,.,- characters)

Do you known a good hash algorithm ?


--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v3 qemu-server 11/11] qcow2: add external snapshot support

2025-01-13 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---


> 
> > > should this maybe have been vdisk_alloc and it just works by
> accident?
> It's not works fine with vdisk_alloc, because the volume need to be
> created without the size specified but with backing file param
> instead.
> (if I remember, qemu-img is looking at the backing file size+metadas
> and set the correct total size for the new volume)

>>I am not sure I follow.. we create a snapshot, but then we pretend it
>>isn't a file with backing file when passing it to qemu? this seems
>>wrong.. IMHO we should just allocate (+format) here, and then let
>>qemu do the backing link up instead of this confusing (and error-
>>prone, as it masks problems that should be a hard error!) call..
>>
>>
>>
>>
> >>Maybe a better way could be to reuse vdisk_alloc, and add backing
> >>file
> >>as param ?
>>
>>that seems.. wrong as well? the file can never be bigger just because
>>it has a backing file right? why can't we just allocate and format
>>the regular volume?

>>I need to redo tests as I don't remember exactly.
>>
>>From memory (I have wrote it 2month ago, sorry ^_^ ) , it was maybe :
>> - related with metadatas prealloc size + lvm size).

>> or (but I need to verify)
>>
>> The "blockdev-snasphot" qmp later, only change the backing-file in
>>memory in the blockgraph, but not inside the file itself. (so after a
>>restart of the process, the chain is borken).

Ok, I have redone tests and verify, this is the second case

The problem is that blockdev-snapshot don't actually rewrite the
backing file inside the qcow2 file.

It still explain on the same link:
https://lists.gnu.org/archive/html/qemu-block/2019-10/msg01404.html


So the only way, is to do it a qcow2 creation.

And then, that's why we need to pass backing=undef with doing blockdev-
add, because if not, qemu will try to open the file the the backing
already opened/locked.

Don't seem to be fixed in last qemu version, and libvirt is still doing
it this way too.




--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v3 qemu-server 08/11] blockdev: convert drive_mirror to blockdev_mirror

2025-01-15 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---

> Fiona Ebner  hat am 15.01.2025 11:06 CET
> geschrieben:
> 
>  
> Am 15.01.25 um 10:51 schrieb Fabian Grünbichler:
> > 
> > basically what we have is the following situation:
> > 
> > - we have some input data (volid+snapname)
> > - we have a key derived from the input data (block node name)
> > - we have a value (block node)
> > - we need to be be able to map back the block node (name) to the
> > input data
> 
> Oh, we need to map back too? But that can be done via filename in the
> block node, or not?

>>but that filename is the result of PVE::Storage::path which is not
>>stable, so we can't compare that?

>>for snapshot operations, we need to find out "which block node is the
>>one for the snapshot volume". we can't rely on the filename in the
>>block graph for that, because how we map from volid+snapname to that
>>filename might have changed on our end since that bock node was set
>>up. 

The "filename" attribute never change in an existing file-nodename.
(It's impossible to update this attribute).

but we can replace/reopen the file-nodename by another file-nodename as
child of the fmt-nodename.


and to respond to my previous mail, for live rename, I have done tests,
I can reopen the throttle-filter with a new fmt+file nodes couple with
the new filename.

so a hash of (fmt|file)-hash(volid+snapname) should be enough.



workflow for snapshot create:

1)

throttlefilter(drive-scsi0>fmt-(hash(local:vm-disk-0)--->file
(hash(local:vm-disk-0)>filename=/path/to/vm-disk-0.qcow2

take a snap1

2) a)  create an hardlink &&  add a new fmt|file couple 

throttlefilter(drive-scsi0>fmt-(hash(local:vm-disk-0)--->file-
(hash(local:vm-disk-0)>filename=/path/to/vm-disk-0.qcow2

fmt-(hash(local:vm-disk-0-snap1)--->file (hash(local:vm-disk-0-snap1)--
-->filename=/path/to/vm-disk-0-snap1.qcow2


b) swap the fmt node with blockdev-reopen && delete old fmt node

throttlefilter(drive-scsi0>fmt-(hash(local:vm-disk-0-snap1)---
>file- (hash(local:vm-disk-0-snap1)>filename=/path/to/vm-disk-0-
snap1.qcow2

c) create a new current fmt-node with snap1 as backing fmt-node with
blockdev-snapshot


throttlefilter(drive-scsi0>fmt-(hash(local:vm-disk-0)--->file-
(hash(local:vm-disk-0)>filename=/path/to/vm-disk-0.qcow2
   |
   |-> fmt-(hash(local:vm-disk-0-snap1)---
>file-(hash(local:vm-disk-0-snap1)>filename=/path/to/vm-disk-0-
snap1.qcow2







--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v3 qemu-server 08/11] blockdev: convert drive_mirror to blockdev_mirror

2025-01-16 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
>>Yes, we don't need much to get enough collision-resistance. Just
>>wanted
>>to make sure and check it explicitly.


I have done some test with sha1, with base62 encode  ('0..9', 'A..Z',
'a..z)


the node-name require to start with an alpha character prefix


encodebase62(sha1("$volid-$snapid") : 5zU4nVxN7gIUWMaskKc4y6EawWu

28characters

so we have space to prefix like for example:

f-5zU4nVxN7gIUWMaskKc4y6EawWu  for fmt-node  
e-5zU4nVxN7gIUWMaskKc4y6EawWu  for file-node




sub encode_base62 {
my ($input) = @_;
my @chars = ('0'..'9', 'A'..'Z', 'a'..'z');
my $base = 62;
my $value = 0;

foreach my $byte (unpack('C*', $input)) {
$value = $value * 256 + $byte;
}

my $result = '';
while ($value > 0) {
$result = $chars[$value % $base] . $result;
$value = int($value / $base);
}

return $result || '0';
}

--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v3 pve-storage 1/3] qcow2: add external snapshot support

2025-01-10 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
>>yes, for the "first" snapshot that is true (since that one is
>>basically the baseline data, which will often be huge compared to the
>>snapshot delta). but streaming (rebasing) saves us the rename, which
>>makes the error handling a lot easier/less risky. maybe we could
>special case the first snapshot as a performance optimization? ;)

Ah, that's a good point indeed. Yes, I think it's a good idea, commit
to "first" snapshot, and rebase for others. I'll look to implement
this.

> 
> 
> This one was if we want to do retry in case of error, if we have
> multiple disks. (for example, first snapshot delete api call,  the
> first disk remove the snapshot, but a bug occur and second disk don't
> remove the snapshot). 
> 
> User could want to unlock the vm-snaphot lock and  and fix it
> manually
> with calling again the snapshot delete.
> 
> I'm not sure how to handle this correctly ?

>>I think the force parameter for snapshot deletion covers this
>>already, and it should be fine for this to die..

Ah, ok, I was not aware about this parameter ! thanks.

> 
> 
> > + } else {
> > +     print"commit $snappath\n";
> > +     $cmd = ['/usr/bin/qemu-img', 'commit', $snappath];
> 
> > > leftover from previous version? not used/overwritten below ;)
> 
> no, this is really to commit the the snapshot to parent

>>but it is not executed..

Ah, ok ! sorrry ! I think I have dropped some code during rebase before
sending patches, because I had tested it a lot of time !



> this is for this usecase :
> 
> A 
> you commit B to A,  then you need to change the backing file of C to
> A
> (instead B)
> 
> A>but this is the wrong semantics.. the writes/delta in B need to go to
>>C (they happened after A), not to A!

I think they can go to A (commit) or C (stream)

here an example:

current (1TB)
 - take snap A

(A (1TB)<--new current 500MB (backing file A))

- take snap B

(A (1TB)<--B 500MB (backingfile A)<--new current 10MB
(backingfile B))


Then, you want to delete B.


so, you stream it to current. (so copy 500MB to current in this
example)

Then, you want to delete snapshot A
you don't want stream A to current, because A is the big initial image.
So, instead, you need to commit the current to A (with the extra 500MB)


So, if you have a lot of snapshot to delete, you are going do a copy
same datas each time to the upper snapshot for nothing, because at the
end we are going to commit to the initial "first" snapshot/image.





--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v3 qemu-server 03/11] blockdev : convert qemu_driveadd && qemu_drivedel

2025-01-10 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
 Message initial 
De: Fabian Grünbichler 
À: Proxmox VE development discussion 
Cc: Alexandre Derumier 
Objet: Re: [pve-devel] [PATCH v3 qemu-server 03/11] blockdev : convert
qemu_driveadd && qemu_drivedel
Date: 08/01/2025 15:26:37


> Alexandre Derumier via pve-devel  hat am
> 16.12.2024 10:12 CET geschrieben:
> fixme/testme :
> PVE/VZDump/QemuServer.pm:    eval {
> PVE::QemuServer::qemu_drivedel($vmid, "tpmstate0-backup"); };
> 
> Signed-off-by: Alexandre Derumier  cyllene.com>
> ---
>  PVE/QemuServer.pm | 64 +
> --
>  1 file changed, 45 insertions(+), 19 deletions(-)
> 
> diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm
> index 2832ed09..baf78ec0 100644
> --- a/PVE/QemuServer.pm
> +++ b/PVE/QemuServer.pm
> @@ -1582,6 +1582,42 @@ sub print_drive_throttle_group {
>     return $throttle_group;
>  }
>  
> +sub generate_throttle_group {
> +    my ($drive) = @_;
> +
> +    my $drive_id = get_drive_id($drive);
> +
> +    my $throttle_group = { id => "throttle-drive-$drive_id" };
> +    my $limits = {};
> +
> +    foreach my $type (['', '-total'], [_rd => '-read'], [_wr => '-
> write']) {
> + my ($dir, $qmpname) = @$type;
> +
> + if (my $v = $drive->{"mbps$dir"}) {
> +     $limits->{"bps$qmpname"} = int($v*1024*1024);
> + }
> + if (my $v = $drive->{"mbps${dir}_max"}) {
> +     $limits->{"bps$qmpname-max"} = int($v*1024*1024);
> + }
> + if (my $v = $drive->{"bps${dir}_max_length"}) {
> +     $limits->{"bps$qmpname-max-length"} = int($v)
> + }
> + if (my $v = $drive->{"iops${dir}"}) {
> +     $limits->{"iops$qmpname"} = int($v);
> + }
> + if (my $v = $drive->{"iops${dir}_max"}) {
> +     $limits->{"iops$qmpname-max"} = int($v);
> + }
> + if (my $v = $drive->{"iops${dir}_max_length"}) {
> +     $limits->{"iops$qmpname-max-length"} = int($v);
> + }
> +   }
> +
> +   $throttle_group->{limits} = $limits;
> +
> +   return $throttle_group;

>>this and the corresponding print sub are exactly the same, so the
>>print sub could call this and join the limits with the `x-` prefix
>>added? 
yes we could merge them.

Currently, the command line can't defined complex qom object (this
should be available soon, qemu devs are working on it). That's why it's
using a different syntax with x-.


>>how does this interact with the qemu_block_set_io_throttle helper
>>used when updating the limits at runtime?
It's still working with block_set_io_throttle, where you define the
device. (the throttling value are passed to the topnode attached to the
device)


> +}
> +
>  sub generate_file_blockdev {
>  my ($storecfg, $drive, $nodename) = @_;
>  
> @@ -4595,32 +4631,22 @@ sub qemu_iothread_del {
>  }
>  
>  sub qemu_driveadd {
> -    my ($storecfg, $vmid, $device) = @_;
> +    my ($storecfg, $vmid, $drive) = @_;
>  
> -    my $kvmver = get_running_qemu_version($vmid);
> -    my $io_uring = min_version($kvmver, 6, 0);
> -    my $drive = print_drive_commandline_full($storecfg, $vmid,
> $device, undef, $io_uring);
> -    $drive =~ s/\\//g;
> -    my $ret = PVE::QemuServer::Monitor::hmp_cmd($vmid, "drive_add
> auto \"$drive\"", 60);
> -
> -    # If the command succeeds qemu prints: "OK"
> -    return 1 if $ret =~ m/OK/s;
> +    my $drive_id = get_drive_id($drive);
> +    my $throttle_group = generate_throttle_group($drive);

>>do we always need a throttle group? or would we benefit from only
>>adding it when limits are set, and skip that node when I/O is
>>unlimited?

It's adding a lot of complexity without it, because it's not always
possible to insert a new blockdev (here throttlegroup) between the
device and the drive blockdev, when the blockdev is already the top
node attached to the device

the other benefit is to have a stable name for top blocknode. 
(drive node names can change when you switch).  (less lookup for some
qmp action, like mirror/commit for example where you need to known the
top node nodename)


They a no performance impact to have a throttle group without limit



--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH-SERIES v3 pve-storage/qemu-server/pve-qemu] add external qcow2 snapshot support

2025-01-10 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
 Message initial 
De: Fiona Ebner 
À: Proxmox VE development discussion ,
f.gruenbich...@proxmox.com 
Cc: "DERUMIER, Alexandre" 
Objet: Re: [pve-devel] [PATCH-SERIES v3 pve-storage/qemu-server/pve-
qemu] add external qcow2 snapshot support
Date: 10/01/2025 10:55:14

Am 10.01.25 um 08:44 schrieb DERUMIER, Alexandre via pve-devel:
>  Message initial 
> De: Fabian Grünbichler 
> À: Proxmox VE development discussion 
> Cc: Alexandre Derumier 
> Objet: Re: [pve-devel] [PATCH-SERIES v3 pve-storage/qemu-server/pve-
> qemu] add external qcow2 snapshot support
> Date: 09/01/2025 15:13:14
> 
> > Alexandre Derumier via pve-devel  hat
> > am
> > 16.12.2024 10:12 CET geschrieben:
> 
> > This patch series implement qcow2 external snapshot support for
> > files
> > && lvm volumes
> > 
> > The current internal qcow2 snapshots have bad write performance
> > because no metadatas can be preallocated.
> > 
> > This is particulary visible on a shared filesystem like ocfs2 or
> > gfs2.
> > 
> > Also other bugs are freeze/lock reported by users since years on
> > snapshots delete on nfs
> > (The disk access seem to be frozen during all the delete duration)
> > 
> > This also open doors for remote snapshot export-import for storage
> > replication.
> > > 
> > > a few high level remarks:
> > > - I am not sure whether we want to/can switch over to blockdev on
> > > the
> > > fly (i.e., without some sort of opt-in phase to iron out kinks).
> > > what
> > > about upgrades with running VMs? I guess some sort of flag and
> > > per-VM
> > > switching would be a better plan..
> 
> I have tested live migration, and it's works for the small tests I
> have
> done. (I was surprised myself). I'll try to do more longer test to be
> 100% sure that they are not corruption of datas.
> 
>  on the guest side, it's transparent. on qemu side, the devices and
> pci
> plumbing is still the same, and qemu already use blockdev behind.
> 
> If needed, we could make a switch based on qemu version, or or manual
> option.

>>Yes, we need to be very careful that all the defaults/behavior would
be
>>the same to not break live-migration. A known difference is format
>>autodetection, which happens with "-drive file=" but not with
>>"-blockdev", but not relevant as we explicitly set the format.
>>Dumping
>>the QObject JSON configs of the drives might be a good heuristic to
>>check that the properties really are the same before and after the
>>switch.
I had looked manually at qdev info, and dumped the blockdevs generated
by -drive command to compare, I didn't see difference (only the node
names and the additional throttle group node)

>>Switching based on QEMU version would need to be the creation QEMU
>>from
>>the meta config property. Using machine or running binary version
>>would
>>mean we would automatically switch for non-Windows OSes which are not
>>version pinned by default, so that doesn't help if there would be
>>breakage.

That's why I was thinking to implement this for pve9. (based on qemu
version)

>>I'd really hope it is compatible, because for a per-VM flag,
>>for backwards-compat reasons (e.g. rolling back to a snapshot with
>>VMstate) it would need to start out as being off by default.

I think that a vmstate is not a problem, because this is only the guest
memory right ? and devices are not changing.


>>We wouldn't even need to switch to using '-blockdev' right now (still
>>good thing to do long-term wise, but if it is opt-in, we can't rely
>>on
>>all VMs having it, which is bad), 
>>you could also set the node-name for the '-drive'. 

Are you sure about this ? I don't have seen any documentation about
adding the node-name to drive.   (and we need it for hotplug hmp
drive_add too :/ )

not even sure this could define specific name to the file nodename,
which is needed for the snapshot renaming with blockdev-reopen.


>>I.e. switching to '-blockdev' can be done separately to
>>switching to 'blockdev-*' QMP operations.


I really don't known if you can implement qmp blockdev-*, with --drive
syntax (where it could be possible to define nodename).

I known that qmp blockdev-* command accept "device" (for --drive) or
"node-name" (for --blockdev)


(BTW,switching to -blockdev is already breaking qmp proxmox backup ^_^
possibly because of the throttle-group top node, I don't remember
exactly).



I'll take time to retest  live migrat

Re: [pve-devel] [PATCH v3 qemu-server 01/11] blockdev: cmdline: convert drive to blockdev syntax

2025-01-10 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---

> - $device .= ",drive=drive-$drive_id,id=$drive_id";
> + $device .= ",id=$drive_id";
> + $device .= ",drive=drive-$drive_id" if $device_type ne 'cd' ||
> $drive->{file} ne 'none';

>>is this just because you remove the whole drive when ejecting? not
>>sure whether that is really needed..

with blockdev, no drive (no disc inserted in the cdrom device), it's
really no blockdev defined.
So we don't pass drive/cdrom media to the cdrom device.


>  
> -sub print_drive_commandline_full {
> -    my ($storecfg, $vmid, $drive, $live_restore_name, $io_uring) =
> @_;
> +sub print_drive_throttle_group {
> +    my ($drive) = @_;
> +    #command line can't use the structured json limits option,
> +    #so limit params need to use with x- as it's unstable api

>>this comment should be below the early return, or above the whole
>>sub.
ok

> +    return if drive_is_cdrom($drive) && $drive->{file} eq 'none';

>>is this needed if we keep empty cdrom drives around like before? I
>>know throttling practically makes no sense in that case, but it might
>>make the code in general more simple?

yes, this is to keep-it like before, but I can put it behind a
throttle-group, no problem. 

> 
> +sub generate_file_blockdev {
> +    my ($storecfg, $drive, $nodename) = @_;
> +
> +    my $volid = $drive->{file};
>  my ($storeid, $volname) = PVE::Storage::parse_volume_id($volid,
> 1);
> -    my $scfg = $storeid ? PVE::Storage::storage_config($storecfg,
> $storeid) : undef;
>  
> -    if (drive_is_cdrom($drive)) {
> - $path = get_iso_path($storecfg, $vmid, $volid);
> - die "$drive_id: cannot back cdrom drive with a live restore
> image\n" if $live_restore_name;
> +    my $scfg = undef;
> +    my $path = $volid;

I think this should only happen if the parse_volume_id above told us
this is an absolute path and not a PVE-managed volume..

> +    if($storeid && $storeid ne 'nbd') {

>>this is wrong.. I guess it's also somewhat wrong in the old
>>qemu_drive_mirror code.. we should probably check using a more
>>specific RE that the "volid" is an NBD URI, and not attempt to parse
>>it as a regular volid in that case..

ok. I'm already parsing the nbd uri later, I'll adapt the code.




> +    my $format = $drive->{format};
> +    $format //= "raw";

>>the format handling here is very sensitive, and I think this broke
>>it. see the big comment this patch removed ;)
>>
>>short summary: for PVE-managed volumes we want the format from the
>>storage layer (via checked_volume_format). if the drive has a format
>>set that disagrees, that is a hard error. for absolute paths we us
>>the format from the drive with a fallback to raw.

yes, I have seen the commits during my rebase before sending patches.
I'll fix that.


>  
> -    if ($live_restore_name) {
> - $format = "rbd" if $is_rbd;
> - die "$drive_id: Proxmox Backup Server backed drive cannot auto-
> detect the format\n"
> -     if !$format;
> - $opts .= ",format=alloc-track,file.driver=$format";
> -    } elsif ($format) {
> - $opts .= ",format=$format";
> +    my ($storeid, $volname) = PVE::Storage::parse_volume_id($volid,
> 1);

>>so I guess this should never be called with nbd-URI-volids?

until we want to live restore to an nbd uri, no ^_^


> +    my $readonly = defined($drive->{ro}) || $force_readonly ?
> JSON::true : JSON::false;
> +
> +    #libvirt define cache option on both format && file
>  my $cache_direct = drive_uses_cache_direct($drive, $scfg);
> +    my $cache = {};
> +    $cache->{direct} = $cache_direct ? JSON::true : JSON::false;
> +    $cache->{'no-flush'} = $drive->{cache} && $drive->{cache} eq
> 'unsafe' ? JSON::true : JSON::false;

>>so we have the same code in two places? should probably be a helper
>>then to not have them go out of sync..

Ah, yes, forgot to do the helper. Libvirt define it at both file &&
format blockdev, not sure why exactly,.


> 
> -    # my $file_param = $live_restore_name ? "file.file.filename" :
> "file";
> -    my $file_param = "file";
> +    my $file_nodename = "file-drive-$drive_id";
> +    my $blockdev_file = generate_file_blockdev($storecfg, $drive,
> $file_nodename);
> +    my $fmt_nodename = "fmt-drive-$drive_id";
> +    my $blockdev_format = generate_format_blockdev($storecfg,
> $drive, $fmt_nodename, $blockdev_file, $force_readonly);
> +
> +    my $blockdev_live_restore = undef;
>  if ($live_restore_name) {
> - # non-rbd drivers require the underlying file to be a separate
> block
> - # node, so add a second .file indirection
> - $file_param .= ".file" if !$is_rbd;
> - $file_param .= ".filename";
> +    die "$drive_id: Proxmox Backup Server backed drive cannot
> auto-detect the format\n"
> +    if !$format;

>>for this check, but it is not actually set anywhere here.. so is
>>something missing or can the check go?

can be remove, this is the older code that I forget to remove.
(I don't have tested the backup/restore yet, ad backup is not working)

> 


--- End Message ---
__

Re: [pve-devel] [PATCH v3 pve-storage 1/3] qcow2: add external snapshot support

2025-01-10 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
>>but you don't know up front that you want to collapse all the
>>snapshots. for each single removal, you have to merge the delta
>>towards the overlay, not the base, else the base contents is no
>>longer matching its name.
>>
>>think about it this way:
>>
>>you take a snapshot B at time X. this snapshot must never contain a
>>modification that happened after X. that means you cannot ever commit
>>a newer snapshot into B, unless you are removing and renaming B.

>>if you start with a chain A -> B -> C -> D (with A being the first
>>snapshot/base, and D being the current active overlay. if you want to
>>remove B, you can either
>>- stream B into C, remove B
>>- commit C into B, remove C, rename B to C
>>
>>in both cases you will end up with a chain A -> C' -> D where C' is
>>the combination of the old B and C.
>>
>>the downside of the streaming variant is that if B's delta is bigger
>>than C's, you have more I/O. the upside is that there is no inbetween
>>state where the backing chain is broken and error handling can go
>>very wrong.
>>
>>what you are doing right now is:
>>
>>chain A->B->C->D as before. remove B by commiting B into A and then
>>rebasing C on top of A. that means you end up with:
>>A'->C->D where A' is A+B. but now this snapshot A contains writes
>>that happened after the original A was taken. this is wrong.


Ah yes, you are right. 

I was just thinking about it, and have the same conclusion

for example:

A 1TB  (12:00)>  B 500MB  (13:00) > C 10MB (14:00) ---> current
(now)

if I delete B, If I merge it to A, it'll not be the A view at 12:00,
but 13:00.
so we indeed need to merge it to C.

(Sorry, I'm using zfs/ceph for too long, where merge never occur, and
block are only referenced  && destroyed in background.)


Ok,I'll rework with stream implementation. (I need to do it anyway for
multi-branch, but later please



--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v3 pve-storage 1/3] qcow2: add external snapshot support

2025-01-10 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
> @@ -710,11 +715,15 @@ sub filesystem_path {
>  # Note: qcow2/qed has internal snapshot, so path is always
>  # the same (with or without snapshot => same file).
>  die "can't snapshot this image format\n"
> - if defined($snapname) && $format !~ m/^(qcow2|qed)$/;
> ²

>>I am not sure if we want to allow snapshots for non-qcow2 files just
>>because snapext is enabled? I know it's technically possible to have
>>a raw base image and then a qcow2 backing chain on top, but this
>>quickly becomes confusing (how is the volume named then? which format
>>does it have in which context)..

in the V2 I was allowing it, but for this V3 series, I only manage
external snasphot with qcow2 files. (with the snapshot file renaming,
It'll be too complex to manage, confusing for user indeed... )

I think I forgot to clean this in the V3, the check should be simply

die "can't snapshot this image format\n" if defined($snapname) &&
$format !~ m/^(qcow2|qed)$/;





> 
>  die "can't snapshot this image format\n" if $volname !~
> m/\.(qcow2|qed)$/;
>  
> -    my $path = $class->filesystem_path($scfg, $volname);
> +    if($scfg->{snapext}) {
>  
> -    my $cmd = ['/usr/bin/qemu-img', 'snapshot','-c', $snap, $path];
> + my $path = $class->path($scfg, $volname, $storeid);
> + my $snappath = $class->path($scfg, $volname, $storeid, $snap);
> + my $format = ($class->parse_volname($volname))[6];
> + #rename current volume to snap volume
> + rename($path, $snappath) if -e $path && !-e $snappath;

>>I think this should die if the snappath already exists, and the one
>>(IMHO wrong) call in qemu-server should switch to
>>vdisk_alloc/alloc_image.. this is rather dangerous otherwise!

right !



> +    if ($scfg->{snapext}) {
> + #technically, we could manage multibranch, we it need lot more work
> for snapshot delete
> + #we need to implemente block-stream from deleted snapshot to all
> others child branchs

>>see my comments in qemu-server - I think we actually want block-
>>stream anyway, since it has the semantics we want..

I don't agree, we don't want always, because with block-stream, you
need to copy parent to child.

for example, you have a 1TB image,  you take a snapshot, writing 5MB in
the snapshot, delete the snapshot,  you'll need to read/copy 1TB data
from parent to the snapshot file.  
I don't read your qemu-server comment yet ;)


> + #when online, we need to do a transaction for multiple disk when
> delete the last snapshot
> + #and need to merge in current running file
> +
> + my $snappath = $class->path($scfg, $volname, $storeid, $snap);
> + my $snapshots = $class->volume_snapshot_info($scfg, $storeid,
> $volname);
> + my $parentsnap = $snapshots->{current}->{parent};
> +
> + return 1 if !-e $snappath || $snapshots->{$parentsnap}->{file} eq
> $snappath;

>>why do we return 1 here if the snapshot doesn't exist? if we only
>>allow rollback to the most recent snapshot for now, then we could
>>just query the current path and see if it is backed by our snapshot?

I think I forget to remove this this from the V2. But the idea is to
check indead if the snapshot back the current image ( with $snapshots-
>{current}->{parent}.  

> +
> + die "can't rollback, '$snap' is not most recent snapshot on
> '$volname'\n";
> +    }
> +
>  return 1;
>  }
>  
> @@ -1201,13 +1257,52 @@ sub volume_snapshot_delete {
>  
>  return 1 if $running;
>  
> +    my $cmd = "";
>  my $path = $class->filesystem_path($scfg, $volname);
>  
> -    $class->deactivate_volume($storeid, $scfg, $volname, $snap, {});
> +    if ($scfg->{snapext}) {
>  
> -    my $cmd = ['/usr/bin/qemu-img', 'snapshot','-d', $snap, $path];
> + my $snapshots = $class->volume_snapshot_info($scfg, $storeid,
> $volname);
> + my $snappath = $snapshots->{$snap}->{file};
> + return if !-e $snappath;  #already deleted ?

>>shouldn't this be an error?

This one was if we want to do retry in case of error, if we have
multiple disks. (for example, first snapshot delete api call,  the
first disk remove the snapshot, but a bug occur and second disk don't
remove the snapshot). 

User could want to unlock the vm-snaphot lock and  and fix it manually
with calling again the snapshot delete.

I'm not sure how to handle this correctly ?

> +     print"commit $childpath\n";
> +     $cmd = ['/usr/bin/qemu-img', 'commit', $childpath];
> +     run_command($cmd);
> +     print"delete $childpath\n";
> +
> +     unlink($childpath);

this unlink can be skipped?

> +     print"rename $snappath to $childpath\n";
> +     rename($snappath, $childpath);

>>since this will overwrite $childpath anyway.. this also reduces the
>>chance of something going wrong:
>>
>>- if the commit fails halfway through, nothing bad should have
>>happened, other than some data is now stored in two snapshots and
>>takes up extra space
>>- if the rename fails, then all of the data of $snap is stored twice,
>>but the backing chain is still valid
>>
>>notable, there is no longer a gap where $

Re: [pve-devel] [PATCH v3 qemu-server 08/11] blockdev: convert drive_mirror to blockdev_mirror

2025-01-15 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
IMHO this isn't really a cryptographic use case, so I'd not worry too
much about any of that.

basically what we have is the following situation:

- we have some input data (volid+snapname)
- we have a key derived from the input data (block node name)
- we have a value (block node)
- we need to be be able to map back the block node (name) to the input
data

>>sometimes we need to allocate a second block node temporarily for a
>>given input data (right?), 

yes, with a unique volume path in the file-node
 

note that a "block-node" , is a couple of fmt-node (handling
.qcow2,.raw,..)->file-node (with the path to file/block/...


For snapshot rename (current->snap1 for xample), I only switch the file
nodes currently (with the blockdev-reopen), so the fmt-node is not
changing. 

The current behaviour is something like:

1)  fmt-node-current->file-node-current

take snap1


2)
  a) rename

fmt-node-current-->file-node-snap1
(and delete file-node-current)



 b) create a new fmt node current with snap1 as backing

fmt-node-current2-->file-node-current
  |
  |->fmt-node-current>file-node-snap1


I'm not sure that I can swap both fmt+file couple by another fmt+file
couple for the renaming part.  (blockdev-mirror is doing it for
example)

I'll try to do it to be sure, (do the blockdev-reopen at fmt level).

if it's not possible, 

for filenode, the hash could work 100% as the volid+snapname shouldn't
be in 2 nodes at the same time.

but for fmt-node, it should need a lookup into the graph, to find the
parent of file-node (file-node found with the hash)




--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH pve-network] controllers: bgp: split v4 && v6 peers in different groups

2025-03-11 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi,

just a remember to not lost this patch.

could it be possible to apply it ?


 Message initial 
De: Alexandre Derumier via pve-devel 
Répondre à: Proxmox VE development discussion 
À: pve-devel@lists.proxmox.com
Cc: Alexandre Derumier 
Objet: [pve-devel] [PATCH pve-network] controllers: bgp: split v4 && v6
peers in different groups
Date: 19/12/2024 17:17:23

reported by user on the forum:


This is for dualstack, when evpn is ipv4, and bgp is ipv6+(ipv4)

Signed-off-by: Alexandre Derumier <
>
---
 src/PVE/Network/SDN/Controllers/BgpPlugin.pm  | 43 -
 .../bgp_ipv4_ipv6/expected_controller_config  | 63 +++
 .../bgp_ipv4_ipv6/expected_sdn_interfaces | 41 
 src/test/zones/evpn/bgp_ipv4_ipv6/interfaces  | 11 
 src/test/zones/evpn/bgp_ipv4_ipv6/sdn_config  | 48 ++
 5 files changed, 192 insertions(+), 14 deletions(-)
 create mode 100644
src/test/zones/evpn/bgp_ipv4_ipv6/expected_controller_config
 create mode 100644
src/test/zones/evpn/bgp_ipv4_ipv6/expected_sdn_interfaces
 create mode 100644 src/test/zones/evpn/bgp_ipv4_ipv6/interfaces
 create mode 100644 src/test/zones/evpn/bgp_ipv4_ipv6/sdn_config

--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH 1/4] frr: bump from 8.5.2 to 10.2.1

2025-03-11 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi,

Nice to see that new proxmox developpers are working on it.


Personally, I would wait for major pve9 before bumping frr,

because from my experience, I had a lot of regression with frr, (really
a lot ^_^)

and this really need a lot of test with different setup to be sure
everything is working fine with differents setup.


I can begin to do test on my production with this version if it can
help


 Message initial 
De: Gabriel Goller 
Répondre à: Proxmox VE development discussion 
À: pve-devel@lists.proxmox.com
Cc: Thomas Lamprecht 
Objet: Re: [pve-devel] [PATCH 1/4] frr: bump from 8.5.2 to 10.2.1
Date: 07/03/2025 13:31:28

Oops, forgot to add the repo to the header. This obviously applies to
the `frr` repo.


___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://antiphishing.vadesecure.com/v4?f=M1hxaWZ5bnNuVExjSWtSa1Mq8Q9hac
d_FEEuXVdVYcijvB6vRmAngi-WnB2WTB3RWyXehSaNNTSNWpGAvsz-
vA&i=YVdEbUdjdUhGSnlic1ZwZQUN05Q2QF8Yf4Ya7QQZ1Kg&k=ktYE&r=eG95dVIxWktuN
GdHSkhZcKb45ZyFsKntWtbyRucDTKUhu-
1Nl9iWuCpJS_C268kdNTYPBEu9Co2b_0nW9CqoWQ&s=7035241fdb10cab1e3f08e156108
ec4bceb4faa1a2d29d36347b4936ade4ab04&u=https%3A%2F%2Flists.proxmox.com%
2Fcgi-bin%2Fmailman%2Flistinfo%2Fpve-devel


--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH 1/4] frr: bump from 8.5.2 to 10.2.1

2025-03-11 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
>>Yeah, it's definitively a bigger jump, but having late PVE 8 and
>>future
>>PVE 9 start out with the same FRR version would be nice and should
>>make
>>backporting fixes easier; also PVE 9 should then start out more
>>stable
>>w.r.t. FRR and SDN stack. And yes, we looked into backporting some
>>patches
>>that Gabriel needs for the SDN Fabrics work to the FRR version we
>>had,
>>but too much changed. Also, IIRC updating FRR to newer minor version
>>releases (e.g., from 8.5.2 to 8.5.7) often caused also some
>>regressions,
>>so a bigger jump should make addressing those at least a bit more
>>worthwile; at least that's what I'm hoping ^^

yes, minors cause also regression ^_^.  They really need an lts release


>>FYI: I just uploaded the build to the pvetest repository, would move
>>it
>>to no-subscription sometime next week, if nothing comes up until
>>then.
>>I can naturally delay the move, or fasten it up, depending on your
>>feedback or if you would like to have some more time, I just would
>>like
>>to move it to no-subscription no later than the 28th March, to ensure
>>it
>>gets long enough time for broad exposure in no-subscription before
>>the
>>next point releases.

ok no problem, I'll try to test on my side this week on a test cluster
for some days.

I have already seen regression where problems could appear after
multiple hours with arp/mac timeout. so I'll try to test as soon as
possible.

I don't remember if we have a package with a depend of frr (I don't see
it in pve-network), but if it could be possible to not force something
like
Depends: frr (>= 10~)   , as I would like to be able to pin the version
in case of problem


(I'll try to test the fabric code too :)


Alexandre








> and this really need a lot of test with different setup to be sure
> everything is working fine with differents setup.
> 
> 
> I can begin to do test on my production with this version if it can
> help

That would be really appreciated, thanks.


--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH pve-network] controllers: bgp: split v4 && v6 peers in different groups

2025-03-12 Thread DERUMIER, Alexandre via pve-devel

--- Begin Message ---
Hi Stefan !


>>It should work for redistributing EVPN routes via BGP, but if you
>>want
>>to use the BGP controller with loopback + multiple address families
>>this
>>doesn't seem to work.
ah  shit, never tested mutiple address families on loopback on my side.
(Need to add a unit test about it )


>> My generated configuration looks like this if I
>>try to do dual-stack BGP:
>>
>> address-family ipv6 unicast
>>  network 172.20.1.1/32
>>  neighbor BGP6 activate
>>  neighbor BGP6 soft-reconfiguration inbound
>> exit-address-family

>>This should take the IPv6 from the loopback, right? 
yes, definitively


>>We would also need
>>to create a correct_src_ipv6 route map then I suppose. Not sure how
>>much
>>sense a dual-stack underlay makes, maybe when transitioning from 4 to
>>6?


>>If I have no IPv4 on my loopback and try an IPv6 only BGP underlay
>>(peers are only IPv6, loopback is IPv6 /128), then it fails on
>>creating
>>a router-id:
>>
>>TASK ERROR: can't autofind a router-id value from ip or mac at
>>/usr/share/perl5/PVE/Network/SDN/Controllers/Plugin.pm line 135.
>>
>>
>>Not 100% sure why that is, I will need to check tomorrow, I think it
>>is
>>because we are only checking the address field of the interfaces file
>>(in find_local_ip_interface_peers), but IPv6 addresses are in the
>>address6 field. That seems to break when using IPv6.

yes, it must be something like that.


>>Reading the MAC from "/sys/class/net/$iface/master/address" also
>>doesn't
>>always work if the interface is not part of a bridge. I have my ptp
>>links configured directly on the interfaces, so that might also be a
>>problem.
yes, this is common to use directly ifaces for ptp links. (I'm doing it
myself, ipv4 only). should be /sys/class/net/$IFACE/address .
(bridge should inherited from enslave iface, so it should works to for
enslaved iface in bridge)

Redistributing IPv4 and IPv6 routes from an EVPN zone exit-node worked
on my machine with this patch.



(I'll not be able to look at this until monday)

--- End Message ---
___
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

83 matches

Mail list logo