Re: bhyve disk performance issue

2024-02-27 Thread Vitaliy Gusev
Hi,


> On 23 Feb 2024, at 18:37, Matthew Grooms  wrote:
> 
>> ...
> The problem occurs when an image file is used on either ZFS or UFS. The 
> problem also occurs when the virtual disk is backed by a raw disk partition 
> or a ZVOL. This issue isn't related to a specific underlying filesystem.
> 

Do I understand right, you ran testing inside VM inside guest VM  on ext4 
filesystem? If so you should be aware about additional overhead in comparison 
when you were running tests on the hosts.

I would suggest to run fio (or even dd) on raw disk device inside VM, i.e. 
without filesystem at all.  Just do not forget do “echo 3 > 
/proc/sys/vm/drop_caches” in Linux Guest VM before you run tests. 

Could you also give more information about:

 1. What results did you get (decode bonnie++ output)?
 2. What results expecting?
 3. VM configuration, virtio-blk disk size, etc.
 4. Full command for tests (including size of test-set), bhyve, etc.
 5. Did you pass virtio-blk as 512 or 4K ? If 512, probably you should try 4K.
 6. Linux has several read-ahead options for IO schedule, and it could be 
related too.

Additionally could also you play with “sync=disabled” volume/zvol option? Of 
course it is only for write testing.

——
Vitaliy



Re: bhyve disk performance issue

2024-02-28 Thread Vitaliy Gusev
Hi,  Matthew.

I still do not know what command line was used for bhyve. I  couldn't find it 
through the thread, sorry. And I couldn't find virtual disk size that you used.

Could you, please, simplify bonnie++ output, it is hard to decode due to 
alignment and use exact numbers for:

READ seq  - I see you had 1.6GB/s for the good time and ~500MB/s for the worst.
WRITE seq  - ...

If you have slow results both for the read and write operations, you probably 
should perform testing only for READs and do not do anything until READs are 
fine.

Again, if you have slow performance for Ext4 Filesystem in guest VM placed on 
the passed disk image, you should try to test on the raw disk image, i.e. 
without Ext4, because it could be related.

If you run test inside VM on a filesystem, you can have deal with filesystem 
bottlenecks, bugs, fragmentation etc. Do you want to fix them all? I don’t 
think so.

For example, if you pass disk image 40G and create Ext4 filesystem, and during 
testing the filesystem becomes full over 80%, I/O could be performed not so 
fine.

You probably should eliminate that guest filesystem behaviour when you meet IO 
performance slowdown.

Also, please look at the TRIM operations when you perform WRITE testing. It 
could be also related to the slow write I/O.

——
Vitaliy

> On 28 Feb 2024, at 21:29, Matthew Grooms  wrote:
> 
> On 2/27/24 04:21, Vitaliy Gusev wrote:
>> Hi,
>> 
>> 
>>> On 23 Feb 2024, at 18:37, Matthew Grooms  
>>> <mailto:mgro...@shrew.net> wrote:
>>> 
>>>> ...
>>> The problem occurs when an image file is used on either ZFS or UFS. The 
>>> problem also occurs when the virtual disk is backed by a raw disk partition 
>>> or a ZVOL. This issue isn't related to a specific underlying filesystem.
>>> 
>> 
>> Do I understand right, you ran testing inside VM inside guest VM  on ext4 
>> filesystem? If so you should be aware about additional overhead in 
>> comparison when you were running tests on the hosts.
>> 
> Hi Vitaliy,
> 
> I appreciate you providing the feedback and suggestions. I spent over a week 
> trying as many combinations of host and guest options as possible to narrow 
> this issue down to a specific host storage or a guest device model option. 
> Unfortunately the problem occurred with every combination I tested while 
> running Linux as the guest. Note, I only tested RHEL8 & RHEL9 compatible 
> distributions ( Alma & Rocky ). The problem did not occur when I ran FreeBSD 
> as the guest. The problem did not occur when I ran KVM in the host and Linux 
> as the guest.
> 
>> I would suggest to run fio (or even dd) on raw disk device inside VM, i.e. 
>> without filesystem at all.  Just do not forget do “echo 3 > 
>> /proc/sys/vm/drop_caches” in Linux Guest VM before you run tests.
> The two servers I was using to test with are are no longer available. 
> However, I'll have two more identical servers arriving in the next week or 
> so. I'll try to run additional tests and report back here. I used bonnie++ as 
> that was easily installed from the package repos on all the systems I tested.
> 
>> 
>> Could you also give more information about:
>> 
>>  1. What results did you get (decode bonnie++ output)?
> If you look back at this email thread, there are many examples of running 
> bonnie++ on the guest. I first ran the tests on the host system using Linux + 
> ext4 and FreeBSD 14 + UFS & ZFS to get a baseline of performance. Then I ran 
> bonnie++ tests using bhyve as the hypervisor and Linux & FreeBSD as the 
> guest. The combination of host and guest storage options included ...
> 
> 1) block device + virtio blk
> 2) block device + nvme
> 3) UFS disk image + virtio blk
> 4) UFS disk image + nvme
> 5) ZFS disk image + virtio blk
> 6) ZFS disk image + nvme
> 7) ZVOL + virtio blk
> 8) ZVOL + nvme
> 
> In every instance, I observed the Linux guest disk IO often perform very well 
> for some time after the guest was first booted. Then the performance of the 
> guest would drop to a fraction of the original performance. The benchmark 
> test was run every 5 or 10 minutes in a cron job. Sometimes the guest would 
> perform well for up to an hour before performance would drop off. Most of the 
> time it would only perform well for a few cycles ( 10 - 30 mins ) before 
> performance would drop off. The only way to restore the performance was to 
> reboot the guest. Once I determined that the problem was not specific to a 
> particular host or guest storage option, I switched my testing to only use a 
> block device as backing storage on the host to avoid hitting any system disk 
> caches.
> 
> Here is the test script I us

Re: bhyve disk performance issue

2024-02-28 Thread Vitaliy Gusev


> On 28 Feb 2024, at 23:03, Matthew Grooms  wrote:
> 
> ...
> The virtual disks were provisioned with either a 128G disk image or a 1TB raw 
> partition, so I don't think space was an issue.
> Trim is definitely not an issue. I'm using a tiny fraction of the 32TB array 
> have tried both heavily under-provisioned HW RAID10 and SW RAID10 using GEOM. 
> The latter was tested after sending full trim resets to all drives 
> individually.
> 
It could be then TRIM/UNMAP is not used, zvol (for the instance) becomes full 
for the while. ZFS considers it as all blocks are used and write operations 
could  have troubles. I believe it was recently fixed.

Also look at this one:

GuestFS->UNMAP->bhyve->Host-FS->PhysicalDisk

The problem of UNMAP that it could have unpredictable slowdown at any time. So 
I would suggest to check results with enabled and disabled UNMAP in a guest.

> I will try to incorporate the rest of your feedback into my next round of 
> testing. If I can find a benchmark tool that works with a raw block device, 
> that would be ideal.
> 
> 
Use “dd” as the first step for read testing;

   ~# dd if=/dev/nvme0n2 of=/dev/null bs=1M status=progress flag=direct
   ~# dd if=/dev/nvme0n2 of=/dev/null bs=1M status=progress

Compare results with directio and without.

“fio” tool. 
 
  1) write prepare:

   ~# fio  --name=prep --rw=write --verify=crc32 --loop=1 --numjobs=2  
--time_based --thread  --bs=1M --iodepth=32  --ioengine=libaio --direct=1  
--group_reporting  --size=20G  --filename=/dev/nvme0n2


 2)  read test:

  ~# fio  --name=readtest --rw=read —loop=30 --numjobs=2  --time_based 
--thread  —bs=256K --iodepth=32  --ioengine=libaio --direct=1  
--group_reporting  --size=20G  --filename=/dev/nvme0n2
 
—
Vitaliy  
> Thanks,
> 
> -Matthew
> 
> 
> 
>> ——
>> Vitaliy
>> 
>>> On 28 Feb 2024, at 21:29, Matthew Grooms  
>>> <mailto:mgro...@shrew.net> wrote:
>>> 
>>> On 2/27/24 04:21, Vitaliy Gusev wrote:
>>>> Hi,
>>>> 
>>>> 
>>>>> On 23 Feb 2024, at 18:37, Matthew Grooms  
>>>>> <mailto:mgro...@shrew.net> wrote:
>>>>> 
>>>>>> ...
>>>>> The problem occurs when an image file is used on either ZFS or UFS. The 
>>>>> problem also occurs when the virtual disk is backed by a raw disk 
>>>>> partition or a ZVOL. This issue isn't related to a specific underlying 
>>>>> filesystem.
>>>>> 
>>>> 
>>>> Do I understand right, you ran testing inside VM inside guest VM  on ext4 
>>>> filesystem? If so you should be aware about additional overhead in 
>>>> comparison when you were running tests on the hosts.
>>>> 
>>> Hi Vitaliy,
>>> 
>>> I appreciate you providing the feedback and suggestions. I spent over a 
>>> week trying as many combinations of host and guest options as possible to 
>>> narrow this issue down to a specific host storage or a guest device model 
>>> option. Unfortunately the problem occurred with every combination I tested 
>>> while running Linux as the guest. Note, I only tested RHEL8 & RHEL9 
>>> compatible distributions ( Alma & Rocky ). The problem did not occur when I 
>>> ran FreeBSD as the guest. The problem did not occur when I ran KVM in the 
>>> host and Linux as the guest.
>>> 
>>>> I would suggest to run fio (or even dd) on raw disk device inside VM, i.e. 
>>>> without filesystem at all.  Just do not forget do “echo 3 > 
>>>> /proc/sys/vm/drop_caches” in Linux Guest VM before you run tests. 
>>> The two servers I was using to test with are are no longer available. 
>>> However, I'll have two more identical servers arriving in the next week or 
>>> so. I'll try to run additional tests and report back here. I used bonnie++ 
>>> as that was easily installed from the package repos on all the systems I 
>>> tested.
>>> 
>>>> 
>>>> Could you also give more information about:
>>>> 
>>>>  1. What results did you get (decode bonnie++ output)?
>>> If you look back at this email thread, there are many examples of running 
>>> bonnie++ on the guest. I first ran the tests on the host system using Linux 
>>> + ext4 and FreeBSD 14 + UFS & ZFS to get a baseline of performance. Then I 
>>> ran bonnie++ tests using bhyve as the hypervisor and Linux & FreeBSD as the 
>>> guest. The combination of host and guest storage options included ...
>>> 
>>> 1) block device + virtio blk
>&

Re: problem with bhyve, ryzen 5800x, freebsd guest

2022-07-07 Thread Vitaliy Gusev
You probably should set up dump device to get crash info, stack, etc.

——
Vitaliy Gusev

> On 7 Jul 2022, at 15:29, Andriy Gapon  wrote:
> 
> 
> I have a strange issue with running an 'appliance' image based on FreeBSD 12 
> in bhyve on a machine with Ryzen 5800x processor.
> 
> The problem is that the guest would run for a while and then the host would 
> suddenly reset itself.  It appears like a triple fault or something with 
> similar consequences.
> 
> The time may be from a few dozens of minutes to many hours.
> 
> Just to be clear, no such thing occurs if I do not run the guest.
> Also, I have an older AMD system (pre-Zen), the problem does not happen there.
> A vanilla FreeBSD 12.3 installation that just sits idle also does not cause 
> the problem.
> 
> Does anyone have an idea what the problem could be?
> What workaround or diagnostics to try?
> Anybody else seen something like this?
> 
> Since it's the host that resets it would be hard to capture any traces.
> Thank you.
> -- 
> Andriy Gapon
> 
> 
> https://standforukraine.com
> https://razomforukraine.org
> 




Re: problem with bhyve, ryzen 5800x, freebsd guest

2022-07-07 Thread Vitaliy Gusev
You probably should set up a dump device to get crash info, stack, etc.

——
Vitaliy Gusev

> On 7 Jul 2022, at 15:29, Andriy Gapon  wrote:
> 
> 
> I have a strange issue with running an 'appliance' image based on FreeBSD 12 
> in bhyve on a machine with Ryzen 5800x processor.
> 
> The problem is that the guest would run for a while and then the host would 
> suddenly reset itself.  It appears like a triple fault or something with 
> similar consequences.
> 
> The time may be from a few dozens of minutes to many hours.
> 
> Just to be clear, no such thing occurs if I do not run the guest.
> Also, I have an older AMD system (pre-Zen), the problem does not happen there.
> A vanilla FreeBSD 12.3 installation that just sits idle also does not cause 
> the problem.
> 
> Does anyone have an idea what the problem could be?
> What workaround or diagnostics to try?
> Anybody else seen something like this?
> 
> Since it's the host that resets it would be hard to capture any traces.
> Thank you.
> -- 
> Andriy Gapon
> 
> 
> https://standforukraine.com
> https://razomforukraine.org
> 




Re: problem with bhyve, ryzen 5800x, freebsd guest

2022-08-01 Thread Vitaliy Gusev
Interesting enough. It would be nice if you find out what exactly triggered 
your VM to reset.

—
Vitaliy

> On 1 Aug 2022, at 17:39, Andriy Gapon  wrote:
> 
> On 2022-07-10 20:28, Gleb Smirnoff wrote:
>> On Thu, Jul 07, 2022 at 03:29:04PM +0300, Andriy Gapon wrote:
>> A> I have a strange issue with running an 'appliance' image based on
>> A> FreeBSD 12 in bhyve on a machine with Ryzen 5800x processor.
>> A>
>> A> The problem is that the guest would run for a while and then the host
>> A> would suddenly reset itself. It appears like a triple fault or
>> A> something with similar consequences.
>> A>
>> A> The time may be from a few dozens of minutes to many hours.
>> A>
>> A> Just to be clear, no such thing occurs if I do not run the guest.
>> A> Also, I have an older AMD system (pre-Zen), the problem does not happen
>> A> there.
>> A> A vanilla FreeBSD 12.3 installation that just sits idle also does not
>> A> cause the problem.
>> A>
>> A> Does anyone have an idea what the problem could be?
>> A> What workaround or diagnostics to try?
>> A> Anybody else seen something like this?
>> A>
>> A> Since it's the host that resets it would be hard to capture any traces.
>> I also run bhyve on Ryzen since late 2021 and never had such an issue.
>> But not FreeBSD 12, I run the head.
> 
> 
> Thank you everyone who responded. It seems that the problem was with some 
> BIOS configuration changes, probably related to the power settings.
> Once I reset everything to factory defaults (plus some minimum "safe" and 
> well-understood changes) the problem went away.
> It's really surprising that I saw it only with bhyve and only with the 
> particular kind of VMs. Perhaps there was a workload pattern that triggered a 
> hardware bug or overloaded some specific module.
> 
> Anyways, sorry for the noise and thank you for the help.
> 
> -- 
> Andriy Gapon
> 
> 
> https://standforukraine.com 
> https://razomforukraine.org 


Re: BHYVE_SNAPSHOT

2023-05-02 Thread Vitaliy Gusev
Just add some plans for me:

 1. Describe snapshot file format: One file for snapshot.

 2. Implement snapshot/resume via nvlist.

   nvlist implementation brings:

Versioning
Easy debugging, getting saved values, etc.
Validate restored variables: types, sized, etc.
Add optional variables without breaking backward compatibility (resume can be 
performed with old snapshots)
Remove variables without breaking backward compatibility
Use one file for snapshot
Improve restore command line:  "bhyve -r $snapshot”, i.e. w/o additional 
arguments

——
Vitaliy Gusev

>>> before extending the functionality off the top of it.
>> 
>> Yup. See above. I appreciate your input, but the goal of live
>> migration was set in 2016 with a prototype first demonstrated in
>> 2018. How long do you suggest a developer wait without review
>> feedback before moving forward out of tree?
> 
> The snapshot feature isn't compiled in by default. So, it's likely that
> changes break it and only a few people are testing it.
> 
> We have to focus on getting this into the tree.
> 
>>>> There are experimental patches for all these features that were
>>>> developed by students at UPB. In a lot of cases, there are open
>>>> reviews that have been waiting on feedback for ages.
>>> 
>>> In general, most people don't want to review large experimental
>>> patches.
>> 
>> Yup. That approach was attempted with the Warm Migration patches.
>> From slide 17 in Elena's presentation:
>> 
>>  First review opened in 2021: https://reviews.freebsd.org/D28270
>>  5 reviews from 2022 starting with https://reviews.freebsd.org/D34717
>> (same feature split in multiple parts)
>>  
>>  A similar request was made recently to Gusev Vitaliy WRT the
>> multiple device support patch which he took ownership of. Thanks for
>> adding feedback to that review BTW. We'll see how that pans out ...
>> 
>>  https://reviews.freebsd.org/D35590
>> 
> 
> I've already reviewed Vitaliy's multi device support patch and people
> had more than enough time to complain about it. I'm going to commit it
> as soon as he splits his commit.
>   
>>>>  The case is quite plain. I'm not sure what the solution is to
>>>> this 
>>>>  problem. I'd love to hear feedback from the community about how
>>>> I've got 
>>>>  this completely wrong and how the course could be corrected.
>>>> That would 
>>>>  be something.
>>>> 
>>> 
>>> My perspective is that it would have been better to focus student
>>> efforts on completing the snapshot feature. By completing the
>>> snapshot feature, I mean getting the code into a state where it's
>>> compiled in by default and no longer considered an experimental
>>> feature.
>>> 
>> I'm not sure what more to say hear regarding the snapshot feature or
>> what might have been done in the past. We need a solution for the
>> present. If you have any comments related to the follow up reviews
>> submitted by UPB, I'm sure they'd love to hear them.
>> And lastly: I get that FreeBSD is a non paid volunteer project for
>> most. Without the efforts of folks like Peter, Neel, John and others,
>> there would be no bhyve. I'm not saying that they, as project
>> maintainers, should somehow be doing more. We all have limited time
>> to invest, paid work to do and families to feed. I'm asking if there
>> are other developers that might be willing and able to help with
>> reviews? Is there something the FreeBSD foundation can do help out in
>> situations like these?
>> Thanks,
>> -Matthew
>>  
> 
> UPB has developed some interesting features and I'd like to see those
> in tree. I can take some time to review the patches. Nevertheless, we
> really need the snapshot feature compiled in by default. Otherwise,
> it's wasted time for all of us.
> 
> 
> -- 
> Kind regards,
> Corvin



Re: BHYVE_SNAPSHOT

2023-05-02 Thread Vitaliy Gusev


> On 2 May 2023, at 18:38, Rob Wing  wrote:
> 
> Do you plan on saving the guest memory into the nvlist? If I have VM with 8 
> gigs of memory, will the nvlist implementation allocate 8 gigs of memory for 
> the nvlist then write it out to disk? Or..?
> 
> All the bullet points look good to me.
> 
> -Rob


Of course no. Guest memory should be saved as is. As possible improvement  -  
only dirty memory/pages. 

I was taking about saving with nvlist: device’s variables, registers, internal 
data, etc. 

Overall description will be provided shortly.

———
Vitaliy Gusev




BHYVE SNAPSHOT image format proposal

2023-05-23 Thread Vitaliy Gusev
Hi,

Here is a proposal for bhyve snapshot/checkpoint image format improvements.

It implies moving snapshot code to nvlist engine. 

Current snapshot implementation has disadvantages:

3 files per snapshot: .meta, .kern, vram
Binary Stream format of data.
Adding  optional variable - breaks resume
Removing variable - breaks resume
Changing saved order of variables - breaks resume
Hard to get information about what is saved and decode.
Hard to debug if somethings goes wrong
No versions. If change code, resume of an old images can be
passed, but with UB.

New nvlist implementation should solve all things above. The first step -
improve snapshot/checkpoint saving format. It eliminates three files usage
per a snapshot.


1. BHYVE SNAPSHOT image format:  

+———+
|  HEADER PHYS - 4096 BYTES |
+———+
|   |
|DATA   |
|   |
+———+


2. HEADER PHYS format: 

 0+—+ 
  |IDENT STRING  - 64 BYTES |
 64   +—+   
  | NVLIST SIZE  - 4 BYTES   |  NVLIST DATA |
 72   +—+
  | |
  |   NVLIST DATA   |
  | |
 4096 +—+


IDENT STRING - Each producer can set its own value to specify image.
NVLIST SIZE  - The following packed header nvlist data size.
NVLIST DATA - Packed nvlist header data.

4KB should be enough for the HEADER to keep basic information about Sections. 
However, it can
be enlarged lately, without breaking backward compatibility. 

3. NVLIST HEADER consists of Sections in the following format:

Name - string
Type:string:
“text,   - plain text,
“nvlist” - packed nvlist,
“binary” - raw binary data.
Size - Size of section - uint64
Offset - Offset in image format - uint64

Predefined sections:  “config”, “devices”, “kernel”, “memory”. 


4. EXAMPLE:


 IDENT STRING:

   "BHYVE CHECKPOINT IMAGE VERSION 1"

 NVLIST HEADER: 

  [config]
config.offset = 0x1000 (4096)
config.size = 0x1f6 (502)
config.type = "text"
  [kernel]
kernel.offset = 0x11f6 (4598)
kernel.size = 0x19a7 (6567)
kernel.type = “nvlist"
  [devices]
devices.offset = 0x2b9d (11165)
devices.size = 0x10145ba (16860602)
devices.type = "nvlist"
  [memory]
memory.offset = 0x120 (18874368)
memory.size = 0x3ce0 (1021313024)
memory.type = “binary"

 SECTIONS:

 [section "config" size 0x1f6 offset 0x1000]:
memory.size=1024M
x86.strictmsr=true
x86.vmexit_on_hlt=true
cpus=2
acpi_tables=true
pci.0.0.0.device=hostbridge
pci.0.31.0.device=lpc
pci.0.4.0.device=virtio-net
pci.0.4.0.backend=tap0
pci.0.7.0.device=fbuf
pci.0.7.0.tcp=10.42.0.78:5900
pci.0.7.0.w=1024
pci.0.7.0.h=768
pci.0.5.0.device=ahci
pci.0.5.0.port.0.type=cd
pci.0.5.0.port.0.path=/ISO/ubuntu-22.04.1-live-server-amd64.iso
lpc.bootrom=/usr/local/share/uefi-firmware/BHYVE_UEFI.fd
checkpoint.date="Wed Jan 25 23:48:29 2023"
name=ubuntu22

 [section "kernel" size 0x19a7 offset 0x11f6]:
   [vm]
vm.vds_version = 0x1 (1)
vm.cpu0.data(BINARY): 00 00 00 00 0D 00 00 00 01 00 00 00 00 00 00 00 
...  size=0x28
vm.cpu1.data(BINARY): 00 00 00 00 0D 00 00 00 01 00 00 00 00 00 00 00 
...  size=0x28
vm.checkpoint_tsc = 0xe2e0ac6fbe456 (3991273496896598)
   [hpet]
hpet.vds_version = 0x1 (1)
hpet.data(BINARY): 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ...  
size=0x118
   [vmx]
vmx.vds_version = 0x1 (1)
vmx.cpu_features = 0 (0)
vmx.cpu0.vmx_data(BINARY): F0 CC 15 B8 FF FF FF FF 40 B4 21 B9 FF FF FF 
FF ...  size=0x288
vmx.cpu1.vmx_data(BINARY): F0 CC 15 B8 FF FF FF FF 00 00 67 41 D8 9B FF 
FF ...  size=0x288
   [ioapic]
ioapic.vds_version = 0x1 (1)
ioapic.data(BINARY): 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 
...  size=0x208
   [lapic]
lapic.vds_version = 0x1 (1)
lapic.cpu0.data(BINARY): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 ...  size=0x460
lapic.cpu1.data(BINARY): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 ...  size=0x460
   [atpit]
atpit.vds_version = 0x1 (1)
atpit.data(BINARY): 00 00 00 00 00 00 00 00 54 AD 51 97 0F 0E 00 00 ... 
 size=0xa0
   [atpic]
atpic.vds_version = 0x1 (1)
atpic.data(BINARY): 01 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 ... 
 size=0x84
   [pmtimer]
pmtimer.vds_version = 0x1 (1)
pmtimer.uptime = 0x26fd133e5cc (2679274464716)
   [rtc]
rtc.vds_version = 0x1 (1)
rtc.data(BINARY): 0A 00 00 00 

Re: BHYVE SNAPSHOT image format proposal

2023-05-23 Thread Vitaliy Gusev
Hi,

> On 23 May 2023, at 19:45, Poul-Henning Kamp  wrote:
> 
> 
>> 1. BHYVE SNAPSHOT image format:
> 
> Please do not invent Yet Another Format, please ?
> 
> Why not make it a tar(5) file ?
> 

Tar cannot solve issues mentioned in “disadvantages”. Tar doesn’t have 
versions, it is just container for files
that would introduce another level of indirection. Snapshot/resume doesn’t need 
just container. It needs
information what is saved and in what format. For example, virtual memory can 
be saved in different ways: binary,
diff pages, etc.

Virtual memory of VM should be saved faster without additional cost. The same 
for restore stage. Do you like
an idea to have tar file with size 8 GB ? And how it can be saved efficiently 
without double copying of data?

Yes, tar is powerful and convenient for many purposes, but it is not so 
suitable to suspend/resume process and
would introduce just another level of complexity.

——
Vitaliy Gusev


Re: BHYVE SNAPSHOT image format proposal

2023-05-24 Thread Vitaliy Gusev
Hi Tomek,

Try to answer to the all questions below, please let me know if I miss some 
important.


> On 23 May 2023, at 21:58, Tomek CEDRO  wrote:
> 
> On Tue, May 23, 2023 at 6:06 PM Vitaliy Gusev wrote:
>> Hi,
>> Here is a proposal for bhyve snapshot/checkpoint image format improvements.
>> It implies moving snapshot code to nvlist engine.
> 
> Hey there Vitaliy :-) bhyve getting more and more traction, I am new
> user of bhyve and no expert, but new and missing features are welcome
> I guess.. there was a discussion on the mailing lists recently on
> better snapshots mechanism :-)
> 
> 
>> Current snapshot implementation has disadvantages:
>> 3 files per snapshot: .meta, .kern, vram
> 
> No problem, unless new single file will be protected against
> corruption (filesystem, transfer, application crash) and possible to
> be easily and cheaply modified in place?

Current snapshot implementation doesn’t have it. I would say more, current
pkg implementation doesn’t track/notify if some of files are changed.  Binary 
files on a
system can be changed, for example ELF files, without any notification.

Tar doesn’t have protection for keeping data.  Some filesystems like ZFS
guarantee that data is not modified by underlying disks.

Protecting requires more efforts and it should be clearly defined: what is 
purpose. If
purpose is having checksum with 99.9% reliability, NVLIST HEADER can be widen
to have “checksum” key/value for a Section.

If purpose is having crypto verification - I believe sha256 program should be 
your choice.

> 
>> Binary Stream format of data.
> 
> This is small and fast? Will new format too?

Small is not so perfect. As the first attempt snapshot code is good. But if you 
want to get
values related to some specific device, for example, for NIC or HPET, you 
cannot get it easily. Please
try :)

Stream doesn’t have flexibility. It is good for well specified  and long long 
time discussed protocols
like XDR (NFS), when it has RFC and each position in the stream is described. 
Example: RFC1813.

New format with NVLIST has flexibility and is fast enough. Note, ZFS uses 
nvlist for keeping attributes 
and more another things.


>> Adding  optional variable - breaks resume
>> Removing variable - breaks resume
>> Changing saved order of variables - breaks resume
> 
> Obviously need improvement :-)
> 
>> Hard to get information about what is saved and decode.
>> Hard to debug if somethings goes wrong
> 
> Additional tools missing? Will new format allow text editor interaction?

Why do you need modify snapshot image ? Could you describe more? Do you
modify current 3 snapshot files?


>> No versions. If change code, resume of an old images can be
>> passed, but with UB.
> 
> Is new format future proof and provides backward compatibility?

Intention of moving to the new format - to have backward compatibility if some 
code
is changed:
Adding optional variable 
Removing variable that is not used anymore
Change order of saving variables
“Hot Fixes”.

If changes are critical and are incompatible, restore stage should have clear 
information about
incompatibility and break resume. Ideally it should be able to get informed 
even before starting
restore process. For this purpose, the new format introduce versions.


> 
>> New nvlist implementation should solve all things above. The first step -
>> improve snapshot/checkpoint saving format. It eliminates three files usage
>> per a snapshot.
>> 
>> (..)
> 
> So this will be new text config based format with variable = value and 
> sections?

This is NVLIST approach with key=value, where key is string, and value can be
Integer, array, string, etc.

> 
> How much bigger will be the overal file size increase?

Not so huge. NVLIST internals is well specified. For example, for my VM

  [kernel]
kernel.offset = 0x11f6 (4598)
kernel.size = 0x19a7 (6567)
kernel.type = “nvlist"
  [devices]
devices.offset = 0x2b9d (11165)
devices.size = 0x10145ba (16860602)
devices.type = “nvlist”

So packed size for kernel  is 6567 bytes, for devices  is 16860602 including
framebuffer 16MB. If remove fbuf, packed nvlist devices Section has size 83386 
bytes.


> 
> How much longer it will take do decode/encode/process files?

It is fast, just several milliseconds. NVLIST is very fast format. It is 
already integrated
into bhyve as Config engine.


> 
> What is the possibility of format change and backward/foward compatibility?

If you are talking about compatibility of a Image format - it should be 
compatible in
both directions, at least for not so big format changes.

If consider overall snapshot/resume compatibility - I believe  forward 
compatibility
is not case and target. Indeed, why do you need  to resume an image cre

Re: BHYVE SNAPSHOT image format proposal

2023-05-24 Thread Vitaliy Gusev
It is ready but in internal git repo, and it needs time to port it to CURRENT 
version. Probably week or so.

When I create review, I will notify you.

Thanks,
Vitaliy Gusev


> On 24 May 2023, at 20:33, Mario Marietto  wrote:
> 
> @gusev.vita...@gmail.com <mailto:gusev.vita...@gmail.com> : Do you want to 
> explain to me how to test the new "snapshot" feature ? I'm interested to test 
> and stress it on my system. Is it ready to be used ?
> 
>> 
>> Thank you for your questions. If you would like, you can try to test the new 
>> implementation and give feedback.
>> 
>> ———
>> Vitaliy Gusev
>> 
> 
> 
> -- 
> Mario.



Re: BHYVE SNAPSHOT image format proposal

2023-05-24 Thread Vitaliy Gusev
Hi, 

> On 24 May 2023, at 20:46, Miroslav Lachman <000.f...@quip.cz> wrote:
> 
> On 24/05/2023 17:10, Vitaliy Gusev wrote:
> 
>>>> Current snapshot implementation has disadvantages:
>>>> 3 files per snapshot: .meta, .kern, vram
>>> 
>>> No problem, unless new single file will be protected against
>>> corruption (filesystem, transfer, application crash) and possible to
>>> be easily and cheaply modified in place?
>> Current snapshot implementation doesn’t have it. I would say more, current
>> pkg implementation doesn’t track/notify if some of files are changed.   
>> Binary files on a
>> system can be changed, for example ELF files, without any notification.
> 
> pkg stores checksums for installed files. You can check them with pkg check 
> -s -a or pkg check --checksums -a. Changes are reported by daily periodic 
> script.


Yep, my fault. However, I found it doesn’t track sticky bit setting:

# chmod u+t /usr/local/bin/vim

# pkg check -s vim
Checking vim: 100%

My point was that if snapshot image needs checksum verification it could be 
done by another program,
because there are many purposes (plain integrity, security, etc) and having it 
in place in snapshot image
could be doing double of work.

And additionally note, that NVLIST Header can be widen to have a  checksum for 
Section data.

Thanks,
Vitaliy Gusev

> Kind regards
> Miroslav Lachman
> 



Re: BHYVE SNAPSHOT image format proposal

2023-05-24 Thread Vitaliy Gusev
It is impossible. However, as first attempt you can verify that multidev branch 
works fine for you.

https://github.com/gusev-vitaliy/freebsd/tree/dev/D35590


——
Vitaliy Gusev

> On 24 May 2023, at 20:47, Mario Marietto  wrote:
> 
> Give me the internal git. I want to test it asap :)
> 
> Il mer 24 mag 2023, 19:39 Vitaliy Gusev  <mailto:gusev.vita...@gmail.com>> ha scritto:
>> It is ready but in internal git repo, and it needs time to port it to 
>> CURRENT version. Probably week or so.
>> 
>> When I create review, I will notify you.
>> 
>> Thanks,
>> Vitaliy Gusev
>> 
>> 
>>> On 24 May 2023, at 20:33, Mario Marietto >> <mailto:marietto2...@gmail.com>> wrote:
>>> 
>>> @gusev.vita...@gmail.com <mailto:gusev.vita...@gmail.com> : Do you want to 
>>> explain to me how to test the new "snapshot" feature ? I'm interested to 
>>> test and stress it on my system. Is it ready to be used ?
>>> 
>>>> 
>>>> Thank you for your questions. If you would like, you can try to test the 
>>>> new implementation and give feedback.
>>>> 
>>>> ———
>>>> Vitaliy Gusev
>>>> 
>>> 
>>> 
>>> -- 
>>> Mario.
>> 



Re: BHYVE SNAPSHOT image format proposal

2023-05-25 Thread Vitaliy Gusev


> On 25 May 2023, at 04:30, Tomek CEDRO  wrote:
> 
> On Wed, May 24, 2023 at 5:11 PM Vitaliy Gusev wrote:
>> Protecting requires more efforts and it should be clearly defined: what is 
>> purpose. If
>> purpose is having checksum with 99.9% reliability, NVLIST HEADER can be widen
>> to have “checksum” key/value for a Section.
> 
> Well, this could be optional but useful to make sure snapshot did not
> break somehow for instance backup medium error or something like
> that.. even more maybe a way to fix it.. just a design stage idea :-

Yes, new format can have checksum of a Section data if implemented.

> 
> 
>> If purpose is having crypto verification - I believe sha256 program should 
>> be your choice.
> 
> My question was more specific to availability of that feature
> (integrity + repair) rather than specific format :-)
> 
> The use case here is having a virtual machine (it was VirtualBox) with
> a bare os installed, plus some common applications, that is snapshoted
> at some point in time, then experimented a lot, restored from
> snapshot, etc. I had a backup of such vm + snapshot backed up that got
> broken somehow. It would be nice to know that something is broken,
> what is broken, maybe a way to fix :-)


 “Integrity" is a very broad term. What checksum algorithm is fine enough?
 
For the instance,  ZFS has several options for checksum:

checksum=on|off|fletcher2|fletcher4|sha256|noparity|sha512|skein|edonr
   

Having checksum for a filesystem is strongly recommended. However, If consider 
image format,
it  doesn’t need to care about consistency in a file itself. As example (!)  - 
binary files in a system.
They don’t have checksum integrated, validation is done by another program  - 
pkg or another.


> 
> 
>> Why do you need modify snapshot image ? Could you describe more? Do you
>> modify current 3 snapshot files?
> 
> Analysis that require ram / nvram modification? Not sure if this is
> already possible, but may come handy for experimenting with uefi and
> maybe some OS (features) that will not run with unmodified nvram :-P


Sorry I don’t get, why do you need to modify snapshot image, but not directly 
vmem on the running
VM?

Another question, checksum and modifying image - two mutual exclusive things. 

> 
> 
>> If you are talking about compatibility of a Image format - it should be 
>> compatible in
>> both directions, at least for not so big format changes.
>> 
>> If consider overall snapshot/resume compatibility - I believe  forward 
>> compatibility
>> is not case and target. Indeed, why do you need  to resume an image created 
>> by
>> a higher version of a program?
> 
> This happens quite often. For instance there is a bug in application
> and I need to revert to (at least) one step older version. Then I am
> unable to work on a file that I just saved (or was autosaved for me).
> Firefox profile settings let be the first example. KiCAD file format
> is another example (sometimes I need to switch to a devel build to
> evade a nasty blocker bug then anyone else that uses a release is
> blocked for some months including me myself).

Any additional thing has a cost of development, testing and support. Current
Implementation doesn’t support compatibility at all. Having compatibility in 
both
directions can be hard.

For example, if some variable is removed in bhyve, backward compatibility is 
fine,
but forward compatibly is not possible unless that removed variable is being 
saved
into a snapshot image just for forward compatibility. And of course, it should 
be tested
and verified as worked.

Do you like that approach? I don’t think so. So I guess only backward 
compatibility
should be supported to make the snapshot code simple and robust.

Thanks,
Vitaliy Gusev




Re: BHYVE SNAPSHOT image format proposal

2023-05-25 Thread Vitaliy Gusev


> On 25 May 2023, at 19:22, Mario Marietto  wrote:
> 
> Vitaliy,
> 
> what happens if I clone your repo as source code on my FreeBSD system. Can I 
> test your code directly or not ? Anyway,I think that,before doing this,I need 
> to follow some kind of tutorial,to understand how the workflow is. Otherwise 
> I will be not able to test and stress it. 


You should build kernel and tools, install it. Then you can run bhyve, bhyvectl 
to deal with suspend/resume.

Please follow 

 9.5. Building and Installing a Custom Kernel

https://docs.freebsd.org/en/books/handbook/book/#kernelconfig-building


Make sure that BHYVE_SNAPSHOT is enabled.

Also look at build(7):

https://man.freebsd.org/cgi/man.cgi?build(7)


> 
> On Thu, May 25, 2023 at 3:40 PM Vitaliy Gusev  <mailto:gusev.vita...@gmail.com>> wrote:
>> 
>> 
>>> On 25 May 2023, at 04:30, Tomek CEDRO >> <mailto:to...@cedro.info>> wrote:
>>> 
>>> On Wed, May 24, 2023 at 5:11 PM Vitaliy Gusev wrote:
>>>> Protecting requires more efforts and it should be clearly defined: what is 
>>>> purpose. If
>>>> purpose is having checksum with 99.9% reliability, NVLIST HEADER can be 
>>>> widen
>>>> to have “checksum” key/value for a Section.
>>> 
>>> Well, this could be optional but useful to make sure snapshot did not
>>> break somehow for instance backup medium error or something like
>>> that.. even more maybe a way to fix it.. just a design stage idea :-
>> 
>> Yes, new format can have checksum of a Section data if implemented.
>> 
>>> 
>>> 
>>>> If purpose is having crypto verification - I believe sha256 program should 
>>>> be your choice.
>>> 
>>> My question was more specific to availability of that feature
>>> (integrity + repair) rather than specific format :-)
>>> 
>>> The use case here is having a virtual machine (it was VirtualBox) with
>>> a bare os installed, plus some common applications, that is snapshoted
>>> at some point in time, then experimented a lot, restored from
>>> snapshot, etc. I had a backup of such vm + snapshot backed up that got
>>> broken somehow. It would be nice to know that something is broken,
>>> what is broken, maybe a way to fix :-)
>> 
>> 
>>  “Integrity" is a very broad term. What checksum algorithm is fine enough?
>>  
>> For the instance,  ZFS has several options for checksum:
>> 
>> checksum=on|off|fletcher2|fletcher4|sha256|noparity|sha512|skein|edonr
>>
>> 
>> Having checksum for a filesystem is strongly recommended. However, If 
>> consider image format,
>> it  doesn’t need to care about consistency in a file itself. As example (!)  
>> - binary files in a system.
>> They don’t have checksum integrated, validation is done by another program  
>> - pkg or another.
>> 
>> 
>>> 
>>> 
>>>> Why do you need modify snapshot image ? Could you describe more? Do you
>>>> modify current 3 snapshot files?
>>> 
>>> Analysis that require ram / nvram modification? Not sure if this is
>>> already possible, but may come handy for experimenting with uefi and
>>> maybe some OS (features) that will not run with unmodified nvram :-P
>> 
>> 
>> Sorry I don’t get, why do you need to modify snapshot image, but not 
>> directly vmem on the running
>> VM?
>> 
>> Another question, checksum and modifying image - two mutual exclusive 
>> things. 
>> 
>>> 
>>> 
>>>> If you are talking about compatibility of a Image format - it should be 
>>>> compatible in
>>>> both directions, at least for not so big format changes.
>>>> 
>>>> If consider overall snapshot/resume compatibility - I believe  forward 
>>>> compatibility
>>>> is not case and target. Indeed, why do you need  to resume an image 
>>>> created by
>>>> a higher version of a program?
>>> 
>>> This happens quite often. For instance there is a bug in application
>>> and I need to revert to (at least) one step older version. Then I am
>>> unable to work on a file that I just saved (or was autosaved for me).
>>> Firefox profile settings let be the first example. KiCAD file format
>>> is another example (sometimes I need to switch to a devel build to
>>> evade a nasty blocker bug then anyone else that uses a release is
>>> blocked for some months including me myself).
>> 
>> Any additional thing has a cost of development, testing and support. Current
>> Implementation doesn’t support compatibility at all. Having compatibility in 
>> both
>> directions can be hard.
>> 
>> For example, if some variable is removed in bhyve, backward compatibility is 
>> fine,
>> but forward compatibly is not possible unless that removed variable is being 
>> saved
>> into a snapshot image just for forward compatibility. And of course, it 
>> should be tested
>> and verified as worked.
>> 
>> Do you like that approach? I don’t think so. So I guess only backward 
>> compatibility
>> should be supported to make the snapshot code simple and robust.
>> 
>> Thanks,
>> Vitaliy Gusev
>> 
>> 
> 
> 
> -- 
> Mario.



Re: BHYVE SNAPSHOT image format proposal

2023-06-06 Thread Vitaliy Gusev
Hi Corvin, 

Thanks for your comments and advices. 

Answers are below,

> On 5 Jun 2023, at 18:32, Corvin Köhne  wrote:
> 
> On Tue, 2023-05-23 at 19:05 +0300, Vitaliy Gusev wrote:
>> 2. HEADER PHYS format: 
>> 
>>  0+—+ 
>>   |IDENT STRING  - 64 BYTES |
>>  64   +—+   
>>   | NVLIST SIZE  - 4 BYTES   |  NVLIST DATA |
>>  72   +—+
>>   | |
>>   |   NVLIST DATA   |
>>   | |
>>  4096 +—+
>> 
>>> 
>>> IDENT STRING - Each producer can set its own value to specify
>>> image.
>>> NVLIST SIZE  - The following packed header nvlist data size.
>>> NVLIST DATA - Packed nvlist header data.
>>> 
>>> 4KB should be enough for the HEADER to keep basic information about
>>> Sections. However, it can
>>> be enlarged lately, without breaking backward compatibility. 
>>> 
> 
> I can't see an advantage of using a fixed sized header of 4KB. You have
> to parse the offset and size of every section anyways. If it's for
> alignment requirements you can still align all sections on save and set
> the offset accordingly. So, why complicating things by using a fixed
> header size?

You are right about 4KB restriction. I will correct it in updated format 
proposal. Idea is
to reserve enough space for HEADER and write it after all finished stages at 
the beginning 
of a snapshot file.

Implementation (snapshot path) should know estimated maximum size of the header 
and can
use the possible maximum. Currently 4KB is enough and easily can be
increased in the bhyve’s code without any problem. 

Alignment is useful to debug and looking into snapshot image file.

> 
> The IDENT STRING seems to be very large. Even a GUID which should be a
> global unique identifier uses just 16 Bytes. Additionally, it might be
> worth using a dedicated ident and version field for an easier version
> parsing. E.g.:

Intention is to add enough space for the future version (as reservation) and 
other producers
and companies to specify it’s own ID string with possible add-on information. 
So adding  64 bytes
for the future is not so huge pay, but can be very useful.

During resume, if IDENT string is not the same as in bhyve, resume can fail 
before parsing
other data, because it could be that internal format is not as expected.

I would not to fix IDENT string format and just apply rule:

During resume, bhyve compares its own IDENT string and IDENT string from an
Snapshot image. If it is not the same, further assumption about format cannot 
be done,
and resume should fail.

> 
> +--+---+
> | IDENT - 56 BYTES | VERSION - 8 BYTES |
> +--+---+
> 
> IDENT - "BHYVE CHECKPOINT IMAGE"
> VERSION - 1 (as uint64_t)
> 
> Btw: I don't care but here we could leave some free space for possible
> forward compatibility. E.g.:
> 
> +--+---+-+
> | IDENT - 16 BYTES | VERSION - 8 BYTES | _FREE_SPACE_ - 40 BYTES |
> +--+---++
...
>> 4. EXAMPLE:
>> 
>> 
>>  IDENT STRING:
>> 
>>"BHYVE CHECKPOINT IMAGE VERSION 1"
>> 
>>  NVLIST HEADER: 
>> 
>>   [config]
>> config.offset = 0x1000 (4096)
>> config.size = 0x1f6 (502)
>> config.type = "text"
>> 
> 
> Not sure if it's just an example for the "text" type. bhyve converts it
> into a nvlist, so it could be saved directly as nvlist.
> Btw: I would only implement the "text" type if there's an usecase that
> can't be solved by one of the other types.


Intention is to use current engine to dump bhyve’s config and read config
from a file (-k option).

Advantage of using “text” type - simple implementation and as an example
of flexibility of proposed image format. Image file can keep any types that
a producer would like to use: text, nvlist, binary, diff-pages, etc.

> 
> All in all, it looks good. Keep on your work!
> 
> Regards checksum feature:
> We should focus on enabling this feature by default before adding
> advanced features. So, keep it simple and small.

Could you give a more example what you meant about “checksum” feature? Did you 
mean as
TAR’s checksum, i.e. only header?


> 
> Regards forward compatibility:
> Backward compatibility is way more important than forward
> compatibility. Nevertheless, forward compatibility would be nice to
> have. So, we should keep it in mind when modifying the layout. For the
> moment, just focus on a format which is backward compatible.
> 

It seems that having information about forward compatibility could be very
useful, at least to get it in advance if it is impossible to restore. I will 
add it during
implementing this format.

Thanks,
Vitaliy Gusev



Re: BHYVE SNAPSHOT image format proposal

2023-06-07 Thread Vitaliy Gusev
R DATA (SECTIONS) |
+-+
|   SNAPSHOT DATA |
+-+


MAGIC ID: should be hardcoded: "BHYVE CHECKPOINT IMAGE”.

PRODUCER ID: can be empty and supported by producer, i.e. reserved. 

NVLIST HEADER SIZE: has enough dimension, but in general size is less than 4KB

NVLIST HEADER DATA: Packed nvlist data, contains Sections:  “config”, “kernel”, 
“devices”, “memory”, … :

[config]
offset = 0x1000 (4096)
size = 0x1f6 (502)
type = text
vers = 1
subvers = 5
[kernel]
offset = 0x11f6 (4598)
size = 0x19a7 (6567)
type = nvlist
vers = 1
subvers = 0
[devices]
offset = 0x2b9d (11165)
size = 0x10145ba (16860602)
type = nvlist
vers = 2
subvers = 1
[memory]
offset = 0x120 (18874368)
    size = 0x3ce0 (1021313024)
type = pages
vers = 1
subvers = 0 


I hope I gained a whole understanding.
Thanks,
Vitaliy Gusev



Re: bhyve: how to keep the host from starving the guest

2023-06-26 Thread Vitaliy Gusev
Hi Aryeh,

Have you wired a guest memory with bhyve's -S option?

-S  Wire guest memory

Anyway, OS does not have another choice than kill a process to free some memory
when RAM+swap is fully used (assume kernel already scanned Inactive memory).

As recommendation:
Look at an another memory consumers like ZFS, another processes
Increase swap
Tune vm.overcommit sysctl. See tuning(7) for details.

So in short, there is no good way to run applications that fully use 10 GB
memory on a systems with just 1 GB RAM + 1 GB swap. You should have
enough resources to do that.

And it would be nice, if you provide more information and metrics for your
system, total RAM, memory assigned for VM,  additional statistics for all
intensive processes (SIZE, RES), etc.

—
Vitaliy Gusev

> On 26 Jun 2023, at 13:44, Aryeh Friedman  wrote:
> 
> I am a 12 core machine that I want allocate only 4 CPU's to the host
> and 8 to a VM (the host is my desktop FreeBSD machine and the guest is
> debian 11 used for playing around with learning AI model making) I
> have passed my GeForce 1030 (bottom of the line GPU for AI work it
> seems) but since it appears that no one can get tensorflow, pytorch or
> anything else that runs ANN's on a GPU to work on FreeBSD (I have
> tracked down to the fact nvidia never ported CUDA to FreeBSD)... the
> problem is sometimes during heavy work on the guest then the host
> slows down and if the the host is doing heavy work (especially
> resource intensive things like compiling lang/rust) that the host will
> kill the guest if it runs out of available memory+swap (speaking of
> that does wiring the memory at least prevent this?)
> 
> -- 
> Aryeh M. Friedman, Lead Developer, http://www.PetiteCloud.org
> 



Re: bhyve: how to keep the host from starving the guest

2023-06-26 Thread Vitaliy Gusev

> On 26 Jun 2023, at 15:06, Aryeh Friedman  wrote:
> 
> On Mon, Jun 26, 2023 at 7:50 AM Vitaliy Gusev  <mailto:gusev.vita...@gmail.com>> wrote:
>> ...
>> As recommendation:
>> 
>> Look at an another memory consumers like ZFS, another processes
>> Increase swap
>> Tune vm.overcommit sysctl. See tuning(7) for details.
>> ...
> You completely mischaracterize the situation I want to reserve 16GB or
> 24GB for the VM and the other 8 are for the host (and the host
> alone).. I have already used the -S flag since it is required by
> passthru
> 
> Also the memory is successfully reserved accoring to top(1) but yet it
> still runs out (i.e. it shows 19GB are wired).
> 


You can try with protect(1):

 protect – protect processes from being killed when swap space is
 exhausted