On April 12, 2023 10:22:25 AM PDT, Charlie Li <[email protected]> wrote: >Charlie Li wrote: >> Cy Schubert wrote: >>> On April 12, 2023 8:51:09 AM PDT, Charlie Li <[email protected]> wrote: >>>> Cy Schubert wrote: >>>>> I have a "sandhbox" pool, called t, used for /usr/obj and ports wrkdirs, >>>>> and other writes I can easily recreate on my laptop. Here are the results >>>>> of my tests. >>>>> >>>>> Method: >>>>> >>>>> Initially I copied my /usr/obj from my two build machines (one >>>>> amd64.amd64 and an i386.i386) to my "sandbox" zpool. >>>>> >>>>> Next, with block_cloning disabled I did cp -R of the /usr/obj test files. >>>>> Then a diff -qr. They source and target directories were the same. >>>>> >>>>> Next, I cleaned up (rm -rf) the target directory to prepare for the >>>>> block_clone enabled test. >>>>> >>>>> Next, I did zpool checkpoint t. After this, zpool upgrade t. Pool t now >>>>> has block_cloning enabled. >>>>> >>>>> I repeated the cp -R test from above followed by a diff -qr. Almost >>>>> every file was different. The pool was corrupted. >>>>> >>>>> I restored the pool by the following removing the corruption: >>>>> >>>>> >>>>> slippy# zpool export t >>>>> slippy# zpool import --rewind-to-checkpoint t >>>>> slippy# >>>>> >>>>> It is recommended that people avoid upgrading their zpools until the >>>>> problem is fixed. >>>>> >>>> As of af7624ed3145, I just did this with an md(4)-backed test pool, though >>>> with the second `cp -R` landing in a separate dataset, created and >>>> destroyed for each test. No corruption either way. However, my poudriere >>>> builds still output/package corrupted files (particularly those with null >>>> characters), probably after install(1) invocations (not cp(1)). >>>> >>> >>> You need to copy from/to the same dataset to reproduce the problem. Copying >>> from a source dataset to a different dataset will avoid block_cloning. >>> >> Got the corruption now. >> >Clarify: no corruption without block_cloning, corruption with. > >What is still a mystery to me is how corruption happens even without >block_cloning in the poudriere scenario. cp(1)/install(1) always happen within >the same dataset, as this test. >
This is because your pool has previously corrupted blocks. Even when you backed up the old pool, created a new pool without block_cloning and restored your data, because the backup contained corrupted blocks from your old pool, they were restored as is. ZFS can only fix corruption if the checksum says it's corrupt. As far as ZFS was concerned at the time those blocks were not corrupted. You will need to delete the files with corruption and recreate them. Even after this regression is fixed and people build/install kernel, whatever was corrupted will remain until corrupted files are either removed and recreated or fixed manually. This regression will have long lasting effects. Like Kirk McKusick has reiterated many times, back in the old days people didn't trust EXT*FS because of the data corruption experienced. Sadly ZFS will need to earn people's trust back again. This is unfortunate. -- Cheers, Cy Schubert <[email protected]> FreeBSD UNIX: <[email protected]> Web: https://FreeBSD.org NTP: <[email protected]> Web: https://nwtime.org e^(i*pi)+1=0 Pardon the typos. Small keyboard in use.
