Gentlemen, I found a solution, as well as the reason why lseek(SEEK_DATA) returns invalid data.
I found a way to craft a file that gets corrupted when we try to make a sparse copy. It’s based on mmap: 1. Open src and dst. 2. Obtain the size of src, and truncate dst to the same size. 3. Use mmap to map dst to RAM. 4. Read src and memcpy all non-blank pages to dst. 5. Use munmap and then close src and dst. 6. If src contained any blank pages, dst is now sparse. 7. Since we did not call msync, the changes are still cached in RAM and lseek(SEEK_DATA) is not aware that some of the pages contain valid data. Workaround: 1. open(dst, O_RDONLY) 2. Obtain the file size = lseek(dst, 0, SEEK_END) 3. Use mmap to map dst to RAM. 4. Call msync. 5. Use munmap and then close dst. 6. The memory view of dst is now synchronised with the underlaying file-system. 7. lseek(SEEK_DATA) returns valid data. Proof of Concept: 1. Start by extracting cc1 https://httpstorm.com/share/.openwrt/test/2023-02-06_coreutils-9.1/cc1.tgz 2. The extracted file is not sparse, we need to craft a sparse copy cc1-mmap without synchronising the filesystem with the memory view gcc m.c && ./a.out src 3 dst 4 00001000 skip total bytes copied 1a45640 / 1a46640 1a46640 3. Now sparse copy cc1-mmap to cc1-sparse gcc d.c && ./a.out src 3 dst 4 c a46640 p 1a46640 h 1a46640 d 1000000 total bytes copied a46640 / 1a46640 4. cc1-sparse is corrupted sha1sum cc1* 16d835378ab973a114082a585cc76958bdbccec0 cc1 16d835378ab973a114082a585cc76958bdbccec0 cc1-mmap 75e6f6cb303cb5d3909c6d7830c417fc6ed658c3 cc1-sparse 5. Use msync to update the filesystem gcc n.c && ./a.out dst 3 MS_ASYNC 1 MS_INVALIDATE 2 MS_SYNC 16 size bytes 1a46640 msync ok 6. Make sparse copy again cc1-mmap to cc1-sparse gcc d.c && ./a.out src 3 dst 4 c 1000 p 1000 h 1000 d 0 c 1a44640 p 1a46640 h 1a46640 d 2000 total bytes copied 1a45640 / 1a46640 7. cc1-sparse is good sha1sum cc1* 16d835378ab973a114082a585cc76958bdbccec0 cc1 16d835378ab973a114082a585cc76958bdbccec0 cc1-mmap 16d835378ab973a114082a585cc76958bdbccec0 cc1-sparse PoC samples
d.c
Description: Binary data
m.c
Description: Binary data
n.c
Description: Binary data
Cheers! Georgi Valkov httpstorm.com nano RTOS