# Background I am creating archives using `--sort=name` with deliberately named files sorted such that some smaller file(s) appear before the larger file(s) in the archive. My goal is to be able to extract the only the smaller file(s) under certain circumstances, without scanning the entire archive. This should be achievable using the `--occurrence` flag.
This ordinarily works fine, but I have found a case in which tar continues to scan the archive after it has found the first match, thus wasting time. # To replicate The problem can be illustrated using the following example archive. I create a small file and a large file two directories deep in the archive. The small file is named such that it will be sorted to appear in the archive before the large file. ``` ### Clear the staging directory rm -rf /tmp/occurrence-example-staging-dir ### Create subdirectories mkdir -p /tmp/occurrence-example-staging-dir/example-archive/another-dir ### Create a small and large file (large file is about 2GB in this example) dd bs=4096 count=1 if=/dev/random of=/tmp/occurrence-example-staging-dir/example-archive/another-dir/example-file-aaaaaaa-small dd bs=4096 count=500000 if=/dev/random of=/tmp/occurrence-example-staging-dir/example-archive/another-dir/example-file-b-large ### Create the archive cd /tmp/occurrence-example-staging-dir tar -czf /tmp/example-archive.tar.gz example-archive --sort=name ``` You can then time the following extractions to see if tar is reading the entire archive or stopping after extracting the first file. In this first example, the wildcard search name "*example-file-aaaaaaa-small" is the same string length as the directory in which it is found in the archive: "example-archive/another-dir" (27 characters). ``` time tar xzv -C /tmp --occurrence=1 --wildcards '*example-file-aaaaaaa-small' -f /tmp/example-archive.tar.gz ``` The above command demonstrates the bug. The extraction of the small file takes several seconds, which is the same amount of time tar would take to extract the entire archive. If I alter the search name by even a single character, so that it is not the same length as the directory in the archive, the extraction is nearly instantaneous: ``` time tar xzv -C /tmp --occurrence=1 --wildcards 'e*example-file-aaaaaaa-small' -f /tmp/example-archive.tar.gz ``` (Note that above I added the first character of the directory, "e", to the search name. Any number of characters besides 27 would work just as well.) # Behavior I observe I observe tar scanning the entire archive even after it has found the matching occurrence. # Behavior I desire I desire that tar would stop scanning the archive after it has found the matching occurrence. # Versions tested This issue occurs in the latest released tar, 1.35. I also built the latest unreleased source code (commit 430306673049f4ce8cc9db7578cd6e1f4200a9c2 from May 14, 2025) and confirmed that the bug still occurs in the latest code. # Looking at the code I tracked down the behavior to a check in `all_names_found()` in names.c. The following check is evaluating as true and making the function return false (as in, all names have not been found, even though they have): ``` len >= cursor->length && ISSLASH (p->file_name[cursor->length]) ``` In this case `len` is the length of the file in the archive just scanned ("example-archive/another-dir/example-file-aaaaaaa-small", 54), `cursor` is the search name "*example-file-aaaaaaa-small" (length 27), and there is a slash at character 27 in the file just scanned ("example-archive/another-dir/ <-- this slash example-file-aaaaaaa-small"). In this case, because the search name contains wildcards, it is not necessarily (and in this case, definitely not) the full path of the file being scanned in the archive, so I think the comparison is invalid. A more correct comparison might be to check if the wildcard could be used to represent only a directory/directories in the path for the archive file just scanned. In this case, scanning must continue. Otherwise, a match has been found. Regards, Andrew Langefeld