# Background
I am creating archives using `--sort=name` with deliberately named files sorted
such that some smaller file(s) appear before the larger file(s) in the archive.
My goal is to be able to extract the only the smaller file(s) under certain
circumstances, without scanning the entire archive. This should be achievable
using the `--occurrence` flag.
This ordinarily works fine, but I have found a case in which tar continues to
scan the archive after it has found the first match, thus wasting time.
# To replicate
The problem can be illustrated using the following example archive. I create a
small file and a large file two directories deep in the archive. The small file
is named such that it will be sorted to appear in the archive before the large
file.
```
### Clear the staging directory
rm -rf /tmp/occurrence-example-staging-dir
### Create subdirectories
mkdir -p /tmp/occurrence-example-staging-dir/example-archive/another-dir
### Create a small and large file (large file is about 2GB in this example)
dd bs=4096 count=1 if=/dev/random
of=/tmp/occurrence-example-staging-dir/example-archive/another-dir/example-file-aaaaaaa-small
dd bs=4096 count=500000 if=/dev/random
of=/tmp/occurrence-example-staging-dir/example-archive/another-dir/example-file-b-large
### Create the archive
cd /tmp/occurrence-example-staging-dir
tar -czf /tmp/example-archive.tar.gz example-archive --sort=name
```
You can then time the following extractions to see if tar is reading the entire
archive or stopping after extracting the first file.
In this first example, the wildcard search name "*example-file-aaaaaaa-small"
is the same string length as the directory in which it is found in the archive:
"example-archive/another-dir" (27 characters).
```
time tar xzv -C /tmp --occurrence=1 --wildcards '*example-file-aaaaaaa-small'
-f /tmp/example-archive.tar.gz
```
The above command demonstrates the bug. The extraction of the small file takes
several seconds, which is the same amount of time tar would take to extract the
entire archive.
If I alter the search name by even a single character, so that it is not the
same length as the directory in the archive, the extraction is nearly
instantaneous:
```
time tar xzv -C /tmp --occurrence=1 --wildcards 'e*example-file-aaaaaaa-small'
-f /tmp/example-archive.tar.gz
```
(Note that above I added the first character of the directory, "e", to the
search name. Any number of characters besides 27 would work just as well.)
# Behavior I observe
I observe tar scanning the entire archive even after it has found the matching
occurrence.
# Behavior I desire
I desire that tar would stop scanning the archive after it has found the
matching occurrence.
# Versions tested
This issue occurs in the latest released tar, 1.35. I also built the latest
unreleased source code (commit 430306673049f4ce8cc9db7578cd6e1f4200a9c2 from
May 14, 2025) and confirmed that the bug still occurs in the latest code.
# Looking at the code
I tracked down the behavior to a check in `all_names_found()` in names.c. The
following check is evaluating as true and making the function return false (as
in, all names have not been found, even though they have):
```
len >= cursor->length && ISSLASH (p->file_name[cursor->length])
```
In this case `len` is the length of the file in the archive just scanned
("example-archive/another-dir/example-file-aaaaaaa-small", 54), `cursor` is the
search name "*example-file-aaaaaaa-small" (length 27), and there is a slash at
character 27 in the file just scanned ("example-archive/another-dir/ <-- this
slash example-file-aaaaaaa-small").
In this case, because the search name contains wildcards, it is not necessarily
(and in this case, definitely not) the full path of the file being scanned in
the archive, so I think the comparison is invalid. A more correct comparison
might be to check if the wildcard could be used to represent only a
directory/directories in the path for the archive file just scanned. In this
case, scanning must continue. Otherwise, a match has been found.
Regards,
Andrew Langefeld