# Background

I am creating archives using `--sort=name` with deliberately named files sorted 
such that some smaller file(s) appear before the larger file(s) in the archive. 
My goal is to be able to extract the only the smaller file(s) under certain 
circumstances, without scanning the entire archive. This should be achievable 
using the `--occurrence` flag.

This ordinarily works fine, but I have found a case in which tar continues to 
scan the archive after it has found the first match, thus wasting time.

# To replicate

The problem can be illustrated using the following example archive. I create a 
small file and a large file two directories deep in the archive. The small file 
is named such that it will be sorted to appear in the archive before the large 
file.

```
### Clear the staging directory
rm -rf /tmp/occurrence-example-staging-dir

### Create subdirectories
mkdir -p /tmp/occurrence-example-staging-dir/example-archive/another-dir

### Create a small and large file (large file is about 2GB in this example)
dd bs=4096 count=1 if=/dev/random 
of=/tmp/occurrence-example-staging-dir/example-archive/another-dir/example-file-aaaaaaa-small
dd bs=4096 count=500000 if=/dev/random 
of=/tmp/occurrence-example-staging-dir/example-archive/another-dir/example-file-b-large

### Create the archive
cd /tmp/occurrence-example-staging-dir
tar -czf /tmp/example-archive.tar.gz example-archive --sort=name
```

You can then time the following extractions to see if tar is reading the entire 
archive or stopping after extracting the first file.

In this first example, the wildcard search name "*example-file-aaaaaaa-small" 
is the same string length as the directory in which it is found in the archive: 
"example-archive/another-dir" (27 characters).

```
time tar xzv -C /tmp --occurrence=1 --wildcards '*example-file-aaaaaaa-small' 
-f /tmp/example-archive.tar.gz
```

The above command demonstrates the bug. The extraction of the small file takes 
several seconds, which is the same amount of time tar would take to extract the 
entire archive.

If I alter the search name by even a single character, so that it is not the 
same length as the directory in the archive, the extraction is nearly 
instantaneous:

```
time tar xzv -C /tmp --occurrence=1 --wildcards 'e*example-file-aaaaaaa-small' 
-f /tmp/example-archive.tar.gz
```

(Note that above I added the first character of the directory, "e", to the 
search name. Any number of characters besides 27 would work just as well.)

# Behavior I observe

I observe tar scanning the entire archive even after it has found the matching 
occurrence.

# Behavior I desire

I desire that tar would stop scanning the archive after it has found the 
matching occurrence.

# Versions tested

This issue occurs in the latest released tar, 1.35. I also built the latest 
unreleased source code (commit 430306673049f4ce8cc9db7578cd6e1f4200a9c2 from 
May 14, 2025) and confirmed that the bug still occurs in the latest code.

# Looking at the code

I tracked down the behavior to a check in `all_names_found()` in names.c. The 
following check is evaluating as true and making the function return false (as 
in, all names have not been found, even though they have):

```
len >= cursor->length && ISSLASH (p->file_name[cursor->length])
```

In this case `len` is the length of the file in the archive just scanned 
("example-archive/another-dir/example-file-aaaaaaa-small", 54), `cursor` is the 
search name "*example-file-aaaaaaa-small" (length 27), and there is a slash at 
character 27 in the file just scanned ("example-archive/another-dir/  <-- this 
slash   example-file-aaaaaaa-small").

In this case, because the search name contains wildcards, it is not necessarily 
(and in this case, definitely not) the full path of the file being scanned in 
the archive, so I think the comparison is invalid. A more correct comparison 
might be to check if the wildcard could be used to represent only a 
directory/directories in the path for the archive file just scanned. In this 
case, scanning must continue. Otherwise, a match has been found.

Regards,
Andrew Langefeld


Reply via email to