Issue 140860
Summary [C++20][Modules] The search complexity for header file info is too expensive with a lot of modules.
Labels new issue
Assignees
Reporter mpark
    Currently, for each header file `#include` or `import`, there is a look-up of this header file on each of the loaded PCMs. This results in a search complexity of `O(N*M)` where `N` is the # of includes, and `M` is the # of loaded PCMs. For large numbers, e.g. `N` = 10,000 and `M` = 1,000, this becomes very costly.

## Test Case

I wrote a bash script to generate the test case. It produces 1,000 empty header units, and 10,000 empty header files. The main file `import`s the 1,000 header units, then `#include`s 10,000 header files. We can measure the preprocessing time on this file. On my machine, it takes about **14s** user time. This is a significant regression compared to without modules which takes about **0.06s**. In a real world scenario, the difference for us was **3s** without modules and **18s** with modules.

The script for the synthetic repro:

```bash
#/usr/bin/env bash

CLANG=${CLANG:-clang++}
NUM_MODULES=${NUM_MODULES:-1000}
NUM_HEADERS=${NUM_HEADERS:-10000}

mkdir build && cd build

for i in $(seq 1 $NUM_MODULES); do > m${i}.h; done
seq 1 $NUM_MODULES | xargs -I {} -P $(nproc) bash -c "$CLANG -std=c++20 -fmodule-header m{}.h"

for i in $(seq 1 $NUM_HEADERS); do > h${i}.h; done

> a.cpp
for i in $(seq 1 $NUM_MODULES); do echo "import \"m${i}.h\";" >> a.cpp; done
for i in $(seq 1 $NUM_HEADERS); do echo "#include \"h${i}.h\"" >> a.cpp; done
echo "int main() {}" >> a.cpp

time $CLANG -std=c++20 -Wno-experimental-header-units -E a.cpp -o /dev/null \
    $(for i in $(seq 1 $NUM_MODULES); do echo -n "-fmodule-file=m${i}.pcm "; done)
```

> Note: I am aware that `-fmodule-file=<pcm>` eagerly loads modules and that it's not encouraged. In reality we actually use the `-fmodule-file=<name>=<pcm>` form with header units.

## Code Path

We more or less start [here](https://github.com/llvm/llvm-project/blob/645846d43b1e6b71b376589d146043db2ed2be54/clang/lib/Lex/PPDirectives.cpp#L1063-L1067):

```cpp
 // Do a standard file entry lookup.
  OptionalFileEntryRef FE = HeaderInfo.LookupFile(...);
```
and proceed through the following:

- [`It->LookupFile(...);`](https://github.com/llvm/llvm-project/blob/645846d43b1e6b71b376589d146043db2ed2be54/clang/lib/Lex/HeaderSearch.cpp#L1079): `It` is an iterator over the search directories.
- [`HS.findUsableModuleForHeader(...);`](https://github.com/llvm/llvm-project/blob/645846d43b1e6b71b376589d146043db2ed2be54/clang/lib/Lex/HeaderSearch.cpp#L532-L534)
- [`suggestModule(...);`](https://github.com/llvm/llvm-project/blob/645846d43b1e6b71b376589d146043db2ed2be54/clang/lib/Lex/HeaderSearch.cpp#L1676)
- [`HS.findModuleForHeader(...);`](https://github.com/llvm/llvm-project/blob/645846d43b1e6b71b376589d146043db2ed2be54/clang/lib/Lex/HeaderSearch.cpp#L1639)

Here we encounter this:
```cpp
ModuleMap::KnownHeader
HeaderSearch::findModuleForHeader(FileEntryRef File, bool AllowTextual,
                                  bool AllowExcluded) const {
  if (ExternalSource) {
    // Make sure the external source has handled header info about this file,
    // which includes whether the file is part of a module.
 (void)getExistingFileInfo(File);
  }
  return ModMap.findModuleForHeader(File, AllowTextual, AllowExcluded);
}
```
where there is an invocation to [`(void)getExistingFileInfo(File);`](https://github.com/llvm/llvm-project/blob/645846d43b1e6b71b376589d146043db2ed2be54/clang/lib/Lex/HeaderSearch.cpp#L1610), which calls [`ExternalSource->GetHeaderFileInfo(FE);`](https://github.com/llvm/llvm-project/blob/d50c85df255c6f0ba195bcf3f9c5236120e3984d/clang/lib/Lex/HeaderSearch.cpp#L1342).

We finally arrive at:
```cpp
HeaderFileInfo ASTReader::GetHeaderFileInfo(FileEntryRef FE) {
  HeaderFileInfoVisitor Visitor(FE);
  ModuleMgr.visit(Visitor);
  if (std::optional<HeaderFileInfo> HFI = Visitor.getHeaderFileInfo())
 return *HFI;

  return HeaderFileInfo();
}
```
where [`ModuleMgr.visit(Visitor);`](https://github.com/llvm/llvm-project/blob/645846d43b1e6b71b376589d146043db2ed2be54/clang/lib/Serialization/ASTReader.cpp#L6866) will visit every currently loaded PCM.

---

Every PCM file contains a `HEADER_SEARCH_TABLE` record, which is an on-disk hash table. For header units, it has at least one entry in that table which is the header file itself. For named modules, we shouldn't need this, but it's still present today (see #78495). This means that every header unit has a `HEADER_SEARCH_TABLE` with at least one entry in it.
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to