Le sunnuntaina 16. heinäkuuta 2023, 23.32.21 EEST Lynne a écrit : > Introducing additional overhead in the form of a dereference is a point > where instability can creep in. Can you guarantee that a context will > always remain in L1D cache,
L1D is not involved here. In version 2, the pointers are cached locally. > as opposed to just reading the raw CPU timing > directly where that's supported. Of course not. Raw CPU timing is subject to noise from interrupts (and whatever those interrupts trigger). And that's not just theoretical. I've experienced it and it sucks. Raw CPU timing is much noisier than Linux perf. And because it has also been proven vastly insecure, it's been disabled on Arm for a long time, and is being disabled on RISC-V too now. > > But I still argue that that is, either way, completely negligible compared > > to the *existing* overhead. Each loop is making 4 system calls, and each > > of those system call requires a direct call (to PLT) and an indirect > > branch (from GOT). If you have a problem with the two additional function > > calls, then you can't be using Linux perf in the first place. > > You don't want to ever use linux perf in the first place, it's second class. No it isn't. The interface is more involved than just reading a CSR; and sure I'd prefer the simple interface that RDCYCLE is all other things being equal. But other things are not equal. Linux perf is in fact *more* accurate by virtue of not *wrongly* counting other things. And it does not threaten the security of the entire system, so it will work inside a rented VM or an unprivileged process. > I don't think it's worth changing the direct inlining we had before. You're > not interested in whether or not the same exact code is ran between > platforms, Err, I am definitely interested in doing exactly that. I don't want to have to reconfigure and recompile the entire FFmpeg just to switch between Linux perf and raw cycle counter. A contrario, I *do* want to compare performance between vendors once the hardware is available. > just that the code that's measuring timing is as efficient and > low overhead as possible. Of course not. Low overhead is irrelevant here. The measurement overhead is know and is subtracted. What we need is stable/reproducible overhead, and accurate measurements. And that's assuming the stuff works at all. You can argue that we should use Arm PMU and RISC-V RDCYCLE, and that Linux perf sucks, all you want. PMU access will just throw a SIGILL and end the checkasm process with zero measurements. The rest of the industry wants to use system calls for informed reasons. I don't think you, or even the whole FFmpeg project, can win that argument against OS and CPU vendors. -- Rémi Denis-Courmont http://www.remlab.net/ _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".