Hi David, Sorry for such late reply. I've been busy with classes and exams.
As the contributor applications are opening, I would like to put forward a proposal for a medium project for extending the static analyzer to work with POSIX file descriptor APIs. As evident in this thread, I've been interested in this project for quite some time, and have in fact written some proof-of-concept code. I'm pretty much familiar with the parts of the code base which are relevant for this project and am confident in completing it in given time frame. Do you have any inputs on how to start writing the proposal? Should I start by reading the past year proposals? Regards, Immad MIr On Mon, Feb 14, 2022 at 4:35 AM David Malcolm <dmalc...@redhat.com> wrote: > On Sun, 2022-02-13 at 17:57 -0500, David Malcolm wrote: > > On Sun, 2022-02-13 at 21:16 +0530, Mir Immad wrote: > > > Hi, > > > > > > I wanted some clarification on bifurcating the exploded graph at > > > call > > > to > > > open(). > > > Should the analyzer warn for code like this "when open fails" (like > > > strchr > > > does when 'strchr' returns NULL) > > > > > > int fd = open("NOFILE", O_RDONLY); > > > write(fd, "a", 1); > > > > > > because of the bad file descriptor. > > > unless it is written like this: > > > if (!errno) > > > write(fd, "a", 1); > > > > "open" returns a non-negative integer on success, and returns -1 and > > sets errno on an error. As I understand it, errno is never cleared > > "modified" would be a better word than "cleared" here, sorry > > Dave > > > unless an error occurs, so errno could already be set due to a prior > > failure. Hence I believe the correct way to write such code is to > > check fd for being -1, rather than checking errno. > > > > It's probably best to bifurcate the state at the open call, into > > "success" and "failure" outcomes, so that we can track that the > > caller > > "owns" the filedescriptor on success, and thus we can complain about > > fd > > leaks on paths that discard the value without closing it. > > > > We can then warn if we see -1 passed as the fd value into "write", > > since that implies that the open call failed. If we bifurcate the > > state at the open call, then for: > > > > int fd = open("NOFILE", O_RDONLY); > > write(fd, "a", 1); > > > > we'd have the execution paths: > > > > (a) "when 'open' succeeds" (setting fd to a conjured_svalue >=0, and > > setting sm-state on that value) > > > > (b) "when 'open' fails" (setting fd to -1, and errno to a > > conjured_svalue known to be nonzero) > > > > and at the call to "write" on path (b) we see the -1 passed to it, > > and > > can complain accordingly. > > > > I'm not sure if we also want to track that the file was O_RDONLY > > (which > > suggests that the "write" should fail), but that would be a nice > > bonus. > > > > Hope this is helpful > > Dave > > > > > > > > > > > > Thank you. > > > > > > On Tue, Feb 1, 2022 at 8:28 PM Mir Immad <mirimnan...@gmail.com> > > > wrote: > > > > > > > I worked around the leak detection and also handled the case > > > > where > > > > the fd > > > > is not saved. I wonder why sm-file hasn't implemented it yet. I'm > > > > attaching > > > > a text file with analyzer warnings. I'm still on gcc-11.2.0, will > > > > move to > > > > v12 next thing. > > > > > > > > > I wonder if it's worth checking for attempts to write to a fd > > > > > that > > > > > was > > > > > opened with O_RDONLY, or the converse? > > > > > > > > Yes, it does not make sense to warn for write on read-only closed > > > > fd > > > > or > > > > converse. But, it does make sense to warn for write operation on > > > > valid > > > > read-only fd. For this, we might have to analyze calls to the > > > > open() > > > > carefully -- get the literal value of the second argument to the > > > > call > > > > and > > > > do bit-mask decomposition on it to find if O_RDONLY or O_WRONLY > > > > are > > > > in it. > > > > We might have to maintain separate vectors for these and check > > > > for > > > > them in > > > > calls to write and read -- warn if write is being called on read- > > > > only > > > > fd > > > > and converse. How do we equate them? Does gimple store unique ids > > > > for > > > > each > > > > tree? or something like SAME_TREE? Just thinking out loud. > > > > > > > > > how much state does it make sense to track for a fd? > > > > Sorry, I didnt get it. Do you mean how may states I'm using? > > > > unchecked, > > > > closed, null and leak_by_not_saved (for the call-tree that does > > > > not > > > > have a > > > > lhs). > > > > > > > > > Also, at some point, we're going to have to handle "errno" - > > > > > but > > > > > given > > > > > that might be somewhat fiddly it's OK to defer that until > > > > > you're > > > > > more > > > > > familiar with the code > > > > > > > > I'm looking forward to figure it out. > > > > > > > > Thank you. > > > > > > > > > > > > On Sat, Jan 29, 2022 at 11:09 PM David Malcolm < > > > > dmalc...@redhat.com> > > > > wrote: > > > > > > > > > On Sat, 2022-01-29 at 20:22 +0530, Mir Immad wrote: > > > > > > Thank you for the detailed information. > > > > > > > > > > > > I've been looking into the integer posix file descriptor > > > > > > APIs > > > > > > and I > > > > > > decided to write proof-of-concept checker for them. (not > > > > > > caring > > > > > > about > > > > > > errno). The checker tracks the fd returned by open(), warns > > > > > > if > > > > > > dup() > > > > > > is > > > > > > called with closed fd otherwise tracks the fd returned by > > > > > > dup(), > > > > > > it > > > > > > also > > > > > > warns if read() and write() functions were called on closed > > > > > > fd. > > > > > > I'm > > > > > > attaching a text file that lists some c sources and warnings > > > > > > by > > > > > > the > > > > > > static > > > > > > analyzer. I've used the diagnostic meta-data from sm-file. Is > > > > > > this > > > > > > something that could also be added to the analyzer? > > > > > > > > > > This looks great, and very promising as both new functionality > > > > > for > > > > > GCC > > > > > 13, and as a GSoC 2022 project. > > > > > > > > > > BTW, it looks like you're working with GCC 11, but the analyzer > > > > > has > > > > > changed quite a bit on trunk for GCC 12, so it's worth trying > > > > > to > > > > > track > > > > > trunk. > > > > > > > > > > I wonder if it's worth checking for attempts to write to a fd > > > > > that > > > > > was > > > > > opened with O_RDONLY, or the converse? (I'm not sure, just > > > > > thinking > > > > > aloud - how much state does it make sense to track for a fd?). > > > > > > > > > > Also, at some point, we're going to have to handle "errno" - > > > > > but > > > > > given > > > > > that might be somewhat fiddly it's OK to defer that until > > > > > you're > > > > > more > > > > > familiar with the code. > > > > > > > > > > > > > > > > > About the fd leak, that's the next thing I'll try to get > > > > > > working. > > > > > > Since > > > > > > you've mentioned that it could be a GSoC project, this is > > > > > > what > > > > > > I'm > > > > > > going to > > > > > > focus on. > > > > > > > > > > Excellent. > > > > > > > > > > Let me know (via this mailing list) if you have any questions. > > > > > > > > > > Thanks > > > > > Dave > > > > > > > > > > > > > > > > > Regards. > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jan 26, 2022 at 7:56 PM David Malcolm < > > > > > > dmalc...@redhat.com> > > > > > > wrote: > > > > > > > > > > > > > On Mon, 2022-01-24 at 01:41 +0530, Mir Immad wrote: > > > > > > > > Hi, sir. > > > > > > > > > > > > > > > > I've been trying to understand the static analyzer's > > > > > > > > code. I > > > > > > > > spent most > > > > > > > > of > > > > > > > > my time learning the state machine's API. I learned how > > > > > > > > state > > > > > > > > machine's > > > > > > > > on_stmt is supposed to "recognize" specific functions and > > > > > > > > how > > > > > > > > on_transition > > > > > > > > takes a specific tree from one state to another, and how > > > > > > > > the > > > > > > > > captured > > > > > > > > states are used by pending_diagnostics to report the > > > > > > > > errors. > > > > > > > > Furthermore, I > > > > > > > > was able to create a dummy checker that mimicked the > > > > > > > > behaviour of > > > > > > > > sm- > > > > > > > > file's > > > > > > > > double_fclose and compile GCC with these changes. Is this > > > > > > > > the > > > > > > > > right way > > > > > > > > of > > > > > > > > learning? > > > > > > > > > > > > > > This sounds great. > > > > > > > > > > > > > > > > > > > > > > > As you've mentioned on the projects page that you would > > > > > > > > like > > > > > > > > to > > > > > > > > add > > > > > > > > more > > > > > > > > support for some POSIX APIs. Can you please write (or > > > > > > > > refer > > > > > > > > me to > > > > > > > > a) a > > > > > > > > simple C program that uses such an API (and also what the > > > > > > > > analyzer > > > > > > > > should > > > > > > > > have done) so that I can attempt to add such a checker to > > > > > > > > the > > > > > > > > analyzer. > > > > > > > > > > > > > > A couple of project ideas: > > > > > > > > > > > > > > (i) treat data coming from a network connection as tainted, > > > > > > > by > > > > > > > somehow > > > > > > > teaching the analyzer about networking APIs. Ideally: look > > > > > > > at > > > > > > > some > > > > > > > subset of historical CVEs involving network-facing attacks > > > > > > > on > > > > > > > user- > > > > > > > space daemons, and find a way to detect them in the > > > > > > > analyzer > > > > > > > (need > > > > > > > to > > > > > > > find a way to mark the incoming data as tainted, so that > > > > > > > the > > > > > > > analyer > > > > > > > "know" about the trust boundary - that the incoming data > > > > > > > needs > > > > > > > to > > > > > > > be > > > > > > > sanitized and treated with extra caution; see > > > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-November/584372.html > > > > > > > for my attempts to do this for the Linux kernel). > > > > > > > > > > > > > > Obviously this is potentially a huge project, so maybe just > > > > > > > picking > > > > > > > a > > > > > > > tiny subset and getting that working as a proof-of-concept > > > > > > > would be > > > > > > > a > > > > > > > good GSoC project. Maybe find an old CVE that someone has > > > > > > > written > > > > > > > a > > > > > > > good write-up for, and think about "how could GCC/- > > > > > > > fanalyzer > > > > > > > have > > > > > > > spotted it?" > > > > > > > > > > > > > > (ii) add leak-detection for POSIX file descriptors: i.e. > > > > > > > the > > > > > > > integer > > > > > > > values returned by "open", "dup", etc. It would be good to > > > > > > > have a > > > > > > > check that the user's code doesn't leak these values e.g. > > > > > > > on > > > > > > > error- > > > > > > > handling paths, by failing to close a file-descriptor (and > > > > > > > not > > > > > > > storing > > > > > > > it anywhere). I think that much of this could be done by > > > > > > > analogy > > > > > > > with > > > > > > > the sm-file.cc code. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Also, I didn't realize the complexity of adding SARIF > > > > > > > > when I > > > > > > > > mentioned > > > > > > > > it. > > > > > > > > I'd rather work on adding more checkers. > > > > > > > > > > > > > > Fair enough. > > > > > > > > > > > > > > Hope this above is constructive. > > > > > > > > > > > > > > Dave > > > > > > > > > > > > > > > > > > > > > > > Regards. > > > > > > > > > > > > > > > > Mir Immad > > > > > > > > > > > > > > > > On Sun, Jan 23, 2022, 11:04 PM Mir Immad < > > > > > > > > mirimnan...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi Sir, > > > > > > > > > > > > > > > > > > I've been trying to understand the static analyzer's > > > > > > > > > code. > > > > > > > > > I > > > > > > > > > spent > > > > > > > > > most of > > > > > > > > > my time learning the state machine's API. I learned how > > > > > > > > > state > > > > > > > > > machine's > > > > > > > > > on_stmt is supposed to "recognize" specific functions > > > > > > > > > and > > > > > > > > > takes > > > > > > > > > a > > > > > > > > > specific > > > > > > > > > tree from one state to another, and how the captured > > > > > > > > > states > > > > > > > > > are > > > > > > > > > used > > > > > > > > > by > > > > > > > > > pending_diagnostics to report the errors. Furthermore, > > > > > > > > > I > > > > > > > > > was > > > > > > > > > able to > > > > > > > > > create > > > > > > > > > a dummy checker that mimicked the behaviour of sm- > > > > > > > > > file's > > > > > > > > > double_fclose and > > > > > > > > > compile GCC with these changes. Is this the right way > > > > > > > > > of > > > > > > > > > learning? > > > > > > > > > > > > > > > > > > As you've mentioned on the projects page that you would > > > > > > > > > like to > > > > > > > > > add > > > > > > > > > more > > > > > > > > > support for some POSIX APIs. Can you please write (or > > > > > > > > > refer > > > > > > > > > me > > > > > > > > > to a) > > > > > > > > > a > > > > > > > > > simple C program that uses such an API (and also what > > > > > > > > > the > > > > > > > > > analyzer > > > > > > > > > should > > > > > > > > > have done) so that I can attempt to add such a checker > > > > > > > > > to > > > > > > > > > the > > > > > > > > > analyzer. > > > > > > > > > > > > > > > > > > Also, I didn't realize the complexity of adding SARIF > > > > > > > > > when > > > > > > > > > I > > > > > > > > > mentioned it. > > > > > > > > > I'd rather work on adding more checkers. > > > > > > > > > > > > > > > > > > Regards. > > > > > > > > > Mir Immad > > > > > > > > > > > > > > > > > > On Mon, Jan 17, 2022 at 5:41 AM David Malcolm < > > > > > > > > > dmalc...@redhat.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > On Fri, 2022-01-14 at 22:15 +0530, Mir Immad wrote: > > > > > > > > > > > HI David, > > > > > > > > > > > I've been tinkering with the static analyzer for > > > > > > > > > > > the > > > > > > > > > > > last > > > > > > > > > > > few > > > > > > > > > > > days. I > > > > > > > > > > > find > > > > > > > > > > > the project of adding SARIF output to the analyzer > > > > > > > > > > > intresting. > > > > > > > > > > > I'm > > > > > > > > > > > writing > > > > > > > > > > > this to let you know that I'm trying to learn the > > > > > > > > > > > codebase. > > > > > > > > > > > Thank you. > > > > > > > > > > > > > > > > > > > > Excellent. > > > > > > > > > > > > > > > > > > > > BTW, I think adding SARIF output would involve > > > > > > > > > > working > > > > > > > > > > more > > > > > > > > > > with > > > > > > > > > > GCC's > > > > > > > > > > diagnostics subsystem than with the static analyzer, > > > > > > > > > > since > > > > > > > > > > (in > > > > > > > > > > theory) > > > > > > > > > > all of the static analyzer's output is passing > > > > > > > > > > through > > > > > > > > > > the > > > > > > > > > > diagnostics > > > > > > > > > > subsystem - though the static analyzer is probably > > > > > > > > > > the > > > > > > > > > > only > > > > > > > > > > GCC > > > > > > > > > > component generating diagnostic paths. > > > > > > > > > > > > > > > > > > > > I'm happy to mentor such a project as I maintain both > > > > > > > > > > subsystems > > > > > > > > > > and > > > > > > > > > > SARIF output would benefit both - but it would be > > > > > > > > > > rather > > > > > > > > > > tangential > > > > > > > > > > to > > > > > > > > > > the analyzer - so if you had specifically wanted to > > > > > > > > > > be > > > > > > > > > > working on > > > > > > > > > > the > > > > > > > > > > guts of the analyzer itself, you may want to pick a > > > > > > > > > > different > > > > > > > > > > subproject. > > > > > > > > > > > > > > > > > > > > The SARIF standard is rather long and complicated, > > > > > > > > > > and we > > > > > > > > > > would > > > > > > > > > > want to > > > > > > > > > > be compatible with clang's implementation. > > > > > > > > > > > > > > > > > > > > It would be very cool if gcc could also accept SARIF > > > > > > > > > > files as > > > > > > > > > > an > > > > > > > > > > *input* format, and emit them as diagnostics; that > > > > > > > > > > might > > > > > > > > > > help > > > > > > > > > > with > > > > > > > > > > debugging SARIF output. (I have a old patch for > > > > > > > > > > adding > > > > > > > > > > JSON > > > > > > > > > > parsing > > > > > > > > > > support to GCC that could be used as a starting point > > > > > > > > > > for > > > > > > > > > > this). > > > > > > > > > > > > > > > > > > > > Hope the above makes sense > > > > > > > > > > Dave > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jan 11, 2022, 7:09 PM David Malcolm < > > > > > > > > > > > dmalc...@redhat.com> > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > On Tue, 2022-01-11 at 11:03 +0530, Mir Immad via > > > > > > > > > > > > Gcc > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > > > > > > > > > > > > > Hi, and welcome. > > > > > > > > > > > > > > > > > > > > > > > > > I intend to work on the static analyzer. Are > > > > > > > > > > > > > these > > > > > > > > > > > > > documents > > > > > > > > > > > > > enough to > > > > > > > > > > > > > get > > > > > > > > > > > > > started: > > > > > > > > > > > > > https://gcc.gnu.org/onlinedocs/gccint and > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://gcc.gnu.org/onlinedocs/gccint/Analyzer-Internals.html#Analyzer-Internals > > > > > > > > > > > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > > > > > > > > > > > > There are also some high-level notes here: > > > > > > > > > > > > > > > > > > > > > > > > https://gcc.gnu.org/wiki/DavidMalcolm/StaticAnalyzer > > > > > > > > > > > > > > > > > > > > > > > > Also, given that the analyzer is part of GCC, the > > > > > > > > > > > > more > > > > > > > > > > > > general > > > > > > > > > > > > introductions to hacking on GCC will be useful. > > > > > > > > > > > > > > > > > > > > > > > > I recommend creating a trivial C source file with > > > > > > > > > > > > a > > > > > > > > > > > > bug > > > > > > > > > > > > in it > > > > > > > > > > > > (e.g. > > > > > > > > > > > > a > > > > > > > > > > > > 3-line function with a use-after-free), and > > > > > > > > > > > > stepping > > > > > > > > > > > > through > > > > > > > > > > > > the > > > > > > > > > > > > analyzer to get a sense of how it works. > > > > > > > > > > > > > > > > > > > > > > > > Hope this is helpful; don't hesitate to ask > > > > > > > > > > > > questions. > > > > > > > > > > > > Dave > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >