## What is this?

It's a draft patch that replaces code like this:
```c
pg_file_unlink(PG_FUNCTION_ARGS)
{
char   *filename;
requireSuperuser();
filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0));
```

With this shorter version:
```c
pg_file_unlink(PG_FUNCTION_ARGS)
{
requireSuperuser();
char      *filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0));
```

## Why would we want this?
1. It removes 23328 lines of code that don't have any impact on how the code 
behaves [1]. This is roughly 2.7% of all the lines of code in the codebase.
2. Declarations are closer to the actual usage. This is advised by the "Code 
Complete" book [2] and has the following advantages:
   a. This limits variable scope to what is necessary. Which in turn makes the 
mental model you have to keep of a function when reading the code simpler.
   b. In most cases it allows you to see the the type of a variable without 
needing to go to the top of the function.
3. You can do input checking and assertions at the top of the function, instead 
of having to put declarations in front of it. This makes it clear which 
invariants hold for the function.  (as seen in the example above and the 
changes for pg_file_rename[3])
4. Declaring variables after statements is allowed in C99. Postgres already 
requires C99, so it might as well use this feature too.

## How was this changeset created?

I created a script that modifies all C files in the following way:

1. Find a non `static` declaration.
2. If it has an initializer and it is not a single variable/constant, don't 
consider replacing it. (reason: a function call initializer might have 
sideffects).
3. Otherwise (it's a simple initializer or it has no initializer at all), take 
the type and variable from that declaration.
4. Find the next use of the variable.
5. If the next use is not an assignment, don't do anything (the value of the 
original initialization is used).
6. If the next use is an assignment:
    1. Remove the old declaration
    2. Prefix the found assignment with the type
    3. Unless the variable is also used in the same line of the new 
initialization, e.g:
      ```c
      int a = 1;
      a = a + a;
      ```

## How does this script work?

It uses a Perl regex to search and replace! (obligatory jokes at the bottom of 
the email) This regex is based on the ones used in this PR to citus[4] and the 
similar PR to pg_auto_failover[5]. The script is in the 3rd commit of this 
patchset.

To visualize the regex in the script in a reasonable way, copy paste the search 
part of it:
```
\n\t(?!(return|static)\b)(?P<type>(\w+[\t ])+[\t *]*)(?>(?P<variable>\w+)( = 
[\w>\s\n-]*?)?;\n(?P<code_between>(?>(?P<comment_or_string_or_not_preprocessor>\/\*.*?\*\/|"(?>\\"|.)*?"|(?!goto)[^#]))*?)(\t)?(?=\b(?P=variable)\b))(?<=\n\t)(?<!:\n\t)(?P=variable)
 =(?![^;]*?[^>_]\b(?P=variable)\b[^_])
```

And paste that into https://www.debuggex.com/, then select PCRE from the 
dropdown. (Sharing seems to be down at this point, so this is the only way to 
do it at the moment) Try it out! The regex is not as crazy as it looks.

There's two important assumptions this regex makes:
1. Code is indented using tabs, and the intent level determines the scope. 
(this is true, because of `pgindent`)
2. Declared variables are actually used. (this is true, because we would get 
compiler warnings otherwise)

There's two cases where this regex has some special behaviour:
1. Stop searching for the first usage of a variable when either a `goto` or a 
preprocessor command is found (outside a string or comment). These things 
affect the control flow in a way that the regex does not understand. (any `#` 
character is assumed to be a preprocessor commands).
2. Don't replace if the assignment is right after a `label:`, by checking if 
there was a `:` as the last character on the previous line. This works around 
errors like this:
```
hashovfl.c:865:3: error: a label can only be part of a statement and a 
declaration is not a statement
   OffsetNumber maxroffnum = PageGetMaxOffsetNumber(rpage);
   ^~~~~~~~~~~~
```
Detecting this case in this way is not perfect, because sometimes there is an 
extra newline or a comment between the label and the assignment. This is not 
easily detectable by the regex, because lookbehinds cannot have a variable 
length in Perl (or most regex engines for that matter). For these few cases (5 
in the entire codebase) a manual change was done either before or after the 
automatic replacement to fix them so the code compiles again. (2nd and 5th 
commits of this patchset)

After all these changes `make -s world` doesn't show any warnings or errors and 
`make check-world` succeeds. I configured compilation like this:
```
./configure --enable-cassert --enable-depend --enable-debug --with-openssl 
--with-libxml --with-libxslt --with-uuid=e2fs --with-icu
```

## What I want to achieve with this email

1. For people to look at a small sample of the changes made by this script. I 
will try to send patches with the actual changes made to the code as 
attachments in a followup email. However, I don't know if that will succeed 
since the patch files are very big and it might run into some mailserver 
limits. So in case that doesn't work, the full changeset is available on 
Github[6]. This make viewing this enormous diff quite a bit easier IMHO 
anyaway. If you see something weird that is not covered in the "Known issues" 
section below, please share it. That way it can be discussed and/or fixed.
2. A discussion on if this type of code change would be a net positive for 
Postgres codebase. Please explain clearly why or why not.
3. Some links to resources on what's necessary to get a big refactoring patch 
like this merged.

## What I don't want to achieve with this email

1. For someone to go over all the changes right now. There's likely to be 
changes to the script or something else. Doing a full review of the changes 
would be better saved for later during a final review.

## Known issues with the currently generated code

There's a few issues with the final generated code that I've already spotted. 
These should all be relatively easy to fix in an automated way. However, I 
think doing so is probably better done by `pgindent` or some other auto 
formatting tool, instead of with the regex. Note that I did run `pgindent`, it 
just didn't address these things:

1. Whitespace between type and variable is kept the same as it was before 
moving the declaration. If the original declaration had whitespace added in 
such a way that all variable names of declarations aligned, then this 
whitespace is preserved. This looks weird in various cases. See [3] for an 
example.
2. `pgindent` adds a newline after each block of declarations, even if they are 
not at the start of function. If this is desired is debatable, but to me it 
seems like it adds excessive newlines. See [7] for an example.
3. If all declarations are moved away from the top of the function, then an 
empty newline is kept at the top of the function. This seems like an 
unnecessary newline to me. See [3] for an example.

Sources:
1. `tokei`[8] results of lines of code:
Before:
```
$ tokei --type C
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C                    1389      1361309       866514       330933       163862
===============================================================================
 Total                1389      1361309       866514       330933       163862
===============================================================================
```
After:
```
$ tokei --type C
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C                    1389      1347515       843186       330952       173377
===============================================================================
 Total                1389      1347515       843186       330952       173377
===============================================================================
```
2. 
https://github.com/mgp/book-notes/blob/master/code-complete.markdown#103-guidelines-for-initializing-variables
3.
```patch
@@ -401,11 +389,10 @@ pg_file_rename_internal(text *file1, text *file2, text 
*file3)
 Datum
 pg_file_unlink(PG_FUNCTION_ARGS)
 {
- char   *filename;

  requireSuperuser();

- filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0));
+ char   *filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0));

  if (access(filename, W_OK) < 0)
```
4. https://github.com/citusdata/citus/pull/3181
5. https://github.com/citusdata/pg_auto_failover/pull/266
6. https://github.com/JelteF/postgres/pull/2
7.
```patch
@@ -2863,7 +2886,8 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber 
blocknum)
  * longer than we must.
  */
  Buffer buffer = ReadBufferExtended(state->rel, MAIN_FORKNUM, blocknum, 
RBM_NORMAL,
- state->checkstrategy);
+ state->checkstrategy);
+
  LockBuffer(buffer, BT_READ);

  /*
```
8. https://github.com/XAMPPRocky/tokei


Obligatory jokes:
1. https://xkcd.com/1171/
2. https://xkcd.com/208/
3. https://stackoverflow.com/a/1732454/2570866 (and serious response to it 
https://neilmadden.blog/2019/02/24/why-you-really-can-parse-html-and-anything-else-with-regular-expressions/)

Attachment: 0002-Move-comment-to-make-refactor-script-work.patch
Description: 0002-Move-comment-to-make-refactor-script-work.patch

Attachment: 0003-Add-script-to-remove-useless-declarations.patch
Description: 0003-Add-script-to-remove-useless-declarations.patch

Attachment: 0001-Allow-declaration-after-statement.patch
Description: 0001-Allow-declaration-after-statement.patch

Reply via email to