On 6/4/24 04:45, Hongyi Zhao wrote:
> Hi there,
> 
> I have a question regarding the best practices for setting environment
> variables when executing commands in Unix-like operating systems.
> Specifically, I would like to understand the differences and use cases
> for using the env command versus directly setting the environment
> variable before a command.

By "directly" do you mean in the shell or in a C program?

Running a new program it boils down to an execve() syscall, which takes char
*envp[] as its third argument, which is an array of "key=value" strings
terminated by a NULL pointer. Under the covers, that's how each new executable
gets its environment strings. So basically there's a second argv[] for
environment variables.

In C, envp is actually the third argument to main(), which most programs ignore.
The entry point of ELF programs is actually _start() which does some setup
before calling main(), and one of the things it does is copy envp into the
global variable "environ", so even if you ignored it you can get it out of the
environment variable.

There are several different "man 3 exec" wrappers, which all boil down to a "man
2 execve" syscall invocation. Any wrapper that doesn't require you to provide
envp as an argument is using "environ" for it.

In Linux, both char **argv and char **envp actually point into the stack
segment, where the kernel copied a bunch of data during setup (the argv[]
pointer array with NUL terminator, a bunch of null terminated strings, the
envp[] pointer array, and more strings). They're basically local variables in a
function call before main(), set up by the kernel when allocating the process's
initial stack space.

Because of this, replacing environment variables in a long-running program gets
a little complicated. The stack space that argv[] and envp[] live in is
writeable, but it's not managed by any allocator. Meaning you can't free() any
of the envp[] variables you inherited because they're stack space, not heap
space. Meaning when you replace a variable in environ[] it generally leaks the
old one, which is annoying if it was _already_ replaced with malloc() memory.
(You just leaked memory.) And if you need to expand the environ[] array itself
you can't do it in place, you need to allocate a new copy of the pointer array
and leak the old one. (Except again, if you've already _done_ that you'd be
leaking malloc() memory if you don't free() it...)

Some libc implementations make an allocator that tracks or figures out when
environ[] entries (and the array[] itself) are stack space and which ones have
already been replaced, and lets you do lots of on the fly updates without
leaking memory. And some just don't bother on the theory that it doesn't happen
enough to care.

> I have observed that both methods can be used to set environment
> variables for command execution. For example:
> 
> Using env command:
> 
> env PATH="/custom/path:$PATH" my_command

Which is going to assemble a new envp[] array and pass it to exec().

> Directly setting the environment variable:
> 
> PATH="/custom/path:$PATH" my_command

This is shell prefix assignment. It _also_ creates a new (temporary) envp[]
array to pass to the new child process's exec, but the process is a bit more
complicated (checking which shell variables have the "export" flag set and so
on, yes you can have an exported local variable that's otherwise only visible in
your current function).

You can think of prefix assignment like a transparent function call:

potato() {
  local PATH="/custom/path:$PATH"
  export PATH
  my_command
}

> My Questions
> 
> Functional Differences:
> 
> Are there any functional differences between these two methods in
> terms of how the environment variables are set and utilized by
> my_command?

As far as your command is concerned, exec received an array of strings which the
kernel copied onto the stack, and main() got a pointer to the start of it as one
of its arguments.

> Best Practices:
> 
> In what scenarios would it be more appropriate to use env versus
> directly setting the environment variable? Are there specific
> advantages or disadvantages associated with each method?

The shell doesn't have an easy way to do env -i, and doing env -u pretty much
requires a function call.

You CAN do it in bash, because local variables have whiteouts:

  $ x=abc; y() { local x; unset x; echo x1=$x;}; y; echo x2=$x
  x1=
  x2=abc

But I dunno how you'd do that _without_ an explicit function context. Prefix
assignment doesn't really provide a syntax for it.

You can set an environment variable to _blank_ with prefix assignment, but
that's not the same as UNSET:

  $ env -i X= env
  X=

Which is why bash has : in its variable expansions to treat "unset" and "blank"
the same. An example using the ${-} default option provider:

  $ x(){ echo x=${X-abc}; }; x; X=xyz x; X= x
  x=abc
  x=xyz
  x=
  $ x(){ echo x=${X:-abc}; }; x; X=xyz x; X= x
  x=abc
  x=xyz
  x=abc

> Complex Commands:
> 
> When dealing with more complex commands or scripts, does one method
> provide better readability or maintainability over the other?

I believe the "env" command was invented long before shell prefix assignment,
and it can still do things the shell doesn't provide an easy syntax for, but for
basic temporary variable assignment either should work.

The env feature I keep wanting is -i whitelisting ala "env -I PATH,TERM,HOME" so
I can clear everything EXCEPT a list of variables I want to keep the parent
definition of. Right now I have to go env -i PATH="$PATH" TERM="$TERM"
HOME="$HOME" and so on, which gets tedious...

Rob

Reply via email to