April 27, 2021

cd is not a program

Or rather, it is not a standalone executable. It is a shell builtin. Let me explain what that means and why it makes a difference.

Note: From now on, when I write shell I mean either bash or fish. I assume that most things are also true for most other shells, but I haven’t actually checked for any of them.

To explain the difference between a shell builtin and a standalone executable, let’s look at two concrete examples and how they are executed from the viewpoint of the shell. Let’s compare

cd some-dir

and

cat some-dir/some-file

cd is a shell builtin, which means that it is a function in the shell code that is exposed to the user. So if we execute cd some-dir, the shell can just call this function with some-dir as an argument. On the other hand, cat is a standalone executable. If we execute cat some-dir/some-file, the shell first tries to find the absolute path to cat using the PATH variable; usually, it is located at /bin/cat. It then creates a new process using the clone or fork system calls, and within the new process calls the execve system call, passing it the absolute path to the command (/bin/cat), the list of arguments (cat and some-dir/some-file), and the environment (more on the environment later). So calling a shell builtin is not only much more lightweight than calling an external executable, the builtin also has access to internals of the shell since it is called within the same process.

The cd builtin is used to change the working directory of the shell. The concept of a working directory exists on the kernel level. Every process has one. On a Linux system you can see it in the /proc file system: /proc/<pid>/cwd is a symbolic link to the working directory of the process with id <pid>. You can also see the working directory of the shell by executing pwd (another shell builtin). The working directory is used by the kernel whenever the process wants to access a file or a directory with a relative path. For example, let’s look at

cat some-dir/some-file

again. When this is executed, cat passes some-dir/some-file to the open system call (see the cat source code). The kernel then takes the current working directory of cat, appends some-dir/some-file, and tries to open that file (the details can be a bit more complicated since there might be symbolic links involved; the man page on path_resolution has the full details).

When a process is created, it inherits the working directory from its parent. In the example, since cat is created as a child process of the shell, it inherits the shell’s working directory. The shell in turn inherits its working directory from its parent, and so forth, all the way to the init process which has / as a working directory (you can verify this by looking at /proc/1/cwd). If that was all, then every process would have working directory /, which would not be very useful. But a process can change its working directory with the chdir system call. And that’s exactly what cd does! (See for yourself in the bash source code and fish source code.)

In the cat example, if we execute

cd /var && cat log/syslog

then cd /var changes the working directory of the shell to /var. Afterwards, the shell creates cat as a child process and passes log/syslog as an argument to it; cat takes this argument and passes it to open. cat inherited /var as a working directory from the shell, so when the kernel sees the relative path log/syslog passed to open, it prepends the working directory and tries to open /var/log/syslog. In other words, this inheritance of the working directory means that everything magically works just as expected.

Note that a process can only change its own working directory. It has no control over the working directory of any other process (except its child processes). That is why cd cannot be a standalone executable. The chdir call has to be executed within the shell process. And that’s why cd must be a shell builtin. (Note that this isn’t the case for all shell builtins; some of them could also be standalone executables. For example pwd is a shell builtin, but it wouldn’t have to be. It could also be a standalone executable that reads the working directory of its parent process. In fact, there is the pidx executable which shows the working directory of any process.)

So why does it make a difference whether cd is a builtin or a standalone executable? Let me give three examples.

Unnecessary `cd` in shell scripts

I occasionally see shell scripts which have a structure like this:

#!/bin/bash

cur_loc=$(pwd)

# ... main part of the script which includes some calls to cd ...

# return to original subdirectory
cd "$cur_loc"

The intention is that when the script is executed from the shell, make sure that we don’t end up in some random directory when the script returns. But this is completely unnecessary. When the script is executed, a new bash process is created which has its own working directory. Any cd in the script will only affect the working directory of the new bash process; the parent process (our interactive shell from which we executed the script) is unaffected. So the last cd is essentially a no-op: it changes the working directory of the bash process, and immediately afterwards the process ends. It is not harmful either; but usually less code is better, especially if the code in question has no purpose, so we should just remove a trailing cd.

Shortcuts to change a directory

Sometimes there might be a directory that you have to go to frequently. Especially if it is deeply nested, typing this out every time might be tedious. A tedious task in the shell is often something that can be simplified with a shell script. But we just saw that changing directories with a shell script is not possible. We have to execute the cd within the current process of the shell. This is a good candidate for an alias. For example, if we realize that we often go to the directory /usr/share/fish/completions, we can add an alias to the shell init files, like alias completions="cd /usr/share/fish/completions".

The `CDPATH` variable

CDPATH is a shell feature that changes the behavior of cd. By default, if we execute

cd some-dir

then the shell will look for the directory some-dir in its current working directory. If the directory exists, the shell changes its working directory to this subdirectory, otherwise it returns an error. This behavior can be changed with the CDPATH variable. If it is nonempty, its contents is interpreted as a list of directories and those are used for the search path instead of the current working directory.

For example, assume that CDPATH is set to /usr/local:/var/local. If we now execute cd some-dir, then the shell will first check whether /usr/local has a subdirectory some-dir, and if so, change its current working directory to /usr/local/some-dir. Otherwise, it checks whether /var/local/ has a subdirectory some-dir if so, it changes the current working directory to /var/local/some-dir. If none of this was successful, both bash and fish then check whether some-dir is a subdirectory of the current directory, and change the working directory to this subdirectory if it does (in other words, bash and fish implicitly add . to the end of the CDPATH list). Only if none of those directories exists cd will fail.

I find CDPATH incredibly useful. I keep all my projects in ~/projects and set my CDPATH to .:~/projects. This way, I can get to any of my project directories from anywhere in the file system. Furthermore, fish supports tab completion across CDPATH. So if my current working directory is /var/lib and I type cd mb<tab>, then fish will auto-complete this to cd my-blog and take me to ~/projects/my-blog.

But there is one potential problem with this feature, and that comes from the confusion of shell variables and environment variables (at least I was confused by this).

Environment variables

Just like the working directory, the environment is a concept that is defined on the kernel level. Every process has an environment; on Linux systems you can see it at /proc/<pid>/environ. It is an array of pointers to strings. By convention the strings have the form key=value, but this is not a requirement; these strings are interpreted as environment variables, in this case key is the name of the variable and value is its value. And just as the working directory, a process inherits the environment from its parent.

If a process executes another program with the execve system call, it can change the environment for the process with the last parameter of the system call (the last e stands for environment, the v stands for argument vector). This is for example how env works. env allows you to execute a program with the environment of your choosing. It will compile the list of environment variables into an array and pass it (together with the program to execute and the arguments) to the execve call.

Now to access the environment, a program will usually use the C standard library (either directly, or through a function in a higher level language that internally uses the C standard library). It provides access to the environment through the environ variable and through the third argument of the main function int main(int argc, char *argv[], char *envp[]) (they initially point at the same array). It also provides the functions getenv, setenv, and putenv, which are used to modify environ. These functions do not however modify the environment that the kernel sees (and which we can see through /proc/<pid>/environ). For example, if you use setenv to add a new variable and call fork() to create a child process, the child process will inherit the original environment which does not include the effects of the setenv. In practice this is not really relevant, since the child will also inherit the environ variable which does include the effects of the setenv; so on a C standard library level, inheriting environments between parent and child works as expected. But it shows that there are two slightly different views on the environment.

The purpose of environment variables is to control the behavior of programs or libraries. For example, git uses the EDITOR environment variable to determine which editor to use to create commit messages. A common way to define environment variables is to create shell variables and to tell the shell that these variables should be added to the environment of any program that is executed.

Shell variables

Shell variables are like variables in any other programming language. They have a name and a value, and whenever we specify the name somewhere, the shell will replace it with the corresponding value. For example, if we have a shell variable named var with value contents and then execute

echo $var

the shell will replace $var with contents and execute echo contents instead.

Some shell variables are already created when the shell is initialized; for example, bash creates BASH_VERSION, and fish creates FISH_VERSION. We can also create new variables, using NAME=VALUE in bash or set NAME VALUE in fish. And then there is another source for shell variables: the environment. When the shell is initialized, it reads environ and makes its contents available as shell variables. This means that every environment variable is also a shell variable. That’s why we can access the HOME environment variable as $HOME.

Not only will the shell make environment variables available as shell variables during startup, we can also tell the shell to turn shell variables into environment variables for new processes. For a shell variable to become an environment variable for a child process, it needs to be marked as exported (for example, using export VARIABLE in bash or set -x VARIABLE in fish; the variables that come from environ are automatically marked as exported). When we instruct the shell to execute another program (not a shell builtin), it compiles a list of all exported shell variables, and those define the environment for the executed program (it does this by executing the program with the execve system call and passing it the exported shell variables as the environment parameter). So in the example above, if we want to define an EDITOR environment variable for git, we can do so by creating a shell variable EDITOR and then exporting it.

Environment variables control the behavior of programs and libraries. Similarly, shell variables can control the behavior of the shell and its builtins. For example, the fc bash builtin uses the EDITOR variable to determine which editor to use to edit commands from the history list. This is similar to the git example above. The difference is that fc is a builtin, so EDITOR does not have to be exported. Since fc is a function within bash it has direct access to the shell variables, it doesn’t need environment variables. In fact, after it read the environ variable during startup, it will not read any variables from the environment again, it will always use the shell variables instead.

So if we define EDITOR as a (non-exported) shell variable, then fc will see it and git won’t. This might create confusion. To understand whether a variable needs to be exported or not, you need to know whether the command you want to be affected is a shell builtin or a program. It might be tempting to just export all variables so that any command can see it. But this is problematic, which brings us back to CDPATH.

The problem with `CDPATH` as an environment variable

On first thought it might make sense to export the CDPATH variable. There exist similar PATH and MANPATH variables which need to be exported, and even the man page on environ lists CDPATH as an example of an environment variable. But cd is not a program; it is a shell builtin. So as we just saw, it does not read CDPATH from the environment, it uses it as a shell variable. This means that it is not only unnecessary to export CDPATH, doing so can lead to undesired behavior and weird bugs as described in this blog post.

One problem is that when CDPATH is set and cd is called with a relative path that is not . or .., then bash will output the absolute path of the new working directory. This leads to bugs in some bash scripts that try to get the parent directory of the script like this:

#!/bin/bash

SCRIPT_DIR=$(cd "$(dirname "$0")" && pwd)

This is intended to work like this: dirname "$0" prints the (relative) path of the directory containing the script ($0 contains the name of the script that is called); the relative path is passed to cd which changes the working directory of the subshell; then pwd outputs the absolute path of the current directory, which is captured in SCRIPT_DIR. This mostly works, unless the user has defined CDPATH and exported it. Then it becomes an environment variable for any child of the shell. In particular, it is visible to the bash process that executes this script, which means that cd "$(dirname $0")" might print the absolute path of the script. Since pwd also prints the absolute path, $SCRIPT_DIR now contains two lines, both containing the path, which will most likely lead to problems further down in the script.

Now this might be seen as a bug in the script which should guard against an exported CDPATH. In fact, this Stack Overflow answer gives a more robust snippet to compute the parent directory of a bash script that guards against an exported CDPATH and a range of other problems (although I think in a lot of cases it would be enough to get a relative path, so the snippet above could be replaced with SCRIPT_DIR=$(dirname "$0")). But the truth is that not all shell scripts are written in a robust way, which makes a CDPATH environment variable problematic.

Another problem comes with non-existing directories. Shellcheck warns against a line in a shell script that’s a single cd <somedir> and advises to change it to cd <somedir> || exit. Otherwise, there might be dangerous consequences if <somedir> does not exist; for example some later line might be something like rm *. The || exit ensures that the shell script does not continue if the directory does not exist. But a CDPATH environment variable can defeat the ||exit guard: if <somedir> does not exist in the working directory of the script but it does exist somewhere in the CDPATH, then the cd will still succeed and potentially remove files or do other destructive or at least unintended things.

So CDPATH should not be exported. It is enough to define it as a shell variable, and since cd is a shell builtin, it will be able to pick it up. Additionally, it should be prevented to be available to shell scripts. In bash, this can be accomplished by defining it in .bashrc which is only read by interactive shells. In fish, the config files are also read by fish scripts, so here it is necessary to guard the CDPATH definition as follows.

if status --is-interactive
  set CDPATH <list of paths>
end

Other shell variables

CDPATH is an especially problematic instance of a shell variable that shouldn’t be an environment variable, but there are other ones as well. For example, bash uses PS1 to control the prompt. This variable only makes sense in interactive shells. In fact, a common test to check whether bash is run interactively is to check whether $PS1 is defined. So this variable should not be exported either; but a quick search on GitHub shows that it isn’t uncommon to be exported in bashrcs. Other variables like HISTSIZE are not actively harmful when they are exported, but they pollute the environment of child processes (this is more of an aesthetic issue, like the trailing cd in shell scripts).

So I think a good rule of thumb is to not export any variables, unless you are sure that they are used by another program or library.

Useful resources

I found it very helpful to look at the shell source code. The bash source code can be quite intimidating, the fish source code is much more accessible. But both of them were surprisingly easy to experiment with. For example, it took me about 10 minutes to hack in a new builtin to bash that prints out the contents of the environ variable. (I wanted to find out whether bash updates environ when an exported shell variable is changed; I couldn’t say for sure by just looking at the code. It turns out that bash updates environ, whereas fish does not.)

I also found this Stack Exchange answer very helpful to understand the different views of the kernel and the C standard library on the environment.

—Written by Sebastian Jambor. Follow me on Mastodon @crepels@mastodon.social for updates on new blog posts.