Never Delete PID Files!

The best way to delete a pid file (or rather a lock file) is not to delete it. Doing so will almost always result in race conditions voiding the protection against parallel execution.

The Normal Locking Workflow

The normal locking procedure goes like this:

  1. Process A opens the PID file read/write (creating it if it doesn’t exist).
  2. Process A successfully acquires an exclusive lock on the PID file in non-blocking mode.
  3. Process A starts doing its work.
  4. Process B opens the PID file read/write.
  5. Process B tries to acquire an exclusive lock on the PID file in non-blocking mode but that immediately fails because A already has a lock.
  6. Process B terminates with an error message.
  7. Process A finishes its work and terminates.

Most of the time the process would also write its own PID into the file after successfully acquiring the lock but that is not important for the general technique. Besides, it is a bogus concept. The idea is to allow killing the process by looking up its PID in the file. But it can easily happen that the process is not alive anymore and that the operating system has reclaimed that PID for a new process. You should rather use lsof process.pid instead of cat process.pid for that purpose.

But why should the process not remove its PID file, once it has finished, and avoid that problem? In order to understand the problems caused by this, you have to make sure that you know what deleting a file actually means under the hood.

Why is the function to delete a file on POSIX systems called unlink() and not delete()?

A file is commonly understood as something with a name. But that is not always true. A file can exist without having a name. The name is a property of the directory containing the file, a so-called directory entry.

Deleting a file has two effects. The visible effect is that the directory entry vanishes. And under the hood, the kernel also decrements the link count of the associated inode by one.

What is the link count? When you create a file a, and then a hard link to it with ln a b, the associated inode has a link count of 2:

$ >a
$ ln a b
$ ls -li
total 0
8597082957 -rw-r--r--  2 user   wheel  0 Jan 23 23:58 a
8597082957 -rw-r--r--  2 user   wheel  0 Jan 23 23:58 b

The integer 8597082957 in the first column is the inode number. You can see that a and b represent the same file/inode.

For the kernel to reclaim all resources associated with an inode, two things must happen:

  1. The link count has to drop to zero, and
  2. No process has on open file descriptor on it.

That feature is commonly used to implement automatic cleaning up of temporary files. A process creates the file, unlinks it right away, and then starts using it. Because the unlink() has deleted the directory entry, an ls in that directory would no longer list the file. That works without problems on most operating systems, but probably not on MS-DOS aka Windows.

When the process terminates, the kernel will now reclaim the inode because the file descriptor was closed.

Different Faces of the Race

If you want to clean up a PID file before terminating, you have two options. You can first release the lock and then unlink the file. Or you first unlink it, and then release the lock. Both variants are wrong! The right way is to leave the PID file alone and just terminate.

The race condition resulting from this technique is easy to understand:

  1. Process A opens the PID file read/write (creating it if it doesn’t exist).
  2. Process A successfully acquires an exclusive lock.
  3. Process A starts work.
  4. Process A finishes work
  5. Process A releases the lock on the file.
  6. Process B opens the PID file read/write.
  7. Process A unlinks the file.
  8. Process C creates the PID file read/write.
  9. Process B and C successfully acquire an exclusive lock and now run simultaneously.

But how can B get a lock in the last step? Re-read the above! B gets the lock for the file descriptor pointing to A’s file. And C gets it for the one it has created itself after A has unlinked the other one.

This exhibits a similar race condition:

  1. Process A opens the PID file read/write (creating it if it doesn’t exist).
  2. Process A successfully acquires an exclusive lock.
  3. Process A starts work.
  4. Process B opens the PID file read/write.
  5. Process A finishes work.
  6. Process A unlinks the PID file.
  7. Process C creates the PID file read/write.
  8. Process A unlocks the PID file.
  9. Process B and C successfully acquire an exclusive lock and now run simultaneously.

Again, process B and C now run simultaneously, wreaking havoc what the pid file was supposed to protect.

What About OPEN_EX and OPEN_SH?

If you examine the reason for the race you see that it is caused by opening and locking the PID file not being an atomic operation. Therefore *BSD systems (and that includes Mac OS X) have the additional open(2) flags OPEN_EX and OPEN_SH that open and lock the file at once.

Problem solved? Not quiet. Linux does not support these flags and your software may have acceptance problems if it does not run on the most widely spread operating system in the world.

Well, maybe with some additional fiddling there is a way to prevent the race, but why run risks? What if you ultimately find out that your smart workaround had a thinko? That will — spoiler-alert! — most probably happen after disaster has struck.

Reproduce The Race?

That is simple. Compile the quick and dirty implementation in C at the end of this page. Save it as delete-pid.c and run make delete-pid.

Open three terminal windows, start the program in all three windows as ./delete-pid daemon.pid (daemon.pid is the name of the pid file) and you can step through the individual steps “open lock file”, “lock file”, “working”, “unlink”, and “unlock” interactively.

If you follow the instructions for the three processes A, B, and C above, you will always end up with processes B and C being in the “working” state simultaneously. If you swap the invocation of unlock_file() and unlink_file at the end, you can switch between the different orders of unlinking and unlocking.

Too much hassle? Then simply believe it! Deleting PID files is a recipe for trouble.

Here’s the source code (download link) for the DIY folks:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
#include <stdio.h>
#include <sys/file.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <unistd.h>
#include <stdlib.h>

static void
prompt(char *message)
{
    char buffer[1];

    setvbuf(stdout, NULL, _IONBF, 0);
    printf("PID %u: %s ", getpid(), message);
    read(0, buffer, sizeof buffer);
}

static int
open_lock_file(char *path)
{
    int fd;

    prompt("hit return to open lock file!");
    fd = open(path, O_CREAT | O_RDWR, 0644);
    if (fd < 0) {
        fprintf(stderr, "error opening '%s': %s\n", path, strerror(errno));
        exit(1);
    }

    return fd;
}

static void
lock_file(int fd)
{
    prompt("hit return to lock file!");
    if (flock(fd, LOCK_EX | LOCK_NB) < 0) {
        fprintf(stderr, "cannot get exclusive lock: %s\n", strerror(errno));
        exit(1);
    }
}

static void
work(void)
{
    prompt("working, hit return to finish!");
    printf("PID %u: finished work.\n", getpid());
}

static void
unlock_file(int fd)
{
    prompt("hit return to unlock file!");
    if (flock(fd, LOCK_UN) < 0) {
        fprintf(stderr, "cannot unlock: %s\n", strerror(errno));
        exit(1);
    }
}

static void
unlink_file(char *path)
{
    prompt("hit return to unlink file!");
    if (unlink(path) < 0) {
        fprintf(stderr, "warning: unlinking '%s' failed: %s\n",
                        path, strerror(errno));
    }
}

int
main (int argc, char *argv[])
{
    char *pidfile;
    int fd;

    if (argc < 2) {
        fprintf(stderr, "usage: %s PIDFILE\n", argv[0]);
        return 1;
    }

    pidfile = argv[1];

    fd = open_lock_file(pidfile);
    lock_file(fd);
    work();

    /* Feel free to swap the next two steps.  */
    unlock_file(fd);
    unlink_file(pidfile);

    return 0;
}

blog comments powered by Disqus