Log in

No account? Create an account
10 January 2009 @ 06:14 am
using flock to protect critical sections in shell scripts  
This isn't about a shell script, it's about a really cool technique to apply in shell scripts. Have you ever been worried about multiple instances of a shell script running because they might overwrite or corrupt the data or devices they are working on? Here's a way to prevent that.

There's a Unix system call named flock(2). It's used to apply advisory locks to open files. Without exhausting the subject, it can be used to synchronize access to resources across multiple running processes. Note that I said access to resources, not just access to files. While flock(2) does solely act on files (actually, on file handles), the file itself need not be the resource to which access is being controlled. Instead, the file can be used as a semaphore to control access to a critical section, in which any resource can be accessed without concurrency concerns. (Howdja like that pair of sentences?) Note that flock(2) performs advisory locking, which is another way of saying that all parties accessing the resource in question have to agree to abide by the locking protocol in order for it to work. That's still useful to us. flock(2) is used in this manner to protect critical sections in lots of executable programs.

It turns out that, in addition to the flock(2) system call, there is also a flock(1) command line tool. It's a simple wrapper around the flock(2) system call, making it accessible to shell scripts. It's part of the util-linux-ng package.  You can certainly use it to restrict write access to files, a reasonable thing to do in some shell scripts, but there's a more general technique to be had: flock(1) can be used as a semaphore!

I'll start with the general form, discuss a couple alternative forms, and then give some practical examples. Here's the general form, which is good for serializing access to a resource It creates a queue, such that each process waits its turn to utilize the resource:


ME=`basename "$0"`;
exec 8>$LCK;

flock -x 8;
echo "I'm in ($$)";
sleep 20; # XXX: Do something interesting here.
echo "I'm done ($$)";

Everything after the call to flock is the critical section, where you can operate on whatever resource that you need to control access to.

You may be wondering about the use of "exec". Normally "exec" in a shell script is used to turn over control of the script to some other program. But it can also be used to open a file and name a file handle for it. Normally, every script has standard input (file handle 0), standard output (file handle 1) and standard error (file handle 2) opened for it. The call "exec 8>$LCK" will open the file named in $LCK for reading, and assign it file handle 8. I picked 8 arbitrarily. The call to "flock -x 8" tells flock to exclusively lock the file referenced by file handle 8. The state of being locked lasts after the flock call, because the file handle is still valid (think of it as still in scope). That state will last until the file handle is closed, typically when the script exits.

You can see the locking in action if you run this script twice, ensuring that the second one is started before the first one finishes it's call to sleep. I do this in the following example by running the script (called flocktest0) once in the background (by using the "&" to background it), the immediately running it again. Because the script sleeps for 20 seconds, the second call will start before the first one is done (and before it's given up the lock). The output is messed up because the first call is put in the background, but then prints to output, causing it to interfere with the shell's output.

jdimpson@argentina:~$ flocktest0 &
[1] 13978
jdimpson@argentina:~$ I'm in (13978)
I'm done (13978)
I'm in (13982)
I'm done (13982)
[1]+ Done flocktest0

Notice that the second call to flocktest0 doesn't say "I'm in (...)" until after the first call to flocktest0 says "I'm done (...)", even though the second call was started before the first call was finished.

So you can imagine a real script doing something interesting in the critical section rather than "sleep 20", and you can be sure that only one call to that script is doing it's thing at a time, even if it's invoked several times in a row, and each call will eventually get its turn.

Whereas the general form above is for serializing parallel access to some resource by creating a queue, this following alternate form is used when you want only one process to be accessing a resource (or performing some function) at a time, but you don't want to create a queue of processes. Instead, subsequent invocations will exit (or doing something else) rather than queue up. If the initial function does exit, then the next invocation will be allowed to execute. Here's the alternate form.


ME=`basename "$0"`;
exec 8>$LCK;

if flock -n -x 8; then
echo "I'm in ($$)";
sleep 20; # XXX: Do something interesting here.
echo "I'm done ($$)";
echo "I'm rejected ($$)";

The primary difference is the "-n" flag to flock, which tells it not to block, but to exit with an error value. It's put in an if statement, which will do the the "interesting work" (call sleep in this example) in the true clause, and will report that it can't do interesting work in the false clause.

And here's what happens when I invoke it four times in rapid succession, then a fifth time after waiting 20 seconds:

jdimpson@argentina:~$ flocktest1 &
[1] 14644
jdimpson@argentina:~$ I'm in (14644)
I'm rejected (14648)
jdimpson@argentina:~$ flocktest1
I'm rejected (14651)
jdimpson@argentina:~$ flocktest1
I'm rejected (14654)
jdimpson@argentina:~$ I'm done (14644)
I'm in (14657)
I'm done (14657)

Again, you can imagine a real script that does something more interesting than sleep, which will benefit from the fact that only one invocation will actually do anything, and every other invocation will just exit. Or, if you wanted to create a reliable modal function by, in the false clause, sending a message (or calling "kill") to the first invocation, to make it shut down. So calling the script the first time starts the function, and calling it the second time stops the function. It won't lose track of state.

OK, maybe you need help imagining these things. Here are two realistic examples, one for the general form, and one for the alternate. Both examples center around my use of MythTV, specifically, around the mythfrontend program. MythTV is an open source PVR/DVR application, and mythfrontend is the component of MythTV that plays the videos, and with which the user directly interacts.

MythTV also comes with a simple command line tool called mythtvosd, which can be used to send messages to mythfrontend, which will write the mesage to the screen by overlaying it over the video being played. I decided it would be cool to display the sender and subject of all email that I receive. I already use procmail to process my incoming email, so it was easy to insert one procmail rule that strips out the sender and the subject and calls mythtvosd with that information, so I could see it on my TV. It's kind of like biff, for those of you who really know your historic Unix applications.

Trouble is, I tend to get two or three email messages at a time, because I use IMAP to download email from the SMTP servers via a cron job. procmail and mythtvosd are able to process all three messages faster than it takes mythfrontend to scroll the sender and subject strings across the screen. So if procmail calls mythtvosd three times in rapid succession, I will only see on the screen the results for the last email (because subsequent calls to mythtvosd have the effect of canceling the previous one). So I used the general form to create a queue, ensuring that all three emails get scrolled across my screen.

The relevant part of .procmailrc to invoke the following code is

| $HOME/bin/mythemail

And mythemail looks like this:

ME=`basename $0`;
     exec 8>/tmp/$ME.LCK;
     flock -x 8
     mythtvosd --template=scroller --scroll_text="mail from $FROM, regarding $SUBJ"
     sleep 10;
) &

The code that sets the values of $FROM and $SUBJ has been removed; it's a complicated hack that doesn't do anything to make my point about flock.

So that I don't create a long queue that backs up all my email just to display notification on my tv, I use process control ("&" again) to background all the processes blocked on the lock file waiting their turn.

The "sleep 10" is needed because there's no way to know when the text has finished its scroll across the screen, but 10 seconds works well for me. It's actually not long enough for very long from/subject strings (and/or very wide screens), but it's enough to give a sense of how many emails have been received.

The other example has to do with my new Logitech LX710 keyboard, which has lots and lots of extra buttons for playing music and starting email clients and so forth (although half the buttons don't work under X/GNOME--showkeys sees them, but xev does not). I mapped some of the buttons to control mythfrontend. One button activates mythfrontend, one pauses the video and others forward and rewind through the video. This time the trouble is when I accidentally press the activate button more than once. Repeated presses of the on button would cause mythfrontend to start multiple times. That wasn't what I wanted. Once it is on, I don't want it to turn on again.

So I used the alternate form to make that happen. On my system, mythfrontend is already a shell script which eventually calls mythfrontend.real, so I only had to add the following code snippet to the top of the script:


ME=`basename $0`;
exec 8>$LOCK;

if flock -n -e 8; then :
echo "Can't get file lock, mythfrontend already running";
exit 1;

# XXX: rest of mythfrontend...

I left the true clause blank (just a colon), and put an exit in the false clause. This let me keep all the lock handling stuff at the top, and makes it very easy to insert into the beginning of a shell script, which is nice because I'm inserting this into a script maintained by someone else, so I'll have to re-insert it during every upgrade. I could have created my own script to contain the locking code, then have it call mythfrontend, but then anyone/anything that calls mythfrontend directly wouldn't go through the locking code, and my scheme wouldn't work.

We could make this script even more useful by doing something more interesting in the false clause than just exiting. If I wanted to make my keyboard button work like a modal on/off switch, I would have the false clause shut down the running instance of mythfrontend. However, I don't want that, because my initial problem was that I was pressing the activate button by mistake, and I wouldn't want to interrupt the video accidentally.

Some notes and limitations:

In the general form, there's no guarantee in what order the processes that are queuing up to access the resource will be served. It's probably a function of the scheduling algorithm used on your system, but will also be effected by how long each process holds the lock. Starvation is a possibility. I probably should mention that these locking techniques aren't intended to scale to high demand or for long-running processes. Use an enterprise-quality software framework for that sort of thing. These techniques, like all shell scripts (in my personal scripting philosophy) are for short lived or infrequently demanded tasks.

The locked file remains locked either until it is explicitly unlocked, or when the script holding the lock closes the file handle. "flock -u N" will explicitly unlock file handle N. Also, all shell scripts (indeed, all processes) close any file handles that remain open when they exit. Finally, a script can explicitly close file handle N by doing "exec N>&-" . I tend to design my scripts so that there's no need to explicitly close file handles or perform the unlock call, for the similar reasons of scalabas in the previous paragraph.

While I prefer using "exec" to create and name lock file handles, another alternative is to use subshells. Instead of

exec 8>lockfile;
flock -x 8;
# XXX: do interesting work here

you can do

flock -x 8;
# XXX: do interesting work here
) 8>lockfile;

I prefer the former, because it has less impact on the overall structure of your script and it keeps most of the lock file handling code in one place. Subshells also do funny things with variable scope, so I don't use them unless I need them. I'm not sure which one is easier to understand; they both use esoteric behaviours of the shell, specifically how you work with file handles.

For reasons I don't understand, at least with bash version 3.2.39 as compiled for ubuntu 8.04.1, you're limited to single digit file handle numbers. You should be able to to get up to 255, but I get various errors when I use anything greater than 9. While there are a few command line tools that know about more than stdin, stdout, and stderr file handles, they are rare, and knowledge of how to use file handles in the shell is rarer still, so running out of file handle numbers shouldn't be a problem.

According to POSIX, the flock(2) system call has a limitation in that it isn't required to work over remote mounted Network File Systems (NFS). The flock(1) command, being a wrapper around the system call, inherits this limitation if present for your system. There's another system call, fcntl(2), which will work over NFS. Unfortunately, I don't know of any command line way to utilize fcntl(2).

fcntl(2) also has a mandatory locking capability. But be aware that if you were to use mandatory file locking as a semaphore to control access to another, arbitrary resource, the mandatory quality is not transitive to the arbitrary resource, for the simple fact that the resource can still be accessed by a rogue process that chooses not to use the lock. It's a moot point at the moment, but something to keep in mind should someone make a fcnt(1) command line tool.

Finally, if a locked file gets deleted, subsequent lock attempts will succeed even if something is holding an old lock. So if you're protecting access to an arbitrary resource, be aware of the access control permissions of the lock file. You don't want anyone to delete the file from under you.
dawerbuch on February 24th, 2011 04:55 pm (UTC)
Running section of sh code in background
Hello, I found your posting by accident during another google search, but I am intrigued: BEGIN CODE #!/bin/sh ME=`basename $0`; ( exec 8>/tmp/$ME.LCK; flock -x 8 mythtvosd --template=scroller --scroll_text="mail from $FROM, regarding $SUBJ" sleep 10; ) & END CODE Does this code really run in background in the active process, or does the shell fork off a new process for it? Can it be waited for using "wait $!" ? Thanks, Dave
jdimpsonjdimpson on February 24th, 2011 08:41 pm (UTC)
Re: Running section of sh code in background
Hey Dave, thanks for the comment. By "run in the background" I do mean fork off a new process. So yes, "wait $!" should let you wait for it to complete, because the new process will be in the same process group as the first one.

I guess the use of "background" comes from the "bg" shell built-in command, which along with "&" and "fg" (and control-z), let you manage processes interactively as you type on the command line. In a script, the background/foreground nomenclature makes less sense, although work exactly the same
crizzo.myopenid.com on March 8th, 2012 10:57 pm (UTC)
Wondering about fd 8
This is a great article and really helped me understand the use of flock. But I have one question... You use fd 8 in your examples. What happens if there is another already open file that is using fd 8 and how do you guarantee that there is not when you perform the flock?

jdimpsonjdimpson on July 3rd, 2012 06:33 pm (UTC)
Re: Wondering about fd 8
fd 8 was arbitrarily chosen. My first reaction to your question was "if you have so many file descriptors that you can't manage them manually, then you are probably trying to do some task better suited to a more featureful programming language (python, perl, etc)."

Detecting in-use file descriptors, and programatically determining an unused fd, don't seem to be done easily in bash. That's why I made that statement, and I think it's true. You're likely writing something that will be easier to develop, troubleshoot, and maintain in another language.

But, technically your question is interesting. The first thing we have to do is stop using hard-coded fd numbers and use variables instead. This turns out to be trickier than I expected. I tried this first:

exec $N>$LCK

This turns out not to work. bash gives an error saying something like "exec: 5: 8: not found". The "5" refers to the line number in the script. The rest suggests that exec (and thus bash, since exec is a built-in) is trying to execute "8" as though it were a command (per the typical behaviour of the exec command). exec must decide whether it's opening an fd or running a command before the variable interpolation takes place (very atypical for bash, but possible, again because exec is a bash built-in). After some experimentation, the following works:

eval "exec $N>$LCK"

The use of eval allows variable interpolation to happen before the exec command determines if it is opening an fd or running a command. Note that you should always be wary using the eval command, because it has a history of providing security holes in shell scripts.

So that works (even for other built-ins, like using the "-u" flag to "read"), but we still haven't answered your questions: how to tell if an fd is already in use, and how to determine the next available one. The bash man page cautions that fds over 9 may be used internally by bash. And fds 0, 1, and 2 (input, output, and error) are opened by default and typically needed in your script. So that leaves 3-9, which, I mention again, should be enough for simple scripts. Anyway...

Presumably if we can find a way to test for ones in use, then we can just enumerate numbers from 0 to the maximum number of fds available, testing each one, until we find an unused one. "ulimit -n" will tell you home many fds are possible. (1024 on my system.)

(Might be better to start at 1024 and work down to 0, or at least start at 3 rather than at 0.)

But how to test whether an fd is in use?

The -t flag to the test builtin detects open file descriptors, but only if the fds are opened on terminals. That will work with 0, 1, and 2, but not for any fd opened on a file or pipe.

The -u flag for the read builtin can read from fd numbers. It returns non-zero if there's a read error. However, this will have the effect of reading a line of text from the fd and incrementing the position within the file, which would be bad if some other process is using the fd for some important purpose. If the fd is opened on a binary file, read may never find end of line, which would cause an error also (probably a different error number, though). Also, presumably you can't read on a write fd (I didn't test).

In fact, the only way I can think of to check what fds are in use requires a Linux-like /proc filesystem. Just look at "/proc/$$/fd/" to see every open fd. ("$$" a shell variable containing the shell's own process ID.) You can use "test -e" to see if it exists. If it does, then the fd is open, to try another one. But that's not guaranteed to be portable, and is out of control of bash.
Adam Danischewski on September 21st, 2017 03:43 am (UTC)
Re: Wondering about fd 8
Bash, since??, has a feature of allowing you to simply specify a variable to be populated BY Bash.

The syntax is: exec {fd_lock}>"${LOCK_FILE}"

Thereafter, ${fd_lock} contains the available fd that Bash found.

Some other things I added to my code regarding flock, is some cleanup:
exec {fd_lock}>"${LOCK_FILE}"
flock -x "${fd_lock}" || { echo "ERROR: flock() failed." >&2; exit 1; }
[[critical code]]
flock -u "${fd_lock}" && echo "removed lock here" || echo "lock busy didn't remove lock .. "
[[ -e "/proc/$$/fd/${fd_lock}" ]] && eval "exec ${fd_lock}>&-"
if lsof "${lock_file}" &>/dev/random; then
echo "lock file still in use, not deleting"
rm -f "${lock_file}"
echo "removed lock file"

Also note that flock does not guarantee any unblock ordering, this is something that troubles about flock. I'm probably going to end up rewriting it one day, if someone doesn't do it sooner.

It would be nice if flock blocks got unblocked sequentially FIFO style, instead they pop off LIFO? This raises the specter of a potential starvation problem. If you have a lot of processes hammering away.
Ikem Krueger on January 11th, 2015 08:09 pm (UTC)
> which will do _the the_ "interesting work"

Ed Greenberg on October 4th, 2018 11:17 am (UTC)
Just a thank you
Of about five articles, this is the one that actually got me going, so I can have an rsync that runs on schedule, but doesn't start if the previous one hasn't finished.

Many thanks

Ed Greenberg