Tags: open()

Learning socat in terms of netcat

In my previous post on sslrsh I wrote about a script to allow remote shell access over SSL. The script made extensive use of socat. It reminded me of how feature-complete socat is, and has motivated me to capture some socat recipes. Note that these aren't general purpose scripts; they are just snippets of functionality listed here for future reference.

I'm not the only person who has a socat tutorial, but I think this post is unique because it will attempt to describe socat by comparing it to a tool that is doubtless a major inspiration for socat, namely, netcat. Hopefully, it will clarify how to use socat, demonstrate how much more featureful socat is, but also show why you shouldn't go ahead and delete netcat outright.

This is part one of a three part series. This one compares socat with netcat. The next one will delve into UDP with socat, and the last one will get into some advanced topics.

Final comment before we start. As of this writing, socat version 2.0 has entered some beta release. socat 2.0 addresses a limitation in socat 1.x, which is that "addresses" in socat 1.x are not completely uniform, and they are not layerable. For example, there's no way to run SSL over UDP, even thought socat knows about both protocols. Similarly, there's no way to have an SSL connection be tunneled through a web PROXY, meaning you have to resort to the hack found in sslrsh. socat 2.0 addresses these limitation, but uses an enhanced syntax, which means 1) it will be even more complicated to use, and 2) this post may become obsolete rather sooner than expected.

But before we compare socat to netcat, let's compare it to their common namesake, cat.

Use cat to display a file on standard output.

jdimpson@artoo:~$ cat file.txt
This is the content of file.txt


Use socat to display a file on standard output.

jdimpson@artoo:~$ socat FILE:file.txt STDOUT
This is the content of file.txt


In general, socat takes two arguments. Both are called addresses. In the above example, FILE:... is one address, and STDOUT is the second. It's customary but not required to spell the address name in upper case. We'll see lots of address types in this post, as well as in a couple follow-on posts that I've got planned.

If you just run "cat" by itself, it will read from standard input and write to standard output, and you have to press control-D to end.

jdimpson@artoo:~$ cat
hello, world!
hello, world!


The first line after the command is typed in, the second is printed by the command.

Here's the equivalent using socat.

jdimpson@artoo:~$ socat STDIN STDOUT
hello, world!
hello, world!


Apparantly, STDIN and STDOUT are both synonyms for STDIO, and socat doesn't care if you send input to the STDOUT address, or read output from the STDIN address. "socat STDOUT STDOUT", "socat STDIN STIN", and "socat STDIO STDIO" all appear to work identically.


But even here socat can improve the situation. We can add a history, so that we can just hit up arrow to repeat what we've typed in earlier, just like bash can do. It utilizes the GNU Readline library.

jdimpson@artoo:~$ socat -u READLINE STDOUT
hello, world!
hello, world!
hello, world!
hello, world!
hello, world!
hello, world!


To get this output, I first typed "hello, world!", pressed enter. socat wrote the second line. The I pressed up arrow to get the third line, and enter to get the fourth. Finally, one more up arrow for the fifth, and again enter for the sixth.

The "-u" flag tells socat to run in unidirectional mode. As we'll see later on, socat usually passes data between the first and second addresses in either direction, something cat does not do. When both addresses end up I connecting to the terminal, as is the case here, it's undertermined as to whether the line you type is being read by the first address an sent to the second, or vice versa. It took me a while to figure this out (over a month after I originally posted this!). By forcing unidirectional mode, only the first address reads what you types, and passes it to the second one.

When you quit using control-c or control-d, the terminal gets messed up, and you have to type "reset" (even though you may not be able see what you're typing) to fix it. The READLINE address has a couple of options, one of which lets you set a history file, which stores the the input history across invocations, just like your "~/.bash_history" file. This example isn't how you'd normally use READLINE, but I'm postponing further discussion on READLINE to another post.

Use cat to create a file (then again to display it)

jdimpson@artoo:~$ cat > file.txt
I like writing files using cat and control-D!!
jdimpson@artoo:~$ cat file.txt
I like writing files using cat and control-D!!


Note that, technically, the shell is actually writing the file by virtue of the redirect symbol (greater than sign).

Use socat to create a file (then use cat to display it)

jdimpson@artoo:~$ socat -u STDIN OPEN:file.txt,creat,trunc
socat needs some funny commands to write files!
jdimpson@artoo:~$ cat file.txt
socat needs some funny commands to write files!


Again. a huge difference between socat and cat is that socat, by default, is bidirectional. So both addresses are read from and written to. cat is always unidirectional. And, in socat, when either one of the addresses sends an EOF (End of File), it waits some amount of time and then exits. And again, the "-u" flag tells socat to be unidirectional. Without it, the above socat invocation will read from the file, get EOF, and exit. Or, if the file doesn't exist, it will quit with an error. There would be no time to type anything in. If instead you pipe something in to socat, like this echo foo | socat STDIN OPEN:file.txt,creat,trunc, the -u isn't needed. Presumably, when invoked within a shell pipe, socat realizes that the fact and know that pipes are always unidirectional, and will behave as if the -u flag were given.

Note the options used, creat and trunc. You could also use append, and lots of other options available to the open() system call. Also, without the trunc option, socat will write bytes into the file in-place. Omitting trunc and using the seek option, you can change arbitrary bytes in the file. There's rdonly and wronly options (read-only and write-only, respectively). I had thought that if I used wronly option, I wouldn't need the -u flag. That didn't work because socat still tried to read from the file, got an error, and exited. Probably the determination of uni- or bi-directionality is done without input from address-specific options. It does work as expected if you pipe input into socat. socat also has a CREATE address based on the creat(), but this is equivalent to OPEN with the creat option.

That covers the major forms of cat, and how socat emulates them, and in some cases enhances them. I don't suggest ever using socat to do what cat can do, but you should have a better sense for how to invoke socat. Now let's compare socat with netcat.

In netcat, connect to TCP port 80 on localhost, as a poor man's web browser.

jdimpson@artoo:~$ nc localhost 80
HEAD / HTTP/1.0
User-agent: netcat, baby!

HTTP/1.1 200 OK
Date: Wed, 28 Jan 2009 13:06:43 GMT
Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.4 with Suhosin-Patch mod_ssl/2.2.8 OpenSSL/0.9.8g mod_perl/2.0.3 Perl/v5.8.8
Last-Modified: Sat, 10 Jan 2009 22:01:08 GMT
ETag: "24008-369-4602802c0f100"
Accept-Ranges: bytes
Content-Length: 873
Connection: close
Content-Type: text/html



I typed in the first three lines (third one is an empty line). The rest is output from the server.

In socat, connect to TCP port 80 on localhost, as a poor man's web browser.

jdimpson@artoo:~$ socat - TCP:localhost:80
HEAD / HTTP/1.0
User-agent: socat, natch!

HTTP/1.1 200 OK
Date: Wed, 28 Jan 2009 13:07:51 GMT
Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.4 with Suhosin-Patch mod_ssl/2.2.8 OpenSSL/0.9.8g mod_perl/2.0.3 Perl/v5.8.8
Last-Modified: Sat, 10 Jan 2009 22:01:08 GMT
ETag: "24008-369-4602802c0f100"
Accept-Ranges: bytes
Content-Length: 873
Connection: close
Content-Type: text/html



Note the "-". That's a shortcut for writing "STDIO", so the above command is equivalent to "socat STDIO TCP:localhost:80".

netcat as a server, listening on TCP port 11111.

nc -l -p 11111


Use "nc localhost 11111" from another window to connect to it. You can type in both windows, and should see each others input in both

There are a couple versions of netcat out there, and some versions (like the OpenBSD one) have had the command flag syntax changed. If you get an error running netcat as decribed above, try it without the -p flag, like this: "nc -l 11111". Some people really don't get the ideas of compatibility and portability--if they want to change the way a program works (presumably because they think they are improving it), fine. But the should also change the name of the program so that hundreds of scripts don't break, and that future script writers don't have to test for each version. Anyways...

socat as a server, listening on TCP port 11111.

socat STDIO TCP-LISTEN:11111,reuseaddr


Use "nc localhost 11111" from another window to connect to it. You can type in both windows, and should see each others input in both.

Note that because socat is bidirectional, it doesn't matter which order you put the addresses. The above is equivalent to "socat TCP-LISTEN:11111,reuseaddr STDIO".

TCP-L can be used as a shortcut for TCP-LISTEN. The reuseaddr option lets you quit socat and run it again immediately. netcat does that by default.

netcat as a server, listening on TCP port 11111, handling multiple connections. This one is untested, and created from memory.

nc -L -p 11111


There used to be some versions of netcat that could handle more than one incoming connection when given the "-L" flag, but I can't find a copy of netcat that works that way, nor even any documentation for it. (Maybe I've imagined it!) It's almost equivalent to this shell script snippet: "while true; do nc -l -p 11111; done", except that this snippet only handles one connection at a time, not multiple ones. The OpenBSD variant of netcat has a -k option which works just like the shell snippet, but still doesn't handle multiple simultaneous connections.

socat as a server, listening on TCP port 11111, handling multiple connections.

socat STDIO TCP-L:11111,reuseaddr,fork


Now open two more windows and run "nc localhost 11111" in each. These are clients to your socat server. What you type in each client window gets displayed in the server window. But what you type in the server window only goes to one of the clients. Each line alternates between each client.  The fork option to TCP-L tells socat to fork a new process for each received connection on port 11111.  Each new process then reads and writes on the standard input/output.

TCP: will use IPv4 or IPv6 depending on which type of address you provide. TCP-LISTEN: will listen on all local addresses (IPv4 and IPv6) unless limited by the bind option. There exist TCP4, TCP6, TCP4-LISTEN, and TCP6-LISTEN variations, as well.

netcat as a UDP server on port 11111.

nc -u -l -p 11111

and then as a UDP client.

nc localhost 11111


socat as a UDP server on port 11111
.

socat - UDP-LISTEN:11111

and then as a UDP client.

socat - UDP:localhost:11111


Again, UDP-L can be used instead of UDP-LISTEN. UDP will use IPv4 or IPv6 depending on which type of address you provide. UDP-LISTEN will listen on all local addresses (IPv4 and IPv6) unless limited by the bind option. There exist UDP4, UDP6, UDP4-LISTEN, and UDP6-LISTEN variations, as well.

socat has other UDP-based addresses that implement other communication patterns beyond what netcat can do. I started to enumerate them here, but the UDP subject ended up dominating this article, so I've pulled it out and link to it here, so this one can remain focused on comparison with netcat.

The coolest, and most dangerous, netcat option is -e, which causes netcat to execute a command when it connects out or receives a connection. A simple remote access server looks like this:


nc -l -p 2323 -e /bin/bash


The strict equivalent simple remote access server in socat is:

socat TCP-LISTEN:2323,reuseaddr EXEC:/bin/bash


However, you can improve on this in several ways. First, the argument to -e in netcat has to be the name of an executable program, found somewhere on the disk. It can't be multiple commands, and can't rely on shell behaviours, like variable handling or wildcard expansion. Not a major impedance, because you can always write out your commands into a shell script, but sometimes doing that is inconvenient. But socat has the SYSTEM address, which uses the system() call rather than a call to exec(), which is what -e in netcat and EXEC in socat do. It enables something like this:


socat TCP-LISTEN:2323,reuseaddr SYSTEM:'echo $HOME; ls -la'


As always whenever the system() call is involved, be aware when writing scripts to not allow unchecked input to be invoked by the system() call. If you try the above in netcat ("nc -l -p 2323 -e 'echo $HOME; ls -la'"), you'll get an error like this: "exec echo $HOME; ls -la failed : No such file or directory", because netcat tried to execute a program called, literally, "echo $HOME; ls -la", spaces and all. Some versions of netcat have a "-c" option, which uses system() instead of exec(), which would allow multiple commands and shell behaviours to work. But again, it depends on which version you have.

netcat is often employed as a data forwarder, aka a simple proxy, listening for incoming connections only to redirect data to another destination port and/or address. It does so by going in to listen mode with -l, then using -e to invoke itself as a client. Because of the use of exec() instead of system(), you have to put the client call into a shell script. First, the client script, "nc-cli", looks like this:

#!/bin/sh
nc localhost 22


Then the call to netcat looks like this:

nc -l -p 2323 -e "./nc-cli"


This redirects incoming connections to port 2323 around to port 22.

(Sometimes you see inetd or xinetd configured to use netcat to do redirecting.)

Of course, you can implement the exact netcat behaviour with "socat STDIO EXEC:nc-cli", or even "socat TCP-L:2323 SYSTEM:'socat STDIO TCP:localhost:22'". However, there's a better way to do data forwarding with socat, which doesn't need a client shell script or even a recursive call to socat. By now you should have enough information about socat to figure it out yourself, so I'll put the example beneath a cut.
Collapse )

And of course, you can replace either address with any other socat address we've already talked about (UDP, UDP-L), or ones we'll talk about in another post (e.g. SSL).

socat can also handle common forwarding requests that netcat doesn't handle. While netcat can bridge between TCP and UDP (insert the -u flag in the above netcat example as appropriate), it can only handle UDP data that is essentially connection-oriented. With socat, any other communication patterns for which UDP is commonly used are also do-able. Just replace the STDIO address in any of the examples in the socat UDP article with TCP or TCP-L addresses as appropriate.

socat can even behave as a
socket gender changer! This part might be a bit confusing to understand; there used to be a file called "TCP-IP_GenderChanger_CSNC_V1.0.pdf" that described the problem, but it seems to be absent from its original location. So I shall try to describe it. The "gender" of a socket is, in this analogy, whether it is a client or a server socket. So a socket gender changer allows two client sockets to connect to each other, or two server sockets to connect to each other. In either case, the gender changer must be running on a host reachable by both clients or both servers. It can run on the same host as either pair, or on a third host. netcat can do this, but with some limitations.

Why would you need this? Off-hand, I can't think of any network protocols that would allow two clients or two servers to just start communicating. So it's not a capability in demand as often as an audio cable gender changer is. But there is one case where it may be useful. Say you have a host running a service that's hidden behind a firewall. No one can connect to the service because the firewall prevents incoming connections. It will allow outgoing connections. Now imagine you can run software on a system outside of the firewall. If you run a server-server gender changer on the external host, and a client-client gender change on the internal host (with one client connecting to the internal service, and the other to one of the server ports on the external host) you have in effect fooled the firewall into allowing access to the internal service despite its access control rules forbidding incoming connections. The above-referenced URL has the specifics of how to use socat to do this. Notice that socat has all sort of retry and timing options to get the desired behaviour. netcat doesn't have all these options, although you may be able to compensate for their absence with a shell script.

That brings to an end the direct comparison of socat and netcat functionality. There's a lot more that socat can do, which I'll address in another article (one on UDP and multicast, the other on everything else). There are some things netcat can do that I didn't discuss, like how it can do telnet negotiation or port scanning. I really consider those out of place in netcat, because they're too application focused. I tried to point out all the netcat options that are only available in some versions of netcat where appropriate. I didn't talk about the source routing ability of (again, some versions of) netcat. socat can do this too, using the ipoptions option, but it's difficult to use. Mostly, though, I don't know enough about source routing to compare the two; something to add to my list of things to figure out. Don't forget, here's the socat & UDP article.

using flock to protect critical sections in shell scripts

This isn't about a shell script, it's about a really cool technique to apply in shell scripts. Have you ever been worried about multiple instances of a shell script running because they might overwrite or corrupt the data or devices they are working on? Here's a way to prevent that.

There's a Unix system call named flock(2). It's used to apply advisory locks to open files. Without exhausting the subject, it can be used to synchronize access to resources across multiple running processes. Note that I said access to resources, not just access to files. While flock(2) does solely act on files (actually, on file handles), the file itself need not be the resource to which access is being controlled. Instead, the file can be used as a semaphore to control access to a critical section, in which any resource can be accessed without concurrency concerns. (Howdja like that pair of sentences?) Note that flock(2) performs advisory locking, which is another way of saying that all parties accessing the resource in question have to agree to abide by the locking protocol in order for it to work. That's still useful to us. flock(2) is used in this manner to protect critical sections in lots of executable programs.

It turns out that, in addition to the flock(2) system call, there is also a flock(1) command line tool. It's a simple wrapper around the flock(2) system call, making it accessible to shell scripts. It's part of the util-linux-ng package.  You can certainly use it to restrict write access to files, a reasonable thing to do in some shell scripts, but there's a more general technique to be had: flock(1) can be used as a semaphore!

I'll start with the general form, discuss a couple alternative forms, and then give some practical examples. Here's the general form, which is good for serializing access to a resource It creates a queue, such that each process waits its turn to utilize the resource:

BEGIN GENERAL FORM
#!/bin/sh

ME=`basename "$0"`;
LCK="/tmp/${ME}.LCK";
exec 8>$LCK;

flock -x 8;
echo "I'm in ($$)";
sleep 20; # XXX: Do something interesting here.
echo "I'm done ($$)";
END GENERAL FORM

Everything after the call to flock is the critical section, where you can operate on whatever resource that you need to control access to.

You may be wondering about the use of "exec". Normally "exec" in a shell script is used to turn over control of the script to some other program. But it can also be used to open a file and name a file handle for it. Normally, every script has standard input (file handle 0), standard output (file handle 1) and standard error (file handle 2) opened for it. The call "exec 8>$LCK" will open the file named in $LCK for reading, and assign it file handle 8. I picked 8 arbitrarily. The call to "flock -x 8" tells flock to exclusively lock the file referenced by file handle 8. The state of being locked lasts after the flock call, because the file handle is still valid (think of it as still in scope). That state will last until the file handle is closed, typically when the script exits.

You can see the locking in action if you run this script twice, ensuring that the second one is started before the first one finishes it's call to sleep. I do this in the following example by running the script (called flocktest0) once in the background (by using the "&" to background it), the immediately running it again. Because the script sleeps for 20 seconds, the second call will start before the first one is done (and before it's given up the lock). The output is messed up because the first call is put in the background, but then prints to output, causing it to interfere with the shell's output.

BEGIN EXAMPLE
jdimpson@argentina:~$ flocktest0 &
[1] 13978
jdimpson@argentina:~$ I'm in (13978)
flocktest0
I'm done (13978)
I'm in (13982)
I'm done (13982)
[1]+ Done flocktest0
jdimpson@argentina:~$
END EXAMPLE

Notice that the second call to flocktest0 doesn't say "I'm in (...)" until after the first call to flocktest0 says "I'm done (...)", even though the second call was started before the first call was finished.

So you can imagine a real script doing something interesting in the critical section rather than "sleep 20", and you can be sure that only one call to that script is doing it's thing at a time, even if it's invoked several times in a row, and each call will eventually get its turn.

Whereas the general form above is for serializing parallel access to some resource by creating a queue, this following alternate form is used when you want only one process to be accessing a resource (or performing some function) at a time, but you don't want to create a queue of processes. Instead, subsequent invocations will exit (or doing something else) rather than queue up. If the initial function does exit, then the next invocation will be allowed to execute. Here's the alternate form.

BEGIN ALTERNATE FORM
#!/bin/sh

ME=`basename "$0"`;
LCK="/tmp/${ME}.LCK";
exec 8>$LCK;

if flock -n -x 8; then
echo "I'm in ($$)";
sleep 20; # XXX: Do something interesting here.
echo "I'm done ($$)";
else
echo "I'm rejected ($$)";
fi
END ALTERNATE FORM

The primary difference is the "-n" flag to flock, which tells it not to block, but to exit with an error value. It's put in an if statement, which will do the the "interesting work" (call sleep in this example) in the true clause, and will report that it can't do interesting work in the false clause.

And here's what happens when I invoke it four times in rapid succession, then a fifth time after waiting 20 seconds:

BEGIN EXAMPLE
jdimpson@argentina:~$ flocktest1 &
[1] 14644
jdimpson@argentina:~$ I'm in (14644)
flocktest1
I'm rejected (14648)
jdimpson@argentina:~$ flocktest1
I'm rejected (14651)
jdimpson@argentina:~$ flocktest1
I'm rejected (14654)
jdimpson@argentina:~$ I'm done (14644)
flocktest1
I'm in (14657)
I'm done (14657)
END EXAMPLE

Again, you can imagine a real script that does something more interesting than sleep, which will benefit from the fact that only one invocation will actually do anything, and every other invocation will just exit. Or, if you wanted to create a reliable modal function by, in the false clause, sending a message (or calling "kill") to the first invocation, to make it shut down. So calling the script the first time starts the function, and calling it the second time stops the function. It won't lose track of state.

OK, maybe you need help imagining these things. Here are two realistic examples, one for the general form, and one for the alternate. Both examples center around my use of MythTV, specifically, around the mythfrontend program. MythTV is an open source PVR/DVR application, and mythfrontend is the component of MythTV that plays the videos, and with which the user directly interacts.

MythTV also comes with a simple command line tool called mythtvosd, which can be used to send messages to mythfrontend, which will write the mesage to the screen by overlaying it over the video being played. I decided it would be cool to display the sender and subject of all email that I receive. I already use procmail to process my incoming email, so it was easy to insert one procmail rule that strips out the sender and the subject and calls mythtvosd with that information, so I could see it on my TV. It's kind of like biff, for those of you who really know your historic Unix applications.

Trouble is, I tend to get two or three email messages at a time, because I use IMAP to download email from the SMTP servers via a cron job. procmail and mythtvosd are able to process all three messages faster than it takes mythfrontend to scroll the sender and subject strings across the screen. So if procmail calls mythtvosd three times in rapid succession, I will only see on the screen the results for the last email (because subsequent calls to mythtvosd have the effect of canceling the previous one). So I used the general form to create a queue, ensuring that all three emails get scrolled across my screen.

The relevant part of .procmailrc to invoke the following code is

:0c
| $HOME/bin/mythemail

And mythemail looks like this:

BEGIN CODE
#!/bin/sh
ME=`basename $0`;
(
     exec 8>/tmp/$ME.LCK;
     flock -x 8
     mythtvosd --template=scroller --scroll_text="mail from $FROM, regarding $SUBJ"
     sleep 10;
) &
END CODE

The code that sets the values of $FROM and $SUBJ has been removed; it's a complicated hack that doesn't do anything to make my point about flock.

So that I don't create a long queue that backs up all my email just to display notification on my tv, I use process control ("&" again) to background all the processes blocked on the lock file waiting their turn.

The "sleep 10" is needed because there's no way to know when the text has finished its scroll across the screen, but 10 seconds works well for me. It's actually not long enough for very long from/subject strings (and/or very wide screens), but it's enough to give a sense of how many emails have been received.

The other example has to do with my new Logitech LX710 keyboard, which has lots and lots of extra buttons for playing music and starting email clients and so forth (although half the buttons don't work under X/GNOME--showkeys sees them, but xev does not). I mapped some of the buttons to control mythfrontend. One button activates mythfrontend, one pauses the video and others forward and rewind through the video. This time the trouble is when I accidentally press the activate button more than once. Repeated presses of the on button would cause mythfrontend to start multiple times. That wasn't what I wanted. Once it is on, I don't want it to turn on again.

So I used the alternate form to make that happen. On my system, mythfrontend is already a shell script which eventually calls mythfrontend.real, so I only had to add the following code snippet to the top of the script:

BEGIN CODE
#!/bin/sh

ME=`basename $0`;
LOCK="/tmp/${ME}.LCK";
exec 8>$LOCK;

if flock -n -e 8; then :
else
echo "Can't get file lock, mythfrontend already running";
exit 1;
fi

# XXX: rest of mythfrontend...
END CODE

I left the true clause blank (just a colon), and put an exit in the false clause. This let me keep all the lock handling stuff at the top, and makes it very easy to insert into the beginning of a shell script, which is nice because I'm inserting this into a script maintained by someone else, so I'll have to re-insert it during every upgrade. I could have created my own script to contain the locking code, then have it call mythfrontend, but then anyone/anything that calls mythfrontend directly wouldn't go through the locking code, and my scheme wouldn't work.

We could make this script even more useful by doing something more interesting in the false clause than just exiting. If I wanted to make my keyboard button work like a modal on/off switch, I would have the false clause shut down the running instance of mythfrontend. However, I don't want that, because my initial problem was that I was pressing the activate button by mistake, and I wouldn't want to interrupt the video accidentally.

Some notes and limitations:

In the general form, there's no guarantee in what order the processes that are queuing up to access the resource will be served. It's probably a function of the scheduling algorithm used on your system, but will also be effected by how long each process holds the lock. Starvation is a possibility. I probably should mention that these locking techniques aren't intended to scale to high demand or for long-running processes. Use an enterprise-quality software framework for that sort of thing. These techniques, like all shell scripts (in my personal scripting philosophy) are for short lived or infrequently demanded tasks.

The locked file remains locked either until it is explicitly unlocked, or when the script holding the lock closes the file handle. "flock -u N" will explicitly unlock file handle N. Also, all shell scripts (indeed, all processes) close any file handles that remain open when they exit. Finally, a script can explicitly close file handle N by doing "exec N>&-" . I tend to design my scripts so that there's no need to explicitly close file handles or perform the unlock call, for the similar reasons of scalabas in the previous paragraph.

While I prefer using "exec" to create and name lock file handles, another alternative is to use subshells. Instead of

exec 8>lockfile;
flock -x 8;
# XXX: do interesting work here

you can do

(
flock -x 8;
# XXX: do interesting work here
) 8>lockfile;

I prefer the former, because it has less impact on the overall structure of your script and it keeps most of the lock file handling code in one place. Subshells also do funny things with variable scope, so I don't use them unless I need them. I'm not sure which one is easier to understand; they both use esoteric behaviours of the shell, specifically how you work with file handles.

For reasons I don't understand, at least with bash version 3.2.39 as compiled for ubuntu 8.04.1, you're limited to single digit file handle numbers. You should be able to to get up to 255, but I get various errors when I use anything greater than 9. While there are a few command line tools that know about more than stdin, stdout, and stderr file handles, they are rare, and knowledge of how to use file handles in the shell is rarer still, so running out of file handle numbers shouldn't be a problem.

According to POSIX, the flock(2) system call has a limitation in that it isn't required to work over remote mounted Network File Systems (NFS). The flock(1) command, being a wrapper around the system call, inherits this limitation if present for your system. There's another system call, fcntl(2), which will work over NFS. Unfortunately, I don't know of any command line way to utilize fcntl(2).

fcntl(2) also has a mandatory locking capability. But be aware that if you were to use mandatory file locking as a semaphore to control access to another, arbitrary resource, the mandatory quality is not transitive to the arbitrary resource, for the simple fact that the resource can still be accessed by a rogue process that chooses not to use the lock. It's a moot point at the moment, but something to keep in mind should someone make a fcnt(1) command line tool.

Finally, if a locked file gets deleted, subsequent lock attempts will succeed even if something is holding an old lock. So if you're protecting access to an arbitrary resource, be aware of the access control permissions of the lock file. You don't want anyone to delete the file from under you.

foreachmail: Run a program on each email in an mbox-style mailbox

Here's one I used today to split out a single mbox file into multiple ones based on the date the email was sent. foreachmail reads an mbox file on standard input, and takes a command (including a shell command line) as an argument. It iterates through the mbox file, finding each individual email message. It then executes the command for each individual email message, sending the email into the command as standard input. It was written in December 2004.

Here's a simple but trivial example:

BEGIN EXAMPLE
cat mbox-file | foreachmail "grep '^From:' | sed -e 's/^From: //' "
END EXAMPLE

In the end, this is equivalent to "cat mbox-file | grep '^From:' | sed -e 's/^From: //' ", BUT, internally, there is a major difference. foreachmail determines the start and finish of each single email, and runs the given command only on that message. The given command only has to worry about processing the contents of one mail message. That may not be an issue if you can implement your solution in a line-oriented, single pass over the data. But if how you must process each message based on some content in the message, foreachmail makes it easier.

Here's a complex example, which is the one I used today. Not only is it complex, it is for a very specific purpose unique to my environment. Also it relies on another custom command (dateformat) that is out of scope for this article. But I wanted to show another example of how it can be used. I used it to break a giant list of spam into a bunch of smaller spam files. Each of the smaller files contains every spam received on one day.

BEGIN COMPLEX EXAMPLE
foreachmail '(cat > /tmp/mytmp; DAY=`cat /tmp/mytmp | egrep "(single-drop|yahoo.com with SMTP|for <.*@impson.tzo.com>)" | sed -e "s/.*; //" | dateformat -t "%Y-%m-%d%n"`; cat /tmp/mytmp >> spam.$DAY ) ' < allspam
END COMPLEX EXAMPLE

I won't explain the example command. It's sufficient to understand that it reads the email message on standard input, figures out the date the email was received, and appends it to a file named for that date.

As you can see, it is possible to write arbitrarily complex scripts in the first argument to foreachmail.

BEGIN ANOTHER EXAMPLE
foreachmail 'i=0; while read l; do if [ "$l" = "" ]; then i=1; fi; if [ $i -gt 0 ]; then echo "$l"; fi; done;' < mbox-file
END ANOTHER EXAMPLE

This one strips out the headers from every email in the mbox-file, printing only the bodies. If you didn't have foreachmail, you might try to do something like this:

grep -v '^[a-z0-9A-Z-]*: ' < mbox-file

This omits any line that has a word ending with a colon at the begining of a line. This is a reasonable heuristic for removing email headers in an mbox file, but it's imperfect. This will also strip out any such lines if the fall within body of the message. And it won't strip out the initial "From " line of the header (note the space), which is particular to the mbox format. Nor does it realize that some headers can be multi-line where subsequent lines are indented but don't repeat the header tag (e.g. Return-path:), therefore it won't strip out every line in multi-line header tags. foreachmail, knowing the structure of email, doesn't have these problems.

Here's the code.
BEGIN CODE
Collapse )
END CODE

If you don't give it a script to run, it defaults to running procmail. It would probably be better to not do that, but instead provide usage information and exit.

This skips messages containing the subject "DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA", which might be unique to my environment. It comes from the use of pine and alpine email clients which keep metadata in a bogus email message at the beginning of the mbox file. The bogus message always has this subject line, so it can be used to identify and ignore the message.

The while loop finds the start and end of each email sent from standard input. It then passes the message to the mailproc() subroutine. mailproc() finds certain headers (currently only using the subject header, as discussed in the previous paragraph), then executes the given command and sends it the contents of the current message.

foreachmail avoids having to fork()/exec() and do child process maintenance by taking advantage of one of the many ways Perl lets you do interprocess communication. In this case, it uses the open() call, with a pipe ("|") followed by the command. See the "perlipc" man page for more information.

Note the parenthesis following the pipe in the open() call. This creates a subshell, which has the advantage of letting you pass multi-command scripts and have all the standard I/O work the way you'd expect. The significance of this is illustrated in the following example. Assume you run foreachmail like this:

foreachmail "rm /tmp/file; cat > /tmp/file"

You'd be expecting foreachmail to delete the temporary file, then write the contents of the each email message into the same file. (There's no reason you'd do exactly this, but you can imagine a more complicated example that would follow this pattern.) WIthout the parens in the open() command above, the email message gets sent ONLY to the standard input of the "rm" command (which ignores it), and does not get sent to the same of the "cat" command (which will therefore block indefinitely waiting for input). With the parenthesis, the above command works as expected, because it causes the pipe to pass the message to the standard input of the entire command list, not just to the first command in the list. This let's you do stuff like read the header up to the blank line used to separate header from body, then discard the body.

As it turns out, the "formail" utility, part of the "procmail" package, has a "-s" flag, which works similarly to foreachmail. (Eerie name similarity, too.) Major difference is that formail -s needs to be passed the name of an executable binary, not a string representing a shell script. However, that can be dealt with by giving it the name of a shell binary. So the equivalent to the first example using formail is

cat mbox-file | formail -s bash -c "grep '^From:' | sed -e 's/^From: //' "

So use formail if you have it, but if you don't and do have Perl installed, foreachmail is a good option.

Some good improvements would be a usage statement (instead of defaulting to procmail), some signal handling so interrupts kill the entire program rather than just the running subshell, and some better error handling (that last one is always true).