?

Log in

 
 
04 January 2009 @ 11:49 pm
foreachmail: Run a program on each email in an mbox-style mailbox  
Here's one I used today to split out a single mbox file into multiple ones based on the date the email was sent. foreachmail reads an mbox file on standard input, and takes a command (including a shell command line) as an argument. It iterates through the mbox file, finding each individual email message. It then executes the command for each individual email message, sending the email into the command as standard input. It was written in December 2004.

Here's a simple but trivial example:

BEGIN EXAMPLE
cat mbox-file | foreachmail "grep '^From:' | sed -e 's/^From: //' "
END EXAMPLE

In the end, this is equivalent to "cat mbox-file | grep '^From:' | sed -e 's/^From: //' ", BUT, internally, there is a major difference. foreachmail determines the start and finish of each single email, and runs the given command only on that message. The given command only has to worry about processing the contents of one mail message. That may not be an issue if you can implement your solution in a line-oriented, single pass over the data. But if how you must process each message based on some content in the message, foreachmail makes it easier.

Here's a complex example, which is the one I used today. Not only is it complex, it is for a very specific purpose unique to my environment. Also it relies on another custom command (dateformat) that is out of scope for this article. But I wanted to show another example of how it can be used. I used it to break a giant list of spam into a bunch of smaller spam files. Each of the smaller files contains every spam received on one day.

BEGIN COMPLEX EXAMPLE
foreachmail '(cat > /tmp/mytmp; DAY=`cat /tmp/mytmp | egrep "(single-drop|yahoo.com with SMTP|for <.*@impson.tzo.com>)" | sed -e "s/.*; //" | dateformat -t "%Y-%m-%d%n"`; cat /tmp/mytmp >> spam.$DAY ) ' < allspam
END COMPLEX EXAMPLE

I won't explain the example command. It's sufficient to understand that it reads the email message on standard input, figures out the date the email was received, and appends it to a file named for that date.

As you can see, it is possible to write arbitrarily complex scripts in the first argument to foreachmail.

BEGIN ANOTHER EXAMPLE
foreachmail 'i=0; while read l; do if [ "$l" = "" ]; then i=1; fi; if [ $i -gt 0 ]; then echo "$l"; fi; done;' < mbox-file
END ANOTHER EXAMPLE

This one strips out the headers from every email in the mbox-file, printing only the bodies. If you didn't have foreachmail, you might try to do something like this:

grep -v '^[a-z0-9A-Z-]*: ' < mbox-file

This omits any line that has a word ending with a colon at the begining of a line. This is a reasonable heuristic for removing email headers in an mbox file, but it's imperfect. This will also strip out any such lines if the fall within body of the message. And it won't strip out the initial "From " line of the header (note the space), which is particular to the mbox format. Nor does it realize that some headers can be multi-line where subsequent lines are indented but don't repeat the header tag (e.g. Return-path:), therefore it won't strip out every line in multi-line header tags. foreachmail, knowing the structure of email, doesn't have these problems.

Here's the code.
BEGIN CODE
#!/usr/bin/perl -w
use strict;

my $procmail = $ARGV[0];
$procmail = "procmail" unless $procmail;
my $flush;
my $times = 0;

while ($_ = ) {

if ($flush) {
$times += mailproc(prog=>$procmail, mess=>\@message);
@message = ();
$flush = 0;
}

if (/^From / and @message) {
$flush = 1;
}

push @message, $_;
}
if (@message) {
$times += mailproc(prog=>$procmail, mess=>\@message);
}
warn "$times messages\n";

sub mailproc {
my %args = @_;
my $prog = $args{'prog'};
my $mess = $args{'mess'};

my $id = '';
my $date = '';
my $subj = '';
my $from = '';
foreach my $l (@$mess) {
$id = $l if $l =~ /^Message-ID:/i;
$date = $l if $l =~ /^Date:/i;
$subj = $l if $l =~ /^Subject:/i;
$from = $l if $l =~ /^From:/i;
}
return 0 if $subj =~ /DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA/;
open OUT, "| ( $prog )" or die "$prog: $!\n";
print OUT @$mess;
close OUT;
return 1;
}

END CODE

If you don't give it a script to run, it defaults to running procmail. It would probably be better to not do that, but instead provide usage information and exit.

This skips messages containing the subject "DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA", which might be unique to my environment. It comes from the use of pine and alpine email clients which keep metadata in a bogus email message at the beginning of the mbox file. The bogus message always has this subject line, so it can be used to identify and ignore the message.

The while loop finds the start and end of each email sent from standard input. It then passes the message to the mailproc() subroutine. mailproc() finds certain headers (currently only using the subject header, as discussed in the previous paragraph), then executes the given command and sends it the contents of the current message.

foreachmail avoids having to fork()/exec() and do child process maintenance by taking advantage of one of the many ways Perl lets you do interprocess communication. In this case, it uses the open() call, with a pipe ("|") followed by the command. See the "perlipc" man page for more information.

Note the parenthesis following the pipe in the open() call. This creates a subshell, which has the advantage of letting you pass multi-command scripts and have all the standard I/O work the way you'd expect. The significance of this is illustrated in the following example. Assume you run foreachmail like this:

foreachmail "rm /tmp/file; cat > /tmp/file"

You'd be expecting foreachmail to delete the temporary file, then write the contents of the each email message into the same file. (There's no reason you'd do exactly this, but you can imagine a more complicated example that would follow this pattern.) WIthout the parens in the open() command above, the email message gets sent ONLY to the standard input of the "rm" command (which ignores it), and does not get sent to the same of the "cat" command (which will therefore block indefinitely waiting for input). With the parenthesis, the above command works as expected, because it causes the pipe to pass the message to the standard input of the entire command list, not just to the first command in the list. This let's you do stuff like read the header up to the blank line used to separate header from body, then discard the body.

As it turns out, the "formail" utility, part of the "procmail" package, has a "-s" flag, which works similarly to foreachmail. (Eerie name similarity, too.) Major difference is that formail -s needs to be passed the name of an executable binary, not a string representing a shell script. However, that can be dealt with by giving it the name of a shell binary. So the equivalent to the first example using formail is

cat mbox-file | formail -s bash -c "grep '^From:' | sed -e 's/^From: //' "

So use formail if you have it, but if you don't and do have Perl installed, foreachmail is a good option.

Some good improvements would be a usage statement (instead of defaulting to procmail), some signal handling so interrupts kill the entire program rather than just the running subshell, and some better error handling (that last one is always true).