?

Log in

No account? Create an account
 
 
28 May 2008 @ 09:01 pm
mailsort.pl: Sort mbox style email messages according to date  
OK, this post is quite self-indulgent, so it's best to just get this over with.  I wanted to write about my oldest script, which is dated 1998-07-02.  Unfortunately, it's not such a great script.  First off, I never use this, so it's just luck that it's been in my bin directory for almost 10 years.  Secondly, I'm kind of disappointed that I have nothing older--I subtitled this blog scripts & hacks since 1994 because that's the year I entered college and was first introduced to Unix and programming.  At that time I didn't own a computer, so I moved scripts around on floppy disks (then Zip disks, then CD-ROM disks).  It's possible I do have older stuff in an archive somewhere.  Thirdly, while it seems to still work, I can't really recommend it. I'd be very, very wary of how well it identifies the end of one email message and the start of the next one. Similarly, the code to interpret different ways of writing dates is bound to break spectacularly.  Let me know if you use it with any success.

mailsort.pl reads an mbox-style mail file from the command line, sorts the emails by date, and spits them out sorted onto standard output.  It's written in Perl.  I'm sort of proud of this, because while I can see lots of inefficient memory usage, no handling of standard input (which Perl makes trivial to do), and lots of other naive, inefficient, or nonsensical constructs, I did have some reasonable discipline and sense of structure.

Amazingly, I do remember writing this.  I had just graduated from Syracuse University, was taking a couple months off before going to work, and between visiting friends and going to New Orleans, I spent some time organizing the stuff I saved from college.  One of those things was email.  For most of college, I would archive my email each year, because at first I had no computer and had to use the multi-user time sharing Solaris hosts.  We had very limited disk quotas (I want to say 5 MB!).  But after I got my own computer with (wait for it) a 40 MB  hard disk, I realized that offline archiving of email was unnecessary.  So I went about merging my yearly email archives together.  I didn't just restore from external media, I also reorganized email folder structure, which meant some emails got out of chronological order.  Thus, I wrote this script.

mbox-style email is one of the ways Unix/Linux systems store email in files. Basically, each email "folder" is a single file.  Every distinct email begins with the string "\nFrom ".  That's the word "From" at the beginning of a line, followed by a space.  The rest of the line varies, but usually contains who the email went to and the date. Note that this is not the same as the "From:" or "Date:" headers of an email.  Those come later.

I don't recall the details, but some Unix email software used  different formats, ranging from using the Content-length: header to delineate message, to a ridiculously sane approach of putting each email in its own file, then grouping emails together in subdirectories.  (I know, that's crazy.)

I like the mbox style, not because it's easy to use (it isn't) or because it's more efficient (it's not) but because it's what sendmail uses.  And I am the master of sendmail (http://www.sendmail.org).  But that's another rant entirely, and I'm already egregiously off-topic.

OK, enough of that wank.  This code uses Perl's Mail::Util and Mail::Header modules, which is overkill, but at the time I thought code reuse was the ultimate accomplishment.  Here it is:

BEGIN mailsort.pl
#!/usr/local/bin/perl

use Mail::Util qw(read_mbox);
use Mail::Header;

%nmon = (
        'Jan'   => 1,
        'Feb'   => 2,
        'Mar'   => 3,
        'Apr'   => 4,
        'May'   => 5,
        'Jun'   => 6,
        'Jul'   => 7,
        'Aug'   => 8,
        'Sep'   => 9,
        'Oct'   => 10,
        'Nov'   => 11,
        'Dec'   => 12,
);

die usage() unless @ARGV;

$filein = shift @ARGV;

@msgrefs = Mail::Util::read_mbox ($filein);

$i = 0;
foreach $msgref (@msgrefs) {
        @tmphead = ();
        $date = '';
        foreach $line (@{$msgref}) {

                if (( $new_msg == 0 and $line =~ /^\s*$/ )
                        or $body[$i] ) {
                # start of body

                        $body[$i] .= $line;

                } else {

                        push @tmphead, $line;
                        $new_msg = 0;

                }
        }

        $head[$i] = new Mail::Header (\@tmphead, 'MailFrom' => 'KEEP');
        $new_msg = 1;
        chomp($date = $head[$i]->get ('Date'));
        #print "$i: no date from header obj\n" unless $date =~ /.+/;
        $date[$i] = [ splitmaildate($date) ];
        #print "$i: from splitmaildate($date): @{$date[$i]}\n";
        $nums[$i] = $i;

        $i++;
}

@nums = sort by_date @nums;

foreach $i (@nums) {
        #print "\nSD $i: @{$date[$i]}\n\n";
        $head[$i]->print();
        print "\n", $body[$i];
}

sub by_date {
        my @adate = @{$date[$a]};
        my @bdate = @{$date[$b]};
        my ($afoo, $bfoo);

        foreach $afoo (@adate) {
                $bfoo = shift @bdate;
                $o = ($afoo <=> $bfoo);
                if ($o != 0) { return ($o); }
        }

        0;
}
sub splitmaildate {
        my ($date) = @_;

        if ($date =~
/\s*(\w\w\w,? )?(\d\d?) (\w\w\w) (\d\d+) (\d\d?:\d\d?:\d\d?)\s*(.*)/) {
                $day = $1; $ndate = $2; $mon = $3;
                $year = $4; $time = $5; $foo = $6;
        } elsif ($date =~
/\s*(\w\w\w,? )?(\d\d?) (\w\w\w) (\d\d?:\d\d?:\d\d?) (\d\d+)\s*(.*)/) {
                $day = $1; $ndate = $2; $mon = $3;
                $year = $5; $time = $4; $foo = $6;
        }

        $mon = $nmon{$mon};

        if ($year < 100) { $year += 1900; }

        $time =~ /(\d\d?):(\d\d?):(\d\d?)/;
        $sec = $3; $min = $2; $hour = $1;

        ($year, $mon, $ndate, $hour, $min, $sec);
}

sub usage { "usage: $0 <mail-file-to-be-sorted>\n"; }

END mailsort.pl

All this does is read in the mbox file, dig out the header and body of each, then sorts them according to date and prints them out.  The date parsing code is an accident waiting to happen--there's no good solution, because there's no standard for the date format, so any email client can do whatever it wants with it.