Log in

14 April 2008 @ 09:26 am
cp_urlencoded: Copy files while renaming to remove special characters  
Dated 2005-08-24, cp_urlencoded was created when I was trying to copy a bunch of MP3 files to an MP3 player.  The player presented itself as a removable USB drive.  As is typical, the drive was Virtual FAT formatted. Virtual FAT is a hack to allow longer file names that can contain special characters.  However, colons (":") are still not allowed, and I do have a few audiobooks that contain colons in the file name.  So I created a script to simultaneously create a new name for such files and then copy them (as the new name) into a new location. They don't rename the original file.  I decided to URL encode the files rather than strip the offending character(s) in the hopes that the MP3 player might interpret the URL encoding correctly and display the original file name when playing the file (hope springs eternal).

This is a Perl script, and requires the CGI::Enurl Perl Module.  URL encoding basically identifies each special character (mostly punctuation) and substitutes it with three characters: the percent sign ("%") followed by a two-character hexadecimal number representing the ASCII value of the original character.


$ touch 'fo" o'
$ cp_urlencoded 'fo" o' .
copying fo" o to ./fo%22%20o


Here the double quote got changed to %22 and the space got changed to %20.

BEGIN cp_urlencoded
use warnings;
use strict;

use CGI::Enurl;
#use Shell;

my $dest = pop;

die usage() unless -d $dest;

while (my $s = pop) {
        my $d = myenurl(basename($s));
        $d = "$dest/$d";
        $d =~ s#//#/#g;
        print "copying $s to $d\n";
        #cp($s, $d);
        `cp '$s' '$d'`;

sub myenurl {
        my $out = enurl($_[0]);
        $out =~ s/\+/%20/g;
        return $out;

sub basename {
        my ($s) = @_;
        $s =~ s#^.*/##;
        return $s;

sub usage {
        my $me = basename($0);
        return "
Usage: $me <file1> [<file2> .. <fileN>] <destdir>\n

END cp_urlencoded

I wrote it in Perl mostly so I could  use the CGi::Enurl module to do the URL encoding, but also so I could more easily implement the command-line processing logic.  I wanted this to work like the standard cp Unix command, which wants the last argument to be the destination.  It should treat every other argument as a file to be copied into the destination.  If I wrote this in the shell, I'd have to iterate through the argument list trying to figure out what to do with each one, and would have to build up a data structure to hold them until I got to th end.  With Perl, I could just pop the last argument off the list.  Data structure manipulation and the shell are the subject of an upcoming scripting philosophy post (hopefully).

After determining and checking the destination directory, cp_urlencoded loops through the remaining arguments, treating each as a file name, urlencodes the name of each, and creates a subshell to the real cp to do the copy.  Originally, I wanted to use Perl's Shell module, which is a clever hack that makes command-line commands available as Perl functions.  Trouble was that it doesn't handle single or double quotes in file names correctly, so in the above example it would choke on the embedded double quote.  So I used Perls backticks to shell out to cp myself.  I put single quotes around the arguments to get around the probem.  This will still fail if the source file has an embedded single quote.  The right solution is to remove the call to cp and implement the copy internally, like this (untested):

open(SRC, "$s") or die qq/Can't open "$s" for reading: $!\n/;
open(DST, ">$d") or die qq/Can't open "$d" for writing: $!\n/;
while (<SRC>) {
    print DST;

Note that there's no problem with quoting here.  Perl has sane variable semantic while the shell does not.  All the quoting problems above are the result of subshelling, not of Perl's handling of string  variables or quotes. 

I chose to wrap the enurl function out of CGI::Enurl because the latter translated spaces to pluses (presumably because that's what the URL encoding standard calls for).  But I decided that the only non-alphanumeric character I wanted to see was the percent sign.  I don't think it really matters, though.

By the way,  Unix cp has two behaviours:  cp old new, which creates a copy of the contents of old to new, and cp f1 f2 f3 dest_dir, which copies files f1, f2, and f3 into the directory dest_dir, re-using the original file names.  cp_urlencoded only emulates the latter of cp's behaviours (but obviously translates the original file name).  Emulating the former behaviour doesn't really make sense, since the underlying premise for using this script is that you don't want to think about what the new names should be. 

I just realized that this doesn't check to make sure the files really are files before copying them.  Of course, in the bugfix, that would cause the first open to fail with a mostly-useful error message. (It would say why the read failed,  but would leave it up to the user to understand that cp_urlencoded isn't sophisticated enough to do what they want, which presumably would be a recursive copy of a directory.)

I suppose I should patch cp_urlencoded up; I just didn't want to change the timestamp.  I like having that as an historical record. I suppose I should be more rational about this, patch it up, and remember how to make ls list creation times rather than modification times when I get nostalgic :)