I have a bunch of directories and subdirectories that contain files with special characters, like this file:
robbie@phil:~$ ls test�sktest.txt
test?sktest.txt
Find reveals an escape sequence:
robbie@phil:~$ find test�sktest.txt -ls
424512 4000 -rwxr--r-x 1 robbie robbie 4091743 Jan 26 00:34 test\323sktest.txt
The only reason I can even type their names on the console is because of tab completion. This also means I can rename them manually (and strip the special character).
I've set LC_ALL to UTF-8, which does not seem to help (also not on a new shell):
robbie@phil:~$ echo $LC_ALL
en_US.UTF-8
I'm connecting to the machine using ssh from my mac. It's an Ubuntu install:
robbie@phil:~$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=7.10
DISTRIB_CODENAME=gutsy
DISTRIB_DESCRIPTION="Ubuntu 7.10"
Shell is Bash, TERM is set to xterm-color.
These files have been there for quite a while, and they have not been created using that install of Ubuntu. So I don't know what the system encoding settings used to be.
I've tried things along the lines of:
find . -type f -ls | sed 's/[^a-zA-Z0-9]//g'
But I can't find a solution that does everything I want:
OR
I have bits and pieces, like iterating over all files and moving them, but identifying the files and formatting them correctly for the mv command seems to be the hard part.
Any extra information as to why they do not display correctly, or how to "guess" the correct encoding are also welcome. (I've tried convmv but it doesn't seem to do exactly what I want: http://j3e.de/linux/convmv/)
I guess you see this �
invalid character because the name contains a byte sequence that isn't valid UTF-8. File names on typical unix filesystems (including yours) are byte strings, and it's up to applications to decide on what encoding to use. Nowadays, there is a trend to use UTF-8, but it's not universal, especially in locales that could never live with plain ASCII and have been using other encodings since before UTF-8 even existed.
Try LC_CTYPE=en_US.iso88591 ls
to see if the file name makes sense in ISO-8859-1 (latin-1). If it doesn't, try other locales. Note that only the LC_CTYPE
locale setting matters here.
In a UTF-8 locale, the following command will show you all files whose name is not valid UTF-8:
grep-invalid-utf8 () {
perl -l -ne '/^([\000-\177]|[\300-\337][\200-\277]|[\340-\357][\200-\277]{2}|[\360-\367][\200-\277]{3}|[\370-\373][\200-\277]{4}|[\374-\375][\200-\277]{5})*$/ or print'
}
find | grep-invalid-utf8
You can check if they make more sense in another locale with recode or iconv:
find | grep-invalid-utf8 | recode latin1..utf8
find | grep-invalid-utf8 | iconv -f latin1 -t utf8
Once you've determined that a bunch of file names are in a certain encoding (e.g. latin1), one way to rename them is
find | grep-invalid-utf8 |
rename 'BEGIN {binmode STDIN, ":encoding(latin1)"; use Encode;}
$_=encode("utf8", $_)'
This uses the perl rename command available on Debian and Ubuntu. You can pass it -n
to show what it would be doing without actually renaming the files.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments