bulk rename (or correctly display) files with special characters

debugcn Published at Dev

RobbieV

I have a bunch of directories and subdirectories that contain files with special characters, like this file:

robbie@phil:~$ ls test�sktest.txt 
test?sktest.txt

Find reveals an escape sequence:

robbie@phil:~$ find test�sktest.txt -ls 
424512 4000 -rwxr--r-x   1 robbie   robbie    4091743 Jan 26 00:34 test\323sktest.txt

The only reason I can even type their names on the console is because of tab completion. This also means I can rename them manually (and strip the special character).

I've set LC_ALL to UTF-8, which does not seem to help (also not on a new shell):

robbie@phil:~$ echo $LC_ALL
en_US.UTF-8

I'm connecting to the machine using ssh from my mac. It's an Ubuntu install:

robbie@phil:~$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=7.10
DISTRIB_CODENAME=gutsy
DISTRIB_DESCRIPTION="Ubuntu 7.10"

Shell is Bash, TERM is set to xterm-color.

These files have been there for quite a while, and they have not been created using that install of Ubuntu. So I don't know what the system encoding settings used to be.

I've tried things along the lines of:

find . -type f -ls | sed 's/[^a-zA-Z0-9]//g'

But I can't find a solution that does everything I want:

Identify all files that have undisplayable characters (the above ignores way too much)
For all those files in a directory tree (recursively), execute mv oldname newname
Optionally, the ability to transliterate special characters such as ä to a (not required, but would be awesome)

Correctly display all these files (and no errors in applications when trying to open them)

I have bits and pieces, like iterating over all files and moving them, but identifying the files and formatting them correctly for the mv command seems to be the hard part.

Any extra information as to why they do not display correctly, or how to "guess" the correct encoding are also welcome. (I've tried convmv but it doesn't seem to do exactly what I want: http://j3e.de/linux/convmv/)

Gilles 'SO- stop being evil'

I guess you see this � invalid character because the name contains a byte sequence that isn't valid UTF-8. File names on typical unix filesystems (including yours) are byte strings, and it's up to applications to decide on what encoding to use. Nowadays, there is a trend to use UTF-8, but it's not universal, especially in locales that could never live with plain ASCII and have been using other encodings since before UTF-8 even existed.

Try LC_CTYPE=en_US.iso88591 ls to see if the file name makes sense in ISO-8859-1 (latin-1). If it doesn't, try other locales. Note that only the LC_CTYPE locale setting matters here.

In a UTF-8 locale, the following command will show you all files whose name is not valid UTF-8:

grep-invalid-utf8 () {
  perl -l -ne '/^([\000-\177]|[\300-\337][\200-\277]|[\340-\357][\200-\277]{2}|[\360-\367][\200-\277]{3}|[\370-\373][\200-\277]{4}|[\374-\375][\200-\277]{5})*$/ or print'
}
find | grep-invalid-utf8

You can check if they make more sense in another locale with recode or iconv:

find | grep-invalid-utf8 | recode latin1..utf8
find | grep-invalid-utf8 | iconv -f latin1 -t utf8

Once you've determined that a bunch of file names are in a certain encoding (e.g. latin1), one way to rename them is

find | grep-invalid-utf8 |
rename 'BEGIN {binmode STDIN, ":encoding(latin1)"; use Encode;}
        $_=encode("utf8", $_)'

This uses the perl rename command available on Debian and Ubuntu. You can pass it -n to show what it would be doing without actually renaming the files.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-06-19

Comments

0 comments

From Dev

Related Related

Article