[vox-tech] vim and utf-8 support (newbie alert)

Peter Jay Salzman vox-tech@lists.lugod.org
Mon, 9 Jun 2003 17:13:59 -0700


On Mon 09 Jun 03,  4:57 PM, Mark K. Kim <markslist@cbreak.org> said:
> On Mon, 9 Jun 2003, Peter Jay Salzman wrote:
> 
> > the language i'm thinking of is hebrew, but with some important issues.
> >
> > 1. i need vowel support.
> > 2. i really want to have mixed hebrew/english
> >
> > i believe taken together, i want to use ISO 10646 which can represent
> > all languages at the same time.
> 
> Unfortunately I don't know the hebrew language so I don't know what the
> difficulties are.  For both Korean and Japanese, we use two-bytes to
> represent a single Asian "character", while maintaining backwards
> compatibility with ASCII by using the MSb on the first character to flag a
> multibyte character.
> 
> If Hebrew does the same thing, there is no technical reason why it can't
> use both English and Hebrew.

fwiw, i happen to know this is the case.   :)    i think you just
described (part of the) utf-8 encoding... the portion of the encoding
which insures backwards compatibility.

the msb being set is also part of utf-8 encoding, and is necessary
because strings in unicode can contain NULL characters, which would
wreak havoc on C string handling.  that's why you don't see utf-2 and
utf-4 encoding on linux.  they don't set the msb byte.

> > as a first stab at getting utf-8 capable xterms, i set:
> >
> >    LC_CTYPE=en_US.UTF-8
> >
> > but wierd things started to happen, like mutt's threading lines turned
> > into really strange characters.  i guess the applications themselves
> > need to be utf-8 aware too.
> 
> UTF-8 is compatible only with the standard ASCII set.  The threading lines
> are in the extended ASCII set (it uses the MSb), not the standard ASCII
> set.  They clash because UTF-8 uses the MSb to signal multibyte character,
> while the extended ASCII set use the MSb.
> 
> I recommend just ignoring it (you get used to it).  If not, I think you
> can tell Mutt to use standard ASCII for threading lines (using +, -, |,
> etc.)
 
unicode includes mathematical and scientific symbols, so those extended
characters are in there somewhere.  it's probably just a matter of
whether you can mutt which characters to use for threading (and how, of
course).

heh.  i read that unicode even includes klingon and the tengwar.

unicode has everything we need.  it's "just" a matter of getting
software to use it correctly.  but boy oh boy are there alot of details
in that word "just"...   :(

> It's one of the reasons I have WindowsXP.  The international language
> support is so amazing.  I can read multi-language data file with so much
> ease.  I've seen Windows2000 also do a very nice job.

you're breaking my heart...   :(

pete

-- 
GPG Instructions: http://www.dirac.org/linux/gpg
GPG Fingerprint: B9F1 6CF3 47C4 7CD8 D33E 70A9 A3B9 1945 67EA 951D