[vox-tech] vim and utf-8 support (newbie alert)
Peter Jay Salzman
vox-tech@lists.lugod.org
Mon, 9 Jun 2003 17:13:59 -0700
On Mon 09 Jun 03, 4:57 PM, Mark K. Kim <markslist@cbreak.org> said:
> On Mon, 9 Jun 2003, Peter Jay Salzman wrote:
>
> > the language i'm thinking of is hebrew, but with some important issues.
> >
> > 1. i need vowel support.
> > 2. i really want to have mixed hebrew/english
> >
> > i believe taken together, i want to use ISO 10646 which can represent
> > all languages at the same time.
>
> Unfortunately I don't know the hebrew language so I don't know what the
> difficulties are. For both Korean and Japanese, we use two-bytes to
> represent a single Asian "character", while maintaining backwards
> compatibility with ASCII by using the MSb on the first character to flag a
> multibyte character.
>
> If Hebrew does the same thing, there is no technical reason why it can't
> use both English and Hebrew.
fwiw, i happen to know this is the case. :) i think you just
described (part of the) utf-8 encoding... the portion of the encoding
which insures backwards compatibility.
the msb being set is also part of utf-8 encoding, and is necessary
because strings in unicode can contain NULL characters, which would
wreak havoc on C string handling. that's why you don't see utf-2 and
utf-4 encoding on linux. they don't set the msb byte.
> > as a first stab at getting utf-8 capable xterms, i set:
> >
> > LC_CTYPE=en_US.UTF-8
> >
> > but wierd things started to happen, like mutt's threading lines turned
> > into really strange characters. i guess the applications themselves
> > need to be utf-8 aware too.
>
> UTF-8 is compatible only with the standard ASCII set. The threading lines
> are in the extended ASCII set (it uses the MSb), not the standard ASCII
> set. They clash because UTF-8 uses the MSb to signal multibyte character,
> while the extended ASCII set use the MSb.
>
> I recommend just ignoring it (you get used to it). If not, I think you
> can tell Mutt to use standard ASCII for threading lines (using +, -, |,
> etc.)
unicode includes mathematical and scientific symbols, so those extended
characters are in there somewhere. it's probably just a matter of
whether you can mutt which characters to use for threading (and how, of
course).
heh. i read that unicode even includes klingon and the tengwar.
unicode has everything we need. it's "just" a matter of getting
software to use it correctly. but boy oh boy are there alot of details
in that word "just"... :(
> It's one of the reasons I have WindowsXP. The international language
> support is so amazing. I can read multi-language data file with so much
> ease. I've seen Windows2000 also do a very nice job.
you're breaking my heart... :(
pete
--
GPG Instructions: http://www.dirac.org/linux/gpg
GPG Fingerprint: B9F1 6CF3 47C4 7CD8 D33E 70A9 A3B9 1945 67EA 951D