[vox-tech] vim and utf-8 support (newbie alert)

Mark K. Kim vox-tech@lists.lugod.org
Mon, 9 Jun 2003 16:57:32 -0700 (PDT)


On Mon, 9 Jun 2003, Peter Jay Salzman wrote:

> right-to-left languages are really, really, really well supported in
> vim.  at least, they seem to be.  check out:
>
>    :set rl

Cool~

> the language i'm thinking of is hebrew, but with some important issues.
>
> 1. i need vowel support.
> 2. i really want to have mixed hebrew/english
>
> i believe taken together, i want to use ISO 10646 which can represent
> all languages at the same time.

Unfortunately I don't know the hebrew language so I don't know what the
difficulties are.  For both Korean and Japanese, we use two-bytes to
represent a single Asian "character", while maintaining backwards
compatibility with ASCII by using the MSb on the first character to flag a
multibyte character.

If Hebrew does the same thing, there is no technical reason why it can't
use both English and Hebrew.  There should even be enough character space
to include vowels.  But I wouldn't know how to access them using the
Hebrew input method.

With your Hebrew terminal, try `cat`-ing any binary file.  If you see any
Hebrew vowels and alphabets, then you just gotta figure out how to type
them using the Hebrew input method.  If not, then the Hebrew encoding
probably isn't compatible with English and can't do vowels.

In that case, your best bet is to use a universal encoding like unicode or
utf-8 (I assume unicode has Hebrew vowels).  But I'm not sure how to do
that so good luck...

> as a first stab at getting utf-8 capable xterms, i set:
>
>    LC_CTYPE=en_US.UTF-8
>
> but wierd things started to happen, like mutt's threading lines turned
> into really strange characters.  i guess the applications themselves
> need to be utf-8 aware too.

UTF-8 is compatible only with the standard ASCII set.  The threading lines
are in the extended ASCII set (it uses the MSb), not the standard ASCII
set.  They clash because UTF-8 uses the MSb to signal multibyte character,
while the extended ASCII set use the MSb.

I recommend just ignoring it (you get used to it).  If not, I think you
can tell Mutt to use standard ASCII for threading lines (using +, -, |,
etc.)

> > Works great under WindowsXP (everything's in unicode; just make sure you
> > got the fonts installed.)
>
> that makes me very sad...   :(

It's one of the reasons I have WindowsXP.  The international language
support is so amazing.  I can read multi-language data file with so much
ease.  I've seen Windows2000 also do a very nice job.

One of these days, Linux needs to be totally UTF-8 based, and port all
software to UTF-8.  But I'm thinking that's gonna require lots of effort
by too many people to happen quickly.

> it totally sucks that mixed hebrew-with-vowels/engish turned out to be
> such a hard thing to do.  :( sucks even worse that it's easy on windows
> xp.   :(

I think it shows more about shortcomings of X than about Windows (although
I'm amazed that MS did such a good job with it).  Something that OSS
community needs to work on...  I think a part of the problem is there
isn't much information available about internationalization, and they
certainly don't teach it in schools or many books.

Maybe we should start ralleying for some standard everyone can work with.
One XIM to handle all languages, one way to display any language, etc.
Put 'em together and we'll have at least some structure people can port
their programs into.  Or maybe there's something already out there that
just needs good documentation.  Whatever the case, I'm tired of working
with all these hacks to get internationalization support working under X
and I bet there are many more people that want better support, too.

Just a thought.

-Mark

-- 
Mark K. Kim
http://www.cbreak.org/
PGP key available upon request.