[vox-tech] vim and utf-8 support (newbie alert)

Micah J. Cowan vox-tech@lists.lugod.org
Mon, 9 Jun 2003 16:04:01 -0700


On Mon, Jun 09, 2003 at 03:15:58PM -0700, Peter Jay Salzman wrote:
> note: in what follows, i'm a bit schizophrenic about "iso 10646" and
> "unicode".  since the tables and encodings are compatible in the most
> recent versions of the standard, i'm using them interchangeably.

Yeah, I pretty much always use "Unicode" to mean both of those. As far
as I'm concerned, there's not enough difference between them to
warrant careful distinction (in informal conversation, anyway), and
"Unicode" is so much easier to remember/pronounce than "ISO 10646"...

> 
> On Mon 09 Jun 03,  2:35 PM, Micah J. Cowan <micah@cowan.name> said:
> > On Mon, Jun 09, 2003 at 04:06:01PM -0500, Jay Strauss wrote:
> > 
> > OOC, Pete, are you planning on doing Hebrew homework or something like
> > that with vim?
>  
> i have some notes on vocabulary and grammar in dead tree format that i'd
> like to convert into magnetic format.   ;-)
> 
> >   2. I don't believe you can get the Hebrew vowels; but I haven't
> >      tried.
>  
> i only learned what ISO 10646 and utf is a few hours ago, but i thought
> that was the whole point of the ISO standard and unicode.

I was speaking of Emacs specifically... (check OM for context).

> i read that some of the characters in the 31 bit characterset were
> designated "combination characters" which provide accents for
> characters.

Yeah; what's great is there's the "combination characters", and also
for reasons of compatibility with existing encoding standards, there
are also characters which already have the vowels combined in
(IIRC). I have a copy of Unicode 3.0 in dead-tree format, but not with
me. (I believe the latest version is 4.0, released very recently).

> a) these are included in unicode for backwards compatibility
> b) you can always use two characters (combination characters) to
> represent pre-composed characters.

True. However, some formats will insist on one or the other, wherever
possible. For example, XML 1.1 demands that characters be precombined
to the extent possible. The main reason was that this happens to be
the format most documents are already in (at least for latin
languages, which were probably converted over from iso 8859), and they
wanted to settle on a specific canonical representation, so that they
could still use a byte-by-byte comparison, without having to worry
about whether there are two versions of "resume" (sorry, the station
I'm at doesn't have mule, so pretend there are accents), since they
are forced to use that particular representation. (Technically, in XML
1.0, it is quite possible to have two completely separate names that,
when normalized, are equivalent, but in byte representation were not
(i.e., one might use combination characters, the other might use
precombined).

> > Doesn't help you much, though, does it? ;)
>  
> heh.  well, before all this, i had zip, zero, nada knowledge of unicode,
> iso 10646, encodings, character tables, utf-2, utf-4, utf-8 and all
> sorts of non-english non-sense.

Unicode rocks, doesn't it? :)

-Micah