"Plan" for unicode

Website · Post by **Z-Man** » Tue Feb 21, 2006 9:18 am

Main information source: The UNIX unicode FAQ at http://www.cl.cam.ac.uk/~mgk25/unicode.html
Windows wants us to handle unicode differently, but unless we want to allow unicode filenames, we can ignore it.

There will be two ways to switch from iso latin 1 to unicode.

The first way is to use full width characters. We'll need to replace std::string with std::wstring or something similar (rumor has it that on some systems, std::wstring does not exist or is identical to std::string or uses 16 bit characters that are not wide enough). We'll have to replace most chars to the appropriate wide chars. We'll have to convert stuff to UTF-8 on input and output, because that's what the outside world expects. Memory usage will increase a lot.
You'd think we'd get easier string handling algorithms by it because every token in the string will be one character, so the string length in tokens also is the character length; but that is wrong. Accents can be handled by combining characters. ä can be represented as `a` followed by a token that says `apply " to the previous character`, and an arbitrary number of those can follow.

The second way is to use UTF-8 internally. It'll save tons of memory. "We'll have to adapt all string handling functions to cope with that! A nightmare!" I hear you scream. But think a bit: in how many places do you actually need to decode UTF-8? UTF-8 preserves ASCII in both ways: ASCII characters appear as they are in an UTF-8 stream, and if you see an ASCII character in a UTF-8 stream, it really is one. So searches for "/", "\n" and ":", which is most of what we do, will continue to work as they should.
There are two places where we need to be aware of UFT-8: the obvious one is the font rendering, it'll need to decode the stream. The other one is string formatting; only there, we need to know the exact number of characters, not bytes, a string has up to a certain point to align the score table and stuff. Luckily, we already have embedded color codes that pose exactly the same problem, so we have half a handful of functions that handle formatting and that are used. So we'll only have to adapt those functions.

I guess the choice is obvious

The plan is to make the font rendering decode UTF-8, adapt the formatting functions in tString, convert the language files, and be done. The difficult bit is the font, but we were toying with the thought of using an external library anyway, and if that supports UTF-8, we'll have almost zero work to do extra. If we keep rolling our own, then the really difficult bit is to make the font renderer able to render some ten thousand characters, not the decoding of UTF-8.

This plan also means that casts from xmlChar * to char * to std::string are safe. I came to the conclusions above while searching for a good way to do a safe conversion

Luke-Jr · Post by **Luke-Jr** » Tue Feb 21, 2006 10:31 am

z-man wrote:"We'll have to adapt all string handling functions to cope with that! A nightmare!" I hear you scream.

Not me... I've done this before, so I have a nice set of C UTF-8 code

Here's my contribution:

Code: Select all

int utf8_charlen(unsigned char start) { 
	if      ((start & 0x80) ==    0)
		return 1;
	else if ((start & 0xe0) == 0xc0)
		return 2;
	else if ((start & 0xf0) == 0xe0)
		return 3;
	else if ((start & 0xf8) == 0xf0)
		return 4;
	else if ((start & 0xfc) == 0xf8)
		return 5;
	else if ((start & 0xfe) == 0xfc)
		return 6;
	return -1;
}

(Edit: the code didn't... you know... work

joda.bot · Post by **joda.bot** » Tue Feb 21, 2006 10:58 am

Something I found ... which is pretty big but also powerful:
http://icu.sourceforge.net/userguide/icufaq.html

Java is also based on this AFAIK.

Website · Post by **Z-Man** » Tue Feb 21, 2006 12:11 pm

Luke: yep, that's it. Now add color code length detection and we're all set

We'll probably want to give color codes a new start code, it's annoying not to be able to write "320x200".

Oh, and network transfer of strings will need translation for old versions, obviously. A small piece of work I forgot.

ICU probably is a bit too big for our purposes, we won't need conversion to different character sets or date and time handling.

Luke-Jr · Post by **Luke-Jr** » Tue Feb 21, 2006 1:51 pm

z-man wrote:Luke: yep, that's it. Now add color code length detection and we're all set We'll probably want to give color codes a new start code, it's annoying not to be able to write "320x200".

While we're at it, how about just having the client handle colour code insertion? Just send them over the network as a four byte sequence: "I am colour", red, green, blue.
For "I am colour", we have quite a few choices (or so my quick-and-dirty utf8 tester says): 128-191, 254, and 255. 128-191 are reserved for tail bytes, so we want to avoid using them. 254 and 255 are (I think) not used by UTF-8 in order to ensure BOM works, but unless we plan to ever support UTF-16, we don't need to worry about that.
Now for another idea that adds a bit of complexity, but makes it work with a 'charlen' function properly:
Encode the colour by taking 01FFFFFF (disection: 01 = colour extension identifier; FF = red; FF = green; FF = blue) and applying UTF-8 to *that* violating the RFC's rule re too-long sequences (so we get something unique to AA): fc81bfbfbfbf
fc81bfbfbfbf = white
fc8180808080 = black
fc81bfb08080 = red
fc81808fbc80 = green
fc81808083bf = blue
Yeah, not too pretty, but... shrug

z-man wrote:Oh, and network transfer of strings will need translation for old versions, obviously. A small piece of work I forgot.

It will if we change colour-code stuff... not sure why else it would, if we just assume old versions used ASCII (which we probably can't)

z-man wrote:ICU probably is a bit too big for our purposes, we won't need conversion to different character sets or date and time handling.

Besides, what does it do that iconv doesn't? =p

Post by **Tank Program** » Tue Feb 21, 2006 6:36 pm

We could use the unicode color codes that IRC uses...

Either way for using unicode sounds good to me z-man. As long as what the user has to do does not change too much, anything should be fine really.

Luke-Jr · Post by **Luke-Jr** » Tue Feb 21, 2006 7:59 pm

Tank Program wrote:We could use the unicode color codes that IRC uses...

IRC doesn't use unicode color codes. IRC doesn't even support unicode.
IRC just uses something like %K followed by a DOS colour code which is 16 colour.

Post by **Tank Program** » Tue Feb 21, 2006 11:10 pm

Not when I last looked it up, unless \uXXXX isn't unicode.

Luke-Jr · Post by **Luke-Jr** » Wed Feb 22, 2006 2:41 am

Tank Program wrote:Not when I last looked it up, unless \uXXXX isn't unicode.

http://www.irchelp.org/irchelp/rfc/rfc.html
Where do you see any mention of Unicode or \uXXXX?

Post by **Tank Program** » Wed Feb 22, 2006 8:02 pm

I believe I was packet capturing at the time. Doing some simple google searching revealed this:

a bit of a how to having todo with java and irc wrote:String plain = "A plain message";

String red1 = "\u000304A red message";

String red2 = "\u0003" + "04" + "A red message";

String whiteOnBlack = "\u000300,01" + "White text on black background";

So not strictly unicode, but expressed as unicode is a possibility.

Luke-Jr · Post by **Luke-Jr** » Thu Feb 23, 2006 8:12 am

Tank Program wrote:I believe I was packet capturing at the time. Doing some simple google searching revealed this:
a bit of a how to having todo with java and irc wrote:String plain = "A plain message";

String red1 = "\u000304A red message";

String red2 = "\u0003" + "04" + "A red message";

String whiteOnBlack = "\u000300,01" + "White text on black background";
So not strictly unicode, but expressed as unicode is a possibility.

Not really Unicode any more than ASCII. It's just a char with value 3. And again, it limits us to 16 colours... Anything wrong with my suggestions?

Website · Post by **Z-Man** » Thu Feb 23, 2006 9:12 am

Luke-Jr wrote:Anything wrong with my suggestions?

To be honest, I don't want to think about it right now. I'd probably prefer a scheme that is valid UTF-8 even for normal tools (so it can be inserted into text files), just using one of the certainly existing unused characters as leadin. Like one of the control characters from 01 to 31 we don't have a use for.

Post by **Tank Program** » Thu Feb 23, 2006 5:55 pm

I'd vote for something even escaped for color codes or using the current system yet. Something that's user friendly should of course, remain a top priority.

As for unicode in general, sounds like you've got a good plan z-man.

Jonathan · Post by **Jonathan** » Thu Feb 23, 2006 6:40 pm

I was thinking about escapable escapes (if that's how you'd call it):
\cff0000red => red
\\cff0000red => \cff0000red

Website · Post by **Z-Man** » Fri Feb 24, 2006 12:28 am

Escape codes sound perfect. They'd still be editable by users (curse and blessing, as today). Sure, they'll take more space than the packed version.

Armagetron Forums

"Plan" for unicode

"Plan" for unicode

Re: "Plan" for unicode