"Plan" for unicode

What do you want to see in Armagetron soon? Any new feature ideas? Let's ponder these ground breaking ideas...
User avatar
Z-Man
God & Project Admin
Posts: 11233
Joined: Sun Jan 23, 2005 6:01 pm
Location: Cologne, Jabber: [email protected]
Contact:

"Plan" for unicode

Post by Z-Man »

Main information source: The UNIX unicode FAQ at http://www.cl.cam.ac.uk/~mgk25/unicode.html
Windows wants us to handle unicode differently, but unless we want to allow unicode filenames, we can ignore it.

There will be two ways to switch from iso latin 1 to unicode.

The first way is to use full width characters. We'll need to replace std::string with std::wstring or something similar (rumor has it that on some systems, std::wstring does not exist or is identical to std::string or uses 16 bit characters that are not wide enough). We'll have to replace most chars to the appropriate wide chars. We'll have to convert stuff to UTF-8 on input and output, because that's what the outside world expects. Memory usage will increase a lot.
You'd think we'd get easier string handling algorithms by it because every token in the string will be one character, so the string length in tokens also is the character length; but that is wrong. Accents can be handled by combining characters. ä can be represented as `a` followed by a token that says `apply " to the previous character`, and an arbitrary number of those can follow.

The second way is to use UTF-8 internally. It'll save tons of memory. "We'll have to adapt all string handling functions to cope with that! A nightmare!" I hear you scream. But think a bit: in how many places do you actually need to decode UTF-8? UTF-8 preserves ASCII in both ways: ASCII characters appear as they are in an UTF-8 stream, and if you see an ASCII character in a UTF-8 stream, it really is one. So searches for "/", "\n" and ":", which is most of what we do, will continue to work as they should.
There are two places where we need to be aware of UFT-8: the obvious one is the font rendering, it'll need to decode the stream. The other one is string formatting; only there, we need to know the exact number of characters, not bytes, a string has up to a certain point to align the score table and stuff. Luckily, we already have embedded color codes that pose exactly the same problem, so we have half a handful of functions that handle formatting and that are used. So we'll only have to adapt those functions.

I guess the choice is obvious :) The plan is to make the font rendering decode UTF-8, adapt the formatting functions in tString, convert the language files, and be done. The difficult bit is the font, but we were toying with the thought of using an external library anyway, and if that supports UTF-8, we'll have almost zero work to do extra. If we keep rolling our own, then the really difficult bit is to make the font renderer able to render some ten thousand characters, not the decoding of UTF-8.

This plan also means that casts from xmlChar * to char * to std::string are safe. I came to the conclusions above while searching for a good way to do a safe conversion :)

Luke-Jr
Dr Z Level
Posts: 2246
Joined: Sun Mar 20, 2005 4:03 pm
Location: IM: [email protected]

Re: "Plan" for unicode

Post by Luke-Jr »

z-man wrote:"We'll have to adapt all string handling functions to cope with that! A nightmare!" I hear you scream.
Not me... I've done this before, so I have a nice set of C UTF-8 code :)
Here's my contribution:

Code: Select all

int utf8_charlen(unsigned char start) { 
	if      ((start & 0x80) ==    0)
		return 1;
	else if ((start & 0xe0) == 0xc0)
		return 2;
	else if ((start & 0xf0) == 0xe0)
		return 3;
	else if ((start & 0xf8) == 0xf0)
		return 4;
	else if ((start & 0xfc) == 0xf8)
		return 5;
	else if ((start & 0xfe) == 0xfc)
		return 6;
	return -1;
}
(Edit: the code didn't... you know... work ;)
Last edited by Luke-Jr on Tue Feb 21, 2006 1:52 pm, edited 1 time in total.

User avatar
joda.bot
Match Winner
Posts: 421
Joined: Sun Jun 20, 2004 11:00 am
Location: Germany
Contact:

Post by joda.bot »

Something I found ... which is pretty big but also powerful:
http://icu.sourceforge.net/userguide/icufaq.html

Java is also based on this AFAIK.

User avatar
Z-Man
God & Project Admin
Posts: 11233
Joined: Sun Jan 23, 2005 6:01 pm
Location: Cologne, Jabber: [email protected]
Contact:

Post by Z-Man »

Luke: yep, that's it. Now add color code length detection and we're all set :) We'll probably want to give color codes a new start code, it's annoying not to be able to write "320x200".

Oh, and network transfer of strings will need translation for old versions, obviously. A small piece of work I forgot.

ICU probably is a bit too big for our purposes, we won't need conversion to different character sets or date and time handling.

Luke-Jr
Dr Z Level
Posts: 2246
Joined: Sun Mar 20, 2005 4:03 pm
Location: IM: [email protected]

Post by Luke-Jr »

z-man wrote:Luke: yep, that's it. Now add color code length detection and we're all set :) We'll probably want to give color codes a new start code, it's annoying not to be able to write "320x200".
While we're at it, how about just having the client handle colour code insertion? Just send them over the network as a four byte sequence: "I am colour", red, green, blue.
For "I am colour", we have quite a few choices (or so my quick-and-dirty utf8 tester says): 128-191, 254, and 255. 128-191 are reserved for tail bytes, so we want to avoid using them. 254 and 255 are (I think) not used by UTF-8 in order to ensure BOM works, but unless we plan to ever support UTF-16, we don't need to worry about that.
Now for another idea that adds a bit of complexity, but makes it work with a 'charlen' function properly:
Encode the colour by taking 01FFFFFF (disection: 01 = colour extension identifier; FF = red; FF = green; FF = blue) and applying UTF-8 to *that* violating the RFC's rule re too-long sequences (so we get something unique to AA): fc81bfbfbfbf
fc81bfbfbfbf = white
fc8180808080 = black
fc81bfb08080 = red
fc81808fbc80 = green
fc81808083bf = blue
Yeah, not too pretty, but... shrug
z-man wrote:Oh, and network transfer of strings will need translation for old versions, obviously. A small piece of work I forgot.
It will if we change colour-code stuff... not sure why else it would, if we just assume old versions used ASCII (which we probably can't) ;)
z-man wrote:ICU probably is a bit too big for our purposes, we won't need conversion to different character sets or date and time handling.
Besides, what does it do that iconv doesn't? =p

User avatar
Tank Program
Forum & Project Admin, PhD
Posts: 6698
Joined: Thu Dec 18, 2003 7:03 pm

Post by Tank Program »

We could use the unicode color codes that IRC uses...

Either way for using unicode sounds good to me z-man. As long as what the user has to do does not change too much, anything should be fine really.
Image

Luke-Jr
Dr Z Level
Posts: 2246
Joined: Sun Mar 20, 2005 4:03 pm
Location: IM: [email protected]

Post by Luke-Jr »

Tank Program wrote:We could use the unicode color codes that IRC uses...
IRC doesn't use unicode color codes. IRC doesn't even support unicode.
IRC just uses something like %K followed by a DOS colour code which is 16 colour.

User avatar
Tank Program
Forum & Project Admin, PhD
Posts: 6698
Joined: Thu Dec 18, 2003 7:03 pm

Post by Tank Program »

Not when I last looked it up, unless \uXXXX isn't unicode.
Image

Luke-Jr
Dr Z Level
Posts: 2246
Joined: Sun Mar 20, 2005 4:03 pm
Location: IM: [email protected]

Post by Luke-Jr »

Tank Program wrote:Not when I last looked it up, unless \uXXXX isn't unicode.
http://www.irchelp.org/irchelp/rfc/rfc.html
Where do you see any mention of Unicode or \uXXXX?

User avatar
Tank Program
Forum & Project Admin, PhD
Posts: 6698
Joined: Thu Dec 18, 2003 7:03 pm

Post by Tank Program »

I believe I was packet capturing at the time. Doing some simple google searching revealed this:
a bit of a how to having todo with java and irc wrote:String plain = "A plain message";

String red1 = "\u000304A red message";

String red2 = "\u0003" + "04" + "A red message";

String whiteOnBlack = "\u000300,01" + "White text on black background";
So not strictly unicode, but expressed as unicode is a possibility.
Image

Luke-Jr
Dr Z Level
Posts: 2246
Joined: Sun Mar 20, 2005 4:03 pm
Location: IM: [email protected]

Post by Luke-Jr »

Tank Program wrote:I believe I was packet capturing at the time. Doing some simple google searching revealed this:
a bit of a how to having todo with java and irc wrote:String plain = "A plain message";

String red1 = "\u000304A red message";

String red2 = "\u0003" + "04" + "A red message";

String whiteOnBlack = "\u000300,01" + "White text on black background";
So not strictly unicode, but expressed as unicode is a possibility.
Not really Unicode any more than ASCII. It's just a char with value 3. And again, it limits us to 16 colours... Anything wrong with my suggestions?

User avatar
Z-Man
God & Project Admin
Posts: 11233
Joined: Sun Jan 23, 2005 6:01 pm
Location: Cologne, Jabber: [email protected]
Contact:

Post by Z-Man »

Luke-Jr wrote:Anything wrong with my suggestions?
To be honest, I don't want to think about it right now. I'd probably prefer a scheme that is valid UTF-8 even for normal tools (so it can be inserted into text files), just using one of the certainly existing unused characters as leadin. Like one of the control characters from 01 to 31 we don't have a use for.

User avatar
Tank Program
Forum & Project Admin, PhD
Posts: 6698
Joined: Thu Dec 18, 2003 7:03 pm

Post by Tank Program »

I'd vote for something even escaped for color codes or using the current system yet. Something that's user friendly should of course, remain a top priority.

As for unicode in general, sounds like you've got a good plan z-man.
Image

User avatar
Jonathan
A Brave Victim
Posts: 3392
Joined: Thu Feb 03, 2005 12:50 am
Location: Not really lurking anymore

Post by Jonathan »

I was thinking about escapable escapes (if that's how you'd call it):
\cff0000red => red
\\cff0000red => \cff0000red

User avatar
Z-Man
God & Project Admin
Posts: 11233
Joined: Sun Jan 23, 2005 6:01 pm
Location: Cologne, Jabber: [email protected]
Contact:

Post by Z-Man »

Escape codes sound perfect. They'd still be editable by users (curse and blessing, as today). Sure, they'll take more space than the packed version.

Post Reply