"Plan" for unicode

Website · Post by **Z-Man** » Sun May 14, 2006 5:34 pm

Using wchars is the wrong way

One, wchars are system dependant and often not wide enough to hold all characters, and two, the apparent benefit that one wchar corresponds to one character printed is none, there still are those composed characters (base letter plus accent). Throwing away memory when storing ASCII strings is just a minor detail. Internally working with UTF8 is the right way, it also makes the least amount of work. In case the font library can't handle UTF8, we should convert it to wchars or whatever the library demands at the last possible moment.

Yes, all the current string transfer quirks can be fixed in a backwards compatible way.

wrtlprnft · Post by **wrtlprnft** » Tue Aug 29, 2006 10:13 pm

*bump*

The utf8 branch is in /armagetronad/armagetronad/branches/utf8/.
The font system now treats all incoming text as utf-8 and displays it correctly, the menu system converts input to utf8 too (though SDL returns a 0 as codepoint for chars not in latin-1), and the language files are converted to utf-8.

Still missing the color codes and network translation.

Luke-Jr · Post by **Luke-Jr** » Wed Aug 30, 2006 1:39 am

Please don't introduce yet another dependency when it is totally unnecessary. The UNIX98 standard defines the iconv(3) API for translation between different character encodings.

wrtlprnft · Post by **wrtlprnft** » Wed Aug 30, 2006 9:12 am

It isn't a dependency. It's a 20K header file in src/thirdparty, that's it. It has some nice c++ goodness to make dealing with utf-8 easier, it's NOT just for conversion.
To use iconv to convert from utf-16 to utf-8 I'd basically need to allocate a string three times the length of the original string which would probably be way too big in most cases... We can use iconv for translating between utf-8 and latin-1 though, I guess.

Luke-Jr · Post by **Luke-Jr** » Wed Aug 30, 2006 7:24 pm

wrtlprnft wrote:It isn't a dependency. It's a 20K header file in src/thirdparty, that's it.

Yes, but how much does the binary size grow? Our binaries are huge, and make certain desirable targets (OpenWrt, handheld gaming devices) impractical.

wrtlprnft wrote:It has some nice c++ goodness to make dealing with utf-8 easier, it's NOT just for conversion.

I have a bunch of really simple dealing-with-utf8 code anyway used for MOO. IIRC, some is already being used.

wrtlprnft wrote:To use iconv to convert from utf-16 to utf-8 I'd basically need to allocate a string three times the length of the original string which would probably be way too big in most cases...

Why? iconv converts one character at a time. For conversion to utf-8, you could use either a char[8] output buffer or malloc a char* with the utf-16 size divided by two and realloc it if you run out of room.

Post by **Tank Program** » Wed Aug 30, 2006 9:01 pm

Luke-Jr wrote:handheld gaming devices

The larger lack of OpenGL on such devices is a bigger issue.

Luke-Jr · Post by **Luke-Jr** » Wed Aug 30, 2006 10:16 pm

Tank Program wrote:
Luke-Jr wrote:handheld gaming devices
The larger lack of OpenGL on such devices is a bigger issue.

You assume they in fact do lack OpenGL. I know at least one of the modern portable gaming devices has a GL library... Also, I doubt our GL code is the majority-- stripping it and replacing it with something else (2D or such) won't save much binary size, so the issue remains.

Website · Post by **Z-Man** » Sat Jan 17, 2009 1:17 am

Epic bump

I restarted the branch, merging it with trunk and committing the result into a new branch. It was hell. Anyway, now we can finish this. The restarted branch is a full one (with winlibs and stuff, it's zero cost in svn) so it can be mirrored to bzr and merged there.

I'm no longer convinced the current "convert strings on the network layer in the current network system" is the correct approach. It's simply too fragile, one can't know for sure in which format the sender formats its strings in. Instead, I'd say we just stick to sending strings in latin1 in the current network system all of the time. I'm actively thinking about the google pattern buffers thing, that would send utf8 strings with new color codes. If we do things this way, all that's left to do before we can merge is to have the network code always convert to and from latin1 instead of doing it selectively.

Oh, and the font files have diverged and can't be merged. Luke added some characters on the Trunk. I don't know what wrtl did on the branch. What shall we do there?

Edit: because the branch version uses wchar strings to communicate with FTGL, it also isn't affected by the recent silent addition of utf8 "support" there (which apparently can't be disabled and crashes when fed with invalid strings).

wrtlprnft · Post by **wrtlprnft** » Sat Jan 17, 2009 11:44 am

Z-Man wrote:Oh, and the font files have diverged and can't be merged. Luke added some characters on the Trunk. I don't know what wrtl did on the branch. What shall we do there?

Heh, I also added a couple of new characters there… I'll try to have a look at it.

Edit: because the branch version uses wchar strings to communicate with FTGL, it also isn't affected by the recent silent addition of utf8 "support" there (which apparently can't be disabled and crashes when fed with invalid strings).

That's quite inefficient though because it has to convert all characters everytime a string is rendered… I'd prefer depending on (or possibly including) the newer version and always using utf-8 internally.

Website · Post by **Z-Man** » Sat Jan 17, 2009 12:27 pm

The branch arrived on launchpad: https://code.launchpad.net/~armagetrona ... ronad-work

wrtlprnft wrote:I'd prefer depending on (or possibly including) the newer version and always using utf-8 internally.

I'm cool with that. FTGL is statically linked anyway in our autopackages, so there's no backward compatibility headache there. And apparently, I already changed the include files on the trunk so they require the new version

So here's our options dealing with this branch, 0.3.1 and the FTGL crash:
a) stay calm, make it so that 0.3.1 can't link with the broken new FTGL, distribute versions of 0.3.1 statically linked with a good FTGL.
b) on the utf8 branch, make it so that the new FTGL is required, get rid of the conversion to wstrings, always convert to/from latin1 in the netcode, merge it to 0.3.1 (could be a tad difficult because there's changes in utf8 now we don't want in 0.3.1) and release that.
c) same as b), but leave the conversion to wstrings on rendering intact, thus not requiring bleeding edge FTGL.

I'm leaning towards a) and d), which is the same as b), just merging into the trunk instead of 0.3.1.

Post by **Lucifer** » Sat Jan 17, 2009 4:07 pm

As far as 0.3.1 is concerned, I say don't merge. If we need to, we can always pinch of 0.3.2 a few weeks after to release utf8 support.

wrtlprnft · Post by **wrtlprnft** » Sat Jan 17, 2009 4:36 pm

The font files are merged now… Please consider the trunk one outdated and don't change it anymore. Adding new characters is pointless, anyways, as the trunk can't display them unless ftgl is messing up.

Website · Post by **Z-Man** » Sat Jan 17, 2009 8:39 pm

Think we should handle FTGL flexibly? I smell people complaining if 0.3.1 doesn't build/work with FTGL >= 2.1.3 and the trunk doesn't build with FTGL < 2.1.3. Shall I try whether one can avoid the uft8->wstring conversion selectively, depending on the version of FTGL compiled against, without too much chaos?

wrtlprnft · Post by **wrtlprnft** » Sat Jan 17, 2009 8:52 pm

if you think it's not a huge hack, knock yourself out

Website · Post by **Z-Man** » Sat Jan 17, 2009 9:02 pm

Oh yeah, another thing: I think we should leave the trunk language files that are also on 0.2.8 in latin1. reason being merge hell, of course. I'll make it so that the loader is flexible.