UTF8 and authentication

What do you want to see in Armagetron soon? Any new feature ideas? Let's ponder these ground breaking ideas...
User avatar
Z-Man
God & Project Admin
Posts: 11585
Joined: Sun Jan 23, 2005 6:01 pm
Location: Cologne
Contact:

UTF8 and authentication

Post by Z-Man »

See this bug: https://bugs.launchpad.net/armagetronad/+bug/332713
Basically, right now, no version handles non-ascii characters in usernames correctly (where by correctly, I mean 'anything that works', not adhering to standards or anything). I think we said back then 'if you have a non-ascii username, you're screwed, sorry'.

Question: is there a good way to solve this? In authentication URLs, non-ascii latin1 characters are now (after a preliminary fix) encoded as %HEXCODE. That works, but does not extend to non-ascii non-latin1 unicode characters, of course, because the hexcode is two digits fixed. Is there a standard way to encode wide characters in html queries?

I guess not, my search turned up nothing, and frankly, it wouldn't make too much sense if there was a way :) So, should we switch to UFT8 for authentication right away, even on 0.2.8? The change would be minimal, just the URL encoding function, no other system needs to know about UTF8. I think we should. We have this situation:

Right now, as of 0.2.8.3_beta1, non-ascii username logins don't work, ever.
After the just committed fix, non-ascii username logins work IF the authority interprets them as latin1 strings.
On trunk versions, non-latin1 usernames don't work, ever.

We could change that to:
On 0.2.8.3_beta2, non-ascii username logins work, provided the authority interprets passed usernames as utf8.
Same for trunk versions and all unicode usernames (provided the authority allows you to set them in the first place, of course).

The disadvantage of the second way is that it will take slightly longer for latin1 usernames to work, authorities need to be checked and adapted. The advantage is that they won't have to be adapted again for full unicode usernames later.

Right now, however, the forums authority expects latin1 usernames, and I expect most others will do, too. Grumble. So maybe it's best to stick to latin1 wherever possible and change the request for unicode usernames (now: user=<latin1 username>, then: uuser=<utf8 username>).

Oh yeah, and then there's also the issue with the internal usernames used for ladderlog :) With the current implementation, they changed when we switched from latin1 to utf8. They use the same encoding for non-ascii characters as the http queries. Question there: any good ideas how to adapt se_(Un)EscapeName so latin1 names get escaped the same way as before and se_UnEscapeName( se_EscapeName( x ) ) == x still always holds, even for non-latin1 strings? We don't have to worry about any standard there.
Luke-Jr
Dr Z Level
Posts: 2246
Joined: Sun Mar 20, 2005 4:03 pm
Location: IM: [email protected]

Post by Luke-Jr »

Aren't usernames still always ASCII, even with UTF-8? Though I can see there might be a problem deciding how to ASCII-ify UTF-8-only names...
User avatar
Z-Man
God & Project Admin
Posts: 11585
Joined: Sun Jan 23, 2005 6:01 pm
Location: Cologne
Contact:

Post by Z-Man »

Luke-Jr wrote:Aren't usernames still always ASCII, even with UTF-8? Though I can see there might be a problem deciding how to ASCII-ify UTF-8-only names...
Precisely :) Usernames, as set to the authority via the http-request, are ASCII-ified by replacing non-ASCII-bytes with hexcodes. At least php automatically translates those codes back to bytes. Whether the resulting string then is utf8 or latin1 depends on whether it was utf8 or latin1 before the hexcodification. Currently, on 0.2.8, it's in latin1, so that's what arrives on the authority side. The trunk is probably still broken, but after the next merge and without further changes, it will send the usernames as utf8. Clearly, that's a recipe for chaos and something needs to be done. It's just a question in which direction we're moving, whether we want to actively support full unicode authentication usernames and if yes, how quickly we want to force the corresponding interface to the authorities.

And that's only the authentication usernames. There's also the internal usernames for ladderlog and stuff that are ASCII-ified internally using the same hexcode stuff and a couple of more escapes.
User avatar
joda.bot
Match Winner
Posts: 421
Joined: Sun Jun 20, 2004 11:00 am
Location: Germany
Contact:

Post by joda.bot »

My view on fixing this bug is:
1) Fix: the signed int problem (this is a blocker bug right?) e.g. send pure latin1 (no wierd escape?) and add encoding=<charset> to the authentification call
2) switch to UTF-8 as soon as possible

The first part makes this solution future prove (as the authorities can still detect and handle older clients - someone might even code something to handle authority calls without "encoding=<charset>" as long as every old client always uses latin1 and uses the (broken) signed int escape ?

EDIT: I'll send a PM to Agent Zéro :D, so he can express his desire to get this fixed (he can't login :-)).
User avatar
Tank Program
Forum & Project Admin, PhD
Posts: 6711
Joined: Thu Dec 18, 2003 7:03 pm

Post by Tank Program »

Would it be possible to recognize UTF-8 characters and re-encode them as %XX%XX?
Image
User avatar
Z-Man
God & Project Admin
Posts: 11585
Joined: Sun Jan 23, 2005 6:01 pm
Location: Cologne
Contact:

Post by Z-Man »

Of course, and it's the standard way to encode utf8 in URIs :) It's even easy enough on 0.2.8. The only reason to do otherwise is that right now, many authorities simply expect latin1 encoded usernames. We don't have to care too much, though. Let's just switch the username encoding to utf8 and be done with it.
User avatar
Tank Program
Forum & Project Admin, PhD
Posts: 6711
Joined: Thu Dec 18, 2003 7:03 pm

Post by Tank Program »

Right-o. I imagine for those of us poor fools with latin1 databases, that it should be relatively trivial to convert the incoming arguments from utf8 to latin1 in php.
Image
User avatar
Z-Man
God & Project Admin
Posts: 11585
Joined: Sun Jan 23, 2005 6:01 pm
Location: Cologne
Contact:

Post by Z-Man »

Blergh, I've grossly underestimated the problem. The URI is not the only place the username encoding matters. The username is also included into the password hash. On the client. In whatever encoding the client uses internally. The same goes for the password, of course. And for the relevant calculations on the authority. This is way too complicated to get working together for all cases.

So screw this. I'll just:
a) switch the relevant scrambling client code to utf8 even on 0.2.8 (it's centralized in one function)
b) switch the username in the URI to utf8
c) adapt the reference authentication (by which I mean: add a utf8 encoded sample user)

That way, if server and client are 0.2.8.3_beta2 or later (or 0.3.whatever_first_includes_utf8) and the authority uses utf8 internally, non-ascii usernames and passwords will work. If one component doesn't meet the requirements, you're restricted to ASCII. The documentation did warn you, so don't complain :)

Tank: that means for your authentication script that you need to a) convert usernames coming in from server requests from utf8 to latin1 and b) you need to transform usernames and passwords from latin1 to utf8 for the hash generations. Users with non-ascii password or username will need to log out and log in again, of course.

Also, the bmd5 protocol will NOT change internally, it'll still use the latin1 password. It doesn't hash the username, so we're clear there. However, I'll remove support for it on trunk clients, I'm not in the mood to convert passwords back to latin1 there. Trunk servers just pass the info on, so they can keep allowing it for a while.
User avatar
Tank Program
Forum & Project Admin, PhD
Posts: 6711
Joined: Thu Dec 18, 2003 7:03 pm

Post by Tank Program »

OK, makes sense.

Good thing the old clients use bmd5, otherwise all hell would break loose on upgrading the authority.
Image
Agent Zéro
Posts: 1
Joined: Thu Feb 19, 2009 12:06 am
Location: France / Bretagne

Post by Agent Zéro »

That seems good !

And Joda greatly exposed my problem. That stupid "é" !

Surely it's better to use the same languages either on the forum and in game. I'll follow this topic. I don't i'm the only one with that bug but i never saw at least somebody with an accent like à, é or even ñ.

I hope it will be good at the next version =)

Thanks for all you do for us !
My name is Dincht, Zell Dincht
epsy
Adjust Outside Corner Grinder
Posts: 2003
Joined: Tue Nov 07, 2006 6:02 pm
Location: paris
Contact:

Post by epsy »

Erm, before talking about UTF8 and fancy stuff..could the behavior of spaces and _'s be defined?
User avatar
Z-Man
God & Project Admin
Posts: 11585
Joined: Sun Jan 23, 2005 6:01 pm
Location: Cologne
Contact:

Post by Z-Man »

Those use the conventions of [url=http://de2.php.net/urlencode]urlencode[/code]: Spaces are encoded as + for some reason or other, every other non-ascii byte as %XX, where XX is the byte's hexcode.

For escaping those characters for the log files, well, it's a bit weird, but we can't change it without Luke complaining. Underscores are left as they are, and spaces are transformed to \_. The rationale was, I think, that usernames that can technically be logged verbatim should be logged verbatim, and those who can't, well, it's their own fault for being weird.
User avatar
joda.bot
Match Winner
Posts: 421
Joined: Sun Jun 20, 2004 11:00 am
Location: Germany
Contact:

Post by joda.bot »

Does "joda bot"@forums work (if I had that nick) ? I had the impression spaces are correctly handeled. But I noticed your bug report about escaping ' and " characters :-)

I suggest the commandline escapes uses either repeated "" escapes and/or a backslash escape \" inside ". The same escapes might be needed for ' for that matter.

But this is commandline only escape problem right ? Because once the arbirtraty string is parsed into login and authority (and encoded as utf-8 ) there is no problem on the client side with sending them to the authority ?

I hope the paramets for the authority URL are escaped such that parameter delimeters can be contained in the nickname ?

Nick Testlist
  • Q?Par=a
  • I'm?
  • A/h&t=?a
  • /"'"'"\ (commandline test name :D)
Can't think of more wierd name problems right now.
User avatar
Z-Man
God & Project Admin
Posts: 11585
Joined: Sun Jan 23, 2005 6:01 pm
Location: Cologne
Contact:

Post by Z-Man »

joda.bot wrote:Does "joda bot"@forums work (if I had that nick) ? I had the impression spaces are correctly handeled.
/login reads in the complete line, so the quotes are not needed there. Unless they're part of the username :) Both cases should work.
joda.bot wrote:But I noticed your bug report about escaping ' and " characters :-)
That's a bug in the autocompleter, I think. It would need to escape those special characters.
joda.bot wrote:I suggest the commandline escapes uses either repeated "" escapes and/or a backslash escape " inside ". The same escapes might be needed for ' for that matter.
It does already, sort of :) To message a player named "joda bot" (quotes not part of the name), you can use either of
/msg "joda bot" <bla>
/msg 'joda bot' <bla>
/msg joda\ bot <bla>
And if the player is named joda"bot, you can
/msg 'joda"bot' <bla> (works because the " does not match the opening ')
/msg "joda"bot" <bla> (works because it's escaped)
/msg joda"bot <bla> (works because it's escaped)
/msg joda"bot <bla> (works because the " is not at the beginning of the string and thus not interpreted as opening quote. Obviously, this it the variant you should not rely on.)

So repeated quotes currently don't work the way they do in C or on the shell, which is sort of good, because they're handled differently; on the shell, only "x""y" gets concatenated to "xy", "x" "y" gets split into two arguments, whereas in C both variants produce just one string.
joda.bot wrote:But this is commandline only escape problem right ? Because once the arbirtraty string is parsed into login and authority (and encoded as utf-8 ) there is no problem on the client side with sending them to the authority ?
Right.
joda.bot wrote:I hope the paramets for the authority URL are escaped such that parameter delimeters can be contained in the nickname ?
Yep, urlencode does that.
User avatar
wrtlprnft
Reverse Outside Corner Grinder
Posts: 1679
Joined: Wed Jan 04, 2006 4:42 am
Location: 0x08048000
Contact:

Post by wrtlprnft »

On the other hand you cant /msg-escape 'joda"bot completely.
There's no place like ::1
Post Reply