Opened 11 years ago

Closed 8 years ago

Last modified 7 years ago

#1645 closed defect (fixed)

ICQ Encoding Problems

Reported by: I4ko Owned by: elb
Milestone: Component: ICQ
Version: 2.0.1 Keywords:
Cc: markdoliner, chroneus, datallah, pva, rpolach, ivan

Description

I've noticed some problems with cyrillic when using pidgin, and before that gaim.

  1. Status messages in cyrillic (UTF-8) do not appear correctly in the contact list - it seems they are interpreted as 1 byte character, not as utf-8 that they are really. Happens in WinXP pro sp2 (official pidgin build), Linux (NLD9/Suse 9.1) (pidgin compiled from source). Seen on icq protocol. Not verified on others. You may see screenshots from winxp at http://hristo.todorov.googlepages.com/pidgin-probl.zip or in the attachment. QIP and ICQ6 both represent them correctly.
  1. Offline messages in cyrillic (UTF-8), icq protocol again, do not show correctly. They are correctly identified as 2byte character but showed with "?" - question mark for a character instead of the actual character. Otherwise cyrillic works fine. A log file with such offline message is in the archive above. No other cyrillic texts in it though. This one is taken under linux, but same happens on windows.

I am avalable for any tests required.

Attachments (12)

pidgin-probl.zip (40.7 KB) - added by I4ko 11 years ago.
small.pcap (7.2 KB) - added by I4ko 11 years ago.
oscar-icq-encoding-hack.diff (8.2 KB) - added by elb 11 years ago.
trial fix
halfok.PNG (36.0 KB) - added by I4ko 11 years ago.
Buddly list and pop-up are ok, buddy info window is not.
icqauth.PNG (12.5 KB) - added by I4ko 11 years ago.
Cyrillic text garbled in auth request, the text is "Разрешете моето искане и ме добавете към вашия списък с контакти ...."
offline-new (2.8 KB) - added by I4ko 11 years ago.
p.jpg (253.5 KB) - added by liorwohl 11 years ago.
Pidgin 2.1.1 (ubuntu gutsy) and ICQ away in hebrew. the incoding is set to CP1255.
paket (2.7 KB) - added by Tafkadasom2k5 11 years ago.
Heres an away message with broken coding. Did it with wireshark "follow tcp-string" and the "SAVE AS". User is using ICQ5.1. Correct would be "Der Benutzer ist zurzeit nicht verfügbar."
oscar_utf8-1.diff (357 bytes) - added by goyko 11 years ago.
Does the same, but is less intrusive
oscar_utf8.diff (3.0 KB) - added by goyko 11 years ago.
Fixes empty fields in Buddy Info when their encoding is not UTF-8
oscar_status_userinfo.diff (952 bytes) - added by yuralol 11 years ago.
Fix status messages in user info window
DSC00384.JPG (84.2 KB) - added by liorwohl 11 years ago.

Download all attachments as: .zip

Change History (115)

Changed 11 years ago by I4ko

comment:1 Changed 11 years ago by elb

What do you have your encoding set to in the account preferences? Can you get a packet dump of a connection that shows these problems?

comment:2 Changed 11 years ago by I4ko

In account preferences - in windows CP1251, this was selected automaticly by pidgin, which is right it the program was non-unicode, in linux - UTF-8. After manually switching to UTF-8 in winxp there is no change.

However there is a way to detect unicode and I belive it's pidgin who should transcode if there are different encodings. Both systems are in unicode mode. Full Packet dump is around 1400 packets - full pidgin start, with filtered yahoo, and jabber protocols. I'll attach it here if you advise me how to remove the login info.

The atached dump only contains packets containg the uins below. the dump should have info for the following two uins. for uin 64378489 - "Status: Away: Потребителят в момента е излязъл." in list and buddy info for uin 127846416 - "Status: Away: Потребителят в момента е излязъл." in list and buddy ingo

Changed 11 years ago by I4ko

comment:3 Changed 11 years ago by elb

What client is buddy 64378489 using? As best I can tell, the away message for this buddy *really is* gibberish. If Pidgin is showing РџРѕС..., it is showing what is truly in the packet. I'm not sure what is going on here. The message is well-formed 2-byte Unicode for the HTML framing, but the text body contains junk.

comment:4 Changed 11 years ago by I4ko

Both buddies are with client ICQ 5.1 Abv.bg - official icq localized client for Bulgaria. Allow me to not aggree with you that what's in the packet is gibberish. That's exactly what you see when interpreting UTF-8 as windows-1251 without transcoding. (just doublechecked it with notepad txt files and my browser)

The text is "Потребителят в момента е излязъл.",and is recognized and shown ok by other clients. When interpterting

Infact uin 127846416 is the same jarja from the screenshots.

comment:5 Changed 11 years ago by elb

What is in those packets is *not* UTF-8, that is what I'm telling you; the text inside the away message block is in a two-byte encoding (look at it -- the HTML bytes (which are ASCII) are all preceded by an 0x0 byte; the character Р is encoded as 0x04, 0x20, which is U+0420 in UCS-2BE, which the encoding tag claims is the encoding of that message). There should be no UTF-8 involved anywhere in this process; I have been overlooking your mention of it, assuming that you simply believe that all Unicode is UTF-8. The fact that you recognize that string as a Windows-1251 rendering of UTF-8 leads me to believe that perhaps the remote client converted the away message from Windows-1251 to UTF-8, and then converted the UTF-8 string from Windows-1251 to UCS-2BE; this is, of course, nonsensical, but it would produce the string which is _actually stored_ in that capture file. (Please do not "disagree" with me about what is in the packet, please look at it yourself and confirm that it is, in fact, what I am claiming it is. You can do this by converting the text in the away message to Windows-1251, and then displaying it as UTF-8 -- you will see the correct away message.)

There is no way for us to predict this complete brokenness. If they are truly using the official client, then it is broken and mishandling encodings. I do not know how we can detect this situation and correct for it.

comment:6 Changed 11 years ago by seanegan

I confirm that converting that UTF-16 to Windows-1251 results in that valid UTF-8 string.

Also, that does indeed seem to be an official ICQ client; we've worked around plenty of broken ICQ i18n problems before, and we should try to work around this one.

We already special-case "unicode-2.0," as it's not the canonical name of any encoding iconv understands. We should, therefore, assume ICQ reserves the right to choose how it should be interpreted (especially since other official clients can handle it), instead of just making it a synonym for UCS2-BE.

In this case, it seems like we should go through the list of fallback encodings, convert the "unicode-2.0" string to them, and see if the result validates as UTF-8. If none of them do, treat it as UTF-16. It's definitely crack.

It seems this particular string is not a user-set string, but a hard-coded "auto-away" status message. l4ko, does this happen for all status messages, or just auto-status messages? If the latter, we may even be able to just special case the few status messages that have this problem.

comment:7 Changed 11 years ago by seanegan

  • Milestone set to 2.0.2

This is a duplicate of #544, but I'm closing that one since this has much more information (Great bug report, l4k0!).

There are other, similar, "ICQ clients don't bother with decent i18n" bugs that are probably also duplicates of this. I'll go through on a case basis.

comment:8 Changed 11 years ago by elb

I4ko, can users of the official client read your away messages properly?

comment:9 Changed 11 years ago by I4ko

It's looks much more like a protocol/icq client related bug now.

elb: Both official ICQ6 and unofficial client (QIP) do not show Pidgin's away messages, either preset in english (latin) or custom (cyrillic). Official localized ICQ 5.1 abv.bg show predefined message in client(itself) for pidgin away and does not show custom text also.

Pidgin to pidgin seems ok (tested on the same app, icq uin added as buddy in aim account). Pidgin does not show at all ICQ6 status, either custom or preset. QIP shows only latin ICQ6 statuses. Pidgin shows correctly ICQ 5.1 abv.bg status in latin.

Custom status set in QIP shows ok in ICQ 5.1 abv.bg, both latin and cyrillic, but does not show in ICQ6.

jabber does not seem to be affected.

seanegan: This seem like a predefined messages to me, but the build I got today after install had predefined messages in english. May be because it's not a clean install as I had a generic ICQ 5.1 before and some registry entries were not deleted as i had a list with the custom statuses I used.

I'm testing with these clients since majority of windows users in Bulgaria are on ICQ 5.1 abv, most ot the rest on QIP, and just a few on ICQ6, miranda, trilian... Pidgin and old version of gaim are used by the linux/unix folks.

comment:10 Changed 11 years ago by elb

I4ko: thanks a lot for clarifying that, I really appreciate you keeping with us on this bug. You're providing a lot of very valuable information that we often have trouble getting.

I think we'll implement Sean's suggestion for viewing ICQ 5.1 away messages; we'll have to look more closely at ICQ6 statuses, I don't have any idea why they don't show (MarkDoliner might, hopefully he'll weigh in here at some point). With any luck, the 5.1 fix will be in 2.0.2 (although I'm running out of time to make that happen).

Changed 11 years ago by elb

trial fix

comment:11 Changed 11 years ago by elb

If you can, please try the diff I just attached to this ticket, and see if it helps things out.

comment:12 Changed 11 years ago by elb

I went ahead and committed, this is hopefully fixed for 2.0.2.

comment:13 Changed 11 years ago by elb

  • Resolution set to fixed
  • Status changed from new to closed

comment:14 Changed 11 years ago by hdima

Offline messages still don't display correctly in 2.0.2. For example if the client disconnects and somebody sent an offline message with cyrillic characters then after reconnect the message looks like this (ICQ):

(10:45:00 PM) Somebody: ??. ????? ??? english ? ???????? ???? (There was an error receiving this message. Either you and XXXXXXXX have different encodings selected, or XXXXXXXX has a buggy client.)

But when the client send online cyrillic messages they are received as expected. This scenario work for almost all different clients on the other side.

comment:15 follow-up: Changed 11 years ago by I4ko

Hello, sorry for reopening but I was away from computers and internet for some days.

  1. The fix that Elb implemented does not work for me. I've tested with 2.0.1 patched and 2.0.2. However this is under linux and I compiled from sources. I cannot be quite sure since the system is NLD9 and is quite broken by itself. I'll test the official windows build this evening.

Btw what is the proper name for *1251, is it CP1251, CP-1251, or windows-1251, iconv seems to understand CP1251. This is because I've tried changing the UTF-8 in the oscar protocol encoding to 1251, but still no luck, so I'm back to UTF-8.

  1. The offline messages still behave the same, but they were not discussed, right?

comment:16 Changed 11 years ago by lschiere

  • Milestone changed from 2.0.2 to 2.1.2
  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:17 in reply to: ↑ 15 Changed 11 years ago by elb

Replying to I4ko:

  1. The fix that Elb implemented does not work for me. I've tested with 2.0.1 patched and 2.0.2. However this is under linux and I compiled from sources. I cannot be quite sure since the system is NLD9 and is quite broken by itself. I'll test the official windows build this evening.

Does it not work any better than what was in before at all, or does it fix some things, but not everything, or what?

Btw what is the proper name for *1251, is it CP1251, CP-1251, or windows-1251, iconv seems to understand CP1251. This is because I've tried changing the UTF-8 in the oscar protocol encoding to 1251, but still no luck, so I'm back to UTF-8.

There is no proper name, it is a nonstandard encoding. Use whatever works with your iconv. Note that for the changes I committed to have any effect, you _must_ have your encoding set to CPwhatever.

  1. The offline messages still behave the same, but they were not discussed, right?

I don't know; if they were similarly broken, they should have been fixed. If they are not, they wouldn't be.

comment:18 Changed 11 years ago by I4ko

Hello Elb I've tested with the official windows build of pidgin. It fixes some of the things, when protocol encoding is set to "CP1251"; although "WINDOWS-1251" is a valid iconv encoding it yelds no effect. With "CP1251" message is correctly displayed in the buddy list, but is still not correct in the Buddy Information window.

Offline messages still does not seem ok, this is what i received "::9E:A409E:49" for a random string of cyrillic letters.

This is not releated here, but I see messages only on Away status. When user is with Not Available status - custom messages are not shown in buddy list (it shows "Idle XXmin - Not Available") - may be this is the way it is supposed to be. Also other clients have the behaveour of showing status message when the conversation window is opened for the contact, pidgin does not seem to do it.

comment:19 Changed 11 years ago by elb

WINDOWS-1251 may not be a valid iconv encoding on Windows; these things vary from system to system. Furthermore, I'm not *really* sure what g_convert uses to do its dirty work on Windows, it may not even be iconv. In any event, if CP1251 works, more the better.

I'm not sure I understand what's happening to offline messages ... is there any way you could get me a packet capture of one of those? That looks like UCS-2 being displayed as UTF-8 directly. Looking at our offline message decoding, though, I'm certain it's wrong; if I can get a sample of an offline message, I can probably fix it.

As for the N/A thing, you can file a separate ticket for that, someone will eventually get to it. :-)

comment:20 follow-up: Changed 11 years ago by hdima

For me on Windows XP the problem disappeared with the encoding set to CP1251. I've captured some packages and found offline message encoded as CP1251. There is maybe some related problem: I've tried encoding UTF-8,CP1251 but it doesn't work. It's strange since UTF-8 can be easily recognized. It seems if the UTF-8 codec can't decode some character it's just replaced by "?" which is wrong behavior.

comment:21 in reply to: ↑ 20 ; follow-up: Changed 11 years ago by elb

Replying to hdima:

For me on Windows XP the problem disappeared with the encoding set to CP1251. I've captured some packages and found offline message encoded as CP1251. There is maybe some related problem: I've tried encoding UTF-8,CP1251 but it doesn't work. It's strange since UTF-8 can be easily recognized. It seems if the UTF-8 codec can't decode some character it's just replaced by "?" which is wrong behavior.

ICQ does not support fallback encodings. Replacing an unrecognized character with '?' is not "wrong behavior"; what would you suggest we do?

comment:22 in reply to: ↑ 21 Changed 11 years ago by hdima

Replying to elb:

Replying to hdima:

For me on Windows XP the problem disappeared with the encoding set to CP1251. I've captured some packages and found offline message encoded as CP1251. There is maybe some related problem: I've tried encoding UTF-8,CP1251 but it doesn't work. It's strange since UTF-8 can be easily recognized. It seems if the UTF-8 codec can't decode some character it's just replaced by "?" which is wrong behavior.

ICQ does not support fallback encodings. Replacing an unrecognized character with '?' is not "wrong behavior"; what would you suggest we do?

Ugh, if ICQ doesn't support fallback encodings then there is nothing to do. :-) I thought if UTF-8 can't be properly decoded the decoder should skip to the next available codec since UTF-8 can be differentiated from other encodings.

comment:23 Changed 11 years ago by I4ko

I'll do the packet capture for the offline packets soon.

As for the incorrect representation of UTF-8 in 1251, the patch and 2.0.2 have corrected the problem on the buddy list. As i previously found, it's still present in the buddy information window, and now I also found it when someone asks you for autorization and enters cyrillic messages in the auth request, the requst displayed by pidgin has the same utf-8 in 1251.

Also, my linux pidgin is working after setting CP1251, it seems it was needed to restart the whole program than just disable/enable the account.

comment:24 Changed 11 years ago by I4ko

Hello I'm unable to reproduce a problem with offline messages now.

I'm attaching some more screenshots of places with broken encoding.

Changed 11 years ago by I4ko

Buddly list and pop-up are ok, buddy info window is not.

Changed 11 years ago by I4ko

Cyrillic text garbled in auth request, the text is "Разрешете моето искане и ме добавете към вашия списък с контакти ...."

comment:25 follow-up: Changed 11 years ago by elb

Does that mean that offline messages work properly?

I'll look into the auth and info problems.

comment:26 in reply to: ↑ 25 Changed 11 years ago by I4ko

The offline messages I've been able to test are all displayed correct. I've recevied some offilne messages from contacts using third party clients that are still not ok, but I can't reprodice them easily. I've done a capture with one client that online messages are correct, but offline are not. However it may be that the client is broken. The uin of the contact is 402329335.

Changed 11 years ago by I4ko

comment:27 Changed 11 years ago by seanegan

  • Owner set to elb
  • Status changed from reopened to new

comment:28 Changed 11 years ago by seanegan

  • priority changed from minor to major

#695 #2161 #1277 #1119 #829 #762 #2670 and #936 have been closed as duplicate of this issue. I'm raising the priority as it is a major issue. hopefully it's possible to fix it.

comment:29 Changed 11 years ago by seanegan

#2132 also

comment:30 Changed 11 years ago by MarkDoliner

  • Cc markdoliner added

Changed 11 years ago by liorwohl

Pidgin 2.1.1 (ubuntu gutsy) and ICQ away in hebrew. the incoding is set to CP1255.

comment:31 Changed 11 years ago by seanegan

comment:32 Changed 11 years ago by seanegan

  • priority changed from major to critical

comment:33 Changed 11 years ago by seanegan

  • Component changed from pidgin (gtk) to ICQ

comment:34 Changed 11 years ago by Tafkadasom2k5

...just beeing jealous: Are there any information we can provide, which would help you to fix this? Using Vista/XP Pro with an ICQ account here.

comment:35 follow-up: Changed 11 years ago by elb

What I need is packet dumps of the particular packets which contain messages which are displayed incorrectly. So far, when presented with such information, we've been able to figure something out to fix it in most cases. For now, we also need to be sure that the remote client is using the official ICQ client -- the various third-party implementations are known to be more or less broken with respect to i18n, and we do not want to make the situation worse, if we can help it.

I would really like to see this issue resolved, as well.

comment:36 in reply to: ↑ 35 Changed 11 years ago by I4ko

Elb, sorry I don't have time to take packet captues now but last 2 versions of Pidgin were broken on Win and Linux. I've set two accounts on a spare computer which is going to be online for a while. You can take packet captures from these. They sould be set with lowest privacy and no authentication.

456423097 - nickname Лилав; name Гълъб; surname Тестов. ICQ client - ABV ICQ 6 Will have status message for online "Тест за лилавия гълъб" as subject and text.

497905632 - nickname Лилав; surname Тестов. ICQ client - ABV ICQ 5.1 Will have status message for online "Тест за лилавия гълъ" as subject and "Тест за лилавия гълъб" as text.

I've set them no to go away, but if that happens i'll transcribe the messages.

Changed 11 years ago by Tafkadasom2k5

Heres an away message with broken coding. Did it with wireshark "follow tcp-string" and the "SAVE AS". User is using ICQ5.1. Correct would be "Der Benutzer ist zurzeit nicht verfügbar."

comment:37 Changed 11 years ago by Tafkadasom2k5

Attached a tiny paket with broken coding. Didn't do that before, so if other things are needed, please let me know. I am using wireshark.

comment:38 Changed 11 years ago by Tafkadasom2k5

http://img219.imageshack.us/img219/359/qipawaydg7.png

Mh just recognized that with qip-users only the "BUDDY-info"-window (the one, which you can activate by clicking on the blue I-icon) doesn't seem to work. The pop-up-window in the "buddy-list" (lower right corner in the screenshot) has the correct encoding! Just like the "AIM-Info" -window.

comment:39 Changed 11 years ago by elb

If qip is a third-party client, we really need only information for the official ICQ clients. Third-party clients mess this up in any manner of weird and wonderful ways.

comment:40 Changed 11 years ago by Tafkadasom2k5

No comment about my networt-capture? That was an official ICQ5.1 User!

comment:41 Changed 11 years ago by elb

I appreciate your capture, I simply have not had time to look at it.

comment:42 Changed 11 years ago by elb

  • Summary changed from Some cyrillic problems to ICQ Encoding Problems

comment:43 Changed 11 years ago by reh

I analyzed the problem with the wrong german umlauts (see Tafkadasom2k5's post). It seems that the crap ICQ 5.1 client sends the away message in UTF-8 although it says >>> text/x-aolrtf; charset="iso-8859-1" <<< in the received packet. Don't know how to workaround this bug. Maybe we have to somehow figure out the client used or we always interpret it as UTF-8 (this could break other stuff).

comment:44 Changed 11 years ago by Tafkadasom2k5

Sounds logical: In the away-messages, every german umlaut is shown as strange letters. The 5.1 client sends UTF-8 (2 bytes afaik?) and pidgin converts it into >2< ISO-charset letters.

comment:45 Changed 11 years ago by goyko

Hi, I had the same problem as in #829, but I saw that you closed it as a duplicate of this bug, so I write here. Here is a patch, that solved my problems with non-ASCII strings in Nick, FirstName and LastName fields (they was disappearing). The problem was in oscar_user_info_convert_and_add().

Changed 11 years ago by goyko

Does the same, but is less intrusive

comment:46 Changed 11 years ago by UbuPetr

I have seen quite similar behaviour with first versions of Pidgin. With current version I did not notice it. But I'll have to check it. My system is Windows-1250 (czech language) Windows XP SP2 Professional.

Changed 11 years ago by goyko

Fixes empty fields in Buddy Info when their encoding is not UTF-8

comment:47 Changed 11 years ago by goyko

Updated oscar_utf8.diff against 2.2.2
oscar_user_info_convert_and_add() does nothing, allocates buffer for conversion, then frees, without using it. Otherwise, same as oscar_user_info_convert_and_add_pair()

comment:48 Changed 11 years ago by alchark

Status messages and offline messages were fine for me with 2.2.1 when I set encoding to CP1251 in ICQ account preferences, but now with 2.2.2 I get status messages looking like UTF-8 text displayed with CP1251 character table. Offline messages are fine.

comment:49 follow-up: Changed 11 years ago by elb@…

(In 6df6c1c77de03ad4f66d5104b6262418260096f2) This is a fix from goyko for ICQ character set conversion in user info.

References #1645

comment:50 in reply to: ↑ 49 Changed 11 years ago by schoen

Replying to elb@pidgin.im:

References #1645

... and fixes #829.

comment:51 Changed 11 years ago by yuralol

I test pidgin 2.3.0. When I set UTF-8 charset, status displaied fine in buddy list and in user info. When I set cp1251 user info displayed with cp1251 character table in user info like "Занят", but in buddy list fine. I think, away message always encoded in UTF-8.

Changed 11 years ago by yuralol

Fix status messages in user info window

comment:52 Changed 11 years ago by yuralol

I try UTF-8 and cp1251 encoding for my ICQ account. In both cases status messages displayed correctly. Please try diff. I test this patch with official ICQ client. Please try this patch between 2 Pidgin client.

comment:53 Changed 11 years ago by loptosko

I did some tests about icq encoding bugs in Pidgin 2.3.0 I was able to reproduce 2 bugs see: http://pastebin.com/f48ac1e24 Don't know if it is helpful

comment:54 Changed 11 years ago by beret

Group names have similar problems. Please see ticket #3874

comment:55 Changed 11 years ago by UbuPetr

I have writtern qIp fix plugin for Pidgin under GPL license http://fialky.com/drupal-5.0/?q=node/13. I will be delighted by new fixes and improvements of this plugin. In the better way, integration to the next release of Pidgin.

comment:56 Changed 11 years ago by k.joe

Please see #4978 and related tickets.

comment:57 Changed 11 years ago by SlavaN

Hello. Today, I downloaded and installed the latest official version 2.4.0 Pidgin. After installation, I can not accept incoming messages. Issued flow of this type:

x4;?x4;0x4;@x4;0x4;<(An error in the admission of the message. Either you have chosen and 414367596 different encodings or 414367596 client contains errors.)

In version 2.3.1, it was not such an error, and the message read perfectly. The journal reports are displayed correctly. My system: WinXP SP2. How can I correct this error?

comment:58 Changed 11 years ago by SlavaN

I apologize. I forgot to point out that in your ICQ protocol I exhibited encoding CP1251

comment:59 Changed 11 years ago by SlavaN

Sorry. Found here on this link http://developer.pidgin.im/ticket/4980. Described there a way I could not come. We had to revert to 2.3.1

comment:60 Changed 11 years ago by liorwohl

in 2.4.0 the problem is gone if the contact use ICQ6

Changed 11 years ago by liorwohl

comment:61 Changed 11 years ago by Gaming

Okay, so far nobody wrote in plain text what is the workaround, so I will do:

  1. (Re)install pidgin-2.3.1; when asked to re-install older GTK version say "yes".
  1. Install current pidgin-2.4.0; when asked to install newer GTK version say "no".

That's it.

comment:62 Changed 11 years ago by GroennDemon

I'm experiencing the problems mentioned in ticket #4980 ("Error receiving this message") with German umlauts and both a friend who uses some MacOS client and another friend who uses Pidgin 2.3.1. Encoding *should* be cp1252, at least I never had any problems before upgrading to Pidgin 2.4.0. Upgrading to 2.4.1 did not help at all.

Also, with another buddy who uses Trillian, umlauts are displayed as completely different characters. ä=d, ö=v etc.

I'll try the workaround now, but please fix this.

comment:63 Changed 11 years ago by GroennDemon

Workaround fixed it!

Please incorporate this into the next version.

comment:64 Changed 11 years ago by Sim-on

  • Milestone set to 2.4.2

comment:65 follow-up: Changed 11 years ago by kalin

Now that this is set for 2.4.2, I'll do my best to help. I am using 2.4.1 on Gentoo linux and am often experiencing icq i18n issues as the following:

  1. cannot read Bulgarian from different people, although they read me fine:

???????? (There was an error receiving this message. Either you and 2726 have different encodings selected, or 27*26 has a buggy client.)

  1. if I send MSG in Bulgarian when a buddy is offline, it gets garbled at their end, although we communicate in Bulgarian when on-line
  1. some status MSGs in Bulgarian (I suppose) are shown garbled in pidgin.

My client is set to UTF-8 all of the time. I can take a packet dump, if somebody is willing to cooperate on the Windows side, preferably knowing Bulgarian/Russian?. My icq is 289594325, on-line usually evenings in JST (GMT+09:00) timezone.

comment:66 in reply to: ↑ 65 ; follow-up: Changed 11 years ago by elb

Replying to kalin:

  1. cannot read Bulgarian from different people, although they read me fine:

???????? (There was an error receiving this message. Either you and 2726 have different encodings selected, or 27*26 has a buggy client.)

This is not exactly a bug -- it's a usability problem, but it's simply a configuration problem. You need to set your encoding (in the account modify dialog) to an appropriate setting. Probably CP1251 or Windows-1251 for Bulgarian, although I can't be sure of that.

  1. if I send MSG in Bulgarian when a buddy is offline, it gets garbled at their end, although we communicate in Bulgarian when on-line

I believe this is in fact a bug, although I believe offline messages work sometimes. Offline messages, status messages, invitations, etc. are where I believe we have remaining encoding problems which are more than simply difficult configuration problems.

  1. some status MSGs in Bulgarian (I suppose) are shown garbled in pidgin.

Ditto above.

My client is set to UTF-8 all of the time. I can take a packet dump, if somebody is willing to cooperate on the Windows side, preferably knowing Bulgarian/Russian?. My icq is 289594325, on-line usually evenings in JST (GMT+09:00) timezone.

This is the biggest problem we have, is finding people using official clients to test against.

Ethan

comment:67 in reply to: ↑ 66 ; follow-up: Changed 11 years ago by kalin

Replying to elb:

Replying to kalin:

  1. cannot read Bulgarian from different people, although they read me fine:

???????? (There was an error receiving this message. Either you and 2726 have different encodings selected, or 27*26 has a buggy client.)

This is not exactly a bug -- it's a usability problem, but it's simply a configuration problem. You need to set your encoding (in the account modify dialog) to an appropriate setting. Probably CP1251 or Windows-1251 for Bulgarian, although I can't be sure of that.

Well, I partially agree with that, it will be more of a RFE, although I guess it will not be too difficult to implement if you think that the encoding in the prefs is what your client sends, and the encoding coming from the network is either autodetected or is included in the stream (I guess, for some protocols at least). In almost all cases I can write in Bulgarian or Japanese and people read properly (since pidgin is doing the right thing, I guess). Problem is reading what they say.

Speaking 5 languages, Cyrillic with 5 encodings and Japanese with 3 is not a fun soup to be drowned into. And it means that I cannot talk to a Shift_JIS and Win-1251 clients at the same time anyway.

Sorry to bloat this bug, this should be a separate bug, IMHO.

Fine for 2. and 3.

This is the biggest problem we have, is finding people using official clients to test against.

OK, at least I know the problem ;-)

It seems somebody has to go deep into enemy territory, no other way... fine with me, I'll do it.

I'll set up a VirtualBox with WinXP guest and put there the official client working with another icq#. Then if you help me, we can do some testing and debugging.

Kalin.

comment:68 in reply to: ↑ 67 Changed 11 years ago by elb

Replying to kalin:

Well, I partially agree with that, it will be more of a RFE, although I guess it will not be too difficult to implement if you think that the encoding in the prefs is what your client sends, and the encoding coming from the network is either autodetected or is included in the stream (I guess, for some protocols at least). In almost all cases I can write in Bulgarian or Japanese and people read properly (since pidgin is doing the right thing, I guess). Problem is reading what they say.

The problem here is that ICQ, specifically, does not include the encoding in the stream -- and autodetection is more or less useless. You can, for example, autodetect with some success if you know that the stream is "Cyrillic" or "Japanese", and you have a table of encodings and some sort of heuristic to say "oh, too many capitals, this must be KOI-8R and not Windows-1251" or whatever (with Japanese encodings it's even a bit easier, because of the shift encodings). However, to throw characters on the ground and say "what is this?" without some pretty specific hints is not feasible.

If you set your outgoing encoding to UTF-8, most of the official clients will handle that; this is why people can read what you send. However, as you see, UTF-8 is not what the official clients send.

What you are seeing is a severe limitation (I would say bug) in the ICQ protocol itself. Unfortunately, we can't do a whole lot about it. In your case, you would need per-buddy or per-group (or similar) encoding preferences -- and we're not willing to compromise that much.

Speaking 5 languages, Cyrillic with 5 encodings and Japanese with 3 is not a fun soup to be drowned into. And it means that I cannot talk to a Shift_JIS and Win-1251 clients at the same time anyway.

In ICQ, specifically, you shouldn't see that much diversity. I think the ICQ Windows and Mac clients do use different encodings for Cyrillic languages (though I don't remember for sure), but you should not see more than one Japanese encoding in our experience. There are also some official client encoding bugs in ICQ which cause for other "encodings" (gibberish strings, really) to show up, but I believe we have identified and worked around that for some time now.

This is the biggest problem we have, is finding people using official clients to test against.

OK, at least I know the problem ;-)

It seems somebody has to go deep into enemy territory, no other way... fine with me, I'll do it.

I'll set up a VirtualBox with WinXP guest and put there the official client working with another icq#. Then if you help me, we can do some testing and debugging.

That would be helpful. I cannot guarantee how much time I can spend on it in the short term, but if we can get a solid inventory of what does and doesn't work against the official client (i.e., a matrix of the various features -- offline, status, invitation, normal message, etc. -- and whether or not they work when the encoding preference is set appropriately) and some data on what encodings the official client uses where, that would be awesome. At this point, at least from my point of view, we're data-limited.

comment:69 follow-up: Changed 11 years ago by Gaming

If this is ICQ protocol bug, how comes that using old GTK solves the problem?

comment:70 in reply to: ↑ 69 Changed 11 years ago by elb

Replying to Gaming:

If this is ICQ protocol bug, how comes that using old GTK solves the problem?

kalin is describing a different problem than you are -- he is using Linux, where the various character conversion mechanisms work. You are using Windows, where there are specific versions of the glib/gtk libraries which cause conversion problems. If you are using Windows, and Pidgin 2.4.0 or 2.4.1, there is indeed a conversion problem which can be fixed by downgrading Gtk+. This is not what we are talking about.

comment:71 Changed 11 years ago by MarkDoliner

elb: FYI that conversion problem should be fixed in 2.4.1 (by specifying UTF-16BE instead of UCS-2BE when converting stuff).

comment:72 Changed 11 years ago by kalin

All right, I went all the way down to hell, setup a fresh English WinXP, hacked the registry to say it is SP2, installed icq6 from icq.abv.bg and am running it with # 338008815 and default settings. But my brain is dead, now going to bed...

I have pasted some Japanese, Bulgarian and English from my website (didn't bother to install proper Cyrillic KBD).

Status is perfectly readable in pidgin-2.4.1/Gentoo, somebody please check on pidgin/Windows (Original text from http://thinrope.net/, check the languages)

Sending Bulgarian and Japanese both ways is fine.

The only thing that does not seem to work so far is when icq6 is offline and I send Bulgarian from pidgin -> it gets garbled.

I'll leave the icq6 on, chat me for test. I'll respond if around.

Kalin.

comment:73 Changed 10 years ago by elb

OK, I've tried to distill the information in this ticket into some tables in the ICQEncodingProgress wiki page. This ticket is very valuable and has been helpful in fixing a number of problems, but unfortunately the information is rather hard to sift through. Please feel free to update the table on that page appropriately if you are an original poster and can clarify or correct data which I entered from this bug. Please do not modify the table based on someone else's comment, as I'd rather not propagate any more errors from misunderstanding than I might have already entered. ;-)

Those of you with access to original clients, feel free to run additional checks and fill out the table and provide packet dump data as possible.

I'd like to thank everyone on this ticket for helping out, and for providing such a wealth of information. With your continued help, we'll get this problem licked once and for all.

comment:74 Changed 10 years ago by Vorkronor

Encountered ICQ encoding problems after upgrading from 2.3.0 to 2.4.2 on Windows XP

Original czech text (contains all lowercase special chars) was:

žluťoučký kůň úpěl ďábelské ódy

is received from Trilian as (the code block thing introduces an extra newline into it):

x1e;lu
x1d;ouhk} kyr zpll oabelski sdy

the same message from QIP 2005 reads:

žx9e;lux9d;ouиkэ kщт ъpмl пбbelskй уd

Pidgin 2.3.0 did not exhibit any of these errors. Encoding in account settings is set to CP1250 (did not touch that).

comment:75 Changed 10 years ago by beret

I must remark that the latter is a bug in QIP 2005, not in Pidgin. Here is a packet dump of a similar message from QIP 2005:

0000  00 00 e8 9f 26 ee 00 a0  c5 69 98 99 08 00 45 00   ....&... .i....E.
0010  00 fd 1f fb 40 00 67 06  5b b3 cd bc 07 e6 c0 a8   ....@.g. [.......
0020  01 02 14 46 e1 73 9d 1b  88 2d b2 7f b0 19 50 18   ...F.s.. .-....P.
0030  40 00 cc aa 00 00 2a 02  e1 71 00 cf 00 04 00 07   @.....*. .q......
0040  00 00 c5 51 68 00 00 00  00 00 00 00 00 00 00 01   ...Qh... ........
0050  09 32 30 37 32 34 34 30  31 30 00 00 00 06 00 01   .2072440 10......
0060  00 02 00 50 00 06 00 04  10 01 00 20 00 05 00 04   ...P.... ... ....
0070  43 55 33 7e 00 1d 00 14  00 08 01 10 e0 86 c9 5c   CU3~.... .......\
0080  61 36 97 3c de b4 e3 13  c5 cb 3c 57 00 0f 00 04   a6.<.... ..<W....
0090  00 00 66 cc 00 03 00 04  48 36 5b 51 00 02 00 5a   ..f..... H6[Q...Z
00a0  05 01 00 02 01 06 01 01  00 50 00 02 00 00 00 70   ........ .P.....p
00b0  04 48 04 3d 00 6c 00 69  00 9a 00 20 00 9e 00 6c   .H.=.l.i ... ...l
00c0  00 75 00 9d 00 6f 00 75  04 38 00 6b 04 4d 00 20   .u...o.u .8.k.M. 
00d0  00 6b 04 49 04 41 00 20  04 4a 00 70 04 3c 00 6c   .k.I.A.  .J.p.<.l
00e0  00 20 04 3f 04 31 00 62  00 65 00 6c 00 73 00 6b   . .?.1.b .e.l.s.k
00f0  04 39 00 20 04 43 00 64  00 79 00 0b 00 00 00 16   .9. .C.d .y......
0100  00 04 48 36 c2 1d 00 13  00 01 3a                  ..H6.... ..:     

QIP has a hard-coded conversion from CP1251 – you can see UTF-16 code points in the Cyrillic range, although the original message contained only characters from the range U+0020 through U+017F.

comment:76 follow-up: Changed 10 years ago by beret

Additional discovery: To most other clients (notably ICQ 5.1, ICQ 6, Miranda, or another QIP), QIP 2005 sends messages in UTF-8 (not in UTF-16BE as seen above) and with correct conversion. Something makes QIP determine which encoding it will be using. If we find out how to make QIP send UTF-8 messages to Pidgin too, this problem could go away – it is very painful for Czech Pidgin users because there could be about 40% of them using QIP 2005 every day.

The key might be client capabilities. Pidgin shows up in QIP having only a few of them (however, UTF-8 is listed among those), and having protocol version 0, whilst Miranda has version 8, official ICQ client has 9 and QIP has version 11.

(It has been affirmed by the QIP developers that the bug in QIP 2005 is present, but they refuse to fix it because the upcoming version of QIP is free of this bug. However, QIP Infium has not yet been released.)

comment:77 follow-up: Changed 10 years ago by birger

I had the same problem. It seems that it was a problem with GTK. Things started to work again only after I manually cleansed my Windows Machine from GTK (removed registry entries and deleted files in %ProgramFiles?%\Common Files\GTK) and reinstalled pidgin with GTK included.

comment:78 in reply to: ↑ 77 Changed 10 years ago by elb

Replying to birger:

I had the same problem. It seems that it was a problem with GTK. Things started to work again only after I manually cleansed my Windows Machine from GTK (removed registry entries and deleted files in %ProgramFiles?%\Common Files\GTK) and reinstalled pidgin with GTK included.

You were seeing an unrelated Gtk+ bug with similar effects; there are other instances of that bug cropping up in this discussion. We have since worked around it, but other encoding problems remain.

comment:79 Changed 10 years ago by Sim-on

For Reference: a patch in #5943 fixes the wrong encoding in the buddy-info-dialog...

comment:80 in reply to: ↑ 76 Changed 10 years ago by beret

The key might be client capabilities. Pidgin shows up in QIP having only a few of them (however, UTF-8 is listed among those), and having protocol version 0, whilst Miranda has version 8, official ICQ client has 9 and QIP has version 11.

I made a research (http://live.jabbim.cz/1734-qip_2005_mojibake_the_cause) to validate this, and it turns out I was right. I have created a separate ticket (#6208) because I think this is just an enhancement, not a critical bug.

comment:81 Changed 10 years ago by liorwohl

i think the bug is fixed now. there are no problems at all with ICQ6 statuses.. i can read hebrew there. the problem was with ICQ5.1, which is not in use anymore.

comment:82 Changed 10 years ago by chroneus

Still have a problem. Pidgin 2.4.3 ,CP1251 Latest Ubuntu (Hardy) cannot receive online messages from different mobile ICQ clients, like jimm. ????? (There was an error receiving this message. Either you and XXXXXXXX have different encodings selected, or XXXXXXXX has a buggy client.)

<angry> This problem is with Pidgin only, icq clients on other stations (Adium,Icq6,Miranda) works like a charm. I don't know, who make it all wrong, ICQ protocol developers,or GTK, or QIP and ICQ plugin programmers, but if you, guys, not able to separate Unicode text from a single-byte encoding after 14 month, please, stay away from developing text messaging program </angry>

comment:83 follow-up: Changed 10 years ago by GroennDemon

The problem mentioned in this ticket and ticket #4980 still exists with Pidgin 2.5.0. Instead of German umlauts I receive ascii letters ("v" instead of "ö") and so on.

I'm going to use the workaround mentioned in #4980 (installing Pidgin 2.3.1 and then upgrading to 2.5.0 without upgrading GTK+), but this really is a PITA. Obviously someone on the GTK team screwed this up, but how can a bug this annoying not be fixed after half a year?

comment:84 in reply to: ↑ 83 Changed 10 years ago by datallah

Replying to GroennDemon:

The problem mentioned in this ticket and ticket #4980 still exists with Pidgin 2.5.0. Instead of German umlauts I receive ascii letters ("v" instead of "ö") and so on.

I'm going to use the workaround mentioned in #4980 (installing Pidgin 2.3.1 and then upgrading to 2.5.0 without upgrading GTK+), but this really is a PITA. Obviously someone on the GTK team screwed this up, but how can a bug this annoying not be fixed after half a year?

Yours is a different issue, not at all related to this.

comment:85 Changed 10 years ago by GroennDemon

Could you reopen #4980 then?

comment:86 Changed 10 years ago by igor4u

Confirm encoding problem under 2.5.0. Messages from QIP broken. Messages from ICQ OK. WinXP.

comment:87 Changed 10 years ago by qdinar

hello. i use pidgin 2.2.1 in ubuntu 7.10. this encoding problem is here: when somebody sends message with win1251 when my im client is offline, when i turn on, it is shown as encoded with iso-8859-1 .

sometimes, it is with qip (non-unicode old versions i think), their messages are decoded with iso-8859-1 .

a group name in pidgin is shown as Собеседники . i'll try rename it again, but one time when i renamed it, i could not, it was not renamed properly.

comment:88 Changed 10 years ago by qdinar

i said this three things about icq.

comment:89 Changed 10 years ago by eugene2k

Here are the contents of the text in the ICQ auth request. Pidgin 2.5.3 on Ubuntu Intepid. This is UTF-8. Pidgin apparently tries to decode it as CP1251.

0000   09 34 34 32 30 39 35 33 36 39 00 ad d0 a1 20 d0  .442095369.... .
0010   bf d0 be d0 bc d0 be d1 89 d1 8c d1 8e 20 d1 8d  ............. ..
0020   d1 82 d0 be d0 b9 20 d0 bf d1 80 d0 be d0 b3 d1  ...... .........
0030   80 d0 b0 d0 bc d0 bc d1 8b 2c d0 b2 d1 8b 20 d0  .........,.... .
0040   bc d0 be d0 b6 d0 b8 d1 82 d0 b5 20 d0 bf d0 be  ........... ....
0050   d1 81 d0 bc d0 be d1 82 d1 80 d0 b5 d1 82 d1 8c  ................
0060   20 d1 81 20 d0 ba d0 b5 d0 bc 20 d0 b8 20 d0 be   .. ...... .. ..
0070   20 d1 87 d0 b5 d0 bc 20 d0 be d0 b1 d1 89 d0 b0   ...... ........
0080   d0 b5 d1 82 d1 81 d1 8f 20 d0 b2 d0 b0 d1 88 20  ........ ...... 
0090   d0 b4 d1 80 d1 83 d0 b3 20 d0 bf d0 be 20 69 63  ........ .... ic
00a0   71 2e 20 68 74 74 70 3a 2f 2f 75 70 77 61 70 2e  q. http://upwap.
00b0   72 75 2f 32 31 36 38 34 32 00 00                 ru/216842..

the message in utf-8 is "С помощью этой программы, вы можите посмотреть с кем и о чем общается ваш друг по icq." plus the url of the trojan.

comment:90 Changed 10 years ago by datallah

Ticket #7645 has been marked as a duplicate of this ticket.

comment:91 follow-up: Changed 10 years ago by kostyantyn

Hi, it's not an UTF-8, it's UTF-16. As we can see from this examples( main text are in UTF-16): 1) word "hedgehog" - (ї)жак

00 BF 04 36 04 30 04 3A

2) word "Europa" - (Є)вропа

00 AA 04 32 04 40 04 3E 04 3F 04 30

3) word "city" - м(і)сто

04 3С 00 B3 04 41 04 42 04 3E

  • that all strage displaying simbols starts from 00. It means that it will use main codepage to interpret them. For windows configured to use Cyrillic, main codepage is cp1251. So when we extracting symbols from it we get ї, Є and і.

comment:92 in reply to: ↑ 91 ; follow-up: Changed 10 years ago by elb

Replying to kostyantyn:

it's not an UTF-8, it's UTF-16. As we can see from this examples( main text are in UTF-16): 1) word "hedgehog" - (ї)жак

00 BF 04 36 04 30 04 3A

What you are describing isn't UTF-16, it's madness. U+00BF is not ï, it is ¿. Same for the other text you posted. UTF-16 (nor any Unicode encoding) does not work the way you describe. The text that eugene2k pasted is indeed valid UTF-8.

  • that all strage displaying simbols starts from 00. It means that it will use main codepage to interpret them. For windows configured to use Cyrillic, main codepage is cp1251. So when we extracting symbols from it we get ї, Є and і.

What client is producing these messages that you're pasting? I strongly suspect it is not the official ICQ client.

comment:93 in reply to: ↑ 92 Changed 10 years ago by kostyantyn

Replying to elb:

What you are describing isn't UTF-16, it's madness. U+00BF is not ï, it is ¿. Same for the other text you posted. UTF-16 (nor any Unicode encoding) does not work the way you describe. The text that eugene2k pasted is indeed valid UTF-8.

00 means that it will be just "ASCII" code(main codepage symbol). On Cyrillic windows ASCII codepage = cp1251 codepage. So in windows you will see ї, but not this "¿" strange symbol. But on my linux box instead of ї, є, і maybe some others and their capital equivalents I see symbols from normal ASCII codepage, which is confusing sometimes(when you see in square or other strange symbols)

What client is producing these messages that you're pasting? I strongly suspect it is not the official ICQ client.

yes. It's qip problem, I saw a lot of people has this problem. I have cp1251 configured for icq in pidgin, but this doesn't help. In some cases qip option CI(you can find it in config file) can change situation, but by default it isn't correct(I think it change symbol coding method or so that all symbols are in UTF16).

comment:94 Changed 10 years ago by kostyantyn

And one more thing. I think it need to encode this UTF-16 chars(which starts from 00) using charset setting for this account in pidgin or from servers description of person you are writing to.

comment:95 follow-up: Changed 10 years ago by eugene2k

00 means that it will be just "ASCII" code(main codepage symbol) --- ASCII is a character table that goes up to 127. The rest is reserved.

So in windows you will see ї, but not this "¿" strange symbol. --- You don't have different characters in unicode which depend on the active codepage. That's the whole point of the standard: to not have codepages. What you have instead is a set of standardized unicode symbols, with the turned question mark being 0x00bf. Check the symbol table in windows yourself if you don't beleive me.

If qip displays 0x00BF as (ї), then it's a qip problem, because it surely isn't in the utf-16 standard. Moreover qip must be doing something additionally to replace the characters it recieves.

comment:96 in reply to: ↑ 95 Changed 10 years ago by kostyantyn

Replying to eugene2k:

If qip displays 0x00BF as (ї), then it's a qip problem, because it surely isn't in the utf-16 standard. Moreover qip must be doing something additionally to replace the characters it recieves.

It's just my interpretation of what happened and why win shows normally messages in Cyrillic from qip. I understand a lot of things in linux/unix world(now I'm porting mono), so I don't want to speak about some realization things. I found why this happen and how we can fix it. If I have only free time to do it by myself. Also, I know that it is a qip bug and I dislike this client and ask my friends to use other clients or protocols, but this can't solve a problem for now. Some users have this problem with pidgin/qip (I already so other slavonic users), but for example kopte/qip doesn't have such a problem. I'm a GNOME/GTK+ user so I prefer pidgin, but some small bugs .

PS: Also I have a problem with Cyrillic letters in authorization request.

comment:97 Changed 10 years ago by sedaha

I have made plugin for correction of "broken WINDOWS-1251 to UTF8 QIP 2005 encoding". It's name is "QIP Decoder" and you can find in in wiki:ThirdPartyPlugins and see the source for details. The Russian QIP authors didn't make neither fully-working WINDOWS-1251 encoding. I think the "QIP problem" can be solved only for local scope (one selectable common encoding, like WINDOWS-1250) - this is, what I am doing in the plugin (with auto detection).

(edit by darkrain42: wiki-link to the third-party plugins page)

comment:98 Changed 9 years ago by darkrain42

Ticket #8025 has been marked as a duplicate of this ticket.

comment:99 Changed 9 years ago by darkrain42

Ticket #6466 has been marked as a duplicate of this ticket.

comment:100 Changed 9 years ago by rekkanoryo

Mark, what are the chances that this situation is at least somewhat improved with your most recent changes for auth request encoding?

comment:101 Changed 9 years ago by MarkDoliner

My change for auth request was very specific for auth request encoding and should not affect anything else.

comment:102 Changed 8 years ago by MarkDoliner

  • Cc ivan added

Adding Ivan to the cc list. As far as I know all ICQ encoding issues have been fixed. That is to say, Pidgin behaves in a way that is 100% compatible with official ICQ clients, and as compatible as possible with 3rd party clients while not breaking compatibility with the official client.

If people still encounter problems I suspect it's the fault of the other client. But feel free to post specific problems, and include information about what client the other person was using.

comment:103 Changed 8 years ago by Robby

  • Resolution set to fixed
  • Status changed from new to closed

I would suggest we track new, specific problems in new tickets. This ticket is a mess. :)

Note: See TracTickets for help on using tickets.
All information, including names and email addresses, entered onto this website or sent to mailing lists affiliated with this website will be public. Do not post confidential information, especially passwords!