= Statscollector for libpurple based clients =

First of I would like to extend my thanks to all Pidgin/libpurple developers who have given me this opportunity to work on a GSoC.

This project aims at collecting useful statistics about the users who use clients based on libpurple. As this is tied with Pidgin, I have majorly focused to work on Pidgin/Finch which both use libpurple. The motivation is to, first - let developers know which features to work on/optimize, and second - to have some interesting facts about how people use the widely active IM service these days. I will split this sections describing the types of statistics collected followed by information on the client (plugin) and the server.

[[TOC(inline, noheading)]]

== For the crazy and non-patient ==

For those who just want to see the final result of stats website, feel free to visit [http://stats.pidgin.im]. The source for the 2.x.y pidgin branch is housed at [http://hg.pidgin.im/soc/2012/sanket/statscollector-2.x.y 2.x.y-plugin] and at [http://hg.pidgin.im/soc/2012/sanket/www-statscollector server].

== Statistics collected ==

If you visit [http://stats.pidgin.im] you can see a host of statistics that are currently collected. I will summarize them in the form of a list here:

 * System information
  * Type of Operating System -- Windows (breakdown), Apple (breakdown), Linux
  * Architecture information
   * Hardware
   * Operating System
   * Pidgin Code
  * Type of processor -- x86, x86_64, ppc, ppc64 etc
 * Client information
  * Version of libpurple in use
  * UI in use -- Pidgin/Finch (haven't tested with Adium et. al)
 * Protocols
  * Purple Protocols -- jabber/irc/...
  * Avg user count for each protocol
  * Breakdown on servers for jabber/irc (see note(1) below)
 * Plugins
  * Count of plugins

NOTE(1): Breakdown on servers can leak private information if the server is not public, for that reason I am developing a simple hash based mechanism to determining if the server is public before accepting raw names. This will avoid any private information sharing! 

== Plugin ==
It's a plain old libpurple plugin which does some crazy stuff to collect information about the client (native and libpurple). Though you could always have a look at the source I would only mention a few challenges associated with writing the client.

=== OS/Hardware specific information ===
Operating Systems such as Windows, Macintosh, Linux (various myriad flavors) and some crazy ones make life difficult to collect common information as Architecture Type or the Bitness of hardware/OS. I had to go through the complicated regime of #ifdef's to complete this task. One interesting observation though is, POSIX compliance can generally save your day. In my case, I could classify the systems in POSIX/Windows, much like IE/rest of the world :-)

=== Privacy Concerns ===
As the plugin is if client side, it can potentially collect secret information. No worries, you should believe in the disclaimer we are about to flash though ;-). Ensuring that everything that is public ONLY is published was a important thought throughout. For example, in order to track if the user is enabling the same account twice, we only store the hashes of his uid instead of the uname@service string. This ensures that, we do not store any sensitive information inside stats.xml (the file which contains all stats data)! You should definitely have a look inside, stats.xml (it resides inside your pidgin/libpurple home directory, ~/.purple in my case).


== Server ==
The server is basically a collator which collects all the stats.xml and transforms them into a useful database (we obviously don't want to be working on raw xml's). For the interested it's written in [http://djangoproject.com/ Django] and uses the awesome [http://highcharts.com/ Highcharts] Javascript Library. Thanks Eion, for the recommendation on the charts library :-)!

=== Processing Stats ===
One major challenge for this server was to sort the XML's efficiently. Because utlimately it's going to hit a lot of traffic and rendering information should be efficient, to be short! I have followed the following workflow: on submitting stats.xml the server will breakdown the file and store it into a database schema. All queries for date ranges by users then, will be simple select * from db where date >= d1 and date <=d2 format. MySql or any RDBMS will be ideally suited for these queries. I had to make sure that Django's abstraction did not screw up the efficiency, because your logic can change the type of query you make -- without you knowing it!

== Ensuring server names in prpl-jabber/irc are public ==

One problem very rightly pointed out by elb (Ethan) regarding displaying Jabber/IRC breakdown is that it may potentially reveal private servers which can then reveal identity of users. We don't want that obviously :-). Also if a user is running a local server for some development purposes, we don't want that either. To solve these problems this mechanism has been provided:

=== Plugin ===
 * The plugins will ask for a trusted list of servers from the Stats server
  * This list will actually contain md5 hashes
 * If the current server is in the list then we can simply put it's name in stats.xml, else
  * The server is yet to be determined as public
 * In both cases, the current stats will count as evidence towards it's being public

=== Server ===
 * The server will check for incoming stats, if it contains only the Hash or both Hash and Name
 * If only hash is present, then it'll increment confidence for it being public else,
 * If both hash and name are present, it'll check if md5(name) == hash ^ hash in trusted_list
  * If both conditions are satisfied, then the name will come in display else not