Fixing Government Data Duplication at DataKind Bangalore


Voters worldwide seldom interact with their chosen leaders- except around 5-yearly elections. However, the advent of advanced Information and Community Technologies (ICT) might shrink this interval considerably. They may even turn back the clock towards the seminal Athenian model of democratic decision-making: directly by the people rather than their representatives. With some political discretion, today’s online forums can allow for similarly incorporating crowdsourced public opinion into policy design. This could contribute to nationally important initiatives (such as preparing Morocco’s 2011 Draft Constitution or debates on Spain’s Plaza Podemos, Brazil’s E-democracia portal and India’s own Nonetheless, we will concern ourselves with far more universal and local problem-solving at the municipal level.

But just who has access to such platforms? While internet penetration in rural India is rising dramatically, the lion’s share (67%) still resides with urban denizens. Moreover, as highlighted by the Wall Street Journal, India boasted of a quarter of the world’s fastest growing urban zones and 8 qualifying ‘MegaCities’ as per India’s 2011 Census definition. The demands on municipal governments are likely to be considerable, and even more likely to be mediated by internet platforms.

Regardless of this explosion of population and the associated challenges, the structure of municipal bodies has remained unchanged since Lord Ripon’s 1882 Resolution on self-government. Furthermore, as Ramesh Ramnathan of Janaagraha points out, the responsibility for action is de facto scattered across acronyms of acrimonious accusing agencies. For example, Bangalore’s (deep breath advised) BDA, BMRDA, BWWSB, BMTC, KSB, BESCOM together juggle the city’s water, transport, electricity, traffic police and development needs. Many authorities, little authority. Increasingly internet-savvy and increasingly increasing residents. Where can they all turn for help?

Enter DataKind Bangalore Partners.

15-year old Janaagraha has endeavoured to improve the quality of urban life- in terms of infrastructure, services and civic engagement- by coordinating government and citizen-led efforts. Of their various initiatives, the IChangeMyCity portal also earned Discover ISIF Asia’s award under the Rights and People’s Choice categories.

Next up, eGovernments Foundation, brainchild of Nandan Nilekani & Srikanth Nadhamuni (Silicon Valley technologist) has since 2003 sought to transform urban governance across 275 Municipalities with the use of scalable and replicable technology solutions (for Financial Accounting, Property & Professional Taxes, Public Works, etc.) Their Public Grievance and Redressal system for the Municipal Corporation of Chennai- recipient of the 2010 Skoch Award -has fielded over 0.22 million complaints over 6 years.

Though these organizations joined hands with DataKind in two distinct ‘Sprints’, the similarities are remarkable. Both their platforms allow citizens to primarily flag problems (garbage, city lighting, potholes) at the neighbourhood level for resolution by government agencies.

Then again, the differences are noteworthy too. As an advocacy-oriented organization, Janaagraha aimed to understand the factors that led to certain complaints being closed promptly by a third party. eGovernments on the other hand, being within the system, to keep officials and engineers adequately prepared for the business-as-usual and also immediately alert them on anomalies. So both sought predictions around complaints- one on their creation, another on their likelihood of closure.

Clearly, quite a campaign lay ahead. If we forget Ancient Greek democracy and hitch a caravan to China, then Sun Tzu’s wisdom from the Art of War pops in: knowing oneself is the key to victory. Always open to relevant philosophy, the DataKinders looked into their own ranks to assess their strengths. The team assigned for E-Governments coincidentally included Ambassadors (Chapter Leader, Vinod Chandrashekhar) and Data Experts (Samarth Bhargav, Sahil Maheshwari) from the Janaagraha project. The teams were also at different junctures joined by the latter’s Vice President (Manu Srivastava) and two of his interns, plus a multidisciplinary mob of volunteers from backgrounds in business consulting, UX Design, data warehousing, development economics and digital ethnography. Let’s see how they waged war.

Progress to Date

Back in March 2015, IChangeMyCity’s presented a set of 18,533 complaints carrying rich meta-data on Category, Complainant Details, Comments, etc. You’d assume this level of detail opens doors to appetizing analyses. Perhaps. Unfortunately, the information dwelt in a database of 10 different tables. Sahil Maheshwari- then working as a Product Specialist- busied himself with the onerous task of unraveling the relationships between them, drawing up an ER Diagram and ‘flattening’ records into one combined table. The team then accordingly fished out missing or anomalous values.

Conversely, E-Governments users either report their problems online, through SMS, paper forms or by calling into the special ‘1913’ helpline where operators transcribe complainants’ inputs. With digital data being entered through drop-down menus rather than free text (either directly by users or call centre employees), no major missing data was to be found. Except of course, unresolved cases-a mere 8% of the 0.18 million complaints. Some entries, amounting to 0.8% were exactly identical- clearly a technical glitch. Moreover, all data resided in one table. So in November 2015’s DataJam, this structure allowed the team to plunge immediately to exploratory analysis.

Across the 200 wards of Chennai, 93 kinds of complaints (grouped further into 9 categories) could be assigned to departments at either the City or Zone level. Although the numbers initially seemed staggering, Samartha Bhargav ran basic visualizations in the R Programming language. The result? Another instance of Pareto’s rule: 15 of these complaint types were contributing to 82% of grievances. Several DataKind first-timers like Aditya Garg & Venkat Reddy ran similar analyses for the 10 most given-to-grumbling wards, and found trouble emanating from roughly the same top 5 sources. Apparently, malfunctioning street lights blow everyone’s fuse. These common bugbears intriguingly became less bearable (and more numerous) in the second half of the year, while others related to taxes seemed more even across the year.

Even so, how could there be 10 broken lights in an area with only one on record? So had ten people all indicated the same light? Like with data analysis, learning from Chinese classics (literally) involves reading the fine print. Sun Tzu’s actual words: ‘If you know the enemy AND know yourself, you need not fear the result of a hundred battles.’ Clearly, this enemy was a lot more complicated than the decoy flanks that DataKinders had speared. Tzu and George Lucas may well have hung out over green tea.

Attack of the Clones .

In usual data science settings, duplicates are often easy to identify and provide little intrinsic value. However, the game changes in the world of crowdsourced data. Especiallydata highlighting the criticality of an issue. So to achieve victory, the team would have to understand and strike at its core- dynamic social feedback. We could assess its importance at four levels.

The first involves messages from the platform itself to indicate that a complaint has been registered and no further inputs are necessary. In its absence, citizens could well create duplicates by hitting the Submit button either accidentally (not knowing if their complaint was logged) or deliberately (hoping that repeating the complaint may lead to quicker action). This is more of a concern for web platforms rather than call centres. By matching against columns involving email, phone and postal contact details and date, time and type of the complaint, DataKind had already been able to quickly hurl out these obvious clones.

The second level of feedback is where the Force truly awakens- from other citizens. The ability to see that other fellow residents have experienced the same concern may prevent its repetition. But this rests on two assumptions. First, that they can view already posted complaints, as exists with IChangeMyCity. They may rally behind this shared cause by ‘upvoting’- an indicator to authorities of its increased importance.

Even if this feature does not exist- as with eGovernments- then all is not lost. High priority might still be inferred by large absolute numbers of complaints. But these would provide an idea of the severity of the problem across the ward (45 pot holes in Adyar) rather than one specific instance of it (that life-threatening one before the flyover). Secondly, if peeved citizens do not put in the effort of checking the roster of existing complaints- as inevitably occurred even with IChangeMyCity- then the Upvotes option alone cannot guarantee being Clone-free.

The third and most obvious feedback comes from authorities via the digital platform- to indicate closure. This is provided by both partners, with IChangeMyCity also appending contact details of which official has been assigned the task.

The fourth and final level- is where a citizen can verify that a complaint marked as ‘closed’ has truly been resolved. After all, accountability forms part of the foundation of democracy. In this manner, the same poorly tended-to complaint could be reopened, rather than filing another one out. This feature currently exists only with IChangeMyCity, which not only allows municipal authorities to mark a complaint as ‘closed’ (as exists with eGoverments), but also allows users to reopen them if unsatisfied.

IChangeMyCity’s resolution rates lie close to 50%- a figure probably reached after allowing for this reopening scenario. eGovernments on the other hand closed a commendable 97%, with up to 13% shut on the same day to an outlier of 1043 (almost 3 years), with the majority (56%) in under 3 days. Mr Srivastava emphasized that these efficiency statistics had improved dramatically in the last 2 years. But as we just explored, perhaps a confounding factor is that multiple duplicate complaints are being closed by engineers who have identified their Clone nature.

How to Fix It?

Thus, it was the second category- unintended duplication- which bled into the fourth. How could the DataKind team exploit the enemies’ own weakness? They decided to unsheathe their two logical light sabers: text and location. Either one in isolation didn’t necessarily pinpoint a duplicate. But in combination, they could quickly incinerate a Clone’s trooper suit.

Saber A: WHERE was the complaint registered? For IChangeMyCity, one can log in, peer through a map of Bangalore and plant a pin on the spot where you’d like to divert the authority’s attention. Using that pin, analysts can procure exact latitude and longitude coordinates. It’s still entirely possible that different people place the pins some distance apart even when referring to the same issue. But it would seem like a safe bet that two closely located complaints might just be Clones.

EGovernments currently doesn’t use maps, but asks users a fairly detailed, 6-level description of addresses (City, Regions, Zones, Wards, Area, Locality, Street). Such text might help direct an engineer gallivanting outdoors, but not for a computer that speaks code. Attempting to translate the text addresses into associated geocodes, the team split the data into 10 parts and ran Google Maps API with an R Script on each one. Despite their best efforts, accuracy could not be guaranteed. Though eGovernments will soon be introducing such coordinates in future work, geocoding seemed like a closed line of attack.

Saber 2: HOW was the complaint registered. The way people express themselves on a particular local issue may vary, but could feature some words in common. However with E-Governments system, pre-loaded tags from the website were automatically attached to complaints. Result? Nearly 40,000 entries demanding ‘NECESSARY ACTION’ (in capitals, no less) with only minor differences. Others exist, but simply restate the category of complaints. (‘Removal of Garbage’). With so little variability and no hidden clues, this strategy failed too.

However, for IChangeMyCity, citizens are free to fill complaint titles and descriptions as they please. So the DataKind Team broke the text of both the complaint’s title and description into sentences and then into words. Then they ran an unsupervised learning algorithm, which helped generate the Jaccard Index- a measure of how ‘close’ two complaints were in terms of statistical similarity.

But to check this ‘distance’ for N complaints against each other would require N*N operations. Far too long for a dataset of this size. To assist with this more abstract sense of ‘distance’, the team decided to turn to the more intuitive geographical meaning of the term. The clearly listed geocode saber we mentioned above.

The team decided that any two complaints within 250m of each other on a map could be considered as potential duplicates, while the rest could be ignored. Plugging these codes into the MongoDB geospatial index, Samarth ingeniously reduced the computation time for this process from 2 hours to 10 minutes. He also later developed a REST API that could be queried to detect the 10 nearest complaints. Going forward, the team hopes to set a threshold of such ‘similarity’ beyond which a new entry could automatically be flagged as a duplicate, much like answered programming queries on Stack Overflow.

 Onward to De-Duplication Success

At first glance, it may seem like the Attack of the Clones had stamped defeat over the eGovernments project, while IChangeMyCity had dodged the bullet. But let’s not jump to conclusions. The importance of this first battle is relative. Since Janaagraha is focused on closure of a single complaint, it makes sense not to muddy waters by repeating the same theory. EGovernments on the other hand is interested in the total number of complaints likely to arise, not the problems. Also, as we’ll soon see in the next installment, the larger numbers of complaints (including duplicates) would prove crucial in helping generate valid forecasts for the Chennai Municipal Corporation.

So at the end of this first DataJam session, what had the team discovered? On a flight that carried along Sun Tzu, 2 mayors, George Lucas and random Athenians in Business Class, we learnt the philosophical complexities of the idea of ‘duplication’, especially in the contexts of crowdsourcing and democratic processes in strained local governments.

Abhishek Pandit is a Strategy Consultant at ChaseFuture

National Portal Delivers eServices for Bangladesh Citizens


The Government of Bangladesh has made substantial strides towards achieving its long-term Perspective Plan (2010-2021) by introducing the National Portal, or NP, which is primarily intended to serve as an information dissemination mechanism for the population, especially the underserved.

The National Portal’s journey started in 2007 when the government introduced a central portal by way of a preliminary endeavour. In 2010, a countrywide initiative was undertaken to introduce portals for all of the country’s 64 districts. Based on the lessons learned from these experiences, the ‘Guidelines on Content Preparation’ and ‘Training Guidelines’ on the same subject were prepared to widen the portals’ scope and reach.

Subsequently, some 22,000 government officials were trained on developing and maintaining the Portal, which created an enabling environment to further advance the effort. Finally, in 2013 and 2014, some 25,000+ websites, adhering to a common architecture, design, and structure in terms of their contents, were integrated within the National Portal and introduced in all tiers of public offices (Union Parishad, the lowest tier of local government, Upazila or Sub-district, district, division, directorate and ministries).

In the new National Portal, citizens are finding a convenient channel for obtaining information from public offices at lower cost and with less hassle. The Portal is also mobile-friendly, thereby ensuring greater access to information since the country enjoys over 70 per cent mobile penetration, with over 80 per cent of Internet access happening over mobile phones. Citizens who are unable to access the websites directly can go to the nearest digital centre, of which there are some 5000+ countrywide.

It is worth noting here that there are complementary initiatives in progress to upgrade the Portal further so that it can host all electronic versions of government services. Mobile applications are also being developed to make it easily accessible to persons with disabilities.

At present, some 100+ services (selected on the basis of importance and public demand), including online passport applications and electricity bill payments, have already been incorporated, and more services will soon be fully automated and provided via the Portal further to a mandatory government directive that will shortly be coming into effect.

Complied from WSIS Stocktaking: Success Stories 2015