Can India's Covid data be trusted? A Big Data investigation into what the numbers say (and hide) BIG Data

In this piece, we try to make sense of a lot of Covid related data that is being generated day in and day out.

But before we get into the details, let’s first try and explain what we want to achieve through this piece.

The aim

The peak of the second wave of Covid-19 seems to be behind us. Both the daily number of Covid cases as well as deaths have been coming down over the last few weeks. Having said that, the lack of preparedness across all levels of the government in handling the second wave was stark and much of the loss of life was avoidable.

The collection of accurate data is challenging due to reasons we touch upon at the outset. Nevertheless, a thorough data-driven analysis of both the severity of the pandemic and the government’s response to it is critical in order to identify areas for improvement and be better prepared for any possible future waves.

This piece is one such attempt to try and make sense of the enormous amount of data that is collected related to Covid. We focus on two fundamental metrics: Covid testing and positivity rate, and the link between them.

There are two kinds of Covid tests, both of which are performed on nasal swabs: RT-PCR tests and antigen tests. Of the two, the RT-PCR test is more reliable but both tests are used to detect Covid infections. The data reported includes results from both the tests. Positivity rate is the ratio of people who test positive for the infection compared to the total number of people tested.

The total number of tests conducted in a state, even in the absence of a significant number of Covid cases, is an important metric. It indicates the state’s ability to detect any uptick in infection and allows a window for early intervention. When the testing rate is high, more people are tested. This increases the chances of early detection if there is a surge in infections, compared to a scenario with less testing. Thus, the testing rate of a state can be viewed as a one measure of a state’s vigilance level, or the extent to which it is prepared to quickly catch any increase in the spread of Covid.

While increased testing is good in all scenarios, it becomes even more important in the face of rising Covid positive cases. Experts agree that as positivity increases, testing should be increased too. This is because a very high positivity rate indicates the possibility of only people with severe symptoms being tested. As a result, many others who may be infected, but are showing fewer symptoms or no symptoms at all, are likely not being tested. These undetected infections can cause rapid spread of the virus because people carrying these infections do not isolate themselves, given that they are unaware that they are carrying the virus in the first place.

Thus, the way a state responds when positivity rates start to surge, as was the case during the second wave peak in India, can tell us how responsive it was. Ideally, testing rates should increase significantly within a few days of a surge in positivity rates. This tells us that the government is aware of the prevailing situation and is trying to do something about it.

Using data collected from states between January 1, 2021 and June 10, 2021, we conducted a detailed analysis on testing numbers and positivity rates. We found wide differences between states across these two metrics, both in absolute numbers and the progression over time, which we will discuss in detail.

Interestingly, not all states that were vigilant were responsive. And not all states that responded well to rising cases were that vigilant. Of course, making such broad statements to describe what is a minefield of data is risky business. The analysis also reveals some questionable data from a few states in how much it differs from nationwide averages, pointing to potential manipulation. Flagging potential inaccuracies in case of Covid data – willful or not – can perhaps discourage such practices.

On the topic of data fudging, there have been a spate of recent reports highlighting massive underreporting of Covid deaths in many states. Investigative work by data scientist Rukmini S along with Chinmay Tumbe of IIM Ahmedabad reveal significant fudging of Covid death numbers by states. Taking a cue from these folks, other media reports have also surfaced showing similar fudging of numbers in more states. We had discussed this detail in an earlier piece titled “But How Do You Hide the Dead”.

Our current analysis in a way confirms what the media is already highlighting about fatality numbers through other data and anecdotal evidence.

Because in the end, data is all we have got. So, let’s start.


The second wave of Covid, which had paralysed the nation for over one and a half months, is receding. The all-India number of new cases on a single day, is down from a peak of 4,14,280 on May 6, 2021 to 37,070 on June 28, 2021 – a decline of over 91 percent in less than two months according to Covid19india.org. To give some context of how severe the second wave was, the highest number recorded for daily infections in the first wave was 97,680 on September 17, 2020. The steep drop in infections in recent weeks is encouraging.

The big question in the mind of many is: can Covid data be trusted? Any scepticism towards the accuracy of Covid-data, and thus the utility of data-driven analysis of the pandemic, is understandable. For a variety of reasons, there are huge gaps in our ability to gather data on Covid infections. Let’s list them out one by one.

First, as we all know by now, not everyone who gets infected shows symptoms. Folks who show symptoms of Covid when infected are called symptomatic patients, whereas those who do not show symptoms despite being infected are called asymptomatic patients. This phenomenon of asymptomatic infections automatically causes a significant number of cases to go unreported. People who don’t feel sick will generally not get tested and hence, won’t be counted as being infected.

It is important to understand this, simply because asymptomatic patients also spread Covid. As Anirban Mahapatra writes in Covid 19: Separating Fact from Fiction: “During this pandemic, it became clear that people who were infected but not sick were spreading the disease silently. A significant proportion of spread of SARS-CoV-2 is by asymptomatic carriers who can spread virus-laden particles as aerosols from anywhere between three to twelve days.”

Second, the surge of sickness brought by the second wave clearly overwhelmed our health infrastructure, including testing capabilities. The system won’t record those it cannot serve. This became more important given that in some states, even getting admitted into a hospital was made difficult by the bureaucratic regulations that were in place.

In Uttar Pradesh, for instance, in order to get admitted into a hospital, a patient required a reference letter from the chief medical officer “who heads the integrated command and control centres set up by the government in all districts”. Due to this rule, patients were turned away from hospitals. And if such a patient died, they wouldn’t be counted in the Covid deaths. Of course, this was over and above whether medical infrastructure was available and the patient had the ability to access it in the first place.

Third, while the Covid testing infrastructure in urban and semi-urban areas is over-stretched, it is either absent or completely inadequate in rural India in many states. Thus, the spread of the virus in the hinterland does not show up in the numbers in a proper way.

As Dr Chandrakant Lahariya, a Delhi-based epidemiologist and public policy and health systems expert, told India Today in June: “In the absence of reliable Covid surveillance and data from rural India, we cannot be sure about the extent and severity of the pandemic. National aggregates may indicate a declining spread in urban settings, but it is possible the virus is still spreading in rural India.”

Fourth, even in places where testing is available, people often avoid getting tested due to the fear of restrictions imposed if they test positive. Then there are prior beliefs and WhatsApp forward influencing beliefs, which are at play as well. This, coupled with the fact that there is a small (but not insignificant) chance of the test returning positive even if one is not infected – or what is referred to as a false positive – fuels a reluctance among people to get tested unless absolutely needed.

Finally, one can’t rule out the possibility of data being fudged by authorities to avoid embarrassment and/or public and political backlash. (Again, something we had documented in our earlier piece.)

Thus, data-driven analysis of Covid-19 testing and infections has quite a few limitations. Yet, this piece will do just that. Our rationale is simple. While the numbers do not reflect 100 percent reality, they are a useful proxy. Most of the limitations of data described above don’t change much over time. Thus, the data collected can inform about mitigation measures and provide insight into the severity of the disease, efficacy of the government response, and perhaps even flag instances of fudging.

All that said, let’s dive into some cold, hard numbers to understand what they tell us about how different states and regions have fared in the second wave. Specifically, we examine the data on testing, positive infections, and the dynamics that link them.

Let’s first start with some aggregate level data analysis. While there are many ways of grouping states to create aggregate data, we picked two.

  1. Geographic division: We bundled together the data from the northern and the eastern part of the country, which is the lesser developed part, on one side, and the southern and the western part of the country, which is the more developed part, on the other.

  2. Political division: The states governed by the National Democratic Alliance parties versus the non-NDA governed states. Obviously, almost all big NDA governed states are governed by the Bharatiya Janata Party except Bihar, where the party is in alliance with the Janata Dal (United).

The data, when cleaved in this fashion, is very striking.

(1) Geographic division

Uttar Pradesh, Bihar, Madhya Pradesh, Rajasthan, Delhi, Haryana, Punjab, West Bengal, Assam, Uttarakhand, Jammu and Kashmir, Jharkhand, Chhattisgarh and Delhi comprise the north and east group.

Maharashtra, Tamil Nadu, Gujarat, Karnataka, Andhra Pradesh, Odisha, Telangana and Kerala comprise the south and west group.

In both the cases, we only considered states with a population of more than one crore.

First, let’s look at the total testing for these two groups from January 1, 2021 to June 10, 2021.

social experiment by Livio Acerbo #greengroundit #live https://www.newslaundry.com/2021/07/16/chintan-patel-vivek-kaul-covid-big-data-investigation-second-wave