Covid-19 Outbreak Exploratory Data Analysis and Prediction

Aadhil imam
Analytics Vidhya
Published in
5 min readMar 29, 2020

--

The outbreak of the Coronavirus Disease 2019 (COVID-19) began in Wuhan City, Hubei Province, China. The booming of COVID-19 is fast becoming a major global crisis, which made the World Health Organisation (WHO) declare the COVID-19 as a pandemic.

While companies like Google’s DeepMind are using AI to generate predictions about the Coronavirus that could help researchers stem the global outbreak, smaller companies and individual data scientists and data analysts are also tinkering with Coronavirus data in order to understand the outbreak better.

Below my exploratory data analysis on the COVID-19 dataset and made inferences based on the dataset. The dataset used for our study is on John Hopkin GitHub repository; it consists of the cumulative number of confirmed cases, recovered cases, and death cases. It also includes the Province/State, Observation date (the date of observation), and Country of the infected patients.

Let’s import the needed python libraries and dataset :

Then we’ll read the datasets . The datasets covered update till March 29.lets print first five rows of dataset.

covid19.head()

As you can see we have 8 columns containing confirmed , recovered and death cases across the countries and the dataset have 67 entries.

The we want to covert ObservationDate and Last Update columns to date and time object .

The lets grouped the dateset calculate confirmed and death cases total by date from 22/01/2020 over worldwide

From the above table, we can infer that in nearly three months, the spread of the virus rise from 555 confirmed cases on January 22, 2020, to 660,706cases as of March 28, 2020.

Lets make a new column named active row subtract confirmed column from death column and without china dataframe to make easy to the data visualization and analysis

lets calculate the the Whole World total active, recovered, deaths and Percentage over the time .

As above method we can calculate the total active death , recovered , death and percentage over the time of china and Italy

Now lets make the Distribution plot of confirmed cases around the world.A Distribution plot displays a Distribution and range of a set of numeric values plotted against a dimensions and make a new dataframe for distributions.

Distribution plot of confirmed cases around the world

Then lets so the Mortality and Recovery Rate around the World.Mortality rate, or death rate, is a measure of the number of deaths in a total confirm cases and recovery rate is measure of number of recovery in a total confirm cases over the time.

Mortality and Recovery Rate analysis around the World

Now lets start to play with some interactive data visualization . I used Ploty for the data visualization. Plotly’s Python and R graphing library makes interactive, publication-quality graphs.

Whole world cases over the time

We can see from the graph above that the number of confirmed cases keeps increasing, thereby increasing the number of death cases, the number of recovered instances, and the number of patients still infected. Fortunately, there is a higher likelihood of recovering than dying as a result of COVID-19.

Besides, we can also see that between February 16, 2020, and March 8, 2020, there was a decline in the recovery rate of infected patients. We may recommend that the government of the affected nations need to put measures in place to make sure that there is a significant increase in the number of recovered patients over a very long time frame.

China cases over the time

In above case we can china’s Active cases are going down over the time.

pie chart for Active , Recovered and Death cases

Let visualize the geographical map for total over the whole world.

The above graph is just an illustration of how the virus is spread out across the globe.

Making prediction

To create our simulations, we extracted coronavirus data dating back to January 22, from an online repository provided by the Johns Hopkins University Centre for Systems Science and Engineering.

This time-stamped data detailed the number and locations of confirmed cases of COVID-19, including people who recovered, and those who died.

Choosing an appropriate modelling technique was integral to the reliability of our results. We used time series forecasting, a method that predicts future values based on previously observed values.

I used support vector machine algorithm for future forecasting for making the prediction.

Above graph showing the real confirm cases vs my prediction model graph you can get the source code from here https://github.com/aadhil96/covid19_analysis_and_prediction.

--

--