Updated: Jul 10
In this post, I will show you how to create a choropleth map in Stata to show the COVID-19 spread in Mainland China. I obtained the ESRI shapefile from https://www.diva-gis.org/datadown. I have saved my database file (.dbf) and shapefile (.shp) in my document folder. The database file (.dbf) and shapefile (.shp) are both called "china".
First, I will change my current working directory and set up the map data in Stata. If not, please specify the full path to your working directory. To do this, I use the following codes:
You will see the following output in the command pane:
You will see from the output that two Stata datasets: china_shp.dta and china.dta are created. And they are both saved in my current working directory. Please make sure that both files are always saved in the same folder. The file that we are going to use to create the map is china.dta, which is the regular dataset.
Now the datasets have been converted in Stata, I am going to draw a blank map of Mainland China to test if the map dataset works properly. To do this, I will use grmap command. In the command pane:
This will give me a blank map:
In order to get the Choropleth Map for COVID-19 cases in Mainland China, we need to merge this map dataset with the COVID-19 data for Mainland China, which I have previously obtained from The Humanitarian Data Exchange website. The dataset you will need to download is called covid-19 china cases by adm 1.csv. It will be imported as an excel file with provinces names and correspondent case numbers. Now I need to convert this .csv file to .dta file with the variables that we need for this analysis: provinces and cases. To do this, I select the relevant columns in .csv file that I previously obtained, right-click, and select copy. I then open Stata and clear the previous dataset, then open the Data Editor in edit mode, right-click the top-left cell and click paste. It will ask if the top row is variable names or data, I select "Variable Names" and the data is correctly pasted into Stata. I then change the columns names to "province" and "cases". You can change to any names you prefer, but please make sure that before you merge this dataset with the map dateset, the variable which represents provinces in the two datasets needs to have exactly the same name. I then save this data as coviddata_china.dta. Please see the codes I used:
Now I need to merge this data with my map data. I first need to alter the name in my map data to match that in the COVID-19 data: I generate a new variable called 'province', then I make sure that the spelling for all the provinces match the one in my COVID-19 dateset. Then I save this dataset as china_map.dta:
Now I am going to merge the two datasets:
NOTE: I use fcolor() command to change the map's color. Then I get the map as below:
This map is showing infection count data in Mainland China. It does not take into account the different population sizes of the different provinces. Naturally we would think the province with a bigger population will have higher number of cases. So now I am going to draw another map showing the percentage of the cases in different provinces. The percentage is just the number of cases divided by the population in that province.
To do this, first I am going to merge the COVID-19 dataset with the population dataset. I obtained the population from https://www.worldatlas.com/articles/chinese-provinces-by-population.html
I copied the provinces and the population data, and then pasted this into an excel spreadsheet, then I saved this .csv file as china_population. Next, I need to import the dataset into Stata. Similar to what I did with the COVID-19 dataset, I copy the data from the .csv file and paste the data using Data Editor in edit mode. I then change the columns names to "province" and "population". Then I save the dataset as china_population.dta:
Now I have the three datasets ready, I will first merge the COVID-19 dataset with the population dataset, then save the merged dataset as coviddata_population_china.dta. In the command pane:
Finally, I merge this dataset with the map dataset. The map will show the percentage of the population that contracted COVID-19 in different provinces. So I generate another variable called 'percent' to calculate this amount:
We can see from the above graph that after taking into account the percentage of the population that contracted COVID-19, there is some slight difference on the map, the color of some provinces become lighter. For example, even though Henan Province has one of the highest number of cases in Mainland China (Central China), when we take the population into account, the percentage is in the second tier; the total number of cases in Beijing is in the second tier, but when taking into account the population, its percentage has gone up to the first tier.