As getting deeper into the data, we find some interesting data ‘issues’ as the differences between different data sources:
We are comparing an important metric (confirmed cases) from multiple data sources of the Outbreaklocation, then we found out the numbers not always aligned from different sources, for example for NewYork State, see the plot in the screenshot.
- Clearly, there are a lot of strange data points in ECDC, I guess this EU institute may not be very good for collecting US data
- But even between 2 JHU sources (Confirmed and Interpolated), CovidTrackingProject, there are some quite big differences at some time periods.
So just want to discuss with the community here, does anyone have some ideas about what could be the reasons for these differences? and suggestions on the data cleaning and treatment: just stay for one source which believed to be most reliable (JHU for US, ECDE for EU), or remove the outliers and take mean/median. etc?