Discussion on the data cleanse for some key metric from multiple sources

As getting deeper into the data, we find some interesting data ‘issues’ as the differences between different data sources:
We are comparing an important metric (confirmed cases) from multiple data sources of the Outbreaklocation, then we found out the numbers not always aligned from different sources, for example for NewYork State, see the plot in the screenshot.

  • Clearly, there are a lot of strange data points in ECDC, I guess this EU institute may not be very good for collecting US data
  • But even between 2 JHU sources (Confirmed and Interpolated), CovidTrackingProject, there are some quite big differences at some time periods.

So just want to discuss with the community here, does anyone have some ideas about what could be the reasons for these differences? and suggestions on the data cleaning and treatment: just stay for one source which believed to be most reliable (JHU for US, ECDE for EU), or remove the outliers and take mean/median. etc?

Many thanks

Hello @Haonan, we’re looking into this and will get back to you soon regarding the specific issues with JHU and ECDC counts.

Regarding your general question, when multiple data sources present the same data (e.g. case counts and death counts), we’d suggest selecting the data source that is most reliable for the geographic areas that you are investigating rather than using outlier detection or mean/median calculations between sources.

Thanks a lot ,and we are planning to take JHU data for US.
But there are interesting data points difference you may also find out, for examples, again, NewYork Confirmed Cases between JHU_ConfirmedCases and JHU_ConfirmedCasesInterpolated
, the huge difference is shown at 31/08/2020. as we double-checked the JHU_ConfirmedCasesInterpolateddata seems to be closer to the JHU website reported (https://coronavirus.jhu.edu/region/us/new-york) as looking on the most recent date. However, if we take JHU_ConfirmedCases curve to calculate a daily level confirmed cases, on 31/08/2020, it gives us a negative value for daily confirmed case on that day, would that be possible? or it may just be a data bug, we can drop this data point for 31/08/2020 for New york from our model


Hi @Haonan, the ECDC data issue you noted has been corrected – ECDC is now only available at the country level.

Regarding JHU data for New York, the particular cause of this problem was a change in reporting structure by JHU – at different times, cases have been reported for each of the five boroughs of New York City and for New York City as a whole. Our interpolation scheme, which normally prevents incorrect data, may have double-counted some of these locations due to the change in reporting. This problem has been partially fixed, but we’re still looking into the decrease on Aug 31. We will provide an update once the fix is finalized.

Please note that JHU_ConfirmedCasesInterpolated is no longer available – you should instead only use JHU_ConfirmedCases.