Hi C3.ai and fellow contest participants,

I thought I must be mistaken somewhere but OutbreakLocation labels Washington DC as a “state” (it is in fact not a state) while Washington is of locationType nan. If this kind of mislabelling exists, what other kind of mislabelling might also exist?

Sample Python code:

a = c3aidatalake.fetch(
        "spec" : {
            "filter" : "contains(id, 'UnitedStates')"
    get_all = True
print(f'Washington is of type: {a[a.id == "Washington_UnitedStates"].locationType.values[0]}')
print(f'Washington D.C. is of type: {a[a.id == "Washington,D.C._UnitedStates"].locationType.values[0]}')


Washington is of type: nan
Washington D.C. is of type: state

I just want to know whether I’m just being stupid or the data lake might also contains other unreliable entries like this.

Hi @joy13975, Washington, D.C. is a unique case because of its classification as a Federal District. Most OutbreakLocation IDs take the form of “city_county_state_country”, and Washington, D.C. falls above the county level but below the country level, so we consider it as a state within the hierarchy. Most data sources we ingest report case counts for Washington, D.C. with state level data, and we do so as well.

We will investigate why Washington is not labeled as a state and get back to you soon.

Regarding unreliable entries at large, please continue to bring them to our attention so that they can be corrected. As a large and constantly expanding collection of 40 data sources, the C3.ai COVID-19 Data Lake is imperfect and contains inaccuracies, but we aim to correct these as soon as possible after finding them.

