Assessing Biodiversity Data Quality: A GBIF Kenya Case Study
The Data Quality Challenge in Biodiversity Science
The Global Biodiversity Information Facility (GBIF) aggregates hundreds of millions of species occurrence records from institutions worldwide. For Kenya alone, GBIF holds millions of records spanning decades of observation. But aggregation at this scale introduces systematic data quality challenges that can undermine conservation decisions if left unaddressed.
In our work with environmental impact assessments and biodiversity monitoring across East Africa, we have seen how uncritical use of GBIF data leads to flawed baseline assessments. This post walks through our approach to systematic data quality assessment — a reproducible R pipeline that we apply to every biodiversity project.
Why Data Quality Matters
Biodiversity data feeds directly into conservation planning, EIA baseline studies, and policy decisions. When that data contains coordinate errors, taxonomic misidentifications, or temporal gaps, the downstream consequences are real: protected areas drawn around phantom populations, impact assessments that miss genuinely sensitive species, and monitoring programmes built on unreliable baselines.
The challenge is that GBIF data comes from heterogeneous sources — museum collections, citizen science platforms, academic surveys — each with different quality standards and error profiles.
A Reproducible Quality Assessment Workflow
Our pipeline operates in four stages, each implemented as testable R functions:
Stage 1: Data Retrieval and Initial Profiling
We use the rgbif package to access GBIF’s API programmatically, then profile the dataset to understand its composition before applying any filters.
library(rgbif)
library(dplyr)
# Retrieve occurrence data for Kenya
occurrences <- occ_search(
country = "KE",
hasCoordinate = TRUE,
limit = 50000
)$data
# Initial profiling
profiling <- occurrences |>
summarise(
n_records = n(),
n_species = n_distinct(species),
date_range = paste(min(year, na.rm = TRUE), max(year, na.rm = TRUE), sep = "-"),
pct_with_coords = mean(!is.na(decimalLatitude)) * 100
)Stage 2: Coordinate Validation
Coordinate errors are the most common and consequential data quality issue. We check for:
- Country centroid records — coordinates that fall exactly on Kenya’s geographic centroid, typically indicating missing data that was auto-filled
- Zero coordinates — records at (0, 0) or with zero latitude/longitude
- Precision issues — coordinates rounded to integer degrees, indicating low spatial precision
- Out-of-boundary records — coordinates that fall outside Kenya’s borders despite being tagged as Kenyan occurrences
library(sf)
library(CoordinateCleaner)
# Flag coordinate issues
cleaned <- occurrences |>
cc_val() |> # Remove invalid coordinates
cc_zero() |> # Flag zero coordinates
cc_cen() |> # Flag country centroids
cc_dupl() |> # Flag exact duplicates
cc_gbif() |> # Flag GBIF headquarters
cc_inst() # Flag biodiversity institutionsStage 3: Taxonomic Verification
Taxonomic names change over time as species are reclassified. We validate names against the GBIF backbone taxonomy and flag records with unresolved or disputed classifications.
# Verify taxonomy against GBIF backbone
taxa_check <- occurrences |>
distinct(species, taxonomicStatus, taxonRank) |>
filter(
taxonomicStatus != "ACCEPTED" |
taxonRank != "SPECIES"
)Stage 4: Temporal and Sampling Bias Assessment
Even clean data can be misleading if it is unevenly distributed in time or space. We assess:
- Temporal coverage — Are records concentrated in particular years or seasons?
- Spatial sampling bias — Are records clustered near roads, cities, or research stations?
- Taxonomic bias — Are certain taxa over-represented relative to expected diversity?
Key Findings from the Kenya Dataset
Applying this pipeline to Kenyan GBIF data consistently reveals several patterns:
- A significant fraction of records fall on country or county centroids, indicating geocoding from administrative labels rather than actual observation coordinates
- Temporal coverage is heavily skewed toward recent decades, with sparse data before 2000
- Spatial sampling is concentrated along major transport corridors and near Nairobi, with large areas of northern Kenya essentially unsampled
- Bird records dominate the dataset, with invertebrates and plants substantially under-represented relative to their actual diversity
Implications for Practice
These findings have direct implications for how GBIF data should be used in environmental assessments:
Never use raw GBIF data without quality filtering. The proportion of records flagged by our pipeline typically ranges from 15-30% — enough to materially change any analysis.
Document your filtering decisions. Reproducible pipelines ensure that quality decisions are transparent and auditable — critical for EIA submissions.
Acknowledge sampling gaps. Absence of records does not mean absence of species. Reporting what the data cannot tell you is as important as what it can.
Combine sources. GBIF data should be supplemented with targeted field surveys, especially in under-sampled regions.
Building Better Tools
This work informs our broader commitment to environmental data transparency. Our open-source kenyaEIAFetcher R package applies similar principles to EIA data access, and we are working on additional tools to make environmental data quality assessment more accessible to practitioners across East Africa.
Reproducible, transparent data quality assessment is not just good practice — it is essential for environmental governance that works.