Assessing Biodiversity Data Quality: A GBIF Kenya Case Study

Environmental Data Science

Biodiversity

Data Quality

A comprehensive analysis demonstrating reproducible workflows for biodiversity data quality assessment using GBIF occurrence data from Kenya.

Author

Kwizera Jean

Published

November 25, 2025

The Data Quality Challenge in Biodiversity Science

The Global Biodiversity Information Facility (GBIF) aggregates hundreds of millions of species occurrence records from institutions worldwide. For Kenya alone, GBIF holds millions of records spanning decades of observation. But aggregation at this scale introduces systematic data quality challenges that can undermine conservation decisions if left unaddressed.

In our work with environmental impact assessments and biodiversity monitoring across East Africa, we have seen how uncritical use of GBIF data leads to flawed baseline assessments. This post walks through our approach to systematic data quality assessment — a reproducible R pipeline that we apply to every biodiversity project.

Why Data Quality Matters

Biodiversity data feeds directly into conservation planning, EIA baseline studies, and policy decisions. When that data contains coordinate errors, taxonomic misidentifications, or temporal gaps, the downstream consequences are real: protected areas drawn around phantom populations, impact assessments that miss genuinely sensitive species, and monitoring programmes built on unreliable baselines.

The challenge is that GBIF data comes from heterogeneous sources — museum collections, citizen science platforms, academic surveys — each with different quality standards and error profiles.

A Reproducible Quality Assessment Workflow

Our pipeline operates in four stages, each implemented as testable R functions:

Stage 1: Data Retrieval and Initial Profiling

We use the rgbif package to access GBIF’s API programmatically, then profile the dataset to understand its composition before applying any filters.

library(rgbif)
library(dplyr)

# Retrieve occurrence data for Kenya
occurrences <- occ_search(
  country = "KE",
  hasCoordinate = TRUE,
  limit = 50000
)$data

# Initial profiling
profiling <- occurrences |>
  summarise(
    n_records = n(),
    n_species = n_distinct(species),
    date_range = paste(min(year, na.rm = TRUE), max(year, na.rm = TRUE), sep = "-"),
    pct_with_coords = mean(!is.na(decimalLatitude)) * 100
  )

Stage 2: Coordinate Validation

Coordinate errors are the most common and consequential data quality issue. We check for:

Country centroid records — coordinates that fall exactly on Kenya’s geographic centroid, typically indicating missing data that was auto-filled
Zero coordinates — records at (0, 0) or with zero latitude/longitude
Precision issues — coordinates rounded to integer degrees, indicating low spatial precision
Out-of-boundary records — coordinates that fall outside Kenya’s borders despite being tagged as Kenyan occurrences

library(sf)
library(CoordinateCleaner)

# Flag coordinate issues
cleaned <- occurrences |>
  cc_val() |>         # Remove invalid coordinates
  cc_zero() |>        # Flag zero coordinates
  cc_cen() |>         # Flag country centroids
  cc_dupl() |>        # Flag exact duplicates
  cc_gbif() |>        # Flag GBIF headquarters
  cc_inst()           # Flag biodiversity institutions

Stage 3: Taxonomic Verification

Taxonomic names change over time as species are reclassified. We validate names against the GBIF backbone taxonomy and flag records with unresolved or disputed classifications.

# Verify taxonomy against GBIF backbone
taxa_check <- occurrences |>
  distinct(species, taxonomicStatus, taxonRank) |>
  filter(
    taxonomicStatus != "ACCEPTED" |
    taxonRank != "SPECIES"
  )

Stage 4: Temporal and Sampling Bias Assessment

Even clean data can be misleading if it is unevenly distributed in time or space. We assess:

Temporal coverage — Are records concentrated in particular years or seasons?
Spatial sampling bias — Are records clustered near roads, cities, or research stations?
Taxonomic bias — Are certain taxa over-represented relative to expected diversity?

Key Findings from the Kenya Dataset

Applying this pipeline to Kenyan GBIF data consistently reveals several patterns:

A significant fraction of records fall on country or county centroids, indicating geocoding from administrative labels rather than actual observation coordinates
Temporal coverage is heavily skewed toward recent decades, with sparse data before 2000
Spatial sampling is concentrated along major transport corridors and near Nairobi, with large areas of northern Kenya essentially unsampled
Bird records dominate the dataset, with invertebrates and plants substantially under-represented relative to their actual diversity

Implications for Practice

These findings have direct implications for how GBIF data should be used in environmental assessments:

Never use raw GBIF data without quality filtering. The proportion of records flagged by our pipeline typically ranges from 15-30% — enough to materially change any analysis.
Document your filtering decisions. Reproducible pipelines ensure that quality decisions are transparent and auditable — critical for EIA submissions.
Acknowledge sampling gaps. Absence of records does not mean absence of species. Reporting what the data cannot tell you is as important as what it can.
Combine sources. GBIF data should be supplemented with targeted field surveys, especially in under-sampled regions.

Building Better Tools

This work informs our broader commitment to environmental data transparency. Our open-source kenyaEIAFetcher R package applies similar principles to EIA data access, and we are working on additional tools to make environmental data quality assessment more accessible to practitioners across East Africa.

Reproducible, transparent data quality assessment is not just good practice — it is essential for environmental governance that works.