← Back to home Open data · Dataset

GitHub Agentic PR Dataset

A large-scale, open dataset of ~1.96 million GitHub pull requests authored by AI coding agents — Claude Code, Cursor, GitHub Copilot, and Devin — and by human developers, paired with their commits and file-level diffs. Built for research on agentic AI, automated code generation, bug-fixing, and mining software repositories.

This dataset compares how autonomous coding agents and humans contribute to real open-source projects. It links 1,959,649 pull requests (773,513 agent-authored, 1,186,136 human-authored) to 6.7M+ commits and 55M+ file-level change records with raw patch diffs — and flags 422,618 of those PRs as bug-fixes.

Extends AIDev (Li et al., 2025) — please cite the original work too.

1.96MPull requests
773KAgent-authored PRs
6.7MCommits
55MFile-level diffs
422KBug-fix PRs
4Coding agents
View on Hugging Face → Browse files License CC-BY-4.0 Format Parquet Size ~87 GB

Coding agents covered

Every agent-authored pull request is labelled with the tool that produced it. Toggle humans in to see the full split.

Claude Code 419,965 PRs Cursor 200,166 PRs GitHub Copilot 117,863 PRs Devin 35,519 PRs Humans 1,186,136 PRs

What’s inside

Nine Parquet tables span three levels — pull requests, commits, and file-level diffs — joined on idpr_idsha. Click a header to sort, or filter by name.

FileRowsWhat it is
all_pull_requests1,959,649The full corpus — every PR (human + agent).
agent_pull_requests275,377A focused collection of agent-authored PRs.
human_pull_requests1,186,136Human-authored PRs only.
fix_classified_prs1,959,649All PRs labelled with type (fix/other) and source (human/agent).
fix_prs_only422,618Only the PRs classified as bug-fixes.
pr_commits6,737,000Commit metadata linked to PRs.
pr_commit_details55,040,478File-level changes with raw patch diffs.
fix_pr_commits1,156,238Commits belonging to bug-fix PRs.
fix_pr_commit_details7,451,150File-level changes for bug-fix PRs.

What you can do with it

Agent vs. human code analysis AI-generated PR detection Automated program repair Code generation & instruction tuning Code-review modelling Mining software repositories (MSR) SWE-style benchmarks

Load it in seconds

Works out of the box with 🤗 Datasets, Pandas, Polars, and DuckDB.

# Hugging Face Datasets — pick any table by config name
from datasets import load_dataset

prs = load_dataset("mabujadallah/GitHub-Agentic-PR-Dataset", split="train")
agent_prs = load_dataset(
    "mabujadallah/GitHub-Agentic-PR-Dataset",
    "agent_pull_requests", split="train",
)

# Pandas
import pandas as pd
base = "hf://datasets/mabujadallah/GitHub-Agentic-PR-Dataset/"
df = pd.read_parquet(base + "agent_pull_requests.parquet")
print(df["agent"].value_counts())

Cite this dataset

DatasetHugging Face · CC-BY-4.0

GitHub Agentic PR Dataset: Pull Requests from AI Coding Agents and Humans

Abujadallah, M. & Sayagh, M. (2026)

Hugging Face Datasets · huggingface.co/datasets/mabujadallah/GitHub-Agentic-PR-Dataset

@misc{abujadallah_github_agentic_pr_dataset,
  title        = {GitHub Agentic PR Dataset: Pull Requests from AI Coding Agents and Humans},
  author       = {Abujadallah, Mahmoud and Sayagh, Mohammed},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/datasets/mabujadallah/GitHub-Agentic-PR-Dataset}},
  note         = {Hugging Face Datasets}
}
ExtendsAIDev · arXiv:2507.15003

The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering

Li, H., Zhang, H., & Hassan, A. E. (2025)

This dataset extends AIDev — please cite the original work too.

@misc{li2025aiteammates,
  title         = {The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering},
  author        = {Li, Hao and Zhang, Haoxiang and Hassan, Ahmed E.},
  year          = {2025},
  eprint        = {2507.15003},
  archivePrefix = {arXiv},
  primaryClass  = {cs.SE},
  howpublished  = {\url{https://huggingface.co/datasets/hao-li/AIDev}}
}

Related work: my MSR 2026 study on why agentic pull-request fixes get rejected →