← Back to home Open data · Dataset

GitHub Agentic PR Dataset

A large-scale, open dataset of ~1.96 million GitHub pull requests authored by AI coding agents — Claude Code, Cursor, GitHub Copilot, and Devin — and by human developers, paired with their commits and file-level diffs. Built for research on agentic AI, automated code generation, bug-fixing, and mining software repositories.

This dataset compares how autonomous coding agents and humans contribute to real open-source projects. It links 1,959,649 pull requests (773,513 agent-authored, 1,186,136 human-authored) to 6.7M+ commits and 55M+ file-level change records with raw patch diffs — and flags 422,618 of those PRs as bug-fixes.

Extends AIDev (Li et al., 2025) — please cite the original work too.

1.96MPull requests

773KAgent-authored PRs

6.7MCommits

55MFile-level diffs

422KBug-fix PRs

4Coding agents

View on Hugging Face → Browse files License CC-BY-4.0 Format Parquet Size ~87 GB

Coding agents covered

Every agent-authored pull request is labelled with the tool that produced it. Toggle humans in to see the full split.

Claude Code 419,965 PRs Cursor 200,166 PRs GitHub Copilot 117,863 PRs Devin 35,519 PRs Humans 1,186,136 PRs

What’s inside

Nine Parquet tables span three levels — pull requests, commits, and file-level diffs — joined on id → pr_id → sha. Click a header to sort, or filter by name.

File	Rows	What it is
`all_pull_requests`	1,959,649	The full corpus — every PR (human + agent).
`agent_pull_requests`	275,377	A focused collection of agent-authored PRs.
`human_pull_requests`	1,186,136	Human-authored PRs only.
`fix_classified_prs`	1,959,649	All PRs labelled with `type` (fix/other) and `source` (human/agent).
`fix_prs_only`	422,618	Only the PRs classified as bug-fixes.
`pr_commits`	6,737,000	Commit metadata linked to PRs.
`pr_commit_details`	55,040,478	File-level changes with raw `patch` diffs.
`fix_pr_commits`	1,156,238	Commits belonging to bug-fix PRs.
`fix_pr_commit_details`	7,451,150	File-level changes for bug-fix PRs.

What you can do with it

Agent vs. human code analysis AI-generated PR detection Automated program repair Code generation & instruction tuning Code-review modelling Mining software repositories (MSR) SWE-style benchmarks

Load it in seconds

Works out of the box with 🤗 Datasets, Pandas, Polars, and DuckDB.

# Hugging Face Datasets — pick any table by config name
from datasets import load_dataset

prs = load_dataset("mabujadallah/GitHub-Agentic-PR-Dataset", split="train")
agent_prs = load_dataset(
    "mabujadallah/GitHub-Agentic-PR-Dataset",
    "agent_pull_requests", split="train",
)

# Pandas
import pandas as pd
base = "hf://datasets/mabujadallah/GitHub-Agentic-PR-Dataset/"
df = pd.read_parquet(base + "agent_pull_requests.parquet")
print(df["agent"].value_counts())

Cite this dataset

DatasetHugging Face · CC-BY-4.0

GitHub Agentic PR Dataset: Pull Requests from AI Coding Agents and Humans

Abujadallah, M. & Sayagh, M. (2026)

Hugging Face Datasets · huggingface.co/datasets/mabujadallah/GitHub-Agentic-PR-Dataset

Hugging Face →

@misc{abujadallah_github_agentic_pr_dataset,
  title        = {GitHub Agentic PR Dataset: Pull Requests from AI Coding Agents and Humans},
  author       = {Abujadallah, Mahmoud and Sayagh, Mohammed},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/datasets/mabujadallah/GitHub-Agentic-PR-Dataset}},
  note         = {Hugging Face Datasets}
}

ExtendsAIDev · arXiv:2507.15003

The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering

Li, H., Zhang, H., & Hassan, A. E. (2025)

This dataset extends AIDev — please cite the original work too.

arXiv → AIDev dataset →

@misc{li2025aiteammates,
  title         = {The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering},
  author        = {Li, Hao and Zhang, Haoxiang and Hassan, Ahmed E.},
  year          = {2025},
  eprint        = {2507.15003},
  archivePrefix = {arXiv},
  primaryClass  = {cs.SE},
  howpublished  = {\url{https://huggingface.co/datasets/hao-li/AIDev}}
}