Bitcoin Research with a Transaction Graph Dataset

·

The Bitcoin ecosystem has evolved into one of the most transparent and data-rich digital economies in history. With every transaction recorded on a public blockchain, researchers have unprecedented access to real-world financial interactions. However, despite this transparency, a major bottleneck remains: the lack of large-scale, structured, and labeled datasets that enable deep analytical exploration. This article introduces a groundbreaking Bitcoin transaction graph dataset—spanning nearly 13 years, containing 252 million nodes, 785 million edges, and over 670 million transactions—that empowers advanced research into network dynamics, entity behavior, and financial security.

Designed for scalability and reproducibility, this dataset models real-world entities as nodes and value transfers as directed edges, enriched with temporal timestamps and semantic labels. By combining blockchain analysis with AI-driven labeling techniques, it offers a powerful foundation for studying everything from ransomware operations to institutional adoption patterns.

Why Public Bitcoin Datasets Are Crucial

Bitcoin operates on a decentralized ledger where all transactions are publicly visible. While this transparency fuels innovation in forensic analysis and economic modeling, extracting meaningful insights requires more than raw data—it demands curation.

Existing datasets often focus narrowly on address labeling or small transaction subsets, limiting their utility for broader research. For instance:

Without standardized, comprehensive datasets, researchers must spend months downloading blockchain data, parsing transactions, and reverse-engineering user behaviors—barriers that exclude many from contributing meaningfully to the field.

👉 Discover how blockchain analytics can unlock hidden financial patterns with powerful tools.

This new dataset bridges that gap by offering a ready-to-use, large-scale representation of the Bitcoin economy—fully timestamped, richly annotated, and optimized for machine learning applications.

Core Components of the Bitcoin Transaction Graph

Node Definition: From Addresses to Real Entities

At the heart of the dataset is a sophisticated clustering methodology that groups Bitcoin addresses into real-world entities—such as individuals, exchanges, miners, or ransomware operators—based on behavioral heuristics.

Rather than treating each address as an independent node, the system identifies clusters of addresses likely controlled by the same entity using established patterns such as:

This process results in 252 million entity clusters, each represented as a unique node—making it the largest publicly available entity-level Bitcoin graph to date.

Edge Construction: Modeling Value Flow

Each directed edge represents a net transfer of value between two entities. The amount sent from sender a to recipient a’ is calculated based on input/output balances within a transaction:

$$ {v}_{\Delta }(a)=\sum _{({v}^{\prime },{a}^{\prime })\in {\Delta }_{{\rm{out}}},\,{a}^{\prime }=a}{v}^{\prime }-\sum _{({v}^{\prime },{a}^{\prime })\in {\Delta }_{{\rm{in}}},\,{a}^{\prime }=a}{v}^{\prime } $$

Edges are timestamped using block height, enabling temporal analysis of fund flows and behavioral evolution over time.

Exclusion Criteria: Ensuring Data Integrity

To maintain clarity and analytical accuracy, certain complex transaction types are excluded:

These exclusions ensure the graph reflects standard transaction logic while preserving interpretability.

Entity Labeling: Bridging On-Chain Data with Real-World Context

One of the dataset’s most valuable features is its labeled nodes—34,000 entities categorized into types such as:

Labels were derived through a hybrid approach combining:

  1. Forum data scraping from BitcoinTalk (over 14 million messages analyzed).
  2. AI-assisted classification using ChatGPT to extract entity names and address roles from unstructured text.
  3. External validation sources, including:

    • Publicly disclosed exchange wallets
    • Sanctions lists (e.g., OFAC SDN)
    • Academic ransomware datasets
    • Blockchain analytics platforms like DefiLlama

This multi-source labeling strategy ensures high accuracy while mitigating bias from any single source.

Frequently Asked Questions

Q: Can this dataset identify individual users?
A: No. All data is pseudonymous. Nodes represent clusters of addresses linked to behaviors—not personal identities—ensuring privacy compliance.

Q: How accurate is the AI-based labeling?
A: Evaluation shows 83–96% accuracy across different prompt types, with highest precision in detecting withdrawals and deposit addresses.

Q: Is the dataset suitable for machine learning?
A: Absolutely. It includes engineered features like transaction frequency, age-normalized activity, and USD-converted values—ideal for training GNNs and classifiers.

Q: What time period does the data cover?
A: Transactions from the first 700,000 blocks—covering Bitcoin’s history up to just before the Taproot upgrade (~2021).

Q: How can I access the dataset?
A: It's publicly available under a Creative Commons license at figshare (DOI: 10.6084/m9.figshare.26305093.v3).

Q: Are there performance benchmarks included?
A: Yes. The study includes trained GNN models (GCN, GraphSage, GAT, GIN) with macro-F1 scores up to 0.64, establishing baselines for future research.

Machine Learning Applications and Technical Validation

To validate the dataset’s utility, researchers trained Graph Neural Networks (GNNs) to classify unlabeled nodes based on their structural and feature context.

Model Performance Overview

ModelMacro-F1 Score
GAT0.64
GIN0.63
GraphSage0.61
GCN0.58
Gradient Boosting (tabular)0.57

Results show that graph-based models outperform traditional ML, confirming that network topology enhances predictive power—especially for mining pools and betting platforms.

👉 See how AI-powered analytics are transforming cryptocurrency research today.

Key Predictive Features

Feature importance analysis revealed that the most influential factors include:

These engineered features highlight the importance of contextual normalization when analyzing long-term blockchain data.

Use Cases Beyond Bitcoin Research

While designed for cryptocurrency analysis, this dataset has far-reaching applications:

Financial Crime Detection

By identifying dense subgraphs associated with ransomware or Ponzi schemes, law enforcement and compliance teams can develop early-warning systems using algorithms like Fraudar or Spade+.

Economic Behavior Studies

Researchers can analyze how macroeconomic events—regulatory changes, market crashes, or technological upgrades—affect transaction volume and network structure over time.

Cross-Network Comparisons

The graph’s scale allows comparisons with traditional financial networks or social systems, offering insights into universal patterns of value exchange.

Pre-training for Graph AI

With over 785 million edges, this dataset serves as an ideal pre-training corpus for graph neural networks applied to fraud detection, recommendation engines, or supply chain modeling.

Practical Implementation Guide

System Requirements

Use pg_restore to load the compressed database dump efficiently across multiple threads.

Getting Started

  1. Download the dataset from figshare
  2. Restore using:
    pg_restore -j 8 -Fd -C -U user -d btc_db dataset_folder/
  3. Query sample data:

    SELECT * FROM node_features LIMIT 10;
    SELECT * FROM transaction_edges WHERE block_index > 600000 LIMIT 10;

All code for preprocessing, sampling, and model training is open-source and written in Python using PyTorch Geometric and Scikit-learn.

👉 Explore cutting-edge tools for blockchain data science and start your own analysis now.

Conclusion

This large-scale, temporally resolved Bitcoin transaction graph dataset marks a significant leap forward in blockchain research infrastructure. By transforming raw blockchain logs into a semantically rich, machine-readable network, it lowers entry barriers for academics, developers, and analysts alike.

With its robust labeling framework, temporal depth, and compatibility with modern AI techniques, it sets a new standard for empirical studies in digital finance. Whether you're detecting illicit activity, modeling economic trends, or exploring graph theory applications, this resource offers unparalleled analytical potential.

As Bitcoin continues to mature as both technology and asset class, datasets like this will be essential for ensuring transparency, security, and innovation across the ecosystem.


Core Keywords:
Bitcoin transaction graph, blockchain research dataset, graph neural network Bitcoin, labeled cryptocurrency data, entity clustering Bitcoin, temporal blockchain analysis, machine learning on Bitcoin