Building a Scalable Blockchain ETL Pipeline
How we process terabytes of blockchain data daily using Airflow, Snowflake, and dbt.
Processing blockchain data at scale is no small feat. With Bitcoin alone generating over 400GB of raw transaction data, and growing daily, you need a robust ETL pipeline. Here's how we built ours.
The Challenge
Every day, our pipeline needs to:
- Download new blocks from multiple chains
- Parse and validate transaction data
- Transform raw data into analytics-ready tables
- Handle failures gracefully
- Scale horizontally as data grows
Our Architecture
Data Ingestion Layer
We use a custom Python-based downloader that:
# Simplified example of our download approach
async def download_chain_data(chain: str, date: str):
async with RateLimiter(requests_per_second=4):
data = await fetch_from_api(chain, date)
await validate_and_store(data)
Key features:
- Rate limiting to respect API constraints
- Checkpointing for resumable downloads
- Parallel processing across chains
Orchestration with Airflow
Apache Airflow manages our entire pipeline:
download_task >> validate_task >> upload_task >> load_task >> transform_task
We use:
- Dynamic task mapping for parallel chain processing
- SLA monitoring with alerts
- Automatic retries with exponential backoff
Transformation with dbt
Our dbt models follow a layered architecture:
- Staging - Clean raw data
- Intermediate - Join and enrich
- Marts - Business-ready tables
Example mart model:
-- models/marts/fct_wallet_daily_activity.sql
SELECT
wallet_address,
activity_date,
SUM(received_amount) as total_received,
SUM(sent_amount) as total_sent,
COUNT(DISTINCT transaction_hash) as tx_count
FROM {{ ref('int_wallet_transactions') }}
GROUP BY 1, 2
Storage in Snowflake
Snowflake handles our analytical workloads with:
- Automatic clustering on frequently-queried columns
- Time travel for debugging and recovery
- Separation of storage and compute for cost efficiency
Performance Optimizations
Over time, we've implemented several optimizations:
| Optimization | Impact | |--------------|--------| | Polars for schema inference | 3-5x faster DDL generation | | 1MB download chunks | 15-20% faster downloads | | Connection pooling | 30-50% faster Snowflake loads | | Incremental dbt models | 20-40% faster builds |
Lessons Learned
1. Start Simple, Scale Later
We started with a simple Python script. Only when we hit scaling limits did we add Airflow. Don't over-engineer from day one.
2. Idempotency is Key
Every operation in our pipeline is idempotent. Re-running a failed task produces the same result without duplicating data.
3. Monitor Everything
We track:
- Download success rates
- Processing latency
- Data quality metrics
- Cost per chain
4. Test Your Transforms
dbt tests are essential:
models:
- name: fct_wallet_daily_activity
tests:
- unique:
column_name: "wallet_address || activity_date"
- not_null:
column_name: activity_date
What's Next
We're currently working on:
- Streaming ingestion with Kafka
- Real-time dashboards with Streamlit
- ML-powered anomaly detection
Stay tuned for more technical deep-dives!