Back to Blog
Engineering

Building a Scalable Blockchain ETL Pipeline

How we process terabytes of blockchain data daily using Airflow, Snowflake, and dbt.

Lokeswar Kudumula - Founder & Lead Engineer
3 min read
etlairflowsnowflakedbtarchitecture

Processing blockchain data at scale is no small feat. With Bitcoin alone generating over 400GB of raw transaction data, and growing daily, you need a robust ETL pipeline. Here's how we built ours.

The Challenge

Every day, our pipeline needs to:

  • Download new blocks from multiple chains
  • Parse and validate transaction data
  • Transform raw data into analytics-ready tables
  • Handle failures gracefully
  • Scale horizontally as data grows

Our Architecture

Data Ingestion Layer

We use a custom Python-based downloader that:

# Simplified example of our download approach
async def download_chain_data(chain: str, date: str):
    async with RateLimiter(requests_per_second=4):
        data = await fetch_from_api(chain, date)
        await validate_and_store(data)

Key features:

  • Rate limiting to respect API constraints
  • Checkpointing for resumable downloads
  • Parallel processing across chains

Orchestration with Airflow

Apache Airflow manages our entire pipeline:

download_task >> validate_task >> upload_task >> load_task >> transform_task

We use:

  • Dynamic task mapping for parallel chain processing
  • SLA monitoring with alerts
  • Automatic retries with exponential backoff

Transformation with dbt

Our dbt models follow a layered architecture:

  1. Staging - Clean raw data
  2. Intermediate - Join and enrich
  3. Marts - Business-ready tables

Example mart model:

-- models/marts/fct_wallet_daily_activity.sql
SELECT
    wallet_address,
    activity_date,
    SUM(received_amount) as total_received,
    SUM(sent_amount) as total_sent,
    COUNT(DISTINCT transaction_hash) as tx_count
FROM {{ ref('int_wallet_transactions') }}
GROUP BY 1, 2

Storage in Snowflake

Snowflake handles our analytical workloads with:

  • Automatic clustering on frequently-queried columns
  • Time travel for debugging and recovery
  • Separation of storage and compute for cost efficiency

Performance Optimizations

Over time, we've implemented several optimizations:

| Optimization | Impact | |--------------|--------| | Polars for schema inference | 3-5x faster DDL generation | | 1MB download chunks | 15-20% faster downloads | | Connection pooling | 30-50% faster Snowflake loads | | Incremental dbt models | 20-40% faster builds |

Lessons Learned

1. Start Simple, Scale Later

We started with a simple Python script. Only when we hit scaling limits did we add Airflow. Don't over-engineer from day one.

2. Idempotency is Key

Every operation in our pipeline is idempotent. Re-running a failed task produces the same result without duplicating data.

3. Monitor Everything

We track:

  • Download success rates
  • Processing latency
  • Data quality metrics
  • Cost per chain

4. Test Your Transforms

dbt tests are essential:

models:
  - name: fct_wallet_daily_activity
    tests:
      - unique:
          column_name: "wallet_address || activity_date"
      - not_null:
          column_name: activity_date

What's Next

We're currently working on:

  • Streaming ingestion with Kafka
  • Real-time dashboards with Streamlit
  • ML-powered anomaly detection

Stay tuned for more technical deep-dives!