Skip to content

nodeify-eth/stream-download

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Production Streaming Snapshot Restore

A bulletproof bash script for streaming large tar archives directly to disk with automatic retry logic, stall detection, and progress monitoring.

Features

  • Streaming extraction - Downloads and extracts simultaneously, no temporary files
  • Production-grade reliability - Automatic retries with exponential backoff
  • Stall detection - Watchdog automatically kills and retries stalled downloads
  • Progress monitoring - Real-time download ETA via pv, or extraction speed fallback
  • Compression support - Auto-detects zstd, lz4, gzip, bzip2, xz, and plain tar (by extension or magic bytes)
  • Minimal disk usage - No temporary tar file, extracts on-the-fly
  • Connection resilience - TCP keepalive, nodelay, and aggressive timeout handling
  • Structured logging - Text or JSON output for log aggregation
  • Checksum verification - Optional SHA-256 integrity check

Use Cases

Perfect for:

  • Blockchain snapshot restoration (Ethereum, Cosmos, etc.)
  • Large database backups
  • CI/CD deployment of large archives
  • Any scenario where disk space is limited but reliability is critical

Requirements

  • bash 4.4+
  • curl
  • tar
  • Compression tools (zstd, lz4, gzip, bzip2, xz) - only needed if using compressed archives
  • Standard Unix utilities: du, awk, numfmt
  • Optional for accurate ETA: pv (auto-detected)
  • Optional for checksum verification: sha256sum or shasum, tee, mktemp

Usage

Basic Usage

export RESTORE_SNAPSHOT=true
export URL="https://example.com/snapshot.tar"
export DIR="/data"

./stream-download.sh

Docker Usage

FROM alpine:3.22

RUN apk add --no-cache \
  bash curl tar \
  zstd lz4 gzip bzip2 xz \
  coreutils pv ca-certificates dumb-init

COPY stream-download.sh /usr/local/bin/stream-download.sh
RUN chmod +x /usr/local/bin/stream-download.sh

ENTRYPOINT ["/usr/bin/dumb-init", "--"]
CMD ["/bin/bash", "-c", "/usr/local/bin/stream-download.sh"]

Kubernetes Usage

apiVersion: v1
kind: Pod
metadata:
  name: snapshot-restore
spec:
  initContainers:
  - name: restore-snapshot
    image: your-image:latest
    env:
    - name: RESTORE_SNAPSHOT
      value: "true"
    - name: URL
      value: "https://snapshot.arbitrum.foundation/arb1/classic-archive.tar"
    - name: DIR
      value: "/storage"
    - name: SUBPATH
      value: "db"
    - name: TAR_ARGS
      value: "--strip-components=1"
    volumeMounts:
    - name: data
      mountPath: /storage
  containers:
  - name: main
    image: your-app:latest
    volumeMounts:
    - name: data
      mountPath: /storage
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: your-pvc

Environment Variables

Variable Default Description
RESTORE_SNAPSHOT false Set to true to enable snapshot restore
URL - Required. URL of the snapshot to download
DIR - Required. Absolute path to extract snapshot to
SUBPATH "" Subdirectory within DIR to extract to (e.g., db)
TAR_ARGS "" Additional arguments to pass to tar (e.g., --strip-components=1)
COMPRESSION auto Compression format: auto, none, gzip, bzip2, xz, zstd, lz4
RM_SUBPATH true Remove SUBPATH directory before extraction (set to false to keep)
MAX_RETRIES 10 Number of retry attempts before giving up
STALL_MINUTES 3 Minutes of no progress before watchdog kills curl
CURL_SPEED_LIMIT 102400 Minimum bytes/sec before curl considers connection stalled
CURL_SPEED_TIME 180 Seconds at low speed before curl aborts
DEBUG false Set to true to enable verbose shell tracing
CURL_INSECURE false Set to true to skip TLS verification
CACERT "" Path to CA bundle for TLS verification
CURL_EXTRA_ARGS "" Extra arguments appended to curl (advanced use)
CHECKSUM_SHA256 "" Expected SHA-256 of the downloaded stream
USE_PV auto Use pv for download ETA when available
STATUS_INTERVAL_SECONDS 30 Progress update interval in seconds
LOG_FORMAT text Logging format: text or json

How It Works

Streaming Architecture

┌─────────┐    ┌──────────────┐    ┌─────┐    ┌────────────┐
│  curl   │───▶│ decompressor │───▶│ tar │───▶│ /storage/* │
└─────────┘    └──────────────┘    └─────┘    └────────────┘
     │              │                  │              │
     └──────────────┴──────────────────┴──────────────┘
                          │
                    ┌─────▼──────┐
                    │  monitors  │
                    │ watchdog + │
                    │   status   │
                    └────────────┘
  1. curl streams data from URL with connection monitoring
  2. decompressor (if needed) decompresses on-the-fly
  3. tar extracts files directly to disk
  4. monitors track progress and detect stalls

Retry Logic

  • Automatic retry - Up to 10 attempts with exponential backoff (10s, 20s, 30s...)
  • Stall detection - Watchdog kills download if no progress for 3 minutes (configurable)
  • Connection monitoring - curl aborts if speed drops below 100KB/s for 3 minutes

Progress Monitoring

With pv available and known file size (text mode):

Download: 45% | 278GiB / 613GiB | Speed: 245MiB/s | ETA: 23m

With pv available and known file size (JSON mode):

{"ts":"2026-02-05T12:00:00Z","level":"info","event":"download","percent":45,"bytes":298521149440,"total":658280898560,"speed_bps":256901120,"eta_seconds":1380}

Stall warnings:

No progress detected for 1 minute(s) (278GiB extracted)
No progress detected for 2 minute(s) (278GiB extracted)
WATCHDOG: Detected stall for 3 minutes, killing download to trigger retry

Examples

Arbitrum Snapshot Restoration

export RESTORE_SNAPSHOT=true
export URL="https://snapshot.arbitrum.foundation/arb1/classic-archive.tar"
export DIR="/storage"
export SUBPATH="db"
export TAR_ARGS="--strip-components=1"

./stream-download.sh

Compressed Snapshot with Custom Settings

export RESTORE_SNAPSHOT=true
export URL="https://example.com/snapshot.tar.zst"
export DIR="/data"
export COMPRESSION="zstd"  # or use "auto" to auto-detect
export MAX_RETRIES=5

./stream-download.sh

Skip Existing Data

export RESTORE_SNAPSHOT=true
export URL="https://example.com/snapshot.tar"
export DIR="/data"
export SUBPATH="database"
export RM_SUBPATH="false"  # Don't delete existing data

./stream-download.sh

Troubleshooting

Download keeps failing

Check connection stability:

# Test download speed
curl -o /dev/null https://your-snapshot-url.tar

# Check if server supports HTTP keepalive
curl -I https://your-snapshot-url.tar | grep -i "keep-alive"

Increase retry attempts:

export MAX_RETRIES=20

Stalls frequently

The watchdog detects stalls after 3 minutes of no progress by default. Tune it:

export STALL_MINUTES=5  # Wait 5 minutes instead of 3

Out of disk space

This script uses minimal space (extracts on-the-fly), but you need enough space for the extracted data. The script warns on startup if free space is less than the file size.

df -h /storage

Limitations

No Resume Capability

This streaming approach cannot resume from a specific byte position. If the download fails, it restarts from the beginning.

Why? Tar archives must be read sequentially. Jumping to a mid-point causes tar to see garbage data and fail to extract correctly.

Mitigation:

  • Automatic retries with exponential backoff
  • Stall detection and auto-recovery
  • Connection speed monitoring
  • Most downloads succeed on first attempt with good internet

Acceptance of Trade-offs

This streaming approach prioritizes:

  • Minimal disk space usage
  • Immediate file availability
  • Simple, predictable behavior

The trade-off is that failed downloads restart from the beginning. However, with retry logic, stall detection, and connection monitoring, the vast majority of downloads complete successfully.

Performance

Typical Performance

Snapshot Size Network Speed Extraction Time
100GB 100Mbps ~2.5 hours
500GB 100Mbps ~12 hours
1TB 1Gbps ~2.5 hours

Bottlenecks

  • Network - Usually the limiting factor
  • Disk I/O - Can bottleneck on slow disks (HDD vs SSD)
  • CPU - Decompression (zstd, lz4) can be CPU-intensive

Security Considerations

  • TLS verification is enabled by default; use CURL_INSECURE=true only when required
  • Prefer CACERT=/path/to/ca-bundle.crt for custom CAs
  • No authentication - assumes public snapshot URLs
  • Optional SHA-256 verification via CHECKSUM_SHA256
  • DIR must be an absolute path; SUBPATH must be relative (no ..)
  • --compressed is intentionally omitted from curl to avoid double-decode on misconfigured CDNs

Accurate ETA

Download ETA is computed from bytes received, not extracted. For accurate ETA:

  • Ensure the server provides Content-Length or supports HTTP Range requests
  • pv is used automatically when available and file size is known (USE_PV=auto)

Log Format

Set LOG_FORMAT=json for line-delimited JSON logs (useful for log aggregation):

export LOG_FORMAT="json"

Advanced Configuration

Custom curl Options

Use environment variables to customize curl:

export CACERT="/path/to/ca-bundle.crt"
export CURL_EXTRA_ARGS="--retry 2 --retry-delay 5"

Checksum Verification

Verify the download stream with SHA-256:

export CHECKSUM_SHA256="abc123...yourchecksum..."

Changing the checksum re-triggers the download even if the stamp file exists.

Custom Watchdog Timing

Tune stall detection threshold:

export STALL_MINUTES=5  # Wait 5 minutes instead of 3

Disable Watchdog

Comment out watchdog in stream_and_extract function:

# watchdog &
# WATCHDOG_PID=$!

Support

For issues, questions, or contributions, please refer to your internal documentation or contact your DevOps team.

Credits

This project is based on the excellent init-stream-download tool by GraphOps. We've extended it to support additional compression formats while maintaining full backward compatibility with the original.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 2

  •  
  •