Skip to main content

Ethereum Nonce Management — Problem Analysis & Solution Proposals

Current Architecture

┌──────────────────────┐      HTTP (Basic Auth)      ┌────────────────────────┐
│  apps/core           │ ──────────────────────────▶  │  apps/relayer          │
│                      │                              │                        │
│  WithdrawService     │  1. getNonceForAddress()     │  EvmController         │
│  ContractInteractions│  2. getCurrentFeeEVM()       │       │                │
│  TokenizationService │  3. sendRawEVMTransaction()  │       ▼                │
│                      │                              │  EvmService            │
│  Signs tx locally    │                              │       │                │
│  with ethers.Wallet  │                              │       ▼                │
└──────────────────────┘                              │  EvmProviderPool       │
                                                      │       │                │
                                                      │  ┌────▼─────┐         │
                                                      │  │  Redis   │ (nonce) │
                                                      │  └──────────┘         │
                                                      │       │                │
                                                      │  ┌────▼─────┐         │
                                                      │  │  RPC Node│         │
                                                      │  └──────────┘         │
                                                      └────────────────────────┘
Transaction lifecycle:
  1. core calls relayer’s POST /evm/address/nonce to get the next nonce
  2. core calls relayer’s POST /evm/getCurrentFee to get gas prices
  3. core signs the transaction locally with ethers.Wallet.signTransaction()
  4. core calls relayer’s POST /evm/transaction/sendRaw with the signed tx
  5. Relayer broadcasts via eth_sendRawTransaction and increments tracked nonce in Redis
Nonce tracking (relayer):
  • Redis key: relayer:evm:nonce:{networkId}:{address}
  • getNonce(): max(onChainNonce, redisTrackedNonce) → persisted via SET
  • sendRawTransaction() success: INCR on the nonce key (atomic)
  • Error recovery: DEL the key on nonce errors, forcing re-fetch from chain

Identified Problems

Problem 1: TOCTOU Race in getNonce() (Critical)

File: apps/relayer/src/evm/evm.service.ts:49-119 The getNonce → sign → sendRawTransactionINCR flow is not atomic. Two concurrent callers can get the same nonce:
Instance A: getNonce() → reads Redis=5, chain=5 → returns 5
Instance B: getNonce() → reads Redis=5, chain=5 → returns 5  (A hasn't sent yet)
Instance A: signs tx with nonce 5, sends → INCR → Redis=6
Instance B: signs tx with nonce 5, sends → "nonce already used" ✗
This is the core bug. The nonce is read and returned without reservation. Between the read and the subsequent INCR (which only happens after broadcast), another caller can obtain the same value.

Problem 2: Split Read/Write Across Services

core fetches nonce via HTTP, signs locally, then sends signed tx back. This means:
  • The nonce is “in flight” for the entire duration of HTTP round-trip + signing + another HTTP round-trip
  • The relayer has no knowledge that a nonce has been “claimed” until the signed tx arrives
  • No way to recycle a nonce if core crashes between getNonce and sendRaw

Problem 3: No Nonce Recycling on Pre-Broadcast Failures

If core gets nonce 5, signs a tx, but the signing fails (bad key, encoding error) or the HTTP call to sendRaw times out, nonce 5 is consumed in Redis (via setTrackedNonce) but never broadcast. This creates a nonce gap — all subsequent transactions queue behind a phantom nonce 5 that will never be mined.

Problem 4: setTrackedNonce() Overwrites Without CAS

File: apps/relayer/src/evm/evm.provider-pool.service.ts:243-250 setTrackedNonce is a plain SET, not a compare-and-swap. Concurrent getNonce() calls can clobber each other:
A: reads chain=7, tracked=5 → calculates max=7
B: reads chain=7, tracked=5 → calculates max=7
A: SET nonce=7
B: SET nonce=7  (harmless here, but if timing differs, could SET a stale value)

Problem 5: Circuit Breaker Was Tripping on Nonce Errors (Fixed in ca79d480)

The recent fix ca79d480 corrected an issue where nonce errors (a tx-level problem) were being recorded as provider failures, tripping the circuit breaker and blocking all subsequent transactions on that network. This is now fixed — only connection errors trip the breaker.

Problem 6: No Stuck Transaction Recovery

There is no mechanism to detect or recover from transactions that are:
  • Stuck in mempool (gas too low for current conditions)
  • Dropped by the mempool (eviction after 3hr in Geth)
  • Creating nonce gaps that block subsequent transactions

Solution Approaches

Summary: Replace the read-then-write nonce flow with a single atomic acquireNonce operation using a Redis Lua script. The nonce is reserved at read time, not after broadcast. Changes required:
  1. Relayer — new Lua-based acquireNonce() method in evm.provider-pool.service.ts:
    async acquireNonce(address: string, networkId: string): Promise<number> {
      const key = this.nonceKey(address, networkId);
      const lua = `
        local current = redis.call('GET', KEYS[1])
        if current == false then return nil end
        redis.call('SET', KEYS[1], tonumber(current) + 1)
        return current
      `;
      const result = await this.redis.client.eval(lua, 1, key);
      if (result === null) {
        // First call or after reset — seed from chain
        await this.syncNonceFromChain(address, networkId);
        return this.acquireNonce(address, networkId);
      }
      return Number(result);
    }
    
  2. Relayer — new syncNonceFromChain() method to initialize/re-seed:
    async syncNonceFromChain(address: string, networkId: string): Promise<void> {
      const provider = await this.getProvider(networkId, rpcUrl);
      const chainNonce = await provider.getTransactionCount(address);
      const key = this.nonceKey(address, networkId);
      // Only set if key doesn't exist or chain is higher (Lua for atomicity)
      const lua = `
        local current = redis.call('GET', KEYS[1])
        if current == false or tonumber(current) < tonumber(ARGV[1]) then
          redis.call('SET', KEYS[1], ARGV[1])
        end
      `;
      await this.redis.client.eval(lua, 1, key, chainNonce);
    }
    
  3. Relayer — new recycleNonce() method for failed-before-broadcast:
    async recycleNonce(address: string, networkId: string, nonce: number): Promise<void> {
      const key = this.nonceKey(address, networkId);
      // Only recycle if it's still the "next" nonce (no one else took a later one)
      const lua = `
        local current = tonumber(redis.call('GET', KEYS[1]))
        if current == tonumber(ARGV[1]) + 1 then
          redis.call('SET', KEYS[1], ARGV[1])
          return 1
        end
        return 0
      `;
      await this.redis.client.eval(lua, 1, key, nonce);
    }
    
  4. Relayer — modify getNonce() in evm.service.ts to call acquireNonce instead of separate get + set.
  5. Relayer — remove incrementNonce() call from sendRawTransaction() — the nonce was already reserved at acquire time.
  6. Relayer — add POST /evm/address/nonce/recycle endpoint so core can release a nonce if signing fails.
  7. Core — add nonce recycling on pre-broadcast failure in withdraw.service.ts and contract-interactions.service.ts:
    const nonce = await this.relayerClient.getNonceForAddress(from, networkId);
    try {
      const tx = await signer.signTransaction({ ...params, nonce });
      return await this.relayerClient.sendRawEVMTransaction(tx, networkId);
    } catch (err) {
      await this.relayerClient.recycleNonce(from, networkId, nonce);
      throw err;
    }
    
Pros:
  • Eliminates the TOCTOU race — nonce is atomically reserved via Lua
  • Lock-free — Redis Lua scripts are single-threaded, no distributed lock needed
  • Nonce recycling prevents gaps from pre-broadcast failures
  • Minimal architectural change — same HTTP API contract between core and relayer
  • Same pattern used by thirdweb Engine and OZ Defender in production
Cons:
  • Still a window between acquire and broadcast where the nonce is “claimed but unused” — if core crashes, the nonce is burned (mitigated by recycling + periodic sync)
  • Lua scripts add operational complexity to Redis (need to monitor script load)
  • Doesn’t solve the broader problem of stuck-in-mempool transactions
Effort: Medium (2–3 days). Mostly relayer changes + one new endpoint + try/catch in core.

Approach B: Relayer-Side Signing (Move Nonce + Sign + Send Into Single Atomic Operation)

Summary: Move transaction signing into the relayer. Core sends unsigned transaction intent (to, value, data, gasLimit). Relayer acquires nonce, fetches fee, signs, and broadcasts — all in one atomic operation with no cross-service race window. Changes required:
  1. Relayer — new POST /evm/transaction/send endpoint that accepts unsigned tx params:
    // Request: { from, to, value, data?, gasLimit?, networkId }
    // Relayer: acquireNonce → getFee → sign → broadcast (all local, atomic)
    
  2. Relayer — key management integration. The relayer needs access to signing keys. Options:
    • a) Relayer calls core’s decryption service to get the key for each tx (adds latency, key in memory briefly)
    • b) Relayer holds a hot wallet key for relayer-owned wallets only (not user wallets)
    • c) Core sends the private key along with the tx params (security risk — key transits over HTTP)
  3. Core — replace getNonce + sign + sendRaw flow with single sendTransaction(from, to, value, ...) call.
  4. Core — remove ethers.Wallet signing logic from withdraw and contract services.
Pros:
  • Completely eliminates the cross-service nonce race — nonce is acquired and consumed in the same process
  • Simplifies core’s transaction code significantly
  • Relayer can implement sophisticated retry with re-signing at new nonce/gas
  • Relayer can do estimateGas before committing the nonce (simulation-before-nonce pattern)
  • Natural place to add transaction queuing, stuck tx recovery, gas bumping
Cons:
  • Major security concern: relayer needs access to signing keys. Current architecture deliberately isolates crypto operations in signing-server and decryption-server. Moving signing to relayer breaks this security boundary.
  • Large architectural change — touching withdraw, contract-interactions, tokenization services
  • Need to handle the decryption/threshold signature flow (clientShare + masterWallet share) somehow
  • Risk of introducing new bugs during migration
  • Doesn’t work for user-initiated transactions that require client-side shares (the signer is derived from user-provided clientShare + server encryptedShare)
Effort: Large (1–2 weeks). Requires careful security review and migration of all tx flows. Verdict: Not viable for user wallet transactions because the key derivation requires a client share that only exists during the user’s request. Could work for system-owned hot wallets (e.g., gas station, contract deployer) but would require a parallel signing path.

Approach C: Pessimistic Distributed Lock on getNonce

Summary: Wrap the entire getNonce → sign → sendRaw flow in a per-address distributed lock (Redis Redlock or simpler SET NX EX). Only one transaction per address can be in flight at a time. Changes required:
  1. Relayer — add POST /evm/address/nonce/lock and /unlock endpoints:
    // Lock: SET relayer:evm:lock:{networkId}:{address} {requestId} NX EX 30
    // Returns: { nonce, lockId }
    // Unlock: DEL if lockId matches (Lua CAS)
    
  2. Core — wrap tx flow in lock/unlock:
    const { nonce, lockId } = await this.relayerClient.lockAndGetNonce(from, networkId);
    try {
      const tx = await signer.signTransaction({ ...params, nonce });
      await this.relayerClient.sendRawEVMTransaction(tx, networkId);
    } finally {
      await this.relayerClient.unlockNonce(from, networkId, lockId);
    }
    
  3. Relayer — auto-expire locks after TTL to prevent deadlocks from crashed callers.
Pros:
  • Simple mental model — only one tx per address at a time, no races possible
  • No Lua scripts needed beyond a simple CAS for unlock
  • Easy to reason about correctness
Cons:
  • Serializes all transactions per address — massive throughput bottleneck. If a user has 3 pending withdrawals, they execute sequentially (getNonce → sign → send → wait → next). Each cycle is ~2-5 seconds minimum.
  • Lock expiry is a tradeoff: too short → lock expires during legitimate signing (especially with slow decryption); too long → stuck locks block subsequent txs
  • Distributed lock algorithms (Redlock) are complex and have known issues (see Martin Kleppmann’s critique)
  • Adds two more HTTP round-trips per transaction (lock + unlock)
  • If core crashes between lock and unlock, the address is locked until TTL expires
Effort: Medium (2–3 days). But the throughput cost is permanent.

Approach D: Optimistic Nonce with Conflict Detection and Auto-Retry

Summary: Keep the current architecture but add a nonce conflict detection layer. When sendRawTransaction fails with “nonce already used,” the relayer automatically re-fetches the nonce from chain, and core retries the entire sign-and-send flow. Changes required:
  1. Relayer — enhance error responses to return structured nonce-conflict errors:
    {
      "error": "NONCE_CONFLICT",
      "currentNonce": 7,
      "attemptedNonce": 5,
      "shouldRetry": true
    }
    
  2. Core — add retry loop around the full tx flow (not just sendRaw):
    for (let attempt = 0; attempt < MAX_NONCE_RETRIES; attempt++) {
      const nonce = await this.relayerClient.getNonceForAddress(from, networkId);
      const tx = await signer.signTransaction({ ...params, nonce });
      try {
        return await this.relayerClient.sendRawEVMTransaction(tx, networkId);
      } catch (err) {
        if (isNonceConflict(err) && attempt < MAX_NONCE_RETRIES - 1) {
          this.logger.warn(`Nonce conflict (attempt ${attempt + 1}), retrying`);
          continue;
        }
        throw err;
      }
    }
    
  3. Relayer — change getNonce() to always re-sync from chain on call (not just max(tracked, chain)) to get the freshest value after a conflict.
Pros:
  • Minimal changes to existing architecture
  • No Lua scripts, no distributed locks
  • Works with the existing security model (core holds keys, relayer broadcasts)
  • Handles the common case (low concurrency per address) efficiently
Cons:
  • Optimistic approach means conflicts are detected after wasting a signing round-trip
  • Under high concurrency (multiple txs from same address), retry storms can occur — N concurrent txs means up to N-1 retries each
  • Each retry requires re-deriving the signer (decryption round-trip), which is expensive
  • Doesn’t prevent nonce gaps from dropped mempool txs
  • Relies on the RPC node returning an accurate nonce, which is unreliable with load-balanced providers
Effort: Small (1–2 days). But doesn’t fully solve the problem — it papers over the race with retries.
Summary: Combine Approach A (atomic Lua reservation) with a background monitor that detects and resolves stuck transactions and nonce gaps. This is the production-grade solution used by thirdweb Engine and OZ Defender. Changes required (in addition to Approach A):
  1. Relayer — add transaction tracking table (Postgres or Redis hash):
    {
      txHash, from, networkId, nonce, gasPrice,
      submittedAt, status: 'pending' | 'confirmed' | 'dropped',
      lastCheckedAt
    }
    
  2. Relayer — background nonce health monitor (runs every 30s per active address):
    async monitorNonceHealth(address: string, networkId: string) {
      const chainNonce = await provider.getTransactionCount(address, 'latest');
      const trackedNonce = await this.getTrackedNonce(address, networkId);
      const pendingTxs = await this.getPendingTxs(address, networkId);
      
      // Detect gaps: chain says nonce=5, but we have no pending tx for nonce 5
      // Detect stuck: tx submitted > 5 min ago, receipt is null
      // Detect drift: tracked nonce diverged significantly from chain
    }
    
  3. Relayer — stuck transaction recovery:
    • If a pending tx is older than threshold (5 min): attempt gas bump (replace with +12% fee, same nonce)
    • If a pending tx is older than hard limit (30 min): cancel via self-transfer at same nonce with high gas
    • If nonce gap detected: fill with self-transfer
  4. Relayer — periodic nonce reconciliation on startup and every N minutes:
    // On startup: syncNonceFromChain() for all known addresses
    // Periodic: compare tracked vs chain, log drift, auto-correct if safe
    
  5. Relayer — nonce health metrics endpoint for observability:
    GET /evm/nonce/health
    {
      "address": "0x...",
      "trackedNonce": 42,
      "chainNonce": 40,
      "pendingCount": 2,
      "oldestPendingAge": "45s",
      "gapCount": 0
    }
    
Pros:
  • Solves both the race condition (Lua) AND the stuck transaction problem (monitor)
  • Self-healing — the system automatically recovers from nonce gaps and stuck txs
  • Observable — metrics and health endpoint for ops
  • Battle-tested pattern — this is what production relayer systems use
  • Incremental — can deploy Approach A first, add the monitor later
Cons:
  • Most complex to implement — touches relayer storage, adds background jobs
  • Stuck tx recovery requires the relayer to sign self-transfer transactions (needs a key for the relayer’s own address, not user addresses)
  • Gas bump for stuck user transactions requires re-signing, which needs the user’s key (not available)
  • The monitor adds load on RPC nodes (getTransactionCount + getTransactionReceipt per address)
Effort: Large (1–2 weeks for full implementation). But can be staged: Approach A first (2–3 days), monitor later (3–5 days).

Comparison Matrix

CriteriaA: Lua AtomicB: Relayer SignsC: Dist LockD: OptimisticE: Hybrid (A+Monitor)
Eliminates nonce race❌ (retries)
Prevents nonce gapsPartial
Handles stuck txs✅ (can re-sign)
Preserves security model
Concurrent tx throughputHighHighLow (serial)MediumHigh
Implementation effortMediumLargeMediumSmallLarge
Works for user wallets❌ (needs key)✅ (monitor limited)
Operational complexityLowHighMediumLowMedium

Recommendation

Start with Approach A (Atomic Nonce Reservation) — it solves the critical race condition with moderate effort and zero architectural disruption. The key changes are:
  1. Replace getTrackedNonce + setTrackedNonce with atomic acquireNonce (Lua)
  2. Remove incrementNonce from sendRawTransaction (already reserved)
  3. Add recycleNonce endpoint for pre-broadcast failures
  4. Add try/catch with recycle in core’s tx flows
Then incrementally add Approach E’s monitor for stuck transaction detection and nonce gap recovery. The monitor is valuable but not blocking — the atomic reservation alone fixes the most common failure mode (concurrent nonce collision). The per-address stuck transaction recovery (gas bumping, cancellation) is limited for user wallets since the relayer doesn’t hold user keys. For user wallets, the practical recovery path is: detect the stuck nonce, alert the user or system, and have core re-derive the signer to submit a replacement. This can be a Temporal workflow that periodically checks pending txs.