Ethereum Nonce Management — Problem Analysis & Solution Proposals

Current Architecture

┌──────────────────────┐      HTTP (Basic Auth)      ┌────────────────────────┐
│  apps/core           │ ──────────────────────────▶  │  apps/relayer          │
│                      │                              │                        │
│  WithdrawService     │  1. getNonceForAddress()     │  EvmController         │
│  ContractInteractions│  2. getCurrentFeeEVM()       │       │                │
│  TokenizationService │  3. sendRawEVMTransaction()  │       ▼                │
│                      │                              │  EvmService            │
│  Signs tx locally    │                              │       │                │
│  with ethers.Wallet  │                              │       ▼                │
└──────────────────────┘                              │  EvmProviderPool       │
                                                      │       │                │
                                                      │  ┌────▼─────┐         │
                                                      │  │  Redis   │ (nonce) │
                                                      │  └──────────┘         │
                                                      │       │                │
                                                      │  ┌────▼─────┐         │
                                                      │  │  RPC Node│         │
                                                      │  └──────────┘         │
                                                      └────────────────────────┘

Transaction lifecycle:

core calls relayer’s POST /evm/address/nonce to get the next nonce
core calls relayer’s POST /evm/getCurrentFee to get gas prices
core signs the transaction locally with ethers.Wallet.signTransaction()
core calls relayer’s POST /evm/transaction/sendRaw with the signed tx
Relayer broadcasts via eth_sendRawTransaction and increments tracked nonce in Redis

Nonce tracking (relayer):

Redis key: relayer:evm:nonce:{networkId}:{address}
getNonce(): max(onChainNonce, redisTrackedNonce) → persisted via SET
sendRawTransaction() success: INCR on the nonce key (atomic)
Error recovery: DEL the key on nonce errors, forcing re-fetch from chain

Identified Problems

Problem 1: TOCTOU Race in `getNonce()` (Critical)

File: apps/relayer/src/evm/evm.service.ts:49-119 The getNonce → sign → sendRawTransaction → INCR flow is not atomic. Two concurrent callers can get the same nonce:

Instance A: getNonce() → reads Redis=5, chain=5 → returns 5
Instance B: getNonce() → reads Redis=5, chain=5 → returns 5  (A hasn't sent yet)
Instance A: signs tx with nonce 5, sends → INCR → Redis=6
Instance B: signs tx with nonce 5, sends → "nonce already used" ✗

This is the core bug. The nonce is read and returned without reservation. Between the read and the subsequent INCR (which only happens after broadcast), another caller can obtain the same value.

Problem 2: Split Read/Write Across Services

core fetches nonce via HTTP, signs locally, then sends signed tx back. This means:

The nonce is “in flight” for the entire duration of HTTP round-trip + signing + another HTTP round-trip
The relayer has no knowledge that a nonce has been “claimed” until the signed tx arrives
No way to recycle a nonce if core crashes between getNonce and sendRaw

Problem 3: No Nonce Recycling on Pre-Broadcast Failures

If core gets nonce 5, signs a tx, but the signing fails (bad key, encoding error) or the HTTP call to sendRaw times out, nonce 5 is consumed in Redis (via setTrackedNonce) but never broadcast. This creates a nonce gap — all subsequent transactions queue behind a phantom nonce 5 that will never be mined.

Problem 4: `setTrackedNonce()` Overwrites Without CAS

File: apps/relayer/src/evm/evm.provider-pool.service.ts:243-250 setTrackedNonce is a plain SET, not a compare-and-swap. Concurrent getNonce() calls can clobber each other:

A: reads chain=7, tracked=5 → calculates max=7
B: reads chain=7, tracked=5 → calculates max=7
A: SET nonce=7
B: SET nonce=7  (harmless here, but if timing differs, could SET a stale value)

Problem 5: Circuit Breaker Was Tripping on Nonce Errors (Fixed in ca79d480)

The recent fix ca79d480 corrected an issue where nonce errors (a tx-level problem) were being recorded as provider failures, tripping the circuit breaker and blocking all subsequent transactions on that network. This is now fixed — only connection errors trip the breaker.

Problem 6: No Stuck Transaction Recovery

There is no mechanism to detect or recover from transactions that are:

Stuck in mempool (gas too low for current conditions)
Dropped by the mempool (eviction after 3hr in Geth)
Creating nonce gaps that block subsequent transactions

Solution Approaches

Approach A: Atomic Nonce Reservation via Redis Lua Script (Recommended)

Summary: Replace the read-then-write nonce flow with a single atomic acquireNonce operation using a Redis Lua script. The nonce is reserved at read time, not after broadcast. Changes required:

Relayer — new Lua-based acquireNonce() method in evm.provider-pool.service.ts:

async acquireNonce(address: string, networkId: string): Promise<number> {
  const key = this.nonceKey(address, networkId);
  const lua = `
    local current = redis.call('GET', KEYS[1])
    if current == false then return nil end
    redis.call('SET', KEYS[1], tonumber(current) + 1)
    return current
  `;
  const result = await this.redis.client.eval(lua, 1, key);
  if (result === null) {
    // First call or after reset — seed from chain
    await this.syncNonceFromChain(address, networkId);
    return this.acquireNonce(address, networkId);
  }
  return Number(result);
}

Relayer — new syncNonceFromChain() method to initialize/re-seed:

async syncNonceFromChain(address: string, networkId: string): Promise<void> {
  const provider = await this.getProvider(networkId, rpcUrl);
  const chainNonce = await provider.getTransactionCount(address);
  const key = this.nonceKey(address, networkId);
  // Only set if key doesn't exist or chain is higher (Lua for atomicity)
  const lua = `
    local current = redis.call('GET', KEYS[1])
    if current == false or tonumber(current) < tonumber(ARGV[1]) then
      redis.call('SET', KEYS[1], ARGV[1])
    end
  `;
  await this.redis.client.eval(lua, 1, key, chainNonce);
}

Relayer — new recycleNonce() method for failed-before-broadcast:

async recycleNonce(address: string, networkId: string, nonce: number): Promise<void> {
  const key = this.nonceKey(address, networkId);
  // Only recycle if it's still the "next" nonce (no one else took a later one)
  const lua = `
    local current = tonumber(redis.call('GET', KEYS[1]))
    if current == tonumber(ARGV[1]) + 1 then
      redis.call('SET', KEYS[1], ARGV[1])
      return 1
    end
    return 0
  `;
  await this.redis.client.eval(lua, 1, key, nonce);
}

Relayer — modify getNonce() in evm.service.ts to call acquireNonce instead of separate get + set.
Relayer — remove incrementNonce() call from sendRawTransaction() — the nonce was already reserved at acquire time.
Relayer — add POST /evm/address/nonce/recycle endpoint so core can release a nonce if signing fails.

Core — add nonce recycling on pre-broadcast failure in withdraw.service.ts and contract-interactions.service.ts:

const nonce = await this.relayerClient.getNonceForAddress(from, networkId);
try {
  const tx = await signer.signTransaction({ ...params, nonce });
  return await this.relayerClient.sendRawEVMTransaction(tx, networkId);
} catch (err) {
  await this.relayerClient.recycleNonce(from, networkId, nonce);
  throw err;
}

Pros:

Eliminates the TOCTOU race — nonce is atomically reserved via Lua
Lock-free — Redis Lua scripts are single-threaded, no distributed lock needed
Nonce recycling prevents gaps from pre-broadcast failures
Minimal architectural change — same HTTP API contract between core and relayer
Same pattern used by thirdweb Engine and OZ Defender in production

Cons:

Still a window between acquire and broadcast where the nonce is “claimed but unused” — if core crashes, the nonce is burned (mitigated by recycling + periodic sync)
Lua scripts add operational complexity to Redis (need to monitor script load)
Doesn’t solve the broader problem of stuck-in-mempool transactions

Effort: Medium (2–3 days). Mostly relayer changes + one new endpoint + try/catch in core.

Approach B: Relayer-Side Signing (Move Nonce + Sign + Send Into Single Atomic Operation)

Summary: Move transaction signing into the relayer. Core sends unsigned transaction intent (to, value, data, gasLimit). Relayer acquires nonce, fetches fee, signs, and broadcasts — all in one atomic operation with no cross-service race window. Changes required:

Relayer — new POST /evm/transaction/send endpoint that accepts unsigned tx params:

// Request: { from, to, value, data?, gasLimit?, networkId }
// Relayer: acquireNonce → getFee → sign → broadcast (all local, atomic)

Relayer — key management integration. The relayer needs access to signing keys. Options:
- a) Relayer calls core’s decryption service to get the key for each tx (adds latency, key in memory briefly)
- b) Relayer holds a hot wallet key for relayer-owned wallets only (not user wallets)
- c) Core sends the private key along with the tx params (security risk — key transits over HTTP)
Core — replace getNonce + sign + sendRaw flow with single sendTransaction(from, to, value, ...) call.
Core — remove ethers.Wallet signing logic from withdraw and contract services.

Pros:

Completely eliminates the cross-service nonce race — nonce is acquired and consumed in the same process
Simplifies core’s transaction code significantly
Relayer can implement sophisticated retry with re-signing at new nonce/gas
Relayer can do estimateGas before committing the nonce (simulation-before-nonce pattern)
Natural place to add transaction queuing, stuck tx recovery, gas bumping

Cons:

Major security concern: relayer needs access to signing keys. Current architecture deliberately isolates crypto operations in signing-server and decryption-server. Moving signing to relayer breaks this security boundary.
Large architectural change — touching withdraw, contract-interactions, tokenization services
Need to handle the decryption/threshold signature flow (clientShare + masterWallet share) somehow
Risk of introducing new bugs during migration
Doesn’t work for user-initiated transactions that require client-side shares (the signer is derived from user-provided clientShare + server encryptedShare)

Effort: Large (1–2 weeks). Requires careful security review and migration of all tx flows. Verdict: Not viable for user wallet transactions because the key derivation requires a client share that only exists during the user’s request. Could work for system-owned hot wallets (e.g., gas station, contract deployer) but would require a parallel signing path.

Approach C: Pessimistic Distributed Lock on getNonce

Summary: Wrap the entire getNonce → sign → sendRaw flow in a per-address distributed lock (Redis Redlock or simpler SET NX EX). Only one transaction per address can be in flight at a time. Changes required:

Relayer — add POST /evm/address/nonce/lock and /unlock endpoints:

// Lock: SET relayer:evm:lock:{networkId}:{address} {requestId} NX EX 30
// Returns: { nonce, lockId }
// Unlock: DEL if lockId matches (Lua CAS)

Core — wrap tx flow in lock/unlock:

const { nonce, lockId } = await this.relayerClient.lockAndGetNonce(from, networkId);
try {
  const tx = await signer.signTransaction({ ...params, nonce });
  await this.relayerClient.sendRawEVMTransaction(tx, networkId);
} finally {
  await this.relayerClient.unlockNonce(from, networkId, lockId);
}

Relayer — auto-expire locks after TTL to prevent deadlocks from crashed callers.

Pros:

Simple mental model — only one tx per address at a time, no races possible
No Lua scripts needed beyond a simple CAS for unlock
Easy to reason about correctness

Cons:

Serializes all transactions per address — massive throughput bottleneck. If a user has 3 pending withdrawals, they execute sequentially (getNonce → sign → send → wait → next). Each cycle is ~2-5 seconds minimum.
Lock expiry is a tradeoff: too short → lock expires during legitimate signing (especially with slow decryption); too long → stuck locks block subsequent txs
Distributed lock algorithms (Redlock) are complex and have known issues (see Martin Kleppmann’s critique)
Adds two more HTTP round-trips per transaction (lock + unlock)
If core crashes between lock and unlock, the address is locked until TTL expires

Effort: Medium (2–3 days). But the throughput cost is permanent.

Approach D: Optimistic Nonce with Conflict Detection and Auto-Retry

Summary: Keep the current architecture but add a nonce conflict detection layer. When sendRawTransaction fails with “nonce already used,” the relayer automatically re-fetches the nonce from chain, and core retries the entire sign-and-send flow. Changes required:

Relayer — enhance error responses to return structured nonce-conflict errors:

{
  "error": "NONCE_CONFLICT",
  "currentNonce": 7,
  "attemptedNonce": 5,
  "shouldRetry": true
}

Core — add retry loop around the full tx flow (not just sendRaw):

for (let attempt = 0; attempt < MAX_NONCE_RETRIES; attempt++) {
  const nonce = await this.relayerClient.getNonceForAddress(from, networkId);
  const tx = await signer.signTransaction({ ...params, nonce });
  try {
    return await this.relayerClient.sendRawEVMTransaction(tx, networkId);
  } catch (err) {
    if (isNonceConflict(err) && attempt < MAX_NONCE_RETRIES - 1) {
      this.logger.warn(`Nonce conflict (attempt ${attempt + 1}), retrying`);
      continue;
    }
    throw err;
  }
}

Relayer — change getNonce() to always re-sync from chain on call (not just max(tracked, chain)) to get the freshest value after a conflict.

Pros:

Minimal changes to existing architecture
No Lua scripts, no distributed locks
Works with the existing security model (core holds keys, relayer broadcasts)
Handles the common case (low concurrency per address) efficiently

Cons:

Optimistic approach means conflicts are detected after wasting a signing round-trip
Under high concurrency (multiple txs from same address), retry storms can occur — N concurrent txs means up to N-1 retries each
Each retry requires re-deriving the signer (decryption round-trip), which is expensive
Doesn’t prevent nonce gaps from dropped mempool txs
Relies on the RPC node returning an accurate nonce, which is unreliable with load-balanced providers

Effort: Small (1–2 days). But doesn’t fully solve the problem — it papers over the race with retries.

Approach E: Hybrid — Atomic Reservation + Stuck Transaction Monitor (Recommended for Production)

Summary: Combine Approach A (atomic Lua reservation) with a background monitor that detects and resolves stuck transactions and nonce gaps. This is the production-grade solution used by thirdweb Engine and OZ Defender. Changes required (in addition to Approach A):

Relayer — add transaction tracking table (Postgres or Redis hash):

{
  txHash, from, networkId, nonce, gasPrice,
  submittedAt, status: 'pending' | 'confirmed' | 'dropped',
  lastCheckedAt
}

Relayer — background nonce health monitor (runs every 30s per active address):

async monitorNonceHealth(address: string, networkId: string) {
  const chainNonce = await provider.getTransactionCount(address, 'latest');
  const trackedNonce = await this.getTrackedNonce(address, networkId);
  const pendingTxs = await this.getPendingTxs(address, networkId);
  
  // Detect gaps: chain says nonce=5, but we have no pending tx for nonce 5
  // Detect stuck: tx submitted > 5 min ago, receipt is null
  // Detect drift: tracked nonce diverged significantly from chain
}

Relayer — stuck transaction recovery:
- If a pending tx is older than threshold (5 min): attempt gas bump (replace with +12% fee, same nonce)
- If a pending tx is older than hard limit (30 min): cancel via self-transfer at same nonce with high gas
- If nonce gap detected: fill with self-transfer

Relayer — periodic nonce reconciliation on startup and every N minutes:

// On startup: syncNonceFromChain() for all known addresses
// Periodic: compare tracked vs chain, log drift, auto-correct if safe

Relayer — nonce health metrics endpoint for observability:

GET /evm/nonce/health
{
  "address": "0x...",
  "trackedNonce": 42,
  "chainNonce": 40,
  "pendingCount": 2,
  "oldestPendingAge": "45s",
  "gapCount": 0
}

Pros:

Solves both the race condition (Lua) AND the stuck transaction problem (monitor)
Self-healing — the system automatically recovers from nonce gaps and stuck txs
Observable — metrics and health endpoint for ops
Battle-tested pattern — this is what production relayer systems use
Incremental — can deploy Approach A first, add the monitor later

Cons:

Most complex to implement — touches relayer storage, adds background jobs
Stuck tx recovery requires the relayer to sign self-transfer transactions (needs a key for the relayer’s own address, not user addresses)
Gas bump for stuck user transactions requires re-signing, which needs the user’s key (not available)
The monitor adds load on RPC nodes (getTransactionCount + getTransactionReceipt per address)

Effort: Large (1–2 weeks for full implementation). But can be staged: Approach A first (2–3 days), monitor later (3–5 days).

Comparison Matrix

Criteria	A: Lua Atomic	B: Relayer Signs	C: Dist Lock	D: Optimistic	E: Hybrid (A+Monitor)
Eliminates nonce race	✅	✅	✅	❌ (retries)	✅
Prevents nonce gaps	Partial	✅	✅	❌	✅
Handles stuck txs	❌	✅ (can re-sign)	❌	❌	✅
Preserves security model	✅	❌	✅	✅	✅
Concurrent tx throughput	High	High	Low (serial)	Medium	High
Implementation effort	Medium	Large	Medium	Small	Large
Works for user wallets	✅	❌ (needs key)	✅	✅	✅ (monitor limited)
Operational complexity	Low	High	Medium	Low	Medium

Recommendation

Start with Approach A (Atomic Nonce Reservation) — it solves the critical race condition with moderate effort and zero architectural disruption. The key changes are:

Replace getTrackedNonce + setTrackedNonce with atomic acquireNonce (Lua)
Remove incrementNonce from sendRawTransaction (already reserved)
Add recycleNonce endpoint for pre-broadcast failures
Add try/catch with recycle in core’s tx flows

Then incrementally add Approach E’s monitor for stuck transaction detection and nonce gap recovery. The monitor is valuable but not blocking — the atomic reservation alone fixes the most common failure mode (concurrent nonce collision). The per-address stuck transaction recovery (gas bumping, cancellation) is limited for user wallets since the relayer doesn’t hold user keys. For user wallets, the practical recovery path is: detect the stuck nonce, alert the user or system, and have core re-derive the signer to submit a replacement. This can be a Temporal workflow that periodically checks pending txs.

Getting Started

Core Concepts

Workflows

Resources

Solution

Ethereum Nonce Management — Problem Analysis & Solution Proposals

Current Architecture

Identified Problems

Problem 1: TOCTOU Race in `getNonce()` (Critical)

Problem 2: Split Read/Write Across Services

Problem 3: No Nonce Recycling on Pre-Broadcast Failures

Problem 4: `setTrackedNonce()` Overwrites Without CAS

Problem 5: Circuit Breaker Was Tripping on Nonce Errors (Fixed in ca79d480)

Problem 6: No Stuck Transaction Recovery

Solution Approaches

Approach A: Atomic Nonce Reservation via Redis Lua Script (Recommended)

Approach B: Relayer-Side Signing (Move Nonce + Sign + Send Into Single Atomic Operation)

Approach C: Pessimistic Distributed Lock on getNonce

Approach D: Optimistic Nonce with Conflict Detection and Auto-Retry

Approach E: Hybrid — Atomic Reservation + Stuck Transaction Monitor (Recommended for Production)

Comparison Matrix

Recommendation

​Ethereum Nonce Management — Problem Analysis & Solution Proposals

​Current Architecture

​Identified Problems

​Problem 1: TOCTOU Race in getNonce() (Critical)

​Problem 2: Split Read/Write Across Services

​Problem 3: No Nonce Recycling on Pre-Broadcast Failures

​Problem 4: setTrackedNonce() Overwrites Without CAS

​Problem 5: Circuit Breaker Was Tripping on Nonce Errors (Fixed in ca79d480)

​Problem 6: No Stuck Transaction Recovery

​Solution Approaches

​Approach A: Atomic Nonce Reservation via Redis Lua Script (Recommended)

​Approach B: Relayer-Side Signing (Move Nonce + Sign + Send Into Single Atomic Operation)

​Approach C: Pessimistic Distributed Lock on getNonce

​Approach D: Optimistic Nonce with Conflict Detection and Auto-Retry

​Approach E: Hybrid — Atomic Reservation + Stuck Transaction Monitor (Recommended for Production)

​Comparison Matrix

​Recommendation

Ethereum Nonce Management — Problem Analysis & Solution Proposals

Current Architecture

Identified Problems

Problem 1: TOCTOU Race in `getNonce()` (Critical)

Problem 2: Split Read/Write Across Services

Problem 3: No Nonce Recycling on Pre-Broadcast Failures

Problem 4: `setTrackedNonce()` Overwrites Without CAS

Problem 5: Circuit Breaker Was Tripping on Nonce Errors (Fixed in ca79d480)

Problem 6: No Stuck Transaction Recovery

Solution Approaches

Approach A: Atomic Nonce Reservation via Redis Lua Script (Recommended)

Approach B: Relayer-Side Signing (Move Nonce + Sign + Send Into Single Atomic Operation)

Approach C: Pessimistic Distributed Lock on getNonce

Approach D: Optimistic Nonce with Conflict Detection and Auto-Retry

Approach E: Hybrid — Atomic Reservation + Stuck Transaction Monitor (Recommended for Production)

Comparison Matrix

Recommendation