Ethereum Nonce Management — Problem Analysis & Solution Proposals
Current Architecture
corecalls relayer’sPOST /evm/address/nonceto get the next noncecorecalls relayer’sPOST /evm/getCurrentFeeto get gas pricescoresigns the transaction locally withethers.Wallet.signTransaction()corecalls relayer’sPOST /evm/transaction/sendRawwith the signed tx- Relayer broadcasts via
eth_sendRawTransactionand increments tracked nonce in Redis
- Redis key:
relayer:evm:nonce:{networkId}:{address} getNonce():max(onChainNonce, redisTrackedNonce)→ persisted viaSETsendRawTransaction()success:INCRon the nonce key (atomic)- Error recovery:
DELthe key on nonce errors, forcing re-fetch from chain
Identified Problems
Problem 1: TOCTOU Race in getNonce() (Critical)
File: apps/relayer/src/evm/evm.service.ts:49-119
The getNonce → sign → sendRawTransaction → INCR flow is not atomic. Two concurrent callers can get the same nonce:
INCR (which only happens after broadcast), another caller can obtain the same value.
Problem 2: Split Read/Write Across Services
core fetches nonce via HTTP, signs locally, then sends signed tx back. This means:
- The nonce is “in flight” for the entire duration of HTTP round-trip + signing + another HTTP round-trip
- The relayer has no knowledge that a nonce has been “claimed” until the signed tx arrives
- No way to recycle a nonce if
corecrashes between getNonce and sendRaw
Problem 3: No Nonce Recycling on Pre-Broadcast Failures
Ifcore gets nonce 5, signs a tx, but the signing fails (bad key, encoding error) or the HTTP call to sendRaw times out, nonce 5 is consumed in Redis (via setTrackedNonce) but never broadcast. This creates a nonce gap — all subsequent transactions queue behind a phantom nonce 5 that will never be mined.
Problem 4: setTrackedNonce() Overwrites Without CAS
File: apps/relayer/src/evm/evm.provider-pool.service.ts:243-250
setTrackedNonce is a plain SET, not a compare-and-swap. Concurrent getNonce() calls can clobber each other:
Problem 5: Circuit Breaker Was Tripping on Nonce Errors (Fixed in ca79d480)
The recent fixca79d480 corrected an issue where nonce errors (a tx-level problem) were being recorded as provider failures, tripping the circuit breaker and blocking all subsequent transactions on that network. This is now fixed — only connection errors trip the breaker.
Problem 6: No Stuck Transaction Recovery
There is no mechanism to detect or recover from transactions that are:- Stuck in mempool (gas too low for current conditions)
- Dropped by the mempool (eviction after 3hr in Geth)
- Creating nonce gaps that block subsequent transactions
Solution Approaches
Approach A: Atomic Nonce Reservation via Redis Lua Script (Recommended)
Summary: Replace the read-then-write nonce flow with a single atomicacquireNonce operation using a Redis Lua script. The nonce is reserved at read time, not after broadcast.
Changes required:
-
Relayer — new Lua-based
acquireNonce()method inevm.provider-pool.service.ts: -
Relayer — new
syncNonceFromChain()method to initialize/re-seed: -
Relayer — new
recycleNonce()method for failed-before-broadcast: -
Relayer — modify
getNonce()inevm.service.tsto callacquireNonceinstead of separate get + set. -
Relayer — remove
incrementNonce()call fromsendRawTransaction()— the nonce was already reserved at acquire time. -
Relayer — add
POST /evm/address/nonce/recycleendpoint socorecan release a nonce if signing fails. -
Core — add nonce recycling on pre-broadcast failure in
withdraw.service.tsandcontract-interactions.service.ts:
- Eliminates the TOCTOU race — nonce is atomically reserved via Lua
- Lock-free — Redis Lua scripts are single-threaded, no distributed lock needed
- Nonce recycling prevents gaps from pre-broadcast failures
- Minimal architectural change — same HTTP API contract between core and relayer
- Same pattern used by thirdweb Engine and OZ Defender in production
- Still a window between acquire and broadcast where the nonce is “claimed but unused” — if core crashes, the nonce is burned (mitigated by recycling + periodic sync)
- Lua scripts add operational complexity to Redis (need to monitor script load)
- Doesn’t solve the broader problem of stuck-in-mempool transactions
Approach B: Relayer-Side Signing (Move Nonce + Sign + Send Into Single Atomic Operation)
Summary: Move transaction signing into the relayer. Core sends unsigned transaction intent (to, value, data, gasLimit). Relayer acquires nonce, fetches fee, signs, and broadcasts — all in one atomic operation with no cross-service race window. Changes required:-
Relayer — new
POST /evm/transaction/sendendpoint that accepts unsigned tx params: -
Relayer — key management integration. The relayer needs access to signing keys. Options:
- a) Relayer calls
core’s decryption service to get the key for each tx (adds latency, key in memory briefly) - b) Relayer holds a hot wallet key for relayer-owned wallets only (not user wallets)
- c) Core sends the private key along with the tx params (security risk — key transits over HTTP)
- a) Relayer calls
-
Core — replace
getNonce + sign + sendRawflow with singlesendTransaction(from, to, value, ...)call. - Core — remove ethers.Wallet signing logic from withdraw and contract services.
- Completely eliminates the cross-service nonce race — nonce is acquired and consumed in the same process
- Simplifies core’s transaction code significantly
- Relayer can implement sophisticated retry with re-signing at new nonce/gas
- Relayer can do
estimateGasbefore committing the nonce (simulation-before-nonce pattern) - Natural place to add transaction queuing, stuck tx recovery, gas bumping
- Major security concern: relayer needs access to signing keys. Current architecture deliberately isolates crypto operations in
signing-serveranddecryption-server. Moving signing to relayer breaks this security boundary. - Large architectural change — touching withdraw, contract-interactions, tokenization services
- Need to handle the decryption/threshold signature flow (clientShare + masterWallet share) somehow
- Risk of introducing new bugs during migration
- Doesn’t work for user-initiated transactions that require client-side shares (the signer is derived from user-provided
clientShare+ serverencryptedShare)
Approach C: Pessimistic Distributed Lock on getNonce
Summary: Wrap the entire getNonce → sign → sendRaw flow in a per-address distributed lock (Redis Redlock or simplerSET NX EX). Only one transaction per address can be in flight at a time.
Changes required:
-
Relayer — add
POST /evm/address/nonce/lockand/unlockendpoints: -
Core — wrap tx flow in lock/unlock:
- Relayer — auto-expire locks after TTL to prevent deadlocks from crashed callers.
- Simple mental model — only one tx per address at a time, no races possible
- No Lua scripts needed beyond a simple CAS for unlock
- Easy to reason about correctness
- Serializes all transactions per address — massive throughput bottleneck. If a user has 3 pending withdrawals, they execute sequentially (getNonce → sign → send → wait → next). Each cycle is ~2-5 seconds minimum.
- Lock expiry is a tradeoff: too short → lock expires during legitimate signing (especially with slow decryption); too long → stuck locks block subsequent txs
- Distributed lock algorithms (Redlock) are complex and have known issues (see Martin Kleppmann’s critique)
- Adds two more HTTP round-trips per transaction (lock + unlock)
- If core crashes between lock and unlock, the address is locked until TTL expires
Approach D: Optimistic Nonce with Conflict Detection and Auto-Retry
Summary: Keep the current architecture but add a nonce conflict detection layer. WhensendRawTransaction fails with “nonce already used,” the relayer automatically re-fetches the nonce from chain, and core retries the entire sign-and-send flow.
Changes required:
-
Relayer — enhance error responses to return structured nonce-conflict errors:
-
Core — add retry loop around the full tx flow (not just sendRaw):
-
Relayer — change
getNonce()to always re-sync from chain on call (not justmax(tracked, chain)) to get the freshest value after a conflict.
- Minimal changes to existing architecture
- No Lua scripts, no distributed locks
- Works with the existing security model (core holds keys, relayer broadcasts)
- Handles the common case (low concurrency per address) efficiently
- Optimistic approach means conflicts are detected after wasting a signing round-trip
- Under high concurrency (multiple txs from same address), retry storms can occur — N concurrent txs means up to N-1 retries each
- Each retry requires re-deriving the signer (decryption round-trip), which is expensive
- Doesn’t prevent nonce gaps from dropped mempool txs
- Relies on the RPC node returning an accurate nonce, which is unreliable with load-balanced providers
Approach E: Hybrid — Atomic Reservation + Stuck Transaction Monitor (Recommended for Production)
Summary: Combine Approach A (atomic Lua reservation) with a background monitor that detects and resolves stuck transactions and nonce gaps. This is the production-grade solution used by thirdweb Engine and OZ Defender. Changes required (in addition to Approach A):-
Relayer — add transaction tracking table (Postgres or Redis hash):
-
Relayer — background nonce health monitor (runs every 30s per active address):
-
Relayer — stuck transaction recovery:
- If a pending tx is older than threshold (5 min): attempt gas bump (replace with +12% fee, same nonce)
- If a pending tx is older than hard limit (30 min): cancel via self-transfer at same nonce with high gas
- If nonce gap detected: fill with self-transfer
-
Relayer — periodic nonce reconciliation on startup and every N minutes:
-
Relayer — nonce health metrics endpoint for observability:
- Solves both the race condition (Lua) AND the stuck transaction problem (monitor)
- Self-healing — the system automatically recovers from nonce gaps and stuck txs
- Observable — metrics and health endpoint for ops
- Battle-tested pattern — this is what production relayer systems use
- Incremental — can deploy Approach A first, add the monitor later
- Most complex to implement — touches relayer storage, adds background jobs
- Stuck tx recovery requires the relayer to sign self-transfer transactions (needs a key for the relayer’s own address, not user addresses)
- Gas bump for stuck user transactions requires re-signing, which needs the user’s key (not available)
- The monitor adds load on RPC nodes (getTransactionCount + getTransactionReceipt per address)
Comparison Matrix
| Criteria | A: Lua Atomic | B: Relayer Signs | C: Dist Lock | D: Optimistic | E: Hybrid (A+Monitor) |
|---|---|---|---|---|---|
| Eliminates nonce race | ✅ | ✅ | ✅ | ❌ (retries) | ✅ |
| Prevents nonce gaps | Partial | ✅ | ✅ | ❌ | ✅ |
| Handles stuck txs | ❌ | ✅ (can re-sign) | ❌ | ❌ | ✅ |
| Preserves security model | ✅ | ❌ | ✅ | ✅ | ✅ |
| Concurrent tx throughput | High | High | Low (serial) | Medium | High |
| Implementation effort | Medium | Large | Medium | Small | Large |
| Works for user wallets | ✅ | ❌ (needs key) | ✅ | ✅ | ✅ (monitor limited) |
| Operational complexity | Low | High | Medium | Low | Medium |
Recommendation
Start with Approach A (Atomic Nonce Reservation) — it solves the critical race condition with moderate effort and zero architectural disruption. The key changes are:- Replace
getTrackedNonce + setTrackedNoncewith atomicacquireNonce(Lua) - Remove
incrementNoncefromsendRawTransaction(already reserved) - Add
recycleNonceendpoint for pre-broadcast failures - Add try/catch with recycle in core’s tx flows