> ## Documentation Index
> Fetch the complete documentation index at: https://docs.oumla.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Solution

# Ethereum Nonce Management — Problem Analysis & Solution Proposals

## Current Architecture

```
┌──────────────────────┐      HTTP (Basic Auth)      ┌────────────────────────┐
│  apps/core           │ ──────────────────────────▶  │  apps/relayer          │
│                      │                              │                        │
│  WithdrawService     │  1. getNonceForAddress()     │  EvmController         │
│  ContractInteractions│  2. getCurrentFeeEVM()       │       │                │
│  TokenizationService │  3. sendRawEVMTransaction()  │       ▼                │
│                      │                              │  EvmService            │
│  Signs tx locally    │                              │       │                │
│  with ethers.Wallet  │                              │       ▼                │
└──────────────────────┘                              │  EvmProviderPool       │
                                                      │       │                │
                                                      │  ┌────▼─────┐         │
                                                      │  │  Redis   │ (nonce) │
                                                      │  └──────────┘         │
                                                      │       │                │
                                                      │  ┌────▼─────┐         │
                                                      │  │  RPC Node│         │
                                                      │  └──────────┘         │
                                                      └────────────────────────┘
```

**Transaction lifecycle:**

1. `core` calls relayer's `POST /evm/address/nonce` to get the next nonce
2. `core` calls relayer's `POST /evm/getCurrentFee` to get gas prices
3. `core` signs the transaction locally with `ethers.Wallet.signTransaction()`
4. `core` calls relayer's `POST /evm/transaction/sendRaw` with the signed tx
5. Relayer broadcasts via `eth_sendRawTransaction` and increments tracked nonce in Redis

**Nonce tracking (relayer):**

* Redis key: `relayer:evm:nonce:{networkId}:{address}`
* `getNonce()`: `max(onChainNonce, redisTrackedNonce)` → persisted via `SET`
* `sendRawTransaction()` success: `INCR` on the nonce key (atomic)
* Error recovery: `DEL` the key on nonce errors, forcing re-fetch from chain

***

## Identified Problems

### Problem 1: TOCTOU Race in `getNonce()` (Critical)

**File:** `apps/relayer/src/evm/evm.service.ts:49-119`

The `getNonce` → sign → `sendRawTransaction` → `INCR` flow is **not atomic**. Two concurrent callers can get the same nonce:

```
Instance A: getNonce() → reads Redis=5, chain=5 → returns 5
Instance B: getNonce() → reads Redis=5, chain=5 → returns 5  (A hasn't sent yet)
Instance A: signs tx with nonce 5, sends → INCR → Redis=6
Instance B: signs tx with nonce 5, sends → "nonce already used" ✗
```

This is the **core bug**. The nonce is read and returned without reservation. Between the read and the subsequent `INCR` (which only happens after broadcast), another caller can obtain the same value.

### Problem 2: Split Read/Write Across Services

`core` fetches nonce via HTTP, signs locally, then sends signed tx back. This means:

* The nonce is "in flight" for the entire duration of HTTP round-trip + signing + another HTTP round-trip
* The relayer has no knowledge that a nonce has been "claimed" until the signed tx arrives
* No way to recycle a nonce if `core` crashes between getNonce and sendRaw

### Problem 3: No Nonce Recycling on Pre-Broadcast Failures

If `core` gets nonce 5, signs a tx, but the signing fails (bad key, encoding error) or the HTTP call to `sendRaw` times out, nonce 5 is consumed in Redis (via `setTrackedNonce`) but never broadcast. This creates a **nonce gap** — all subsequent transactions queue behind a phantom nonce 5 that will never be mined.

### Problem 4: `setTrackedNonce()` Overwrites Without CAS

**File:** `apps/relayer/src/evm/evm.provider-pool.service.ts:243-250`

`setTrackedNonce` is a plain `SET`, not a compare-and-swap. Concurrent `getNonce()` calls can clobber each other:

```
A: reads chain=7, tracked=5 → calculates max=7
B: reads chain=7, tracked=5 → calculates max=7
A: SET nonce=7
B: SET nonce=7  (harmless here, but if timing differs, could SET a stale value)
```

### Problem 5: Circuit Breaker Was Tripping on Nonce Errors (Fixed in ca79d480)

The recent fix `ca79d480` corrected an issue where nonce errors (a tx-level problem) were being recorded as provider failures, tripping the circuit breaker and blocking **all** subsequent transactions on that network. This is now fixed — only connection errors trip the breaker.

### Problem 6: No Stuck Transaction Recovery

There is no mechanism to detect or recover from transactions that are:

* Stuck in mempool (gas too low for current conditions)
* Dropped by the mempool (eviction after 3hr in Geth)
* Creating nonce gaps that block subsequent transactions

***

## Solution Approaches

### Approach A: Atomic Nonce Reservation via Redis Lua Script (Recommended)

**Summary:** Replace the read-then-write nonce flow with a single atomic `acquireNonce` operation using a Redis Lua script. The nonce is reserved at read time, not after broadcast.

**Changes required:**

1. **Relayer — new Lua-based `acquireNonce()` method** in `evm.provider-pool.service.ts`:
   ```typescript theme={null}
   async acquireNonce(address: string, networkId: string): Promise<number> {
     const key = this.nonceKey(address, networkId);
     const lua = `
       local current = redis.call('GET', KEYS[1])
       if current == false then return nil end
       redis.call('SET', KEYS[1], tonumber(current) + 1)
       return current
     `;
     const result = await this.redis.client.eval(lua, 1, key);
     if (result === null) {
       // First call or after reset — seed from chain
       await this.syncNonceFromChain(address, networkId);
       return this.acquireNonce(address, networkId);
     }
     return Number(result);
   }
   ```

2. **Relayer — new `syncNonceFromChain()` method** to initialize/re-seed:
   ```typescript theme={null}
   async syncNonceFromChain(address: string, networkId: string): Promise<void> {
     const provider = await this.getProvider(networkId, rpcUrl);
     const chainNonce = await provider.getTransactionCount(address);
     const key = this.nonceKey(address, networkId);
     // Only set if key doesn't exist or chain is higher (Lua for atomicity)
     const lua = `
       local current = redis.call('GET', KEYS[1])
       if current == false or tonumber(current) < tonumber(ARGV[1]) then
         redis.call('SET', KEYS[1], ARGV[1])
       end
     `;
     await this.redis.client.eval(lua, 1, key, chainNonce);
   }
   ```

3. **Relayer — new `recycleNonce()` method** for failed-before-broadcast:
   ```typescript theme={null}
   async recycleNonce(address: string, networkId: string, nonce: number): Promise<void> {
     const key = this.nonceKey(address, networkId);
     // Only recycle if it's still the "next" nonce (no one else took a later one)
     const lua = `
       local current = tonumber(redis.call('GET', KEYS[1]))
       if current == tonumber(ARGV[1]) + 1 then
         redis.call('SET', KEYS[1], ARGV[1])
         return 1
       end
       return 0
     `;
     await this.redis.client.eval(lua, 1, key, nonce);
   }
   ```

4. **Relayer — modify `getNonce()` in `evm.service.ts`** to call `acquireNonce` instead of separate get + set.

5. **Relayer — remove `incrementNonce()` call from `sendRawTransaction()`** — the nonce was already reserved at acquire time.

6. **Relayer — add `POST /evm/address/nonce/recycle` endpoint** so `core` can release a nonce if signing fails.

7. **Core — add nonce recycling on pre-broadcast failure** in `withdraw.service.ts` and `contract-interactions.service.ts`:
   ```typescript theme={null}
   const nonce = await this.relayerClient.getNonceForAddress(from, networkId);
   try {
     const tx = await signer.signTransaction({ ...params, nonce });
     return await this.relayerClient.sendRawEVMTransaction(tx, networkId);
   } catch (err) {
     await this.relayerClient.recycleNonce(from, networkId, nonce);
     throw err;
   }
   ```

**Pros:**

* Eliminates the TOCTOU race — nonce is atomically reserved via Lua
* Lock-free — Redis Lua scripts are single-threaded, no distributed lock needed
* Nonce recycling prevents gaps from pre-broadcast failures
* Minimal architectural change — same HTTP API contract between core and relayer
* Same pattern used by thirdweb Engine and OZ Defender in production

**Cons:**

* Still a window between acquire and broadcast where the nonce is "claimed but unused" — if core crashes, the nonce is burned (mitigated by recycling + periodic sync)
* Lua scripts add operational complexity to Redis (need to monitor script load)
* Doesn't solve the broader problem of stuck-in-mempool transactions

**Effort:** Medium (2–3 days). Mostly relayer changes + one new endpoint + try/catch in core.

***

### Approach B: Relayer-Side Signing (Move Nonce + Sign + Send Into Single Atomic Operation)

**Summary:** Move transaction signing into the relayer. Core sends *unsigned* transaction intent (to, value, data, gasLimit). Relayer acquires nonce, fetches fee, signs, and broadcasts — all in one atomic operation with no cross-service race window.

**Changes required:**

1. **Relayer — new `POST /evm/transaction/send` endpoint** that accepts unsigned tx params:
   ```typescript theme={null}
   // Request: { from, to, value, data?, gasLimit?, networkId }
   // Relayer: acquireNonce → getFee → sign → broadcast (all local, atomic)
   ```

2. **Relayer — key management integration.** The relayer needs access to signing keys. Options:
   * a) Relayer calls `core`'s decryption service to get the key for each tx (adds latency, key in memory briefly)
   * b) Relayer holds a hot wallet key for relayer-owned wallets only (not user wallets)
   * c) Core sends the private key along with the tx params (security risk — key transits over HTTP)

3. **Core — replace `getNonce + sign + sendRaw` flow** with single `sendTransaction(from, to, value, ...)` call.

4. **Core — remove ethers.Wallet signing logic** from withdraw and contract services.

**Pros:**

* Completely eliminates the cross-service nonce race — nonce is acquired and consumed in the same process
* Simplifies core's transaction code significantly
* Relayer can implement sophisticated retry with re-signing at new nonce/gas
* Relayer can do `estimateGas` before committing the nonce (simulation-before-nonce pattern)
* Natural place to add transaction queuing, stuck tx recovery, gas bumping

**Cons:**

* **Major security concern**: relayer needs access to signing keys. Current architecture deliberately isolates crypto operations in `signing-server` and `decryption-server`. Moving signing to relayer breaks this security boundary.
* Large architectural change — touching withdraw, contract-interactions, tokenization services
* Need to handle the decryption/threshold signature flow (clientShare + masterWallet share) somehow
* Risk of introducing new bugs during migration
* Doesn't work for user-initiated transactions that require client-side shares (the signer is derived from user-provided `clientShare` + server `encryptedShare`)

**Effort:** Large (1–2 weeks). Requires careful security review and migration of all tx flows.

**Verdict:** Not viable for user wallet transactions because the key derivation requires a client share that only exists during the user's request. Could work for system-owned hot wallets (e.g., gas station, contract deployer) but would require a parallel signing path.

***

### Approach C: Pessimistic Distributed Lock on getNonce

**Summary:** Wrap the entire getNonce → sign → sendRaw flow in a per-address distributed lock (Redis Redlock or simpler `SET NX EX`). Only one transaction per address can be in flight at a time.

**Changes required:**

1. **Relayer — add `POST /evm/address/nonce/lock` and `/unlock` endpoints:**
   ```typescript theme={null}
   // Lock: SET relayer:evm:lock:{networkId}:{address} {requestId} NX EX 30
   // Returns: { nonce, lockId }
   // Unlock: DEL if lockId matches (Lua CAS)
   ```

2. **Core — wrap tx flow in lock/unlock:**
   ```typescript theme={null}
   const { nonce, lockId } = await this.relayerClient.lockAndGetNonce(from, networkId);
   try {
     const tx = await signer.signTransaction({ ...params, nonce });
     await this.relayerClient.sendRawEVMTransaction(tx, networkId);
   } finally {
     await this.relayerClient.unlockNonce(from, networkId, lockId);
   }
   ```

3. **Relayer — auto-expire locks** after TTL to prevent deadlocks from crashed callers.

**Pros:**

* Simple mental model — only one tx per address at a time, no races possible
* No Lua scripts needed beyond a simple CAS for unlock
* Easy to reason about correctness

**Cons:**

* **Serializes all transactions per address** — massive throughput bottleneck. If a user has 3 pending withdrawals, they execute sequentially (getNonce → sign → send → wait → next). Each cycle is \~2-5 seconds minimum.
* Lock expiry is a tradeoff: too short → lock expires during legitimate signing (especially with slow decryption); too long → stuck locks block subsequent txs
* Distributed lock algorithms (Redlock) are complex and have known issues (see Martin Kleppmann's critique)
* Adds two more HTTP round-trips per transaction (lock + unlock)
* If core crashes between lock and unlock, the address is locked until TTL expires

**Effort:** Medium (2–3 days). But the throughput cost is permanent.

***

### Approach D: Optimistic Nonce with Conflict Detection and Auto-Retry

**Summary:** Keep the current architecture but add a nonce conflict detection layer. When `sendRawTransaction` fails with "nonce already used," the relayer automatically re-fetches the nonce from chain, and core retries the entire sign-and-send flow.

**Changes required:**

1. **Relayer — enhance error responses** to return structured nonce-conflict errors:
   ```json theme={null}
   {
     "error": "NONCE_CONFLICT",
     "currentNonce": 7,
     "attemptedNonce": 5,
     "shouldRetry": true
   }
   ```

2. **Core — add retry loop around the full tx flow** (not just sendRaw):
   ```typescript theme={null}
   for (let attempt = 0; attempt < MAX_NONCE_RETRIES; attempt++) {
     const nonce = await this.relayerClient.getNonceForAddress(from, networkId);
     const tx = await signer.signTransaction({ ...params, nonce });
     try {
       return await this.relayerClient.sendRawEVMTransaction(tx, networkId);
     } catch (err) {
       if (isNonceConflict(err) && attempt < MAX_NONCE_RETRIES - 1) {
         this.logger.warn(`Nonce conflict (attempt ${attempt + 1}), retrying`);
         continue;
       }
       throw err;
     }
   }
   ```

3. **Relayer — change `getNonce()` to always re-sync from chain on call** (not just `max(tracked, chain)`) to get the freshest value after a conflict.

**Pros:**

* Minimal changes to existing architecture
* No Lua scripts, no distributed locks
* Works with the existing security model (core holds keys, relayer broadcasts)
* Handles the common case (low concurrency per address) efficiently

**Cons:**

* Optimistic approach means conflicts are detected *after* wasting a signing round-trip
* Under high concurrency (multiple txs from same address), retry storms can occur — N concurrent txs means up to N-1 retries each
* Each retry requires re-deriving the signer (decryption round-trip), which is expensive
* Doesn't prevent nonce gaps from dropped mempool txs
* Relies on the RPC node returning an accurate nonce, which is unreliable with load-balanced providers

**Effort:** Small (1–2 days). But doesn't fully solve the problem — it papers over the race with retries.

***

### Approach E: Hybrid — Atomic Reservation + Stuck Transaction Monitor (Recommended for Production)

**Summary:** Combine Approach A (atomic Lua reservation) with a background monitor that detects and resolves stuck transactions and nonce gaps. This is the production-grade solution used by thirdweb Engine and OZ Defender.

**Changes required (in addition to Approach A):**

1. **Relayer — add transaction tracking table** (Postgres or Redis hash):
   ```
   {
     txHash, from, networkId, nonce, gasPrice,
     submittedAt, status: 'pending' | 'confirmed' | 'dropped',
     lastCheckedAt
   }
   ```

2. **Relayer — background nonce health monitor** (runs every 30s per active address):
   ```typescript theme={null}
   async monitorNonceHealth(address: string, networkId: string) {
     const chainNonce = await provider.getTransactionCount(address, 'latest');
     const trackedNonce = await this.getTrackedNonce(address, networkId);
     const pendingTxs = await this.getPendingTxs(address, networkId);
     
     // Detect gaps: chain says nonce=5, but we have no pending tx for nonce 5
     // Detect stuck: tx submitted > 5 min ago, receipt is null
     // Detect drift: tracked nonce diverged significantly from chain
   }
   ```

3. **Relayer — stuck transaction recovery:**
   * If a pending tx is older than threshold (5 min): attempt gas bump (replace with +12% fee, same nonce)
   * If a pending tx is older than hard limit (30 min): cancel via self-transfer at same nonce with high gas
   * If nonce gap detected: fill with self-transfer

4. **Relayer — periodic nonce reconciliation** on startup and every N minutes:
   ```typescript theme={null}
   // On startup: syncNonceFromChain() for all known addresses
   // Periodic: compare tracked vs chain, log drift, auto-correct if safe
   ```

5. **Relayer — nonce health metrics endpoint** for observability:
   ```json theme={null}
   GET /evm/nonce/health
   {
     "address": "0x...",
     "trackedNonce": 42,
     "chainNonce": 40,
     "pendingCount": 2,
     "oldestPendingAge": "45s",
     "gapCount": 0
   }
   ```

**Pros:**

* Solves both the race condition (Lua) AND the stuck transaction problem (monitor)
* Self-healing — the system automatically recovers from nonce gaps and stuck txs
* Observable — metrics and health endpoint for ops
* Battle-tested pattern — this is what production relayer systems use
* Incremental — can deploy Approach A first, add the monitor later

**Cons:**

* Most complex to implement — touches relayer storage, adds background jobs
* Stuck tx recovery requires the relayer to sign self-transfer transactions (needs a key for the relayer's own address, not user addresses)
* Gas bump for stuck user transactions requires re-signing, which needs the user's key (not available)
* The monitor adds load on RPC nodes (getTransactionCount + getTransactionReceipt per address)

**Effort:** Large (1–2 weeks for full implementation). But can be staged: Approach A first (2–3 days), monitor later (3–5 days).

***

## Comparison Matrix

| Criteria                 | A: Lua Atomic | B: Relayer Signs | C: Dist Lock | D: Optimistic | E: Hybrid (A+Monitor) |
| ------------------------ | :-----------: | :--------------: | :----------: | :-----------: | :-------------------: |
| Eliminates nonce race    |       ✅       |         ✅        |       ✅      |  ❌ (retries)  |           ✅           |
| Prevents nonce gaps      |    Partial    |         ✅        |       ✅      |       ❌       |           ✅           |
| Handles stuck txs        |       ❌       |  ✅ (can re-sign) |       ❌      |       ❌       |           ✅           |
| Preserves security model |       ✅       |         ❌        |       ✅      |       ✅       |           ✅           |
| Concurrent tx throughput |      High     |       High       | Low (serial) |     Medium    |          High         |
| Implementation effort    |     Medium    |       Large      |    Medium    |     Small     |         Large         |
| Works for user wallets   |       ✅       |   ❌ (needs key)  |       ✅      |       ✅       |  ✅ (monitor limited)  |
| Operational complexity   |      Low      |       High       |    Medium    |      Low      |         Medium        |

## Recommendation

**Start with Approach A (Atomic Nonce Reservation)** — it solves the critical race condition with moderate effort and zero architectural disruption. The key changes are:

1. Replace `getTrackedNonce + setTrackedNonce` with atomic `acquireNonce` (Lua)
2. Remove `incrementNonce` from `sendRawTransaction` (already reserved)
3. Add `recycleNonce` endpoint for pre-broadcast failures
4. Add try/catch with recycle in core's tx flows

**Then incrementally add Approach E's monitor** for stuck transaction detection and nonce gap recovery. The monitor is valuable but not blocking — the atomic reservation alone fixes the most common failure mode (concurrent nonce collision).

The per-address stuck transaction recovery (gas bumping, cancellation) is limited for user wallets since the relayer doesn't hold user keys. For user wallets, the practical recovery path is: detect the stuck nonce, alert the user or system, and have core re-derive the signer to submit a replacement. This can be a Temporal workflow that periodically checks pending txs.