fix(tables): surface real error causes on cell-execution failures (diagnostics)#4868
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
PR SummaryLow Risk Overview Logging-only updates across schedule execution, workflow group cell execution, execution logging session, and pause persistence: error logs now include structured Vitest coverage for Reviewed by Cursor Bugbot for commit 332223b. Bugbot is set up for automated code reviews on this repo. Configure here. |
Greptile SummaryThis PR adds diagnostics-only logging for transient DB/network failures in cell-execution paths. The core addition is
Confidence Score: 5/5Safe to merge — all changes are purely additive to log payloads with no behaviour modifications. The change is logging-only: No files require special attention. Important Files Changed
Reviews (5): Last reviewed commit: "refactor(tables): drop in-process retry,..." | Re-trigger Greptile |
…surface error causes Workflow-group-cell runs intermittently failed on trivial DB reads/writes under heavy fan-out, stranding cells in `running`. Investigation showed the PlanetScale and ElastiCache backends were healthy at the time — the failures are transient connection-level faults that the cell (maxAttempts: 1) had no tolerance for, and the real cause was never logged (Drizzle wraps it as "Failed query: ..." and the driver cause lives in error.cause). Resilience: - Add retryTransient (lib/table/retry-transient.ts): retries only transient infra errors (reuses isRetryableInfrastructureError; adds an ioredis command-timeout match) with jittered backoff, then rethrows. Fail-fast for everything else. - Wrap the cell's getTableById/getRowById reads, the terminal write (cell-write updateRow — idempotent via the executionId guard), and the Redis cascade-lock acquire. Diagnostics: - Add describeError (lib/core/errors/retryable-infrastructure.ts): walks the .cause chain and always returns the underlying driver cause (code/errno/ syscall + causeChain), including for unclassified errors like AbortError. - Log `cause` + a `retryable` flag (and aborted/timedOut in the cell's main catch) across the cell + finalization error paths, mirroring the existing schedule-execution pattern. Logging-only; no behavior change. This lets the next recurrence reveal the real cause and whether the retry applies. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
7db8a62 to
91f00bd
Compare
|
@greptile review |
- retryTransient: re-check the abort signal after the backoff sleep so a cancellation during sleep stops the next attempt (don't run/return work for an already-cancelled task). - isRetryableRedisError: walk the .cause chain (mirroring the infra classifier) so wrapped Redis timeouts are recognized; drop "Connection is in subscriber mode" — that's a connection-state programming error, not a transient drop, and would just fail identically every retry. - cascade-lock: stop wrapping acquireLock in retryTransient. acquireLock is a non-idempotent SET NX, so retrying after a timed-out-but-applied first SET returns false (key already ours) and yields a false `contended` that skips the cascade. A transient Redis blip here just fails the run before pickup (no stranded cell); the dispatcher re-drives it. - Tests: cause-chain Redis match, subscriber-mode exclusion, abort-during-sleep. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the reviews — addressed in 8812154: Fixed
Not a bug (won't change)
@greptile review |
The main catch logged `aborted`/`timedOut` from `abortSignal`/`timeoutController`, but those are declared inside the outer try block (the inner try around executeWorkflow is try/finally, so this catch belongs to the outer try) and are not in scope in the catch — `next build`'s type-check failed with "Cannot find name 'abortSignal'". Local incremental `tsc --noEmit` had skipped the file and falsely passed; the Cursor/Greptile reviewers flagged this correctly. Removed the two fields. Abort/timeout is still surfaced via `cause: describeError(err)` (an aborted run shows `name: 'AbortError'` / the timeout message), so no diagnostic signal is lost. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Correction on the abort-vars finding — you were right, I was wrong. I claimed it was a false positive based on a local Fixed in the latest commit by removing |
|
Greptile's P1 on |
In-process retry is the wrong layer for this path: the cell task is maxAttempts:1 by design, retrying on a possibly-degraded worker may not help, and it masks the very transient-failure signal we're trying to capture before we understand the root cause. Removed retryTransient entirely (file + all wrapping in cell-write, the cascade reads, and the lock acquire) and kept only the diagnostic logging. - Deleted lib/table/retry-transient.ts (+ test); cell-write and the cascade reads call getTableById/getRowById/updateRow directly again, fail-fast. - Kept describeError + `cause`/`retryable` fields across the cell + finalization catch blocks; the cell-path `retryable` flag now sources from isRetryableInfrastructureError (the canonical classifier) for consistency. Diagnostics-first: surface the real driver cause on the next recurrence, then decide the actual fix (e.g. task-level maxAttempts, or addressing the worker- side cause) from evidence rather than a speculative in-process retry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Simplified per discussion: removed the in-process retry entirely (wrong layer for a |
The scheduled-job failure paths logged the raw error (.message/stack only) — its `.cause` (the real driver error behind a Drizzle "Failed query: ..." wrapper) was never recorded, and the classified-only `describeRetryableInfrastructureError` returns undefined for unrecognized errors. A real failed run (same incident window as the cell failures) failed in `applyScheduleUpdate` with exactly this unrecorded cause. Added `cause: describeError(error)` (always-on, walks the cause chain) to the applyScheduleUpdate catch, the early-failure catch, and the unhandled-error catch — passed as a second arg so the existing message+stack still emit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`describeError` is a general-purpose error/cause-chain helper — it didn't belong in `lib/core/errors/retryable-infrastructure.ts` (that module is specifically about classifying retryable infra errors, and the name read wrong for a generic diagnostic). Moved it to `@sim/utils/errors` alongside `toError`/ `getErrorMessage`/`getPostgresErrorCode`, with its own cycle-safe cause walk. - Added describeError + DescribedError + tests to packages/utils/src/errors.ts. - Reverted the describeError addition from retryable-infrastructure.ts (it keeps only isRetryableInfrastructureError / describeRetryableInfrastructureError, which are accurately named and still used by the schedule retry path). - Re-pointed all consumers (cell, logging-session, pause-persistence, schedule) to import describeError from @sim/utils/errors. The `retryable` classification flag still sources from isRetryableInfrastructureError where used. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 332223b. Configure here.
| typeof value === 'string' ? value : undefined | ||
| const code = asString(deepest.code) | ||
| const errno = asString(deepest.errno) | ||
| const syscall = asString(deepest.syscall) |
There was a problem hiding this comment.
Numeric errno omitted from logs
Low Severity
describeError only copies errno when it is a string, but Node SystemError values typically expose errno as a number. Structured logs can then omit errno even when the driver set it, weakening the diagnostics this change is meant to add.
Reviewed by Cursor Bugbot for commit 332223b. Configure here.


Problem
workflow-group-cellruns intermittently fail on trivial DB reads/writes under heavy fan-out, leaving cells stranded inrunning. Investigation showed the PlanetScale and ElastiCache backends were healthy at the time, and the cell doesn't checkpoint — so these are transient connection-level faults. But we could never identify the actual cause: the catch blocks logtoError(error).message, which for a Drizzle error is the wrapper"Failed query: ..."— the real driver cause (ECONNRESET/ statement timeout /AbortError/…) lives inerror.cause, which is never logged (and@sim/loggerdrops the non-enumerable.cause).Approach: diagnostics-first (no retry)
We deliberately do not add in-process retry. It's the wrong layer here — the task is
maxAttempts: 1by design, retrying on a possibly-degraded worker may not help, and it masks the very signal we're trying to capture before the root cause is understood. Instead: catch, log the real cause, fail fast. Once the logs tell us what's actually happening, we decide the real fix from evidence (e.g. task-levelmaxAttempts, or addressing the worker-side cause).Changes
describeError()inlib/core/errors/retryable-infrastructure.ts— walks the.causechain (reuses the existinggetErrorChain) and always returns the underlying driver cause (code/errno/syscall+causeChain), including for unclassified errors likeAbortError.cause: describeError(err)+ aretryable: isRetryableInfrastructureError(err)classification flag across the cell + finalization catch blocks (workflow-column-execution.ts,logging-session.ts,pause-persistence.ts), mirroring the existingschedule-execution.tspattern. Logging-only; no behavior change.Testing
retryable-infrastructure.test.ts(5) — cause-chain walk, AbortError, non-Error input, depth cap. All pass.tscclean (incremental tsc had previously masked a scope error — verified with--incremental false).Notes
🤖 Generated with Claude Code