fix(tables): surface real error causes on cell-execution failures (diagnostics) by TheodoreSpeaks · Pull Request #4868 · simstudioai/sim

TheodoreSpeaks · 2026-06-03T19:55:35Z

Problem

workflow-group-cell runs intermittently fail on trivial DB reads/writes under heavy fan-out, leaving cells stranded in running. Investigation showed the PlanetScale and ElastiCache backends were healthy at the time, and the cell doesn't checkpoint — so these are transient connection-level faults. But we could never identify the actual cause: the catch blocks log toError(error).message, which for a Drizzle error is the wrapper "Failed query: ..." — the real driver cause (ECONNRESET / statement timeout / AbortError/…) lives in error.cause, which is never logged (and @sim/logger drops the non-enumerable .cause).

Approach: diagnostics-first (no retry)

We deliberately do not add in-process retry. It's the wrong layer here — the task is maxAttempts: 1 by design, retrying on a possibly-degraded worker may not help, and it masks the very signal we're trying to capture before the root cause is understood. Instead: catch, log the real cause, fail fast. Once the logs tell us what's actually happening, we decide the real fix from evidence (e.g. task-level maxAttempts, or addressing the worker-side cause).

Changes

describeError() in lib/core/errors/retryable-infrastructure.ts — walks the .cause chain (reuses the existing getErrorChain) and always returns the underlying driver cause (code/errno/syscall + causeChain), including for unclassified errors like AbortError.
Added cause: describeError(err) + a retryable: isRetryableInfrastructureError(err) classification flag across the cell + finalization catch blocks (workflow-column-execution.ts, logging-session.ts, pause-persistence.ts), mirroring the existing schedule-execution.ts pattern. Logging-only; no behavior change.

Testing

retryable-infrastructure.test.ts (5) — cause-chain walk, AbortError, non-Error input, depth cap. All pass.
biome clean; full non-incremental tsc clean (incremental tsc had previously masked a scope error — verified with --incremental false).

Notes

Diagnostics-only by design: this does not stop a transient blip from stranding a cell — it makes the cause visible so the right fix can be chosen next. Zombie cleanup of already-stranded cells is a separate follow-up.

🤖 Generated with Claude Code

vercel · 2026-06-03T19:55:41Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
docs	Ready	Preview, Comment	Jun 3, 2026 9:55pm

cursor · 2026-06-03T19:55:45Z

PR Summary

Low Risk
Observability-only logging changes; no control-flow, auth, or retry policy changes.

Overview
Adds describeError() in @sim/utils/errors to walk .cause chains and expose the deepest driver fields (code, errno, syscall) plus an optional causeChain, because loggers only see wrapper messages like Drizzle’s "Failed query: ...".

Logging-only updates across schedule execution, workflow group cell execution, execution logging session, and pause persistence: error logs now include structured cause: describeError(err) and, where relevant, retryable: isRetryableInfrastructureError(err). No retry or execution behavior changes.

Vitest coverage for describeError (wrapped driver errors, AbortError, cycles, depth cap).

^{Reviewed by Cursor Bugbot for commit 332223b. Bugbot is set up for automated code reviews on this repo. Configure here.}

greptile-apps · 2026-06-03T20:01:18Z

Greptile Summary

This PR adds diagnostics-only logging for transient DB/network failures in cell-execution paths. The core addition is describeError(), which walks the .cause chain and surfaces the deepest driver-level error fields (code/errno/syscall) that @sim/logger previously dropped because Error.prototype.cause is non-enumerable. No retry logic is added; catch blocks now emit the richer structured fields alongside the existing error message string.

retryable-infrastructure.ts: adds describeError() (reuses the existing getErrorChain helper) and the DescribedError interface; cycles are bounded by the pre-existing depth-10 guard.
retryable-infrastructure.test.ts: five new tests cover single errors, wrapped driver errors, AbortError, non-Error inputs, and cyclic cause chains.
workflow-column-execution.ts, logging-session.ts, pause-persistence.ts: each catch block now passes cause: describeError(err) and retryable: isRetryableInfrastructureError(err) as explicit structured log fields.

Confidence Score: 5/5

Safe to merge — all changes are purely additive to log payloads with no behaviour modifications.

The change is logging-only: describeError cannot throw (the test confirms this for non-Error inputs and cycles), and its output is always a plain object. Every call site passes the result as an extra field in an existing log statement; none of the surrounding error handling or control flow is altered. The five new unit tests cover the cases that matter for correctness.

No files require special attention.

Important Files Changed

Filename	Overview
apps/sim/lib/core/errors/retryable-infrastructure.ts	New `describeError()` function walks the `.cause` chain via the existing `getErrorChain` helper, reports the deepest link's diagnostics fields, and guards against cycles with a depth-10 cap. Logic is correct for all tested paths.
apps/sim/lib/core/errors/retryable-infrastructure.test.ts	New test file covers single error, wrapped driver error (ECONNRESET), AbortError, non-Error input, and cyclic cause chain — all five cases are meaningful and distinct.
apps/sim/background/workflow-column-execution.ts	Three catch blocks enriched with `cause: describeError(err)` and `retryable: isRetryableInfrastructureError(err)` fields; purely additive to log payloads, no behaviour change.
apps/sim/lib/logs/execution/logging-session.ts	Four catch blocks in `LoggingSession` now emit `cause`/`retryable` alongside the existing `error` string; pattern is consistent across all sites.
apps/sim/lib/workflows/executor/pause-persistence.ts	Two catch blocks in `handlePostExecutionPauseState` enriched with the same `cause`/`retryable` diagnostic fields; no behaviour change.

_{Reviews (5): Last reviewed commit: "refactor(tables): drop in-process retry,..." | Re-trigger Greptile}

…surface error causes Workflow-group-cell runs intermittently failed on trivial DB reads/writes under heavy fan-out, stranding cells in `running`. Investigation showed the PlanetScale and ElastiCache backends were healthy at the time — the failures are transient connection-level faults that the cell (maxAttempts: 1) had no tolerance for, and the real cause was never logged (Drizzle wraps it as "Failed query: ..." and the driver cause lives in error.cause). Resilience: - Add retryTransient (lib/table/retry-transient.ts): retries only transient infra errors (reuses isRetryableInfrastructureError; adds an ioredis command-timeout match) with jittered backoff, then rethrows. Fail-fast for everything else. - Wrap the cell's getTableById/getRowById reads, the terminal write (cell-write updateRow — idempotent via the executionId guard), and the Redis cascade-lock acquire. Diagnostics: - Add describeError (lib/core/errors/retryable-infrastructure.ts): walks the .cause chain and always returns the underlying driver cause (code/errno/ syscall + causeChain), including for unclassified errors like AbortError. - Log `cause` + a `retryable` flag (and aborted/timedOut in the cell's main catch) across the cell + finalization error paths, mirroring the existing schedule-execution pattern. Logging-only; no behavior change. This lets the next recurrence reveal the real cause and whether the retry applies. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

TheodoreSpeaks · 2026-06-03T20:06:41Z

@greptile review

- retryTransient: re-check the abort signal after the backoff sleep so a cancellation during sleep stops the next attempt (don't run/return work for an already-cancelled task). - isRetryableRedisError: walk the .cause chain (mirroring the infra classifier) so wrapped Redis timeouts are recognized; drop "Connection is in subscriber mode" — that's a connection-state programming error, not a transient drop, and would just fail identically every retry. - cascade-lock: stop wrapping acquireLock in retryTransient. acquireLock is a non-idempotent SET NX, so retrying after a timed-out-but-applied first SET returns false (key already ours) and yields a false `contended` that skips the cascade. A transient Redis blip here just fails the run before pickup (no stranded cell); the dispatcher re-drives it. - Tests: cause-chain Redis match, subscriber-mode exclusion, abort-during-sleep. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

TheodoreSpeaks · 2026-06-03T20:15:56Z

Thanks for the reviews — addressed in 8812154:

Fixed

Signal not re-checked after backoff sleep (Greptile P1 / Cursor): now re-checks signal.aborted after sleep, so a cancellation mid-backoff stops the next attempt. Added a test.
"Connection is in subscriber mode" retried (Greptile P2): removed — it's a connection-state programming error, not a transient drop. Added an exclusion test.
Redis matcher skips cause chain (Cursor): isRetryableRedisError now walks the .cause chain (depth-bounded), mirroring the infra classifier. Added a wrapped-cause test.
Lock retry → false contention (Cursor): stopped wrapping acquireLock in retryTransient. It's a non-idempotent SET NX, so retrying after a timed-out-but-applied first SET returns false (key already ours) → a false contended that skips the cascade. A transient blip there now just fails the run before pickup (no stranded cell); the dispatcher re-drives it.

Not a bug (won't change)

Catch references uninitialized abortSignal/timeoutController (Cursor, High): the catch that logs aborted/timedOut closes the inner try (opened at L663, around executeWorkflow), which begins after both bindings are declared (L657–658). Failures before that point are handled by the enclosing try, not this catch. tsc confirms the bindings are in scope and definitely assigned at that catch (no TS2304/TS2454), so there's no reachable path where they're uninitialized.

@greptile review

The main catch logged `aborted`/`timedOut` from `abortSignal`/`timeoutController`, but those are declared inside the outer try block (the inner try around executeWorkflow is try/finally, so this catch belongs to the outer try) and are not in scope in the catch — `next build`'s type-check failed with "Cannot find name 'abortSignal'". Local incremental `tsc --noEmit` had skipped the file and falsely passed; the Cursor/Greptile reviewers flagged this correctly. Removed the two fields. Abort/timeout is still surfaced via `cause: describeError(err)` (an aborted run shows `name: 'AbortError'` / the timeout message), so no diagnostic signal is lost. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

TheodoreSpeaks · 2026-06-03T20:26:17Z

Correction on the abort-vars finding — you were right, I was wrong. I claimed it was a false positive based on a local tsc --noEmit that passed, but that was incremental tsc skipping the file. CI's next build type-check correctly failed with Cannot find name 'abortSignal': the inner try around executeWorkflow is try/finally, so the logging catch belongs to the outer try, where abortSignal/timeoutController (declared inside it) are out of scope.

Fixed in the latest commit by removing aborted/timedOut from that catch. Abort/timeout is still captured via cause: describeError(err) (name: 'AbortError' / the timeout message), so no signal is lost. Thanks for catching it.

TheodoreSpeaks · 2026-06-03T20:35:32Z

Greptile's P1 on acquireLock (non-idempotent SET NX wrapped in retry) was already resolved in 8812154 — the retryTransient wrapper was removed, so it's now a plain acquireLock call. @greptile review

In-process retry is the wrong layer for this path: the cell task is maxAttempts:1 by design, retrying on a possibly-degraded worker may not help, and it masks the very transient-failure signal we're trying to capture before we understand the root cause. Removed retryTransient entirely (file + all wrapping in cell-write, the cascade reads, and the lock acquire) and kept only the diagnostic logging. - Deleted lib/table/retry-transient.ts (+ test); cell-write and the cascade reads call getTableById/getRowById/updateRow directly again, fail-fast. - Kept describeError + `cause`/`retryable` fields across the cell + finalization catch blocks; the cell-path `retryable` flag now sources from isRetryableInfrastructureError (the canonical classifier) for consistency. Diagnostics-first: surface the real driver cause on the next recurrence, then decide the actual fix (e.g. task-level maxAttempts, or addressing the worker- side cause) from evidence rather than a speculative in-process retry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

TheodoreSpeaks · 2026-06-03T21:11:46Z

Simplified per discussion: removed the in-process retry entirely (wrong layer for a maxAttempts: 1 task, and premature before we know the root cause) and kept only the cause diagnostics. −240 lines. @greptile review

The scheduled-job failure paths logged the raw error (.message/stack only) — its `.cause` (the real driver error behind a Drizzle "Failed query: ..." wrapper) was never recorded, and the classified-only `describeRetryableInfrastructureError` returns undefined for unrecognized errors. A real failed run (same incident window as the cell failures) failed in `applyScheduleUpdate` with exactly this unrecorded cause. Added `cause: describeError(error)` (always-on, walks the cause chain) to the applyScheduleUpdate catch, the early-failure catch, and the unhandled-error catch — passed as a second arg so the existing message+stack still emit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

`describeError` is a general-purpose error/cause-chain helper — it didn't belong in `lib/core/errors/retryable-infrastructure.ts` (that module is specifically about classifying retryable infra errors, and the name read wrong for a generic diagnostic). Moved it to `@sim/utils/errors` alongside `toError`/ `getErrorMessage`/`getPostgresErrorCode`, with its own cycle-safe cause walk. - Added describeError + DescribedError + tests to packages/utils/src/errors.ts. - Reverted the describeError addition from retryable-infrastructure.ts (it keeps only isRetryableInfrastructureError / describeRetryableInfrastructureError, which are accurately named and still used by the schedule retry path). - Re-pointed all consumers (cell, logging-session, pause-persistence, schedule) to import describeError from @sim/utils/errors. The `retryable` classification flag still sources from isRetryableInfrastructureError where used. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 332223b. Configure here.}

cursor · 2026-06-03T21:52:39Z

+    typeof value === 'string' ? value : undefined
+  const code = asString(deepest.code)
+  const errno = asString(deepest.errno)
+  const syscall = asString(deepest.syscall)


Numeric errno omitted from logs

Low Severity

describeError only copies errno when it is a string, but Node SystemError values typically expose errno as a number. Structured logs can then omit errno even when the driver set it, weakening the diagnostics this change is meant to add.

^{Reviewed by Cursor Bugbot for commit 332223b. Configure here.}

cursor Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread apps/sim/background/workflow-column-execution.ts Outdated

Comment thread apps/sim/lib/table/retry-transient.ts Outdated

greptile-apps Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread apps/sim/lib/table/retry-transient.ts Outdated

Comment thread apps/sim/lib/table/retry-transient.ts Outdated

TheodoreSpeaks force-pushed the fix/cell-db-failure branch from 7db8a62 to 91f00bd Compare June 3, 2026 20:02

TheodoreSpeaks changed the base branch from main to staging June 3, 2026 20:02

vercel Bot temporarily deployed to Preview June 3, 2026 20:02 Inactive

cursor Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread apps/sim/background/workflow-column-execution.ts Outdated

Comment thread apps/sim/lib/table/retry-transient.ts Outdated

Comment thread apps/sim/lib/table/retry-transient.ts Outdated

cursor Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread apps/sim/lib/table/cascade-lock.ts Outdated

greptile-apps Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread apps/sim/lib/table/cascade-lock.ts Outdated

vercel Bot temporarily deployed to Preview June 3, 2026 20:15 Inactive

vercel Bot temporarily deployed to Preview June 3, 2026 20:26 Inactive

vercel Bot temporarily deployed to Preview June 3, 2026 21:11 Inactive

TheodoreSpeaks changed the title ~~fix(tables): retry transient DB/Redis failures in cell execution and surface error causes~~ fix(tables): surface real error causes on cell-execution failures (diagnostics) Jun 3, 2026

cursor Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread apps/sim/background/workflow-column-execution.ts

vercel Bot temporarily deployed to Preview June 3, 2026 21:34 Inactive

cursor Bot reviewed Jun 3, 2026

View reviewed changes

vercel Bot deployed to Preview June 3, 2026 21:55 View deployment

TheodoreSpeaks merged commit aed4402 into staging Jun 3, 2026
14 checks passed

waleedlatif1 mentioned this pull request Jun 3, 2026

v0.6.102: support S3-compatible in object storage, GitLab code knowledgebase connector, SSO provider ID allowlist, singleton memory leak fix #4871

Merged

waleedlatif1 deleted the fix/cell-db-failure branch June 4, 2026 16:20

Conversation

TheodoreSpeaks commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Approach: diagnostics-first (no retry)

Changes

Testing

Notes

Uh oh!

vercel Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TheodoreSpeaks commented Jun 3, 2026

Uh oh!

Uh oh!

TheodoreSpeaks commented Jun 3, 2026

Uh oh!

TheodoreSpeaks commented Jun 3, 2026

Uh oh!

TheodoreSpeaks commented Jun 3, 2026

Uh oh!

TheodoreSpeaks commented Jun 3, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 3, 2026

Choose a reason for hiding this comment

Numeric errno omitted from logs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TheodoreSpeaks commented Jun 3, 2026 •

edited

Loading

vercel Bot commented Jun 3, 2026 •

edited

Loading

cursor Bot commented Jun 3, 2026 •

edited

Loading

greptile-apps Bot commented Jun 3, 2026 •

edited

Loading