fix(otel): make service.instance.id unique per process#4891
Conversation
All app replicas shared a hardcoded service.instance.id ("mothership-sim"),
so OTel metrics from every process collapsed into one Prometheus series.
Their independent cumulative counters then interleaved, producing phantom
counter resets that corrupt rate()/increase() — staging hosted-key cost
inflated to ~$0.72 from a few cents, while no-`key` metrics (cost_charged,
throttled, queue_wait_*) were affected fleet-wide.
Append the hostname (the container id under ECS, unique per task) so each
replica gets its own series and sum(rate(...)) / sum(increase(...)) aggregate
correctly. The mothership-sim prefix is kept so Jaeger's clock-skew adjuster
still separates Sim from Go.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryLow Risk Overview Comments were updated to explain the per-process uniqueness requirement (cumulative counter interleaving / bad Reviewed by Cursor Bugbot for commit e4aa8ae. Bugbot is set up for automated code reviews on this repo. Configure here. |
Greptile SummaryThis PR fixes a Prometheus counter-collision bug where all Fargate replicas of the Sim app were sharing the same hardcoded
Confidence Score: 5/5Safe to merge — a targeted one-line change to the OTel bootstrap with no runtime risk and clear upside in production observability.
No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant R1 as Replica 1 (hostname: abc123)
participant R2 as Replica 2 (hostname: def456)
participant Prom as Prometheus/Grafana Cloud
Note over R1,R2: Before fix — both use service.instance.id = mothership-sim
R1->>Prom: "counter=10 {instance=mothership-sim}"
R2->>Prom: "counter=3 {instance=mothership-sim}"
Note over Prom: Series interleave → phantom reset → rate() inflated
Note over R1,R2: After fix — unique service.instance.id per container
R1->>Prom: "counter=10 {instance=mothership-sim-abc123}"
R2->>Prom: "counter=3 {instance=mothership-sim-def456}"
Note over Prom: Two clean series → sum(rate()) correct
Reviews (1): Last reviewed commit: "fix(otel): make service.instance.id uniq..." | Re-trigger Greptile |
Problem
Every Sim app replica reports OTel telemetry with the same hardcoded
service.instance.id(mothership-sim), becauseinstrumentation-node.tsbuilds it from a constant slug. With >1 replica (stagingappruns 2 tasks), all replicas write to the same Prometheus series. Each process keeps its own independent cumulative counter, so the merged series interleaves values from two sources → phantom counter resets →rate()/increase()"add back" the drops and inflate.Observed on staging (
grafanacloud-prom):resets(hosted_key_cost_charged_USD_total[40m])= 7+ whileresets(hosted_key_used_total[...])= 0service_instance_id/instanceeach have exactly one value despite 2 running tasksNo-
keymetrics (cost_charged,throttled,queue_wait_*,queue_wait_exceeded) collide fully;key-labeled ones (used/failed/upstream) are only probabilistically protected by the differingkeylabel.Fix
Append
hostname()(the container id under ECS/Fargate, unique per task) toservice.instance.id. Each replica becomes its own clean cumulative-counter series, sosum(rate(...))/sum(increase(...))aggregate correctly across replicas. Themothership-simprefix is preserved so Jaeger's clock-skew adjuster still separates Sim from Go spans.Notes
sum(rate(...))/sum(increase(...))queries become correct once instance ids are unique.🤖 Generated with Claude Code