Logging & Observability

Structured logging, distributed tracing, and metrics for production systems. Covers the full observability stack from log formatting to alert routing.

When to Use

Activate on: "structured logging", "distributed tracing", "OpenTelemetry", "OTel", "correlation ID", "log levels", "Grafana dashboard", "alerting thresholds", "SLI SLO", "Prometheus metrics", "PagerDuty integration", "observability stack", "Winston setup", "Pino logger", "log aggregation", "Datadog", "Honeycomb"

NOT for: Performance profiling (CPU/memory flamegraphs) | Load testing | Database query optimization | Security auditing

Decision Tree: What to Log at Each Level

flowchart TD
    E[Event Occurs] --> Q1{Does it represent\na system failure?}
    Q1 -->|Yes| Q2{Is it recoverable\nwithout human?}
    Q2 -->|No| FATAL[FATAL: Service cannot\ncontinue — trigger pager]
    Q2 -->|Yes| ERROR[ERROR: Operation failed,\nwill retry or degrade]
    Q1 -->|No| Q3{Is it unexpected\nbut not failing?}
    Q3 -->|Yes| WARN[WARN: Unusual condition,\ncircuit breaker open,\ndeprecation used]
    Q3 -->|No| Q4{Is it a meaningful\nbusiness event?}
    Q4 -->|Yes| INFO[INFO: User action,\npayment processed,\nservice started]
    Q4 -->|No| Q5{Needed to debug\na specific issue?}
    Q5 -->|Yes| DEBUG[DEBUG: DB queries,\ncache hits/misses,\nfunction inputs]
    Q5 -->|No| TRACE[TRACE: Fine-grained\nloop iterations,\nOTel spans]

Rule of thumb: Production runs INFO and above. DEBUG only enabled per-service via dynamic config, never always-on in prod.

Core Patterns

Structured Log Format (JSON)

Every log line must be parseable. String concatenation is not a log.

Node.js with Pino:

import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL ?? 'info',
  base: {
    service: 'payment-service',
    version: process.env.SERVICE_VERSION,
    env: process.env.NODE_ENV,
  },
  redact: {
    paths: ['req.headers.authorization', 'body.password', 'body.cardNumber', '*.ssn'],
    censor: '[REDACTED]',
  },
});

// Good: structured fields
logger.info({ orderId, userId, amountCents }, 'Payment processed');

// Bad: string interpolation
logger.info(`Payment processed for user ${userId} order ${orderId}`);

Python with structlog:

import structlog

log = structlog.get_logger()

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.stdlib.add_log_level,
        structlog.stdlib.add_logger_name,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer(),
    ]
)

# Good: key-value pairs
log.info("payment_processed", order_id=order_id, user_id=user_id, amount_cents=amount)

Correlation IDs

Every request needs a trace ID that flows through all downstream calls. This is the minimum viable distributed tracing without OTel.

// Express middleware
import { randomUUID } from 'crypto';
import { AsyncLocalStorage } from 'async_hooks';

const requestContext = new AsyncLocalStorage<{ traceId: string; spanId: string }>();

export function correlationMiddleware(req, res, next) {
  const traceId = req.headers['x-trace-id'] ?? randomUUID();
  const spanId = randomUUID().slice(0, 8);

  requestContext.run({ traceId, spanId }, () => {
    res.setHeader('x-trace-id', traceId);
    next();
  });
}

// Logger that auto-includes context
export function getLogger(name: string) {
  return {
    info: (msg: string, fields?: object) => {
      const ctx = requestContext.getStore();
      logger.info({ ...ctx, ...fields, logger: name }, msg);
    },
    // ... error, warn, debug
  };
}

OpenTelemetry Setup

See references/opentelemetry-setup.md for complete OTel collector config, SDK initialization per language, and span attribute conventions.

Minimal Node.js bootstrap:

// Must be first import in entrypoint
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  serviceName: 'payment-service',
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

Distributed Trace Propagation

sequenceDiagram
    participant C as Client
    participant GW as API Gateway
    participant SVC as Payment Service
    participant DB as Database

    C->>GW: POST /checkout<br/>(no trace header)
    Note over GW: Generate trace-id: abc123<br/>span-id: 0001
    GW->>SVC: POST /payment<br/>traceparent: 00-abc123-0001-01
    Note over SVC: Inherit trace-id: abc123<br/>New span-id: 0002
    SVC->>DB: INSERT payment<br/>traceparent: 00-abc123-0002-01
    Note over DB: Inherit trace-id: abc123<br/>New span-id: 0003
    DB-->>SVC: OK (span 0003 ends)
    SVC-->>GW: 200 OK (span 0002 ends)
    GW-->>C: 200 OK (span 0001 ends)<br/>x-trace-id: abc123

The W3C traceparent header format: 00-{traceId}-{spanId}-{flags}. Always propagate this header on every downstream HTTP call.

Reference Files

File	Contents
`references/opentelemetry-setup.md`	OTel SDK init per language, collector YAML config, span attributes, context propagation
`references/alerting-patterns.md`	SLI/SLO definitions, alert routing, PagerDuty severity mapping, alert fatigue prevention

Anti-Patterns (Shibboleths)

Anti-Pattern 1: Logging PII or Secrets in Production

Novice thinking: "I'll just log the full request body to debug this auth issue."

Why wrong: GDPR/CCPA violations carry fines up to 4% of global revenue. Secrets in logs propagate to log aggregators, S3 exports, audit trails — all places with different access controls. A single console.log(req.body) can expose thousands of user passwords in your Datadog dashboard.

Detection signature: Search your logs for password, ssn, cardNumber, authorization as field values (not keys). If any appear, you have a PII leak.

Fix — Allowlist approach:

// Never log what you don't explicitly approve
const SAFE_BODY_FIELDS = ['orderId', 'productId', 'quantity', 'currency'];

logger.info({
  body: pick(req.body, SAFE_BODY_FIELDS), // only known-safe fields
  path: req.path,
  method: req.method,
}, 'Request received');

Fix — Redaction in logger config:

// Pino's redact runs before any transport
const logger = pino({
  redact: {
    paths: [
      'req.headers.authorization',
      'req.headers.cookie',
      'body.password',
      'body.*.password',   // nested objects too
      'body.cardNumber',
      'body.ssn',
      '*.token',
      '*.secret',
    ],
    censor: '[REDACTED]',
  },
});

Shibboleth: An expert sets up redaction at logger initialization, not as a reminder comment. Redaction must be structural, not ad-hoc.

Anti-Pattern 2: Unstructured String Logs Instead of Structured JSON

Novice thinking: logger.info('User ' + userId + ' purchased ' + productId + ' for $' + amount)

Why wrong: You cannot filter, aggregate, or alert on string-interpolated data in any log aggregator. A Grafana query for amount > 1000 requires amount to be a numeric field, not embedded in a sentence. String logs are write-only — you can read them but not query them at scale.

Impact: Your 10 million daily log lines become unsearchable. MTTR (mean time to recovery) during incidents doubles because engineers grep through strings instead of filtering structured fields.

Fix:

// Bad: string log — amount is buried in text
logger.info(`User ${userId} purchased ${productId} for $${amount}`);

// Good: structured — every field is queryable
logger.info({ userId, productId, amountDollars: amount / 100 }, 'purchase_completed');

Consistent event naming: Use snake_case verb-noun event names (payment_processed, user_signed_up, order_failed) as the message string. This creates a stable vocabulary for dashboards and alerts.

Shibboleth: The log message string is for humans scanning log tails. All queryable data lives in structured fields. A logger that produces {} as its output shape is better than one that produces readable strings.

Anti-Pattern 3: Log-and-Throw (Duplicate Log Entries)

Novice thinking: Log the error, then re-throw so the caller also knows about it.

// BAD: log-and-throw
async function processPayment(orderId: string) {
  try {
    return await chargeCard(orderId);
  } catch (err) {
    logger.error({ err, orderId }, 'Payment failed'); // Logged here
    throw err; // And the caller logs it again
  }
}

async function handleCheckout(req, res) {
  try {
    await processPayment(req.body.orderId);
  } catch (err) {
    logger.error({ err }, 'Checkout failed'); // Same error logged twice
    res.status(500).json({ error: 'Checkout failed' });
  }
}

Why wrong: Every error appears 2-5 times in your logs depending on call depth. Alerting on error count becomes unreliable. Incident review is confusing — engineers think there were multiple failures. Log volume costs money (Datadog charges per ingested GB).

Fix — Log only at the boundary where you handle the error:

// Good: log only where you decide what to do with the error
async function processPayment(orderId: string) {
  // No try-catch: let errors propagate naturally
  return await chargeCard(orderId);
}

async function handleCheckout(req, res) {
  try {
    await processPayment(req.body.orderId);
    res.json({ success: true });
  } catch (err) {
    // One log, at the boundary where we're deciding to return 500
    logger.error({ err, orderId: req.body.orderId }, 'checkout_failed');
    res.status(500).json({ error: 'Checkout failed' });
  }
}

Rule: Log where you handle. Don't log where you propagate. The call stack in the error object already tells you where it originated.

Shibboleth: If you see the same traceId appear in more than two error log lines for a single request, you have a log-and-throw chain somewhere.

Quality Checklist

[ ] All log lines are JSON (no string concatenation)
[ ] Log level strategy documented: what goes at each level
[ ] PII/secrets redacted at logger config level, not call site
[ ] Correlation IDs propagated on all outbound HTTP calls
[ ] OTel SDK initialized before any other imports
[ ] Error logs include the error object (not just message)
[ ] No log-and-throw patterns in error handling
[ ] DEBUG logs use conditional guards or sampling
[ ] SLI/SLO defined for each critical user journey
[ ] Alert routing: notify vs page threshold documented
[ ] Runbook linked from every paging alert
[ ] Log retention policy set (cost vs compliance)

Output Artifacts

Logger configuration — Pino/Winston/structlog setup with redaction rules
OTel bootstrap file — SDK init with auto-instrumentation
Correlation middleware — AsyncLocalStorage request context
Prometheus metrics module — Counter/histogram/gauge definitions
Grafana dashboard JSON — Four golden signals panels
Alertmanager rules YAML — SLO-based alert definitions

When to Use​

Decision Tree: What to Log at Each Level​

Core Patterns​

Structured Log Format (JSON)​

Correlation IDs​

OpenTelemetry Setup​

Distributed Trace Propagation​

Reference Files​

Anti-Patterns (Shibboleths)​

Anti-Pattern 1: Logging PII or Secrets in Production​

Anti-Pattern 2: Unstructured String Logs Instead of Structured JSON​

Anti-Pattern 3: Log-and-Throw (Duplicate Log Entries)​

Quality Checklist​

Output Artifacts​