Pync — Architecture Decision Records

One-page reference for each major technical decision. For future team members, founders, and diligence.


Index


ADR-001: Railway over AWS through 25,000 users

Status: Approved

Context Product targets 5,000 users in 6 months, scaling toward 25,000 in year one. AWS is the industry default but Railway offers better DX and cost at small-to-medium scale. Product workload (web services, Postgres, Redis, MQTT) is within Railway’s capability. Railway Metal Singapore co-locates with pilot market (Thailand).

Decision Production runs on Railway Metal Singapore through 25,000 users. Migrate to AWS only when a specific requirement emerges: HIPAA compliance, AWS IoT Core fleet scale, or multi-region complexity.

Consequences

  • Lower operational overhead; no VPC/IAM investment
  • ~3-5x lower infrastructure cost at current scale
  • Better egress economics (0.09/GB base in NA)
  • Railway’s IaC and observability tooling less mature than AWS
  • Migration cost grows with service count; plan portability from day one

Alternatives Considered

  • AWS from day one: operational tax, cost, and hiring constraints outweigh current needs
  • GCP: viable, smaller ecosystem, no specific advantage
  • Fly.io: interesting but less mature managed Postgres and more opinionated architecture
  • Self-managed Kubernetes: operational cost not justified at team size

ADR-002: Strangler pattern for Spring backend migration

Status: Approved

Context Inherited Java/Spring Boot backend built by Chinese contractor team. Works in pilot but has technical debt (two ORMs, Sa-Token auth, unclear dependency tree). Rewriting is months of lost velocity with no product value. CTO does not want to hire Spring engineers in NYC.

Decision Spring monolith continues to serve production traffic. New features and high-volume paths implemented as Go services. Carve out Spring functionality over 12-18 months in priority order. User accounts and billing remain in Spring indefinitely.

Consequences

  • No productivity loss from rewrite
  • Team can be built around preferred stack (Go, Python, TS) rather than Spring hiring
  • Dual-stack complexity during transition — mitigated by clear API boundary
  • Some Spring code will live in production for years — accept and manage
  • Observability must cover both stacks from day one

Alternatives Considered

  • Full rewrite in Go: 3-6 months of stopped feature work; unacceptable
  • Full embrace of Spring and hire Java engineers: contradicts CTO preference and NYC hiring pool signal
  • Microservices extraction to smaller Spring services: doesn’t solve hiring problem

ADR-003: Go for new backend services

Status: Approved

Context Need language for new backend services in strangler pattern. CTO has production experience with Go at scale ($7B startup). NYC senior engineering pool skews Python/TS/Go. Product workload (MQTT ingest, telemetry pipeline, device management) fits Go’s concurrency model.

Decision Go is the default for all new backend services unless a specific reason dictates otherwise (Python for ML-adjacent, TS for BFF or shared-type boundaries).

Consequences

  • Hiring pool strong in NYC
  • Operational simplicity (static binaries, low resource usage)
  • gRPC natural for internal service-to-service
  • Go ORMs (GORM, ent, sqlc) less ergonomic than Spring Data — use sqlc with raw SQL
  • Complex domain modeling requires more scaffolding than in JVM ecosystem

Alternatives Considered

  • Rust: mature on backend but hiring pool thinner, overkill for CRUD workloads
  • Node/TypeScript: viable but GIL-equivalent (event loop) and dependency hell concerns at scale
  • Python: fine for ML, not recommended for primary API layer at 25k+ users
  • Java (Spring Boot): rejected in ADR-002 rationale

ADR-004: TimescaleDB for time-series telemetry

Status: Approved

Context Product generates significant time-series data: IMU activity events, location pings, device health, ML inference outputs. At 5000 devices with periodic sync, ~10M+ writes/day. Vanilla Postgres handles volume but queries degrade without partitioning. Separate time-series DB adds operational overhead.

Decision TimescaleDB extension on Postgres for telemetry tables. Managed via Timescale Cloud (Singapore region). Operational Postgres (users, subscriptions) co-located or separate — evaluate based on vendor options.

Consequences

  • SQL-compatible, no new query language
  • Automatic partitioning, columnar compression, continuous aggregates
  • Single database technology reduces ops surface
  • Railway’s managed Postgres doesn’t include Timescale extension — use Timescale Cloud
  • Minor vendor lock-in to Timescale-specific features (hypertables, compression policies)

Alternatives Considered

  • InfluxDB: separate system, different query language, weaker SQL story
  • ClickHouse: excellent for analytics but operational overhead and write-pattern mismatch for IoT telemetry
  • Vanilla Postgres with manual partitioning: works but requires DIY tooling for retention and compression
  • AWS Timestream: ties to AWS, different API, locked in

ADR-005: HiveMQ Cloud for MQTT broker

Status: Approved

Context Dock speaks MQTT to backend (persistent connection, bidirectional, for backend-to-dock commands like “ring the beeper”). At 5,000-25,000 docks, managed broker preferred over self-hosted.

Decision HiveMQ Cloud (APAC region) for MQTT brokering. MQTT 5 support, strong auth and ACL model, reasonable pricing at target scale.

Consequences

  • Managed operations; focus engineering on product
  • Best-in-class MQTT 5 support
  • Commercial German company, broad geographic regions
  • Vendor cost scales with connections — reasonable at 5k-25k
  • Alternative brokers readily available if needed (EMQX, NATS MQTT, AWS IoT)

Alternatives Considered

  • EMQX: strong product but Chinese-origin company concern given contractor exit strategy
  • NATS with MQTT adapter: consolidates messaging but requires self-host and ops
  • AWS IoT Core: ties to AWS, per-message pricing adds up
  • Mosquitto: doesn’t scale; fine for dev only
  • Cedalo Pro Mosquitto: wrong use case (edge broker product, not cloud)

ADR-006: Self-hosted LGTM for observability

Status: Approved

Context Observability generates significant egress; Datadog at full pricing is expensive. CTO has production experience running LGTM stack (Loki, Grafana, Tempo, Mimir). Railway’s private networking means zero egress for observability traffic between services and self-hosted backend.

Decision Self-host LGTM on Railway. OpenTelemetry Collector as aggregation layer. All services emit OTLP over private networking. Datadog credits ($100k available) applied surgically: synthetic monitoring, Database Monitoring (DBM) for Timescale, occasional debugging.

Consequences

  • No egress cost for observability data
  • No per-GB or per-host pricing cliff
  • Operational ownership of observability stack — acceptable given CTO experience
  • OpenTelemetry instrumentation keeps backend choice portable
  • More tuning required (Loki storage, Mimir retention, Tempo sampling) as volume grows

Alternatives Considered

  • Full Datadog on credits: attractive short-term; Railway egress and post-credit cost make it unsustainable
  • Grafana Cloud managed: good free tier but egress costs still apply, cost grows with scale
  • Honeycomb for tracing + Loki for logs: split-vendor complexity, higher combined cost

ADR-007: Clerk for user authentication

Status: Approved

Context Existing backend uses Sa-Token (Chinese auth library, sparse English docs, hard to hire for). Replacing with managed identity provider required as part of contractor exit. CTO has used Supabase Auth and Better Auth before.

Decision Clerk for user authentication. React Native-native SDK, supports email/password, social (Apple, Google), passwordless, MFA. Replace Sa-Token incrementally: new services validate Clerk JWTs, Spring continues Sa-Token validation for existing sessions, migrate users on re-authentication.

Consequences

  • Modern DX, fast to ship
  • Strong React Native support (primary mobile platform)
  • Vendor dependency; Clerk pricing scales with MAU
  • JWT-based, JWKS endpoint for validation in any backend language
  • User auth separate from device auth (ADR-008)

Alternatives Considered

  • Auth0: enterprise-flavored, expensive, no advantage for B2C consumer
  • Supabase Auth: awkward without broader Supabase adoption
  • Firebase Auth: Google ecosystem lock-in, Clerk has better consumer mobile UX
  • Better Auth (self-hosted): self-host complexity wrong trade at stage
  • Keep Sa-Token: hiring and maintenance costs long-term exceed migration cost

ADR-008: Device authentication via bootstrap secret + JWT rotation

Status: Approved

Context Collars and docks need authentication to backend separate from user auth. Full mTLS PKI is overkill for consumer IoT startup scale and operationally complex. Pre-shared secrets must not be fleet-wide (one leak compromises all). X.509 cert rotation is painful.

Decision Each device provisioned at manufacture with unique device ID + random bootstrap secret, logged to provisioning database. On first connection, device presents bootstrap credential; backend validates, issues device-specific JWT for ongoing use. JWT rotates on firmware update. Revocation integrated with unpair, factory reset, and account deletion flows.

Consequences

  • 90% of mTLS security value at 20% of operational cost
  • Provisioning database is a new secret store that must be protected
  • Factory manufacturing flow must include secure bootstrap credential burn-in
  • Dual-auth model (user JWT + device JWT) requires clear separation in backend code
  • Revisit at enterprise/HIPAA compliance or hardware-security-module product direction

Alternatives Considered

  • Full mTLS with factory-signed client certificates: operationally expensive, cert rotation painful
  • Shared secret with per-device derivation: one leak compromises fleet, rejected
  • Per-device username/password: essentially equivalent but less standard than JWT

ADR-009: Stripe multi-entity instead of Paddle

Status: Approved

Context Company is incorporated in Delaware, Thailand, and Singapore. Global subscription sales raise tax compliance issues (VAT, GST, Digital Service Tax). Paddle as merchant-of-record handles compliance but charges ~5% + 0.30.

Decision Stripe across three entities:

  • US sales via Delaware entity (Stripe Tax for US state sales tax)
  • Thailand sales via Thai entity (Thai VAT handled directly)
  • Global subscriptions via Singapore entity (Singapore GST + local registrations as expansion occurs)

Paddle remains fallback if expansion outpaces compliance capacity.

Consequences

  • Better margins than Paddle (2-3% savings on subscription revenue)
  • Three Stripe accounts, three compliance footprints — operational overhead
  • Singapore entity as merchant of record for rest-of-world sales enables Paddle avoidance
  • Stripe Tax covers automated calculation but legal compliance remains company responsibility
  • Hardware (physical goods) cleaner on Stripe than Paddle regardless

Alternatives Considered

  • Paddle for all subscriptions: 5% fee long-term exceeds compliance cost savings given Singapore entity
  • Stripe Delaware-only, defer global: limits TAM, rejected
  • Paddle for subs + Stripe for hardware: operational complexity not justified given Singapore entity option

ADR-010: C for Apollo3 firmware

Status: Approved

Context Apollo3 Blue firmware needed. Ambiq SDK is C. neuralSPOT (AI SDK) is C. TFLM is C++ with C APIs. CMSIS-NN is C. Every reference implementation and vendor support path assumes C.

Decision Firmware is C. Use vendor SDK (AmbiqSuite) and neuralSPOT. No new language introduction at this silicon.

Consequences

  • Vendor support and reference material directly applicable
  • Hiring pool for embedded C is broad
  • Standard toolchain (GCC ARM, CMake)
  • Memory safety concerns mitigated by static analysis, code review, and rigorous testing
  • Cannot leverage Rust’s safety guarantees — accept

Alternatives Considered

  • Rust: embedded-hal ecosystem immature for Apollo3, vendor SDK wrapping negates safety benefits
  • Zig: pre-1.0, toolchain immaturity, hobbyist embedded ecosystem
  • C++: viable but TFLM and CMSIS-NN APIs work fine from C; no meaningful benefit

ADR-011: On-device TFLM via Ambiq neuralSPOT

Status: Provisional — pending model size validation

Context Target is on-device inference of TCN + Transformer + CNN + Cross-Attention model within ~664KB flash budget on Apollo3 Cortex-M4F, 768KB RAM (minus BLE stack and app). TensorFlow Lite for Microcontrollers is industry standard for MCU inference. Ambiq’s neuralSPOT builds on TFLM with Apollo-optimized kernels.

Decision Use neuralSPOT as base runtime. Edge Impulse EON Compiler considered for further optimization. Custom kernels for Transformer attention blocks where TFLM operators are insufficient.

Consequences

  • Vendor-supported path, established for similar silicon/workloads
  • Pipeline: PyTorch → ONNX → TFLite (INT8 via QAT) → neuralSPOT integration → Apollo3 binary
  • Accuracy loss at quantization must be validated on real silicon, not just simulator
  • Aggressive quantization (INT4 weights) may be required
  • If model does not fit budget, architecture reduction or hardware change (Apollo4/5 with NN accelerator) required

Alternatives Considered

  • ExecuTorch (PyTorch Edge): newer, less mature on Apollo3, watch but don’t bet
  • Raw CMSIS-NN: hand-rolled, no runtime, rejected as engineering time sink
  • Cloud inference fallback: bandwidth, battery, and latency costs unacceptable for collar

ADR-012: PostHog for product analytics and feature flags

Status: Approved

Context Need product analytics (event tracking), feature flags (for strangler rollout and paid tier gating), session replay (for UX debugging). Multiple tools possible; consolidation preferred to reduce SaaS sprawl.

Decision PostHog Cloud for all three. Core events defined in React Native app. Feature flags exposed to both app and backend. Session replay enabled selectively with privacy review.

Consequences

  • Single vendor for product analytics, flags, experiments, session replay
  • Free tier covers pilot phase (1M events/month)
  • Self-host option exists if scale or data residency requires
  • Separate from infrastructure observability (LGTM) — different tools for different jobs
  • Experiment/A-B testing capability available when user count supports statistical power

Alternatives Considered

  • Mixpanel + LaunchDarkly + LogRocket: three vendors, higher combined cost, no integration advantage
  • Amplitude + Statsig: similar split-vendor issues
  • Self-hosted analytics (Plausible, Umami): doesn’t cover flags or session replay

ADR-013: Memfault for firmware observability

Status: Approved

Context Firmware bugs in production are opaque without crash reporting. Memfault provides coredump reporting, firmware metrics, fleet-wide dashboards, OTA orchestration. Best-in-class for MCU fleet observability, supports Apollo3.

Decision Memfault integrated into firmware before 5,000-user scale. Crash reports, battery metrics, BLE health, sync failure rates surfaced to fleet dashboards. Critical alerts piped to LGTM via webhook for unified alerting.

Consequences

  • Firmware engineering effort to integrate SDK (~1-2 weeks)
  • Vendor cost scales with device count
  • Irreplaceable observability for in-field firmware debugging
  • Alternative paths (custom telemetry) underbuilt and underinvested vs Memfault’s feature set
  • Non-negotiable for shipping firmware at scale

Alternatives Considered

  • Custom crash reporting: months of engineering for inferior result
  • Percepio Tracealyzer: dev-time tool, not production telemetry
  • No firmware observability: unacceptable at 5,000+ unit scale

ADR-014: GitHub Actions + Railway native deploys through year 1

Status: Approved

Context CI/CD pipeline needed for backend and mobile. CTO has experience with Buildkite + Argo Rollouts at $7B startup scale and GitLab CI at prior roles. GitHub reliability has had issues recently. Team is small; operational overhead of self-managed CI unjustified.

Decision GitHub Actions for CI, Railway native integration for CD. Monitor GitHub reliability. Migrate to GitLab CI if GitHub reliability degrades materially; migrate to Buildkite + Argo Rollouts when moving to Kubernetes/AWS at scale.

Consequences

  • Zero operational overhead for CI infrastructure
  • Railway auto-deploys on main merge — appropriate simplicity for stage
  • Staged rollouts via feature flags (PostHog) not infrastructure canary
  • GitHub outages do affect dev velocity; have a migration plan
  • Buildkite/Argo eventual migration in year 2-3 alongside K8s transition

Alternatives Considered

  • Buildkite from day one: operational overhead not justified at team size
  • GitLab CI from day one: viable but no compelling switch from GHA given established workflow
  • CircleCI: no advantage over GHA
  • Self-hosted Jenkins: rejected, operational tax

ADR-015: REST for mobile API, gRPC for internal services

Status: Approved

Context Mobile-to-backend contract needs stable, debuggable, tooling-friendly protocol. Internal service-to-service calls benefit from efficient, strongly-typed RPC. CTO has experience with both REST and gRPC at scale.

Decision REST with OpenAPI for mobile-to-backend. URL versioning (/v1/). Additive-only changes until major version. gRPC for service-to-service calls between Go services when multiple services exist. Protobuf schemas versioned in monorepo or shared schema repo.

Consequences

  • Mobile gets standard REST tooling, caching, debugging
  • Internal services get efficient RPC with strong typing
  • Two contract systems (OpenAPI for REST, Protobuf for gRPC) — acceptable overhead
  • Contract testing enforced in CI for both
  • Breaking change detection (oasdiff for OpenAPI, buf for Protobuf)

Alternatives Considered

  • gRPC for mobile: worse tooling, debugging, gRPC-Web still requires gateway; rejected
  • GraphQL: unnecessary complexity for bounded query patterns, caching complexity
  • REST for internal services: less efficient, weaker typing, rejected where gRPC fits

Last updated: during CTO work trial. Update when new ADRs are added or status changes.