Pync — Architecture Decision Records

One-page reference for each major technical decision. For future team members, founders, and diligence.

Index

ADR-001: Railway over AWS through 25,000 users
ADR-002: Strangler pattern for Spring backend migration
ADR-003: Go for new backend services
ADR-004: TimescaleDB for time-series telemetry
ADR-005: HiveMQ Cloud for MQTT broker
ADR-006: Self-hosted LGTM for observability
ADR-007: Clerk for user authentication
ADR-008: Device authentication via bootstrap secret + JWT rotation
ADR-009: Stripe multi-entity instead of Paddle
ADR-010: C for Apollo3 firmware
ADR-011: On-device TFLM via Ambiq neuralSPOT
ADR-012: PostHog for product analytics and feature flags
ADR-013: Memfault for firmware observability
ADR-014: GitHub Actions + Railway native deploys through year 1
ADR-015: REST for mobile API, gRPC for internal services

ADR-001: Railway over AWS through 25,000 users

Status: Approved

Context Product targets 5,000 users in 6 months, scaling toward 25,000 in year one. AWS is the industry default but Railway offers better DX and cost at small-to-medium scale. Product workload (web services, Postgres, Redis, MQTT) is within Railway’s capability. Railway Metal Singapore co-locates with pilot market (Thailand).

Decision Production runs on Railway Metal Singapore through 25,000 users. Migrate to AWS only when a specific requirement emerges: HIPAA compliance, AWS IoT Core fleet scale, or multi-region complexity.

Consequences

Lower operational overhead; no VPC/IAM investment
~3-5x lower infrastructure cost at current scale
Better egress economics ( $0.05/ GB v s A W S$ 0.09/GB base in NA)
Railway’s IaC and observability tooling less mature than AWS
Migration cost grows with service count; plan portability from day one

Alternatives Considered

AWS from day one: operational tax, cost, and hiring constraints outweigh current needs
GCP: viable, smaller ecosystem, no specific advantage
Fly.io: interesting but less mature managed Postgres and more opinionated architecture
Self-managed Kubernetes: operational cost not justified at team size

ADR-002: Strangler pattern for Spring backend migration

Status: Approved

Context Inherited Java/Spring Boot backend built by Chinese contractor team. Works in pilot but has technical debt (two ORMs, Sa-Token auth, unclear dependency tree). Rewriting is months of lost velocity with no product value. CTO does not want to hire Spring engineers in NYC.

Decision Spring monolith continues to serve production traffic. New features and high-volume paths implemented as Go services. Carve out Spring functionality over 12-18 months in priority order. User accounts and billing remain in Spring indefinitely.

Consequences

No productivity loss from rewrite
Team can be built around preferred stack (Go, Python, TS) rather than Spring hiring
Dual-stack complexity during transition — mitigated by clear API boundary
Some Spring code will live in production for years — accept and manage
Observability must cover both stacks from day one

Alternatives Considered

Full rewrite in Go: 3-6 months of stopped feature work; unacceptable
Full embrace of Spring and hire Java engineers: contradicts CTO preference and NYC hiring pool signal
Microservices extraction to smaller Spring services: doesn’t solve hiring problem

ADR-003: Go for new backend services

Status: Approved

Context Need language for new backend services in strangler pattern. CTO has production experience with Go at scale ($7B startup). NYC senior engineering pool skews Python/TS/Go. Product workload (MQTT ingest, telemetry pipeline, device management) fits Go’s concurrency model.

Decision Go is the default for all new backend services unless a specific reason dictates otherwise (Python for ML-adjacent, TS for BFF or shared-type boundaries).

Consequences

Hiring pool strong in NYC
Operational simplicity (static binaries, low resource usage)
gRPC natural for internal service-to-service
Go ORMs (GORM, ent, sqlc) less ergonomic than Spring Data — use sqlc with raw SQL
Complex domain modeling requires more scaffolding than in JVM ecosystem

Alternatives Considered

Rust: mature on backend but hiring pool thinner, overkill for CRUD workloads
Node/TypeScript: viable but GIL-equivalent (event loop) and dependency hell concerns at scale
Python: fine for ML, not recommended for primary API layer at 25k+ users
Java (Spring Boot): rejected in ADR-002 rationale

ADR-004: TimescaleDB for time-series telemetry

Status: Approved

Context Product generates significant time-series data: IMU activity events, location pings, device health, ML inference outputs. At 5000 devices with periodic sync, ~10M+ writes/day. Vanilla Postgres handles volume but queries degrade without partitioning. Separate time-series DB adds operational overhead.

Decision TimescaleDB extension on Postgres for telemetry tables. Managed via Timescale Cloud (Singapore region). Operational Postgres (users, subscriptions) co-located or separate — evaluate based on vendor options.

Consequences

SQL-compatible, no new query language
Automatic partitioning, columnar compression, continuous aggregates
Single database technology reduces ops surface
Railway’s managed Postgres doesn’t include Timescale extension — use Timescale Cloud
Minor vendor lock-in to Timescale-specific features (hypertables, compression policies)

Alternatives Considered

InfluxDB: separate system, different query language, weaker SQL story
ClickHouse: excellent for analytics but operational overhead and write-pattern mismatch for IoT telemetry
Vanilla Postgres with manual partitioning: works but requires DIY tooling for retention and compression
AWS Timestream: ties to AWS, different API, locked in

ADR-005: HiveMQ Cloud for MQTT broker

Status: Approved

Context Dock speaks MQTT to backend (persistent connection, bidirectional, for backend-to-dock commands like “ring the beeper”). At 5,000-25,000 docks, managed broker preferred over self-hosted.

Decision HiveMQ Cloud (APAC region) for MQTT brokering. MQTT 5 support, strong auth and ACL model, reasonable pricing at target scale.

Consequences

Managed operations; focus engineering on product
Best-in-class MQTT 5 support
Commercial German company, broad geographic regions
Vendor cost scales with connections — reasonable at 5k-25k
Alternative brokers readily available if needed (EMQX, NATS MQTT, AWS IoT)

Alternatives Considered

EMQX: strong product but Chinese-origin company concern given contractor exit strategy
NATS with MQTT adapter: consolidates messaging but requires self-host and ops
AWS IoT Core: ties to AWS, per-message pricing adds up
Mosquitto: doesn’t scale; fine for dev only
Cedalo Pro Mosquitto: wrong use case (edge broker product, not cloud)

ADR-006: Self-hosted LGTM for observability

Status: Approved

Context Observability generates significant egress; Datadog at full pricing is expensive. CTO has production experience running LGTM stack (Loki, Grafana, Tempo, Mimir). Railway’s private networking means zero egress for observability traffic between services and self-hosted backend.

Decision Self-host LGTM on Railway. OpenTelemetry Collector as aggregation layer. All services emit OTLP over private networking. Datadog credits ($100k available) applied surgically: synthetic monitoring, Database Monitoring (DBM) for Timescale, occasional debugging.

Consequences

No egress cost for observability data
No per-GB or per-host pricing cliff
Operational ownership of observability stack — acceptable given CTO experience
OpenTelemetry instrumentation keeps backend choice portable
More tuning required (Loki storage, Mimir retention, Tempo sampling) as volume grows

Alternatives Considered

Full Datadog on credits: attractive short-term; Railway egress and post-credit cost make it unsustainable
Grafana Cloud managed: good free tier but egress costs still apply, cost grows with scale
Honeycomb for tracing + Loki for logs: split-vendor complexity, higher combined cost

ADR-007: Clerk for user authentication

Status: Approved

Context Existing backend uses Sa-Token (Chinese auth library, sparse English docs, hard to hire for). Replacing with managed identity provider required as part of contractor exit. CTO has used Supabase Auth and Better Auth before.

Decision Clerk for user authentication. React Native-native SDK, supports email/password, social (Apple, Google), passwordless, MFA. Replace Sa-Token incrementally: new services validate Clerk JWTs, Spring continues Sa-Token validation for existing sessions, migrate users on re-authentication.

Consequences

Modern DX, fast to ship
Strong React Native support (primary mobile platform)
Vendor dependency; Clerk pricing scales with MAU
JWT-based, JWKS endpoint for validation in any backend language
User auth separate from device auth (ADR-008)

Alternatives Considered

Auth0: enterprise-flavored, expensive, no advantage for B2C consumer
Supabase Auth: awkward without broader Supabase adoption
Firebase Auth: Google ecosystem lock-in, Clerk has better consumer mobile UX
Better Auth (self-hosted): self-host complexity wrong trade at stage
Keep Sa-Token: hiring and maintenance costs long-term exceed migration cost

ADR-008: Device authentication via bootstrap secret + JWT rotation

Status: Approved

Context Collars and docks need authentication to backend separate from user auth. Full mTLS PKI is overkill for consumer IoT startup scale and operationally complex. Pre-shared secrets must not be fleet-wide (one leak compromises all). X.509 cert rotation is painful.

Decision Each device provisioned at manufacture with unique device ID + random bootstrap secret, logged to provisioning database. On first connection, device presents bootstrap credential; backend validates, issues device-specific JWT for ongoing use. JWT rotates on firmware update. Revocation integrated with unpair, factory reset, and account deletion flows.

Consequences

90% of mTLS security value at 20% of operational cost
Provisioning database is a new secret store that must be protected
Factory manufacturing flow must include secure bootstrap credential burn-in
Dual-auth model (user JWT + device JWT) requires clear separation in backend code
Revisit at enterprise/HIPAA compliance or hardware-security-module product direction

Alternatives Considered

Full mTLS with factory-signed client certificates: operationally expensive, cert rotation painful
Shared secret with per-device derivation: one leak compromises fleet, rejected
Per-device username/password: essentially equivalent but less standard than JWT

ADR-009: Stripe multi-entity instead of Paddle

Status: Approved

Context Company is incorporated in Delaware, Thailand, and Singapore. Global subscription sales raise tax compliance issues (VAT, GST, Digital Service Tax). Paddle as merchant-of-record handles compliance but charges ~5% + $0.50 v s St r i p e 2.9$ 0.30.

Decision Stripe across three entities:

US sales via Delaware entity (Stripe Tax for US state sales tax)
Thailand sales via Thai entity (Thai VAT handled directly)
Global subscriptions via Singapore entity (Singapore GST + local registrations as expansion occurs)

Paddle remains fallback if expansion outpaces compliance capacity.

Consequences

Better margins than Paddle (2-3% savings on subscription revenue)
Three Stripe accounts, three compliance footprints — operational overhead
Singapore entity as merchant of record for rest-of-world sales enables Paddle avoidance
Stripe Tax covers automated calculation but legal compliance remains company responsibility
Hardware (physical goods) cleaner on Stripe than Paddle regardless

Alternatives Considered

Paddle for all subscriptions: 5% fee long-term exceeds compliance cost savings given Singapore entity
Stripe Delaware-only, defer global: limits TAM, rejected
Paddle for subs + Stripe for hardware: operational complexity not justified given Singapore entity option

ADR-010: C for Apollo3 firmware

Status: Approved

Context Apollo3 Blue firmware needed. Ambiq SDK is C. neuralSPOT (AI SDK) is C. TFLM is C++ with C APIs. CMSIS-NN is C. Every reference implementation and vendor support path assumes C.

Decision Firmware is C. Use vendor SDK (AmbiqSuite) and neuralSPOT. No new language introduction at this silicon.

Consequences

Vendor support and reference material directly applicable
Hiring pool for embedded C is broad
Standard toolchain (GCC ARM, CMake)
Memory safety concerns mitigated by static analysis, code review, and rigorous testing
Cannot leverage Rust’s safety guarantees — accept

Alternatives Considered

Rust: embedded-hal ecosystem immature for Apollo3, vendor SDK wrapping negates safety benefits
Zig: pre-1.0, toolchain immaturity, hobbyist embedded ecosystem
C++: viable but TFLM and CMSIS-NN APIs work fine from C; no meaningful benefit

ADR-011: On-device TFLM via Ambiq neuralSPOT

Status: Provisional — pending model size validation

Context Target is on-device inference of TCN + Transformer + CNN + Cross-Attention model within ~664KB flash budget on Apollo3 Cortex-M4F, 768KB RAM (minus BLE stack and app). TensorFlow Lite for Microcontrollers is industry standard for MCU inference. Ambiq’s neuralSPOT builds on TFLM with Apollo-optimized kernels.

Decision Use neuralSPOT as base runtime. Edge Impulse EON Compiler considered for further optimization. Custom kernels for Transformer attention blocks where TFLM operators are insufficient.

Consequences

Vendor-supported path, established for similar silicon/workloads
Pipeline: PyTorch → ONNX → TFLite (INT8 via QAT) → neuralSPOT integration → Apollo3 binary
Accuracy loss at quantization must be validated on real silicon, not just simulator
Aggressive quantization (INT4 weights) may be required
If model does not fit budget, architecture reduction or hardware change (Apollo4/5 with NN accelerator) required

Alternatives Considered

ExecuTorch (PyTorch Edge): newer, less mature on Apollo3, watch but don’t bet
Raw CMSIS-NN: hand-rolled, no runtime, rejected as engineering time sink
Cloud inference fallback: bandwidth, battery, and latency costs unacceptable for collar

ADR-012: PostHog for product analytics and feature flags

Status: Approved

Context Need product analytics (event tracking), feature flags (for strangler rollout and paid tier gating), session replay (for UX debugging). Multiple tools possible; consolidation preferred to reduce SaaS sprawl.

Decision PostHog Cloud for all three. Core events defined in React Native app. Feature flags exposed to both app and backend. Session replay enabled selectively with privacy review.

Consequences

Single vendor for product analytics, flags, experiments, session replay
Free tier covers pilot phase (1M events/month)
Self-host option exists if scale or data residency requires
Separate from infrastructure observability (LGTM) — different tools for different jobs
Experiment/A-B testing capability available when user count supports statistical power

Alternatives Considered

Mixpanel + LaunchDarkly + LogRocket: three vendors, higher combined cost, no integration advantage
Amplitude + Statsig: similar split-vendor issues
Self-hosted analytics (Plausible, Umami): doesn’t cover flags or session replay

ADR-013: Memfault for firmware observability

Status: Approved

Context Firmware bugs in production are opaque without crash reporting. Memfault provides coredump reporting, firmware metrics, fleet-wide dashboards, OTA orchestration. Best-in-class for MCU fleet observability, supports Apollo3.

Decision Memfault integrated into firmware before 5,000-user scale. Crash reports, battery metrics, BLE health, sync failure rates surfaced to fleet dashboards. Critical alerts piped to LGTM via webhook for unified alerting.

Consequences

Firmware engineering effort to integrate SDK (~1-2 weeks)
Vendor cost scales with device count
Irreplaceable observability for in-field firmware debugging
Alternative paths (custom telemetry) underbuilt and underinvested vs Memfault’s feature set
Non-negotiable for shipping firmware at scale

Alternatives Considered

Custom crash reporting: months of engineering for inferior result
Percepio Tracealyzer: dev-time tool, not production telemetry
No firmware observability: unacceptable at 5,000+ unit scale

ADR-014: GitHub Actions + Railway native deploys through year 1

Status: Approved

Context CI/CD pipeline needed for backend and mobile. CTO has experience with Buildkite + Argo Rollouts at $7B startup scale and GitLab CI at prior roles. GitHub reliability has had issues recently. Team is small; operational overhead of self-managed CI unjustified.

Decision GitHub Actions for CI, Railway native integration for CD. Monitor GitHub reliability. Migrate to GitLab CI if GitHub reliability degrades materially; migrate to Buildkite + Argo Rollouts when moving to Kubernetes/AWS at scale.

Consequences

Zero operational overhead for CI infrastructure
Railway auto-deploys on main merge — appropriate simplicity for stage
Staged rollouts via feature flags (PostHog) not infrastructure canary
GitHub outages do affect dev velocity; have a migration plan
Buildkite/Argo eventual migration in year 2-3 alongside K8s transition

Alternatives Considered

Buildkite from day one: operational overhead not justified at team size
GitLab CI from day one: viable but no compelling switch from GHA given established workflow
CircleCI: no advantage over GHA
Self-hosted Jenkins: rejected, operational tax

ADR-015: REST for mobile API, gRPC for internal services

Status: Approved

Context Mobile-to-backend contract needs stable, debuggable, tooling-friendly protocol. Internal service-to-service calls benefit from efficient, strongly-typed RPC. CTO has experience with both REST and gRPC at scale.

Decision REST with OpenAPI for mobile-to-backend. URL versioning (/v1/). Additive-only changes until major version. gRPC for service-to-service calls between Go services when multiple services exist. Protobuf schemas versioned in monorepo or shared schema repo.

Consequences

Mobile gets standard REST tooling, caching, debugging
Internal services get efficient RPC with strong typing
Two contract systems (OpenAPI for REST, Protobuf for gRPC) — acceptable overhead
Contract testing enforced in CI for both
Breaking change detection (oasdiff for OpenAPI, buf for Protobuf)

Alternatives Considered

gRPC for mobile: worse tooling, debugging, gRPC-Web still requires gateway; rejected
GraphQL: unnecessary complexity for bounded query patterns, caching complexity
REST for internal services: less efficient, weaker typing, rejected where gRPC fits

Last updated: during CTO work trial. Update when new ADRs are added or status changes.

Quartz 4

Explorer

pync-adr

Pync — Architecture Decision Records

Index

ADR-001: Railway over AWS through 25,000 users

ADR-002: Strangler pattern for Spring backend migration

ADR-003: Go for new backend services

ADR-004: TimescaleDB for time-series telemetry

ADR-005: HiveMQ Cloud for MQTT broker

ADR-006: Self-hosted LGTM for observability

ADR-007: Clerk for user authentication

ADR-008: Device authentication via bootstrap secret + JWT rotation

ADR-009: Stripe multi-entity instead of Paddle

ADR-010: C for Apollo3 firmware

ADR-011: On-device TFLM via Ambiq neuralSPOT

ADR-012: PostHog for product analytics and feature flags

ADR-013: Memfault for firmware observability

ADR-014: GitHub Actions + Railway native deploys through year 1

ADR-015: REST for mobile API, gRPC for internal services

Graph View

Table of Contents

Backlinks