Pync — Architecture Decision Records
One-page reference for each major technical decision. For future team members, founders, and diligence.
Index
- ADR-001: Railway over AWS through 25,000 users
- ADR-002: Strangler pattern for Spring backend migration
- ADR-003: Go for new backend services
- ADR-004: TimescaleDB for time-series telemetry
- ADR-005: HiveMQ Cloud for MQTT broker
- ADR-006: Self-hosted LGTM for observability
- ADR-007: Clerk for user authentication
- ADR-008: Device authentication via bootstrap secret + JWT rotation
- ADR-009: Stripe multi-entity instead of Paddle
- ADR-010: C for Apollo3 firmware
- ADR-011: On-device TFLM via Ambiq neuralSPOT
- ADR-012: PostHog for product analytics and feature flags
- ADR-013: Memfault for firmware observability
- ADR-014: GitHub Actions + Railway native deploys through year 1
- ADR-015: REST for mobile API, gRPC for internal services
ADR-001: Railway over AWS through 25,000 users
Status: Approved
Context Product targets 5,000 users in 6 months, scaling toward 25,000 in year one. AWS is the industry default but Railway offers better DX and cost at small-to-medium scale. Product workload (web services, Postgres, Redis, MQTT) is within Railway’s capability. Railway Metal Singapore co-locates with pilot market (Thailand).
Decision Production runs on Railway Metal Singapore through 25,000 users. Migrate to AWS only when a specific requirement emerges: HIPAA compliance, AWS IoT Core fleet scale, or multi-region complexity.
Consequences
- Lower operational overhead; no VPC/IAM investment
- ~3-5x lower infrastructure cost at current scale
- Better egress economics (0.09/GB base in NA)
- Railway’s IaC and observability tooling less mature than AWS
- Migration cost grows with service count; plan portability from day one
Alternatives Considered
- AWS from day one: operational tax, cost, and hiring constraints outweigh current needs
- GCP: viable, smaller ecosystem, no specific advantage
- Fly.io: interesting but less mature managed Postgres and more opinionated architecture
- Self-managed Kubernetes: operational cost not justified at team size
ADR-002: Strangler pattern for Spring backend migration
Status: Approved
Context Inherited Java/Spring Boot backend built by Chinese contractor team. Works in pilot but has technical debt (two ORMs, Sa-Token auth, unclear dependency tree). Rewriting is months of lost velocity with no product value. CTO does not want to hire Spring engineers in NYC.
Decision Spring monolith continues to serve production traffic. New features and high-volume paths implemented as Go services. Carve out Spring functionality over 12-18 months in priority order. User accounts and billing remain in Spring indefinitely.
Consequences
- No productivity loss from rewrite
- Team can be built around preferred stack (Go, Python, TS) rather than Spring hiring
- Dual-stack complexity during transition — mitigated by clear API boundary
- Some Spring code will live in production for years — accept and manage
- Observability must cover both stacks from day one
Alternatives Considered
- Full rewrite in Go: 3-6 months of stopped feature work; unacceptable
- Full embrace of Spring and hire Java engineers: contradicts CTO preference and NYC hiring pool signal
- Microservices extraction to smaller Spring services: doesn’t solve hiring problem
ADR-003: Go for new backend services
Status: Approved
Context Need language for new backend services in strangler pattern. CTO has production experience with Go at scale ($7B startup). NYC senior engineering pool skews Python/TS/Go. Product workload (MQTT ingest, telemetry pipeline, device management) fits Go’s concurrency model.
Decision Go is the default for all new backend services unless a specific reason dictates otherwise (Python for ML-adjacent, TS for BFF or shared-type boundaries).
Consequences
- Hiring pool strong in NYC
- Operational simplicity (static binaries, low resource usage)
- gRPC natural for internal service-to-service
- Go ORMs (GORM, ent, sqlc) less ergonomic than Spring Data — use sqlc with raw SQL
- Complex domain modeling requires more scaffolding than in JVM ecosystem
Alternatives Considered
- Rust: mature on backend but hiring pool thinner, overkill for CRUD workloads
- Node/TypeScript: viable but GIL-equivalent (event loop) and dependency hell concerns at scale
- Python: fine for ML, not recommended for primary API layer at 25k+ users
- Java (Spring Boot): rejected in ADR-002 rationale
ADR-004: TimescaleDB for time-series telemetry
Status: Approved
Context Product generates significant time-series data: IMU activity events, location pings, device health, ML inference outputs. At 5000 devices with periodic sync, ~10M+ writes/day. Vanilla Postgres handles volume but queries degrade without partitioning. Separate time-series DB adds operational overhead.
Decision TimescaleDB extension on Postgres for telemetry tables. Managed via Timescale Cloud (Singapore region). Operational Postgres (users, subscriptions) co-located or separate — evaluate based on vendor options.
Consequences
- SQL-compatible, no new query language
- Automatic partitioning, columnar compression, continuous aggregates
- Single database technology reduces ops surface
- Railway’s managed Postgres doesn’t include Timescale extension — use Timescale Cloud
- Minor vendor lock-in to Timescale-specific features (hypertables, compression policies)
Alternatives Considered
- InfluxDB: separate system, different query language, weaker SQL story
- ClickHouse: excellent for analytics but operational overhead and write-pattern mismatch for IoT telemetry
- Vanilla Postgres with manual partitioning: works but requires DIY tooling for retention and compression
- AWS Timestream: ties to AWS, different API, locked in
ADR-005: HiveMQ Cloud for MQTT broker
Status: Approved
Context Dock speaks MQTT to backend (persistent connection, bidirectional, for backend-to-dock commands like “ring the beeper”). At 5,000-25,000 docks, managed broker preferred over self-hosted.
Decision HiveMQ Cloud (APAC region) for MQTT brokering. MQTT 5 support, strong auth and ACL model, reasonable pricing at target scale.
Consequences
- Managed operations; focus engineering on product
- Best-in-class MQTT 5 support
- Commercial German company, broad geographic regions
- Vendor cost scales with connections — reasonable at 5k-25k
- Alternative brokers readily available if needed (EMQX, NATS MQTT, AWS IoT)
Alternatives Considered
- EMQX: strong product but Chinese-origin company concern given contractor exit strategy
- NATS with MQTT adapter: consolidates messaging but requires self-host and ops
- AWS IoT Core: ties to AWS, per-message pricing adds up
- Mosquitto: doesn’t scale; fine for dev only
- Cedalo Pro Mosquitto: wrong use case (edge broker product, not cloud)
ADR-006: Self-hosted LGTM for observability
Status: Approved
Context Observability generates significant egress; Datadog at full pricing is expensive. CTO has production experience running LGTM stack (Loki, Grafana, Tempo, Mimir). Railway’s private networking means zero egress for observability traffic between services and self-hosted backend.
Decision Self-host LGTM on Railway. OpenTelemetry Collector as aggregation layer. All services emit OTLP over private networking. Datadog credits ($100k available) applied surgically: synthetic monitoring, Database Monitoring (DBM) for Timescale, occasional debugging.
Consequences
- No egress cost for observability data
- No per-GB or per-host pricing cliff
- Operational ownership of observability stack — acceptable given CTO experience
- OpenTelemetry instrumentation keeps backend choice portable
- More tuning required (Loki storage, Mimir retention, Tempo sampling) as volume grows
Alternatives Considered
- Full Datadog on credits: attractive short-term; Railway egress and post-credit cost make it unsustainable
- Grafana Cloud managed: good free tier but egress costs still apply, cost grows with scale
- Honeycomb for tracing + Loki for logs: split-vendor complexity, higher combined cost
ADR-007: Clerk for user authentication
Status: Approved
Context Existing backend uses Sa-Token (Chinese auth library, sparse English docs, hard to hire for). Replacing with managed identity provider required as part of contractor exit. CTO has used Supabase Auth and Better Auth before.
Decision Clerk for user authentication. React Native-native SDK, supports email/password, social (Apple, Google), passwordless, MFA. Replace Sa-Token incrementally: new services validate Clerk JWTs, Spring continues Sa-Token validation for existing sessions, migrate users on re-authentication.
Consequences
- Modern DX, fast to ship
- Strong React Native support (primary mobile platform)
- Vendor dependency; Clerk pricing scales with MAU
- JWT-based, JWKS endpoint for validation in any backend language
- User auth separate from device auth (ADR-008)
Alternatives Considered
- Auth0: enterprise-flavored, expensive, no advantage for B2C consumer
- Supabase Auth: awkward without broader Supabase adoption
- Firebase Auth: Google ecosystem lock-in, Clerk has better consumer mobile UX
- Better Auth (self-hosted): self-host complexity wrong trade at stage
- Keep Sa-Token: hiring and maintenance costs long-term exceed migration cost
ADR-008: Device authentication via bootstrap secret + JWT rotation
Status: Approved
Context Collars and docks need authentication to backend separate from user auth. Full mTLS PKI is overkill for consumer IoT startup scale and operationally complex. Pre-shared secrets must not be fleet-wide (one leak compromises all). X.509 cert rotation is painful.
Decision Each device provisioned at manufacture with unique device ID + random bootstrap secret, logged to provisioning database. On first connection, device presents bootstrap credential; backend validates, issues device-specific JWT for ongoing use. JWT rotates on firmware update. Revocation integrated with unpair, factory reset, and account deletion flows.
Consequences
- 90% of mTLS security value at 20% of operational cost
- Provisioning database is a new secret store that must be protected
- Factory manufacturing flow must include secure bootstrap credential burn-in
- Dual-auth model (user JWT + device JWT) requires clear separation in backend code
- Revisit at enterprise/HIPAA compliance or hardware-security-module product direction
Alternatives Considered
- Full mTLS with factory-signed client certificates: operationally expensive, cert rotation painful
- Shared secret with per-device derivation: one leak compromises fleet, rejected
- Per-device username/password: essentially equivalent but less standard than JWT
ADR-009: Stripe multi-entity instead of Paddle
Status: Approved
Context Company is incorporated in Delaware, Thailand, and Singapore. Global subscription sales raise tax compliance issues (VAT, GST, Digital Service Tax). Paddle as merchant-of-record handles compliance but charges ~5% + 0.30.
Decision Stripe across three entities:
- US sales via Delaware entity (Stripe Tax for US state sales tax)
- Thailand sales via Thai entity (Thai VAT handled directly)
- Global subscriptions via Singapore entity (Singapore GST + local registrations as expansion occurs)
Paddle remains fallback if expansion outpaces compliance capacity.
Consequences
- Better margins than Paddle (2-3% savings on subscription revenue)
- Three Stripe accounts, three compliance footprints — operational overhead
- Singapore entity as merchant of record for rest-of-world sales enables Paddle avoidance
- Stripe Tax covers automated calculation but legal compliance remains company responsibility
- Hardware (physical goods) cleaner on Stripe than Paddle regardless
Alternatives Considered
- Paddle for all subscriptions: 5% fee long-term exceeds compliance cost savings given Singapore entity
- Stripe Delaware-only, defer global: limits TAM, rejected
- Paddle for subs + Stripe for hardware: operational complexity not justified given Singapore entity option
ADR-010: C for Apollo3 firmware
Status: Approved
Context Apollo3 Blue firmware needed. Ambiq SDK is C. neuralSPOT (AI SDK) is C. TFLM is C++ with C APIs. CMSIS-NN is C. Every reference implementation and vendor support path assumes C.
Decision Firmware is C. Use vendor SDK (AmbiqSuite) and neuralSPOT. No new language introduction at this silicon.
Consequences
- Vendor support and reference material directly applicable
- Hiring pool for embedded C is broad
- Standard toolchain (GCC ARM, CMake)
- Memory safety concerns mitigated by static analysis, code review, and rigorous testing
- Cannot leverage Rust’s safety guarantees — accept
Alternatives Considered
- Rust: embedded-hal ecosystem immature for Apollo3, vendor SDK wrapping negates safety benefits
- Zig: pre-1.0, toolchain immaturity, hobbyist embedded ecosystem
- C++: viable but TFLM and CMSIS-NN APIs work fine from C; no meaningful benefit
ADR-011: On-device TFLM via Ambiq neuralSPOT
Status: Provisional — pending model size validation
Context Target is on-device inference of TCN + Transformer + CNN + Cross-Attention model within ~664KB flash budget on Apollo3 Cortex-M4F, 768KB RAM (minus BLE stack and app). TensorFlow Lite for Microcontrollers is industry standard for MCU inference. Ambiq’s neuralSPOT builds on TFLM with Apollo-optimized kernels.
Decision Use neuralSPOT as base runtime. Edge Impulse EON Compiler considered for further optimization. Custom kernels for Transformer attention blocks where TFLM operators are insufficient.
Consequences
- Vendor-supported path, established for similar silicon/workloads
- Pipeline: PyTorch → ONNX → TFLite (INT8 via QAT) → neuralSPOT integration → Apollo3 binary
- Accuracy loss at quantization must be validated on real silicon, not just simulator
- Aggressive quantization (INT4 weights) may be required
- If model does not fit budget, architecture reduction or hardware change (Apollo4/5 with NN accelerator) required
Alternatives Considered
- ExecuTorch (PyTorch Edge): newer, less mature on Apollo3, watch but don’t bet
- Raw CMSIS-NN: hand-rolled, no runtime, rejected as engineering time sink
- Cloud inference fallback: bandwidth, battery, and latency costs unacceptable for collar
ADR-012: PostHog for product analytics and feature flags
Status: Approved
Context Need product analytics (event tracking), feature flags (for strangler rollout and paid tier gating), session replay (for UX debugging). Multiple tools possible; consolidation preferred to reduce SaaS sprawl.
Decision PostHog Cloud for all three. Core events defined in React Native app. Feature flags exposed to both app and backend. Session replay enabled selectively with privacy review.
Consequences
- Single vendor for product analytics, flags, experiments, session replay
- Free tier covers pilot phase (1M events/month)
- Self-host option exists if scale or data residency requires
- Separate from infrastructure observability (LGTM) — different tools for different jobs
- Experiment/A-B testing capability available when user count supports statistical power
Alternatives Considered
- Mixpanel + LaunchDarkly + LogRocket: three vendors, higher combined cost, no integration advantage
- Amplitude + Statsig: similar split-vendor issues
- Self-hosted analytics (Plausible, Umami): doesn’t cover flags or session replay
ADR-013: Memfault for firmware observability
Status: Approved
Context Firmware bugs in production are opaque without crash reporting. Memfault provides coredump reporting, firmware metrics, fleet-wide dashboards, OTA orchestration. Best-in-class for MCU fleet observability, supports Apollo3.
Decision Memfault integrated into firmware before 5,000-user scale. Crash reports, battery metrics, BLE health, sync failure rates surfaced to fleet dashboards. Critical alerts piped to LGTM via webhook for unified alerting.
Consequences
- Firmware engineering effort to integrate SDK (~1-2 weeks)
- Vendor cost scales with device count
- Irreplaceable observability for in-field firmware debugging
- Alternative paths (custom telemetry) underbuilt and underinvested vs Memfault’s feature set
- Non-negotiable for shipping firmware at scale
Alternatives Considered
- Custom crash reporting: months of engineering for inferior result
- Percepio Tracealyzer: dev-time tool, not production telemetry
- No firmware observability: unacceptable at 5,000+ unit scale
ADR-014: GitHub Actions + Railway native deploys through year 1
Status: Approved
Context CI/CD pipeline needed for backend and mobile. CTO has experience with Buildkite + Argo Rollouts at $7B startup scale and GitLab CI at prior roles. GitHub reliability has had issues recently. Team is small; operational overhead of self-managed CI unjustified.
Decision GitHub Actions for CI, Railway native integration for CD. Monitor GitHub reliability. Migrate to GitLab CI if GitHub reliability degrades materially; migrate to Buildkite + Argo Rollouts when moving to Kubernetes/AWS at scale.
Consequences
- Zero operational overhead for CI infrastructure
- Railway auto-deploys on main merge — appropriate simplicity for stage
- Staged rollouts via feature flags (PostHog) not infrastructure canary
- GitHub outages do affect dev velocity; have a migration plan
- Buildkite/Argo eventual migration in year 2-3 alongside K8s transition
Alternatives Considered
- Buildkite from day one: operational overhead not justified at team size
- GitLab CI from day one: viable but no compelling switch from GHA given established workflow
- CircleCI: no advantage over GHA
- Self-hosted Jenkins: rejected, operational tax
ADR-015: REST for mobile API, gRPC for internal services
Status: Approved
Context Mobile-to-backend contract needs stable, debuggable, tooling-friendly protocol. Internal service-to-service calls benefit from efficient, strongly-typed RPC. CTO has experience with both REST and gRPC at scale.
Decision
REST with OpenAPI for mobile-to-backend. URL versioning (/v1/). Additive-only changes until major version. gRPC for service-to-service calls between Go services when multiple services exist. Protobuf schemas versioned in monorepo or shared schema repo.
Consequences
- Mobile gets standard REST tooling, caching, debugging
- Internal services get efficient RPC with strong typing
- Two contract systems (OpenAPI for REST, Protobuf for gRPC) — acceptable overhead
- Contract testing enforced in CI for both
- Breaking change detection (oasdiff for OpenAPI, buf for Protobuf)
Alternatives Considered
- gRPC for mobile: worse tooling, debugging, gRPC-Web still requires gateway; rejected
- GraphQL: unnecessary complexity for bounded query patterns, caching complexity
- REST for internal services: less efficient, weaker typing, rejected where gRPC fits
Last updated: during CTO work trial. Update when new ADRs are added or status changes.