Problem Context
The goal was to create small accountability groups (4-6 people) focused on specific habits (quitting smoking, waking up early, etc.). The system needed:
- Real-time messaging (to create a sense of presence)
- Group state synchronization (typing indicators, online status)
- Automated moderation via AI to reduce manual oversight
- Persistence for messages and user progress
Constraint: This had to work reliably with the free tier limitations of hosting providers (Vercel, Neon Postgres).
Initial Approach
The first version used HTTP polling every 2 seconds to fetch new messages. This worked but had problems:
What didn't work:
- Polling delay felt sluggish (2-second lag before seeing new messages)
- Server costs would scale poorly with more users
- No way to show "User is typing..." without spamming requests
- Connection state was ambiguous (was the user offline or just slow to poll?)
The system needed WebSockets for true real-time communication.
Design Decisions & Trade-offs
Real-Time Layer: Socket.IO on Next.js Custom Server
Next.js doesn't support WebSockets natively in the App Router. Options:
- Separate WebSocket server: More flexible but adds deployment complexity
- Custom Next.js server with Socket.IO: Keeps everything in one codebase
We chose Option 2 (custom server) because deployment simplicity mattered more than perfect separation of concerns.
Trade-off: This meant ejecting from Vercel's serverless model and self-hosting on Railway. Loss of automatic scaling, but gained WebSocket support.
State Synchronization Challenge
Each "Pod" (group) needed to track:
- Who is currently online
- Who is typing
- Unread message counts per user
Problem: WebSocket connections are stateless. If a user refreshes the page, the server loses their connection context.
Solution:
- On connection, client sends
user_idandpod_id - Server maintains an in-memory mapping:
{ socket_id -> {user_id, pod_id} } - On disconnect, remove mapping and broadcast updated online status
What went wrong initially: Rapid reconnections (user switching tabs, poor network) caused duplicate entries in the mapping. Solution: debounce disconnect events by 3 seconds before marking user offline.
Database Schema: PostgreSQL with Prisma
The schema evolved significantly during development.
Initial schema (too simple):
model Message {
id String
content String
userId String
podId String
}
This didn't handle:
- Deleted users (orphan messages)
- Pod disbanding (what happens to messages?)
- Message reactions or edits
Final schema (more rigid):
model Message {
id String @id
content String
createdAt DateTime @default(now())
author User @relation(fields: [authorId], references: [id], onDelete: Cascade)
authorId String
pod Pod @relation(fields: [podId], references: [id], onDelete: Cascade)
podId String
}
The onDelete: Cascade ensures that if a Pod is deleted, all messages are removed automatically. This prevented orphan data.
Schema migration challenge:
Adding the cascade delete required a migration on production data. We had to manually reassign messages with missing podId before the migration could run.
AI Moderation: Llama 3 via OpenRouter
Each Pod has an AI"member" powered by Llama 3. It's configured to intervene when:
- Engagement drops (no messages for 2 hours)
- Conflict arises (detected via sentiment analysis)
Configuration:
- Temperature: 0.1 (consistency over creativity)
- Max tokens: 150 (short, focused responses)
What didn't work: The AI sometimes misread sarcasm as conflict. We added a "humor flag" to messages where users can mark sarcasm, preventing false positives.
Implementation Notes
Connection Status UI
Users needed to know if they were truly connected or just seeing stale data.
Visual states:
- 🟡 Connecting...
- 🟢 Connected
- 🔴 Disconnected
The status badge updates based on Socket.IO's built-in connection events. We added a "retry connection" button that appears after 10 seconds of disconnection.
Message Persistence vs Real-Time
Messages are:
- Sent via WebSocket immediately (for real-time feel)
- Persisted to Postgres in the background
- Acknowledged back to the sender once saved
Trade-off: If the database write fails (rare but possible), the message appears in the UI but isn't saved. We added a "pending" indicator that clears once the database confirms the write.
Results & Impact
The platform supports 40 active Pods with ~200 total users. Observations:
- Average message latency: <100ms
- Connection drop rate: ~2% (mostly due to poor mobile networks)
- AI intervention is used in ~30% of Pods (more than expected)
What stayed hard:
- Handling users who join multiple Pods (managing multiple WebSocket subscriptions)
- Time zones for scheduled check-ins (storing user timezone was added later)
- Scaling Socket.IO beyond a single server instance (not yet a problem but will need clustering)
What I Would Change
If I were to rebuild this with current knowledge:
- Use Supabase Realtime instead of Socket.IO: Would simplify state synchronization and leverage database subscriptions
- Add message queuing: Right now message persistence is synchronous. A queue would make failures more manageable.
- Better connection recovery: Currently, if a user disconnects and reconnects, they might see duplicate messages. Need idempotency keys.
Takeaways
Building real-time social apps taught me that low latency isn't just about speed—it's about perceived responsiveness. A 2-second delay in seeing a message kills conversation flow.
State synchronization is harder than it looks. The in-memory socket mapping worked for our scale, but wouldn't survive a server restart. Proper production would need Redis or similar.
For anyone building WebSocket apps on Next.js: plan your deployment strategy early. The custom server requirement eliminates some hosting options, and that constraint affects architecture decisions.
