Prometheus

CI/CD & Production Infrastructure for a Social App

Production Infrastructure & CI/CD for a Social App Launch Client Puzzle Master — a social matching platform Challenge The startup had a production-ready Nest.js backend and Angular frontend, but zero infrastructure: deployments were manual, there was no CI/CD, no monitoring, no backups, and no separation between dev and prod environments. The goal was to build a complete DevOps stack from scratch before the public launch. Solution 1. Application Containerization Multi-stage Dockerfile for backend (Nest.js + Prisma, non-root user) Multi-stage Dockerfile for frontend (Angular 12, legacy OpenSSL, Nginx for static assets) Docker Compose full stack: PostgreSQL 15, Redis 7, imgproxy, Nginx Healthchecks and depends_on for correct startup ordering Isolated dev and prod environments in /opt/dev and /opt/prod 2. GitLab CI/CD Migration of repository from Bitbucket to GitLab Pipeline for backend and frontend: build → push → deploy GitLab Container Registry for Docker image storage Automatic deploy to dev on every push; manual trigger for prod SSH deployment to VPS via SSH_PRIVATE_KEY 3. Nginx Reverse Proxy Environment-agnostic config via envsubst for dev/prod parity SSL/TLS (TLSv1.2, TLSv1.3) with Cloudflare certificates Routing: /api/* → backend:4000, /* → frontend:80 www → root domain redirect (301) Separate imgproxy stack with SSL termination 4. Security (Ansible) Server hardening via Ansible: SSH key-only auth, root login disabled UFW Firewall: only ports 80, 443, and custom SSH open Database accessible only via SSH tunnel All secrets stored in GitLab CI/CD variables 5. Monitoring Prometheus + Grafana with automated dashboard provisioning Exporters: Node, cAdvisor, Postgres, Redis, Nginx, Blackbox 5 Grafana dashboards: server, Docker containers, PostgreSQL, Redis, Nginx Alertmanager with Slack/webhook integration; alerts on CPU/RAM/Disk/API/SSL 6. Database Backups Automated pg_dump every hour gzip compression and upload to S3-compatible object storage (Cloudflare R2) Prometheus backup metrics: success status, size, timestamp Alerts: DatabaseBackupMissing, DatabaseBackupFailed, DatabaseBackupSizeAnomaly Technologies GitLab CI Docker Ansible Prometheus Nginx PostgreSQL Results ✅ Deploy: git push to main → automatic build and deploy to server ✅ Environments: full dev/prod isolation on a single VPS ✅ Monitoring: 5 dashboards, alerts across 6 categories ✅ Backups: automated hourly pg_dump to Cloudflare R2 ✅ Security: UFW, key-based SSH, database inaccessible from outside ✅ Scalability: architecture ready for database extraction to a dedicated server

Read

Prometheus + Grafana Monitoring Stack

Observability Stack for Microservices Architecture Client Early-stage startup Challenge After migrating to a microservices architecture (15+ services), the team had no centralized monitoring in place. Issues were only discovered through user complaints — typically 30+ minutes after they occurred. A full observability stack was needed to detect and diagnose problems proactively. Solution 1. Monitoring Architecture Prometheus for metrics collection Grafana for visualization Loki for centralized log aggregation Jaeger for distributed tracing Alertmanager for notifications 2. Metrics Collection Automatic service discovery in Kubernetes Application-level custom metrics System metrics via node-exporter Database metrics via postgres-exporter and redis-exporter 3. Grafana Dashboards Per-service dashboards for each microservice Unified infrastructure overview dashboard SLA/SLO tracking metrics Business metrics (RPS, conversion rate) 4. Centralized Logging (Loki) Log aggregation across all services Full-text log search via Grafana Log-to-metric correlation 5. Distributed Tracing (Jaeger) HTTP request tracing across services Call chain visualization Bottleneck identification Per-service latency analysis 6. Alerting Alerts delivered to Slack / PagerDuty / custom webhooks Critical issue escalation On-call rotation support Automatic incident creation Technologies Prometheus Grafana Kubernetes Docker Helm Linux Results ✅ MTTD: reduced from 30 minutes to under 1 minute ✅ MTTR: recovery time reduced by 60% ✅ Alerts: proactive notifications before users are impacted ✅ Visibility: full observability across all services ✅ Capacity planning: data-driven resource forecasting

Read