Buttons > blueprints: UI-first automation with real-time status
Position
Self-service doesn’t have to mean Service Catalog or a Terraform PR. For multi-step, failure-prone jobs—like VPN as a Service, pre-staging equipment, and new site turn-ups—a tiny web UI with real-time status between steps ships faster, fails safer, and keeps costs tied to the workload.
This is not anti-IaC. Keep your golden plumbing (VPCs, TGWs, base routes, IAM) in Terraform/CDK. Use a UI-triggered runbook for app/site-specific work. Successful runs can still write back tfvars/outputs to keep your code of record clean.
Why a UI beats a catalog (sometimes)
Human inputs need validation. Peer IPs, ASNs, route lists, site names—validate in the form before you start.
External dependencies happen. If you’re waiting on a firewall change or carrier handoff, users should see where it’s stuck, not a generic “IN_PROGRESS.”
Checkpoint control. Create → attach → propagate → verify. If step 3 fails, retry step 3 only.
Explainability. Stream the step, log line, resource ID, and “what to fix” directly to the page.
Architecture (minimal, AWS-native)
UI: Static (S3 + CloudFront). One form. One progress pane.
Auth: Gate with Identity Center/Cognito/IAM—whatever you already run.
Orchestration: Step Functions state machine emitting an event per step.
Workers: Small Lambdas: validate, issue/export certs, generate vendor configs, health checks.
Live status: EventBridge → Lambda → API Gateway WebSocket → browser (no refresh).
State & audit: DynamoDB tables for runs, run_events, and WebSocket connections. CloudWatch logs with who/what/when.
Artifacts: S3 /routerconfigs/<site> (certs, configs).
Ownership & TCO: Tag everything (Workload, Owner, Env, CostCenter) so CUR + Athena can attribute network costs to the workload.
Flow in one paragraph:
The user submits site details → the UI calls a “start” API (Lambda) → we kick a Step Functions run with a run_id. The browser opens a WebSocket (?run_id=...). As each state runs (validate, certs, CGW, VPN, attach/propagate, vendor config, health checks), Step Functions emits events; a Lambda broadcasts them to the browser. If a step fails, the user sees which one and why, then retries just that step.
VPN-as-a-Service (concrete runbook)
Validate (peer IP, ASN, routing mode, routes).
Issue & export CGW certificate (ACM Private CA) → stash in S3 under /routerconfigs/<site>/certs/*.
Create Customer Gateway.
Create VPN Connection (to TGW or VPG).
Attach + propagate to the right TGW route tables.
Generate vendor config (e.g., Juniper SRX / PAN / Forti) from a template that references your exported certs and parameters.
Health checks (tunnels, IKEv2, BGP, optional reachability).
Emit outputs (tunnel IPs, cert paths, route tables, config links).
Real-time event (what the UI receives):
{
"run_id": "vpn-2025-08-12-1830Z-9f2a",
"step": "create_vpn_connection",
"phase": "running",
"message": "Creating VPN to TGW tgw-0abc...",
"percent": 42,
"workload": "payments-prod",
"site": "companyname-denver-01",
"outputs": { "vpn_id": "vpn-0def..." }
}
Certificate-based VPN: make AWS samples actually usable
AWS supports certificate-based auth for Site-to-Site VPN via AWS Private CA. The docs outline what’s required, but the downloadable device configs tend to be PSK-first and aren’t drop-in for cert flows. This pattern closes that gap:
Issue the right cert, export it, and store it under /routerconfigs/<site>/certs/ with the chain and a passphrase file.
Transform AWS samples into real vendor configs using templates (SRX/PAN/Forti) that reference your certs and the correct IKE/IPsec parameters for your environment.
Validate the chain and IKEv2/IPsec parameters; if a field is wrong, the step fails fast and the UI shows the exact error.
Rotate cleanly with a visible, gated step (optional approval pause).
Result: a cookie-cutter turn-up you can scale—pre-stage gear, ship it, click Create VPN, watch live status, and fix only what bumps in the night.
Where this shines: pre-staging + new site turn-ups
Pre-staging equipment. Issue device certs, render vendor configs, push test payloads to S3, and hand the installer a link. Artifacts + logs update live.
New site turn-ups. Collect IP/ASN/routes → build VPN → attach/propagate → health checks—while the team watches each step complete.
If a firewall rule is missing, the page shows the failing step and you retry that step only.
Small workflows: Lambda + web + SQS (simple and cheap)
For straight-line jobs (<10 steps) where Step Functions would be overkill:
Authenticated Web UI → start Lambda → SQS job → worker Lambda
Optional: the worker pushes updates to the same WebSocket so the page stays live.
Pros: tiny footprint, low cost, easy to reason about.
Cons: you own the control flow and retries (trivial for small jobs).
One-liner takeaway: For smaller workflows, an authenticated web frontend + Lambda starter + SQS worker (with optional WebSocket for live status) works extremely well.
Where the UI-first pattern fits—and where it doesn’t
Pattern
Use UI-first when…
Prefer Catalog/Terraform when…
Pre-staging hardware
Per-site inputs; artifact delivery; human checks
Bulk enrollment of identical devices
New site turn-up
External dependencies; live visibility; checkpoint retries
Stamping the same site module at scale
Cert issuance/rolls
Approvals; human handoff; artifact delivery
Fully automated rotation with strict SLOs
TGW route tweaks
Risky one-offs; rapid rollback
Baseline topology & standard route tables
Private endpoints
Per-workload flips with validation
Org-wide defaults and enforcement
Temporary access
Time-boxed grants with audit
Long-lived identity plumbing
Baseline VPC/Cluster
—
Always code it; GitOps owns lifecycle
Rule of thumb: if humans, external systems, or long-running checks are involved—and you want real-time explainability—a tiny UI + events beats waiting on a PR queue.
Assortment of automation opportunities (beyond VPN)
Pre-stage hardware: issue device certs, render configs, publish installer packets.
New site turn-up: collect IP/ASN/routes, create VPN/TGW wiring, health checks, handoff artifacts.
TGW attachment hygiene: enable per-AZ subnets for attachments; verify no cross-AZ hairpins.
PrivateLink producer/consumer setup: create service, authorize principals, smoke-test endpoints.
Private endpoints rollout: S3/DynamoDB/ECR/STS/SSM endpoints with route and policy validation.
Route change “safe apply”: impact analysis, approval pause, apply, reachability tests, auto-rollback.
Certificate issuance/rotation: PCA issuance, device/app cert packaging, staged rollout with checks.
Just-in-time access: time-boxed IAM role grants or SSM Session Manager access with auto-revoke.
EKS namespace onboarding: create namespace, quotas, RBAC, run lint/smoke tests, emit kubeconfig.
AMI patch & bake: launch test host, run SSM tests, bake, promote, publish image details.
Windows on Spot (hibernate): interruption prep, hibernate/resume checks, post-resume validation.
S3 bucket bootstrap: bucket + KMS + lifecycle + replication + access policy, then drift checks.
SageMaker/Bedrock access: request flow with pre-canned guardrails and cost/accountability tags.
“Ask your bill” actions: Slack bot triggers curated runbooks (e.g., enable endpoints, fix AZ mis-align) with receipts.
Guardrails you want on day one
Idempotency everywhere. Each step uses tokens keyed by run_id and input so retries don’t duplicate resources.
Rollback flow. A “Delete” action that reverses steps in order; leaves logs/artifacts intact.
Change windows + approvals. Optional WaitForTaskToken step before the risky part.
Quotas. Per-workload concurrency caps and daily limits.
Least privilege. Per-function IAM; encrypt artifacts; scope outbound.
Ownership tags. Workload, Owner, Env, CostCenter on everything created.
What good looks like in minutes
You’re turning up or connecting new sites in seconds or minutes, not days—because the code lives centrally, and the UI is simple for teams who don’t live in IaC.
Ops pre-stages certs/configs for companyname-denver-01.
The on-site installer brings up tunnels, watches health checks turn green in the UI, and downloads the vendor config from S3.
Finance asks “what changed?” You paste one link: the run page and its log.
Bottom line
Service Catalog and Terraform are still the right tools for baseline infrastructure. But for multi-step jobs with human inputs, external dependencies, and failure modes you actually want to see, a tiny UI + evented runbook is the fastest, safest way to deliver self-service—and the most scalable way to turn vague docs and generic samples into cookie-cutter, vendor-ready configs with real-time status and clear cost ownership.