Troubleshooting Flowcharts

Follow the arrows to find the fix


1. Can't Reach a Server

πŸ–₯️ Reachability decision tree

Check: Does loopback respond?
ping -c 3 127.0.0.1
βœ“ Yes βœ— No
Action: Loopback / TCP stack issue, restart networking, check ip addr / ifconfig, review VPN clients capturing traffic
Yes ↓
Check: Can you ping the default gateway?
ip route | awk '/default/ {print $3}' | xargs ping -c 3
βœ“ Yes βœ— No
Action: L2 / link problem, cable, Wi‑Fi, NIC driver, wrong VLAN/port; try ethtool <iface> and a different port or cable
Yes ↓
Check: Can you reach the public internet (routing)?
ping -c 3 8.8.8.8
βœ“ Yes βœ— No
Action: Routing / upstream, verify default route, gateway ARP (arp -n), ISP or corporate path; traceroute 8.8.8.8
Yes ↓
Check: Does the hostname resolve and respond?
ping -c 3 example.com
βœ“ Yes βœ— No
Action: DNS issue, compare dig @8.8.8.8 example.com vs system resolver; fix /etc/resolv.conf or DHCP DNS options
Yes ↓
Check: Is the application port reachable on the target IP?
nc -vz 192.168.1.10 443
βœ“ Yes βœ— No
Action: Host or path firewall, sudo iptables -L -n / sudo firewall-cmd --list-all; security groups / cloud firewall rules; wrong IP or service down
Yes ↓
Check: TLS / app layer (optional)
curl -vI https://example.com
βœ“ Yes βœ— No
Action: Certificate, SNI, proxy, or app config, inspect curl error, server vhosts, load balancer health
Yes ↓
Issue narrowed, L3/L4 path OK; focus on app credentials, HTTP errors, or server logs if the service still misbehaves
2. DNS Not Resolving

πŸ”€ Resolver β†’ authority chain

Check: Flush local cache (stale answers)?
sudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder
sudo systemd-resolve --flush-caches
βœ“ Retest βœ— Still broken
Action: On Windows use ipconfig /flushdns; Linux varies by resolver (nscd, systemd-resolved, dnsmasq)
Continue ↓
Check: Does a public resolver return answers?
dig @8.8.8.8 +short example.com A
βœ“ Yes βœ— No
Action: Upstream / global DNS or domain itself, try dig @1.1.1.1; verify domain registration and authoritative NS at registrar
Yes ↓
Check: What is the OS using for DNS?
cat /etc/resolv.conf
resolvectl status
βœ“ Looks correct βœ— Wrong / empty
Action: Fix DHCP static DNS, NetworkManager, or cloud metadata; disable rogue resolv.conf overrides
Continue ↓
Check: Trace delegation from root to your name
dig +trace example.com
βœ“ Completes βœ— Fails mid-chain
Action: Broken NS glue, lame delegation, or registrar NS mismatch, fix NS records and parent zone glue at registrar
OK ↓
Check: Do authoritative nameservers answer directly?
dig NS example.com +short
dig @ns1.example.net example.com A +norecurse
βœ“ Yes βœ— No
Action: NS host down, firewall on 53/UDP+TCP, or wrong BIND/Cloud DNS zone, open ports, sync zone AXFR/IXFR, check SOA serial
Yes ↓
Check: Do the records you expect exist?
dig example.com ANY +noall +answer
βœ“ Yes βœ— No
Action: Add/fix A/AAAA/CNAME, TTL propagation; split-horizon DNS returning different answers internally vs externally
Yes ↓
DNS chain healthy, if apps still fail, check search domains, /etc/hosts, and application-specific resolver settings
3. Website Loads Slowly

🐒 Find the bottleneck

Check: DNS resolution time
dig example.com | grep 'Query time'
βœ“ Fast (<50ms typical) βœ— Slow
Action: Use faster resolvers, reduce TTL churn, enable DNS prefetch, geo DNS or anycast NS closer to users
Fast ↓
Check: TLS handshake duration
curl -w 'tls:%{time_appconnect}\n' -o /dev/null -s https://example.com
βœ“ OK βœ— High
Action: Session tickets, OCSP stapling, HTTP/2 or QUIC, reduce cert chain size, edge TLS termination at CDN
OK ↓
Check: Time to first byte (TTFB)
curl -w 'ttfb:%{time_starttransfer} total:%{time_total}\n' -o /dev/null -s https://example.com
βœ“ Low TTFB βœ— High TTFB
Action: Origin CPU/DB/cache, cold containers, region latency, add caching, scale app, optimize queries, move origin closer
Good ↓
Check: Download size & compression
curl -sI https://example.com | grep -i content-length
curl -sI -H 'Accept-Encoding: gzip' https://example.com | grep -i content-encoding
βœ“ Reasonable βœ— Huge / uncompressed
Action: Enable gzip/Brotli, shrink images, code-split JS, lazy-load media, HTTP/2 multiplexing
OK ↓
Check: Is a CDN caching static assets?
curl -sI https://example.com/static/app.js | grep -iE 'cf-cache|x-cache|age|server'
βœ“ HIT / edge βœ— MISS / direct origin
Action: Put static on CDN, tune cache headers (Cache-Control), purge stale objects, enable image CDN
Optimized ↓
Check: Client-side render & third parties
(Chrome DevTools β†’ Network β†’ Disable cache β†’ reload, check waterfall)
βœ“ Clean βœ— Blocking scripts
Action: Defer/async scripts, cut trackers, preconnect to critical origins, reduce main-thread work
↓
You’ve isolated the slow phase, apply the matching optimization above and re-measure with WebPageTest or Lighthouse
4. SSH Connection Refused

πŸ” SSH path

Check: Is TCP 22 (or your port) open from the client?
nc -vz host.example.com 22
nmap -p 22 host.example.com
βœ“ Open βœ— Closed / filtered
Action: Security group / cloud firewall / edge ACL, allow 22 from your IP; confirm target IP and NAT
Open ↓
Check: Is sshd listening on the server?
sudo systemctl status ssh sshd
sudo ss -tlnp | grep ':22'
βœ“ Active βœ— Down
Action: sudo systemctl start sshd, fix unit failures (journalctl -u sshd), reinstall openssh-server
Running ↓
Check: Host firewall allows SSH?
sudo iptables -L INPUT -n -v
sudo firewall-cmd --list-services
βœ“ Allowed βœ— Blocked
Action: Add rule for tcp/22 (or custom port), check fail2ban jails (sudo fail2ban-client status sshd)
Allowed ↓
Check: sshd_config, listen address & port
sudo sshd -T | grep -E '^(port|listenaddress)'
βœ“ Matches client βœ— Mismatch
Action: Edit /etc/ssh/sshd_config (Port, ListenAddress), then sudo systemctl reload sshd
Match ↓
Check: Pubkey / password auth policy
sudo sshd -T | grep -E 'passwordauthentication|pubkeyauthentication|permitemptypasswords'
βœ“ Expected mode on βœ— Locked down
Action: Enable keys or password temporarily via hardened config; verify ~/.ssh/authorized_keys perms (600/700)
OK ↓
Check: Verbose client errors
ssh -vvv user@host
βœ“ Connects βœ— Fails late
Action: Host key changed (known_hosts), KEX/cipher mismatch, MaxStartups, PAM, align client/server algorithms or reset host key if intentional
↓
SSH should work, if β€œrefused” became β€œtimeout”, revisit network path and middleboxes
5. GCP VM Not Accessible

☁️ GCP checklist

Check: Does the VM have an external IP or reachable path (IAP / VPN)?
gcloud compute instances describe VM_NAME --zone=ZONE --format='get(networkInterfaces[0].accessConfigs[0].natIP)'
βœ“ Yes / IAP ready βœ— No
Action: Attach external IP, configure Cloud NAT + private path, or use gcloud compute ssh --tunnel-through-iap
Reachable path ↓
Check: VPC firewall allows your traffic?
gcloud compute firewall-rules list --filter='network:NETWORK' --format='table(name,direction,priority,allowed,sourceRanges)'
βœ“ Rule hits βœ— Deny / missing
Action: Create allow rule for tcp:22 or tcp:443 from your IP or IAP range 35.235.240.0/20; check priority & deny rules
Allowed ↓
Check: Routes send return traffic correctly (no blackhole)?
gcloud compute routes list --filter='network:NETWORK'
βœ“ Default via gw βœ— Wrong next hop
Action: Fix custom static routes, ilb next hops, or remove conflicting peering custom routes
OK ↓
Check: Is the instance RUNNING and healthy?
gcloud compute instances describe VM_NAME --zone=ZONE --format='get(status)'
βœ“ RUNNING βœ— Stopped / crash
Action: Start VM, fix startup scripts, disk full, or guest OS panic, use serial console next
Running ↓
Check: Serial port output / guest boot
gcloud compute instances get-serial-port-output VM_NAME --zone=ZONE
βœ“ Clean boot βœ— Errors
Action: fsck/disk errors, cloud-init failures, wrong NIC config, attach recovery disk or fix metadata/startup script
OK ↓
Check: OS guest firewall inside VM
gcloud compute ssh VM_NAME --zone=ZONE --command='sudo iptables -L -n || sudo nft list ruleset'
βœ“ Allows service βœ— Blocks
Action: Open service port in guest OS firewall; confirm app binds 0.0.0.0 not only localhost
↓
GCP path aligned, if still failing, capture tcpdump on VM and verify load balancer backend health for managed instance groups
6. Kubernetes Pod Can't Connect

☸️ Cluster networking

Check: Does a Service exist for the target?
kubectl get svc -A | grep MY-SVC
βœ“ Yes βœ— No
Action: Create Service (ClusterIP/NodePort/LoadBalancer) with correct selectors matching pod labels
Yes ↓
Check: In-cluster DNS resolves the service?
kubectl run -it --rm dbg --image=busybox:1.36 --restart=Never -- nslookup my-svc.my-ns.svc.cluster.local
βœ“ Yes βœ— NXDOMAIN / timeout
Action: Fix CoreDNS pods, kube-dns service, upstream resolvers, or NetworkPolicy blocking UDP/TCP 53
Yes ↓
Check: NetworkPolicies, allowed egress/ingress?
kubectl get networkpolicy -A
kubectl describe networkpolicy -n NAMESPACE POLICY_NAME
βœ“ Permits path βœ— Deny
Action: Add egress to Service CIDR/pod CIDR/443; allow DNS namespaces; verify namespace selectors and ports
OK ↓
Check: CNI pods healthy (Calico/Cilium/Amazon VPC CNI)?
kubectl get pods -n kube-system
βœ“ Running βœ— CrashLoop
Action: Review CNI logs, IP pool exhaustion, MTU issues on overlay, or provider quotas, restart/reinstall CNI per vendor runbook
Healthy ↓
Check: Endpoints match ready pods?
kubectl get endpoints -n NAMESPACE my-svc -o wide
βœ“ Addresses listed βœ— Empty
Action: Fix readiness probes, wrong Service selector, pods not Ready, or headless vs clusterIP confusion
Ready ↓
Check: Pod-to-pod / pod-to-Service test
kubectl exec -n NAMESPACE POD -- wget -qO- --timeout=3 http://my-svc:8080/health
βœ“ Works βœ— Fails
Action: kube-proxy mode (iptables/IPVS), stale conntrack, dual-stack mismatch, or app binding, compare with kubectl get svc -o wide
↓
Kubernetes service path OK, escalate to app logs and ingress/controller if only external clients fail
7. VPN Tunnel Down

πŸ›‘οΈ IPsec / IKE tunnel

Check: IKE Phase 1 (ISAKMP), peer identity, PSK/cert, UDP 500/4500
sudo journalctl -u strongswan -u ipsec -f
sudo tcpdump -ni any udp port 500 or udp port 4500
βœ“ SA established βœ— No SA
Action: Align IKE version, encryption suite, peer IDs, NAT-T; verify shared secret or trust chain; fix clock skew
Phase 1 OK ↓
Check: IKE Phase 2 (child SA), proxy IDs / traffic selectors
sudo ip xfrm state
sudo ip xfrm policy
βœ“ Installed βœ— Mismatch
Action: Match left/right subnets in config (GCP: classic VPN vs HA VPN traffic selectors); fix overlapping CIDRs
Phase 2 OK ↓
Check: Routes propagated to use tunnel (policy / static / BGP)
ip route | grep -E 'tun|vti|encap'
gcloud compute vpn-tunnels describe TUNNEL --region=REGION
βœ“ Routes present βœ— Missing
Action: Add Cloud Router BGP sessions, static routes with correct priority, or install shunts for on-prem ranges
Routed ↓
Check: Perimeter firewalls allow IPsec
# UDP 500 (ISAKMP), UDP 4500 (NAT-T), ESP IP proto 50
βœ“ Allowed βœ— Blocked
Action: Open UDP/500, UDP/4500, and ESP in edge firewall; disable SIP ALG or other helpers that break IKE
Open ↓
Check: MTU / fragmentation on encrypted path
ping -M do -s 1400 REMOTE_LAN_HOST
tracepath REMOTE_LAN_HOST
βœ“ Stable βœ— Black hole
Action: Lower TCP MSS via iptables clamp, set interface MTU (often 1360–1392 over IPsec), enable PMTUD
Stable ↓
Check: Keepalive / DPD / rekey behavior
sudo swanctl --list-sas
βœ“ Healthy βœ— Flapping
Action: Tune dpdaction, increase keylife overlap, fix asymmetric NAT or idle timeouts on middle firewalls
↓
Tunnel up and passing traffic, monitor with vendor/cloud VPN metrics and log correlation on both sides