Troubleshooting Flowcharts

Follow the arrows to find the fix

1. Can't Reach a Server

🖥️ Reachability decision tree

Check: Does loopback respond?

ping -c 3 127.0.0.1

✓ Yes ✗ No

Action: Loopback / TCP stack issue, restart networking, check ip addr / ifconfig, review VPN clients capturing traffic

Yes ↓

Check: Can you ping the default gateway?

ip route | awk '/default/ {print $3}' | xargs ping -c 3

✓ Yes ✗ No

Action: L2 / link problem, cable, Wi‑Fi, NIC driver, wrong VLAN/port; try ethtool <iface> and a different port or cable

Yes ↓

Check: Can you reach the public internet (routing)?

ping -c 3 8.8.8.8

✓ Yes ✗ No

Action: Routing / upstream, verify default route, gateway ARP (arp -n), ISP or corporate path; traceroute 8.8.8.8

Yes ↓

Check: Does the hostname resolve and respond?

ping -c 3 example.com

✓ Yes ✗ No

Action: DNS issue, compare dig @8.8.8.8 example.com vs system resolver; fix /etc/resolv.conf or DHCP DNS options

Yes ↓

Check: Is the application port reachable on the target IP?

nc -vz 192.168.1.10 443

✓ Yes ✗ No

Action: Host or path firewall, sudo iptables -L -n / sudo firewall-cmd --list-all; security groups / cloud firewall rules; wrong IP or service down

Yes ↓

Check: TLS / app layer (optional)

curl -vI https://example.com

✓ Yes ✗ No

Action: Certificate, SNI, proxy, or app config, inspect curl error, server vhosts, load balancer health

Yes ↓

Issue narrowed, L3/L4 path OK; focus on app credentials, HTTP errors, or server logs if the service still misbehaves

2. DNS Not Resolving

🔤 Resolver → authority chain

Check: Flush local cache (stale answers)?

sudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder

sudo systemd-resolve --flush-caches

✓ Retest ✗ Still broken

Action: On Windows use ipconfig /flushdns; Linux varies by resolver (nscd, systemd-resolved, dnsmasq)

Continue ↓

Check: Does a public resolver return answers?

dig @8.8.8.8 +short example.com A

✓ Yes ✗ No

Action: Upstream / global DNS or domain itself, try dig @1.1.1.1; verify domain registration and authoritative NS at registrar

Yes ↓

Check: What is the OS using for DNS?

cat /etc/resolv.conf

resolvectl status

✓ Looks correct ✗ Wrong / empty

Action: Fix DHCP static DNS, NetworkManager, or cloud metadata; disable rogue resolv.conf overrides

Continue ↓

Check: Trace delegation from root to your name

dig +trace example.com

✓ Completes ✗ Fails mid-chain

Action: Broken NS glue, lame delegation, or registrar NS mismatch, fix NS records and parent zone glue at registrar

OK ↓

Check: Do authoritative nameservers answer directly?

dig NS example.com +short

dig @ns1.example.net example.com A +norecurse

✓ Yes ✗ No

Action: NS host down, firewall on 53/UDP+TCP, or wrong BIND/Cloud DNS zone, open ports, sync zone AXFR/IXFR, check SOA serial

Yes ↓

Check: Do the records you expect exist?

dig example.com ANY +noall +answer

✓ Yes ✗ No

Action: Add/fix A/AAAA/CNAME, TTL propagation; split-horizon DNS returning different answers internally vs externally

Yes ↓

DNS chain healthy, if apps still fail, check search domains, /etc/hosts, and application-specific resolver settings

3. Website Loads Slowly

🐢 Find the bottleneck

Check: DNS resolution time

dig example.com | grep 'Query time'

✓ Fast (<50ms typical) ✗ Slow

Action: Use faster resolvers, reduce TTL churn, enable DNS prefetch, geo DNS or anycast NS closer to users

Fast ↓

Check: TLS handshake duration

curl -w 'tls:%{time_appconnect}\n' -o /dev/null -s https://example.com

✓ OK ✗ High

Action: Session tickets, OCSP stapling, HTTP/2 or QUIC, reduce cert chain size, edge TLS termination at CDN

OK ↓

Check: Time to first byte (TTFB)

curl -w 'ttfb:%{time_starttransfer} total:%{time_total}\n' -o /dev/null -s https://example.com

✓ Low TTFB ✗ High TTFB

Action: Origin CPU/DB/cache, cold containers, region latency, add caching, scale app, optimize queries, move origin closer

Good ↓

Check: Download size & compression

curl -sI https://example.com | grep -i content-length

curl -sI -H 'Accept-Encoding: gzip' https://example.com | grep -i content-encoding

✓ Reasonable ✗ Huge / uncompressed

Action: Enable gzip/Brotli, shrink images, code-split JS, lazy-load media, HTTP/2 multiplexing

OK ↓

Check: Is a CDN caching static assets?

curl -sI https://example.com/static/app.js | grep -iE 'cf-cache|x-cache|age|server'

✓ HIT / edge ✗ MISS / direct origin

Action: Put static on CDN, tune cache headers (Cache-Control), purge stale objects, enable image CDN

Optimized ↓

Check: Client-side render & third parties

(Chrome DevTools → Network → Disable cache → reload, check waterfall)

✓ Clean ✗ Blocking scripts

Action: Defer/async scripts, cut trackers, preconnect to critical origins, reduce main-thread work

↓

You’ve isolated the slow phase, apply the matching optimization above and re-measure with WebPageTest or Lighthouse

4. SSH Connection Refused

🔐 SSH path

Check: Is TCP 22 (or your port) open from the client?

nc -vz host.example.com 22

nmap -p 22 host.example.com

✓ Open ✗ Closed / filtered

Action: Security group / cloud firewall / edge ACL, allow 22 from your IP; confirm target IP and NAT

Open ↓

Check: Is sshd listening on the server?

sudo systemctl status ssh sshd

sudo ss -tlnp | grep ':22'

✓ Active ✗ Down

Action: sudo systemctl start sshd, fix unit failures (journalctl -u sshd), reinstall openssh-server

Running ↓

Check: Host firewall allows SSH?

sudo iptables -L INPUT -n -v

sudo firewall-cmd --list-services

✓ Allowed ✗ Blocked

Action: Add rule for tcp/22 (or custom port), check fail2ban jails (sudo fail2ban-client status sshd)

Allowed ↓

Check: sshd_config, listen address & port

sudo sshd -T | grep -E '^(port|listenaddress)'

✓ Matches client ✗ Mismatch

Action: Edit /etc/ssh/sshd_config (Port, ListenAddress), then sudo systemctl reload sshd

Match ↓

Check: Pubkey / password auth policy

sudo sshd -T | grep -E 'passwordauthentication|pubkeyauthentication|permitemptypasswords'

✓ Expected mode on ✗ Locked down

Action: Enable keys or password temporarily via hardened config; verify ~/.ssh/authorized_keys perms (600/700)

OK ↓

Check: Verbose client errors

ssh -vvv user@host

✓ Connects ✗ Fails late

Action: Host key changed (known_hosts), KEX/cipher mismatch, MaxStartups, PAM, align client/server algorithms or reset host key if intentional

↓

SSH should work, if “refused” became “timeout”, revisit network path and middleboxes

5. GCP VM Not Accessible

☁️ GCP checklist

Check: Does the VM have an external IP or reachable path (IAP / VPN)?

gcloud compute instances describe VM_NAME --zone=ZONE --format='get(networkInterfaces[0].accessConfigs[0].natIP)'

✓ Yes / IAP ready ✗ No

Action: Attach external IP, configure Cloud NAT + private path, or use gcloud compute ssh --tunnel-through-iap

Reachable path ↓

Check: VPC firewall allows your traffic?

gcloud compute firewall-rules list --filter='network:NETWORK' --format='table(name,direction,priority,allowed,sourceRanges)'

✓ Rule hits ✗ Deny / missing

Action: Create allow rule for tcp:22 or tcp:443 from your IP or IAP range 35.235.240.0/20; check priority & deny rules

Allowed ↓

Check: Routes send return traffic correctly (no blackhole)?

gcloud compute routes list --filter='network:NETWORK'

✓ Default via gw ✗ Wrong next hop

Action: Fix custom static routes, ilb next hops, or remove conflicting peering custom routes

OK ↓

Check: Is the instance RUNNING and healthy?

gcloud compute instances describe VM_NAME --zone=ZONE --format='get(status)'

✓ RUNNING ✗ Stopped / crash

Action: Start VM, fix startup scripts, disk full, or guest OS panic, use serial console next

Running ↓

Check: Serial port output / guest boot

gcloud compute instances get-serial-port-output VM_NAME --zone=ZONE

✓ Clean boot ✗ Errors

Action: fsck/disk errors, cloud-init failures, wrong NIC config, attach recovery disk or fix metadata/startup script

OK ↓

Check: OS guest firewall inside VM

gcloud compute ssh VM_NAME --zone=ZONE --command='sudo iptables -L -n || sudo nft list ruleset'

✓ Allows service ✗ Blocks

Action: Open service port in guest OS firewall; confirm app binds 0.0.0.0 not only localhost

↓

GCP path aligned, if still failing, capture tcpdump on VM and verify load balancer backend health for managed instance groups

6. Kubernetes Pod Can't Connect

☸️ Cluster networking

Check: Does a Service exist for the target?

kubectl get svc -A | grep MY-SVC

✓ Yes ✗ No

Action: Create Service (ClusterIP/NodePort/LoadBalancer) with correct selectors matching pod labels

Yes ↓

Check: In-cluster DNS resolves the service?

kubectl run -it --rm dbg --image=busybox:1.36 --restart=Never -- nslookup my-svc.my-ns.svc.cluster.local

✓ Yes ✗ NXDOMAIN / timeout

Action: Fix CoreDNS pods, kube-dns service, upstream resolvers, or NetworkPolicy blocking UDP/TCP 53

Yes ↓

Check: NetworkPolicies, allowed egress/ingress?

kubectl get networkpolicy -A

kubectl describe networkpolicy -n NAMESPACE POLICY_NAME

✓ Permits path ✗ Deny

Action: Add egress to Service CIDR/pod CIDR/443; allow DNS namespaces; verify namespace selectors and ports

OK ↓

Check: CNI pods healthy (Calico/Cilium/Amazon VPC CNI)?

kubectl get pods -n kube-system

✓ Running ✗ CrashLoop

Action: Review CNI logs, IP pool exhaustion, MTU issues on overlay, or provider quotas, restart/reinstall CNI per vendor runbook

Healthy ↓

Check: Endpoints match ready pods?

kubectl get endpoints -n NAMESPACE my-svc -o wide

✓ Addresses listed ✗ Empty

Action: Fix readiness probes, wrong Service selector, pods not Ready, or headless vs clusterIP confusion

Ready ↓

Check: Pod-to-pod / pod-to-Service test

kubectl exec -n NAMESPACE POD -- wget -qO- --timeout=3 http://my-svc:8080/health

✓ Works ✗ Fails

Action: kube-proxy mode (iptables/IPVS), stale conntrack, dual-stack mismatch, or app binding, compare with kubectl get svc -o wide

↓

Kubernetes service path OK, escalate to app logs and ingress/controller if only external clients fail

7. VPN Tunnel Down

🛡️ IPsec / IKE tunnel

Check: IKE Phase 1 (ISAKMP), peer identity, PSK/cert, UDP 500/4500

sudo journalctl -u strongswan -u ipsec -f

sudo tcpdump -ni any udp port 500 or udp port 4500

✓ SA established ✗ No SA

Action: Align IKE version, encryption suite, peer IDs, NAT-T; verify shared secret or trust chain; fix clock skew

Phase 1 OK ↓

Check: IKE Phase 2 (child SA), proxy IDs / traffic selectors

sudo ip xfrm state

sudo ip xfrm policy

✓ Installed ✗ Mismatch

Action: Match left/right subnets in config (GCP: classic VPN vs HA VPN traffic selectors); fix overlapping CIDRs

Phase 2 OK ↓

Check: Routes propagated to use tunnel (policy / static / BGP)

ip route | grep -E 'tun|vti|encap'

gcloud compute vpn-tunnels describe TUNNEL --region=REGION

✓ Routes present ✗ Missing

Action: Add Cloud Router BGP sessions, static routes with correct priority, or install shunts for on-prem ranges

Routed ↓

Check: Perimeter firewalls allow IPsec

# UDP 500 (ISAKMP), UDP 4500 (NAT-T), ESP IP proto 50

✓ Allowed ✗ Blocked

Action: Open UDP/500, UDP/4500, and ESP in edge firewall; disable SIP ALG or other helpers that break IKE

Open ↓

Check: MTU / fragmentation on encrypted path

ping -M do -s 1400 REMOTE_LAN_HOST

tracepath REMOTE_LAN_HOST

✓ Stable ✗ Black hole

Action: Lower TCP MSS via iptables clamp, set interface MTU (often 1360–1392 over IPsec), enable PMTUD

Stable ↓

Check: Keepalive / DPD / rekey behavior

sudo swanctl --list-sas

✓ Healthy ✗ Flapping

Action: Tune dpdaction, increase keylife overlap, fix asymmetric NAT or idle timeouts on middle firewalls

↓

Tunnel up and passing traffic, monitor with vendor/cloud VPN metrics and log correlation on both sides