Kubernetes: Kubernetes Alert & Monitoring¶

Hệ thống Kubernetes mang lại sự linh hoạt và tự động hoá cao trong việc triển khai ứng dụng, nhưng đồng thời cũng làm tăng độ phức tạp trong việc giám sát và xử lý sự cố. Do đó, việc thiết lập một hệ thống giám sát & cảnh báo toàn diện là điều thiết yếu để đảm bảo khả năng quan sát, tính ổn định và hiệu suất hoạt động của hệ thống.

Tính năng Kubernetes Monitoring & Alerting được xây dựng với mục tiêu cung cấp một giải pháp “observability by default”, giúp DevOps và Developer:

Theo dõi real-time metrics trên toàn bộ hạ tầng Kubernetes.
Phát hiện và cảnh báo sớm các sự cố (CPU cao, Pod chết, HTTP 5xx, Network down…).
Chủ động phản ứng hoặc tự động xử lý (auto-recovery) để giảm downtime.
Phân tích nguyên nhân gốc (Root Cause Analysis).
Tối ưu hiệu suất và sử dụng tài nguyên, tránh overspending.

Hành vi người dùng & nhu cầu thực tế¶

Vai trò	Hành vi sử dụng	Nhu cầu
DevOps	Theo dõi hoạt động cluster, cấu hình cảnh báo, xử lý sự cố	Phát hiện sớm lỗi, giữ hệ thống ổn định
SRE	Phân tích nguyên nhân sự cố, cải thiện SLI/SLO	Có log/metrics/alert đủ để phản ứng nhanh
Developer	Tìm log khi Pod bị crash, truy vết request	Có góc nhìn app-level để debug
Team Lead/PM	Xem health tổng thể hệ thống	Biết khi nào có sự cố, báo cáo uptime
Khách hàng nội bộ (business)	Nhận thông báo khi hệ thống có vấn đề	Biết khi nào user bị ảnh hưởng

Kiến trúc hệ thống giám sát Kubernetes¶

Hệ thống được triển khai theo mô hình modular, scalable và extensible, gồm các thành phần:

Thành phần	Mô tả
Metrics Collector (Prometheus)	Thu thập metrics từ node, pod, container, kubelet, v.v.
Log Collector	Thu thập log ứng dụng, sự kiện hệ thống, audit log
Alerting System (Alertmanager)	Quản lý luồng cảnh báo: điều kiện cảnh báo, ngưỡng, gửi thông báo
Dashboard (Grafana hoặc tích hợp riêng)	Hiển thị biểu đồ real-time, báo cáo và tình trạng hệ thống
Event Tracker	Ghi nhận Kubernetes Event và phân tích hành vi bất thường
Webhook/Notification Integrator	Gửi cảnh báo đến Email, Slack, Telegram, v.v.

DANH SÁCH CÁC METRIC MONITOR & ALERT¶

Theo dõi trạng thái hoạt động real-time của hệ thống (Cluster, Pod, Node, App).
Phát hiện và cảnh báo sớm các sự cố (CPU cao, Pod chết, HTTP 5xx, Network down…).
Chủ động phản ứng hoặc tự động xử lý (auto-recovery) để giảm downtime.
Tối ưu hiệu suất và sử dụng tài nguyên, tránh overspending.

Cấu trúc cụm tính năng¶

A. Monitoring¶

Real-time metrics từ Node, Pod, Container, Application.
Dashboard tổng quan & chi tiết (theo namespace, app, cluster).
Truy xuất log theo thời gian/thẻ định danh (labels).
Heatmap & trends theo thời gian.

B. Alerting¶

Định nghĩa điều kiện alert bằng rule (YAML hoặc giao diện).
Phân loại alert theo severity: info, warning, critical.
Thông báo đa kênh: Slack, Email, Webhook, Opsgenie…
Lập lịch silent (không cảnh báo) khi deploy/maintenance.
Thống kê cảnh báo theo thời gian, nguồn gây lỗi.

C. Notification & Escalation¶

Kết nối với hệ thống quản lý sự cố (PagerDuty, Opsgenie).
Luồng thông báo theo độ ưu tiên (Critical → nhiều kênh hơn).
Retry & Deduplication (tránh spam khi alert lặp lại).

Chi tiết tính năng cụ thể (mức task backlog)¶

1. Monitoring Core¶

ID	Tính năng	Mô tả
M1	Cluster Health Overview	Tổng hợp trạng thái node/pod, % tài nguyên sử dụng
M2	Resource Usage Panel	Biểu đồ CPU, RAM, Disk, Network theo namespace/app
M3	Pod Status Heatmap	Hiển thị tình trạng pod theo thời gian

2. Alerting¶

ID	Tính năng	Mô tả
A1	Alert Rule Config	Cấu hình điều kiện (ex: CPU > 80% trong 5p)
A2	Severity Mapping	Mapping mức độ alert và người nhận
A3	Alert Silence	Tạm tắt cảnh báo khi deploy hoặc bảo trì
A4	Alert History	Lưu log cảnh báo + trạng thái xử lý
A5	Alert Template	Rule mẫu có sẵn cho user tạo nhanh

3 Notification¶

ID	Tính năng	Mô tả
N1	Slack/Email/Webhook integration	Kết nối các kênh gửi alert
N2	Retry on failure	Gửi lại nếu lần đầu thất bại
N3	Alert Summary Digest	Gửi tổng hợp alert theo giờ/ngày

DANH SÁCH CÁC METRIC MONITOR & ALERT¶

Dashboard	Nguồn	Mục tiêu sử dụng	Biểu đồ chính	PromQL tiêu biểu
Kubernetes / Networking / Namespace (Pods)	`kubernetes-mixin`	Theo dõi traffic giữa các Pod trong namespace	- Bytes in/out- Packets drop/error	`sum(rate(container_network_receive_bytes_total{namespace=~"$namespace"}[5m])) by (namespace)`
Kubernetes / Networking / Namespace (Workload)	`kubernetes-mixin`	Xác định workload tạo nhiều network traffic	- Network usage by workload	`sum(rate(container_network_transmit_bytes_total{namespace=~"$namespace"}[5m])) by (workload)`
Kubernetes / Networking / Pod	`kubernetes-mixin`	Debug Pod gặp vấn đề network	- Bytes per Pod- Error packets	`rate(container_network_receive_errors_total{pod=~"$pod"}[5m])`
Kubernetes / Networking / Workload	`kubernetes-mixin`	Theo dõi workload có lưu lượng bất thường	- Bytes per workload	`sum(rate(container_network_receive_bytes_total[5m])) by (workload)`
Kubernetes / Persistent Volumes	`kubernetes-mixin`	Theo dõi sử dụng đĩa	- Volume usage %- IOPS read/write	`kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes`
Kubernetes / Proxy	`kubernetes-mixin`	Giám sát kube-proxy	- Proxy rules count- Sync latency	`rate(kubeproxy_sync_proxy_rules_duration_seconds_sum[5m])`
Kubernetes / Scheduler	`kubernetes-mixin`	Xác định vấn đề Pod Pending	- Scheduling attempts- Latency metrics	`rate(scheduler_schedule_attempts_total[5m])`
Kubernetes / USE Method / Cluster (Windows)	`kubernetes-mixin`	Tổng quan hiệu suất hệ thống Windows	- CPU Utilization- Disk IO	`rate(windows_cpu_time_total[5m])`
Kubernetes / USE Method / Node (Windows)	`kubernetes-mixin`	Hiệu suất node Windows	- CPU/Memory/Disk Errors	`windows_logical_disk_free_bytes / windows_logical_disk_size_bytes`
Node Exporter / AIX	`node-exporter-mixin`	Giám sát node AIX (nếu có)	- CPU, Mem, Disk	`node_cpu_seconds_total`
Node Exporter / MacOS	`node-exporter-mixin`	Theo dõi node Mac	- CPU/Mem per process	`node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes`
Node Exporter / Nodes	`node-exporter-mixin`	Hiệu suất node	- CPU idle, loadavg- Disk IO	`rate(node_cpu_seconds_total{mode="idle"}[5m])`
Node Exporter / USE Method / Cluster	`node-exporter-mixin`	Tổng quan health system	- Cluster resource usage	`avg by (cluster) (node_load1 / count(node_cpu_seconds_total{mode="idle"}))`
Node Exporter / USE Method / Node	`node-exporter-mixin`	Phân tích chi tiết node	- Node bottleneck	`node_disk_io_time_seconds_total`
Prometheus / Overview	`prometheus-mixin`	Theo dõi Prometheus server	- TSDB size- Query latency- Scrape failures	prometheus_engine_query_duration_seconds``rate(prometheus_tsdb_head_chunks_created_total[5m])
Prometheus / Remote Write	`prometheus-mixin`	Theo dõi kết nối remote	- Send duration- Queue length	`rate(prometheus_remote_storage_sent_bytes_total[5m])`

📌 Ghi chú:

$namespace, $pod, $workload là biến dashboard trong Grafana.
PromQL có thể khác tùy theo version hoặc exporter, nhưng các ví dụ trên là chuẩn trong kube-prometheus-stack.

DANH SÁCH CÁC ALERT¶

STT	Alert Name	Threshold	Điều kiện	PromQL tương ứng
1	Kubernetes Node Not ready	`< 1`	`lt`	`max(kube_node_status_condition{condition="Ready", status="true"}) by (node) < 1`
2	Kubernetes Pod CPU Usage > 80%	`> 80`	`gt`	`sum(rate(container_cpu_usage_seconds_total{image!=""}[5m])) by (pod) / sum(kube_pod_container_resource_limits_cpu_cores) by (pod) * 100 > 80`
3	Kubernetes Pod OOM Killed	`> 0`	`gt`	`increase(kube_pod_container_status_terminated_reason{reason="OOMKilled"}[5m]) > 0`
4	Kubernetes Pod memory Usage > 80%	`> 80`	`gt`	`sum(container_memory_usage_bytes{image!=""}) by (pod) / sum(kube_pod_container_resource_limits_memory_bytes) by (pod) * 100 > 80`
5	Kubernetes Pod restarted	`> 1 (trong 3 phút)`	`gt`	`increase(kube_pod_container_status_restarts_total[3m]) > 1`
6	Node CPU Load > 80%	`> 80`	`gt`	`100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80`
7	Node File system usage < 10%	`< 10`	`lt`	`(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10`
8	Node Memory usage < 15%	`< 15`	`lt`	`(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 15`
9	Persistent Volume Claim used > 80%	`> 80`	`gt`	`kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100 > 80`

Kênh gửi cảnh báo:¶

Email, Slack, Telegram
Webhook đến hệ thống vận hành
Giao diện quản lý cảnh báo tập trung cảnh báo theo thời gian