SkyWalking 8.4 provides infrastructure monitoring
Origin: Tetrate.io blog
Background
Apache SkyWalking– the APM tool for distributed systems– has historically focused on providing observability around tracing and metrics, but service performance is often affected by the host. The newest release, SkyWalking 8.4.0, introduces a new feature for monitoring virtual machines. Users can easily detect possible problems from the dashboard– for example, when CPU usage is overloaded, when there’s not enough memory or disk space, or when the network status is unhealthy, etc.
How it works
SkyWalking leverages Prometheus and OpenTelemetry for collecting metrics data as we did for Istio control panel metrics; Prometheus is mature and widely used, and we expect to see increased adoption of the new CNCF project, OpenTelemetry. The SkyWalking OAP Server receives these metrics data of OpenCensus format from OpenTelemetry. The process is as follows:
- Prometheus Node Exporter collects metrics data from the VMs.
- OpenTelemetry Collector fetches metrics from Node Exporters via Prometheus Receiver, and pushes metrics to SkyWalking OAP Server via the OpenCensus GRPC Exporter.
- The SkyWalking OAP Server parses the expression with MAL to filter/calculate/aggregate and store the results. The expression rules are in
/config/otel-oc-rules/vm.yaml
. - We can now see the data on the SkyWalking WebUI dashboard.
What to monitor
SkyWalking provides default monitoring metrics including:
- CPU Usage (%)
- Memory RAM Usage (MB)
- Memory Swap Usage (MB)
- CPU Average Used
- CPU Load
- Memory RAM (total/available/used MB)
- Memory Swap (total/free MB)
- File System Mount point Usage (%)
- Disk R/W (KB/s)
- Network Bandwidth Usage (receive/transmit KB/s)
- Network Status (tcp_curr_estab/tcp_tw/tcp_alloc/sockets_used/udp_inuse)
- File fd Allocated
The following is how it looks when we monitor Linux:
How to use
To enable this feature, we need to install Prometheus Node Exporter and OpenTelemetry Collector and activate the VM monitoring rules in SkyWalking OAP Server.
Install Prometheus Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.0.1.linux-amd64.tar.gz
cd node_exporter-1.0.1.linux-amd64
./node_exporter
In linux Node Exporter exposes metrics on port 9100
by default. When it is running, we can get the metrics from the /metrics
endpoint. Use a web browser or command curl
to verify.
curl http://localhost:9100/metrics
We should see all the metrics from the output like:
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 7.7777e-05
go_gc_duration_seconds{quantile="0.25"} 0.000113756
go_gc_duration_seconds{quantile="0.5"} 0.000127199
go_gc_duration_seconds{quantile="0.75"} 0.000147778
go_gc_duration_seconds{quantile="1"} 0.000371894
go_gc_duration_seconds_sum 0.292994058
go_gc_duration_seconds_count 2029
...
Note: We only need to install Node Exporter, rather than Prometheus server. If you want to get more information about Prometheus Node Exporter see: https://prometheus.io/docs/guides/node-exporter/
Install OpenTelemetry Collector
We can quickly install a OpenTelemetry Collector instance by using docker-compose
with the following steps:
- Create a directory to store the configuration files, like
/usr/local/otel
. - Create
docker-compose.yaml
andotel-collector-config.yaml
in this directory represented below:
docker-compose.yaml
version: "2"
services:
# Collector
otel-collector:
# Specify the image to start the container from
image: otel/opentelemetry-collector:0.19.0
# Set the otel-collector configfile
command: ["--config=/etc/otel-collector-config.yaml"]
# Mapping the configfile to host directory
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "13133:13133" # health_check extension
- "55678" # OpenCensus receiver
otel-collector-config.yaml
extensions:
health_check:
# A receiver is how data gets into the OpenTelemetry Collector
receivers:
# Set Prometheus Receiver to collects metrics from targets
# It’s supports the full set of Prometheus configuration
prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 10s
static_configs:
# Replace the IP to your VMs‘s IP which has installed Node Exporter
- targets: [ 'vm1:9100' ]
- targets: [ 'vm2:9100' ]
- targets: [ ‘vm3:9100' ]
processors:
batch:
# An exporter is how data gets sent to different systems/back-ends
exporters:
# Exports metrics via gRPC using OpenCensus format
opencensus:
endpoint: "docker.for.mac.host.internal:11800" # The OAP Server address
insecure: true
logging:
logLevel: debug
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [batch]
exporters: [logging, opencensus]
extensions: [health_check]
- In this directory use command
docker-compose
to start up the container:
docker-compose up -d
After the container is up and running, you should see metrics already exported in the logs:
...
Metric #165
Descriptor:
-> Name: node_network_receive_compressed_total
-> Description: Network device statistic receive_compressed.
-> Unit:
-> DataType: DoubleSum
-> IsMonotonic: true
-> AggregationTemporality: AGGREGATION_TEMPORALITY_CUMULATIVE
DoubleDataPoints #0
Data point labels:
-> device: ens4
StartTime: 1612234754364000000
Timestamp: 1612235563448000000
Value: 0.000000
DoubleDataPoints #1
Data point labels:
-> device: lo
StartTime: 1612234754364000000
Timestamp: 1612235563448000000
Value: 0.000000
...
If you want to get more information about OpenTelemetry Collector see: https://opentelemetry.io/docs/collector/
Set up SkyWalking OAP Server
To activate the oc handler and vm relevant rules, set your environment variables:
SW_OTEL_RECEIVER=default
SW_OTEL_RECEIVER_ENABLED_OC_RULES=vm
Note: If there are other rules already activated , you can add vm with use ,
as a separator.
SW_OTEL_RECEIVER_ENABLED_OC_RULES=vm,oap
Start the SkyWalking OAP Server.
Done!
After all of the above steps are completed, check out the SkyWalking WebUI. Dashboard VM
provides the default metrics of all observed virtual machines.
Note: Clear the browser local cache if you used it to access deployments of previous SkyWalking versions.
Additional Resources
- Read more about the SkyWalking 8.4 release highlights.
- Get more SkyWalking updates on Twitter.