Skip to main content

Trace Sampling Configuration

Introduction

When using APO OneAgent to collect Traces data, the full link data is saved by default. You can enable the data sampling feature by modifying the configuration. Enabling data sampling will significantly reduce the amount of saved Traces data, lowering storage costs. Depending on the environment and sampling settings, storage costs can be reduced to 30% or less of the original.

Different from traditional head sampling and tail sampling, APO implements Traces data sampling based on a distributed sampling strategy. Distributed sampling decides whether to save the current data at the edge side (the host where the service is located), and only slow links, error links, and a small amount of normal link data will be saved, effectively reducing the overall system resource overhead. After enabling the sampling feature, only the sampled data will be visible in the "full link".

Note

If you install APO by integrating with an existing OpenTelemetry and SkyWalking installation, the sampling feature is not supported.

Configuration

To enable the sampling feature, follow these steps:

1. Open the configuration file apo-otel-collector-agent-config.

  • kubernetes
kubectl edit cm apo-otel-collector-agent-config -napo

2. Change processors.backsampling.adaptive.enable to true.

3. Open the configuration file apo-collector-config.

  • kubernetes
kubectl edit cm apo-collector-config -napo

4. Change sample.enable to true.

5. Restart apo-one-agent and apo-collector.

  • kubernetes
kubectl rollout restart deployment apo-collector -n apo
kubectl rollout restart daemonset apo-one-agent -n apo

Advanced Configuration

Note

Please ensure you understand the parameters you are modifying before proceeding.

More configuation about apo-otel-collector-agent-config,Please restart apo-one-agent after editing.

processors:
backsampling:
adaptive:
enable: true
# After dynamic sampling is enabled, some ultra-slow requests may not be captured.
# Set the trigger threshold to control the collection of one ultra-slow request data every M seconds.
span_slow_threshold: 10s
# After dynamic sampling is enabled, some low-frequency URL requests may also not be captured.
# Set how many different URLs to collect every M seconds.
service_sample_count: 1
# Window reset time, collect data every M seconds.
service_sample_window: 5s
# Every Check seconds, send the probe's memory usage to the Receiver and obtain a unified sampling rate.
memory_check_interval: 2s
# Set the memory threshold.
# After the Receiver receives the current memory usage and the memory threshold, it calculates the sampling rate.
memory_limit_mib_threshold: 500
# Cache the lifecycle of the sampled TraceId.
# Some upstream requests in a single link may take some time to complete, so the hit TraceId needs to be cached for a period of time.
traceid_holdtime: 60s

More configuation about apo-collector-config,Please restart apo-collector after editing.

sample:
# Enable dynamic sampling.
enable: true
# All the following samples are sample_values, ranging from 0 to 10. Setting it too high is meaningless [when set to 20, only one in a million].
# sample_rate = 1 / 2^sample_value
# sample_value | sample_rate
# 0 | 100%
# 4 | 1 / 16
# 10 | 1 / 1024

# Default initial sampling value (0), i.e., 1/2^0 = 100%.
min_sample: 0
# First trigger at 80% memory usage, set sampling value (4), i.e., 1/ 2^4 = 1/16.
init_sample: 4
# If memory continues to grow, the sampling value will also increase rapidly after calculation, up to the maximum value max_sample.
# At this point, the sampling rate will drop sharply, and memory will also decrease slowly.
# Maximum set sampling value (10), i.e., 1/ 2^10 = 1 / 1024.
max_sample: 10
# Every N seconds, calculate the current memory usage to decide whether to lower the sampling value (i.e., increase the sampling rate), ensuring sufficient sampled data.
# Considering that the sampling rate increases, memory will also increase, a slow recovery strategy is adopted to avoid sudden rises and falls, ensuring data stability.
reset_sample_period: 30m