Worker health - Temporal Cloud feature guide
This page is a guide to monitoring a Temporal Worker fleet and covers the following scenarios:
- Configuring minimal observations
- How to detect a backlog of Tasks
- How to detect greedy Worker resources
- How to detect misconfigured Workers
- How to configure Sticky cache
Minimal Observations
These alerts should be configured and understood first to gain intelligence into your application health and behaviors.
- Create monitors and alerts for Schedule-To-Start latency SDK metrics (both Workflow Executions and Activity Executions). See Detect Task backlog section to explore sample queries and appropriate responses that accompany these values.
- Alert at >200ms for your p99 value
- Plot >100ms for your p95 value
- Create a Grafana panel called Sync Match Rate. See the Sync Match Rate section to explore example queries and appropriate responses that accompany these values.
- Alert at <95% for your p99 value
- Plot <99% for your p95 value
- Create a Grafana panel called Poll Success Rate. See the Detect greedy Workers section for example queries and appropriate responses that accompany these values.
- Alert at <90% for your p99 value
- Plot <95% for your p95 value
The following alerts build on the above to dive deeper into specific potential causes for Worker related issues you might be experiencing.
- Create monitors and alerts for the temporal_worker_task_slots_available SDK metric. See the Detect misconfigured Workers section for appropriate responses based on the value.
- Alert at 0 for your p99 value
- Create monitors for the temporal_sticky_cache_size SDK metric. See the Configure Sticky Cache section for more details on this configuration.
- Plot at {value} > {WorkflowCacheSize.Value}
- Create monitors for the temporal_sticky_cache_total_forced_eviction SDK metric. This metric is available in the Go SDK, and the Java SDK only. See the Configure Sticky Cache section for more details and appropriate responses.
- Alert at >{predetermined_high_number}
Detect Task Backlog
How to detect a backlog of Tasks.
Metrics to monitor:
- SDK metric: workflow_task_schedule_to_start_latency
- SDK metric: activity_schedule_to_start_latency
- Temporal Cloud metric: temporal_cloud_v0_poll_success_count
- Temporal Cloud metric: temporal_cloud_v0_poll_success_sync_count
Schedule To Start latency
The Schedule-To-Start metric represents how long Tasks are staying, unprocessed, in the Task Queues. But differently, it is the time between when a Task is enqueued and when it is picked up by a Worker. This time being long (likely) means that your Workers can't keep up — either increase the number of Workers (if the host load is already high) or increase the number of pollers per Worker.
If your Schedule-To-Start latency alert triggers or is high, check the Sync Match Rate to decide if you need to adjust your Worker or fleet, or contact Temporal Cloud support. If your Sync Match Rate is low, contact Temporal Cloud support. If your Sync Match Rate is low, you can contact Temporal Cloud support.
The schedule_to_start_latency SDK metric for both Workflow Executions and Activity Executions should have alerts.
Prometheus query samples
Workflow Task Latency, 99th percentile
histogram_quantile(0.99, sum(rate(temporal_workflow_task_schedule_to_start_latency_seconds_bucket[5m])) by (le, namespace, task_queue))
Workflow Task Latency, average
sum(increase(temporal_workflow_task_schedule_to_start_latency_seconds_sum[5m])) by (namespace, task_queue)
/
sum(increase(temporal_workflow_task_schedule_to_start_latency_seconds_count[5m])) by (namespace, task_queue)
Activity Task Latency, 99th percentile
histogram_quantile(0.99, sum(rate(temporal_activity_schedule_to_start_latency_seconds_bucket[5m])) by (le, namespace, task_queue))
Activity Task Latency, average
sum(increase(temporal_activity_schedule_to_start_latency_seconds_sum[5m])) by (namespace, task_queue)
/
sum(increase(temporal_activity_schedule_to_start_latency_seconds_count[5m])) by (namespace, task_queue)
Target
This latency should be very low, close to zero. Any higher value indicates a bottleneck. Anything else indicates bottlenecking.