OpenTelemetry Source Connector - Operational Runbook¶
This runbook provides operational guidance for monitoring, troubleshooting, and maintaining the OpenTelemetry Source Connector in production.
Table of Contents¶
- Incident Response Decision Tree
- Monitoring
- Common Issues
- Performance Tuning
- Troubleshooting
- Recovery Procedures
Incident Response Decision Tree¶
Use this decision tree when an alert fires or an issue is reported:
START: Alert or Issue Reported
│
├─ Is connector in RUNNING state?
│ │ curl http://localhost:8083/connectors/<name>/status | jq '.connector.state'
│ │
│ └─ NO → Go to [Connector Not Running](#issue-connector-not-running)
│ └─ YES → Continue ↓
│
├─ Are OTLP receivers listening?
│ │ netstat -an | grep 4317 (gRPC)
│ │ netstat -an | grep 4318 (HTTP)
│ │
│ └─ NO → Go to [Receivers Not Started](#issue-1-otlp-receivers-not-listening)
│ └─ YES → Continue ↓
│
├─ Are messages being received? (TotalMessagesReceived increasing)
│ │ Check JMX: io.conduktor.connect.otel:*/TotalMessagesReceived
│ │
│ └─ NO → Go to [No Messages Received](#issue-2-no-messages-received)
│ └─ YES → Continue ↓
│
├─ Are queues near capacity? (MaxQueueUtilizationPercent > 80%)
│ │ Check JMX: io.conduktor.connect.otel:*/MaxQueueUtilizationPercent
│ │
│ └─ YES → Go to [High Message Drop Rate](#issue-3-high-message-drop-rate)
│ └─ NO → Continue ↓
│
├─ Is Kafka write lagging? (RecordsProduced << TotalMessagesReceived)
│ │ Check lag: TotalMessagesReceived - RecordsProduced > 10000
│ │
│ └─ YES → Go to [Processing Lag](#issue-4-processing-lag-building-up)
│ └─ NO → Continue ↓
│
└─ Check logs for errors
│ grep "ERROR\|WARN" $KAFKA_HOME/logs/connect.log | tail -50
│
└─ Errors found → Match error to [Common Issues](#common-issues)
└─ No errors → Monitor, escalate if issue persists
Quick Commands Reference¶
# Check connector status
curl -s http://localhost:8083/connectors/<name>/status | jq .
# Restart connector
curl -X POST http://localhost:8083/connectors/<name>/restart
# Restart specific task
curl -X POST http://localhost:8083/connectors/<name>/tasks/0/restart
# Check recent logs
tail -100 $KAFKA_HOME/logs/connect.log | grep -E "OpenTelemetry|ERROR|WARN"
# Check OTLP endpoints
netstat -an | grep -E "4317|4318"
# Test gRPC endpoint
grpcurl -plaintext localhost:4317 list
# Test HTTP endpoint
curl -v http://localhost:4318/v1/traces
Issue: Connector Not Running¶
Symptoms: Connector state is FAILED or task state is FAILED
Quick Fix:
# Check the error message
curl -s http://localhost:8083/connectors/<name>/status | jq '.tasks[0].trace'
# Restart connector
curl -X POST http://localhost:8083/connectors/<name>/restart
# If still failing, check config and redeploy
curl -X DELETE http://localhost:8083/connectors/<name>
curl -X POST http://localhost:8083/connectors -H "Content-Type: application/json" -d @connector.json
Common Causes: - Invalid configuration (bad port, invalid format) - Port already in use (4317 or 4318 bound to another process) - Missing dependencies (ClassNotFoundException) - Insufficient permissions to bind to ports
Monitoring¶
Key JMX Metrics¶
The connector exposes JMX metrics under io.conduktor.connect.otel:type=OpenTelemetryConnector,name=<connector-name>:
Counter Metrics¶
- TracesReceived: Total traces received from OTLP
- Alert: Rate drops to 0 for > 5 minutes → Connection issue or no traffic
-
Action: Check OTLP endpoint accessibility, verify applications are sending
-
MetricsReceived: Total metrics received from OTLP
- Alert: Rate drops to 0 for > 5 minutes → Connection issue or no traffic
-
Action: Check OTLP endpoint accessibility, verify applications are sending
-
LogsReceived: Total logs received from OTLP
- Alert: Rate drops to 0 for > 5 minutes → Connection issue or no traffic
-
Action: Check OTLP endpoint accessibility, verify applications are sending
-
TracesDropped: Traces dropped due to queue full
- Alert: Drop rate > 1% → Queue capacity insufficient
-
Action: Increase
otlp.message.queue.sizeor optimize Kafka throughput -
MetricsDropped: Metrics dropped due to queue full
- Alert: Drop rate > 1% → Queue capacity insufficient
-
Action: Increase
otlp.message.queue.sizeor optimize Kafka throughput -
LogsDropped: Logs dropped due to queue full
- Alert: Drop rate > 1% → Queue capacity insufficient
-
Action: Increase
otlp.message.queue.sizeor optimize Kafka throughput -
RecordsProduced: Total records written to Kafka
- Alert: Lag (TotalMessagesReceived - RecordsProduced) > 10000 → Processing backlog
- Action: Check Kafka broker health, review consumer lag
Queue Metrics¶
- TracesQueueSize: Current number of traces in queue
- MetricsQueueSize: Current number of metrics in queue
- LogsQueueSize: Current number of logs in queue
- QueueCapacity: Maximum queue capacity (per signal)
- MaxQueueUtilizationPercent: Highest utilization across all three queues
- Alert: Utilization > 80% → Approaching capacity
- Action: Monitor for drops, consider increasing queue size
Derived Metrics¶
- TotalMessagesReceived: TracesReceived + MetricsReceived + LogsReceived
- TotalMessagesDropped: TracesDropped + MetricsDropped + LogsDropped
- TotalLagCount: TotalMessagesReceived - RecordsProduced
- Alert: > 10000 → Processing backlog
-
Action: Review Kafka producer performance
-
DropRate: (TotalMessagesDropped / TotalMessagesReceived) * 100
- Alert: > 1% → Significant message loss
- Action: Increase queue size or optimize throughput
Recommended Alerts¶
# Example Prometheus alerting rules
groups:
- name: otlp_connector
rules:
- alert: OTLPNoMessagesReceived
expr: rate(otlp_TotalMessagesReceived[5m]) == 0
for: 5m
annotations:
summary: "OTLP connector not receiving messages"
description: "No messages received for 5 minutes on {{ $labels.connector_name }}"
- alert: HighMessageDropRate
expr: otlp_DropRate > 1
for: 5m
annotations:
summary: "High message drop rate for {{ $labels.connector_name }}"
description: "Dropping {{ $value }}% of messages - queue capacity insufficient"
- alert: HighProcessingLag
expr: otlp_TotalLagCount > 10000
for: 5m
annotations:
summary: "High processing lag for {{ $labels.connector_name }}"
description: "Lag count: {{ $value }} - Kafka throughput issue"
- alert: HighQueueUtilization
expr: otlp_MaxQueueUtilizationPercent > 80
for: 3m
annotations:
summary: "High queue utilization for {{ $labels.connector_name }}"
description: "Queue at {{ $value }}% capacity - risk of drops"
- alert: TracesQueueDropping
expr: increase(otlp_TracesDropped[5m]) > 0
annotations:
summary: "Traces being dropped on {{ $labels.connector_name }}"
description: "{{ $value }} traces dropped in last 5 minutes"
- alert: MetricsQueueDropping
expr: increase(otlp_MetricsDropped[5m]) > 0
annotations:
summary: "Metrics being dropped on {{ $labels.connector_name }}"
description: "{{ $value }} metrics dropped in last 5 minutes"
- alert: LogsQueueDropping
expr: increase(otlp_LogsDropped[5m]) > 0
annotations:
summary: "Logs being dropped on {{ $labels.connector_name }}"
description: "{{ $value }} logs dropped in last 5 minutes"
Common Issues¶
Issue 1: OTLP Receivers Not Listening¶
Symptoms: - netstat -an | grep 4317 shows nothing - netstat -an | grep 4318 shows nothing - Applications cannot connect to OTLP endpoints - Logs showing "Failed to start OTLP receiver"
Root Causes: 1. Ports already in use by another process 2. Insufficient permissions to bind to ports 3. Firewall blocking ports 4. Connector failed to start
Resolution:
# Check what's using the ports
lsof -i :4317
lsof -i :4318
# If ports are in use, either:
# 1. Stop the conflicting process
# 2. Use different ports in connector config:
{
"otlp.grpc.port": "14317",
"otlp.http.port": "14318"
}
# Check firewall rules
sudo ufw status
sudo firewall-cmd --list-ports
# Allow ports if needed
sudo ufw allow 4317/tcp
sudo ufw allow 4318/tcp
# Restart connector
curl -X POST http://localhost:8083/connectors/otlp-source/restart
Issue 2: No Messages Received¶
Symptoms: - TotalMessagesReceived metric not incrementing - OTLP endpoints are listening - No errors in logs
Root Causes: 1. Applications not configured to send to connector 2. Applications using wrong endpoint 3. Network connectivity issues 4. Authentication/firewall blocking
Resolution:
# Verify endpoints are accessible
grpcurl -plaintext localhost:4317 list
curl -v http://localhost:4318/v1/traces
# Check application configuration
# Should point to connector endpoints:
export OTEL_EXPORTER_OTLP_ENDPOINT=http://<connector-host>:4318
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
# Test sending data manually
# Using otel-cli:
otel-cli exec --endpoint http://localhost:4318 \
--service my-test-service -- echo "test"
# Check connector logs for incoming requests
tail -f $KAFKA_HOME/logs/connect.log | grep "event=trace_received\|event=metric_received\|event=log_received"
# Actions:
1. Verify application OTEL_EXPORTER_OTLP_ENDPOINT points to connector
2. Check network connectivity from app to connector
3. Verify firewall allows inbound on 4317, 4318
4. Test with manual data send
Issue 3: High Message Drop Rate¶
Symptoms: - DropRate metric > 1% - TracesDropped, MetricsDropped, or LogsDropped incrementing - MaxQueueUtilizationPercent at or near 100% - Logs showing "Queue full, dropping message" or "event=trace_dropped"
Root Causes: 1. Queue capacity too small for traffic volume 2. Kafka broker throughput bottleneck 3. Inefficient message format (JSON vs Protobuf) 4. Consumer lag on target topics
Resolution:
# Check queue metrics
jconsole # Connect to JMX and view queue metrics
# Actions:
1. Increase queue size:
"otlp.message.queue.size": "50000" # default: 10000
2. Switch to Protobuf format (smaller, faster):
"otlp.message.format": "protobuf"
3. Check Kafka broker health:
kafka-broker-api-versions --bootstrap-server localhost:9092
4. Monitor consumer lag on topics:
kafka-consumer-groups --bootstrap-server localhost:9092 \
--group <group> --describe
5. Optimize Kafka producer (in connect-distributed.properties):
producer.linger.ms=10
producer.batch.size=32768
producer.compression.type=lz4
producer.acks=1
6. Scale Kafka brokers if needed
Issue 4: Processing Lag Building Up¶
Symptoms: - TotalLagCount increasing over time - Queue sizes growing - RecordsProduced not keeping pace with TotalMessagesReceived
Root Causes: 1. Kafka broker slowness or overload 2. Network issues to Kafka 3. Kafka topic partitions insufficient 4. Serialization bottleneck
Resolution:
# Check Kafka broker metrics
kafka-run-class kafka.tools.JmxTool \
--object-name kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec \
--jmx-url service:jmx:rmi:///jndi/rmi://localhost:9999/jmxrmi
# Check topic configuration
kafka-topics --bootstrap-server localhost:9092 \
--describe --topic otlp-traces
# Actions:
1. Increase topic partitions for parallelism:
kafka-topics --alter --topic otlp-traces --partitions 12
kafka-topics --alter --topic otlp-metrics --partitions 12
kafka-topics --alter --topic otlp-logs --partitions 12
2. Check Kafka broker disk I/O and network throughput:
iostat -xz 1
iftop
3. Monitor Kafka Connect producer metrics:
curl http://localhost:8083/connectors/otlp-source/status
4. Review connector logs for serialization errors:
grep "ERROR" $KAFKA_HOME/logs/connect.log
5. Switch to Protobuf format if using JSON:
"otlp.message.format": "protobuf"
Performance Tuning¶
Queue Sizing¶
Default: 10000 messages per signal type (30000 total)
Recommendations: - Low-volume (< 100 msg/s): 1000-5000 - Medium-volume (100-1000 msg/s): 10000-20000 (default) - High-volume (1000-5000 msg/s): 20000-50000 - Very high-volume (> 5000 msg/s): 50000-100000
Trade-offs: - Larger queue = more memory usage, better burst handling - Smaller queue = less memory, more drops under traffic spikes
Memory formula:
Example: 50,000 queue × 2 KB avg × 3 = ~300 MB
Message Format Selection¶
Production Recommendations:
| Traffic Volume | Recommended Format | Reason |
|---|---|---|
| Low (< 100 msg/s) | JSON | Easier debugging |
| Medium (100-1000 msg/s) | JSON or Protobuf | Either works |
| High (> 1000 msg/s) | Protobuf | 3-5x smaller, faster |
JSON Benefits: - Human-readable for debugging - Easy downstream processing (ksqlDB, Kafka Streams) - No decoding required
Protobuf Benefits: - 3-5x smaller message size - Faster serialization/deserialization - Lower bandwidth usage
Port Configuration¶
Default Ports:
Custom Ports (avoid conflicts):
# Example for staging environment
otlp.grpc.port=5317
otlp.http.port=5318
# Example for development
otlp.grpc.port=6317
otlp.http.port=6318
Troubleshooting¶
Debug Logging¶
Enable debug logging for detailed troubleshooting:
# Connector logging
log4j.logger.io.conduktor.connect.otel=DEBUG
# gRPC logging
log4j.logger.io.grpc=DEBUG
Common Log Messages¶
| Log Message | Severity | Meaning | Action |
|---|---|---|---|
| "event=task_starting" | INFO | Task initialization | Normal operation |
| "event=otlp_receiver_started" | INFO | OTLP receivers started | Normal operation |
| "event=trace_received" | DEBUG | Trace received | Normal operation |
| "event=trace_dropped" | WARN | Trace dropped (queue full) | Increase queue size |
| "event=task_metrics" | INFO | Periodic metrics log | Review metrics |
| "event=task_stopping" | INFO | Graceful shutdown | Normal shutdown |
| "Failed to start gRPC server" | ERROR | gRPC startup failed | Check port availability |
| "Failed to start HTTP server" | ERROR | HTTP startup failed | Check port availability |
Health Check¶
Create a health check script:
#!/bin/bash
# health-check.sh
CONNECTOR_NAME="otlp-source"
CONNECT_URL="http://localhost:8083"
# Check connector status
STATUS=$(curl -s $CONNECT_URL/connectors/$CONNECTOR_NAME/status | jq -r '.connector.state')
if [ "$STATUS" != "RUNNING" ]; then
echo "CRITICAL: Connector not running (state: $STATUS)"
exit 2
fi
# Check task status
TASK_STATUS=$(curl -s $CONNECT_URL/connectors/$CONNECTOR_NAME/status | jq -r '.tasks[0].state')
if [ "$TASK_STATUS" != "RUNNING" ]; then
echo "CRITICAL: Task not running (state: $TASK_STATUS)"
exit 2
fi
# Check OTLP endpoints
if ! netstat -an | grep -q 4317; then
echo "WARNING: gRPC port 4317 not listening"
exit 1
fi
if ! netstat -an | grep -q 4318; then
echo "WARNING: HTTP port 4318 not listening"
exit 1
fi
echo "OK: Connector healthy"
exit 0
OTLP Endpoint Testing¶
Test gRPC endpoint:
# List available services
grpcurl -plaintext localhost:4317 list
# Expected output:
# opentelemetry.proto.collector.trace.v1.TraceService
# opentelemetry.proto.collector.metrics.v1.MetricsService
# opentelemetry.proto.collector.logs.v1.LogsService
# Describe trace service
grpcurl -plaintext localhost:4317 describe opentelemetry.proto.collector.trace.v1.TraceService
Test HTTP endpoint:
# Test traces endpoint
curl -v http://localhost:4318/v1/traces \
-H "Content-Type: application/x-protobuf"
# Test metrics endpoint
curl -v http://localhost:4318/v1/metrics \
-H "Content-Type: application/x-protobuf"
# Test logs endpoint
curl -v http://localhost:4318/v1/logs \
-H "Content-Type: application/x-protobuf"
# Should return 200 or 400
Recovery Procedures¶
Restart Connector¶
# Pause connector (stops receiving new messages)
curl -X PUT http://localhost:8083/connectors/otlp-source/pause
# Wait for in-flight messages to flush
sleep 10
# Resume connector
curl -X PUT http://localhost:8083/connectors/otlp-source/resume
# Verify status
curl -s http://localhost:8083/connectors/otlp-source/status | jq .
Update Configuration¶
# Update connector configuration
cat > updated-config.json <<EOF
{
"connector.class": "io.conduktor.connect.otel.OpenTelemetrySourceConnector",
"tasks.max": "1",
"otlp.message.queue.size": "50000",
"otlp.message.format": "protobuf"
}
EOF
curl -X PUT http://localhost:8083/connectors/otlp-source/config \
-H "Content-Type: application/json" \
-d @updated-config.json
Emergency Shutdown¶
If connector is misbehaving and needs immediate shutdown:
# Option 1: Pause connector (graceful)
curl -X PUT http://localhost:8083/connectors/otlp-source/pause
# Option 2: Delete connector (removes from cluster)
curl -X DELETE http://localhost:8083/connectors/otlp-source
# Option 3: Restart Connect worker (nuclear option)
kubectl delete pod -l app=kafka-connect
# or
systemctl restart kafka-connect
Capacity Planning¶
Estimating Queue Size¶
Formula:
Divide by 3 because there are 3 queues (traces, metrics, logs).
Example: - Peak: 5000 msg/s - Burst: 10 seconds - Queue size: (5000 × 10) / 3 = 16,667 per queue
Recommendation: Add 50% buffer → 25,000 per queue
Estimating Memory¶
Formula:
Example: - Queue: 25,000 - Avg message: 2 KB - Memory: (25,000 × 2 KB × 3) = 150 MB + overhead ≈ 200 MB
Recommendation: Set heap to 4× calculated = 800 MB minimum
Scaling Considerations¶
Single connector limitations: - One task per connector (cannot scale horizontally via tasks.max) - One pair of ports (gRPC + HTTP)
To scale: 1. Run multiple connector instances on different ports 2. Use load balancer to distribute apps across connectors 3. Scale Kafka brokers to handle increased throughput
Contact and Escalation¶
For issues not covered in this runbook:
- Review connector logs at DEBUG level
- Check Kafka Connect worker logs
- Verify Kafka broker health
- Check OTLP endpoint accessibility
- Consult project documentation: https://github.com/conduktor/kafka-connect-opentelemetry
- Open issue with:
- Connector configuration
- JMX metrics snapshot
- Relevant logs (last 1000 lines)
- Kafka Connect worker version
- Kafka broker version
- OpenTelemetry SDK version (if applicable)