CloudWatchは、AWSリソースとアプリケーションの監視基盤です。本記事では、運用記事で触れなかった高度な機能を解説します。
CloudWatchアーキテクチャ
flowchart TB
subgraph CloudWatch["CloudWatch"]
Metrics["メトリクス"]
Logs["ログ"]
Alarms["アラーム"]
Dashboards["ダッシュボード"]
Insights["Contributor Insights"]
Anomaly["異常検出"]
end
subgraph Sources["データソース"]
EC2["EC2"]
Lambda["Lambda"]
App["アプリケーション"]
end
Sources --> Metrics
Sources --> Logs
Metrics --> Alarms
Metrics --> Anomaly
Logs --> Insights
style CloudWatch fill:#3b82f6,color:#fff
Embedded Metric Format (EMF)
概要
flowchart LR
subgraph EMF["Embedded Metric Format"]
App["アプリケーション"]
JSON["構造化JSON"]
CWLogs["CloudWatch Logs"]
CWMetrics["CloudWatch Metrics"]
end
App --> |"EMF形式"| JSON
JSON --> CWLogs
CWLogs --> |"自動抽出"| CWMetrics
style EMF fill:#f59e0b,color:#000
EMF実装
import json
from datetime import datetime
def create_emf_log(namespace, metrics, dimensions):
"""EMF形式のログを生成"""
emf_log = {
"_aws": {
"Timestamp": int(datetime.now().timestamp() * 1000),
"CloudWatchMetrics": [
{
"Namespace": namespace,
"Dimensions": [list(dimensions.keys())],
"Metrics": [
{"Name": name, "Unit": unit}
for name, (value, unit) in metrics.items()
]
}
]
},
**dimensions,
**{name: value for name, (value, unit) in metrics.items()}
}
# CloudWatch Logsに出力(自動的にメトリクス化)
print(json.dumps(emf_log))
# 使用例
create_emf_log(
namespace="MyApp/Performance",
metrics={
"RequestLatency": (125.5, "Milliseconds"),
"RequestCount": (1, "Count"),
"ErrorCount": (0, "Count")
},
dimensions={
"Service": "OrderAPI",
"Environment": "Production"
}
)
Lambda Powertools EMF
from aws_lambda_powertools import Metrics
from aws_lambda_powertools.metrics import MetricUnit
metrics = Metrics(namespace="MyApp", service="OrderService")
@metrics.log_metrics(capture_cold_start_metric=True)
def lambda_handler(event, context):
# ディメンション追加
metrics.add_dimension(name="Environment", value="Production")
# メトリクス記録
metrics.add_metric(name="OrderProcessed", unit=MetricUnit.Count, value=1)
metrics.add_metric(name="OrderValue", unit=MetricUnit.Count, value=event.get("amount", 0))
# 高解像度メトリクス
metrics.add_metric(
name="ProcessingTime",
unit=MetricUnit.Milliseconds,
value=150,
resolution=1 # 1秒解像度
)
return {"statusCode": 200}
カスタムメトリクス
CloudFormation定義
# カスタムメトリクス用CloudWatchエージェント設定
CloudWatchAgentConfig:
Type: AWS::SSM::Parameter
Properties:
Name: /cloudwatch-agent/config
Type: String
Value: |
{
"metrics": {
"namespace": "MyApp/EC2",
"metrics_collected": {
"cpu": {
"measurement": ["cpu_usage_idle", "cpu_usage_user", "cpu_usage_system"],
"totalcpu": true,
"metrics_collection_interval": 60
},
"mem": {
"measurement": ["mem_used_percent", "mem_available"],
"metrics_collection_interval": 60
},
"disk": {
"measurement": ["disk_used_percent"],
"resources": ["/", "/data"],
"metrics_collection_interval": 60
},
"statsd": {
"service_address": ":8125",
"metrics_collection_interval": 10,
"metrics_aggregation_interval": 60
}
},
"append_dimensions": {
"AutoScalingGroupName": "${aws:AutoScalingGroupName}",
"InstanceId": "${aws:InstanceId}",
"InstanceType": "${aws:InstanceType}"
}
},
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/application/*.log",
"log_group_name": "/app/logs",
"log_stream_name": "{instance_id}",
"timestamp_format": "%Y-%m-%d %H:%M:%S"
}
]
}
}
}
}
メトリクスフィルター
# ログからメトリクスを抽出
ErrorMetricFilter:
Type: AWS::Logs::MetricFilter
Properties:
LogGroupName: /app/logs
FilterPattern: "[timestamp, level=ERROR, ...]"
MetricTransformations:
- MetricName: ErrorCount
MetricNamespace: MyApp/Logs
MetricValue: "1"
DefaultValue: 0
Dimensions:
- Key: LogGroup
Value: $logGroup
LatencyMetricFilter:
Type: AWS::Logs::MetricFilter
Properties:
LogGroupName: /app/logs
FilterPattern: "[timestamp, level, message, latency]"
MetricTransformations:
- MetricName: RequestLatency
MetricNamespace: MyApp/Logs
MetricValue: "$latency"
Unit: Milliseconds
CloudWatch Logs Insights
基本クエリ
-- エラーログの検索
fields @timestamp, @message, @logStream
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
-- レスポンスタイム分析
fields @timestamp, @message
| parse @message /latency=(?<latency>\d+)ms/
| stats avg(latency) as avg_latency,
max(latency) as max_latency,
min(latency) as min_latency,
pct(latency, 95) as p95_latency,
pct(latency, 99) as p99_latency
by bin(5m)
-- エラー率計算
fields @timestamp, @message
| stats count(*) as total,
sum(strcontains(@message, "ERROR")) as errors
by bin(1h)
| display total, errors, (errors/total)*100 as error_rate
高度なクエリ
-- Lambda関数の分析
fields @timestamp, @requestId, @duration, @billedDuration, @memorySize, @maxMemoryUsed
| filter @type = "REPORT"
| stats avg(@duration) as avg_duration,
max(@duration) as max_duration,
avg(@maxMemoryUsed/@memorySize*100) as avg_memory_pct
by bin(1h)
-- API Gatewayアクセス分析
fields @timestamp, httpMethod, path, status, responseLatency
| filter ispresent(status)
| stats count(*) as requests,
avg(responseLatency) as avg_latency,
sum(status >= 500) as server_errors,
sum(status >= 400 and status < 500) as client_errors
by httpMethod, path
-- VPCフローログ分析
fields @timestamp, srcAddr, dstAddr, srcPort, dstPort, protocol, action
| filter action = "REJECT"
| stats count(*) as rejected_count by srcAddr, dstPort
| sort rejected_count desc
| limit 20
-- コンテナログの相関分析
fields @timestamp, kubernetes.pod_name as pod, @message
| filter kubernetes.namespace_name = "production"
| parse @message /request_id=(?<request_id>\S+)/
| stats count(*) as log_count by request_id, pod
| filter log_count > 1
クエリ結果の可視化
# Logs Insightsクエリウィジェット
DashboardWidget:
Type: AWS::CloudWatch::Dashboard
Properties:
DashboardBody: !Sub |
{
"widgets": [
{
"type": "log",
"x": 0,
"y": 0,
"width": 24,
"height": 6,
"properties": {
"query": "SOURCE '/aws/lambda/my-function' | fields @timestamp, @duration | stats avg(@duration) by bin(5m)",
"region": "${AWS::Region}",
"title": "Lambda Duration Trend",
"view": "timeSeries"
}
}
]
}
異常検出
概要
flowchart TB
subgraph AnomalyDetection["異常検出"]
Historical["過去データ"]
ML["機械学習モデル"]
Band["期待値バンド"]
Alert["異常アラート"]
end
Historical --> ML
ML --> Band
Band --> |"バンド逸脱"| Alert
style AnomalyDetection fill:#8b5cf6,color:#fff
異常検出アラーム
AnomalyDetectionAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: api-latency-anomaly
AlarmDescription: API latency anomaly detected
Metrics:
- Id: m1
MetricStat:
Metric:
Namespace: AWS/ApiGateway
MetricName: Latency
Dimensions:
- Name: ApiName
Value: my-api
Period: 300
Stat: Average
ReturnData: false
- Id: ad1
Expression: ANOMALY_DETECTION_BAND(m1, 2)
Label: LatencyAnomaly
ReturnData: true
ThresholdMetricId: ad1
ComparisonOperator: LessThanLowerOrGreaterThanUpperThreshold
EvaluationPeriods: 3
DatapointsToAlarm: 2
TreatMissingData: missing
ActionsEnabled: true
AlarmActions:
- !Ref AlertTopic
# 異常検出器の設定
AnomalyDetector:
Type: AWS::CloudWatch::AnomalyDetector
Properties:
MetricName: RequestCount
Namespace: MyApp
Stat: Sum
Dimensions:
- Name: Environment
Value: Production
Configuration:
ExcludedTimeRanges:
- StartTime: "2024-12-24T00:00:00"
EndTime: "2024-12-26T00:00:00"
MetricTimezone: Asia/Tokyo
複合アラーム
# 複合アラーム(複数条件の組み合わせ)
CompositeAlarm:
Type: AWS::CloudWatch::CompositeAlarm
Properties:
AlarmName: service-health-composite
AlarmDescription: Service health composite alarm
AlarmRule: |
ALARM(HighLatencyAlarm) AND
(ALARM(HighErrorRateAlarm) OR ALARM(High5xxErrorAlarm))
ActionsEnabled: true
AlarmActions:
- !Ref PagerDutyTopic
OKActions:
- !Ref RecoveryTopic
InsufficientDataActions:
- !Ref AlertTopic
HighLatencyAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: high-latency
MetricName: Latency
Namespace: AWS/ApiGateway
Statistic: Average
Period: 60
EvaluationPeriods: 3
Threshold: 1000
ComparisonOperator: GreaterThanThreshold
HighErrorRateAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: high-error-rate
Metrics:
- Id: errors
MetricStat:
Metric:
Namespace: MyApp
MetricName: ErrorCount
Period: 60
Stat: Sum
- Id: total
MetricStat:
Metric:
Namespace: MyApp
MetricName: RequestCount
Period: 60
Stat: Sum
- Id: error_rate
Expression: (errors/total)*100
Label: ErrorRate
Threshold: 5
ComparisonOperator: GreaterThanThreshold
EvaluationPeriods: 3
Contributor Insights
ルール定義
ContributorInsightsRule:
Type: AWS::CloudWatch::InsightRule
Properties:
RuleName: top-api-consumers
RuleState: ENABLED
RuleBody: !Sub |
{
"Schema": {
"Name": "CloudWatchLogRule",
"Version": 1
},
"LogGroupNames": [
"/aws/apigateway/my-api"
],
"LogFormat": "JSON",
"Contribution": {
"Keys": ["$.sourceIp"],
"ValueOf": "$.requestId",
"Filters": [
{
"Match": "$.status",
"GreaterThan": 499
}
]
},
"AggregateOn": "Count"
}
# DynamoDBスロットリング分析
DynamoDBThrottleRule:
Type: AWS::CloudWatch::InsightRule
Properties:
RuleName: dynamodb-throttled-keys
RuleState: ENABLED
RuleBody: |
{
"Schema": {
"Name": "CloudWatchLogRule",
"Version": 1
},
"AWSAccountId": "*",
"LogGroupNames": [
"DynamoDBThrottledRequests"
],
"Contribution": {
"Keys": ["$.tableName", "$.partitionKey"],
"Filters": []
},
"AggregateOn": "Count"
}
ダッシュボード設計
本番環境ダッシュボード
OperationalDashboard:
Type: AWS::CloudWatch::Dashboard
Properties:
DashboardName: production-overview
DashboardBody: !Sub |
{
"widgets": [
{
"type": "metric",
"x": 0, "y": 0,
"width": 8, "height": 6,
"properties": {
"title": "API Latency",
"metrics": [
["AWS/ApiGateway", "Latency", "ApiName", "my-api", {"stat": "p50", "label": "p50"}],
["...", {"stat": "p90", "label": "p90"}],
["...", {"stat": "p99", "label": "p99"}]
],
"period": 60,
"region": "${AWS::Region}",
"view": "timeSeries",
"annotations": {
"horizontal": [
{"value": 500, "label": "SLO", "color": "#ff7f0e"}
]
}
}
},
{
"type": "metric",
"x": 8, "y": 0,
"width": 8, "height": 6,
"properties": {
"title": "Error Rate",
"metrics": [
[{"expression": "m2/m1*100", "label": "Error Rate %", "id": "e1"}],
["MyApp", "ErrorCount", {"id": "m2", "visible": false}],
["MyApp", "RequestCount", {"id": "m1", "visible": false}]
],
"period": 60,
"yAxis": {"left": {"min": 0, "max": 10}}
}
},
{
"type": "metric",
"x": 16, "y": 0,
"width": 8, "height": 6,
"properties": {
"title": "Request Count with Anomaly Band",
"metrics": [
["MyApp", "RequestCount", {"id": "m1"}],
[{"expression": "ANOMALY_DETECTION_BAND(m1)", "label": "Expected", "id": "ad1"}]
],
"period": 300
}
},
{
"type": "alarm",
"x": 0, "y": 6,
"width": 24, "height": 3,
"properties": {
"title": "Alarm Status",
"alarms": [
"arn:aws:cloudwatch:${AWS::Region}:${AWS::AccountId}:alarm:high-latency",
"arn:aws:cloudwatch:${AWS::Region}:${AWS::AccountId}:alarm:high-error-rate",
"arn:aws:cloudwatch:${AWS::Region}:${AWS::AccountId}:alarm:api-latency-anomaly"
]
}
}
]
}
メトリクスストリーム
Kinesis Data Firehose連携
MetricStream:
Type: AWS::CloudWatch::MetricStream
Properties:
Name: metrics-to-s3
FirehoseArn: !GetAtt DeliveryStream.Arn
RoleArn: !GetAtt MetricStreamRole.Arn
OutputFormat: opentelemetry0.7
IncludeFilters:
- Namespace: AWS/EC2
- Namespace: AWS/Lambda
- Namespace: MyApp
StatisticsConfigurations:
- IncludeMetrics:
- Namespace: AWS/ApiGateway
MetricName: Latency
AdditionalStatistics:
- p90
- p95
- p99
DeliveryStream:
Type: AWS::KinesisFirehose::DeliveryStream
Properties:
DeliveryStreamType: DirectPut
S3DestinationConfiguration:
BucketARN: !GetAtt MetricsBucket.Arn
RoleARN: !GetAtt FirehoseRole.Arn
Prefix: metrics/
BufferingHints:
IntervalInSeconds: 60
SizeInMBs: 5
ベストプラクティス
flowchart TB
subgraph BestPractices["Best Practices"]
EMF["EMFで効率的なメトリクス収集"]
Anomaly["異常検出で動的しきい値"]
Composite["複合アラームで精度向上"]
Dashboard["目的別ダッシュボード"]
end
style BestPractices fill:#22c55e,color:#fff
| カテゴリ | 項目 |
|---|---|
| メトリクス | EMFでログとメトリクスを統合 |
| アラーム | 異常検出で季節変動に対応 |
| 分析 | Logs Insightsで深掘り分析 |
| 可視化 | 役割別ダッシュボード設計 |
まとめ
| 機能 | 用途 |
|---|---|
| EMF | 構造化ログからメトリクス自動生成 |
| 異常検出 | ML基づく動的しきい値 |
| Contributor Insights | トップN分析 |
| Logs Insights | インタラクティブなログ分析 |
CloudWatchの高度な機能を活用することで、効果的な監視とトラブルシューティングを実現できます。