AWS Config
At our organisation we use custom config rules to help us achieve near real-time compliance and remediation.
This is achieved using custom config rules and associated lambdas that process the conditional logic you want to match to determine if a resource is complaint or not.
The lambdas usually sit in a dedicated SecOps account so they can be updated easily if the logic changes or needs to be updated. Child accounts in the organisation have custom config rules deployed that fire the central lambda.
There are lots of tools and useful blog post to help get started with this, however this is not what this post is about!
- https://aws.amazon.com/blogs/mt/introducing-the-aws-config-rule-development-kit-rdk/
- https://aws.amazon.com/blogs/devops/aws-config-rdk-deploying-the-custom-rules-using-the-terraform/
Hey why is Config costing us so much in this account?
As part of our regular cost review for platform resource use, we noticed something odd in one of our development accounts.
These accounts are used for many things, but in relation to AWS Config it's used to develop the custom rules described above.
AWS Config
At our organisation we use custom config rules to help us achieve near real-time compliance and remediation.
This is achieved using custom config rules and associated lambdas that process the conditional logic you want to match to determine if a resource is compliant or not.
The lambdas usually sit in a dedicated SecOps account so they can be updated easily if the logic changes or needs to be updated. Child accounts in the organisation have custom config rules deployed that fire the central lambda.
There are lots of tools and useful blog posts to help get started with this, however, this is not what this post is about!
- https://aws.amazon.com/blogs/mt/introducing-the-aws-config-rule-development-kit-rdk/
- https://aws.amazon.com/blogs/devops/aws-config-rdk-deploying-the-custom-rules-using-the-terraform/
Hey why is Config costing us so much in this account?
As part of our regular cost review for platform resource use, we noticed something odd in one of our development accounts.
These accounts are used for many things, but in relation to AWS Config it's used to develop the custom rules described above.
Me: Why is that account costing us $$$$ a month for config?
Colleague: Hmm thats odd, it does get expensive when there are lots of rules
Me: Not that expensive! There are next to no resources in there...
Colleague: Hmm... that needs investigation
Config is expensive!
AWS Config gets a lot of stick for being expensive, however sometime ago they optimized the pricing.
In our multi account organisation, the monthly bill is around 1k per month across the board which in my opinion for an enterprise ain't so bad when you consider the capability the service is enabling.
- Resource Change Triggered and Time Based Rule execution
- Automated Remediation Integration
- EventBridge Integration
- Aggregator Query at the Organisation level
The above is pretty nice, right? Haters will hate.
AWS Config also drives Security Hub Security Standard Checks that help our users keep aligned to AWS best practices (or not :))
The investigation
Getting back to the point of this post, why was this account blowing up Config costs?
Looking at Cost Explorer I could see that APS2-ConfigurationItemRecorded($)
UsageType was the culprit.
A quick AWS support ticket pointed me to the following knowledge article allowing me to debug this further.
BTW Can't speak any more highly of the quick response we always get from support, keep up the excellent work!
Athena
Created my table
CREATE EXTERNAL TABLE awsconfig (
fileversion string,
configSnapshotId string,
configurationitems ARRAY < STRUCT < configurationItemVersion : STRING,
configurationItemCaptureTime : STRING,
configurationStateId : BIGINT,
awsAccountId : STRING,
configurationItemStatus : STRING,
resourceType : STRING,
resourceId : STRING,
resourceName : STRING,
ARN : STRING,
awsRegion : STRING,
availabilityZone : STRING,
configurationStateMd5Hash : STRING,
resourceCreationTime : STRING > >
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION 's3://BUCKET/AWSLogs/ACCOUNT/Config/ap-southeast-2/';
Queried the number of changes per resource in October, sorted by most frequently changed:
SELECT configurationItem.resourceType,
configurationItem.resourceId,
COUNT(configurationItem.resourceId) AS NumberOfChanges
FROM default.awsconfig
CROSS JOIN UNNEST(configurationitems) AS t(configurationItem)
WHERE "$path" LIKE '%ConfigHistory%'
AND configurationItem.configurationItemCaptureTime >= '2021-10-01T%'
AND configurationItem.configurationItemCaptureTime <= '2021-10-07T%'
GROUP BY configurationItem.resourceType, configurationItem.resourceId
ORDER BY NumberOfChanges DESC
The plot thickens...... we have a VPC, Subnet & Security group with over 5 thousand changes.
Cloudwatch Insights
Lets take a look a CloudTrail using Insights to see if there is something that is genuinely adding rules to that security group in some sort of loop. I'll be looking for events other than aws config.
Query targeting the security group id
fields @timestamp, eventSource, eventName
| filter @message like /(sg-012345678912345)/
| stats count() as EventNameCount by eventName
Most of the noise is coming from Config itself PutEvaluations
, but as the output shows RunInstances
is something we need to investigate more.
It's possible that something is just using the security group, triggering a state change and costing us in AWS Config as the resource is re-evaluated maybe? Lets find out!
Query for only the RunInstances eventName
fields @timestamp, eventSource, eventName
| filter @message like /(sg-012345678912345)/
| filter eventName == "RunInstances"
I've removed a lot of the message output but here are some of the values that helped me get to the source of the problem
eventName RunInstances
userAgent spotfleet.amazonaws.com
requestParameters.iamInstanceProfile.arn arn:aws:iam::000000000000:instance-profile/ecsInstanceRole
userIdentity.sessionContext.sessionIssuer.userName AWSServiceRoleForEC2SpotFleet
So here is what we have so far:-
- Something is running using Spot instances from the userAgent and AWSServiceRole
- The instance profile arn in use has ECS in the name so Elastic Container Service is likely
The smoking gun!
A quick look at the ECS Console and bingo, we have a Fargate Spot Cluster!
A single service is defined, using the VPC, Subnet & Security Group we identified via the Athena query.
Viewing the status of the service it is constantly restarting due to a registration failure with its GitHub target.
Wrap up
So while it appeared that AWS Config was just an expensive service, we actually found that we had inadvertently introduced a repeating failure that was genuinely triggering the ConfigurationItemRecorded charge frequently.
I have heard of similar conditions when using S3 event triggers, however, this was the first time for me.
Things for our team to consider and implement:-
- Anomaly detection alerts for increased service usage
- Dev accounts need the same level if not more FinOps attention!
- Clean up stuff once finished with it
Hope someone else finds this useful.
Cheers!