Unleashing the power of AWS Athena on Transit Gateway Flow Logs

Stephen Jones
Aws
March 10, 2023

Table of Contents

AWS Transit Gateway Flow Logs provide valuable insights into the traffic flowing through your network. However, analyzing this data can be challenging, especially if you have many logs to sift through.

In this article, we will show you how to use AWS Athena to quickly ingest and query your Transit Gateway Flow Logs, providing a convenient and cost-effective way to gain valuable insights into your network traffic.

Setting up Transit Gateway Flow Logs to s3

First, let’s complete the prerequisites to get our flow logs into s3.

To setup Transit Gateway Flow Logs to s3, follow these steps:
Open the Amazon VPC console at https://console.aws.amazon.com/vpc/.
In the left navigation pane, choose Transit Gateways.
Select the transit gateway that you want to enable flow logs.
On the details pane, select Flow Logs.
Choose Create flow log.
Give it a meaningful name.
For the destination, choose s3.
For the s3 bucket, add the arn of the bucket where you want to store the flow logs.
Select the default format.
Turn on Hive-compatible s3 prefix.
Enable Partition logs by the hour.
Add any Tags required.
In the Create Flow Log dialogue box, choose to Create.

Wait for the flow log to be created and associated with your transit gateway.

After about 5 minutes thn traffic will start to be delivered to your s3 bucket in the following format.

s3://s3bucket/AWSLogs/aws-account-id=123456..

Querying the data

Now we create our table using AWS Athena to query the logs.

Open the Amazon Athena console at https://console.aws.amazon.com/athena/.
In the left navigation pane, choose Query Editor.
In the Query Editor, past the following, updating the s3 location with the destination above.

CREATE EXTERNAL TABLE IF NOT EXISTS `twgflowlogs`.`awslogs` (
  version int, 
  resource_type string, 
  account_id string, 
  tgw_id string, 
  tgw_attachment_id string, 
  tgw_src_vpc_account_id string, 
  tgw_dst_vpc_account_id string, 
  tgw_src_vpc_id string, 
  tgw_dst_vpc_id string, 
  tgw_src_subnet_id string, 
  tgw_dst_subnet_id string, 
  tgw_src_eni string, 
  tgw_dst_eni string, 
  tgw_src_az_id string, 
  tgw_dst_az_id string, 
  tgw_pair_attachment_id string, 
  srcaddr string, 
  dstaddr string, 
  srcpor int, 
  dstport int, 
  protocol bigint, 
  packet bigint, 
  bytes bigint, 
  start bigint, 
  `end` bigint, 
  log_status string, 
  type string, 
  packets_lost_no_route bigint, 
  packets_lost_blackhole bigint, 
  packets_lost_mtu_exceeded bigint, 
  packets_lost_ttl_expired bigint, 
  tcp_flags int, 
  region string, 
  flow_direction string, 
  pkt_src_aws_service string, 
  pkt_dst_aws_service string)
PARTITIONED BY (
 `aws-account-id` string,
 `aws-service` string,
 `aws-region` string,
 year string,
 month string,
 day string,
 hour string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ' ' 
LOCATION
  's3://BUCKET_NAME_REPLACE_ME/AWSLogs/'
TBLPROPERTIES (
  'skip.header.line.count'='1',
  'write.compression'='GZIP')

Choose Run query. This will create your table. which should now show Data navigation to the left.

Now we are all set to get access to our data!

To query for the top ten destination addresses by total bytes, you can use the following query:

SELECT dstaddr, SUM(bytes) AS total_bytes
FROM twgflowlogs.awslogs
GROUP BY dstaddr
ORDER BY total_bytes DESC
LIMIT 10;

This query groups the data by

dstaddr column,
calculates the total number of bytes for each group,
sorts the results in descending order by the total_bytes column.

The LIMIT clause returns only the top ten results.

As our table is partitioned, we can specify partition values in the FROM clause to limit queried data.

For example:

SELECT dstaddr, SUM(bytes) AS total_bytes FROM twgflowlogs.awslogs WHERE aws-account-id = ‘123456789012’ AND aws-region = ‘ap-southeast-2’ GROUP BY dstaddr ORDER BY total_bytes DESC LIMIT 10;

This query only considers data in the aws-account-id=123456789012 and aws-region=ap-southeast-2 partitions, which can improve performance and reduce costs.

If you want to query data for December 2022, you can use the following query:

SELECT dstaddr, SUM(bytes) AS total_bytes
FROM twgflowlogs.awslogs
WHERE year = '2022'
  AND month = '12'
GROUP BY dstaddr
ORDER BY total_bytes DESC
LIMIT 10;

This query only considers data in the year=2022 and month=12 partitions, which corresponds to the data for December 2022.

Note that the flow log format also has start and end columns that can also filter the data by date.

For example:

SELECT dstaddr, SUM(bytes) AS total_bytes
FROM twgflowlogs.awslogs
WHERE start >= 1669852800 AND end < 1670457600
GROUP BY dstaddr
ORDER BY total_bytes DESC
LIMIT 10;

This query only considers data where

the start timestamp is greater than or equal to the beginning of “Thursday, 01 December 2022, 12:00:00 AM GMT” (1669852800)
and the end timestamp is less than the end of “Thursday, 08 December 2022, 12:00:00 AM GMT” (1670457600).

Using this method, we can still get very fine-grained windows in the data without creating additional partitions.

Summary

This post shows Athena’s ability to access Transit Gateway Flow logs.

Next up, we will look at transforming the data with AWS Glue to make those source and destination addresses more meaningful!

I hope this helps someone else.

Cheers

Unleashing the power of AWS Athena on Transit Gateway Flow Logs

Setting up Transit Gateway Flow Logs to s3

Querying the data

Summary

Tags :

Share :

Related Posts

Process github Workflow Events with AWS Stepfunctions

Dude Scale My Runners

The Power of Self-Hosted github-actions