AWS blocked me?! And how it was always DNS...

Published: 2024-06-11

Tags: the-cyclical-chagrinweb-dev

I have written a lot this recent time, and I kept on adding it. So, I'll have table of contents here.

This is just me writing down my live thoughts, so it does not have any structure. Sorry, if you ever come across my random ramblings...

Contents

Context about change of plans...

A few weeks ago, I had an interview for a React Frontend Developer. They wanted me to write a calculator app in 2 hours starting from scratch. I thought that it would be easy, but I was dumbfounded. I have not developed in React in almost six months, and my brain memory was still in Svelte-world. I was able to make the calculator app in two hours, but it was not satisfactory. And, wouldn't you know but I obviously did not get the job. It was that moment I realized I need to take this seriously.

So, I'm pivoting to practicing what I think I know, and what's prominently on my resume. Looking at my public projects, I don't have any significant React, Golang (maybe), or AWS projects. I'm changing that by trying to make a React app hosted in AWS. And so it begins...

Problem 1: AWS Blocked me?

  • Problem: AWS blocked me?
  • Solution: Just contact them LOL

I begin with my weakness: cloud infrastructure. I know AWS stuff, but I am missing a lot of knowledge in-between that serve as "aha" moments for me. I took for granted the abstractions provided at work, so I only need to think about "connecting major services". Now, I only have me, the internet, AWS' not-so-beginner-friendly and differing-opinion docs, and a few guys I want to find who did the exact things I'm trying.

And it went on a bad start. I was trying to make resources, every attempt failing. So I went to provision resources manually, launching my own service, and it never leaves the desired state. This went for two days. Only when I provisioned a task did I see the 400 error status that I was blocked.

So AWS blocked me???

That blocked me for two days, but setting up a basic ECS on Fargate service, just like at work, was a struggle. I think I could easily do this by manually provisioning it, but I wanted to do it the CDK way.

Problem 2: ECR access

  • Problem: Can't access ECR from ECS Task
  • Solution:
    • Initial solution was to use NAT gateway, and I did not intend that
    • Later solution was to use the VPC endpoints that I have hooked up on the wrong subnet. The current subnet my applications were running were in the private subnet with egress, but the Gateway endpoint for S3 (used for pulling the image), was in the private ISOLATED subnet. The ECS tasks were not in that subnet.

So, here's a link to the infrastructure diagram I wrote in Notion.

If the diagram has not changed, I have a github repository, which should has a Github Action that pushes a Docker image to AWS ECR, AWS' docker image repository. I want my ECS services to pull images from that repository. I knew it was not gonna be simple because my ECS service was in a VPC. I knew there were things like Private Links and IAM permissions involved just to do this thing I want. I could serve my ECS on a public subnet, but that would not reflect how businesses should do.

Here's our first error in the Task pulling from ECR:

ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 3 time(s): RequestError: send request failed caused by: Post "https://api.ecr.us-west-2.amazonaws.com/": dial tcp 34.223.25.165:443: i/o timeout. Please check your task network configuration.

After a bit of search, I found this StackOverflow answer that solved this problem: CannotPullContainerError in ECR when trying to use VPC Endpoints

I tried the solution here but it did not work. I only got the error below. (Hint: I will learn a hard lesson later that this was the correct solution. I only made a small mistake!)

}CannotPullContainerError: ref pull has been retried 5 time(s): failed to copy: httpReadSeeker: failed open: failed to do request: Get AWS_ID.dkr.ecr.us-west-2.amazonaws.com/turnipreactinfrastack-turnipreact2be52df6-11pzkecigkb9:latest: dial tcp 52.92.212.154:443: i/o timeout

The task is failing to pull the container image from Amazon Elastic Container Registry. There appears to be a timeout occurring when trying to establish a connection to the registry to retrieve the image.

I tried Amazon Q, Amazon's built-in AI chat bot. I don't really trust it, but I want to see if it will help. Amazon Q recommends the following steps to resolve your error.


1. Go to the Amazon ECS console and navigate to the cluster containing the failing task
2. Select the failing task and click on the "Stop" button to stop the task
3. Navigate to the service containing the failing task
4. Click on the service name and then select the "Update" option
5. In the "Configure service" step, scroll down to the "Container definition" section
6. Click on the container name to expand the container details
7. Check if the image URI is correct and points to the correct Amazon ECR repository
8. If the image URI is correct, click "Next" through the remaining steps to update the service
	If the issue persists after updating the service:
	1. Go to the Amazon VPC console and check the security group rules for the ECS cluster
2. Ensure that outbound traffic is allowed on port 443 to the Amazon ECR endpoint (e.g., `AWS_ID.dkr.ecr.us-west-2.amazonaws.com`)
3. If the security group rules are correct, check the network ACL rules for the VPC and ensure that outbound traffic is allowed on port 443
	If you don't have permissions to make the necessary changes, contact your AWS Administrator.

The solution above is close to what the actual solution is, but at the time I was reading this, I did not know how to translate this into CDK changes.

Then, I learned about runbook: https://docs.aws.amazon.com/systems-manager-automation-runbooks/latest/userguide/automation-aws-troubleshootecstaskfailedtostart.html

But... The runbook did not help me out. It does not give me any diagnosis, or maybe I don't know how to find the results. The outputs don't make sense or point to what I need.

For this part, I found out I was trying to fix the permissions while my task was running in a public subnet with instructions for private subnet. (I wrote this without understanding that this is not what I exactly wanted. I wanted instructions for private subnet with egress, not public subnet.)

Okay we switched from public to private subnet, cause our load balancer is in public anyway

To do that, I did this:


		this.loadBalancedFargateService = new ecsPatterns.ApplicationLoadBalancedFargateService(this, `${id}-service`, {
	cluster: this.cluster,
	memoryLimitMiB: 2048,
	desiredCount: 1,
	cpu: 512,
	taskImageOptions: {
	image: ecs.ContainerImage.fromEcrRepository(props.repository, "latest"),
	containerPort: 80,
	taskRole: props.taskExecutionRole,
	executionRole: props.taskExecutionRole,
},
	// commented out the public subnets
	// taskSubnets: {
	//     subnets: this.vpc.publicSubnets,    // },    // assignPublicIp: true,    loadBalancerName: `${id}-lb`,
		publicLoadBalancer: true
	});

Problem 3: Docker and Fargate

  • Problem: How can the app in Docker be accessed from the internet.
  • Solution: Open and listen to a port above a certain range outside the privileged ports. Tell the ECS Task that container port was in that port.

The Fargate was not able to pull the image from ECR. Now, the task won't run. I get this:

CannotStartContainerError: ResourceInitializationError: failed to create new container runtime task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "80": executable file not found in $PATH: unknown

Here's what I got from StackOverflow: CannotStartContainerError: ResourceInitializationError: failed to create new container runtime task: failed to create shim: OCI runtime create failed:

I initially have put an entry point somewhere in-between while looking for answers.

this.loadBalancedFargateService = new ecsPatterns.ApplicationLoadBalancedFargateService(this, `${id}-service`, {
	cluster: this.cluster,
	memoryLimitMiB: 2048,
	desiredCount: 1,
	cpu: 512,
	taskImageOptions: {
	image: ecs.ContainerImage.fromEcrRepository(props.repository, "latest"),
	containerPort: 80,
	// commented out this one
	// entryPoint: ["80"],
	taskRole: props.taskExecutionRole,
	executionRole: props.taskExecutionRole,
},
	publicLoadBalancer: true
});

After that, this was the cloudformation log from the tasks being run that are immediately closing:

⨯ Failed to start server
	Error: listen EACCES: permission denied 0.0.0.0:80
	at Server.setupListenHandle [as _listen2] (node:net:1800:21)
	at listenInCluster (node:net:1865:12)
	at doListen (node:net:2014:7)
	at process.processTicksAndRejections (node:internal/process/task_queues:83:21) {
	code: 'EACCES',
	errno: -13,
	syscall: 'listen',
	address: '0.0.0.0',
	port: 80
}

I had an argument in my Dockerfile which overrides the hostname, so I removed it.


	# this was the previous code which has HOSTNAME
	# CMD HOSTNAME="0.0.0.0" node server.js

	# new CMD file
	CMD node server.js

We're so close! We got a new error:


⨯ Failed to start server
Error: listen EACCES: permission denied 172.31.14.242:80
at Server.setupListenHandle [as _listen2] (node:net:1800:21)
at listenInCluster (node:net:1865:12)
at GetAddrInfoReqWrap.doListen [as callback] (node:net:2014:7)
at GetAddrInfoReqWrap.onlookup [as oncomplete] (node:dns:109:8) {
code: 'EACCES',
errno: -13,
syscall: 'listen',
address: '172.31.14.242',
port: 80
}

It seems like the program running our Dockerfile has issues. We're forcing to open and listen port 80. Apparently, within a lower range of ports, we are not allowed to listen to them without sudo or privileged permissions. (Reference: Container cannot bind to port 80 running as non-root user on ECS Fargate)

So, we reverted the Exposed port and the PORT environment variable back to 3000. I also set the container port in the CDK definition to 3000.


	this.loadBalancedFargateService = new ecsPatterns.ApplicationLoadBalancedFargateService(this, `${id}-service`, {
	cluster: this.cluster,
	memoryLimitMiB: 2048,
	desiredCount: 1,
	cpu: 512,
	taskImageOptions: {
	image: ecs.ContainerImage.fromEcrRepository(props.repository, "latest"),
	// old code
	// containerPort: 80,
	// new code
	containerPort: 3000,
	taskRole: props.taskExecutionRole,
	executionRole: props.taskExecutionRole,
},
	publicLoadBalancer: true
});

I thought we might encounter problems in our ALB, but it works! Now... I just have to figure out how the ALB can get the port 3000 stuff.

We did it. It's live :crazy:

The problem I have is I only know CDK and the abstractions provided at work. I don't know the fundamentals of AWS cloud stuff. I know how to connect them using the abstractions I've learned at work. And on top of the fundamentals, I have to also learn the Cloudformation mappings of these AWS resources. Knowing all three levels is essential to widen my search. I can look at searches with terms that have solely AWS, Cloudformation, or CDK.

Problem 4: Deploying new images to ECS Fargate

When I was planning, I wanted my CDK infra to reflect the following properties: Security Transparency and Simplicity (vs Complexity). So, I was having a dilemma about which approach to take with making sure that our ECS tasks are reflecting the current code we have on main or in turnip/dev (my staging branch). For this one, I had two overlapping problems:

  • Problem 1: How do I deploy my infrastructure?
  • Solution: I ended up removing Github Actions since it was revealing too much information that I don't want people to know about. I've outlined some solutions if I ever need to collaborate with anyone. Since I'm the only person working on this, I'm not gonna go through the extra trouble of setting it up for collaboration.

Deploying using Github Actions had several issues:

  • the work of masking all secrets - too much work and fragile
  • Should I use SSM params to hide the code? This would not hide the logs but, I can use the mask command in Github Actions to hide them. But, I have to do this manually, which feels like is against the principles of CDK. There might a complex way of doing this, but I don't think a solution exist in a practical sense.
  • I could use Github Secrets instead of SSM param, but it does not change the situation too much. It only changes the source of truth, and the complexity is just being switched between the local worflow, github repository, github action, and AWS' system.
  • Another idea I had was just have local environment variable as the source of truth and using git hook on main to somehow do magic, but like the previous solutions, it does not really get me out of this bad structure where I either have to manually update the values somewhere or have to come up with a complex system.
  • I looked into whether I can hide github action logs workflow, but it was not possible.
  • An alternative source of truth would be having a machine outside github action running - too complex and bad UX for me.

Alternatives include:

  • Using a private mirror by having a Github Action that mirrors the repository in a private Github repository or in AWS CodeCommit for better convenience
  • The original code is public: no logs when working.
  • The original is private: we could create a hook that pushes to our public repo. I think this is my best option to making my code public but still allowing collaboration. Although, I think I can defer this task and stick to the final alternative.
  • Just deploy CDK locally! I'm the only person working here, and it's the best way to learn since I am still a CDK noob with no special tools that work provided.

  • Problem 2: How do I tell the ECS service to use the new docker images uploaded in ECR?
  • Solution: I opted to using a Github Action wish pushes the Docker image to the ECR repo.

An alternative to this was to set up the events in CDK. I could set up a Cloudwatch event (or EventBridge event), that gets triggered when the ECR repository receives a new Docker image. This event would trigger a Lambda, which calls ECS Fargate to update itself. (Reference: ECR Image Change Detection )

The current solution isn't that different, but it uses Github Action to also call this Lambda after it pushes to ECR.

Problem 5: Transferring DNS

  • Problem: How do we serve turnipxenon.com at Vercel servers while react.turnipxenon.com and staging-react.turnipxenon.com at AWS servers.
  • Solution: The simplest solution was to actually not switch to AWS completely. I still have the A and CNAME records for turnipxenon.com at Porkbun towards Vercel. For *react.turnipxenon.com, we have NS (name server) records pointing towards the Route53 hosted zones, which serves as authoritative servers for those subdomains.

My initial plan was to transfer our DNS records to Amazon. I thought it was a good idea since I can create certificates (does not understand how it works at the time) and different subdomains via CDK. So, here's my initial plan:

  • Turn off our API calls on portfolio site, or update our UI to gracefully handle being unable to receive data.
  • Defined AWS Route53 and certificates in CDK. I used this as reference: Route53 to ALB to ECS Fargate.
  • Transfer records from Porkbun to a CDK-provisioned hosted zone in AWS Route53. I read this as reference: DNS route traffic for subdomains.

I manually set up the DNS records for react.turnipxenon.com and turnipxenon.com in a singular route 53 hosted zone. It worked. At the time. It later broke down in Challenge 6. It also affected how my servers are served in the internet later. I had to pay the price of having to wait for 48 hours until staging-react.turnipxenon.com got corrected from an error state when anyone visits it.

Our current conclusion here was to host the DNS records on AWS. I updated the solution here to what I currently have, which is to have a mix of DNS records between Porkbun and AWS.

Challenge 6: DNS and SSL certs

  • Actual Problem: How to resolve SSL certificate conflict between Porkbun and AWS.
  • Perceived Problem: Why does \*.turnipxenon.com certificates not being validated?
  • Solution: Export SSL cert from Porkbun to AWS Route 53.

So, the solution in Challenge 5 did not work out. LOL


START RequestId: 263957c0-2dfc-45be-b2a2-8724bfa308d9 Version: $LATEST
2024-06-08T00:52:29.582Z 263957c0-2dfc-45be-b2a2-8724bfa308d9 ERROR AccessDeniedException: User: arn:aws:sts::AWS_ID:assumed-role/TurnipReactInfraStack-DeployerLambdaServiceRole1D29-VYdkqH51xMTG/TurnipReactInfraStack-DeployerLambdaE468B1A1-5UdlAvfRk12p is not authorized to perform: ecs:UpdateService on resource: arn:aws:ecs:us-west-2:AWS_ID:service/TurnipReactInfraStack-TurnipReactTurnipReactecsCluster4DBA1B1F-tkUP4mX1UoRg/TurnipReactInfraStack-TurnipReactTurnipReactserviceService937D1AF5-2x4MOF6Y1Uky because no identity-based policy allows the ecs:UpdateService action
at de_AccessDeniedExceptionRes (/var/runtime/node_modules/@aws-sdk/client-ecs/dist-cjs/index.js:2286:21)
at de_CommandError (/var/runtime/node_modules/@aws-sdk/client-ecs/dist-cjs/index.js:2216:19)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async /var/runtime/node_modules/@aws-sdk/node_modules/@smithy/middleware-serde/dist-cjs/index.js:35:20
at async /var/runtime/node_modules/@aws-sdk/node_modules/@smithy/core/dist-cjs/index.js:165:18
at async /var/runtime/node_modules/@aws-sdk/node_modules/@smithy/middleware-retry/dist-cjs/index.js:320:38
at async /var/runtime/node_modules/@aws-sdk/middleware-logger/dist-cjs/index.js:33:22
at async Runtime.handler (file:///var/task/index.mjs:16:25) {
	'$fault': 'client',
	'$metadata': {
		httpStatusCode: 400,
		requestId: '490e5462-4c2c-483e-9945-ad9e2cc7e2d5',
		extendedRequestId: undefined,
		cfId: undefined,
		attempts: 1,
		totalRetryDelay: 0
	},
	__type: 'AccessDeniedException'
}
END RequestId: 263957c0-2dfc-45be-b2a2-8724bfa308d9
REPORT RequestId: 263957c0-2dfc-45be-b2a2-8724bfa308d9 Duration: 773.97 ms Billed Duration: 774 ms Memory Size: 128 MB Max Memory Used: 87 MB Init Duration: 377.66 ms

We've been stuck on the pending for validation part on our certificate. I saw a fine print I missed when I have a different DNS provider: Setting up DNS validation. It says something about Route 53 not being the domain's DNS server (step 3b in Setting up DNS validation).

The stuck in pending validation was caused by SSL certificate that exists in Porkbun. It covers *.turnipxenon.com, which is causing issues for react.turnipxenon.com. So whatever DNS magic happens between tracing to Porkbun for react.turnipxenon.com, it gets the SSL cert from Porkbun instead of AWS. As a solution, I imported the generated SSL cert from Porkbun, and used it on Route53. (Porkbun SSL cert to AWS ACM)

I don't know how much work it is renewing the certs for react.turnipxenon.com when Porkbun does it, but I'll figure it out when 18 Jul 2024. That's a month from now!

Since, react.turnipxenon.com had an invalid certificate at the time, but staging-react.turnipxenon.com had a valid certificate, when I transferred to using Porkbun's SSL certificate, the DNS records for the old Load Balancer is still cached. I had to wait 48 hours for the change to propagate to the new Load Balancer created from CDK.

Uh-oh, it's getting expensive

  • Problem: Pulling Docker images was very expensive since it's 200+ MB per image.
  • Solution: Fix the misconfigured Private Links or VPC endpoints from Challenge 2.

One of the answers I've found about expensive Docker pull to ECS Fargate was this link to Nathan's solution: ECS Cluster isolated VPC no NAT gateway

I saw had a NAT gateway, so I removed it but it caused the problem below:

CannotPullContainerError: ref pull has been retried 5 time(s): failed to copy: httpReadSeeker: failed open: failed to do request: Get AWS_ID.dkr.ecr.us-west-2.amazonaws.com/turnipreactinfrastack-turnipreact2be52df6-11pzkecigkb9:latest: dial tcp 52.92.212.154:443: i/o timeout

What I did find out was it was due to a missing VPC endpoint. It kinda made me confused because I could not find the missing VPC endpoint based on the link I had based on Nathan's solution.

I investigated looking at the resource map for the VPC the Fargate infrastructure was in. I used this troubleshooting page as a guide: Task cannot pull image. It also led me to this: ECS Fargate pull container error based on the the httpReadSeeker keyword I had in my error log.

I went through those documents step-by-step, trying to see where I made a mistake.

Flow showing that the subnet for application flows to route table and towards another VPC

The image above shows the application subnet that our ECS task was located in. I can see that it has a route to connect with other networks.

Flow showing that the isolated subnet only flows up to the route table and never towards another VPC

Then I looked at the other subnet, which was the isolated one, it doesn't go to other networks. This was when something kinda clicked in my mind.

I checked my S3 VPC endpoint.

Table of routes

There's the culprit! I had to understand the difference between my subnets. So I have three: public (ingress), private with egress (application), and isolated (internal).

The problem was I just copy pasted the code here: CannotPullContainerError in ECR when trying to use VPC Endpoints.

The solution was to change the subnet where the s3 VPC endpoint is connected to.


		new ec2.GatewayVpcEndpoint(this, 'S3GatewayEndpoint', {
		service: ec2.GatewayVpcEndpointAwsService.S3,
		vpc,
		// old code
		// subnets: [{ subnetType: cdk.aws_ec2.SubnetType.PRIVATE_ISOLATED }]
		// new code
		subnets: [{ subnetType: cdk.aws_ec2.SubnetType.PRIVATE_WITH_EGRESS }]
	})
		

But it did not stop there.

I encountered a possible bug in CDK. The progress gets stuck when changing the subnet for the VPC endpoint. As a solution, I had to manually add the routes to the application subnet. Then deleted the isolated subnet routes.

Future CDK shenanigans

Since everything is now working fine in the infrastructure side, I can finally try doing more complex things both in the logic and (still) infrastructure. I plan to use AWS Cognito and see if it doesn't break my bank. LOL

My code logic can seen github.com/TurnipXenon/turnip-react and the CDK infra code at github.com/TurnipXenon/turnip-infra.

enfrtl
entlfr