Last week I gave a talk at the AWS Community Day DACH 2023 in Munich.
During the presentation I showed the Code live on screen, but I have added the shown examples to the slides as there is no recording of the talk.
When developing CDK constructs or libraries there is often several ways to achieve similar outcomes. This talk will dive deep on some of these decisions, compare the options and with that give the audience information to pick the best option for themselfes.
Examples for these options are eg project setup (projen or not), structuring constructs and their properties or how to hook into CDK for custom checks.
The slides are below and the example code is at GitHub.
Personio hosted its first “Platform & Pizzas” Meetup recently, where I presented my talk “Enabling teams in a fast-growing company to manage their own infrastructure - How we work with CDK in Personio”.
In the presentation I explain why we are using CDK at Personio, why its the right choice for us and what modifications we made to it, for it to be useful in our context.
I recently gave a talk at O’Reilly Velocity Conference in Berlin.
My initial plan was to create a blog version of it also, but I didn’t have enough time.
Luckily there is a recording and the slides are available! Here’s the abstract:
CI/CD systems are usually tightly coupled, and inherit for the CD part a lot of administrative privileges combined with network access to production systems. We tend to believe that we only execute trusted software within those systems, but it quickly becomes clear that code from a huge variety of sources is loaded and executed in that system that isn’t under your control.
In the talk I will walk you through how to identify the most relevant issues along the steps of actual pipelines. You’ll take a deep dive on the confused deputy, a trusted third-party that can be tricked into abuse of its privileges, which will explain how the direct association of code with access permissions on a public cloud provider can help to eliminate the need to trust components in between.
Who doesn’t know this situation: Sometime something weird is going on in your setup, but everything is so vague that you don’t know where to start. This is a story about a bug we experienced in a production setup and how we found out what was the root cause.
The problem
Imagine a ECS cluster with multiple tasks running with an NFS (EFS in this case) backed persistent root volume. Launching new tasks works just fine but as soon you are doing a cluster upgrade complete hosts seem to be unable to start tasks or can only start a few and then break. But also some hosts just work fine. So scaling up or down your cluster is a really scary thing that could not be done anymore without manual intervention. That’s an unacceptable situation, but what to do?
Want to know how we handle the NFS mount? My colleague Philipp has a blog post on that
symptoms
Only some hosts are affected
During the upgrades we noticed that some hosts didn’t have any issues with launching tasks. But what was the common pattern? Number of tasks, availability zone, something else?
As many host instances in eu-west-1c were affected the AZ theory was the first to verify.
Things that could go wrong with EFS in an AZ are:
EFS Mountpoint not there
Wrong security group attached to mountpoint
per subnet routing in VPC gone wrong
routing / subnet config on the instances
iptables on instance misconfigured
docker interfaces catches NFS traffic
I checked all of them, but everything was configured correctly, but still when i logged on to a host and tried to mount one of the EFS volumes manually it just hung forever.
Could it be something else network related that I did not think of? UDP vs TCP in NFS?
It’s tcpdump time!
tcpdump -v port not 443 -i any -s 65535 -w dump.pcap excludes stuff from SSM Agents that are running on the ECS instance and should really catch all traffic on all interfaces.
Starting tcpdump, run the mount command manually, stop tcpdump, pcap file on s3, fetch file to analyse locally with wireshark.
Digging through the whole dump I was unable to find any package to the EFS Mountpoint. Did I do something wrong? But also after the second try there was not a single package related to NFS.
The only possible conclusion here: the system does not even try to mount, but why?
Stracing things down
ps shows many other hanging NFS mounts triggered by ECS / Docker. But why are they hanging?
Its strace time! Let’s see what is going on
strace -f mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,tcp,intr $EFSID.efs.eu-west-1.amazonaws.com:/ /opt/debugmount/$EFSID &
We see that everything hangs on the mount syscall. And that for even longer than 10 minutes, which should have hit already every timeout. (In this case actually the latest timeout should have triggered after two minutes (600/10*2))
As I noticed before, there were many other hanging NFS mounts, so let’s kill them off and try again. Now the mount runs perfectly fine! Are we hitting an AWS service limit here? Is there some race condition blocking out each other? According to AWS docs, there is not such an service limit, so let’s stick with the deadlock theory and try to build a test case.
function debugmount { mkdir -p /opt/debugmount/$1
echo start $1
mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,tcp,intr $1.efs.eu-west-1.amazonaws.com:/ /opt/debugmount/$1
echo finish $1
}# this is a random list of about 15 EFS mountsfor fs in $MYFSLIST; do debugmount $fs
done
All of this looks quite sane. There is some fiddling with addr option, but we also tested mounting by IP and had the same issue.
Stepping back to our initial problem: The issues only occur on a cluster update where all tasks are re-scheduled to other instances. What if that happens not sequentially but in parallel?
Our simple test is just running in a loop, so one mount command runs after another.
mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,tcp,intr $1.efs.eu-west-1.amazonaws.com:/ /opt/debugmount/$1 &
Now we are backgrounding the mount command. So we do not wait anymore for the command to complete, but immediately start the next one.
Here we go: now after a few iterations things seem to block and mounting stops to work. We now have a test case that we can use to verify if we fixed the issue.
and now
We now know that the issue has to be in the kernel and have a way to reproduce it, but I have no experience in tracing actually what is going on in the kernel here.
So Google has to help out and after searching for nfs mount race condition deadlock rhel it returns a bug report.
The issue described in the report looks very similiar to what we experience and one of the comments mentions that this behaviour was introduced with NFS 4.1.
So lets change our snippet to use NFS 4.0 and see if everything is working now as expected:
mount -t nfs4 -o nfsvers=4.0,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,tcp,intr $1.efs.eu-west-1.amazonaws.com:/ /opt/debugmount/$1 &
Awesome. Our quite coomplex to debug issue was easy to solve. We changed the mount options in our ECS cluster and everything is running smooth now.
The current Kernel of the AWS ECS optimized AMI seems not to contain that bug fix at the time of writing.
Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra-scalable and highly reliable software systems.
A more detailed explanation can be found at Googles SRE page
Back in October 2016 I held a talk in the Munich AWS User Group about managing AWS Multi account infrastructure at glomex. This post is summarizing the talk and appeared initially on the glomex techblog.
As for the motivation to go ahead and deal with the complexity of such a setup - there are a lot of good reasons:
It’s recommended by AWS for security and billing concerns.
Concerning security, you can control access and security on a much more granular level. So for example you don’t have to deal with designing IAM rules so that one team can’t access another team’s resources – say EC2 resources - because you “physically” separate those resources from each other.
Billing-wise, you instantly get more insights on how much a team is spending or how much a certain product costs without having the need to build reports based on billing tags. You can still make use of those within the accounts to further drill down on how much impact a certain aspect of your infrastructure / product has on your AWS spendings.
Mimic your organization’s hierarchy. Sometimes, different departments have different guidelines or budgets so they want to be “in control” and not share accounts in order to avoid interference.
Separation of concerns: Separate your staging environments from each other and, most importantly, from your production environment. This makes it much harder to interconnect services from different stages which results in better isolated systems and possibly more consistency. This also protects your production environment from hitting either AWS account limits or API rate limits. Nothing more embarrassing than an AutoScalingGroup that can’t scale up because a load test done by the QA team eats up all of your EC2 instance allowance, right? Or you can’t deploy because a rogue developer script triggers rate limiting for the AWS API in that account.
This separation also helps with decommissioning of services. Sure, you should have all your infrastructure as code but in some cases, it still makes sense to shut down a complete account including deleting all name spaces and start fresh.
In the end it all comes down to treat AWS accounts as just another volatile/non-static resource that can be “instantiated” as you wish.
In a nutshell, all of those measures also help to minimize the blast radius of things gone wrong. Account limits, API limits, rogue scripts, security incidents – all of those can be limited to a smaller section of your whole cloud infrastructure by having a multiple account strategy.
The image above shows the rough glomex setup. Each of our teams gets a set of up to four AWS accounts which they use for their environments. The “Team Ops” Accounts have some special IAM roles set up for accessing the other accounts, but is otherwise no different to the development team accounts. Furthermore, we have one master billing account which also serves as the account to keep IAM users. This means that none of our other accounts have any users configured (except for some special 3rd party tools which can’t deal with STS). We enforce two factor authentication and users must then switch to a set of certain preconfigured roles to be able to access any of the other accounts. This way we can easily manage who gets access to which account. The users are kept directly within AWS with no active connection to our user management backend. Instead we sync users from our LDAP to IAM whenever we make changes to a user. So in case our LDAP fails, we can still access AWS. As a last thing, we have another special account which contains all of our CloudTrail logs. This account has very restricted access as it contains sensitive data. For the management of our accounts, we have created an internal tool handling the creation of VPCs, IAM federation, deployment of same basic resources. We are looking very much forward to the GA of AWS Organizations since there is still a lot of manual stuff to do (initial account registration on the AWS website) that we haven’t automated yet.
Pitfalls we discovered
Along the way we found some pitfalls that sometimes were surprising, but eventually we found ways to work around all of them:
Tool support for cross-account (STS) is not as good as expected. Since the various AWS SDKs usually handle authentication and authorization we were sometimes surprised that some tools explicitly wouldn’t understand STS credentials:
S3cmd
Serverless framework (at least in the beginning of 2016, we since have abandoned using it)
Kinesis agent (has been fixed)
AWS support for cross-account resource access is underwhelming:
API gateway is always public
VPC security groups cannot be used across accounts
S3 bucket permissions between more than 2 accounts are a mess
Complex trust relationships between accounts needed
Complex networking setup: We have decided against peering for all accounts but still do peer to some accounts
Hard to get a good overview over all accounts with AWS tools:
Billing works fairly well with Cost Explorer
Metrics are best used with a 3rd party tool, we use DataDog
Config Rules are too expensive
Costs do multiply
Config Rules
Support costs
VPN connections
User support and education is more demanding
User federation / STS hard to grasp for sporadic users
Tools we used
As there is surprisingly little tooling out there to support such setups we have created some custom tools which we are in the progress of being made open source:
LDAP – IAM sync
LDAP SSH key user management for EC2 instances
Account / environment detection for services as a safety net
Base setup tool
Custom deployment tools (CloudFormation, Lambda, CodeDeploy, API Gateway)
Account creation automation
To sum it all up we can say that from our experience the invest in this kind of account structure has already paid off. We had numerous occasions were somebody would have killed a production setup if it had been in the same account (deleting the wrong resources, eating up various AWS limits, etc). With AWS organizations, it should also become much easier to get rid of the cumbersome part of initial account creation – for us, that would mean that we will have even more accounts in the future.
Road to the Tajik Border, the cars are on the limit. We are on 4300m above sea level here. Nice Hot Spring in Tajikistan, We gifted the lady with new clothes for the child. She seemed quite happy about that The Tajik / Afghan Border