High Availability Fault Tolerant Architecture on AWS

1. Introduction

We shall go through the basic setup of a High Availability Fault Tolerant Architecture on AWS.

This setup was initially targetted to develop a Web Application in the Ad-Tech domain, but can be customized to your requirements. It basically provides an template to host a customer facing application (with High Availability, Fault Tolerence and Redundancy) and also peek into an interface capable of handling streaming data.

Assumptions:

Reader is expected to have a general awareness of AWS basics like Regions, Availability-Zones, VPC, Subnets, Security-Groups, EC2, RDS, Load-balancer, Auto-Scale-Groups.

2. Requirements

Highly available and fault-tolerant setup.
The web application needs to be deployed on a Linux server with a relational database (say Postgres) as backend.
Receive streaming data and display reports in near real time.

3. The Menu (AWS Architecture Diagram)

The architecture diagram is given for reference and we shall discuss each of the components and the planned data flow in detail.

4. Starters (with defaults)

4.1. Region & Availability Zone

We would need to choose the AWS region that is closest to the clients/business. Also, need to ensure that the selected AWS Region has at least 3 availability zones (AZ).

4.2. VPC

A default VPC is available in your AWS account. I think it is best to use that (there could be unexpected issues related to using a non-default VPC). The VPC would come with a CIDR block with net mask 16 giving approx 65K IPs.

4.3. Public/Private Subnets, CIDR Blocks and IP allocation (+ Route Table, Internet Gateway)

In our use-case, we are planning to use 3 availability zones within the region (lets call them AZ1A, AZ1B and AZ1C) defined by orange coloured dotted-line border in the diagram. In AWS console, we shall see it as us-east-1a, us-east-1b, us-east-1c etc.

By default, one subnet per AZ is already created for us in each of the availability zone within the region. The default route table(RT) would have a route/entry to the default Internet gateway(IGW). All the default Subnets mentioned above would be attached to the default route-table. The CIDR block will have net mask /20 with 4091 available IPs.

For clarity, let us rename the route-table to ‘Public-RT‘ and also the subnets as public by apending ‘public’ along with the AZ name to it. We are marking the subnets as public since they already have a route to the internet. In the diagram, the subnets that we renamed this way are PBSN-AZ1A, PBSN-AZ1B and PBSN-AZ1C. The remaining subnets that we are not planning to use could be renamed as NOT-in-USE so that we avoid any confusion while selecting these subnet later in the setup. Additional private subnet will be created down the lane (in Main course).

4.4. Security Groups

The default security group allows for ALL-Traffic originating from within the same security-group. Let us rename this as ‘SG-Internal‘. We shall create two more security groups in Main course (step 5).

5. Main Course (adding More dishes)

5.1 Private Route Table

Create a new Route table without an entry/route to the IGW. Name it as Private_RT.

5.2. Private Subnets

We would need one Private subnet in each of our selected AZ. As you may remember, the default subnets were renamed as Public. Create the subnets in each of the AZ with the same net-mask /20 (adjust the 3rd octet of the CIDR block as required). Name them as PRSN-AZ1A, PRSN-AZ1B and PRSN-AZ1C.

The new subnet will be using the Public_RT by default. To make it private, change the assigned RT to use the Private_RT that we created above. Now these subnet are truly private with no direct access to internet.

5.3. Security Groups (additional)

The existing/default security-group was renamed as SG-Internal in step 4.4 above. We are creating specific & separate SG for flexibility. Let us create 3 additional SG with the below names & inbound rules. 1. SG-http80-ALL - In the inbound rules, set it to allow http traffic on port 80 from ALL sources. 2. SG-https443-ALL - In the inbound rules, set it to allow https traffic on port 443 from ALL sources. 2. SG-admin-ssh - In the inbound rules, set it to allow ssh traffic on port 443 from your custom IP

5.4. Load Balancers

Select Classic LB when creating a new Elastic Load Balancer (ELB).
Select you VPC and un-check Internal option. We need an Internet facing ELB.
Enable advanced VPC options and this will enable the subnet selection section.
Listener: The listener could be HTTP (80). HTTPS (443) can be selected as your application may need it. I am not selecting it for the time being and subsequently skipping the steps related to certificate installation.
Subnet: Since our ELB is internet facing, select the Public subnets available across the three AZ that we are using. It should be easy to identify them if we have already renamed them during our initial setup. Note: Behind the scenes, AWS creates EC2 in our subnet to handle the traffic.
Security Group: From the existing security groups, select the SG-Internal and SG-http80-ALL (plus SG-https443-ALL as per your requirement). You may get a warning if https443 is not selected in the next page.
Health Check: Use HTTP or TCP for health check as per the design of your application. Also select the frequency of the ping.
EC2: Skip this step of adding EC2 instances since we would be configuring the same in the Auto Scaling Group later.

Note: ELB is outside Free-tier and chargeable. $$$

5.5. Launch Configuration

The launch configuration created here will be used by the AutoScalingGroup to spin-up new EC2 instances as required. To attain high-availability, we could save a fully configured EC2 instance of our application as a template. This can be saved as our Golden-AMI. The advantage is that the new instance when launched will have all the required software/application installed and ready to serve. On the flip side, it is an effort to maintain the Golden-AMI if your application is going through frequent changes (not touching CI/CD). 1. Select your Golden-AMI (or any other). 2. Select the required instance type. 3. Name your LC and in advanced section, for the IP Address type select Do NOT assign Public IP. 4. Select the storage, termination action and encryption options as required. 5. Security Group - select SG-Internal and SG-http80-ALL (plus SG-https443-ALL as per your requirement)

5.6. Auto Scaling Group

Give a group name and select the number of instances to start with. Ideally it should be minimum 2, since we are planning for high-availability setup using load balancer.
Subnet: Select the Private subnets that we created earlier. The new EC2 instances will be placed here and will not be accessible directly from outside.
Advanced section, select the checkbox to receive traffic from load balancer and select the ELB we created in step 5.4 above.
The health-check can be ELB since it is already defined there.
Scaling Policy - use the scaling policy to scale-in/out the EC2 instances. Also define the MIN & MAX number of EC2 instances that we may require. One example is to add/remove instance based on the CPU utilization. Alarms can also be configured here to keep us notified on the changes.
Notifications can also be sent to pre-defined AWS events when instances are launched/terminated or fails the launch/termination.

Note: Once the ASG is created, it will kick-in and start spinning up the new EC2 instances. $$$

5.7. RDS with Multi-AZ

The existing ‘default’ SubNet group seems to be spanning across all AZ/Subnets. Since our requirement is to have the RDS created in private subnets, we shall create a new SubNet Group with our selected Availability Zones and the Private SubNets where you want to place the RDS. Let us name it SubNet_Private.

When we go for Multi-AZ RDS, behind the scenes AWS will be creating a copy of our RDS in a different AZ and keeps it in sync with our primary RDS. In case of failure, request will be re-directed to the standby RDS. Steps to create Multi-AZ RDS as below. 1. Select the RDS Engine 2. Use case - for this demo purpose, we are going with Dev/Test 3. Provide the Instance specs as required (license, version, instance class, Multi-AZ - Create replica, type, storage). 4. Provide Settings (Instance identified, Master user & password) 5. Select VPC (default) and Select Subnet group (SubNet_Private) created above 6. Public Accessibility: No 7. Availability Zone: (Greyed out if Prod use-case is selected in step 2 above). 8. Security Group: Select SG-Internal. Additionally a new Security group would be required here to allow the database port specific traffic (example 5432 for Postgres). 9. Select the remaining option as per application requirement (Name, port, parameter group, option group, IAM, Encryption, Backup, Monitoring, Maintenance).

Note: Multi-AZ is outside Free-tier and chargeable. $$$

6. Desserts

6.1. Kinesis Firehose & S3 buckets

After considering some options (like Storm/Flink/Kafka) , it was found to be easy to setup a Kinesis Delivery stream that is managed by AWS. This is a detailed topic which could be documented separately. Few key points to be noted are jotted down below for reference. 1. Identify the source of data (it could be Direct PUT / other sources or a Kinesis Stream). There are multiple ways in which you may deploy Direct PUT. 2. Decide if the received data has to be processed (transform / convert). 3. Select the destination for the received data (S3, Splunk, RedShift, ES). 4. Set the buffer size, interval, compression, encryption. 5. Additionally, you may need to select the IAM role, AWS Cognito-ID to complete the setup.

A separate document will be created for this.

6.2. Route53

Although ELB provides domain name to make requests to ELB, it does not resolves to a static IP. So we cannot use the A or AAAA records in the domain registrar control panel.

7. Pay your bills

There are few specific areas where the above setup will cost you money beyond the free-tier limits. 1. Multi-AZ RDS - Only standalone RDS is covered under free-tier. Multi-AZ RDS will be chargeable. 2. ELB - ELB usage charges kicks in from the time it is activated. AWS will charge you for both the uptime and the data volume passed through the ELB. 3. EC2 - Although, one micro instance of EC2 comes under free-tier for 750hrs/month, the entire ELB setup uses more than one EC2 instance. This would mean, if the total uptime of all the EC2 instances combined crosses 750hrs/month, then it would be chargeable.

8. Tips please

Since our ASG EC2 instances does not have Public IPs, we cannot access them directly from outside the VPC. We can create an EC2 instance (call it JumpHost) with a Public IP in our Public Subnet and access the ASG EC2s from this server. Ensure that the JumpHost has SG-Internal and SG-admin-ssh attached to it.