Troubleshooting Incompatibilities

Blogpost

TLDR;

AWS Application Load Balancer idle time should be lower than keep alive timeout on NodeJS.

Full story

At Korelogic we’re all about high availability and service reliability, so monitoring is a priority. This article takes you through a real life investigation using our monitoring tools. We recently started using real user monitoring service, Sentry which flagged issues with some customers receiving HTTP 502 response.

We then checked the system monitoring Cloudwatch metrics which indicated the same issue.

Even if HTTP 502 is only affecting 0.1% of all requests, it is worth investigating and fixing it.

AWS provides a descriptive guide to troubleshoot the errors at Troubleshoot Application Load Balancer HTTP 502 errors.

After checking the access logs, we found our issue

The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target. The load balancer receives a request and forwards it to the target. The target receives the request and starts to process it, but closes the connection to the load balancer too early. This usually occurs when the duration of the keep-alive timeout for the target is shorter than the idle timeout value of the load balancer. Make sure that the duration of the keep-alive timeout is greater than the idle timeout value.

However this doesn’t really tell us how to fix our issue.

Korelogic was started more than ten years ago with the principle of JavaScript everywhere, so it comes as no surprise that we used NodeJS. In the NodeJS world, by default keep alive timeout is 5 seconds:

HTTP | Node.js v14.21.3 Documentation

The number of milliseconds of inactivity a server needs to wait for additional incoming data, after it has finished writing the last response, before a socket will be destroyed. If the server receives new data before the keep-alive timeout has fired, it will reset the regular inactivity timeout, i.e., server.timeout.

What didn’t work for us?

We briefly tried to decrease AWS Application Load Balancers idle timeout value to 4 seconds but this resulted in other issues, for example legitimate requests taking more than 4 seconds to be terminated.

This meant that we would have to increase node’s keep-alive values. But first, let’s dig deep to confirm if our apps currently have the 5 second default keep alive or if its different.

We can open TCP connection with <telnet customer-v2 80>

If you continue sending requests, it works but if you stop for more than 5 seconds, you’ll get 'Connection closed by foreign host' which is expected!

If we examine <tcpdump -w> from the pod, we can see that the server sends TCP FIN request after 5 seconds as expected!

This confirms that our servers currently have 5 seconds timeout.

What doesn’t work?

We were not able to trigger TCP session re-use with 'curl'. It would return 'Connection #0 to foo left intact' but packet capture would show that the session would be terminated. It won’t work with '--keepalive-time 60' flag either.

The problem with 'curl' was defined in their manual:

The curl command-line tool can, however, only keep connections alive for as long as it runs, so as soon as it exits back to your command line it has to close down all currently open connections

How to increase keepalive timeout on Node?

After the deployment we get 61 seconds keep alive:

The connection got terminated after 60 seconds too 'Connection closed by foreign host'. A packet inspection confirms the same:

We have confirmed that Node works the same way as the documentation are saying. Also that the fix increased keep-alive timeout!

The number of milliseconds of inactivity a server needs to wait for additional incoming data, after it has finished writing the last response, before a socket will be destroyed. If the server receives new data before the keep-alive timeout has fired, it will reset the regular inactivity timeout, i.e., 'server.timeout'.

It is time for a production release and to measure our metrics.

But before that, let’s explore potential resource increases. This is what ChatGPT thinks:

This is valid, though barely noticeable on our infrastructure so no further changes needed.

We have released this to production and monitored the metrics. The HTTP 502s errors were gone immediately and we have not noticed any extra resource consumption.

Let's chat.

We’re here to help.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.