Sunday, September 20, 2009

Linux: epoll performance and gotcha's

This is a graph of a very small test I have been running to test the performance of synchronous vs. asynchronous I/O. The previous post already showed how lighttpd is using the epoll interface. I wanted to do some testing on my own with my own implementation.
In this test, I'm using a single dual-core machine (E6850) with 4GB memory on a 1333 MHz bus with CL9. The forking server is very crude, which basically forks after accepting a connection. This could certainly be done more effectively by pooling threads and so on, but that is what the sock queue already simulates. The idea is to show the effect of creating new processes to handle new traffic and how this affects overall machine performance.

The forking server shows varying performance. At some point, the socket queue of the process within the kernel is full and the client is either waiting 3, 9 or 12 seconds to obtain a connection, probably because the kernel stalls connections (SYN flood protection?) or needs to clean up resources before it continues to accept new ones. Actually, I didn't wait for this process to finish. When you add the time required, it was running about 1 minute and did 5800 connections. So that performance is very poor. Programming wise though, things are very easy, especially using standard C available on every decent UNIX machine, since it was just an accept and a fork. (The gotcha was that you need to close the parent's file descriptor and waitpid on child's processes, otherwise it's holding on to resources).

All processes had their file descriptor limit increased to 65536 (ulimit -n 65536) and processes were run as root to prevent having to build in some setrlimit call and other complexities.

The socket queue with 8 threads is basically a round-robin fire and forget mechanism. It is implemented with glib and similarly to fork, it isn't as difficult to implement. Basically, it has one thread to continuously accept new connections as fast as it can, then it hands the socket descriptor to one of the queues in a round-robin way. The threads at the other end consume from their queue when a descriptor becomes available and process the socket until empty. Since this is all running locally, there is virtually zero latency and we know that each client connects and writes as fast as possible. Since the server doesn't create new threads or sockets, this is actually quite efficient. Problem may become when sockets actually start blocking, then you may need more threads to handle them.

The epoll() method with 8 threads is the most fancy one and actually skips a couple of other possibilities in the middle. It's strictly using asynchronous I/O, which is also the most difficult to implement. The advantage is that none of the threads ever block and are therefore always looking to do something useful. In this case, I configured eight worker threads as well. The performance is very similar to the queue one, but in different circumstances the epoll() method will certainly outperform the others. The reason is that the previous one will block waiting for something to read. Alternatives to epoll() is poll() with a self-maintained list of socket array. Eventually, those lists need maintenance and therefore they need to be processed linearly in each thread at some point. That is O(n) time, so it doesn't scale as well as epoll(), which has O(1) performance, plus the array maintenance that needs to be coded.

The simple client used in these tests is an absolute psychopath. It attempts to connect up to 125 sockets at the same time and maintain that number (that means a barrage of 125 SYN packets in a very quick succession from one thread in the beginning and as many as it can, up to 125, whenever connections get processed). This client also uses asynchronous sockets. The technique is similar to non-blocking connect() calls with select() or poll(), but in this case you feed the descriptors to epoll() to let the kernel figure everything out. The gotcha here is that the data structure used in the epoll_event structure is not a 4-attribute structure, but a union. That means that you can set only any one of the fields, but not two. For example, writing the fd field to store the socket descriptor overwrites a ptr and vice-versa.

Thus, the implementation for the client is like:
  1. open socket
  2. set socket to non-blocking
  3. call connect(), this doesn't wait around until it completes. It almost always returns "EINPROGRESS".
  4. store the socket + connected info in a specialized structure in epoll through the data.ptr attribute.
  5. open another socket.
After 125 sockets are open at the same time:
  1. see if any sockets have completed already, write a little bit of data and close the socket.
  2. this may free up a couple of places, after which new SYN's are sent.
The quicker the server accepts new sockets, the faster it can go, especially when clients are suffering from a bit of latency. In the epoll and queued sockets structure above, I cycled through 30,000 connection requests in 4-5 seconds. That's about 6,000 processed connections per second. This is in an environment where both client and server run on the same machine. The problem that this load faces is that the machine runs out of sockets, because due to the speed, there are a lot of sockets in the "ESTABLISHED" state (that's how quick the NIC is and how relatively slow the NIC is :). Sockets in ESTABLISHED state consume file descriptors and you can only have so many open before you run out. Modern servers should be able to handle 10,000 connections at the same time however. Running this on two different computers should give twice the number of sockets available.

The idea of the above is to develop software where the system becomes network-constrained. A lot of software, due to its architecture, doesn't use all of the network's bandwidth, but it probably uses all of CPU or memory. Threads and processes require a stack space to operate, because they call functions and leave other data on the stack for reference. The 90's way of handling traffic was to add threads and processes, which increases the requirements for memory and CPU. The context switch and process management becomes a bigger bottleneck (since that's what the kernel does). The Linux kernel has a very nice scheduling system now, which is default, which also has O(1) performance. So you could have 10,000 processes hanging around quite easily, but at some point you do noticeably see performance degradation, especially once the kernel needs to traverse the process lists every x (here 1) ms. That is why having one process/thread per socket is a bad idea.

Increasing throughput on LAN's has some more considerations and options for tweaking. Most Linux distributions are pre-configured for internet use, which means that their window sizes are different from a LAN. This is because if the round trip time is high and you send a large packet, which somehow gets lost, the cost of resending that packet is high. So a smaller packet would do better. But if you only send 1 byte per packet, then you're not getting anywhere either. Linux actually seems to optimize that in the kernel now. High performance servers sometimes need a bit more local ports to deal with incoming traffic. This can be modified non-permanently in /proc/sys/net/ipv4/ip_local_port_range.

3 comments:

nick black said...

Interesting post. I've been thinking a lot about how edge-triggered events and threads tie best together. You might find this worth looking into, especially in a few weeks: libtorque

Anonymous said...

Even with epoll, keeping your PROCESSING code small and efficient is the key to performances.

One strking example is G-WAN, which outperforms everything else by several orders (ASP.Net, Java, PHP) of magnitude:

http://gwan/ch/

Yuk! Who believed it was possible to be millions of times faster than IIS?

Gerard Toonstra said...

I had a look at G-WAN. There's not a lot of information about the internal architecture, so we can only speculate how it gets its speed. My guess is that it uses an asynchronous I/O layer at its core with another handling layer on top that allows direct access to internal structures in the scripts.

Not sure if there's a lot of interest in web servers with more speed than lighttpd or cherokee. At some point you gain more cycles and better speeds, but you lose out in development time and clarity of the code. I think lighttpd and those kinds of servers are fast and good enough for general applications, but still offer significant flexibility in what you run on them.

Developing web applications in C script feels like programming your entire web front end in C.... that's only useful for very specific areas of applications. Surely no one would dream of implementing Amazon that way? :)