Hidden Costs of select() Without Threads

Journal started Jul 4, 2004

Today, I will be talking about the function select(). Never has such a boneheaded idea been more vital to programming efficiently. Select bothers me on so many levels, I have never successfully programmed with it without throwing my hands up and giving up on adding yet more bloat to my code. Select forces you to consolidate all your file descriptors into a single mysterious data structure, and then it tells you only one thing: if there exists a socket somewhere in there ready for reading, writing, or in error.

THEN, you have to loop through all your file descriptors once more, checking each and every single one to see if it happened to be the one, the descriptor that select is talking about. Once you find it you have the dubious pleasure of extracting the available data. Here's where it gets even worse: under the select() paradigm you cannot block, that is you cannot read data when it is not available, even if you need it.

It's like if I started say, lovely weather we're having lately talking in fragments o This example of Corriscerus abiliomari is approaching lovely indeed with grassy hills illumin f sentences. ated by the filtered sunlight.

I'm sure you can reconstruct that sentence, and so can a computer, but is it really worth it? What our minds do is form a construct of each entire sentence in memory. For a computer to do so would be wasteful however, especially if the connection is unending, or large amounts of data is transferred. And worse, the data is not transferred in meaningful chunks. Take a look at the third line especially, you can see that the word 'of' got chopped in half.

What is the computer supposed to do with that meaningless 'o'? Such half recieved data must all be stored in some form of buffer or pending queue, or something equally stupid because programs should not have to deal with data that they aren't built to understand. What the heck do you mean by 'o'? Well it won't be clear until the next select call for that connection when the 'f' will be returned.

So to summarize my case, using select() requires the program to search through a list of sockets multiples times, keep track of data that doesn't make sense, cut itself off in the middle of an operation several times during the process of a connection. Each function for handling the data must be designed not based on the logic of a continuous connection, but on the logic of inconsistent data coming in on a line.

Wait, you might ask, why can't the select call just return the file descriptor that is ready? Well, that's because... I have no idea. The people who designed it enjoyed making us do detective work, that's why. Whether you store your sockets in a static array or a linked list, or a hash table, or a closure, you still have to determine yourself which socket is the ready one, and indeed more than one might be ready with data on the line.

Wait, you might ask, why can't select just call a function when data is ready on a per socket basis? Well you can write that version of select, but remember that the function called cannot be connection based. It must be able to handle any data from any connection, and manually separate that data based on the socket that is ready for reading/writing. Though the data per connection will be in order, the order the connections send in each of their data will be random and unpredictable.

Alright then, one function per connection! It only recieves data from that connection, in order, and only does stuff based on that connection. You'll probably use a switch statement or hash mapping in the function handling all the data coming from all the sockets to determine which function to use for a piece of data, after you've found the socket that select is referring to, or rather the multiple sockets possible: you're going to have to loop through all available sockets doing this with each one per loop of the main select statment.

Confused yet? I sure am.

Wouldn't it be great if we could design one function call per connection. Not only does it not get cut off in the middle of an atomic operation, it can even be designed to flow from start of connection to end of connection, disregarding all other connections, new, old, pending or closing. To do this you could either use an elaborate set of setjmp/longjmp calls, where every 'read' and 'write' saves state information and jumps into the main select loop, and use an equally confusing array of system states per socket to decide which state select should return to, that is after you've figured out which socket select is talking about.

...or you could drop select and use threads.

Threads are very efficient--a thread pool is one of the quickest ways to program, even outperforming select() in most cases, even on a single processor machine. At least 'tis so in Linux.

Efficient or no though, threads are the way to go. If you have to deal with hundreds of connections, well you probably can't use select() anyway. On many operating systems its fd_set structure cannot handle more than 256 sockets at a time. You could use a list of fd_sets, but that's another list on top of the list of file descriptors you have to keep. With threading, you can forget about the file descriptor, and it requires no lists.

What about joining? Just to let you know, detach is your friend. How to tell when threads are finished? Semaphores.

I'm talking POSIX semaphores btw, SysV semaphores are just plain scary.

Check down near "pthread_detach"

Pretty cool, huh? After initializing the thread, I didn't have to deal with it, or its data at all. Just that one semaphore is the way all threads let the main thread know when they're done. In addition since it starts at a value of 5, there will be a limit of 5 threads initialized before the main thread blocks on sem_wait, and when a thread exits, sem_post breaks the sem_sait, then another thread gets initialized so you always maintain a queue of 5. Unlike in thread_join where you might get hung up on a thread that refuses to join, the semaphore system will make a new thread any time /any/ of the threads using that semaphore exit.

If you don't want to set a limit on the number of worker threads uh... I dunno. I always like to set a limit, even if it's really high.

ANYWAY. So using that above technique, you can create light, efficient per-connection worker threads, increasing the limit on workers as neccessary for your application. In theory the extra code you have to program to use the select() call is much greater overhead than the behind-the-scenes stuff threads do. It's not quite a thread pool, but is very fast and efficient. Just call a new thread every time a connection is made, and use semaphores to keep track of the threads you have running and not overload your machine.

I made some benchmarks if you're interested how I came to the realization that threads actually outperform select() due to their ease of flow control and lack of needed extra logic and buffering.

Comment
Index
Previous (Microsoft and Classified Ads)
Next (Where a Bubble Goes When It Pops)