If you ask me, I'm pretty good at troubleshooting. After all, I get lots of practice (hey, I never said I was smart). However, on July 27, 2006, I wrote a bug so simple yet so baffling that I had no hope of finding a solution. I put up this page out of desparation.
Most bugs are easy to track down with a simple application of the Scientific Method:
The above method, with its numerous variants, usually works really well. Sometimes, though, you just can't figure out what's going on. It's extremely irritating to whittle down your bug into a five line program that looks perfect. You have absolutely know clue of what could possibly be wrong in there.
To understand the bug, you need some UNIX background. I assume you know all about named pipes (fifos) and shells. Anyone who knows what's going on in this test program certainly knows more about this stuff than I do. Anyway, here's the broken code (mystery.sh):
#!/bin/sh FIFO1=/tmp/testfifo1 FIFO2=/tmp/testfifo2 rm -f $FIFO1 $FIFO2 mkfifo $FIFO1 || exit mkfifo $FIFO2 || exit nc -v www.mcnabbs.org 80 <$FIFO1 >$FIFO2 & ./httpget.sh <$FIFO2 >$FIFO1
Where httpget.sh is as follows:
#!/bin/sh echo "GET / HTTP/1.1" echo "Host: www.mcnabbs.org" echo while read line do echo $line >&2 done
The output of mystery.sh should be something like:
www.mcnabbs.org [64.62.190.91] 80 (http) open HTTP/1.1 200 OK Date: Fri, 28 Jul 2006 02:51:44 GMT Server: Apache/2.0.52 (CentOS) ...
Instead, the script just hangs. No output at all. Netcat doesn't even say that it's opened up a TCP connection. I've tested this code on two different Linux distributions and on Mac OS X, and I've tried both Bash and ZSH as the interpreter. The results are identical in all of my tests.
Before you start making up stuff about blocking and deadlocks without thinking things through, look at an example of a similar script that works. Only one line has changed (mystery2.sh):
#!/bin/sh FIFO1=/tmp/testfifo1 FIFO2=/tmp/testfifo2 rm -f $FIFO1 $FIFO2 mkfifo $FIFO1 || exit mkfifo $FIFO2 || exit nc -v www.mcnabbs.org 80 <$FIFO1 >$FIFO2 & ./httpget.sh >$FIFO1 <$FIFO2
What the heck? It works if you redirect standard output first, but it doesn't work if you redirect standard input first?? This is in the shell, before httpget.sh is execed! Everything I know about shells and file descriptors says this shouldn't make any difference.
Something consistent is happening, but I have no clue what it is. I've tried several completely different kernels and completely different shells. I've switched the order of executing Netcat and httpget.sh, and I've tried rewriting httpget.sh in Python. Something important is happening here.
The first person to give me a satisfying explanation of the situation was... Byron Clark. I guess I'm not too surprised. :) Don't read the explanation until you've been thoroughly stumped by the problem. If you cheat you won't appreciate the solution. Anyway, here are Byron's comments:
First, another example, then the explanation:
$ mkfifo foo.fifo $ strace cat foo.fifo
Note that cat hangs while trying to open foo.fifo. In another shell:
$ echo foo > foo.fifo
Note that cat in the first shell finally succeeds in opening the fifo and catting the contents.
It appears that open(2) for reading or writing on a pipe will block until open(2) is called on the other end of the pipe.
So, here's what happens in the non-working version of mystery.sh and httpget.sh on the webpage you linked to:
nc -v www.mcnabbs.org 80 <$FIFO1 >$FIFO2 &
./httpget.sh <$FIFO2 >$FIFO1
Hence, deadlock.
The version that works:
nc -v www.mcnabbs.org 80 <$FIFO1 >$FIFO2 &
./httpget.sh >$FIFO1 <$FIFO2
Here's the relevant text from fifo(4):
The kernel maintains exactly one pipe object for each FIFO special file that is opened by at least one process. The FIFO must be opened on both ends (reading and writing) before data can be passed. Normally, opening the FIFO blocks until the other end is opened also.
A process can open a FIFO in non-blocking mode. In this case, opening for read only will succeed even if noone has opened on the write side yet; opening for write only will fail with ENXIO (no such device or address) unless the other end has already been opened.
Note that the issue isn't opening for reading before opening for writing. The problem is that the first open will block until the open on the other end finishes. So the following is still wrong:
nc -v www.mcnabbs.org 80 >$FIFO2 <$FIFO1 & ./httpget.sh >$FIFO1 <$FIFO2
Thanks again, Byron.