September 2007 Archives

RHEL 5 intermittent segfaults

For the past couple of months, on 12 servers, I have been seeing intermittent segmentation faults happening with the ssh, scp, and ntpstat commands. Those servers that weren't brand new had not exhibited that behaviour with RHEL 4 in the past, it was only when Red Hat Enterprise Linux 5 was installed that it began.

2 additional servers running RHEL 5 were not showing the same fault, but they weren't of the same type - all affected servers were IBM xSeries or System X with multiple processors and various model numbers, and all had ServeRAID cards.

I couldn't find any mention of such a fault anywhere except for on a CentOS bug tracker, bug ID 0002241.

After a few tests, it turned out that ntpstat would fail to run about 10 times in every 50000, or 0.02% of the time. Each failure, according to strace, was not actually with the program itself but with the attempt to run it - the execve call, which causes the program to be executed, was failing with an EINVAL error code, indicating some sort of problem to do with the ELF interpreter.

The only thing I could think of that would modify that sort of thing, and which would be nullified by the "replace RPMs with new ones, then replace new RPMs with old ones again" fix that the reporter described in the CentOS bug report, was prelink.

So, I turned it off, by editing /etc/sysconfig/prelink and by running prelink -au. Immediately after doing that, ntpstat worked 100% of the time instead of 99.98%.

I'm presuming that something to do with prelink's address space randomisation was breaking stuff on the servers I'm using, but I am not in a position to test that or to try to find a proper fix, so for now it remains disabled.

In summary then, if you're having weird random segmentation faults and you're sure it's not a fault with your RAM (having tried Memtest86 and Lucifer to check), then run:
prelink -au
sed -i s/^PRELINKING=yes/PRELINKING=no/ /etc/sysconfig/prelink
...and see if your problems disappear.



Update: I now have the results of testing with different parameters to prelink:

Options to prelinkTest resultsSuccess
-au50000/50000100.00%
-amR49988/5000099.98%
-aR49986/5000099.97%
-am50000/50000100.00%

Each test run had prelink -au run after it followed by another test to make sure success went back to 100%.

Basically, the -R option to prelink seems to be the one that's cocking everything up.



Update: Kernel 2.6.18-53.1.4.el5 appears to fix this problem.