ivarch.com: Coding Comments

ACUCOBOL SEGV on RHEL4

2011-10-07T06:30:00Z

This is one of those problems that searching the Internet throws up zero results on, so I thought I'd ensure that in 2 years when it happens to one other person in the world, they can find out how to fix it and save themselves a couple of weeks of head-scratching.

A virtual server I maintain runs Red Hat Enterprise Linux 4 (as a paravirtualised Xen guest), and is used to develop a COBOL application central to business operations. For various reasons, the version of COBOL is ACUCOBOL 6.2.

We'd acquired several new servers to be used as Xen hosts, all much newer and more powerful than the original host, so I decided to move the development VM over to one of them.

50GB+ of data transfer later, I fired up the domU in its new home, and it booted fine. Ran the application's main menu to test it, and - nothing. All I got was

Memory access violation
COBOL error at 000000 in [[program]]

On investigation with strace I found that the runtime interpreter was segfaulting immediately after loading the program, before even trying to run any COBOL code at all.

Running in HVM mode instead of PV, with a regular kernel in the domU instead of a Xen kernel, caused the problem to go away - but at the cost of terrible performance and only 1 CPU available. However, running an SMP or PAE kernel in HVM mode made the error come back.

It turns out that the problem was caused by the non-executable pages feature available on the newer Xen hosts' CPUs, which is enabled automatically in PAE and Xen kernels (and presumably the SMP kernel for RHEL4).

To fix it, even in PV mode with a Xen kernel, just add this to the domU's kernel command line in GRUB and reboot:

noexec=off

This disables the feature, enabling ACUCOBOL to run.

Samba auditing

2011-06-09T19:50:21Z

Today I tried doing something I'd put off for ages because I thought it was going to be really tricky: enabling auditing on a Samba share so there is a log of who is creating, deleting, and editing each file (to track down mysteriously disappearing files).

One quick search and I found this: "Samba: Logging User Activity".

It turns out to be a case of adding this to the share definition in smb.conf:

vfs objects = full_audit
full_audit:priority = INFO
full_audit:facility = LOCAL1
full_audit:failure = none
full_audit:success = mkdir rename unlink rmdir pwrite
full_audit:prefix = %u|%U|%I|%m|%S

And that's all. Works fine in RHEL5's Samba (3.0.33). Change the syslog settings to whatever makes sense and update /etc/syslog.conf accordingly, and you have an audit trail.

PV 1.2.0

2010-12-14T10:33:02Z

Version 1.2.0 of PV is released today. I've been putting it off for months and months, because there are still many things left to do, but it's been over 2 years since the last release so that was just silly. It's also not fair to the people who have contributed to be made to wait so long for their changes to be put into the main release.

The change log is:

integrated improved SI prefixes and --average-rate (Henry Gebhardt)
return nonzero if exiting due to SIGTERM (Martin Baum)
patch from Phil Rutschman to restore terminal properly on exit
fix i18n especially for --help (Sebastian Kayser)
refactored pv_display
we now have a coherent, documented, exit status
modified pipe test and new cksum test from Sebastian Kayser
default CFLAGS to just "-O" for non-GCC (Kjetil Torgrim Homme)
LFS compile fix for OS X 10.4 (Alexandre de Verteuil)
remove DESTDIR / suffix (Sam Nelson, Daniel Pape)
fixed potential NULL deref in transfer (Elias Pipping / LLVM/Clang)

The main user-visible changes are the new -a / --average-rate option from Henry Gebhardt which gives a much more sensible display for "bursty" traffic, and a consistent, documented exit status so it's easier for scripts to tell when there has been an error.

Term::VT102 0.91

2008-11-09T07:30:00Z

It's been a while, but now there's a new version of Term::VT102. A few people have contacted me about the module over the past few weeks, and then Jörg Walter sent a patch to fix Unicode handling, which resurrected my interest in clearing a few of the TODOs from the list.

So, I cleaned it up a bit and extended the example scripts enough that I could effectively use Term::VT102 as a terminal emulator, and ran things like top and mutt within it to see how it handled. As a result I've fixed a few bugs in escape sequence handling and line wrapping as well as adding TAB stop support and callbacks for title changes and other private message strings.

There is also now an example script to show scrollback buffer processing for things like converting script logs or screen history into a flat file you can read with less without all the cursor positioning stuff getting in the way.

PV 1.1.4

2008-03-06T07:30:00Z

I've finally got around to releasing version 1.1.4 of PV. Elias Pipping and Patrick Collison have been sending patches to improve compilation on Mac OS X, and there are a couple of minor cleanups: left-over IPC resources are cleaned up on termination thanks to Laszlo Ersek, and if you supply a non-numeric argument to an option that needs a number, you now get an error thanks to Boris Lohner.

RHEL 5 intermittent segfaults

2007-09-04T06:30:00Z

For the past couple of months, on 12 servers, I have been seeing intermittent segmentation faults happening with the ssh, scp, and ntpstat commands. Those servers that weren't brand new had not exhibited that behaviour with RHEL 4 in the past, it was only when Red Hat Enterprise Linux 5 was installed that it began.

2 additional servers running RHEL 5 were not showing the same fault, but they weren't of the same type - all affected servers were IBM xSeries or System X with multiple processors and various model numbers, and all had ServeRAID cards.

I couldn't find any mention of such a fault anywhere except for on a CentOS bug tracker, bug ID 0002241.

After a few tests, it turned out that ntpstat would fail to run about 10 times in every 50000, or 0.02% of the time. Each failure, according to strace, was not actually with the program itself but with the attempt to run it - the execve call, which causes the program to be executed, was failing with an EINVAL error code, indicating some sort of problem to do with the ELF interpreter.

The only thing I could think of that would modify that sort of thing, and which would be nullified by the "replace RPMs with new ones, then replace new RPMs with old ones again" fix that the reporter described in the CentOS bug report, was prelink.

So, I turned it off, by editing /etc/sysconfig/prelink and by running prelink -au. Immediately after doing that, ntpstat worked 100% of the time instead of 99.98%.

I'm presuming that something to do with prelink's address space randomisation was breaking stuff on the servers I'm using, but I am not in a position to test that or to try to find a proper fix, so for now it remains disabled.

In summary then, if you're having weird random segmentation faults and you're sure it's not a fault with your RAM (having tried Memtest86 and Lucifer to check), then run:

prelink -au
sed -i s/^PRELINKING=yes/PRELINKING=no/ /etc/sysconfig/prelink

...and see if your problems disappear.

Update: I now have the results of testing with different parameters to prelink:

Options to `prelink`	Test results	Success
`-au`	50000/50000	100.00%
`-amR`	49988/50000	99.98%
`-aR`	49986/50000	99.97%
`-am`	50000/50000	100.00%

Each test run had prelink -au run after it followed by another test to make sure success went back to 100%.

Basically, the -R option to prelink seems to be the one that's cocking everything up.

Update: Kernel 2.6.18-53.1.4.el5 appears to fix this problem.

PV 1.1.0

2007-08-30T06:30:00Z

Version 1.1.0 of PV has been released. This release incorporates some fixes for Mac OS X, a couple of packaging cleanups, a dramatic improvement in the resource usage of the --rate-limit (-L) option, and two new features.

The first new feature to be added, --line-mode (-l) was a Debian wishlist request. This causes PV to count lines instead of bytes. While it's not something I have ever particularly wanted myself, it does sound like it might come in handy occasionally (and, more importantly, it didn't require much to be added to the code to make it work).

The second was one that I have occasionally found myself wanting, particularly during long network data transfers. The --remote (-R) option allows the settings of an already-running PV to be altered. This can be used to change the rate limit while a transfer is in progress, for example, or set PV's idea of the total size of all data to something different.

QSF 1.2.7

2007-08-28T06:30:00Z

Version 1.2.7 of QSF has been released. Like the recent PV release, this was prompted by inclusion in the Fedora Project and the resultant need to change the license to Artistic 2.0.

QSF's development is, again like PV, moving from SourceForge to Google Code.

PV 1.0.1

2007-08-07T06:49:00Z

Version 1.0.1 of PV has been released. This is a code cleanup release, prompted by the discovery that PV has been included in the Fedora Project - version 0.9.9 is available now in FC7 and as an "extra" package in FC6.

It can be interesting to go back to old code and see how the style has changed over time. With a fresh perspective, a few oddities were more obvious, so the occasional untidy section was rewritten and a few more comment blocks were added. The organisation of the functions was changed a bit so that the "command-line program" part is now distinct from the "PV functionality" part, which means if I decide in future to create a library to add progress indicators to other command-line programs it will be significantly easier. Not that it's likely, but it seemed to make things neater.

Packages in use

2007-06-27T06:30:00Z

A command line to find which RPM packages are in use by the system at this very moment. This can be useful if you are in the process of determining which packages to remove from a system that has a lot of unnecessary software installed, but you're also running nonstandard software such as the Sun JRE so you can't be sure that RPM's dependency tracking is enough.

To do this, we look at all libraries currently mapped in place by all running processes, as well as the file each process is executing, and then look at which RPMs those files belong to.

awk '{print $NF}' /proc/*/maps \
| sort - <(for A in /proc/*/exe; do readlink $A; done) \
| uniq \
| grep / \
| while read FILE; do rpm -q --queryformat='%{NAME}\n' -f $FILE 2>/dev/null; done \
| grep -v 'is not owned' \
| sort \
| uniq

You can omit the --queryformat='%{NAME}\n' part if you want the RPM version numbers to be included, in case you have multiple versions of some packages installed.

This has only been tested on Red Hat Enterprise Linux 5. If you don't have readlink, you could try ls -l $A | awk '{print $NF}' instead.

Note that this will only catch dependencies that are memory mapped, such as system libraries. It won't catch files that are only read occasionally or which aren't memory mapped.

Rewriting root

2007-03-09T07:30:00Z

Another root filesystem recovery HOWTO.

Recently I had a hard disk develop major faults such that the root filesystem went readonly. Although rebooting caused it to come back up fine, a SMART check (hdparm -t long /dev/hda) showed that it was failing, so I requested that the server operators replace the hard disk (it's a leased server not under my direct control).

Since they could not fit both the old and new disks in the server at once, instead they copied the contents of the failing disk to a /backup directory on the new disk, which they had installed a fresh copy of Fedora Core 4 on to.

Rather than try to upgrade the new installation to Fedora Core 6 and get everything configured back to the way it was, instead I opted to swap the files in /backup/ with the current /, i.e. replace the whole new root filesystem with the old one to get everything back as it was before the change. I also had to do this live over the network, since I have no physical access to the server.

If you try to do "mv /bin /OLD/bin; mv /backup/bin /bin" you will run into major problems - after the first mv, the mv command won't work because it has moved to a different directory, and if you work around that, you will still eventually run into nasty problems related to /lib.

Instead, I used mount --bind / /backup/mnt to make the root filesystem visible under /backup so that I could then do chroot /backup (making sure /backup/dev/ was populated with device files first). I was then able to replace, under /backup, /mnt/{bin,boot,etc,lib,...} to replace the "real" root filesystem's files (i.e. mv /mnt/bin /mnt/old-bin; cp -a /bin /mnt/bin etc).

The only thing that went wrong with this approach was that rebooting failed, I had to ask the server operators to manually restart the machine. I think this was probably due to a problem with the shutdown sequence.

If you have problems with files / resources being "busy", make sure you have stopped all non-essential services - i.e. anything other than the network interfaces and SSH - before you start the move. Also, note you won't be able to move /dev, /proc, /sys, or any other directories that are mount points or which have mount points under them (though you could investigate mount --move).

Resize a live root FS - a HOWTO

2007-01-31T07:30:00Z

It is possible, though difficult, to resize a Linux root partition while it's still mounted. What's more, it can be done remotely, without having to be at the console. You'll need 2GB of RAM, but here is how:

Stop all services other than the network and SSH, and stop SELinux interfering:

# telinit 2
# for SERVICE in \
`chkconfig --list | grep 2:on | awk '{print $1}' | grep -v -e sshd -e network -e rawdevices`; \
do service $SERVICE stop; done
# service nfs stop
# service rpcidmapd stop
# setenforce 0

Unmount all filesystems:
```
# umount -a
```
Create a temporary filesystem:
```
# mkdir /tmp/tmproot
# mount none /tmp/tmproot -t tmpfs
# mkdir /tmp/tmproot/{proc,sys,usr,var,oldroot}
# cp -ax /{bin,etc,mnt,sbin,lib} /tmp/tmproot/
# cp -ax /usr/{bin,sbin,lib} /tmp/tmproot/usr/
# cp -ax /var/{account,empty,lib,local,lock,nis,opt,preserve,run,spool,tmp,yp} /tmp/tmproot/var/
# cp -a /dev /tmp/tmproot/dev
```
Note that this used up about 1.6GB of ramdisk on my Red Hat Enterprise Linux (AS) 4 server.

Also note that on 64-bit systems you will also need to copy /lib64 and /usr/lib64 as well, otherwise you will see errors like "lib64/ld-linux-x86-64.so.2: bad ELF interpreter: No such file or directory".

Switch the filesystem root to the temporary filesystem:

# pivot_root /tmp/tmproot/ /tmp/tmproot/oldroot
# mount none /proc -t proc
# mount none /sys -t sysfs (this may fail on 2.4 systems)
# mount none /dev/pts -t devpts

Restart the SSH daemon to close the old pty devices:
```
# service sshd restart
```
You should now try to make a new connection. If that succeeds, close your old one to release the old pty device. If it fails, get the SSH daemon properly restarted before proceeding.

Close everything that's still using the old filesystem:

# umount /oldroot/proc
# umount /oldroot/dev/pts
# umount /oldroot/selinux
# umount /oldroot/sys
# umount /oldroot/var/lib/nfs/rpc_pipefs

Now try to find other things that are still holding on to the old filesystem, particularly /dev:

# fuser -vm /oldroot/dev

Common processes that will need killing:

# killall udevd
# killall gconfd-2
# killall mingetty
# killall minilogd

Finally, you will need to re-execute init:

# telinit u

Unmount the old filesystem:
```
# umount -l /oldroot/dev
# umount /oldroot
```
Note that we use the umount -l ("lazy") option, available only with kernels 2.4.11 and later, because /oldroot is actually mounted using an entry in /oldroot/dev, so it would be difficult if not impossible to unmount either of them otherwise.
Now resize the root filesystem:
```
# e2fsck -C 0 -f /dev/VolGroup00/LogVol00
# resize2fs -p -f /dev/VolGroup00/LogVol00 8G
# lvresize /dev/VolGroup00/LogVol00 -L 8G
# resize2fs -p -f /dev/VolGroup00/LogVol00
# e2fsck -C 0 -f /dev/VolGroup00/LogVol00
```
In this example the root partition is /dev/VolGroup00/LogVol00 and it is being shrunk to 8GB. You don't necessarily have to run resize2fs twice, I just do in case my idea of the size differs from what lvresize thinks.
We're done, so start putting everything back:
```
# mount /dev/VolGroup00/LogVol00 /oldroot
# pivot_root /oldroot /oldroot/tmp/tmproot
# umount /tmp/tmproot/proc
# mount none /proc -t proc
# cp -ax /tmp/tmproot/dev/* /dev/
# mount /dev/pts
# mount /sys
# killall mingetty
# telinit u
# service sshd restart
```
Now make a new SSH connection, and if it works, close the old one. Note that sshd may still be running in the temporary filesystem at this point because of the way the service scripts work - check this with fuser, and if this is the case, kill the oldest sshd process and then do service sshd start. Then log in again and disconnect all other connections.

Final steps to unmount the temporary filesystem:
```
# umount -l /tmp/tmproot/dev/pts
# umount -l /tmp/tmproot
# rmdir /tmp/tmproot
```
Now to re-mount our original filesystems and start services back up:
```
# mount -a
# umount /sys
# mount /sys
# for SERVICE in \
`chkconfig --list | grep 2:on | awk '{print $1}' | grep -v -e sshd -e network -e rawdevices`; \
do service $SERVICE start; done
# telinit 3
```
Replace 3 with your preferred runlevel. You may also want to start SELinux up again with setenforce.

The above has only been tested on RHEL AS 4, but something like it should work on most Linux variants that have pivot_root, tmpfs, and umount -l, so long as you can replace the chkconfig and service parts with whatever is appropriate for your distribution.

Update: Lucas Chan says, for CentOS 4.4, "I was not able to login after restarting sshd in step 5 until I did this: mount none /dev/pts -t devpts".

Update: Simetrical suggests that 64-bit systems also need to copy /lib64 and /usr/lib64, and that after pivot_root 2.6 kernels will also need mount none /sys -t sysfs and mount none /dev/pts -t devpts. (The above steps have been modified accordingly).

Don't release often

2007-01-30T07:30:00Z

Until I released QSF 1.2.5 the other day, I'd forgotten one of the reasons I don't subscribe to the motto "release early, release often" - it's a pain in the arse. SourceForge really don't make it easy to release projects with multiple files, and they've also managed to hose the Compile Farm again so I can't produce anything other than Fedora Core 6 i386 binaries.

The other reason I don't follow The Great Prophet ESR is because I dislike wasting my time. Far too often I have looked for a package to do a particular job and ended up installing seven different half-finished pieces of complete garbage, none of which work, or have installed something that works one week and then, on upgrading to fix its many bugs, fails the next. Since the general philosophy of "be excellent to each other" implies that one should not do to others what you don't like having done to you, I would rather at least attempt to run some tests on my code before dumping it on the Internet, so as to not waste the time of users with buggy releases. Whether those tests work or not is another matter, but at least I give it a shot.

Then there's the co-operative development aspect. None of the projects I have ever worked on have particularly attracted other developers. I've had the odd bit of feedback, even the occasional patch, and have a couple of people managing Debian ports for some of my projects, for which I'm grateful - but I'm the only person making major changes to the codebase. I'm happy with this - it suits my temperament - but since "release often" is geared towards getting feedback and development, without developers or many users (and project mailing lists infested with tumbleweeds), it's a bit pointless in my case.

QSF 1.2.5 released

2007-01-27T20:32:00Z

QSF version 1.2.5 has been released. This version fixes a bug in the new list backend which causes tokens to slowly be randomly deleted on update. This can include the special token that keeps track of token aging, so databases may grow uncontrollably.

Although version 1.2.5 fixes this bug it cannot restore the lost data, so unless you rebuild your databases by retraining them from scratch they will continue to be inaccurate until the newly accumulated data starts outweighing the old.

I had wondered why certain users' databases were getting so large, but just assumed that it was due to the massive volume of email those users' accounts were processing.

These graphs show training and classification times using QSF 1.2.5 with various different backends:

The "training" graph shows how much CPU time it takes to build a new database from a set of emails, displaying CPU time versus the number of emails in the training set. The "classification" graph shows the CPU time it takes to decide whether a certain number of messages are spam or not.

As you can see, the list backend seems to be the quickest, so it's a shame it had this great big bug in it. There are further optimisations still to do - in particular, deleting multiple tokens at once (such as during a pruning cycle) is very inefficient - but they will have to wait, as it isn't critical and I'm more than busy enough.

Tip when using yum to upgrade

2006-11-27T13:21:00Z

Today I attempted to upgrade a PC from Fedora Core 5 to Fedora Core 6, using YUM.

I started off with something like what this blog post describes, and it all seemed to be running fine, so I left it for a while.

When I came back, it seems that the machine just got bored and stopped part-way through the installation of the packages. It had installed about half of them, and not done any of the "cleanup" phase yet, when it just aborted (with no error message) and dropped back to the shell. I'm guessing it ran out of memory (that machine has no swap, though it has 512MB of RAM), but there were no logs to confirm that.

So, I switched on swap before continuing.

I tried a second yum update, but it failed with conflicts and missing dependencies. This turned out to be because lots of Fedora Core 6 packages had been installed, but the corresponding Fedora Core 5 packages had not been uninstalled (since yum died before the cleanup phase).

Here's the quick-and-dirty command line I used to find the duplicates:

rpm -qa --queryformat='%{NAME}\t%{INSTALLTID}\t%{VERSION}-%{RELEASE}\n' \
| sort \
| awk 'BEGIN{prev=""} {if ($1 == prev) { print prevline; } prev=$1; prevline=$0}' \
| awk '{print $1 "-" $3}' \
| grep -v -e ^kernel -e ^gpg-pubkey-

This will look for any packages that have multiple versions installed, not including anything starting with "kernel" or "gpg-pubkey" (since they normally have more than one version / key installed). For each duplicate package, all but the most recently installed version will be listed.

Running rpm -e --repackage --nodeps on the package names output by the above took a while, but afterwards, yum update worked properly again.