09 November 2008

9:44 PM Term::VT102 0.91
It's been a while, but now there's a new version of Term::VT102. A few people have contacted me about the module over the past few weeks, and then Jörg Walter sent a patch to fix Unicode handling, which resurrected my interest in clearing a few of the TODOs from the list.

So, I cleaned it up a bit and extended the example scripts enough that I could effectively use Term::VT102 as a terminal emulator, and ran things like top and mutt within it to see how it handled. As a result I've fixed a few bugs in escape sequence handling and line wrapping as well as adding TAB stop support and callbacks for title changes and other private message strings.

There is also now an example script to show scrollback buffer processing for things like converting script logs or screen history into a flat file you can read with less without all the cursor positioning stuff getting in the way.

Labels:

25 June 2008

11:47 PM Server move and upgrade
Recently I moved this web server's services from London to Dallas, which meant building a new installation pretty much from scratch. So instead of being based on a very creaky initial base of Red Hat 7.3, customised and running under UML, it's all now running on CentOS 5 under Xen.

Last night I upgraded the virtual hosts to CentOS 5.2, which went reasonably smoothly, so tonight I went ahead and upgraded the "real" host as well. That didn't go so well. On rebooting, everything came back up, but I couldn't route to any of the virtual hosts any more.

It seems that the updated version of Xen had modified some scripts which meant I ended up with two bridge devices - my old one, virbr0, containing all of my virtual hosts and an alias for the real host, and a new one, xenbr0, containing a renamed version of the raw Ethernet device plus one more interface I've blotted from my memory. For some reason this caused all of the iptables DNAT rules to fail to work. SNAT / masquerading for outbound connections worked fine, but inbound data would only go in; the responses wouldn't go back out.

Anyway, if you are trying to get Xen working again after upgrading and are seeing mysterious DNAT failures, try applying these two patches:

--- /etc/xen/scripts/network-bridge.rpmnew 2008-06-21 23:09:32.000000000 +0100
+++ /etc/xen/scripts/network-bridge 2008-05-20 21:14:32.000000000 +0100
@@ -110,8 +110,7 @@
ip addr show dev ${src} | egrep '^ *inet ' | sed -e "
s/inet/ip addr add/
s@\([0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+/[0-9]\+\)@\1@
-s/${src}/dev ${dst} label ${dst}/
-s/secondary//
+s/${src}/dev ${dst}/
" | sh -e
# Remove automatic routes on destination device
ip route list | sed -ne "

--- /etc/xen/scripts/xen-network-common.sh.rpmnew 2008-06-21 23:09:32.000000000 +0100
+++ /etc/xen/scripts/xen-network-common.sh 2008-05-20 21:14:32.000000000 +0100
@@ -120,12 +120,7 @@
ip link set ${bridge} arp off
ip link set ${bridge} multicast off
fi
-
- # A small MTU disables IPv6 (and therefore IPv6 addrconf).
- mtu=$(ip link show ${bridge} | sed -n 's/.* mtu \([0-9]\+\).*/\1/p')
- ip link set ${bridge} mtu 68
ip link set ${bridge} up
- ip link set ${bridge} mtu ${mtu:-1500}
}

# Usage: add_to_bridge bridge dev

I've not looked into why it works; the above is just a reversion to the scripts as they were before upgrading to xen-3.0.3-64.el5_2.1, and it works for me, so I'm happy.

Labels:

06 March 2008

7:30 AM PV 1.1.4
I've finally got around to releasing version 1.1.4 of PV. Elias Pipping and Patrick Collison have been sending patches to improve compilation on Mac OS X, and there are a couple of minor cleanups: left-over IPC resources are cleaned up on termination thanks to Laszlo Ersek, and if you supply a non-numeric argument to an option that needs a number, you now get an error thanks to Boris Lohner.

Incidentally I did finish that toy garage in time, just forgot to update here. The lift needs a bit of work still - some sort of ratchet is needed, as it just slips at the moment. But the rest of it is in active service.

Labels:

04 September 2007

7:30 AM RHEL 5 intermittent segfaults
For the past couple of months, on 12 servers, I have been seeing intermittent segmentation faults happening with the ssh, scp, and ntpstat commands. Those servers that weren't brand new had not exhibited that behaviour with RHEL 4 in the past, it was only when Red Hat Enterprise Linux 5 was installed that it began.

2 additional servers running RHEL 5 were not showing the same fault, but they weren't of the same type - all affected servers were IBM xSeries or System X with multiple processors and various model numbers, and all had ServeRAID cards.

I couldn't find any mention of such a fault anywhere except for on a CentOS bug tracker, bug ID 0002241.

After a few tests, it turned out that ntpstat would fail to run about 10 times in every 50000, or 0.02% of the time. Each failure, according to strace, was not actually with the program itself but with the attempt to run it - the execve call, which causes the program to be executed, was failing with an EINVAL error code, indicating some sort of problem to do with the ELF interpreter.

The only thing I could think of that would modify that sort of thing, and which would be nullified by the "replace RPMs with new ones, then replace new RPMs with old ones again" fix that the reporter described in the CentOS bug report, was prelink.

So, I turned it off, by editing /etc/sysconfig/prelink and by running prelink -au. Immediately after doing that, ntpstat worked 100% of the time instead of 99.98%.

I'm presuming that something to do with prelink's address space randomisation was breaking stuff on the servers I'm using, but I am not in a position to test that or to try to find a proper fix, so for now it remains disabled.

In summary then, if you're having weird random segmentation faults and you're sure it's not a fault with your RAM (having tried Memtest86 and Lucifer to check), then run:
prelink -au
sed -i s/^PRELINKING=yes/PRELINKING=no/ /etc/sysconfig/prelink
...and see if your problems disappear.



Update: I now have the results of testing with different parameters to prelink:

Options to prelinkTest resultsSuccess
-au50000/50000100.00%
-amR49988/5000099.98%
-aR49986/5000099.97%
-am50000/50000100.00%

Each test run had prelink -au run after it followed by another test to make sure success went back to 100%.

Basically, the -R option to prelink seems to be the one that's cocking everything up.



Update: Kernel 2.6.18-53.1.4.el5 appears to fix this problem.

Labels:

30 August 2007

7:30 AM PV 1.1.0
Version 1.1.0 of PV has been released. This release incorporates some fixes for Mac OS X, a couple of packaging cleanups, a dramatic improvement in the resource usage of the --rate-limit (-L) option, and two new features.

The first new feature to be added, --line-mode (-l) was a Debian wishlist request. This causes PV to count lines instead of bytes. While it's not something I have ever particularly wanted myself, it does sound like it might come in handy occasionally (and, more importantly, it didn't require much to be added to the code to make it work).

The second was one that I have occasionally found myself wanting, particularly during long network data transfers. The --remote (-R) option allows the settings of an already-running PV to be altered. This can be used to change the rate limit while a transfer is in progress, for example, or set PV's idea of the total size of all data to something different.

Labels:

28 August 2007

7:30 AM QSF 1.2.7
Version 1.2.7 of QSF has been released. Like the recent PV release, this was prompted by inclusion in the Fedora Project and the resultant need to change the license to Artistic 2.0.

QSF's development is, again like PV, moving from SourceForge to Google Code.

Labels:

07 August 2007

7:49 PM PV 1.0.1
Version 1.0.1 of PV has been released. This is a code cleanup release, prompted by the discovery that PV has been included in the Fedora Project - version 0.9.9 is available now in FC7 and as an "extra" package in FC6.

It can be interesting to go back to old code and see how the style has changed over time. With a fresh perspective, a few oddities were more obvious, so the occasional untidy section was rewritten and a few more comment blocks were added. The organisation of the functions was changed a bit so that the "command-line program" part is now distinct from the "PV functionality" part, which means if I decide in future to create a library to add progress indicators to other command-line programs it will be significantly easier. Not that it's likely, but it seemed to make things neater.

Labels:

27 June 2007

7:30 AM Packages in use
A command line to find which RPM packages are in use by the system at this very moment. This can be useful if you are in the process of determining which packages to remove from a system that has a lot of unnecessary software installed, but you're also running nonstandard software such as the Sun JRE so you can't be sure that RPM's dependency tracking is enough.

To do this, we look at all libraries currently mapped in place by all running processes, as well as the file each process is executing, and then look at which RPMs those files belong to.
awk '{print $NF}' /proc/*/maps \
| sort - <(for A in /proc/*/exe; do readlink $A; done) \
| uniq \
| grep / \
| while read FILE; do rpm -q --queryformat='%{NAME}\n' -f $FILE 2>/dev/null; done \
| grep -v 'is not owned' \
| sort \
| uniq

You can omit the --queryformat='%{NAME}\n' part if you want the RPM version numbers to be included, in case you have multiple versions of some packages installed.

This has only been tested on Red Hat Enterprise Linux 5. If you don't have readlink, you could try ls -l $A | awk '{print $NF}' instead.

Note that this will only catch dependencies that are memory mapped, such as system libraries. It won't catch files that are only read occasionally or which aren't memory mapped.

Labels:

09 March 2007

7:30 AM Rewriting root
Another root filesystem recovery HOWTO.

Recently I had a hard disk develop major faults such that the root filesystem went readonly. Although rebooting caused it to come back up fine, a SMART check (hdparm -t long /dev/hda) showed that it was failing, so I requested that the server operators replace the hard disk (it's a leased server not under my direct control).

Since they could not fit both the old and new disks in the server at once, instead they copied the contents of the failing disk to a /backup directory on the new disk, which they had installed a fresh copy of Fedora Core 4 on to.

Rather than try to upgrade the new installation to Fedora Core 6 and get everything configured back to the way it was, instead I opted to swap the files in /backup/ with the current /, i.e. replace the whole new root filesystem with the old one to get everything back as it was before the change. I also had to do this live over the network, since I have no physical access to the server.

If you try to do "mv /bin /OLD/bin; mv /backup/bin /bin" you will run into major problems - after the first mv, the mv command won't work because it has moved to a different directory, and if you work around that, you will still eventually run into nasty problems related to /lib.

Instead, I used mount --bind / /backup/mnt to make the root filesystem visible under /backup so that I could then do chroot /backup (making sure /backup/dev/ was populated with device files first). I was then able to replace, under /backup, /mnt/{bin,boot,etc,lib,...} to replace the "real" root filesystem's files (i.e. mv /mnt/bin /mnt/old-bin; cp -a /bin /mnt/bin etc).

The only thing that went wrong with this approach was that rebooting failed, I had to ask the server operators to manually restart the machine. I think this was probably due to a problem with the shutdown sequence.

If you have problems with files / resources being "busy", make sure you have stopped all non-essential services - i.e. anything other than the network interfaces and SSH - before you start the move. Also, note you won't be able to move /dev, /proc, /sys, or any other directories that are mount points or which have mount points under them (though you could investigate mount --move).

Labels:

31 January 2007

7:30 AM Resize a live root FS - a HOWTO
It is possible, though difficult, to resize a Linux root partition while it's still mounted. What's more, it can be done remotely, without having to be at the console. You'll need 2GB of RAM, but here is how:
  1. Stop all services other than the network and SSH, and stop SELinux interfering:
    # telinit 2
    # for SERVICE in \
    `chkconfig --list | grep 2:on | awk '{print $1}' | grep -v -e sshd -e network -e rawdevices`; \
    do service $SERVICE stop; done
    # service nfs stop
    # service rpcidmapd stop
    # setenforce 0

  2. Unmount all filesystems:
    # umount -a

  3. Create a temporary filesystem:
    # mkdir /tmp/tmproot
    # mount none /tmp/tmproot -t tmpfs
    # mkdir /tmp/tmproot/{proc,sys,usr,var,oldroot}
    # cp -ax /{bin,etc,mnt,sbin,lib} /tmp/tmproot/
    # cp -ax /usr/{bin,sbin,lib} /tmp/tmproot/usr/
    # cp -ax /var/{account,empty,lib,local,lock,nis,opt,preserve,run,spool,tmp,yp} /tmp/tmproot/var/
    # cp -a /dev /tmp/tmproot/dev
    Note that this used up about 1.6GB of ramdisk on my Red Hat Enterprise Linux (AS) 4 server.

    Also note that on 64-bit systems you will also need to copy /lib64 and /usr/lib64 as well, otherwise you will see errors like "lib64/ld-linux-x86-64.so.2: bad ELF interpreter: No such file or directory".

  4. Switch the filesystem root to the temporary filesystem:
    # pivot_root /tmp/tmproot/ /tmp/tmproot/oldroot
    # mount none /proc -t proc
    # mount none /sys -t sysfs (this may fail on 2.4 systems)
    # mount none /dev/pts -t devpts

  5. Restart the SSH daemon to close the old pty devices:
    # service sshd restart
    You should now try to make a new connection. If that succeeds, close your old one to release the old pty device. If it fails, get the SSH daemon properly restarted before proceeding.

  6. Close everything that's still using the old filesystem:
    # umount /oldroot/proc
    # umount /oldroot/dev/pts
    # umount /oldroot/selinux
    # umount /oldroot/sys
    # umount /oldroot/var/lib/nfs/rpc_pipefs
    Now try to find other things that are still holding on to the old filesystem, particularly /dev:
    # fuser -vm /oldroot/dev
    Common processes that will need killing:
    # killall udevd
    # killall gconfd-2
    # killall mingetty
    # killall minilogd
    Finally, you will need to re-execute init:
    # telinit u

  7. Unmount the old filesystem:
    # umount -l /oldroot/dev
    # umount /oldroot
    Note that we use the umount -l ("lazy") option, available only with kernels 2.4.11 and later, because /oldroot is actually mounted using an entry in /oldroot/dev, so it would be difficult if not impossible to unmount either of them otherwise.

  8. Now resize the root filesystem:
    # e2fsck -C 0 -f /dev/VolGroup00/LogVol00
    # resize2fs -p -f /dev/VolGroup00/LogVol00 8G
    # lvresize /dev/VolGroup00/LogVol00 -L 8G
    # resize2fs -p -f /dev/VolGroup00/LogVol00
    # e2fsck -C 0 -f /dev/VolGroup00/LogVol00
    In this example the root partition is /dev/VolGroup00/LogVol00 and it is being shrunk to 8GB. You don't necessarily have to run resize2fs twice, I just do in case my idea of the size differs from what lvresize thinks.

  9. We're done, so start putting everything back:
    # mount /dev/VolGroup00/LogVol00 /oldroot
    # pivot_root /oldroot /oldroot/tmp/tmproot
    # umount /tmp/tmproot/proc
    # mount none /proc -t proc
    # cp -ax /tmp/tmproot/dev/* /dev/
    # mount /dev/pts
    # mount /sys
    # killall mingetty
    # telinit u
    # service sshd restart
    Now make a new SSH connection, and if it works, close the old one. Note that sshd may still be running in the temporary filesystem at this point because of the way the service scripts work - check this with fuser, and if this is the case, kill the oldest sshd process and then do service sshd start. Then log in again and disconnect all other connections.

    Final steps to unmount the temporary filesystem:
    # umount -l /tmp/tmproot/dev/pts
    # umount -l /tmp/tmproot
    # rmdir /tmp/tmproot
    Now to re-mount our original filesystems and start services back up:
    # mount -a
    # umount /sys
    # mount /sys
    # for SERVICE in \
    `chkconfig --list | grep 2:on | awk '{print $1}' | grep -v -e sshd -e network -e rawdevices`; \
    do service $SERVICE start; done
    # telinit 3
    Replace 3 with your preferred runlevel. You may also want to start SELinux up again with setenforce.

The above has only been tested on RHEL AS 4, but something like it should work on most Linux variants that have pivot_root, tmpfs, and umount -l, so long as you can replace the chkconfig and service parts with whatever is appropriate for your distribution.



Update: Lucas Chan says, for CentOS 4.4, "I was not able to login after restarting sshd in step 5 until I did this: mount none /dev/pts -t devpts".



Update: Simetrical suggests that 64-bit systems also need to copy /lib64 and /usr/lib64, and that after pivot_root 2.6 kernels will also need mount none /sys -t sysfs and mount none /dev/pts -t devpts. (The above steps have been modified accordingly).

Labels:

30 January 2007

7:30 AM Don't release often
Until I released QSF 1.2.5 the other day, I'd forgotten one of the reasons I don't subscribe to the motto "release early, release often" - it's a pain in the arse. SourceForge really don't make it easy to release projects with multiple files, and they've also managed to hose the Compile Farm again so I can't produce anything other than Fedora Core 6 i386 binaries.

The other reason I don't follow The Great Prophet ESR is because I dislike wasting my time. Far too often I have looked for a package to do a particular job and ended up installing seven different half-finished pieces of complete garbage, none of which work, or have installed something that works one week and then, on upgrading to fix its many bugs, fails the next. Since the general philosophy of "be excellent to each other" implies that one should not do to others what you don't like having done to you, I would rather at least attempt to run some tests on my code before dumping it on the Internet, so as to not waste the time of users with buggy releases. Whether those tests work or not is another matter, but at least I give it a shot.

Then there's the co-operative development aspect. None of the projects I have ever worked on have particularly attracted other developers. I've had the odd bit of feedback, even the occasional patch, and have a couple of people managing Debian ports for some of my projects, for which I'm grateful - but I'm the only person making major changes to the codebase. I'm happy with this - it suits my temperament - but since "release often" is geared towards getting feedback and development, without developers or many users (and project mailing lists infested with tumbleweeds), it's a bit pointless in my case.

Labels:

21 January 2007

8:32 PM QSF 1.2.5 released
QSF version 1.2.5 has been released. This version fixes a bug in the new list backend which causes tokens to slowly be randomly deleted on update. This can include the special token that keeps track of token aging, so databases may grow uncontrollably.

Although version 1.2.5 fixes this bug it cannot restore the lost data, so unless you rebuild your databases by retraining them from scratch they will continue to be inaccurate until the newly accumulated data starts outweighing the old.

I had wondered why certain users' databases were getting so large, but just assumed that it was due to the massive volume of email those users' accounts were processing.

These graphs show training and classification times using QSF 1.2.5 with various different backends:


The "training" graph shows how much CPU time it takes to build a new database from a set of emails, displaying CPU time versus the number of emails in the training set. The "classification" graph shows the CPU time it takes to decide whether a certain number of messages are spam or not.

As you can see, the list backend seems to be the quickest, so it's a shame it had this great big bug in it. There are further optimisations still to do - in particular, deleting multiple tokens at once (such as during a pruning cycle) is very inefficient - but they will have to wait, as it isn't critical and I'm more than busy enough.

Labels:

17 November 2006

1:21 PM Tip when using yum to upgrade
Today I attempted to upgrade a PC from Fedora Core 5 to Fedora Core 6, using YUM.

I started off with something like what this blog post describes, and it all seemed to be running fine, so I left it for a while.

When I came back, it seems that the machine just got bored and stopped part-way through the installation of the packages. It had installed about half of them, and not done any of the "cleanup" phase yet, when it just aborted (with no error message) and dropped back to the shell. I'm guessing it ran out of memory (that machine has no swap, though it has 512MB of RAM), but there were no logs to confirm that.

So, I switched on swap before continuing.

I tried a second yum update, but it failed with conflicts and missing dependencies. This turned out to be because lots of Fedora Core 6 packages had been installed, but the corresponding Fedora Core 5 packages had not been uninstalled (since yum died before the cleanup phase).

Here's the quick-and-dirty command line I used to find the duplicates:

rpm -qa --queryformat='%{NAME}\t%{INSTALLTID}\t%{VERSION}-%{RELEASE}\n' \
| sort \
| awk 'BEGIN{prev=""} {if ($1 == prev) { print prevline; } prev=$1; prevline=$0}' \
| awk '{print $1 "-" $3}' \
| grep -v -e ^kernel -e ^gpg-pubkey-

This will look for any packages that have multiple versions installed, not including anything starting with "kernel" or "gpg-pubkey" (since they normally have more than one version / key installed). For each duplicate package, all but the most recently installed version will be listed.

Running rpm -e --repackage --nodeps on the package names output by the above took a while, but afterwards, yum update worked properly again.

Labels:

27 February 2006

10:45 PM PV 0.9.6
PV 0.9.6 has been released, incorporating some minor bugfixes. A build problem on Cygwin was fixed, NLS was made static on systems without msgfmt, the -i option can now properly understand decimal numbers, and a cosmetic fix: when progress reaches 100%, the ETA is blanked instead of showing "ETA: 0:00:00".

Labels:

06 February 2006

10:31 PM JMBA 0.5.5
Some minor feature enhancements to JMBA. New SENDER and RECIPIENT tags are available in the template, and the ORIGINAL tag includes slightly more of the original message. Also, files in the queue directory starting with "." are now ignored.

These updates were sponsored by SpamDefy.

Labels:

03 February 2006

8:30 AM QSF 1.1.6
QSF version 1.1.6 has been released. It fixes a few tokenisation bugs (URLs at the start of messages, and nested message attachments) and improves the build process a little.

Labels:

24 January 2006

2:16 PM postprox 0.2.0
Version 0.2.0 of postprox has been released. This adds much better logging, and allows filter scripts to see the remote host IP address, HELO, envelope sender, and first recipient.

Labels: