Portfolio: Batch framework

This page describes a proprietary framework and supporting scripts to allow batch jobs to be fully defined by a shell script placed in a directory, with supporting functions for logging, status reporting, and failure alerting built in, that I built while employed as a Linux system administrator and Linux architect.

Key details
Brief description:A framework to easily extend existing shell scripts to support flexible scheduling, with dependencies, conflict avoidance, centralised logging, and status reporting built in.
Consumer:ERP development team, Linux system administration team
Impact to consumer:
  • Improved observability of scheduled jobs
  • Reduced risk of human error
  • Shorter time from code commit to deployment
  • Improved security posture
Technical features:
  • Automatic delivery of status and logs to a central point as well as to local files
  • Generates metrics and status files plus an LLD file for the Zabbix monitoring system to detect jobs automatically and raise alerts as appropriate
  • Minimal overhead with no complicated frameworks or dependencies
Technologies used:Bash

When packaging and deploying internal software, there is a division between the "application" side and the "operating system" side. For some internal development teams, particularly the ERP developers working on the core Cobol application, this can make it difficult to manage scheduled jobs - granting the developers direct access to the cron scheduler would put the whole system at risk of accidental compromise, and is difficult to justify from an internal controls perspective. This means that developers have to co-ordinate with the Linux system admin team to set up or modify scheduled tasks.

Many scheduled jobs relate to processing of financially relevant information, and so from an internal controls perspective, there is a need to ensure that all jobs have consistent logging and failure alerting mechanisms - so that, for example, if a critical invoice processing job fails, an alert will definitely be raised.

I developed the batch framework to close these gaps without having to delegate an unsafe amount of responsibility to developers, and without having to introduce a large, complex, maintenance-heavy ecosystem such as Control-M or UC4.

Once the framework has been set up by the system administration team, new scheduled jobs can be created by placing a shell script in a particular directory - usually via deployment of an OS-native package along with the rest of the internal application. The job's shell script contains the job itself, plus all information necessary for scheduling and alerting. The framework provides supporting functions for the script to call - functions for logging, status reporting, and failure alerting built in.

Each job is run by the framework rather than directly from cron, so all output is automatically captured to local log files regardless of whether the framework functions are used by the script. Job status, output, information about running jobs' ETA, performance of a job relative to previous run times, and more is provided by a batch status viewer tool I developed separately, providing IT service management, developers, and other teams with greater visibility of these processes without having to grant them access to the servers on which the jobs are running.

The framework generates a low-level discovery file for the Zabbix monitoring system so that the template I created for Zabbix can automatically start monitoring any new jobs. Metrics are tracked for job running time, time since last successful run, success or failure of the last run, and so on; and alerts are automatically generated based on the parameters built into each job, reducing the likelihood of failures being missed because operators forgot to set up alerting.

The following parameters can be set for each job, in the comment block at the top of the job's script:

When administrators set up the batch framework, they can define multiple job locations, each with their own set of parameters. These parameters include which user to run the jobs as, which central host to transmit logs to, checks to run before starting any of the jobs (such as checking that this is the active node of an active/passive cluster), and commands to include as if they were at the start of every job (such as setting up Cobol runtime environment variables).

Introducing the batch framework to the core ERP system immediately improved the monitoring and alerting of the jobs that were moved over to it. I then started using it even on systems where no delegation of control was required, wherever jobs had more complex needs than a very basic maintenance cron job, and the system admin team are now comfortable using it for any scheduled task that would benefit from automatic monitoring and alerting.