HTCondor, install and configure as non-root

If you don't know what HTCondor is about, I'll run you through a couple of scenarios where it may be useful even if you're not a data scientist.

Thousands of *.wav files that you want converted to *.mp3 via ffmpeg? Just dispatch a job and Condor will make use of available hardware at your disposal!
Renting a flat with friends; everyone has a beefed up machine with a decent GPU. You tried to mine ETH for a while but gave up because you didn't have the discipline to start/stop the miner and, at night, the noise was too loud. Submit a job on a schedule that will only run during the day. Profit!

The team from UW–Madison and collaborators did impressive work both on the detail and depth of the manual, but I struggled a bit with the scenario where you don't have root privilege (company policy?) but have access to a multitude of machines.

Can HTCondor leverage multiple machines as non-root? Definitely!

HTCondor 9 (and above)

Historically you'd secure the cluster after install, From 9 onwards the default is to just have security tightened up. This was a nice welcome, but the installation became more intricate. While the team behind Condor released proper tools for it, as non-root, there is some extra-work.

HTCondor Daemons

You'll most likely have one machine that manages it all (central), one machine where jobs are submitted (can also be the central) and a lot of machines where jobs run.

MASTER - runs on all machines, keeps tabs on everything
COLLECTOR - collects jobs
NEGOTIATOR - negotiates jobs
SCHEDD - schedules jobs
STARTD - starts (and runs!) jobs

MACHINE	MASTER	COLLECTOR	NEGOTIATOR	SCHEDD	STARTD
central	x	x	x	x
box-1	x				x
box-2	x				x
box-3	x				x

In this scenario central is the machine that manages the cluster and is also the only machine where you can submit (schedd) jobs from. The box-n are just working beasts. They can range from dedicated machines to idle desktops.

Next Steps:

Generate SSL certificates
Download and Install Condor
Map SSL certificates to your user
Configure Condor for all machines
Sync Condor to all machines

Security? SSL to the rescue!

While HTCondor provides a multitude of ways on how you can secure your cluster, we will use SSL. It's proven, fast, and one of the most versatile setups.

I previously wrote on how you can leverage SSL for company internal services, for a deeper understanding on what we're doing or if you need tips to install OpenSSL, just jump over there.

We'll be setting up a single SSL self-signed certificate that will function as the whole source of truth for authentication.

Create a directory called certs on your home mkdir ~/certs. Now go to that directory and run the following command:

openssl req -x509 -newkey rsa:2048 -sha256 -days 3650 -nodes -keyout condor.key -out condor.crt -subj "/CN=Condor Cluster"

With this one-liner, we generate a self-signed certificate

Go ahead and inspect the cert:

frankie@box-1:~/certs$ openssl x509 -noout -text -in condor.crt
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            08:00:6d:08:d4:b9:cd:bb:c4:84:76:1c:19:14:2f:35:1c:c7:a8:e2
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN = Condor Cluster
        Validity
            Not Before: Feb 25 17:47:53 2022 GMT
            Not After : Feb 23 17:47:53 2032 GMT
        Subject: CN = Condor Cluster
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                RSA Public-Key: (2048 bit)
                (...)

After Validity notice the Subject line, this is the line we need to map our single-user to the certificate

Inside ~/certs, create a file called condor_mapfile with the following line:

SSL "/CN=Condor Cluster" frankie

Obviously, change frankie to your own username

HTCondor uses the Subject field to map certificates to users. On a multi-user installation, each user would have its own certificate with proper identifiable fields.

frankie@box-1:~/certs$ ls -l
-rw-r--r-- 1 frankie frankie 1.2K Feb 25 17:47 condor.crt
-rw------- 1 frankie frankie 1.7K Feb 25 17:47 condor.key
-rw-r--r-- 1 frankie frankie   33 Feb 25 17:56 condor_mapfile

You should have 3 files on ~/certs

Download and Install HTCondor

Our file structure will be something like this:

/home/frankie
             # ssl files and mapping
             /certs/condor.crt
             /certs/condor.key
             /certs/condor_mapfile
             
             # condor version we'll be running
             /condor/condor-9.0.9
             
             # config file for condor
             /condor_config
             /condor_config_central.local
             /condor_config_box-1.local
             /condor_config_box-2.local
             /condor_config_box-3.local

So create a directory called ~/condor, head over to HTCondor releases page, choose a version, download it and unzip it inside ~/condor.

wget https://research.cs.wisc.edu/htcondor/tarball/stable/9.0.9/release/condor-9.0.9-x86_64_Debian11-stripped.tar.gz
tar -xvf condor-9.0.9-x86_64_Debian11-stripped.tar.gz
rm condor-9.0.9-x86_64_Debian11-stripped.tar.gz
mv condor-9.0.9-1.2-x86_64_Debian11-stripped condor-9.0.9

Download, unzip and delete gz file. Move condor to a directory that's easier to write about! ;)

Create HTCondor config file

The main config file is condor_config. This file is present on all machines and configures a lot of things about Condor expected behaviour. Go ahead and create a ~/condor/condor_config file with what's below.

# your own username
USER = frankie

# hostname/IP that all machines will connect to
# this is the hostname/IP of the central machine
CONDOR_HOST = 192.168.5.2

# ------------------------
# from here on you probably don't need to change anything
# ------------------------

# where have you installed bin, sbin and lib condor?
RELEASE_DIR = /home/$(USER)/condor/condor-9.0.9

# where is the local condor directory for each host?
LOCAL_DIR = $(RELEASE_DIR)/local.$(HOSTNAME)

# each host has it's own config file
LOCAL_CONFIG_FILE = /home/$(USER)/condor/condor_config_$(HOSTNAME).local

# map the generated SSL certificates to our user account
CERTIFICATE_MAPFILE = /home/$(USER)/certs/condor_mapfile

# tell Condor where the self-signed certificate is
AUTH_SSL_SERVER_CAFILE = /home/$(USER)/certs/condor.crt
AUTH_SSL_SERVER_CERTFILE = /home/$(USER)/certs/condor.crt
AUTH_SSL_SERVER_KEYFILE = /home/$(USER)/certs/condor.key
AUTH_SSL_CLIENT_CAFILE =   /home/$(USER)/certs/condor.crt
AUTH_SSL_CLIENT_CERTFILE = /home/$(USER)/certs/condor.crt
AUTH_SSL_CLIENT_KEYFILE =  /home/$(USER)/certs/condor.key

# everything must be secured by SSL
SEC_DEFAULT_AUTHENTICATION = REQUIRED
SEC_DEFAULT_AUTHENTICATION_METHODS = SSL
SEC_DAEMON_AUTHENTICATION = REQUIRED
SEC_DAEMON_AUTHENTICATION_METHODS = SSL
SEC_DAEMON_INTEGRITY = REQUIRE
SEC_DAEMON_ENCRYPTION = REQUIRED

# allow your own user to do everything
ALLOW_READ = $(USER)@*
ALLOW_WRITE = $(USER)@*
ALLOW_ADMINISTRATOR = $(USER)@*
ALLOW_CONFIG = $(USER)@*
ALLOW_NEGOTIATOR = $(USER)@*
ALLOW_DAEMON = $(USER)@*

# don't be too strict on SSL checks, this is a self-signed certificate
SSL_SKIP_HOST_CHECK = true

Remember that USER and CONDOR_HOST must be changed.

Tell the system where HTCondor is

For Condor binaries to work, they must be on your $PATH.
Also, HTCondor queries CONDOR_CONFIG environment variable for configuration location, so let's edit ~/.bashrc and add those two.

# obviously replace frankie by your own username
export CONDOR_HOME=/home/frankie/condor/condor-9.0.9
export CONDOR_CONFIG=/home/frankie/condor/condor_config
PATH=$PATH:$CONDOR_HOME/bin:$CONDOR_HOME/sbin

We add CONDOR_HOME to PATH and create an environment variable that stores where CONDOR_CONFIG is.

Reload bash settings with command source ~/.bashrc.
Run condor_master to see if the cluster starts.

frankie@central:~/condor$ condor_master
ERROR: Can't read config source /home/frankie/condor/condor_config_central.local

Our first error!

Condor is telling you that it's expecting a file condor_config_central.local with specific config settings for this particular host central. You will get either the hostname or the IP of your machine. Let's create such a file.

# Configures condor for the main hub. We will run all daemons 
# but STARTD which is the one that starts jobs. On this example, 
# however, we're running them all just so we can check if everything
# is as should be
DAEMON_LIST = MASTER COLLECTOR NEGOTIATOR SCHEDD STARTD

Write this into the condor_config_XXXX.local requested above

Notice that a local file can state a lot about a particular machine. You may want to dedicate fewer cores to the cluster from that machine, you may have a different path for Java, etc. But, as a minimal example, you just need to really specify the running daemons.

Now let's run condor_master again.

frankie@central:~/condor$ condor_master
02/25/22 19:42:52 Can't open "/home/frankie/condor/condor-9.0.9/local.central/log/MasterLog"
ERROR "Cannot open log file '/home/frankie/condor/condor-9.0.9/local.central/log/MasterLog'" at line 174 in file ./src/condor_utils/dprintf_setup.cpp

Another error. Always read error messages, most of the time they'll tell you exactly what's going on.

Notice that condor is referencing a folder local.central inside condor-9.0.9 that doesn't exist. If you were using an installation script, this would probably be automatically created for you. As we're not, we'll to do it by hand.

Missing folders? Create them!

You can use the simple script below to create the missing folders for all machines.

#!/bin/bash

# i'm using 4 machines with the following hostnames
# your mileage will vary
machines=(
  "central"
  "box-1"
  "box-2"
  "box-3"
)

for machine in "${machines[@]}"
do
  echo "creating folders for $machine"
  mkdir ~/condor/condor-9.0.9/local."$machine"
  mkdir ~/condor/condor-9.0.9/local."$machine"/log
done

You can, just as easily, create the folders by hand

Once again, let's go for condor_master:

frankie@central:~/condor$ condor_master
frankie@central:~/condor$

Great news, condor_master started without issues!

Condor started! What about the daemons?

As per the table above, HTCondor runs 5 daemons which we do care about. Let's query condor_q to see if it's working:

frankie@central:~/condor$ condor_q
-- Schedd: [email protected] : <192.168.5.121:9618?... @ 02/26/22 05:10:42
OWNER BATCH_NAME      SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Total for all users: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

Working! We don't have any job in the queue, but it's working.

Now let's try condor_status to get some info on the cluster pool:

frankie@central:~/condor$ condor_status
Name                    OpSys   Arch   State     Activity     LoadAv Mem Time
[email protected] LINUX   X86_64 Unclaimed Benchmarking  0.000 1966 0.0
[email protected] LINUX   X86_64 Unclaimed Idle          0.000 1966 0.0
[email protected] LINUX   X86_64 Unclaimed Idle          0.000 1966 0.0
[email protected] LINUX   X86_64 Unclaimed Idle          0.000 1966 0.0

               Total Owner Claimed Unclaimed Matched Preempting Backfill  Drain

  X86_64/LINUX     4     0       0         4       0          0        0      0
         Total     4     0       0         4       0          0        0      0

Looking good! We have 4 slots available, meaning that condor is placing the 4 CPUs of the central machine as acquirable to run jobs.

Condor working. Let's multi-machine it!

frankie@central:~/condor$ ls -l
drwxr-xr-x 8 frankie frankie 4096 Feb 25 19:55 condor-9.0.9
-rw-r--r-- 1 frankie frankie 1676 Feb 25 20:17 condor_config
-rw-r--r-- 1 frankie frankie  272 Feb 25 19:42 condor_config_central.local
-rwxr-x--- 1 frankie frankie  235 Feb 25 19:55 create_missing_folder.sh

If you've been reading along, you'll have a directory structure similar to this.

Remember that condor failed to start because it was missing a local config file for the central machine? We must add a configuration file for every single machine. Mybox-nmachines will function as workers, so they only need the master and startd daemons.

#!/bin/bash

# central machine is not here as it 
# already has a config file in place
machines=(
  "box-1"
  "box-2"
  "box-3"
)

for machine in "${machines[@]}"
do
  echo "DAEMON_LIST = MASTER STARTD" > condor_config_"$machine".local
done

create_missing_config_files.sh will create the 3 missing files with the daemon list above

Sync file structure to all available machines

The directories ~/certs and ~/condor must be present on all cluster machines. The first folder has the self-signed certificate we need for authentication, the second has HTCondor binaries.

#!/bin/bash

machines=(
  "box-1"
  "box-2"
  "box-3"
)

for machine in "${machines[@]}"
do
  echo "syncing condor directory to ${machine}"
  rsync -avz ~/condor frankie@${machine}:~/
  rsync -avr ~/certs frankie@${machine}:~/
done

sync_condor_to_all_machines.sh just pushes the two required folders to working machines

Now you'll want to start condor on all nodes. While Condor has specific tools for restart, you'll probably be stopping and starting a lot while the cluster is still not functional, this script may help. It calls pkill to wipe all condor processes and condor_master to start them up.

#!/bin/bash

machines=(
  "central"
  "box-1"
  "box-2"
  "box-3"
)

for machine in "${machines[@]}"
do
  echo "will restart condor on $machine"
  ssh frankie@${machine} "pkill condor; sleep 2; condor_master"
done

restart_cluster_all_machines.sh this file avoids the word condor because pkill would terminate it!

Is it running yet?

Let's start it (or re-start it) cluster wide and check.

frankie@central:~/condor$ ./restart_cluster_all_machines.sh
will restart condor on central
will restart condor on box-1
will restart condor on box-2
will restart condor on box-3

restart_cluster_all_machines.sh output looks good

Let see how many machines are available with condor_status

frankie@box-1:~/condor$ condor_status
Name                  OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

[email protected] LINUX      X86_64 Unclaimed Idle      0.000 1966  0+00:00:00
[email protected] LINUX      X86_64 Unclaimed Idle      0.000 1966  0+00:00:28
[email protected] LINUX      X86_64 Unclaimed Idle      0.000 1966  0+00:00:28
[email protected] LINUX      X86_64 Unclaimed Idle      0.000 1966  0+00:00:28
[email protected] LINUX      X86_64 Unclaimed Idle      0.000  922  0+00:00:00
[email protected] LINUX      X86_64 Unclaimed Idle      0.000  922  0+00:00:32
[email protected] LINUX      X86_64 Unclaimed Idle      0.000  922  0+00:00:32
[email protected] LINUX      X86_64 Unclaimed Idle      0.000  922  0+00:00:32
[email protected] LINUX      X86_64 Unclaimed Idle      0.000  974  0+00:00:00
[email protected] LINUX      X86_64 Unclaimed Idle      0.000  974  0+00:00:26
[email protected] LINUX      X86_64 Unclaimed Idle      0.000  974  0+00:00:26
[email protected] LINUX      X86_64 Unclaimed Idle      0.000  974  0+00:00:26

               Total Owner Claimed Unclaimed Matched Preempting Backfill  Drain
  X86_64/LINUX    12     0       0        12       0          0        0      0
         Total    12     0       0        12       0          0        0      0

Success!

This is exactly what we were aiming for. We now have HTCondor running as a non-root user, on a multitude of machines, secured by SSL.

Now you should probably try to submit your first job. There are great examples online, but if you need help, just comment below or send me an email.

What if you hit an error?

One thing I find particularly helpful is to run commands with a -debug flag.

condor_q -debug
condor_status -debug

Glancing over the config_macros may be useful and increasing log detail by putting ALL_DEBUG = D_SECURITY or ALL_DEBUG = D_ALL in the config files may also help!

As a last resource, if all else fails, you have the HTCondor mailling-lists where incredibly talented people like Todd Tannenbaum or John Knoeller may be able to help.

Specific variables per node on the job?

HTCondor runs with an isolated set of environmental variables. That means machine environmental variables will not be accessible to jobs. But sometimes you need them.

Say you have a data folder that's going to be used by your job. You're running on 3 different boxes, Windows, LinuxSmall and LinuxLarge and, on each machine, the path varies.

# where is data folder
windows=f:\data
linux-small=/mnt/volume1/data
linux-large=/home/user/data

data is on different folders depending on the machine

How can you export this to condor?
Pass the value as a ClassAd. Just edit the condor_$(hostname).local config file, add the value and pass it as a STARTD_ATTRS.

# pass data folder as a ClassAd
DATA_FOLDER_PATH = /mnt/volume1/data
STARTD_ATTRS = $(STARTD_ATTRS) DATA_FOLDER_PATH

And then in your job submit configuration file you can either pass it as an argument or add it to the environment.

# pass the data_folder as an argument
arguments = "--dataFolder=$$(DATA_FOLDER_PATH)"

# add the data folder to the environment
environment = "DATA_FOLDER_PATH=$$(DATA_FOLDER_PATH)"

There's also the case where you may run the jobs on machines where DATA_FOLDER_PATH was not specified. To cover such cases, you may add a default value:

$$(DATA_FOLDER_PATH:default_value_here)

If you don't pass a default value, Condor will fill it with UNDEFINED

I run a very small cluster of 15 machines connected by the TL-SG1024S. For the price, you can't get better. With a metal enclosure, it's sturdy. Being fanless it's silent. Pushed to the max, it gets warm, not burning hot as some other routers I've had before.