If you don't know what HTCondor is about, I'll run you through a couple of scenarios where it may be useful even if you're not a data scientist.
- Thousands of *.wav files that you want converted to *.mp3 via
ffmpeg? Just dispatch a job and Condor will make use of available hardware at your disposal!
- Renting a flat with friends; everyone has a beefed up machine with a decent GPU. You tried to mine ETH for a while but gave up because you didn't have the discipline to start/stop the miner and, at night, the noise was too loud. Submit a job on a schedule that will only run during the day. Profit!
The team from UW–Madison and collaborators did impressive work both on the detail and depth of the manual, but I struggled a bit with the scenario where you don't have root privilege (company policy?) but have access to a multitude of machines.
Can HTCondor leverage multiple machines as non-root? Definitely!
HTCondor 9 (and above)
Historically you'd secure the cluster after install, From 9 onwards the default is to just have security tightened up. This was a nice welcome, but the installation became more intricate. While the team behind Condor released proper tools for it, as non-root, there is some extra-work.
You'll most likely have one machine that manages it all (central), one machine where jobs are submitted (can also be the central) and a lot of machines where jobs run.
MASTER - runs on all machines, keeps tabs on everything
COLLECTOR - collects jobs
NEGOTIATOR - negotiates jobs
SCHEDD - schedules jobs
STARTD - starts (and runs!) jobs
In this scenario
central is the machine that manages the cluster and is also the only machine where you can submit (schedd) jobs from. The
box-n are just working beasts. They can range from dedicated machines to idle desktops.
- Generate SSL certificates
- Download and Install Condor
- Map SSL certificates to your user
- Configure Condor for all machines
- Sync Condor to all machines
Security? SSL to the rescue!
While HTCondor provides a multitude of ways on how you can secure your cluster, we will use SSL. It's proven, fast, and one of the most versatile setups.
I previously wrote on how you can leverage SSL for company internal services, for a deeper understanding on what we're doing or if you need tips to install OpenSSL, just jump over there.
We'll be setting up a single SSL self-signed certificate that will function as the whole source of truth for authentication.
Create a directory called
certs on your home
mkdir ~/certs. Now go to that directory and run the following command:
Go ahead and inspect the cert:
~/certs, create a file called
condor_mapfile with the following line:
HTCondor uses the
Subject field to map certificates to users. On a multi-user installation, each user would have its own certificate with proper identifiable fields.
Download and Install HTCondor
Our file structure will be something like this:
/home/frankie # ssl files and mapping /certs/condor.crt /certs/condor.key /certs/condor_mapfile # condor version we'll be running /condor/condor-9.0.9 # config file for condor /condor_config /condor_config_central.local /condor_config_box-1.local /condor_config_box-2.local /condor_config_box-3.local
So create a directory called
~/condor, head over to HTCondor releases page, choose a version, download it and unzip it inside
Create HTCondor config file
The main config file is
condor_config. This file is present on all machines and configures a lot of things about Condor expected behaviour. Go ahead and create a
~/condor/condor_config file with what's below.
Tell the system where HTCondor is
For Condor binaries to work, they must be on your
Also, HTCondor queries
CONDOR_CONFIG environment variable for configuration location, so let's edit
~/.bashrc and add those two.
Reload bash settings with command
condor_master to see if the cluster starts.
Condor is telling you that it's expecting a file
with specific config settings for this particular host
central. You will get either the hostname or the IP of your machine. Let's create such a file.
Notice that a local file can state a lot about a particular machine. You may want to dedicate fewer cores to the cluster from that machine, you may have a different path for Java, etc. But, as a minimal example, you just need to really specify the running daemons.
Now let's run
Notice that condor is referencing a folder
condor-9.0.9 that doesn't exist. If you were using an installation script, this would probably be automatically created for you. As we're not, we'll to do it by hand.
Missing folders? Create them!
You can use the simple script below to create the missing folders for all machines.
Once again, let's go for
Condor started! What about the daemons?
As per the table above, HTCondor runs 5 daemons which we do care about. Let's query
condor_q to see if it's working:
Now let's try
condor_status to get some info on the cluster pool:
Condor working. Let's multi-machine it!
Remember that condor failed to start because it was missing a local config file for the
central machine? We must add a configuration file for every single machine. My
box-nmachines will function as workers, so they only need the
Sync file structure to all available machines
~/condor must be present on all cluster machines. The first folder has the self-signed certificate we need for authentication, the second has HTCondor binaries.
Now you'll want to start condor on all nodes. While Condor has specific tools for restart, you'll probably be stopping and starting a lot while the cluster is still not functional, this script may help. It calls
pkill to wipe all condor processes and
condor_master to start them up.
Is it running yet?
Let's start it (or re-start it) cluster wide and check.
Let see how many machines are available with
frankie@box-1:~/condor$ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime [email protected] LINUX X86_64 Unclaimed Idle 0.000 1966 0+00:00:00 [email protected] LINUX X86_64 Unclaimed Idle 0.000 1966 0+00:00:28 [email protected] LINUX X86_64 Unclaimed Idle 0.000 1966 0+00:00:28 [email protected] LINUX X86_64 Unclaimed Idle 0.000 1966 0+00:00:28 [email protected] LINUX X86_64 Unclaimed Idle 0.000 922 0+00:00:00 [email protected] LINUX X86_64 Unclaimed Idle 0.000 922 0+00:00:32 [email protected] LINUX X86_64 Unclaimed Idle 0.000 922 0+00:00:32 [email protected] LINUX X86_64 Unclaimed Idle 0.000 922 0+00:00:32 [email protected] LINUX X86_64 Unclaimed Idle 0.000 974 0+00:00:00 [email protected] LINUX X86_64 Unclaimed Idle 0.000 974 0+00:00:26 [email protected] LINUX X86_64 Unclaimed Idle 0.000 974 0+00:00:26 [email protected] LINUX X86_64 Unclaimed Idle 0.000 974 0+00:00:26 Total Owner Claimed Unclaimed Matched Preempting Backfill Drain X86_64/LINUX 12 0 0 12 0 0 0 0 Total 12 0 0 12 0 0 0 0
This is exactly what we were aiming for. We now have HTCondor running as a non-root user, on a multitude of machines, secured by SSL.
Now you should probably try to submit your first job. There are great examples online, but if you need help, just comment below or send me an email.
What if you hit an error?
One thing I find particularly helpful is to run commands with a
condor_q -debug condor_status -debug
Glancing over the config_macros may be useful and increasing log detail by putting
ALL_DEBUG = D_SECURITY or
ALL_DEBUG = D_ALL in the config files may also help!
As a last resource, if all else fails, you have the HTCondor mailling-lists where incredibly talented people like Todd Tannenbaum or John Knoeller may be able to help.