If you don't know what HTCondor is about, I'll run you through a couple of scenarios where it may be useful even if you're not a data scientist.
- Thousands of *.wav files that you want converted to *.mp3 via
ffmpeg
? Just dispatch a job and Condor will make use of available hardware at your disposal! - Renting a flat with friends; everyone has a beefed up machine with a decent GPU. You tried to mine ETH for a while but gave up because you didn't have the discipline to start/stop the miner and, at night, the noise was too loud. Submit a job on a schedule that will only run during the day. Profit!
The team from UW–Madison and collaborators did impressive work both on the detail and depth of the manual, but I struggled a bit with the scenario where you don't have root privilege (company policy?) but have access to a multitude of machines.
Can HTCondor leverage multiple machines as non-root? Definitely!
HTCondor 9 (and above)
Historically you'd secure the cluster after install, From 9 onwards the default is to just have security tightened up. This was a nice welcome, but the installation became more intricate. While the team behind Condor released proper tools for it, as non-root, there is some extra-work.
HTCondor Daemons
You'll most likely have one machine that manages it all (central), one machine where jobs are submitted (can also be the central) and a lot of machines where jobs run.
MASTER
- runs on all machines, keeps tabs on everything
COLLECTOR
- collects jobs
NEGOTIATOR
- negotiates jobs
SCHEDD
- schedules jobs
STARTD
- starts (and runs!) jobs
MACHINE | MASTER | COLLECTOR | NEGOTIATOR | SCHEDD | STARTD |
---|---|---|---|---|---|
central | x | x | x | x | |
box-1 | x | x | |||
box-2 | x | x | |||
box-3 | x | x |
In this scenario central
is the machine that manages the cluster and is also the only machine where you can submit (schedd) jobs from. The box-n
are just working beasts. They can range from dedicated machines to idle desktops.
Next Steps:
- Generate SSL certificates
- Download and Install Condor
- Map SSL certificates to your user
- Configure Condor for all machines
- Sync Condor to all machines
Security? SSL to the rescue!
While HTCondor provides a multitude of ways on how you can secure your cluster, we will use SSL. It's proven, fast, and one of the most versatile setups.
I previously wrote on how you can leverage SSL for company internal services, for a deeper understanding on what we're doing or if you need tips to install OpenSSL, just jump over there.
We'll be setting up a single SSL self-signed certificate that will function as the whole source of truth for authentication.
Create a directory called certs
on your home mkdir ~/certs
. Now go to that directory and run the following command:
Go ahead and inspect the cert:
Inside ~/certs
, create a file called condor_mapfile
with the following line:
HTCondor uses the Subject
field to map certificates to users. On a multi-user installation, each user would have its own certificate with proper identifiable fields.
Download and Install HTCondor
Our file structure will be something like this:
/home/frankie
# ssl files and mapping
/certs/condor.crt
/certs/condor.key
/certs/condor_mapfile
# condor version we'll be running
/condor/condor-9.0.9
# config file for condor
/condor_config
/condor_config_central.local
/condor_config_box-1.local
/condor_config_box-2.local
/condor_config_box-3.local
So create a directory called ~/condor
, head over to HTCondor releases page, choose a version, download it and unzip it inside ~/condor
.
Create HTCondor config file
The main config file is condor_config
. This file is present on all machines and configures a lot of things about Condor expected behaviour. Go ahead and create a ~/condor/condor_config
file with what's below.
Tell the system where HTCondor is
For Condor binaries to work, they must be on your $PATH
.
Also, HTCondor queries CONDOR_CONFIG
environment variable for configuration location, so let's edit ~/.bashrc
and add those two.
Reload bash settings with command source ~/.bashrc
.
Run condor_master
to see if the cluster starts.
Condor is telling you that it's expecting a file condor_config_central.local
with specific config settings for this particular host central
. You will get either the hostname or the IP of your machine. Let's create such a file.
Notice that a local file can state a lot about a particular machine. You may want to dedicate fewer cores to the cluster from that machine, you may have a different path for Java, etc. But, as a minimal example, you just need to really specify the running daemons.
Now let's run condor_master
again.
Notice that condor is referencing a folder local.central
inside condor-9.0.9
that doesn't exist. If you were using an installation script, this would probably be automatically created for you. As we're not, we'll to do it by hand.
Missing folders? Create them!
You can use the simple script below to create the missing folders for all machines.
Once again, let's go for condor_master
:
Condor started! What about the daemons?
As per the table above, HTCondor runs 5 daemons which we do care about. Let's query condor_q
to see if it's working:
Now let's try condor_status
to get some info on the cluster pool:
Condor working. Let's multi-machine it!
Remember that condor failed to start because it was missing a local config file for the central
machine? We must add a configuration file for every single machine. Mybox-n
machines will function as workers, so they only need the master
and startd
daemons.
Sync file structure to all available machines
The directories ~/certs
and ~/condor
must be present on all cluster machines. The first folder has the self-signed certificate we need for authentication, the second has HTCondor binaries.
Now you'll want to start condor on all nodes. While Condor has specific tools for restart, you'll probably be stopping and starting a lot while the cluster is still not functional, this script may help. It calls pkill
to wipe all condor processes and condor_master
to start them up.
Is it running yet?
Let's start it (or re-start it) cluster wide and check.
Let see how many machines are available with condor_status
frankie@box-1:~/condor$ condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
[email protected] LINUX X86_64 Unclaimed Idle 0.000 1966 0+00:00:00
[email protected] LINUX X86_64 Unclaimed Idle 0.000 1966 0+00:00:28
[email protected] LINUX X86_64 Unclaimed Idle 0.000 1966 0+00:00:28
[email protected] LINUX X86_64 Unclaimed Idle 0.000 1966 0+00:00:28
[email protected] LINUX X86_64 Unclaimed Idle 0.000 922 0+00:00:00
[email protected] LINUX X86_64 Unclaimed Idle 0.000 922 0+00:00:32
[email protected] LINUX X86_64 Unclaimed Idle 0.000 922 0+00:00:32
[email protected] LINUX X86_64 Unclaimed Idle 0.000 922 0+00:00:32
[email protected] LINUX X86_64 Unclaimed Idle 0.000 974 0+00:00:00
[email protected] LINUX X86_64 Unclaimed Idle 0.000 974 0+00:00:26
[email protected] LINUX X86_64 Unclaimed Idle 0.000 974 0+00:00:26
[email protected] LINUX X86_64 Unclaimed Idle 0.000 974 0+00:00:26
Total Owner Claimed Unclaimed Matched Preempting Backfill Drain
X86_64/LINUX 12 0 0 12 0 0 0 0
Total 12 0 0 12 0 0 0 0
Success!
This is exactly what we were aiming for. We now have HTCondor running as a non-root user, on a multitude of machines, secured by SSL.
Now you should probably try to submit your first job. There are great examples online, but if you need help, just comment below or send me an email.
What if you hit an error?
One thing I find particularly helpful is to run commands with a -debug
flag.
condor_q -debug
condor_status -debug
Glancing over the config_macros may be useful and increasing log detail by putting ALL_DEBUG = D_SECURITY
or ALL_DEBUG = D_ALL
in the config files may also help!
As a last resource, if all else fails, you have the HTCondor mailling-lists where incredibly talented people like Todd Tannenbaum or John Knoeller may be able to help.
Specific variables per node on the job?
HTCondor runs with an isolated set of environmental variables. That means machine environmental variables will not be accessible to jobs. But sometimes you need them.
Say you have a data
folder that's going to be used by your job. You're running on 3 different boxes, Windows, LinuxSmall and LinuxLarge and, on each machine, the path varies.
How can you export this to condor?
Pass the value as a ClassAd
. Just edit the condor_$(hostname).local
config file, add the value and pass it as a STARTD_ATTRS
.
# pass data folder as a ClassAd
DATA_FOLDER_PATH = /mnt/volume1/data
STARTD_ATTRS = $(STARTD_ATTRS) DATA_FOLDER_PATH
And then in your job submit configuration file you can either pass it as an argument or add it to the environment.
# pass the data_folder as an argument
arguments = "--dataFolder=$$(DATA_FOLDER_PATH)"
# add the data folder to the environment
environment = "DATA_FOLDER_PATH=$$(DATA_FOLDER_PATH)"
There's also the case where you may run the jobs on machines where DATA_FOLDER_PATH
was not specified. To cover such cases, you may add a default value:
I run a very small cluster of 15 machines connected by the TL-SG1024S. For the price, you can't get better. With a metal enclosure, it's sturdy. Being fanless it's silent. Pushed to the max, it gets warm, not burning hot as some other routers I've had before.
As an Amazon Associate I may earn from qualifying purchases on some links.
If you found this page helpful, please share.