GNU Parallel: How to run arbitrary shell commands in parallel

A few months ago, I switched from using a computing cluster to using a virtual machine in the cloud as my primary method to run computationally demanding software. One feature that I missed is how easy the cluster’s software made it to run scripts in parallel: If I wanted to run a script on 20 different input files, I would just submit 20 separate jobs and they would run in parallel when possible. To achieve this on a regular computer or virtual machine, all you need is GNU Parallel.

Basic usage

Let’s assume you have a bunch of FASTQ-files on your hard drive and you want to compress them with the gzip command. Of course you could type gzip *.fastq, but this would compress one file after another which can be quite slow if your files are big.

Using GNU Parallel as follows, you can gzip your files in parallel:

parallel gzip ::: *.fastq

The syntax above can also handle command line arguments and flags (e.g. parallel gzip --fast ::: *.fastq). Of course commands other than gzip also work. The triple-colon is part of GNU Parallel’s overwhelming set of options and arguments.

Running complex commands

But what if you want to run a complex command with pipes and redirects? Imagine you have many large gzipped .CSV files and you want to dump the number of lines for every file to a separate file. For a single file, you could use:

gzip -dc myfile.csv.gz | wc -l > myfile.csv.gz.linecount

If you want to run this command on multiple files that each get a separate output file, it get’s a bit more complicated:

for f in *.csv.gz
do
  gzip -dc $f | wc -l > ${f}.linecount
done

Since the above command is quite dangerous (one little mistake and you may overwrite your input files!), I usually wouldn’t run it directly. Instead I echo the bash command inside the loop to a new file, which also has the added benefit that it keeps track of all the commands I ran:

for f in *.csv.gz
do
  echo "gzip -dc $f | wc -l > ${f}.linecount"
done > count_lines.sh

The contents of count_lines.sh would then look something like this:

lukas@ubuntu: more count_lines.sh
gzip -dc file_1.csv.gz | wc -l > file_1.csv.gz.linecount
gzip -dc file_2.csv.gz | wc -l > file_2.csv.gz.linecount

Once you have a script like this, parallelization becomes very trivial:

parallel -a count_lines.sh

and that’s it! It’s easy to remember, speeds up your work and it forces you to be more reproducible since you have to generate a small script first.

Limiting the number of parallel threads

If your parallel jobs need a lot of RAM, it’s wise to restrict the number of threads with the option -j n. E.g. to run at most 5 jobs in parallel you’d type

parallel -j 5 -a count_lines.sh

Comments