A few months ago, I switched from using a computing cluster to using a virtual machine in the cloud as my primary method to run computationally demanding software. One feature that I missed is how easy the cluster’s software made it to run scripts in parallel: If I wanted to run a script on 20 different input files, I would just submit 20 separate jobs and they would run in parallel when possible. To achieve this on a regular computer or virtual machine, all you need is GNU Parallel.
Basic usage
Let’s assume you have a bunch of FASTQ-files on your hard drive and you want to compress them with the gzip command.
Of course you could type gzip *.fastq
, but this would compress one file after another which can
be quite slow if your files are big.
Using GNU Parallel as follows, you can gzip your files in parallel:
parallel gzip ::: *.fastq
The syntax above can also handle command line arguments and flags (e.g. parallel gzip --fast ::: *.fastq
). Of course
commands other than gzip
also work. The triple-colon is part of GNU Parallel’s
overwhelming set of options and arguments.
Running complex commands
But what if you want to run a complex command with pipes and redirects? Imagine you have many large gzipped .CSV files and you want to dump the number of lines for every file to a separate file. For a single file, you could use:
gzip -dc myfile.csv.gz | wc -l > myfile.csv.gz.linecount
If you want to run this command on multiple files that each get a separate output file, it get’s a bit more complicated:
for f in *.csv.gz
do
gzip -dc $f | wc -l > ${f}.linecount
done
Since the above command is quite dangerous (one little mistake and you may overwrite your input files!), I usually
wouldn’t run it directly. Instead I echo
the bash command inside the loop to a new file, which also has the added
benefit that it keeps track of all the commands I ran:
for f in *.csv.gz
do
echo "gzip -dc $f | wc -l > ${f}.linecount"
done > count_lines.sh
The contents of count_lines.sh
would then look something like this:
lukas@ubuntu: more count_lines.sh
gzip -dc file_1.csv.gz | wc -l > file_1.csv.gz.linecount
gzip -dc file_2.csv.gz | wc -l > file_2.csv.gz.linecount
Once you have a script like this, parallelization becomes very trivial:
parallel -a count_lines.sh
and that’s it! It’s easy to remember, speeds up your work and it forces you to be more reproducible since you have to generate a small script first.
Limiting the number of parallel threads
If your parallel jobs need a lot of RAM, it’s wise to restrict the number of threads with the option -j n
.
E.g. to run at most 5 jobs in parallel you’d type
parallel -j 5 -a count_lines.sh
Comments