More Unix tools

Haven’t posted anything in ages, just too busy with various things… Anyway, here I’m just going to paste a few usage examples of some more very useful standard unix tools.

Awk

Often you have a lot of useful information encoded in file and directory names. In the past I used cut, tr, etc. to extract this information. This can sometimes get quite awkward. I knew there’s a tool called awk, but I never really bothered to use it. Until recently :-) It’s actually quite easy and very useful. Here’s an example:

Imagine you organized your holiday photos like this

/mnt/photos/2020/tenerife/DSC001.jpg
/mnt/photos/2020/tenerife/DSC002.jpg
/mnt/photos/2020/tenerife/DSC003.jpg
...

Now lets say you want to create a CSV file with an inventory of your photos:

cd /mnt/photos
echo "Year,Location,Filename" > ~/my_photos.csv
find * -type f | awk 'BEGIN { FS = "/" } ; {print $1","$2","$3}' >> ~/my_photos.csv

Note: This use case could be handled in various, probably simpler ways, however this hopefully demonstrates how awk works. As you can guess awk also quite handy to extract information from CSV files.

Find

I used it already in previous section, but find is also useful to process files serially. Example, calculate a checksum for all zip files:

find * -iname "*.zip" -exec sha1sum {} \; >> ~/checksums.sha1

Note: {} is substituted for every file find finds. The command which is executed has to be terminated with \;.

Parallel

Not really a standard tool, but very useful if you want to make most of your multicore CPU! See GNU Parallel

Example: We’ll do the same, calculate the checksums of all zip files:

# First get all absolute paths to the zip files and put them in a file:
find * -iname "*.zip" -exec readlink -f {} \; >> ~/zip_files.txt

# Process them in 5 parallel threads:
parallel -a ~/zip_files.txt --eta -j5 --joblog log.txt --delay 2 -k sha1sum {} >> ~/checksums.sha1

Notes: With parallel you don’t need the \;. Please ignore that this example doesn’t make much sense, because shasum1 is so fast you wouldn’t use parallel for that. But it shows a few options which can be very useful in other cases. --eta gives you some progress information. -j5 means use 5 threads. --joblog obviously save the logs. --delay delay the start of each job by 2 seconds. This option can be very important. For example if you kick off a process which right at the start hits a database (or other limited resource) very hard, you don’t want 5, 10, or whatever jobs to do that exactly at the same time. Finally -k which ensures that the output from each job is written in the same order as the input. Without this option the output order would be the order in which the jobs finish, which could be pretty random. Often you want to preserve the order so you can easily match up input and output files.