More Unix tools
Haven’t posted anything in ages, just too busy with various things… Anyway, here I’m just going to paste a few usage examples of some more very useful standard unix tools.
Awk
Often you have a lot of useful information encoded in file and directory
names. In the past I used cut
, tr
, etc. to extract this information.
This can sometimes get quite awkward. I knew there’s a tool called awk
,
but I never really bothered to use it. Until recently :-) It’s actually
quite easy and very useful. Here’s an example:
Imagine you organized your holiday photos like this
/mnt/photos/2020/tenerife/DSC001.jpg
/mnt/photos/2020/tenerife/DSC002.jpg
/mnt/photos/2020/tenerife/DSC003.jpg
...
Now lets say you want to create a CSV file with an inventory of your photos:
cd /mnt/photos
echo "Year,Location,Filename" > ~/my_photos.csv
find * -type f | awk 'BEGIN { FS = "/" } ; {print $1","$2","$3}' >> ~/my_photos.csv
Note: This use case could be handled in various, probably simpler ways, however
this hopefully demonstrates how awk
works. As you can guess awk
also quite
handy to extract information from CSV files.
Find
I used it already in previous section, but find
is also useful to process
files serially. Example, calculate a checksum for all zip files:
find * -iname "*.zip" -exec sha1sum {} \; >> ~/checksums.sha1
Note: {}
is substituted for every file find
finds. The command which is executed
has to be terminated with \;
.
Parallel
Not really a standard tool, but very useful if you want to make most of your multicore CPU! See GNU Parallel
Example: We’ll do the same, calculate the checksums of all zip files:
# First get all absolute paths to the zip files and put them in a file:
find * -iname "*.zip" -exec readlink -f {} \; >> ~/zip_files.txt
# Process them in 5 parallel threads:
parallel -a ~/zip_files.txt --eta -j5 --joblog log.txt --delay 2 -k sha1sum {} >> ~/checksums.sha1
Notes: With parallel
you don’t need the \;
. Please ignore that this example
doesn’t make much sense, because shasum1
is so fast you wouldn’t use parallel
for that. But it shows a few options which can be very useful in other cases.
--eta
gives you some progress information. -j5
means use 5 threads.
--joblog
obviously save the logs. --delay
delay the start of each job by
2 seconds. This option can be very important. For example if you kick off a process
which right at the start hits a database (or other limited resource) very hard,
you don’t want 5, 10, or whatever jobs to do that exactly at the same time.
Finally -k
which ensures that the output from each job is written in the same
order as the input. Without this option the output order would be the order in
which the jobs finish, which could be pretty random. Often you want to preserve
the order so you can easily match up input and output files.