Handy to know shizzle: awk - multiple iterations on a single file with different logic (FS) per iteration

Holy smokin’ Toledo’s - what a blast from the past! I was doing some blog spring cleaning and found this unpublished draft from 2012-May-02. Here is the unedited awk paste (scroll down for a commented version):

awk -v loops=2 '
BEGIN {
  f = ARGV[ARGC-1]
  "wc -l "f"|egrep -o [0-9]+" | getline NL;
  while (++i < loops) {
    ARGV[ARGC++] = f
  }
}
FNR == 1 {
  iteration++; print "iteration: "iteration
}
FNR == NL {
  FS = "[0-9][0-9]:[0-9][0-9] "
}
iteration == 1 { print $1 }
iteration == 2 { print $NF }

' $media_files_file

In 2012 I must have been happy with my handy work - because I had created a blog draft either as a scratch pad or because I wanted to share it...😊

So I searched my script library for the keyword media_files_file and got a hit on a script named update-dm-to-dc.sh which was created to batch update the date modified timestamp of files to match the date created timestamp.

This was also back in a time when I was probably storing media files on NTFS which stores creation timestamps for its files.

I'm guessing that I probably did something to change the modified timestamp on a bunch of media files and wanted to revert that change. Back in 2012, my storage did not support CoW or a file system that supported snapshots - so there was no easy way to roll back if the mass modification of files went wrong - backups would have been the main undo workflow.

Perhaps iTunes or similar media library had updated file modified timestamps in an undesirable way. E.g. causing differential backup scripts to see the media files as being modified and being selected during the next backup?

Maybe I was migrating files to a filesystem that didn't support creation timestamps (like XFS v4 or ext3) and wanted to set the modified stamp to the source filesystems creation stamp?

<rabbit-hole>

Q: Did XFS support creation time (crtime) in 2012?
A: No - In 2012 the latest XFS release was v4 - the crtime code (XFS inode v3) was first committed in 2013-Apr-21 by Christoph Hellwig. Here is the commit.
An XFS status update in 2013-May mentions the the release of Linux 3.10 with an "experimental XFS feature - CRC protection for on-disk metadata structures", which AFAIK was part of the inode v3 code.
In 2015-May xfsprogs-3.2.3-rc1 was released which mentioned properly supporting inode v3 format.
The XFS docs were updated in 2016-Jan with details of crtime and XFS v5 fields. Note the differentiation between XFS version and XFS inode version.

# on XFS v5
# xfs_db -r -c "inode $(stat -c '%i' yourfile.txt)" -c "print v3.crtime.sec" /dev/disk/by-uuid/1d5722e2-a5ac-XXXX-XXXX-392290480c23
v3.crtime.sec = Sun Aug 22 14:58:40 2021

It is noted that stat -c '%w' or '%W' should display file creation time on filesystems that store creation time. Noting that on Linux this requires coreutils 8.31 (released 2019-Mar), glibc 2.28 and kernel version 4.11 or newer.

</rabbit-hole>

I've gone through the awk script and added comments - it took me a moment to figure out what I was trying to do back then... Unfortunately I don't have bash history going back to that exact point in time to know the exact contents or line format of $media_files_file.

What I was able to establish is that I probably ran this ad hoc awk script the day before I created the update-dm-to-dc.sh script (the research and debugging phase). So the line format of $media_files_file was probably an earlier iteration of the formats used in the update-dm-to-dc.sh script - the record formats in the script don't match the pattern: FS = "[0-9][0-9]:[0-9][0-9] "

My summary of the ad-hoc awk script: I was trying to run awk once but read the records (lines) more than once and do something different with the the records based on the iteration.

For each record (line in this case - default RS) in the $media_files_file the ad-hoc awk would of:

Once at the start of each iteration - printed the iteration number.
For the #1 iteration - for each record - print the first field (default FS).
For the #2 iteration - for each record - print the number of fields (custom FS).

My hypothesis of the goal of the ad-hoc awk - was to help validate records (lines) and fields - to ensure that the fields were normalised and predictable for script input, with the goal of doing a mass update/change of timestamps of my media library. So this awk was part of the manual assertions to check that the planned script input would work in a predicable way. With comments:

awk -v loops=2 '
# this block is run once at the start of awk invocation
BEGIN {
  # store the first awk argument in var f - the file to process in this case 
  f = ARGV[ARGC-1]
  
  # store the line count of f var in NL var (number of lines in the file to be processed).
  # awk does not have a built in variable for this.
  "wc -l "f"|egrep -o [0-9]+" | getline NL;

  # duplicate the command line argument to to satisfy the number of specified loops.
  # this has the effect of telling awk to run more than once on the input file stored in f var.
  while (++i < loops) {
    ARGV[ARGC++] = f
  }
}

# run the following code for each iteration

# this block is executed at the start of each iteration in f var (first line of the file)
FNR == 1 {
  # increment and print the iteration counter
  iteration++;
  print "iteration: "iteration
}

# this block is executed at the end of an iteration in f var (last line of the file) 
FNR == NL {
  # modify the awk FS (Field Separator)
  FS = "[0-9][0-9]:[0-9][0-9] "
}

# this block is executed only during iteration 1
iteration == 1 {
  # print the first field in the current input record
  print $1
}

# this block is executed only during iteration 2
iteration == 2 {
  # print the NF (Number of Fields) in the current input record
  print $NF
}

' $media_files_file

Overall thoughts

The script is pretty nifty - in retrospect there may have been better ways to achieve the same results but I like how it explores the possibilities of awk, demonstrating a practical modification of ARGV and how to run different logic on the same input records. Nice work 2012 me!

Handy to know shizzle

Thursday, January 26, 2023

awk - multiple iterations on a single file with different logic (FS) per iteration

Overall thoughts

No comments: