TDM 20100: Project 7 — 2022

Motivation: awk is a programming language designed for text processing. It can be a quick and efficient way to quickly parse through and process textual data. While Python and R definitely have their place in the data science world, it can be extremely satisfying to perform an operation extremely quickly using something like awk.

Context: This is the third of three projects where we introduce awk. awk is a powerful tool that can be used to perform a variety of the tasks that we’ve previously used other UNIX utilities for. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner.

Scope: awk, UNIX utilities

Learning Objectives
  • Use awk to process and manipulate textual data.

  • Use piping and redirection within the terminal to pass around data between utilities.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.csv

Questions

Question 1

Take a look at the dataset. You may have noticed that the "Store Location" column (8th column) contains latitude and longitude coordinates. That is some rich data that could be fun and useful.

The data will look something like the following:

1013 MAINKEOKUK 52632(40.39978, -91.387531)
900 E WASHINGTONCLARINDA 51632(40.739238, -95.02756)
1414 48TH STFORT MADISON 52627(40.624226, -91.373211)
812 S 1ST AVEIOWA CITY 52240

What this means is that you can’t just parse out the latitude and longitude coordinates and call it a day — you need to use awk functions like gsub and split to extract the latitude and longitude coordinates.

Use awk to print out the latitude and longitude for each line in the original dataset. Output should resemble the following.

lat;lon
1.23;4.56

Make sure to take care of rows that don’t have latitude and longitude coordinates — just skip them. So if your results look like this, you need to add logic to skip the "empty" rows:

40.39978, -91.387531
40.739238, -95.02756
40.624226, -91.373211
,
41.985887, -92.579244

To do this, just go ahead and wrap your print in an if statement similar to:

if (length(coords[1]) > ) {
    print coords[1]";"coords[2]
}

split and gsub will be useful awk functions to use for this question.

If we have a bunch of data formatted like the following:

1013 MAINKEOKUK 52632(40.39978, -91.387531)

If we first used split to split on "(", for example like:

split($8, coords, "(");

coords[2] would be:

40.39978, -91.387531)

Then, you could use gsub to remove any ")" characters from coords[2] like:

gsub(/\)/, "", coords[2]);

coords[2] would be:

40.39978, -91.387531

At this point I’m sure you can see how to use awk to extract and print the rest!

Don’t forget any lingering space after the first comma! We don’t want that.

To verify your awk command is correct, pipe the first 10 rows to your awk command. The output should be the following.

%%bash

head -n 10 /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.csv | awk -F';' '{}'
output
40.39978;-91.387531
40.739238;-95.02756
40.624226;-91.373211
41.985887;-92.579244
42.490073;-95.544793
42.459938;-92.327917
41.985887;-92.579244
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

Use awk to create a new dataset called sales_by_store.csv. Include the lat and lon you figured out how to parse in the previous question. The final columns should be the following.

columns
store_name;date;sold_usd;volume_sold;lat;lon

Please exclude all rows that do not have latitude and longitude values. Save volume sold as liters, not gallons.

You can output the results of the awk command to a new file called sales_by_store.csv as follows.

%%bash

awk -F';' {} /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.csv > $HOME/sales_by_store.csv

The > part is a redirect. You are redirecting the output from the awk command to a new file called sales_by_store.csv. If you were to replace > by >> it would append instead of replace. In other words, if you use a single > it will first erase the sales_by_store.csv file before adding the results of the awk command to the file. If you use >>, it will append the results. If you use >> and append results — if you were to run the command more than once, the sales_by_store.csv file would continue to grow.

To verify your output, the results from piping the first 10 lines of our dataset to your awk command should be the following.

%%bash

head -n 10 /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.csv | awk -F';' '{}'
output
store_name;date;sold_usd;volume_sold;lat;lon
Keokuk Spirits;11/20/2015;162.84;4.5;40.39978;-91.387531
Ding's Honk And Holler;11/21/2015;325.68;9.0;40.739238;-95.02756
Quicker Liquor Store;11/16/2015;19.20;0.3;40.624226;-91.373211
Twin Town Liquor;11/17/2015;19.20;0.3;41.985887;-92.579244
Spirits, Stogies and Stuff;11/11/2015;53.34;1.75;42.490073;-95.544793
Hy-Vee Food Store #3 / Waterloo;11/09/2015;104.58;9.0;42.459938;-92.327917
Twin Town Liquor;11/10/2015;106.68;3.5;41.985887;-92.579244
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Believe it or not, awk even supports geometric calculations like sin and cos. Write a bash script that, given a pair of latitude and pair of longitude, calculates the distance between the two points.

Okay, so how to get started? To calculate this, we can use the Haversine formula. The formula is:

$2*r*arcsin(\sqrt{sin^2(\frac{\phi_2 - \phi_1}{2}) + cos(\phi_1)*cos(\phi_2)*sin^2(\frac{\lambda_2 - \lambda_1}{2})})$

Where:

  • $r$ is the radius of the Earth in kilometers, we can use: 6367.4447 kilometers

  • $\phi_1$ and $\phi_2$ are the latitude coordinates of the two points

  • $\lambda_1$ and $\lambda_2$ are the longitude coordinates of the two points

In awk, sin is sin, cos is cos, and sqrt is sqrt.

To get the arcsin use the following awk function:

function arcsin(x) { return atan2(x, sqrt(1-x*x)) }

To convert from degrees to radians, use the following awk function:

function dtor(x) { return x*atan2(0, -1)/180 }

The following is how the script should work (with a real example you can test):

%%bash

./question3.sh 40.39978 -91.387531 40.739238 -95.02756
Results
309.57

To include functions in your awk command, do as follows:

awk -v lat1=$1 -v lat2=$3 -v lon1=$2 -v lon2=$4 'function arcsin(x) { return atan2(x, sqrt(1-x*x)) }function dtor(x) { return x*atan2(0, -1)/180 }BEGIN{
    lat1 = dtor(lat1);
    print lat1;
    # rest of your code here!
}'

We want you to create a bash script called question3.sh in your $HOME directory. After you have your bash script, we want you to run it in a bash cell to see the output.

The following is some skeleton code that you can use to get started.

#!/bin/bash

lat1=$1
lat2=$3
lon1=$2
lon2=$4

awk -v lat1=$1 -v lat2=$3 -v lon1=$2 -v lon2=$4 'function arcsin(x) { return atan2(x, sqrt(1-x*x)) }function dtor(x) { return x*atan2(0, -1)/180 }BEGIN{
    lat1 = dtor(lat1);
    print lat1;
    # rest of your code here!
}'

You may need to give your script execute permissions like this.

%%bash

chmod +x $HOME/question3.sh

Read the shebang and arguments sections in the book.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Find the latitude and longitude points for two interesting points on a map (it could be anywhere). Make a note of the locations and the latitude and longitude values for each point in a markdown cell.

Use your question.sh script to determine the distance. How close is the distance to the distance you get from an online map app? Pretty close?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.