STAT 39000: Project 14 — Fall 2020
Motivation: As we learned earlier in the semester, bash scripts are a powerful tool when you need to perform repeated tasks in a UNIX-like system. In addition, sometimes preprocessing data using UNIX tools prior to analysis in R or Python is useful. Ample practice is integral to becoming proficient with these tools. As such, we will be reviewing topics learned earlier in the semester.
Context: We’ve just ended a series of projects focused on SQL. In this project we will begin to review topics learned throughout the semester, starting writing bash scripts using the various UNIX tools we learned about in Projects 3 through 8.
Scope: awk, UNIX utilities, bash scripts, fread
Dataset
The following questions will use PLOTSNAP.csv from the data folder found in Scholar:
/class/datamine/data/forest
To read more about PLOTSNAP.csv that you will be working with:
Questions
Question 1
Take a look at at PLOTSNAP.csv
. Write a line of awk code that displays the STATECD
followed by the number of rows with that STATECD
.
-
Code used to solve the problem.
-
Count of the following `STATECD`s: 1, 2, 4, 5, 6
Question 2
Unfortunately, there isn’t a very accessible list available that shows which state each STATECD
represents. This is no problem for us though, the dataset has LAT
and LON
! Write some bash that prints just the STATECD
, LAT
, and LON
.
There are 92 columns in our dataset: |
-
Code used to solve the problem.
-
The output of your code piped to
head
.
Question 3
fread
is a "Fast and Friendly File Finagler". It is part of the very popular data.table
package in R. We will learn more about this package next semester. For now, read the documentation here and use the cmd
argument in conjunction with your bash code from (2) to read the data of STATECD
, LAT
, and LON
into a data.table
in your R environment.
-
Code used to solve the problem.
-
The
head
of the resultingdata.table
.
Question 4
We are going to further understand the data from question (3) by finding the actual locations based on the LAT
and LON
columns. We can use the library revgeo
to get a location given a pair of longitude and latitude values. revgeo
uses a free API hosted by photon in order to do so.
For example:
library(revgeo)
revgeo(longitude=-86.926153, latitude=40.427055, output='frame')
The code above will give you the address information in six columns, from the most-granular housenumber
to the least-granular country
. Depending on the coordinates, revgeo
may or may not give you results for each column. For this question, we are going to keep only the state
column.
There are over 4 million rows in our dataset — we do not want to hit photon’s API that many times. Instead, we are going to do the following:
-
Unless you feel comfortable using
data.table
, convert yourdata.table
to adata.frame
:
my_dataframe <- data.frame(my_datatable)
-
Calculate the average
LAT
andLON
for eachSTATECD
, and call the newdata.frame
,dat
. This should result in 57 rows of lat/long pairs. -
For each row in
dat
, run a reverse geocode and append thestate
to a new column calledSTATE
.
To calculate the average |
|
Here is some extra help:
|
It is okay to get "Not Found" for some of the addresses. |
-
Code used to solve the problem.
-
The
head
of the resultingdata.frame
.
Question 5
Use the leaflet
, addTiles
, and addCircles
functions from the leaflet
package to map our average latitude and longitude data from question (4) to a map (should be a total of 57 lat/long pairs).
See here for an example of adding points to a map. |
-
Code used to create the map.
-
The map itself as output from running the code chunk.
Question 6
Write a bash script that accepts at least 1 argument, and performs a useful task using at least 1 dataset from the forest
folder in /class/datamine/data/forest
. An example of a useful task could be printing a report of summary statistics for the data. Feel free to get creative. Note that tasks must be non-trivial — a bash script that counts the number of lines in a file is not appropriate. Make sure to properly document (via comments) what your bash script does. Also ensure that your script returns columnar data with appropriate separating characters (for example a csv).
-
The content of your bash script starting from
#!/bin/bash
. -
Example output from running your script as intended.
-
A description of what your script does.
Question 7
You used fread
in question (2). Now use the cmd
argument in conjunction with your script from (6) to read the script output into a data.table
in your R environment.
-
The R code used to read in and preprocess your data using your bash script from (6).
-
The
head
of the resultingdata.table
.