STAT 29000: Project 3 — Fall 2021

Regular expressions, irregularly satisfying, introduction to grep and regular expressions

Motivation: The need to search files and datasets based on the text held within is common during various parts of the data wrangling process — after all, projects in industry will not typically provide you with a path to your dataset and call it a day. grep is an extremely powerful UNIX tool that allows you to search text using regular expressions. Regular expressions are a structured method for searching for specified patterns. Regular expressions can be very complicated, even professionals can make critical mistakes. With that being said, learning some of the basics is an incredible tool that will come in handy regardless of the language you are working in.

Regular expressions are not something you will be able to completely escape from. They exist in some way, shape, and form in all major programming languages. Even if you are less-interested in UNIX tools (which you shouldn’t be, they can be awesome), you should definitely take the time to learn regular expressions.

Context: We’ve just begun to learn the basics of navigating a file system in UNIX using various terminal commands. Now we will go into more depth with one of the most useful command line tools, grep, and experiment with regular expressions using grep, R, and later on, Python.

Scope: grep, regular expression basics, utilizing regular expression tools in R and Python

Learning Objectives
  • Use grep to search for patterns within a dataset.

  • Use cut to section off and slice up data from the command line.

  • Use wc to count the number of lines of input.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

/depot/datamine/data/consumer_complaints/processed.csv

Questions

Question 1

grep stands for (g)lobally search for a (r)egular (e)xpression and (p)rint matching lines. As such, to best demonstrate grep, we will be using it with textual data.

Let’s assume for a second that we didn’t provide you with the location of this projects dataset, and you didn’t know the name of the file either. With all of that being said, you do know that it is the only dataset with the text "That’s the sort of fraudy fraudulent fraud that Wells Fargo defrauds its fraud-victim customers with. Fraudulently." in it. (When you search for this sentence in the file, make sure that you type the single quote in "That’s" so that you get a regular ASCII single quote. Otherwise, you will not find this sentence.)

Write a grep command that finds the dataset. You can start in the /depot/datamine/data directory to reduce the amount of text being searched. In addition, use a wildcard to reduce the directories we search to only directories that start with a c inside the /depot/datamine/data directory.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

In the previous project, you learned about a command that could quickly print out the first n lines of a file. A csv file typically has a header row to explain what data each column holds. Use the command you learned to print out the first line of the file, and only the first line of the file.

Great, now that you know what each column holds, repeat question (1), but, format the output so that it shows the complaint_id, consumer_complaint_narrative, and the state.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Imagine a scenario where we are dealing with a much bigger dataset. Imagine that we live in the southeast and are really only interested in analyzing the data for Florida, Georgia, Mississippi, Alabama, and South Carolina. In addition, we are only interested in in the complaint_id, state, consumer_complaint_narrative, and tags.

Use UNIX tools to, in one line, create a new dataset called southeast.csv that only contains the data for the five states mentioned above, and only the columns listed above.

Be careful you don’t accidentally get lines with a word like "CAPITAL" in them (AL is the state code of Alabama and is present in the word "CAPITAL").

How many rows of data remain? How many megabytes is the new file? Use cut to isolate just the data we ask for. For example, just print the number of rows, and just print the value (in Mb) of the size of the file.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

We want to isolate some of our southeast complaints. Return rows from our new dataset, southeast.csv, that have one of the following words: "wow", "irritating", or "rude" followed by at least 1 exclamation mark. Do this with just a single grep command. Ignore case (whether or not parts of the "wow", "rude", or "irritating" words are capitalized or not).

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

If you pay attention to the consumer_complaint_narrative column, you’ll notice that some of the narratives contain dollar amounts in curly braces { and }. Use grep to find the narratives that contain at least one dollar amount enclosed in curly braces. Use head to limit output to only the first 5 results.

Use the option --color=auto to get some nice, colored output (if using a terminal).

Use the option -E to use extended regular expressions. This will make your regular expressions less messy (less escaping).

There are instances like {>= $1000000} and { XXXX }. The first example qualifies, but the second doesn’t. Make sure the following are matched:

  • {$0.00}

  • { $1,000.00 }

  • {>= $1000000}

  • { >= $1000000 }

And that the following are not matched:

  • { XXX }

  • {XXX}

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 6

As mentioned earlier on, every major language has some sort of regular expression package. Use either the re package in Python (or string methods in pandas, for example, findall), or the grep, grepl, and stringr packages in R to perform the same operation in question (5).

If you are using pandas, there will be 3 types of results: lists of strings, empty lists, and NA values. You can convert your empty lists to NA values like this.

dat['amounts'] = dat['amounts'].apply(lambda x: pd.NA if x==[] else x)

Then, dat['amounts'] will be a pandas Series with values pd.NA or a list of strings. Which you can filter like this.

dat['amounts'].loc[dat['amounts'].notna()]
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 7 (optional, 0 pts)

As mentioned earlier on, every major language has some sort of regular expression package. Use either the re package in Python, or the grep, grepl, and stringr packages in R to create a new column in your data frame (pandas or R data frame) named amounts that contains a semi-colon separated string of dollar amounts without the dollar sign. For example, if the dollar amounts are $100, $200, and $300, the amounts column should contain 100.00;200.00;300.00.

One good way to do this is to use the apply method on the pandas Series.

dat['amounts'] = dat['amounts'].apply(some_function)

This is one way to test if a value is NA or not.

isinstance(my_list, type(pd.NA))
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.