TDM 20100: Project 3 — 2022
Motivation: The need to search files and datasets based on the text held within is common during various parts of the data wrangling process — after all, projects in industry will not typically provide you with a path to your dataset and call it a day. grep
is an extremely powerful UNIX tool that allows you to search text using regular expressions. Regular expressions are a structured method for searching for specified patterns. Regular expressions can be very complicated, even professionals can make critical mistakes. With that being said, learning some of the basics is an incredible tool that will come in handy regardless of the language you are working in.
Regular expressions are not something you will be able to completely escape from. They exist in some way, shape, and form in all major programming languages. Even if you are less-interested in UNIX tools (which you shouldn’t be, they can be awesome), you should definitely take the time to learn regular expressions. |
Context: We’ve just begun to learn the basics of navigating a file system in UNIX using various terminal commands. Now we will go into more depth with one of the most useful command line tools, grep
, and experiment with regular expressions using grep
, R, and later on, Python.
Scope: grep
, regular expression basics, utilizing regular expression tools in R and Python
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/consumer_complaints/processed.csv
Questions
Question 1
grep
stands for (g)lobally search for a (r)egular (e)xpression and (p)rint matching lines. As such, to best demonstrate grep
, we will be using it with textual data.
Let’s assume for a second that we didn’t provide you with the location of this projects dataset, and you didn’t know the name of the file either. With all of that being said, you do know that it is the only dataset with the text "That’s the sort of fraudy fraudulent fraud that Wells Fargo defrauds its fraud-victim customers with. Fraudulently." in it.
When you search for this sentence in the file, make sure that you type the single quote in "That’s" so that you get a regular ASCII single quote. Otherwise, you will not find this sentence. Or, just use a unique part of the sentence that will likely not exist in another file. |
Write a grep
command that finds the dataset. You can start in the /anvil/projects/tdm/data
directory to reduce the amount of text being searched. In addition, use a wildcard to reduce the directories we search to only directories that start with a con
inside the /anvil/projects/tdm/data
directory. Just know that you’d eventually find the file without using the wildcard, but we don’t want to waste your time.
Use |
-
Code used to solve this problem.
-
Output from running the code.
Question 2
In the previous project, you learned about a command that could quickly print out the first n lines of a file. A csv file typically has a header row to explain what data each column holds. Use the command you learned to print out the first line of the file, and only the first line of the file.
Great, now that you know what each column holds, repeat question (1), but, format the output so that it shows the complaint_id
, consumer_complaint_narrative
, and the state
. Print only the first 100 lines (using head
) so our notebook is not too full of text.
Now, use cat
, head
, tail
, and cut
to isolate those same 3 columns for the single line where we heard about the "fraudy fraudulent fraud".
You can find the exact line from the file where the "fraudy fraudulent fraud" occurs, by using the |
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Imagine a scenario where we are dealing with a much bigger dataset. Imagine that we live in the southeast and are really only interested in analyzing the data for Florida, Georgia, Mississippi, Alabama, and South Carolina. In addition, we are only interested in in the consumer_complaint_narrative
, state
, tags
, and complaint_id
.
Use UNIX tools to, in one line, create a new dataset called southeast.csv
that only contains the data for the five states mentioned above, and only the columns listed above.
Be careful you don’t accidentally get lines with a word like "CAPITAL" in them (AL is the state code of Alabama and is present in the word "CAPITAL"). |
How many rows of data remain? How many megabytes is the new file? Use cut
to isolate just the data we ask for. For example, just print the number of rows, and just print the value (in Mb) of the size of the file.
20M
-rw-r--r-- 1 x-kamstut x-tdm-admin 20M Dec 13 10:59 /home/x-kamstut/southeast.csv
-
Code used to solve this problem.
-
Output from running the code.
Question 4
We want to isolate some of our southeast complaints. Return rows from our new dataset, southeast.csv
, that have one of the following words: "wow", "irritating", or "rude" followed by at least 1 exclamation mark. Do this with just a single grep
command. Ignore case (whether or not parts of the "wow", "rude", or "irritating" words are capitalized or not). Limit your output to only 5 rows (using head
).
-
Code used to solve this problem.
-
Output from running the code.
Question 5
If you pay attention to the consumer_complaint_narrative
column in our new dataset, southeast.csv
, you’ll notice that some of the narratives contain dollar amounts in curly braces {
and }
. Use grep
to find the narratives that contain at least one dollar amount enclosed in curly braces. Use head
to limit output to only the first 5 results.
Use the option |
There are instances like
And that the following are not matched:
|
Regex is hard. Try the following logic.
|
To verify your answer, the following code should have the following result.
result
3185125 3184467 3183547 3183544 3182879 |
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |