STAT 19000: Project 15 — Fall 2020

Motivation: Some people say it takes 20 hours to learn a skill, some say 10,000 hours. What is certain is it definitely takes time. In this project we will explore an interesting dataset and exercise some of the skills learned this semester.

Context: This is the final project of the semester. We sincerely hope that you’ve learned something, and that we’ve provided you with first hand experience digging through data.

Scope: r

Learning objectives
  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • Utilize apply functions in order to solve a data-driven problem.

  • Gain proficiency using split, merge, and subset.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/donerschoose/

Questions

Question 1

Read the data /class/datamine/data/donerschoose/Projects.csv into a data.frame called projects. Make sure you use the function you learned in Project 13 (fread) from the data.table package to read the data. Don’t forget to then convert the data.table into a data.frame. Let’s do an initial exploration of this data. What types of projects (Project.Type) are there? How many resource categories (Project.Resource.Category) are there?

Items to submit
  • R code used to solve the question.

  • 1-2 sentences containing the project’s types and how many resource categories are in the dataset.

Question 2

Create two new variables in projects, the number of days a project lasted and the number of days until the project was fully funded. Name those variables project_duration and time_until_funded, respectively. To calculate them use the project’s posted date (Project.Posted.Date), expiration date (Project.Expiration.Date), and fully funded date (Project.Fully.Funded.Date). What are the shortest and longest times until a project is fully funded? For consistency check, see if we have any negative project’s duration. If so, how many?

You may find the argument units in the function difftime useful.

Be sure to pay attention to the order of operations of difftime.

Note that if you used the fread function from data.table to read in the data, you will not need to convert the columns as date.

It is not required that you use difftime.

Items to submit
  • R code used to solve the question.

  • Shortest and longest times until a project is fully funded.

  • 1-2 sentences answering whether we have if we have negative project’s duration, and if so how many.

Question 3

As you noted in question (2) there may be some project’s with negative duration time. As we may have some concerns for the data regarding these projects, filter the projects data to exclude the projects with negative duration, and call this filtered data selected_projects. With that filtered data, make a dotchart for mean time until the project is fully funded (time_until_funded) for the various resource categories (Project.Resource.Category). Make sure to comment on your results. Are they surprising? Could there be another variable influencing this result? If so, name at least one.

You will first need to average time until fully funded for the different categories before making your plot.

To make your dotchart look nicer, you may want to first order the average time until fully funded before passing it to the dotchart function. In addition, consider reducing the y-axis font size using the argument cex.

Items to submit
  • R code used to solve the question.

  • Resulting dotchart.

  • 1-2 sentences commenting on your plot. Make sure to mention whether you are surprised or not by the results. Don’t forget to add if you think there could be more factors influencing your answer, and if so, be sure to give examples.

Question 4

Read /class/datamine/data/donerschoose/Schools.csv into a data.frame called schools. Combine selected_projects and schools by School.ID keeping only School.ID`s present in both datasets. Name the combined data.frame `selected_projects. Use the newly combined data to determine the percentage of already fully funded projects (Project.Current.Status) for schools in West Lafayette, IN. In addition, determine the state (School.State) with the highest number of projects. Be sure to specify the number of projects this state has.

West Lafayette, IN zip codes are 47906 and 47907.

Items to submit
  • R code used to solve the question.

  • 1-2 sentences answering the percentage of already fully funded projects for schools in West Lafayette, IN, the state with the highest number of projects, and the number of projects this state has.

Question 5

Using the combined selected_projects data, get the school(s) (School.Name), city/cities (School.City) and state(s) (School.State) for the teacher with the highest percentage of fully funded projects (Project.Current.Status).

There are many ways to solve this problem. For example, one option to get the teacher’s ID is to create a variable indicating whether or not the project is fully funded and use tapply. Another option is to create prop.table and select the corresponding column/row.

Note that each row in the data corresponds to a unique project ID.

Once you have the teacher’s ID, consider filtering projects to contain only rows for which the corresponding teacher’s ID is in, and only the columns we are interested in: School.Name, School.City, and School.State. Then, you can get the unique values in this shortened data.

To get only certain columns when subetting, you may find the argument select from subset useful.

Items to submit
  • R code used to solve the question.

  • Output of your code containing school(s), city(s) and state(s) of the selected teacher.