A course in quantitative research workflow for students in the higher education administration program at the University of Florida
In this lesson, we’ll discuss how to organize both a project directory and an R script. While there’s no one exact way to do either, there are good practices that you should generally follow. We’ll begin with how to organize your project files.
Every data analysis project should have its own set of organized folders. Just like you might organize a kitchen so that ingredients, cookbooks, and prepared food all have a specific cabinet or shelf, so too should you organize your project.
But computers are pretty good at finding files, you say: you can use your machine’s search feature look for what you need. If you don’t have that many files to look through, you might not be too bad at quickly scanning to find what you want either. If this is the case, then why bother organizing a project directory? Why not just dump everything — scripts, data, figures, tables, notes, etc — into a single folder (My downloads folder works just fine, thank you.)? If you need something, the computer can definitely find it.
So what’s the big deal?
The big deal is that you are thinking from your computer’s perspective when you should be thinking from the perspective of you, your collaborators (which includes your future self), and future replicators (which also includes yourself). Search features are nice, but there’s no substitute for being able to look through a project’s files just by looking through the project folders. When a project is well organized, it’s much easier to understand how it all fits together: the inputs, processes, and outputs.
You already have a basic directory structure with your class repo:
student_<your_last_name>/
|
|__ assignments/
|__ data/
|__ lessons/
|__ scripts/
|__ working/
With a project, you probably won’t have assignments
or lessons
folders, but you almost certainly will have folders for your scripts
and data
as well as a working
folder (which some people call
scratch
, like a scratch pad).
What are we missing? Following our analogy above, we have:
data
scripts
We are missing Prepared food (Outputs)!
What kinds of outputs are we likely to have with our project? One type quickly come to mind: figures. Both when we start exploring our data and when we want to make a final report to share, we’re likely to make a number of plots. Let’s make a folder.
Quick exercise
Make a folder in your
student_*
working directory calledfigures
. There are two ways to do this: (1) you can use the point-and-click interface in the bottom right panel of RStudio, or (2) you can use the functiondir.create()
in the console.
Another reason we organize our project directory is so that we have a common framework for loading, analyzing, and saving our work. This common framework allows us to use relative rather than absolute paths in our scripts.
A quick analogy. Imagine you ask a friend for directions: how do I get to the state park? Your friend has three ways they can give you directions:
The first way is really easy for your friend, but it puts extra work on you. Either you first must go to their house so that the directions make sense, or you have to translate them from your perspective. These directions are annoying and useless if you also don’t know how to get to their house.
The second way is really nice to you, but difficult for your friend. These directions also aren’t reusable for other people. If a third friend tries to get the directions from you, you’ll just replicate the first problem. Alternately, if you aren’t at your house the directions aren’t as useful.
The third way is the best compromise. If you know how to get to a common landmark, you can get there the way that is best for you. Once there, you can follow your friend’s directions to the park. These directions are also shareable: anyone who can get to the common landmark can use the directions to get to the park.
In the world of your project, the first two sets of directions are
absolute paths while the third is a relative path. If your
project is organized into folders and another person (collaborator, code
reviewer, future you on another computer) has the full project
directory, then as long as they can get to the correct starting point —
e.g. your scripts
folder — they can run your project easily.
/Users/btskinner/classes/edh7916/scripts
./scripts
The second line assumes we are in the class directory (your student_*
folders). Notice the .
(dot)? On almost all systems, dots in a file
path work like this:
.
(one dot): this directory/folder..
(two dots): up one directoryWhat if we are in the scripts
folder, but want to access a file in the
data
directory? We use two dots (..
):
scripts
to data
: ../data
What this says is
..
: back up into the main folder, student_*
/data
: from the main folder, go into data
Quick exercise
Use R’s
list.files()
command in the console to show all the files in the current directory. Next, use the same command, but use a relative link to see all the files inside thescripts
folder. Finally, use a relative link to see all the files in the directory above.
Quick exercise
Create a file in you new
figures
directory calledREADME.md
. While you can use RStudio’s point-and-click interface, I want you to use the R commandfile.create()
. Be sure to include the proper path so that it’s created in thefigures
folder.
A general philosophy for every script (and functions within a script, for that matter) should be to do one thing and do it well. Rather than have a single giant script with 5,000 lines that does your entire project, it makes more sense to break up your scripts into shorter modules that you can then bring together. Right now, we have small scripts so having just one is fine. But this idea applies within the script itself: organize your scripts into clear sections so that a person can easily scan it to see what is happening and where.
Here’s a template for an R script with clearly defined sections:
################################################################################
##
## [ PROJ ] < Name of the overall project >
## [ FILE ] < Name of this particular file >
## [ AUTH ] < Your name + email / Twitter / GitHub handle >
## [ INIT ] < Date you started the file >
##
################################################################################
## ---------------------------
## libraries
## ---------------------------
## ---------------------------
## directory paths
## ---------------------------
## ---------------------------
## settings/macros
## ---------------------------
## ---------------------------
## functions
## ---------------------------
## -----------------------------------------------------------------------------
## < BODY >
## -----------------------------------------------------------------------------
## ---------------------------
## input
## ---------------------------
## ---------------------------
## process
## ---------------------------
## ---------------------------
## output
## ---------------------------
## -----------------------------------------------------------------------------
## END SCRIPT
## -----------------------------------------------------------------------------
And here’s the template with a very simple project outline:
################################################################################
##
## [ PROJ ] EDH 7916: Organizing
## [ FILE ] organizing.R
## [ AUTH ] Benjamin Skinner (@btskinner)
## [ INIT ] 13 January 2020
##
################################################################################
## ---------------------------
## libraries
## ---------------------------
library(tidyverse)
## ---------------------------
## directory paths
## ---------------------------
dat_dir <- file.path(".", "data")
fig_dir <- file.path(".", "figures")
## ---------------------------
## settings/macros
## ---------------------------
old_to_new_score_ratio <- 1.1
## ---------------------------
## functions
## ---------------------------
old_to_new_score <- function(test_score, ratio) {
return(test_score * ratio)
}
## -----------------------------------------------------------------------------
## BODY
## -----------------------------------------------------------------------------
## ---------------------------
## input
## ---------------------------
df <- readRDS(file.path(dat_dir, "test_scores.RDS"))
## ---------------------------
## process
## ---------------------------
## add a column for new test score
## <dataframe>$<new_scores_column> <- function(<old_scores_column>, <ratio>)
df$test_scores_new <- old_to_new_score(df$test_score, old_to_new_score_ratio)
## ---------------------------
## output
## ---------------------------
saveRDS(df, file.path(dat_dir, "test_scores_updated.RDS"))
## -----------------------------------------------------------------------------
## END SCRIPT
## -----------------------------------------------------------------------------
At the very top of your script, give all the relevant information about the script.
################################################################################
##
## [ PROJ ] EDH 7916: Organizing
## [ FILE ] organizing.R
## [ AUTH ] Benjamin Skinner (@btskinner)
## [ INIT ] 13 January 2020
##
################################################################################
Specifically:
[ PROJ ]
: tell what project it belongs to[ FILE ]
: give the file’s name[ AUTH ]
: give your name and a way to contact you[ INIT ]
: give the date you started the fileIf you aren’t using a version control system like git, then it would
make sense to also include a line for the last time you revised the
file: [ REVN ]
. You can still do this if you want, but the date you
started the file in conjuction with the changes you commit to git should
suffice to give its history.
NOTE You don’t have to make your header look exactly like mine. This is just what I’ve landed on after a few years. As long as you have the relevant information, personalize the details as you will.
After the informational header, the first thing you want to include are the libraries you need to call for your script to work. In this course, you will almost always call the {tidyverse} library.
## ---------------------------
## libraries
## ---------------------------
library(tidyverse)
Notice our relative links? Rather than hard-coding / rewriting all the
paths in the script below, we can save the paths in an object. We use
the file.path()
command because it is smart. Some computer operating
systems use forward slashes, /
, for their file paths; others use
backslashes, \
. Rather than try to guess or assume what operating
system future users will use, we can use R’s function, file.path()
, to
check the current operating system and build the paths correctly for us.
## ---------------------------
## directory paths
## ---------------------------
dat_dir <- file.path(".", "data")
fig_dir <- file.path(".", "figures")
Quick exercise
Run these lines of code and then print
dat_dir
andfig_dir
to the console. What do you see?
Programmers hate “magic” numbers. What are “magic” numbers (including “magic” strings and settings here as well)? They are values that are hard-coded in your analysis script.
Say you want to convert old test scores to a new test score based some ratio, let’s say 1.1 (10% increase). You could multiply every old test score value throughout your script by 1.1, but what if you later decide it should be 1.2 or 0.9? You need to change every instance of 1.1. (Don’t miss any!) Alternately, does 1.1 have any inherent meaning? To me, not really.
## ---------------------------
## settings/macros
## ---------------------------
old_to_new_score_ratio <- 1.1
It’s better to store these constant reusable values at the top of your
script in an object (or macro, as I call it here) that has a clear name.
old_to_new_score_ratio <- 1.1
is clear in its meaning, so when we use
it below, we’ll know what it means. Also, if we decide we need to fix /
change the value later, we only have to change it once.
Put all such numbers, strings, and settings here.
Other than the functions that come with the packages we load, we might write functions ourselves. At this point in the course, I’m not concerned that you know how to write a function or how one works. Just notice how the function has a good name that tells what it does: converts old scores to new scores using a ratio (we’ll use the one we defined above).
## ---------------------------
## functions
## ---------------------------
old_to_new_score <- function(test_score, ratio) {
return(test_score * ratio)
}
The body of your script is where the main work happens. You may need more sections, but at the very least you should generally have dedicated spots for reading in your data, working with your data, and saving any output.
Here we read in a very small tibble (a tidyverse version of a data frame) with some student IDs and their test scores. We can see this if we look at the data.
## ---------------------------
## input
## ---------------------------
df <- readRDS(file.path(dat_dir, "test_scores.RDS"))
Quick exercise
Look at the data, both using RStudio’s View and by using some of the basic techniques we’ve learned so far.
Let’s say that all we want to do is add a new column in which we have
the new test scores as computed with our function and ratio. We can use
base R to add a new column using the $
notation.
## ---------------------------
## process
## ---------------------------
## add a column for new test score
## <dataframe>$<new_scores_column> <- function(<old_scores_column>, <ratio>)
df$test_scores_new <- old_to_new_score(df$test_score, old_to_new_score_ratio)
Finally, we save our tibble. Notice how we use a new name. We’ll talk more about data consistency in a couple of lessons, but we never want to overwrite our original data. Save a new file with a useful name to keep everything separate.
## ---------------------------
## output
## ---------------------------
saveRDS(df, file.path(dat_dir, "test_scores_updated.RDS"))
When naming folders, files, objects, or macros, keep these naming rules in mind:
data_clean.R
or
old_to_new_score_ratio
data_1.RDS
, data_2.RDS
,data_3.RDS
, etcdata clean.R
data_clean.R
More person hours are lost than can be counted dealing with file names with spaces. Don’t do it! Use underscores or hyphens to separate words.