A course in quantitative research workflow for students in the higher education administration program at the University of Florida
In this lesson, we’ll discuss how to organize both a project directory and an R script. While there’s no one exact way to do either, there are good practices that you should generally follow.
We’ll begin with how to organize your course and project files.
Every data analysis project should have its own set of organized folders. Just like you might organize a kitchen so that ingredients, cookbooks, and prepared food all have a specific cabinet or shelf, so too should you organize your project. We’ll organize our course directory in a similar fashion.
But computers are pretty good at finding files, you say: you can use your machine’s search feature to look for what you need. If you don’t have that many files to look through, you might not be too bad at quickly scanning to find what you want either. If this is the case, then why bother organizing a project directory? Why not just dump everything — scripts, data, figures, tables, notes, etc — into a single folder (My downloads folder works just fine, thank you…)? If you need something, the computer can definitely find it.
So what’s the big deal?
The big deal is that you are thinking from your computer’s perspective when you should be thinking from the perspective of you, your collaborators (which includes your future self), and future replicators (which also includes yourself). Search features are nice, but there’s no substitute for being able to look through a project’s files just by looking through the project folders. When a project is well organized, it’s much easier to understand how everything — each input, process, and output — fits together.
Let’s set up a basic directory structure for this class:
past/edh7916/2020/summer/
|
|__ assignments/
|__ data/
|__ figures/
|__ final_project/
|__ lessons/
|__ scripts/
|__ working/
As you can see, we have a main directory (or folder — same thing)
for the course called past/edh7916/2020/summer
. This can live on your computer
wherever you want to put it: on your Desktop, in your home directory,
in another folder where you store materials for your other classes —
wherever makes sense for you.
Inside the main course directory, there are subdirectories (or
subfolders — again, same thing) for the different types of files
you’ll collect or create this term. These subdirectories have
self-explanatory names: PDFs for assignments go into assignments
,
PDFs for lessons into lessons
and so on.
Note that this type of structure works well with research projects. Of
course, you’re unlike to have assignments
or lessons
subfolders
within a research project directory, but you almost certainly will
have subfolders for your scripts
, data
, and figures
as well as a
working
folder (which some people call scratch
, like a scratch
pad) where you can store odds and ends or practice new ideas.
You may ask: why these folders in particular or, why should I have
separate data
, scripts
, and figures
in my project directory?
Think about it this way. Following our kitchen analogy from before, we have:
data
scripts
figures
Particular projects may require particular folders (for example, you
may find it useful to have a special subfolder for tables
or one for
regression output called estimates
). But in almost all cases, your
project directory should have separate subfolders for your data, your
analysis scripts, and any output you produce.
You can create new directories using your operating system. For both MacOS and Windows, one of the easiest ways to make a new folder is to right-click on your Desktop and choose to create a new folder. You can then open this folder and continue right-click creating subfolders until you have what you need.
You can also use the RStudio Files tab (lower right facet) to create new folders. We’ll do it that way.
Let’s say I want to create a course directory on my Desktop.
When I open RStudio, notice that it starts in my Home folder (we’ll
talk more about what that means to “start in” below). You can see this
by looking at the menu bar in the Files facet. It also helps that I
can see all the folders and files listed in the window. Because I know
generally what’s in my home directory (Applications
, Desktop
,
Documents
), it’s clear to me that that’s where I am. Depending on
your settings, RStudio may start somewhere else for you — that’s
fine, just check the menu bar.
So I’m in my home folder, but I want to be in my Desktop folder. Easy enough: I can just click the Desktop link in the window.
Notice how the file list has changed. There’s nothing there! That’s because I have a very clean Desktop! But if you have files on your Desktop, you should see those now. Either way, I can see that the menu bar address has changed to add a folder: Home > Desktop.
To create a new folder, I can click on the New Folder button and then give my new folder a name in the drop down menu.
Now you can see my course folder, which I’ve called
past/edh7916/2020/summer
.
Clicking on that, I see the menu address change to Home > Desktop > past/edh7916/2020/summer and nothing inside, which is expected since it’s a newly created folder.
Time to add the subdirectories!
Quick exercise
Create your course directory (wherever you like — just keep track!) and add the following subdirectories to it:
- assignments
- data
- figures
- final_project
- lessons
- scripts
- working
When you’re finished, it should look something like this in the Files facet:
NB: Your menu path and course directory name may look different from mine, but the subdirectories in your course directory should look the same.
Once you’ve done that, place all the PDFs, scripts, and data you’ve downloaded so far into their proper folders.
NB When you download course files, they most likely end up in your computer’s downloads folder by default. If that’s the case just use the file system on your computer to drag and drop these files where they belong.
Above, I said
When I open RStudio, notice that it starts in my Home folder…
What does that mean, starts in?
There are two common ways to open a file on your computer with the correct application:
*.xlsx
file and Excel
will open with your file ready to go).In this way, RStudio is an application just like any other application
on your computer. If you double-click on *.R
scripts or *.md
or
*.Rmd
files, it’s likely that RStudio will open. However, if you
open RStudio first, it begins by assuming you want to look for/store
files in a default location — most likely your Home directory
unless you’ve changed the settings.
This what I mean by starts in.
As you might be guessing by this point, how people set up their computers or organize their files will affect where RStudio needs to look for the correct files. One skill you’ll have to develop is being able to navigate to the proper working directory, that is, telling RStudio where your files are located.
After downloading today’s scripts (the
icons at the top of the page) and adding them to the scripts
subdirectory of my course directory, I’ve opened one in
RStudio.
Despite what I see in the Files facet, however, my working directory is still set to my home directory. This means that when RStudio starts looking for files, it’s going to start in the home folder.
How do I know that? Notice that the path shown on the Console
facet doesn’t match the one on the Files facet. Also, when I run
the R function list.files()
— which by default lists all the files
in the current directory — in the console, I see the files in my
home (~
) directory rather than those I see in
~/Desktop/past/edh7916/2020/summer/scripts
directory.
There are a number of ways to fix this, but the easiest way is to
click the More button (with the gear) in the Files facet. In the
drop down menu, you’ll see an option to “Set As Working Directory.”
Because you are in the scripts
directory, you can click it to
correctly tell RStudio that you are working in scripts
, not elsewhere.
Clicking “Set As Working Directory” sends the command setwd()
to the
Console with the correct path to get from where RStudio starts to
where you want to be. Now you can see that the paths in each facet’s
menu bar match and when we run list.files()
again, we now see the
two R scripts listed on the left that we see listed on the right.
Now that you’ve found your course script files in RStudio, the next step is to understand how to make sure they can find your data and output folders.
Every quantitative data analysis will need to coordinate the three ingredients discussed above (data, scripts, output). When your script attempts to read in data, it has be able to find it on your computer. Similarly, when it’s time to make a figure, its needs to know where to save it. In other words, you need to give it paths to your other folders.
There are two basic types of paths — relative and absolute — and which type you choose to use will affect the transferability and replicability of your work.
A quick analogy. Imagine you ask a friend for directions: how do I get to the state park? Your friend has three ways they can give you directions:
The first way is really easy for your friend, but it puts extra work on you. Either you have to go to their house first so that the directions make sense, or you have to translate them from your perspective. Directions like these are annoying at best and useless if you don’t know how to get to their house.
The second way is really nice to you, but difficult for your friend. However, the directions aren’t as useful if you don’t plan to start from your house. In addition, these directions also aren’t reusable for other people — if a third friend tries to get the directions from you, you’ll just replicate the first problem for them.
The third way is the best compromise. If you each know how to get to a common landmark, then you can head to the landmark however works best for you and, once there, follow your friend’s directions to the park. As a bonus, these directions are shareable: anyone who can get to the common landmark can use the directions to get to the park.
In the world of our class directory, the first two sets of directions
are absolute paths while the third is a relative path. Since
we all share the same course directory structure, as long as each of
us can get to primary directory (e.g. past/edh7916/2020/summer
), then all the
subdirectories will be in the same relative locations on each of our
computers. In practice, this means I can give you all the same script
and it should work.
The same holds true for a research project directory. If your project
is organized into folders and you share the full project directory
with another researcher, then as long as they can get to the correct
starting point — e.g. your scripts
folder — they can run your
project easily.
Let’s say I want to give you the path to the course scripts
directory. Here are two:
~/Desktop/past/edh7916/2020/summer/scripts
./scripts
The first absolute path only works for you if you both named your
course folder past/edh7916/2020/summer
and put it on your desktop. For some of you,
it might work, but that would be due to happenstance rather than good
planning on my part.
The second relative path assumes we are in the primary class directory
(past/edh7916/2020/summer
). Notice the .
(dot)? On most systems, dots in a file
path work like this:
.
(one dot): this directory/folder..
(two dots): up one directory (e.g. from past/edh7916/2020/summer/scripts
to past/edh7916/2020/summer
What if we are in the scripts
folder, but want to access a file in
the data
directory? We use two dots (..
):
scripts
to data
: ../data
What this says is
..
: back up into the main folder, past/edh7916/2020/summer
/data
: from the main folder (past/edh7916/2020/summer
), go into data
Quick exercise
Use R’s
list.files(".")
command in the console to show all the files in the current directory. Next, uselist.files("..")
to show all the files in the directory above the current. Finally, uselist.files("<...>")
with a relative link to see all the files inside thedata
folder (replace<...>
with the correct relative path).
A general philosophy for every project: your scripts should do one thing and do it well. Rather than have a single giant script with 5,000 lines that run your entire analysis, it’s generally clearer to have many smaller scripts that are run in conjunction with one another (e.g. one script that reads in the data and cleans it, one that performs the analysis, one that makes the figures, etc).
But even if you only need a single small script, this organizing principle still applies. Always organize your scripts into clear sections in which you perform specific analytic tasks.
Here’s a template for an R script with clearly defined sections that you can use:
################################################################################
##
## [ PROJ ] < Name of the overall project >
## [ FILE ] < Name of this particular file >
## [ AUTH ] < Your name + email / Twitter / GitHub handle >
## [ INIT ] < Date you started the file >
##
################################################################################
## ---------------------------
## libraries
## ---------------------------
## ---------------------------
## directory paths
## ---------------------------
## ---------------------------
## settings/macros
## ---------------------------
## ---------------------------
## functions
## ---------------------------
## -----------------------------------------------------------------------------
## < BODY >
## -----------------------------------------------------------------------------
## ---------------------------
## input
## ---------------------------
## ---------------------------
## process
## ---------------------------
## ---------------------------
## output
## ---------------------------
## -----------------------------------------------------------------------------
## END SCRIPT
## -----------------------------------------------------------------------------
And here’s the template with a very simple project outline:
################################################################################
##
## [ PROJ ] EDH 7916: Organizing
## [ FILE ] organizing.R
## [ AUTH ] Benjamin Skinner (@btskinner)
## [ INIT ] 13 January 2020
##
################################################################################
## ---------------------------
## libraries
## ---------------------------
library(tidyverse)
## ---------------------------
## directory paths
## ---------------------------
dat_dir <- file.path("..", "data")
## ---------------------------
## settings/macros
## ---------------------------
old_to_new_score_ratio <- 1.1
## ---------------------------
## functions
## ---------------------------
old_to_new_score <- function(old_test_score, ratio) {
return(old_test_score * ratio)
}
## -----------------------------------------------------------------------------
## BODY
## -----------------------------------------------------------------------------
## ---------------------------
## input
## ---------------------------
df <- readRDS(file.path(dat_dir, "test_scores.RDS"))
## ---------------------------
## process
## ---------------------------
## add a column for new test score using our function
df <- df %>%
mutate(test_score_new = old_to_new_score(old_test_score = test_score,
ratio = old_to_new_score_ratio))
## ---------------------------
## output
## ---------------------------
saveRDS(df, file.path(dat_dir, "test_scores_updated.RDS"))
## -----------------------------------------------------------------------------
## END SCRIPT
## -----------------------------------------------------------------------------
At the very top of your script, give all the relevant information about the script.
################################################################################
##
## [ PROJ ] EDH 7916: Organizing
## [ FILE ] organizing.R
## [ AUTH ] Benjamin Skinner (@btskinner)
## [ INIT ] 13 January 2020
##
################################################################################
Specifically:
[ PROJ ]
: tell what project it belongs to[ FILE ]
: give the file’s name[ AUTH ]
: give your name and a way to contact you[ INIT ]
: give the date you started the fileIf you aren’t using a version control system like git, then it would
make sense to also include a line for the last time you revised the
file: [ REVN ]
.
NOTE You don’t have to make your header look exactly like mine. This is just what I’ve landed on after a few years. As long as you have the relevant information, personalize the details as you will.
After the informational header, the first thing you want to include are the libraries you need to call for your script to work. In this course, you will almost always call the tidyverse library.
## ---------------------------
## libraries
## ---------------------------
library(tidyverse)
Rather than hard-coding / rewriting all the paths in the script below,
we can save the paths in an object. We use the file.path()
command
because it is smart. Some computer operating systems use forward
slashes, /
, for their file paths; others use backslashes,
\
. Rather than try to guess or assume what operating system future
users will use, we can use R’s function — file.path()
— to check
the current operating system and build the paths correctly for us.
## ---------------------------
## directory paths
## ---------------------------
dat_dir <- file.path("..", "data")
Notice our relative link? This script assumes that the data file we
need for the analysis, test_scores.RDS
, is stored in another
subfolder called data
that is outside this subfolder, but in the
same primary folder. Visually:
past/edh7916/2020/summer/
|
|__data/
| \--+ test_scores.RDS
|
|__scripts/
\--+ organizing.R
This means two things:
past/edh7916/2020/summer/scripts
past/edh7916/2020/summer/data
Programmers hate “magic” numbers. What are “magic” numbers (or “magic” strings or “magic” settings)? They are values that are hard-coded in your analysis script.
Say you want to convert old test scores to a new test score based some ratio, say 1.1 (10% increase). You could multiply every old test score value throughout your script by 1.1, but what if you later decide it should be 1.2? You need to change every instance of 1.1 — don’t miss any!
In addition, does 1.1 have any inherent meaning? To me, not really. If
later on you take a look at your script and see that you are
multiplying by 1.1 (e.g. x * 1.1
), you’ll have to go back to your
notes or memory — or just guess (!) — what that 1.1 represents.
## ---------------------------
## settings/macros
## ---------------------------
old_to_new_score_ratio <- 1.1
It’s better to store these constant reusable values at the top of your
script in an object (or macro, as I call it here) that has a clear
name. old_to_new_score_ratio <- 1.1
is clear in its meaning, so when
we use it below, we’ll know what it means. Also, if we decide we need
to fix / change the value later, we only have to change it once.
Put all such numbers, strings, and settings here.
Other than the functions that come with the libraries we load, we might write functions ourselves. At this point in the course, I’m not concerned that you know how to write a function or how one works. Just notice how the function has a good name that tells what it does: converts old scores to new scores using a ratio (we’ll use the one we defined above).
## ---------------------------
## functions
## ---------------------------
old_to_new_score <- function(old_test_score, ratio) {
return(old_test_score * ratio)
}
The body of your script is where the main work happens. You may need more sections, but at the very least you should generally have dedicated spots for reading in your data, working with your data, and saving any output.
Here we read in a very small tibble (a tidyverse version of a data frame) with some student IDs and their test scores. We can see this if we look at the data.
## ---------------------------
## input
## ---------------------------
df <- readRDS(file.path(dat_dir, "test_scores.RDS"))
Quick exercise
Look at the data, both using RStudio’s View and by just printing the object to the console.
Let’s say that all we want to do is add a new column to our tibble
that holds the updated test scores as computed by our function and
ratio. We will use the mutate()
command from the dplyr library
(part of the tidyverse — we’ll be using it a lot in this course)
to create and add this column.
## ---------------------------
## process
## ---------------------------
## add a column for new test score using our function
df <- df %>%
mutate(test_score_new = old_to_new_score(old_test_score = test_score,
ratio = old_to_new_score_ratio))
Finally, we save our tibble. Notice how we use a new name. We’ll talk more about data consistency in a later lesson, but suffice to say for now that we never want to overwrite our original data. Save a new file with a useful name to keep everything separate.
## ---------------------------
## output
## ---------------------------
saveRDS(df, file.path(dat_dir, "test_scores_updated.RDS"))
With that, we’re finished! While not necessary, it’s nice to have a clear indication of where your script ends. This is particularly handy down the road when you have many scripts called one after another. If you have a problem and have to go through lines of output, knowing where a script ends can be very helpful.
## -----------------------------------------------------------------------------
## END SCRIPT
## -----------------------------------------------------------------------------
And with that, we’re finished!
When naming folders, files, objects, or macros, keep these naming rules in mind:
data_clean.R
or old_to_new_score_ratio
data_1.RDS
, data_2.RDS
,data_3.RDS
, etcdata clean.R
data_clean.R
Per this last point, more person hours are lost than can be counted dealing (or failing to deal) with file names that have spaces. Don’t do it! Use underscores or hyphens to separate words.