A course in quantitative research workflow for students in the higher education administration program at the University of Florida
From the git website http://git-scm.com/
Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.
With so many other (read: easier!) ways to share and collaborate on documents, why use git? Isn’t it a bit overkill to learn an entirely new syntax? Why not just email files or use something like DropBox? Because it is very easy to end up with something like this:
Credit: Jorge Chan
\[ Project = f(size, scope, collaborators) \]
As any part of that function grows, so too does the need for a work flow that:
Git was designed to do all these things, so it works better than other systems.
Yep, you guessed it. Git is the system/language that supports version control. GitHub is an online service that will host your remote repositories, for free if public or a small fee if private. (Students get an education discount and some free repos. Check out https://education.github.com/.)
RStudio is nice because it provides an nice point-and-click interface for git. (Sourcetree and GitHub Desktop are also really nice GUIs.) If you want to run git at the command line, go for it! But using RStudio or another GUI is fine.
pull
from your local repo with your remote repo before
starting any work. This makes sure you have the latest files.
RStudio has a pull
button in the Git tab.push
your commits. Just like you save your papers
every few lines or paragraphs, you should push to your remote repo.
This way, you’re less likely to lose work in the event of a computer
crash. Also, should you want to return to a prior version, small
changes make it easier to find what you want.commit
messages. RStudio will make you add something.
Don’t say “stuff.” A version history of 95 “stuff”s is pretty
non-helpful. Also, I would keep it clean. Even in a private
repository, work as if someone will see what you write someday..gitignore
to make sure that git doesn’t
track or push them.Every class and work session should go like this:
pull
from GitHub remoteadd
or stage
(if new file)commit
with messagepush
back to GitHub remoteIf you’d like you can also use terminal commands to accomplish the same things. Many people (including me) like to use the terminal commands directly.
git pull
git add <filenames>
git commit -m "<your message here>"
git push
That said, you should use whatever works best for you.
By far the most useful guide to working with R and git/GitHub is Jenny Bryan’s guide.
Git works best with plain text files. This is because it notes the differences across two plain text files rather than just copying and re-copying the same file over and over again as it changes. When you’ve only changed one word, git’s method of version control is much more efficient than making a whole new copy.
It’s also useful when you need to merge two files. If file B is just file A with new section, file B can be easily merged into file A by inserting the new section — just like you would add a paragraph in a paper you’re writing. R scripts are plain text. Some data files, like *.csv, are plain text. This is why git works really well with data analysis workflows.
On the other hand, git does not work as well with binary files. These files are stored in a format closer to what your computer understands, which comes with benefits. Data files, like Stata’s .dta and R’s *.Rdata, as well as MS Office files — *.docx, *.xlsx, *.pptx, *etc — are binary files. Git will keep track of your MS Word document, but due to its underlying structure, you won’t be able to merge and every small change will just make a whole new copy of the file. This is why we generally don’t commit large binary data files to Git: your repo just becomes larger and larger with each small change to your data.
Not-so-quick exercise
In this first exercise, we’ll practice using RStudio to pull down your student repository from GitHub. We’ll also practice making a few quick file changes, staging, commiting, and pushing them back to the remote.
R is a port of the S language, which was developed at Bell Labs. As a GNU project, R is open source and free to use and distribute. It can be installed and used on most major operating systems.
R is best thought of as an integrated language and environment that was designed with statistical computing and data analysis in mind. To that end, its structure is a balance between powerful mathematical computation and high-level functionality that can be used interactively (unlike compiled code). In other words, it’s a great tool for quantitative data analysis since it both allows you to investigate your data easily and, when the time comes, write robust programs.
Originally, R was probably best known for its graphing capabilities. As it has matured, it has grown in popularity among data scientists, who have increasingly extended its functionality through user-contributed packages. We will use a number of packages during this course.
To work with the R language, it helps to have an application. While R ships with one (you may see it on your computer as R.app), it’s pretty plain. RStudio, on the other hand, is a powerful IDE that does most everything R-related very well and with little fuss: run commands, write scripts, view output, interact with other languages and remote site, etc. There are other options for working with R, but RStudio is a great all-around program that we will use in this course.
RStudio has 3-4 main frames:
Each has a useful purpose, but for today, we’ll mostly focus on the console itself.
Quick exercise
If you haven’t already, try entering an equation in the console (like
1 + 1
). Next, open the script associated with this module and run the first line. Welcome to R!
R is a type of object-oriented programming environment. This means that R thinks of things in its world as objects, which are like virtual boxes in which we can put things: data, functions, and even other objects.
Before discussing data types and structures, the first lesson in R is
how to assign values to objects. In R (for quirky
reasons),
the primary means of assignment is the arrow, <-
, which is a less than
symbol, <
, followed by a hyphen, -
.
## assign value to object x using <-
x <- 1
NOTE: You can also use a single equals sign, =
, to assign a value
to an object: x = 1
. Keep in mind, however, that since =
sometimes
has other meanings in R and can be confused with ==
, which is
different, it’s generally clearer to use <-
.
R does exactly what you ask it to do — no more, no less. If you don’t ask it to return something, either explicitly from a function or implicitly by printing to the console, it won’t. This can be huge source of frustration to new users.
The good-ish news is that by default, R will print an object’s contents to the console if it’s the only thing you type in. Many functions similarly print to the console if you don’t assign the output to an object. You can see this when simply type a number or character into the console.
## when you input a number or character, R returns it back to you
1
[1] 1
"a"
[1] "a"
Basically, you’ve just told R “Here’s a 1
” and R said “The content of
1
is 1
”. Same for "a"
. The initial number in the square brackets
([1]
) is telling you the index (place within the object) of the first
item. Since we only have one item, it’s just [1]
.
When you store something in an object, you can type the object’s name into the console to see what’s in it.
## what's in x?
x
[1] 1
A neat trick if you want to both assign a value and see the results
printed to the output is to to wrap the entire line in ()
.
## wrap in () to print after assignment
(x <- 5)
[1] 5
Quick exercise
Using the arrow, assign the output of
1 + 1
tox
. Next subtract 1 fromx
and reassign the result tox
. Show the value inx
.
A NOTE ON “GOOD-ISH” Keep in mind how much data your object has /
might have when printing it to the console. For a small amount, printing
is just fine. But if you have, for example, a matrix with 1,000 columns
and 1 million rows, printing might not be a useful exercise. There are
other ways, such as the function head()
that might be more useful in
these situations.
You may have noticed already, but comments in R are set off using the
hash or pound character at the beginning of the line: #
. The comment
character tells R to ignore the line, that is, do not try to interpret
it as code you the user want run.
Quick exercise
Type the phrase “This is a comment” directly into the R console both with and without a leading “#”. What happens each time?
You may notice that I use two hashes. This is a stylistic tick that has
more to do with the editor I use than an R requirement. You can use only
a single #
for your comments if you like.
R uses variety of data types and structures to represent and work with data. There are many, but the major ones that you’ll use most often are:
logical
numeric
(integer
& double
)character
vector
matrix
list
dataframe
Understanding the nuanced differences between data types is not important right now. Just know that they exist and that you’ll gain an intuitive understanding of them as you become better aquainted with R. For more information, see the supplemental material on data types and structures.
User-submitted packages are a huge part of what makes R great. You may hear me use the phrases “base R” during class. What I mean by this is the R that comes as you download it with no packages loaded (sometimes also called “vanilla R”). While it’s powerful in and of itself — you can do everything you need with base R — most of your scripts will make use of one of more contributed packages. These will make your data analytic life much nicer. We’ll lean heavily on the {tidyverse} suite of packages this semester.
Many contributed packages are hosted on the CRAN package repository. What’s really nice about CRAN is that packages have to go through quite a few checks in order for CRAN to approve and host them. Checks include making sure the package has documentation, works on a variety of systems, and doesn’t try to do odd things to your computer. The upshot is that you should feel okay downloading these packages from CRAN.
To download a package from CRAN, use:
install.packages("<package name>")
NOTE Throughout this course, if you see something in triangle
brackets (<...>
), that means it’s a placeholder for you to change
accordingly.
Many packages rely on other packages to function properly. When you use
install.packages()
, the default option is to install all dependencies.
By default, R will check how you installed R and download the right
operating system file type.
Quick exercise
Install the
{tidyverse}
package, which is really a suite of packages that we’ll use throughout the semester. Don’t forget to use double quotation marks around the package name.
Package libraries can loaded in a number of ways, but the easiest it to write:
library("<library name>")
where "<library name>"
is the name of the package/library. You will
need to load these before you can use their functions in your scripts.
Typically, they are placed at the top of the script file.
Quick exercise
Load the
tidyverse
package, which you just installed. This will be a good test of the installation since we will use tidyverse libraries throughout the rest of the workshop.
Even I don’t have every R function and nuance memorized. With all the user-written packages, it would be difficult to keep up if I tried! When stuck, there are a few ways to get help.
In the console, typing a function name immediately after a question mark will bring up that function’s help file:
## get help file for function
?median
Median Value
Description:
Compute the sample median.
Usage:
median(x, na.rm = FALSE, ...)
Arguments:
x: an object for which a method has been defined, or a numeric
vector containing the values whose median is to be computed.
na.rm: a logical value indicating whether 'NA' values should be
stripped before the computation proceeds.
...: potentially further arguments for methods; not used in the
default method.
Details:
This is a generic function for which methods can be written.
However, the default method makes use of 'is.na', 'sort' and
'mean' from package 'base' all of which are generic, and so the
default method will work for most classes (e.g., '"Date"') for
which a median is a reasonable concept.
Value:
The default method returns a length-one object of the same type as
'x', except when 'x' is logical or integer of even length, when
the result will be double.
If there are no values or if 'na.rm = FALSE' and there are 'NA'
values the result is 'NA' of the same type as 'x' (or more
generally the result of 'x[FALSE][NA]').
References:
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S
Language_. Wadsworth & Brooks/Cole.
See Also:
'quantile' for general quantiles.
Examples:
median(1:4) # = 2.5 [even number]
median(c(1:3, 100, 1000)) # = 3 [odd, robust]
Two question marks will search for the command name in CRAN packages:
## search for function in CRAN
??median
Package | Topic | Type | Title |
---|---|---|---|
bit64 | qtile | help | (Q)uan(Tile)s |
ellipsis | safe_median | help | Safe version of median |
future | sticky_globals | help | Place a sticky-globals environment immediately after the global environment |
ggplot2 | hmisc | help | A selection of summary functions from Hmisc |
ggpubr | add_summary | help | Add Summary Statistics onto a ggplot. |
Hmisc | smean.cl.normal | help | Compute Summary Statistics on a Vector |
httr | guess_media | help | Guess the media type of a path from its extension. |
httr | parse_media | help | Parse a media type. |
igraph | time_bins.sir | help | SIR model on graphs |
matrixStats | rowMedians | help | Calculates the median for each row (column) in a matrix |
matrixStats | rowWeightedMedians | help | Calculates the weighted medians for each row (column) in a matrix |
matrixStats | weightedMad | help | Weighted Median Absolute Deviation (MAD) |
matrixStats | weightedMedian | help | Weighted Median Value |
olsrr | reference-media | vignette | Media |
posterior | ess_quantile | help | Effective sample sizes for quantiles |
posterior | fold_draws | help | Fold draws around their median |
posterior | mcse_quantile | help | Monte Carlo standard error for quantiles |
purrr | accumulate | help | Accumulate intermediate results of a vector reduction |
raster | Summary-methods | help | Summary methods |
recipes | step_medianimpute | help | Impute Numeric Data Using the Median |
zoo | rollmean | help | Rolling Means/Maximums/Medians/Sums |
zoo | zoo | help | Z’s Ordered Observations |
stats | mad | help | Median Absolute Deviation |
stats | median | help | Median Value |
stats | medpolish | help | Median Polish (Robust Twoway Decomposition) of a Matrix |
stats | runmed | help | Running Medians - Robust Scatter Plot Smoothing |
stats | smooth | help | Tukey’s (Running Median) Smoothing |
stats | smoothEnds | help | End Points Smoothing (for Running Medians) |
survival | Math.Surv | help | Methods for Surv objects |
At first, using help files may feel like trying to use a dictionary to see how to spell a word — if you knew how to spell it, you wouldn’t need the dictionary! Similarly, if you knew what you needed, you wouldn’t need the help file. But over time, they will become more useful, particularly when you want to figure out an obscure option that will give you exactly what you need.
Google is a coder’s best friend. If you are having a problem, odds are a 1,000+ other people have too and at least one of them has been brave enough to ask about it in a forum like StackOverflow, CrossValidated, or R-help mailing list.
If you are lucky, you’ll find the exact answer to your question. More likely, you’ll find a partial answer that you’ll need to modify for your needs. Sometimes, you’ll find multiple partial answers that, in combination, help you figure out a solution. It can feel overwhelming at first, particularly if it’s a way of problem-solving that’s different from what you’re used to. But it does become easier with practice.
Google it!
When needing help for this class, your order of operations should be:
This is not because I don’t want to help. My concern is the opposite: that I’m likely to just show you. Data analysis is tricky because no two problems are alike. But over time, they do rhyme. The time you put in now learning to figure things out on your own will be well paid in the future.