Notes - Things They Forgot to Teach You In R, rstudio::conf18
Jessica Minnier
1/31/2018 and 2/1/2018
Thanks to the excellent contributions (via pull
request!) from classmate Peter Higgins!
Links:
Materials: rstd.io/forgot
To use “in building” mirror: options(repos = c(CRAN = "https://cran.rstudio.com/"))
Random thoughts
Day 1: Morning - library exploration
- JB has us go to issues and put emoji on OS, tests that we’ve signed into github! also sees how many windows users there are
- it’s ok that things take a long time
- make things self explaining (good filenames - see https://speakerdeck.com/jennybc/how-to-name-files), don’t spend a lot of time writing wordy explainers that you have to maintain and won’t want to read later
- organize your files and keep your readme up to date!
- “unless you can keep it current (maintain it), don’t write it”
- it helps to know about your R installation and where packages go in order to make a package
- new package
fs
(for ‘file system’), helps us work with file paths. note, don’t use paste() to work with file paths like they are strings, they are not!
Day 1: Mid-morning - file copying and naming, projects
- turn off .RData loading and never ask to save .RData
- restart R (Shift-Cmd-F10) often!
- no absolute paths
- use a stable base:
here::here("data","raw-data.csv")
in your project directory, or fs::path_home()
- form paths at runtime relative to a stable base
- normalizePath() makes it work on both windows and mac
- Rstudio set up a mirror for downloading packages in the building! Whaaattt!
- use
basename
instead of strsplit
to get name of file from a path (don’t have to specify or /)
- name files and directories as thing_thing-info_thing2 so for instance day_session_topic so you can order and delimit them later
Side note from my own issues in moving this repo to github
How to add an existing Rstudio repo to github:
This has changed since git 2.9, now when you add remote and then pull you need a special option. This only needs to happen if you added a Readme. Just don’t do this and life will be easier.
git remote add origin https://github.com/jminnier/rstudioconf18_jennybryan_forgot.git
git pull origin master --allow-unrelated-histories
Most easy option from happywithgitr
Day 1: Afternoon - git/github
Day 1: Afternoon after break
- to upload an existing git directory to github without going through all the steps, just do
usethis::use_github()
!!!
- JB showing that .R and .Rmd can make same github_document output
- You can make your README.md into a webpage by changing settings in GitHub. Go to your repo, click ‘Settings’, scroll way down to GitHub pages, change Source to Master, and choose a theme. Can add links to other .md files in the same repo by inserting this into README.md file: displayname
- Good options for knitr:
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
out.width = "100%"
)
- git can detect simple push conflicts (i.e. you forgot to pull first)
- if you are collaborating on code: commit and push (and pull) often, so that you make small changes that git’s troubleshooting push fail algorithm can handle it; “sync to the mothership really often”
- also working with binary files like pdf makes this really hard since git can’t look inside them to figure out what the problem is. so, don’t put them in your repo very often! (.gitignore)
- use githooks to avoid pushing files that have conflicts (they can look for the markers from the merge conflict)
- if you have made conflicting commits to your local version and remote version, push will be rejected. Pull first and fix conflicts. If this does not work, consider git reset. If this does not work, consider renaming local repo to borked-repo, pull remote, then selectively copy over new useful files from borked-repo
- you can search github for code snippets (I need to do this more!), such as: “llply” user:cran language:R; Jim Hester will talk more about this tomorrow
Day 2 Morning - Jim Hester, searching github
- Jim Hester is teaching us how to use github search, for example
read_csv OR read_delim col_type language:R extension:.R
looks for examples of using the col_type
argument in read_csv
or read_delim
functions
- if the documentation doesn’t answer your questions, use github search to find examples
- example search:
user:cran onAttach
will search CRAN packages that use the function onAttach
- use
user:
field to search CRAN or tidyverse because those users/orgs keep track of all the packages on cran or tidyverse
- if you get an error, you can search that error code string in the repo to determine what is causing it
- The tidyverse packages typically name internal helpers with a trailing underscore.
- JH has a package for this! jimhester/lookup
library(lookup)
lookup(knitr::knit)
- JB: if you are reporting a bug or requesting an addition, be as specific as possible
Day 2 morning, JB start up files
- re: installing packages after an update: you can install them organically as you go, this is a good way to perform “spring cleaning” (JB and [@LucyStats](twitter.com/lucystats) are in favor of this). However, Jim Hester has a package for this, too! jimhester/autoinst (but maybe try not to break it/find issues ;))
- github search for “dotfiles” shows people’s repos of dotfiles
- very useful links from JB’s repo:
- examples of good options to turn on:
warnPartialMatchAttr = TRUE
, etc = no autocomplete!
utils::rc.settings(ipck=TRUE)
= yes autocomplete package names
- don’t load packages in .Rprofile because you will be using functions other people can’t without loading
- though JB still loads “workflow packages” since low risk for irreproducible code: devtools, usethis, remotes, testthat, reprex
- can populate description file automatically (see JB’s .Rprofile)
- interesting package to maintain these types of files: HenrikBengtsson/startup
Day 2 Morning after break - github project example
- github etiquette: use
#
in front of numbers to link to issues, words like “Fixes” are keywords that can close issues see more on github
- plan for revisiting analyses/projects (like 10x more than you thought you would)
- refine and extend! make your code more readable, efficient, resilient, general
- beware of monoliths! i.e one script and a workspace, much better to break logic and ouput into smaller pieces; divide and conquer; i.e. making figures is separate from modeling
- make scripts based on the different steps in the analysis pipeline (see slide 17 slides/day2_slides2_project-api.pdf)
- make outputs separate based on their uses rather than a giant RData file
- practice “safe paths”! use
here
package to do this; works on my machine and on yours, also works with knitr and rmarkdown! rmarkdown sets workign directory where .Rmd file is, but here
can get around this if you have subdirectories
library(here)
here()
here("day1_s1_explore-libraries")
cat(readLines(here("day1_s2_copy-files/","00_filesystem-practice_comfy.R")))
- we had an excerise of forking a repo and adding code to files along with documentation in the readme.
- there are R packages that are similar to
make
, such as remake (more to come after lunch)
Day 2 Afternoon - make-like, live purrr coding, API example
- you can view csv files in github (better to do this than opening in excel!)
- you can ammend commits if you haven’t pushed yet
- github can show you diffs of .png files, useful for seeing progression of plots
- having projects in smaller pieces allows people to come into the code at the middle
- JB’s repo solution
- the table of file organization in JB’s readme is quite useful
- Kirill Muller is telling us about
drake
, repo and slides
- drake will look at your code and run what needs to be updated
- make gives you all kinds of problems that drake does not
- use
drake_plan()
to map out your analysis plan that needs to be updated
drake::loadd()
can load data or an object or a plot, updated based on your plan
- how much complexity can this handle? KM used to use remake, where you specify plans with YAML files, and he had YAML files up to 6 mb, so very complex workflows
- JB live coding
purrr
here: rstd.io/jenny-live-code, using dropbox with url trick to make it render in the browser: use the the raw=1
query https://www.dropbox.com/s/2b8mi4rir23pvnx/jenny-live-code.R?raw=1
- it took JB a long time to embrace
plyr
, she stuck with apply
for a while, until she read the plyr
paper and finally “got it”
- get good at iteration, really really good!
repurrrsive
is a package to learn purrr
based on recursive lists
- API is good example since it comes as json which is inherently a list and cannot be a rectangle right away
- lists: one bracket is how you get multiple things, 2 brackets is how you get one thing (via twitter via hadley’s workshop)
- BRRR: package to play sound snippets
purrr:walk()
replaces a for loop + a function, either through a formula ~
or a function
View(listname)
is great in Rstudio, much easier to look at than super long str(listname)
str(listname, list.len=3)
and str(listname, max.level=1)
make str
more manageable
- side note,
wesanderson
color palettes in repurrrsive
are awesome!
- normally don’t grow things inside lists or else R will get slow, need to designate space in advance, but
map
and lapply
deal with those bookkeeping tasks
- return results “like so” =
map_*
such as map_dbl
map_dfr(minis, '[', c("pants","torso","head"))
= get all these things and put them in a df!
- why use
map
over lapply
? some syntaxical sugar that makes things simpler and shorter
flatten_int
can flatten lists
map_chr(,.default = NA)
sets a default value of NA if it is missing; but it doesn’t replace “” with NA
- can’t use
map_chr
if map
returns items with different lengths–you will get an error; if you tried sapply
on this it would just give you a list back (maybe not what you are expecting!)
- code that can make list columns from map:
library(repurrrsive)
tibble(
name = map_chr(got_chars, "name"),
titles = map(got_chars, "titles")
) %>% head
#> # A tibble: 6 x 2
#> name titles
#> <chr> <list>
#> 1 Theon Greyjoy <chr [3]>
#> 2 Tyrion Lannister <chr [2]>
#> 3 Victarion Greyjoy <chr [2]>
#> 4 Will <chr [1]>
#> 5 Areo Hotah <chr [1]>
#> 6 Chett <chr [1]>
- workflow:
- do something easy with iterative machine
- do real hard thing with one representative unit
- do hard thing with iterative machine
pmap
can iterate over every row of a data frame, including and most especially tibbles where you have list columns
- after the break we will have “choose your own adventure” time
Day 2 afternoon - choose your own adventure
- I chose to make a dotfiles repo
- JB made a gallery of the O’Rly? covers here along with hyperlinks!
fig_files <- c(
"https://github.com/jennybc/orly-full-res/blob/master/changinstuff-big.png",
"https://github.com/jennybc/orly-full-res/blob/master/uninformativegitcommit-big.png")
fig_files_smaller <- gsub("(^.*)(\\.png$)", "\\1-smaller\\2", fig_files)
tags <- sprintf("<a href=\"%s\"><img src=\"%s\"></a>",
fig_files, fig_files_smaller)
cat(tags)
#> <a href="https://github.com/jennybc/orly-full-res/blob/master/changinstuff-big.png"><img src="https://github.com/jennybc/orly-full-res/blob/master/changinstuff-big-smaller.png"></a> <a href="https://github.com/jennybc/orly-full-res/blob/master/uninformativegitcommit-big.png"><img src="https://github.com/jennybc/orly-full-res/blob/master/uninformativegitcommit-big-smaller.png"></a>
- also explored the weights of cats at SF SPCA, and found the fattest cat
url <- "https://raw.githubusercontent.com/jennybc/fattest-cat/master/fattest-cat.R"
rfile <- basename(url)
download.file(url, rfile)
source(rfile)
Q’s
- what about versioning data? for small to medium you can easily use github, might need to just keep it in a separate repo and git that, or might need to use gitfs for large file systems
- in the analysis pipeline, if I want to generate an html (or pdf etc) file along with the github_document, is there a way to always generate both? you can “keep md” but can you “keep github_document”? Sort of, but you need to use
rmarkdown::render
(in theory, though JB can’t get it to work yet for github_docs…)
- why is
fs
package making my Rstudio crash every time I want to view an object created with it? May be related to this issue