class: center, middle, inverse, title-slide # Lab 03: CS631 ## Working with Data ### Alison Hill --- # Data for today We'll use data from [Wordbank](http://wordbank.stanford.edu)- an open source database of children's vocabulary development. The tool used to measure children's language and communicative development in this database is the [MacArthur-Bates Communicative Development Inventories (MB-CDI)](http://mb-cdi.stanford.edu). The MB-CDI is a parent-reported questionnaire. - R package [`wordbankr`](https://cran.r-project.org/web/packages/wordbankr/index.html) - [`wordbankr` vignette](https://cran.r-project.org/web/packages/wordbankr/vignettes/wordbankr.html) - More about [Wordbank](http://wordbank.stanford.edu) - More about [MB-CDI](http://mb-cdi.stanford.edu) --- # Get the data Use this code chunk to import my cleaned CSV file: ```r library(readr) sounds <- read_csv("http://bit.ly/cs631-meow") ``` --- class: inverse, middle, center <img src="../images/r-data-types.png" width="65%" style="display: block; margin: auto;" /> ## RStudio Base R Cheatsheet https://github.com/rstudio/cheatsheets/blob/master/base-r.pdf --- ## Know your data types * Numeric (2 subtypes) - Integers (`1, 50`) - Double (`1.5, 50.25`, `?double`) * Character (`"hello"`) * Factor (`grade = "A" | grade = "B"`) * Logical (`TRUE | FALSE`) -- ```r typeof(sounds$age) ``` ``` [1] "double" ``` ```r typeof(sounds$sound) ``` ``` [1] "character" ``` ```r typeof(sounds$sound == "meow") ``` ``` [1] "logical" ``` --- # Even better: `glimpse` ```r glimpse(sounds) ``` ``` Observations: 33 Variables: 4 $ age <dbl> 8, 8, 8, 9, 9, 9, 10, 10, 10, 11, 11, 11, 12, 12, 12, 13… $ sound <chr> "cockadoodledoo", "meow", "woof woof", "cockadoodledoo",… $ kids_produce <dbl> 1, 0, 3, 0, 2, 2, 0, 5, 4, 0, 5, 12, 0, 12, 28, 9, 125, … $ kids_respond <dbl> 35, 35, 35, 91, 93, 93, 139, 145, 143, 94, 94, 94, 141, … ``` --- # `sounds` (a subset) - `age`: child age in months - `sound`: a string describing a type of animal sound - `kids_produce`: the number of parents who answered "yes, my child produces this animal sound" - `kids_respond`: the number of parents who responded to this question at all <table> <thead> <tr> <th style="text-align:right;"> age </th> <th style="text-align:left;"> sound </th> <th style="text-align:right;"> kids_produce </th> <th style="text-align:right;"> kids_respond </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> cockadoodledoo </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 35 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> meow </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 35 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> woof woof </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 35 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> cockadoodledoo </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 91 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> meow </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 93 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> woof woof </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 93 </td> </tr> </tbody> </table> --- # Data types <img src="http://r4ds.had.co.nz/diagrams/data-structures-overview.png" width="65%" style="display: block; margin: auto;" /> --- class: middle, center, inverse # ⌛️ ## Let's review --- # Data wrangling with `dplyr` .pull-left[ From DataCamp Chapter 3 - `group_by` - `summarize` ] -- .pull-right[ Adding onto your arsenal of... - `filter` - `arrange` - `mutate` - `glimpse` - `distinct` - `count` - `tally` - `pull` - `top_n` ] --- class: middle, center, inverse # 😈 ## More on `mutate` --- # 3 ways to `mutate` 1. <font color="#ED1941">Create a new variable with a specific value</font> 1. Create a new variable based on other variables 1. Change an existing variable -- ```r sounds %>% mutate(form = "WS") ``` ``` # A tibble: 33 x 5 age sound kids_produce kids_respond form <dbl> <chr> <dbl> <dbl> <chr> 1 8 cockadoodledoo 1 35 WS 2 8 meow 0 35 WS 3 8 woof woof 3 35 WS 4 9 cockadoodledoo 0 91 WS 5 9 meow 2 93 WS 6 9 woof woof 2 93 WS 7 10 cockadoodledoo 0 139 WS 8 10 meow 5 145 WS 9 10 woof woof 4 143 WS 10 11 cockadoodledoo 0 94 WS # … with 23 more rows ``` --- # 3 ways to `mutate` 1. Create a new variable with a specific value 1. <font color="#ED1941">Create a new variable based on other variables</font> 1. Change an existing variable -- ```r sounds %>% mutate(prop_produce = kids_produce / kids_respond) ``` ``` # A tibble: 33 x 5 age sound kids_produce kids_respond prop_produce <dbl> <chr> <dbl> <dbl> <dbl> 1 8 cockadoodledoo 1 35 0.0286 2 8 meow 0 35 0 3 8 woof woof 3 35 0.0857 4 9 cockadoodledoo 0 91 0 5 9 meow 2 93 0.0215 6 9 woof woof 2 93 0.0215 7 10 cockadoodledoo 0 139 0 8 10 meow 5 145 0.0345 9 10 woof woof 4 143 0.0280 10 11 cockadoodledoo 0 94 0 # … with 23 more rows ``` --- # 3 ways to `mutate` 1. Create a new variable with a specific value 1. Create a new variable based on other variables 1. <font color="#ED1941">Change an existing variable</font> -- ```r sounds %>% mutate(prop_produce = prop_produce * 100) ``` ``` # A tibble: 33 x 5 age sound kids_produce kids_respond prop_produce <dbl> <chr> <dbl> <dbl> <dbl> 1 8 cockadoodledoo 1 35 2.86 2 8 meow 0 35 0 3 8 woof woof 3 35 8.57 4 9 cockadoodledoo 0 91 0 5 9 meow 2 93 2.15 6 9 woof woof 2 93 2.15 7 10 cockadoodledoo 0 139 0 8 10 meow 5 145 3.45 9 10 woof woof 4 143 2.80 10 11 cockadoodledoo 0 94 0 # … with 23 more rows ``` --- class: middle, center, inverse # ⌛️ ## Let's review some helpful functions for `mutate` + `summarize` --- class: inverse, bottom, center background-image: url("../images/peapod.png") background-size: 25% ## Remember: ## Base R + Tidyverse --- class: middle, center, inverse #💡 ## First: ## Arithmetic *especially useful for* `mutate` See: http://r4ds.had.co.nz/transform.html#mutate-funs --- ```r ?Arithmetic ``` <table> <thead> <tr> <th style="text-align:left;"> Operator </th> <th style="text-align:left;"> Description </th> <th style="text-align:left;"> Usage </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> + </td> <td style="text-align:left;"> addition </td> <td style="text-align:left;"> x + y </td> </tr> <tr> <td style="text-align:left;"> - </td> <td style="text-align:left;"> subtraction </td> <td style="text-align:left;"> x - y </td> </tr> <tr> <td style="text-align:left;"> * </td> <td style="text-align:left;"> multiplication </td> <td style="text-align:left;"> x * y </td> </tr> <tr> <td style="text-align:left;"> / </td> <td style="text-align:left;"> division </td> <td style="text-align:left;"> x / y </td> </tr> <tr> <td style="text-align:left;"> ^ </td> <td style="text-align:left;"> raised to the power of </td> <td style="text-align:left;"> x ^ y </td> </tr> <tr> <td style="text-align:left;"> abs </td> <td style="text-align:left;"> absolute value </td> <td style="text-align:left;"> abs(x) </td> </tr> <tr> <td style="text-align:left;"> %/% </td> <td style="text-align:left;"> integer division </td> <td style="text-align:left;"> x %/% y </td> </tr> <tr> <td style="text-align:left;"> %% </td> <td style="text-align:left;"> remainder after division </td> <td style="text-align:left;"> x %% y </td> </tr> </tbody> </table> ```r 5 %/% 2 # 2 goes into 5 two times with... ``` ``` [1] 2 ``` ```r 5 %% 2 # 1 left over ``` ``` [1] 1 ``` --- class: middle, center, inverse #💡 ## Second: ## Summaries *especially useful for* `summarize` *even more useful after a* `group_by` See: http://r4ds.had.co.nz/transform.html#summarise-funs --- <table> <thead> <tr> <th style="text-align:left;"> Description </th> <th style="text-align:left;"> Usage </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> sum </td> <td style="text-align:left;"> sum(x) </td> </tr> <tr> <td style="text-align:left;"> minimum </td> <td style="text-align:left;"> min(x) </td> </tr> <tr> <td style="text-align:left;"> maximum </td> <td style="text-align:left;"> max(x) </td> </tr> <tr> <td style="text-align:left;"> mean </td> <td style="text-align:left;"> mean(x) </td> </tr> <tr> <td style="text-align:left;"> median </td> <td style="text-align:left;"> mean(x) </td> </tr> <tr> <td style="text-align:left;"> standard deviation </td> <td style="text-align:left;"> sd(x) </td> </tr> <tr> <td style="text-align:left;"> variance </td> <td style="text-align:left;"> var(x) </td> </tr> <tr> <td style="text-align:left;"> rank </td> <td style="text-align:left;"> rank(x) </td> </tr> </tbody> </table> * All allow for `na.rm` argument to remove `NA` values before summarizing. The default setting for this argument is *always* `na.rm = FALSE`, so if there is one `NA` value the summary will be `NA`. * See "Maths Functions" in the RStudio Base R Cheatsheet: https://github.com/rstudio/cheatsheets/blob/master/base-r.pdf --- class: inverse, middle, center ![](../images/alicedata-lego-colors.jpg) ## <small>"Spent day pondering grayscale vs colourscale using `ggplot`"</small> *photo and caption courtesy [@alice-data](https://twitter.com/alice_data)* --- # Today's lab: COLORS Specifically, discrete colors. At the end of today's lab, you'll see an extra section on continuous colors. --- ## But first: `shape` <img src="03-slides_files/figure-html/unnamed-chunk-18-1.png" width="65%" style="display: block; margin: auto;" /> --- ## Shapes with `color = "hotpink"` <img src="03-slides_files/figure-html/unnamed-chunk-19-1.png" width="65%" style="display: block; margin: auto;" /> --- ## Shapes with `fill = "gold"` <img src="03-slides_files/figure-html/unnamed-chunk-20-1.png" width="65%" style="display: block; margin: auto;" /> --- ## Default shape for `geom_point` 🕵🏽 Requires spelunking into the dark corners of the `ggplot2` code on [GitHub](https://github.com/tidyverse/ggplot2/blob/master/R/geom-point.r): ```r default_aes = aes( shape = 19, colour = "black", size = 1.5, fill = NA, alpha = NA, stroke = 0.5 ) ``` So, the default for `geom_point(shape = 19)`! This is important to remember: this shape only "understands" the *color* aesthetic, but not the *fill* aesthetic. --- class: inverse, middle, center # 👇🏽 ## R Markdown: https://www.markdowntutorial.com https://andrewbtran.github.io/NICAR/2018/workflow/docs/02-rmarkdown.html https://yihui.name/tinytex/ *(install!)* https://github.com/rstudio/cheatsheets/blob/master/rmarkdown-2.0.pdf https://rmarkdown.rstudio.com/html_document_format.html https://rmarkdown.rstudio.com/pdf_document_format.html