Yotta 資料視覺化實戰

Packages

library(magrittr)
library(readr)
library(tidyr)
library(dplyr)

Defining tidy data (long-format)

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

Tools for reshaping dataframe

Packages	to long	to wide
tidyr	gather	spread
reshape2	melt	dcast
pandas	melt	unstsack / pivot_table / pivot
spreadsheets	unpivot	pivot
databases	fold	unfold

Example dataset

Wide data

name	treatmenta	treatmentb
John Smith	NA	18
Jane Doe	4	1
Mary Johnson	6	7

treatment	John Smith	Jane Doe	Mary Johnson
a	NA	4	6
b	18	1	7

Long data

name	treatment	n
Jane Doe	a	4
Jane Doe	b	1
John Smith	a	NA
John Smith	b	18
Mary Johnson	a	6
Mary Johnson	b	7

Tidying: Tidyr variables

起手式

Specifying variables and values. (it depends)
gather(): Put all values in the cells.
spread(): Put variables back to columns.

gather: to long

data %>% gather(key, value, ...cols_to_gather)

df_wide <- read_csv("http://bit.ly/country-year-wide")
df_wide
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#>   <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766

df_wide %>% gather(key = "year", value = "cases", -country)
#> # A tibble: 6 x 3
#>   country     year   cases
#>   <chr>       <chr>  <int>
#> 1 Afghanistan 1999     745
#> 2 Brazil      1999   37737
#> 3 China       1999  212258
#> 4 Afghanistan 2000    2666
#> 5 Brazil      2000   80488
#> 6 China       2000  213766

spread: to wide

data %>% spread(key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)

df_long <- read_csv("http://bit.ly/cases-long")
df_long
#> # A tibble: 12 x 4
#>    country      year type            count
#>    <chr>       <int> <chr>           <int>
#>  1 Afghanistan  1999 cases             745
#>  2 Afghanistan  1999 population   19987071
#>  3 Afghanistan  2000 cases            2666
#>  4 Afghanistan  2000 population   20595360
#>  5 Brazil       1999 cases           37737
#>  6 Brazil       1999 population  172006362
#>  7 Brazil       2000 cases           80488
#>  8 Brazil       2000 population  174504898
#>  9 China        1999 cases          212258
#> 10 China        1999 population 1272915272
#> 11 China        2000 cases          213766
#> 12 China        2000 population 1280428583

df_long %>% 
  spread(key = type, value = count)
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

seperate: 分割一個 cell 多個值

data %>% seperate(col, into, sep)

容易發生在源頭是 Excel 的資料

df_to_sep <- read_csv("http://bit.ly/sep-raw")
df_to_sep
#> # A tibble: 6 x 3
#>   country      year rate             
#>   <chr>       <int> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583

df_to_sep %>% 
  separate(rate, into = c("cases", "pop"), sep = "/")
#> # A tibble: 6 x 4
#>   country      year cases  pop       
#>   <chr>       <int> <chr>  <chr>     
#> 1 Afghanistan  1999 745    19987071  
#> 2 Afghanistan  2000 2666   20595360  
#> 3 Brazil       1999 37737  172006362 
#> 4 Brazil       2000 80488  174504898 
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

data %>% seperate_rows(sep) == seperate %>% gather

df_to_sep %>% 
  separate_rows(rate, sep = "/")
#> # A tibble: 12 x 3
#>    country      year rate      
#>    <chr>       <int> <chr>     
#>  1 Afghanistan  1999 745       
#>  2 Afghanistan  1999 19987071  
#>  3 Afghanistan  2000 2666      
#>  4 Afghanistan  2000 20595360  
#>  5 Brazil       1999 37737     
#>  6 Brazil       1999 172006362 
#>  7 Brazil       2000 80488     
#>  8 Brazil       2000 174504898 
#>  9 China        1999 212258    
#> 10 China        1999 1272915272
#> 11 China        2000 213766    
#> 12 China        2000 1280428583

狀況1: 欄位名稱是值 (dataset: pew)