ggplot2 簡介
ggplot2
是一個很強大的資料探索及視覺化工具, 是許多最有影響力的 R 套件開發者 Hadley Wickham 所開發- 所有繪圖函數都有背後的視覺化邏輯(Grammar of Graphics)
- Grammar of Graphics 的作用就是幫助我們將圖表拆解成個別的元素, 然後將這些元素按照邏輯個別操作,正確又簡單地達到圖表的目的
一個例子學會畫圖:mpg 🚗油耗資料
mpg
dataset:
Fuel economy data from 1999 and 2008 for 38 popular models of car.
variable | detail |
---|---|
manufacturer | 車廠 |
model | 型號 |
displ | 引擎排氣量 |
year | 出廠年份 |
cyl | 氣缸數 |
trans | 自/手排 |
drv | f = front-wheel drive, r = rear wheel drive, 4 = 4wd |
cty | city miles per gallon 城市駕駛油耗 |
hwy | highway miles per gallon 高速公路駕駛油耗 |
fl | 汽油: ethanol E85, diesel, regular, premium, CNG |
class | 車型 |
一個例子學會畫圖:mpg
先看兩個變數:
- displ - 引擎排氣公升
- hwy - (油耗效率,哩/加侖)
- 大引擎的車子更耗油嗎?如果是的話,那有多耗油?
- 引擎大小和油耗效率之間的關係為何?正/負相關?線性/非線性?相關程度?
Scatterplots
library(ggplot2)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
從圖表可歸納幾個結論:
- 兩變數為高度負相關 ── 大引擎 => 低效率
- 有些車是離群值
Exercise: 用 mpg
資料畫不同的圖
看看不同變數之間的相關
- 畫出 scatterplot:
hwy
vscyl
- 畫出 scatterplot:
class
vsdrv
Answer-1
## hwy vs cyl
ggplot(data = mpg) +
geom_point(mapping = aes(x = hwy, y = cyl))
Answer-2
## class vs drv
ggplot(data = mpg) +
geom_point(mapping = aes(x = class, y = drv))
Aesthetic Mapping
- 在 Grammar of Graphics 裡面最重要的概念就是 “Aesthetic Mapping”
- 在畫圖前我們先來練習用眼睛👀看 aethetics
Exercise: 觀察 Aesthetic Mapping
- 有哪些變數 variables
- 分別對應到哪個 aethetic
Aesthetics 基本題 1
在 x-y 二維的 Scatterplot 加入第三個 aesthetic
- x = displ
- y = hwy
- color = class (把 class 對應到點的顏色)
- hint:
?geom_point
: 查詢支援的 aesthetics
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
Aesthetics 基本題 2
- x = displ
- y = hwy
- shape = class
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
Plot Exercise
試試在 x-y 二維的 Scatterplot 加入第三個 aesthetic
- 把 class 對應到點的形狀
Answer
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
#> Warning: The shape palette can deal with a maximum of 6 discrete values
#> because more than 6 becomes difficult to discriminate; you have 7.
#> Consider specifying shapes manually if you must have them.
#> Warning: Removed 62 rows containing missing values (geom_point).
Aesthetic Mappings 小結
ggplot(data = <DATA>) + # Data
geom_<xxx>(
mapping = aes(<MAPPINGS>), ## <= Aesthetic mappings
stat = <STAT>,
position = <POSITION>
) +
scale_<xxx>() + coord_<xxx>() + facet_<xxx>()
theme_()
aes()
可以放在:ggplot()
裡面 – 有“記憶效果”(成為所有圖層預設)- 外面
+ aes()
– 有“記憶效果”(成為所有圖層預設) geom_<xxx>()
裡面 – 無“記憶效果”(只對該 geom 有效)
geom_<xxx>(inherit.aes=FALSE)
: overrides the default aesthetics.
Static Aesthetic
有時候你可能只想要手動設定某個固定 aesthetic,這裡的設定只為了美觀, 並不會帶出多餘資料訊息。
- 將 aesthetic 放在 aes() 裡面: map aesthetic 並自動建立 legend
- 將 aesthetic 放在 aes() 之外: 手動設定某個固定 aesthetic
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
ggplot()
There are two main plotting functions in ggplot2
:
qplot()
: (quick plot) 需要快速畫圖時才使用,用法和 R 的內建繪圖函數plot()
差不多ggplot()
: 推薦的繪圖方法,搭配繪圖步驟的其他函數逐步建構圖層
ggplot2 起手式
ggplot(data = <DATA>) + # Data
geom_<xxx>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) + # Layers & Aesthetic mappings
scale_<xxx>() + coord_<xxx>() + facet_<xxx>() # Position
theme_()
Data for Plot – ETL
- 每一欄 (column) 都是一個(繪圖)變數
- 每一列 (row) 都是一筆觀察值
- Wide Format -> Long Format (
tidyr
)
資料和圖表是一體兩面,先有資料才有圖表
以 mpg
為例
mpg
共有 11 個變數 234 筆資料- 這裡需要的繪圖變數 (aesthetic mapping)
- x:
displ
- y:
hwy
- x:
#> # A tibble: 234 x 11
#> manufacturer model displ year cyl trans drv cty hwy fl
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
#> 1 audi a4 1.8 1999 4 auto(l… f 18 29 p
#> 2 audi a4 1.8 1999 4 manual… f 21 29 p
#> 3 audi a4 2 2008 4 manual… f 20 31 p
#> 4 audi a4 2 2008 4 auto(a… f 21 30 p
#> 5 audi a4 2.8 1999 6 auto(l… f 16 26 p
#> 6 audi a4 2.8 1999 6 manual… f 18 26 p
#> 7 audi a4 3.1 2008 6 auto(a… f 18 27 p
#> 8 audi a4 quat… 1.8 1999 4 manual… 4 18 26 p
#> 9 audi a4 quat… 1.8 1999 4 auto(l… 4 16 25 p
#> 10 audi a4 quat… 2 2008 4 manual… 4 20 28 p
#> # ... with 224 more rows, and 1 more variable: class <chr>
Geoms
這兩張圖差在哪裡?
- Geom 決定圖表呈現的「幾何圖形物件」,也就是你眼睛看到的資料呈現方式
geom_<xxx>()
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
Different geom:
- 因為 Geoms 真的太多了,通常要用的時候再去查: RStudio - ggplot2 Cheatsheet
- 如同前面所述,不同 Geoms 有不同支援的 Aesthetics
Bar Charts
- 各種車型(
class
)的數量? geom_bar()
ggplot(data = mpg) +
geom_bar(mapping = aes(x = class))
表格與視覺化的關聯
在畫圖之前,你可能要先想到畫出這樣的表格:
class | n |
---|---|
2seater | 5 |
compact | 47 |
midsize | 41 |
minivan | 11 |
pickup | 33 |
subcompact | 35 |
suv | 62 |
- 表格就是一種視覺化的方式,有時候一張好的表格資訊就很清楚
- 畫圖只是將這些資訊再強調出來
Stats (Geom 的一體兩面)
- x:
class
- y: ?? count 不在原本的
mpg
資料中
到底 count 是怎麼算出來的?
- 原本可能在 Excel 算
- R 幫你算
dplyr::summarise()
- ggplot2
geom_bar
幫你算?geom_bar
的預設stat
是 “count”
- 有些 Geom (例如 scatterplot) 畫的是 raw value (stat_identity)
- 有些 Geom (例如
geom_bar()
) 會自動幫你計算新的 stat (e.g.,count
) 以供畫圖- 使用
geom_<xxx>()
時,要注意預設的 stat 是什麼 - 若不計算 stat:
stat = "identity"
- 使用
Stats 是怎麼幫你算出來的?
stat_<xxx>()
先手動處理 stats 再畫圖: dplyr
五大動詞
mutate()
計算 (新/舊) 欄位select()
選擇欄位filter()
篩選列條件summarise()
彙整計算多個值輸出一個值 (mean, sum, …)arrange()
排序group_by()
常用
- SQL - where ==
filter()
- SQL - group by + count ==
group_by() %>% tally(sort = FALSE)
- SQL - group by + mean ==
group_by() %>% summarize(mean_x = mean(x))
- SQL - order by ==
arrange()
Exercise
- Plot bar chart with
cut
in datasetdiamonds
- 用
dplyr::summarise()
算出 count 這個變數,再用ggplot2
畫圖
hint: stat = "identity"
d <- mpg %>%
group_by(class) %>%
summarise(n = n())
# Or use tally() the same:
d <- mpg %>%
group_by(class) %>%
tally()
d
#> # A tibble: 7 x 2
#> class n
#> <chr> <int>
#> 1 2seater 5
#> 2 compact 47
#> 3 midsize 41
#> 4 minivan 11
#> 5 pickup 33
#> 6 subcompact 35
#> 7 suv 62
ggplot(data = d) +
geom_bar(mapping = aes(x = class, y = n),
stat = "identity")
練習: 有時候遇到複雜的問題就需要手動先算 Stats
- 畫出各種車型(
class
)的平均油耗 bar chart
hint:
dplyr::group_by()
dplyr::summarise(mean(xxx))
,
geom_bar(stat = "identity")
#> # A tibble: 7 x 2
#> class mean_hwy
#> <chr> <dbl>
#> 1 2seater 24.8
#> 2 compact 28.3
#> 3 midsize 27.3
#> 4 minivan 22.4
#> 5 pickup 16.9
#> 6 subcompact 28.1
#> 7 suv 18.1
沒有排序的 bar chart 很難看
要怎麼排序?
- 先手動計算
reorder(<被排序的變數>, <參照大小的變數>)
d <- mpg %>%
group_by(class) %>%
summarise(n = n())
d
#> # A tibble: 7 x 2
#> class n
#> <chr> <int>
#> 1 2seater 5
#> 2 compact 47
#> 3 midsize 41
#> 4 minivan 11
#> 5 pickup 33
#> 6 subcompact 35
#> 7 suv 62
ggplot(data = d) +
geom_bar(mapping = aes(x = reorder(class, -n), y = n),
stat = "identity")
Geoms + Stats 實例
- bar charts, histograms: 計算每一組 bin 裡面的數目.
ggplot(data = mpg, aes(x = class)) +
geom_bar() +
geom_text(stat = "count",
aes(label = ..count.., y =..count..),
vjust = "bottom")
- boxplots: plot quartiles. (看各種車型的油耗分佈)
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = class, y = hwy))
Layers 圖層觀念
要呈現多個幾何圖形物件 (Geoms) 時要怎麼做到呢?
- 一個
geom_<xxx>()
就會在圖上畫一圖層 (Layer) - 可一層層疊加上去
- 每個圖層甚至可以用不同的 data,在畫進階圖表時很常用到
- 但要注意是否有預設的 aesthetic 不小心 mapping 到該圖層
兩層圖層
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping= aes(x = displ, y = hwy))
(附錄) Positions:當圖形在位置打架時要怎麼辦?
Try Bar Charts
- 填滿顏色
fill
(錯誤示範:不建議同一變數 mapping 多個 aes)
ggplot(data = mpg) +
geom_bar(mapping = aes(x = class, fill = class))
geom_bar(position = ?)
?geom_bar
- 堆疊:
position = "stack"
(default)
ggplot(data = mpg) +
geom_bar(mapping = aes(x = class, fill = manufacturer),
position = "stack") +
ggtitle('Position = "stack"')
position
:
- “identity” 同一位置(覆蓋住後面圖層)
- “stack” 堆疊
- “dodge” 併排
- “fill” 堆疊並 scale 至 100%
- “jitter” “抖…” 點會互相閃避
position = "identity"
- 同一位置(覆蓋住後面圖層)
ggplot(data = mpg) +
geom_bar(mapping = aes(x = class, fill = manufacturer),
position = "identity", alpha = .4) +
ggtitle('Position = "identity"')
position = "dodge"
- 併排
ggplot(data = mpg) +
geom_bar(mapping = aes(x = class, fill = manufacturer),
position = "dodge") +
ggtitle('Position = "dodge"')
position = "fill"
- 堆疊並 scale 至 100%
ggplot(data = mpg) +
geom_bar(mapping = aes(x = class, fill = manufacturer),
position = "fill") +
ggtitle('Position = "fill"')
Facets: Small-Multiples
Too many variables!!!!
- 看到剛才的車車油耗🚗,是不是覺得還是很難透過圖表理解資料?
- 剛才畫的圖因為多了第3個變數,所以更難理解了
- Facets 是很重要的一個呈現方式,一定要學起來
- 為什麼要用 Facets?
- 當同一個座標平面塞入太多變數,會造成大腦無法負荷
- 分拆資訊,讓大腦協助腦補更有效率
“Illustrations of postage-stamp size are indexed by category or a label, sequenced over time like the frames of a movie, or ordered by a quantitative variable not used in the single image itself.” – Edward Tufte
Facet Exercise
各車廠(manufacturer
)在不同車型(class
)的數量為何?
先用表格來思考
- 你的(視覺化)表格要怎麼畫別人才會清楚
- 在把表格放到圖表上面
三個變數應該怎麼放:
- manufacturer
- class
- count
1. Long-format
是我們在 ggplot2
需要拿來畫圖的表格
manufacturer | class | n |
---|---|---|
audi | compact | 15 |
audi | midsize | 3 |
chevrolet | 2seater | 5 |
chevrolet | midsize | 5 |
chevrolet | suv | 9 |
dodge | minivan | 11 |
dodge | pickup | 19 |
dodge | suv | 7 |
ford | pickup | 7 |
ford | subcompact | 9 |
2. 直接看就很清楚的表格
- Pivot 樞紐分析表
manufacturer | 2seater | compact | midsize | minivan | pickup | subcompact | suv |
---|---|---|---|---|---|---|---|
audi | 0 | 15 | 3 | 0 | 0 | 0 | 0 |
chevrolet | 5 | 0 | 5 | 0 | 0 | 0 | 9 |
dodge | 0 | 0 | 0 | 11 | 19 | 0 | 7 |
ford | 0 | 0 | 0 | 0 | 7 | 9 | 9 |
honda | 0 | 0 | 0 | 0 | 0 | 9 | 0 |
hyundai | 0 | 0 | 7 | 0 | 0 | 7 | 0 |
jeep | 0 | 0 | 0 | 0 | 0 | 0 | 8 |
land rover | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
lincoln | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
mercury | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
nissan | 0 | 2 | 7 | 0 | 0 | 0 | 4 |
pontiac | 0 | 0 | 5 | 0 | 0 | 0 | 0 |
subaru | 0 | 4 | 0 | 0 | 0 | 4 | 6 |
toyota | 0 | 12 | 7 | 0 | 7 | 0 | 8 |
volkswagen | 0 | 14 | 7 | 0 | 0 | 6 | 0 |
3. 讓視覺化增加資訊的清晰度: facet_wrap()
畫出各車廠(manufacturer
)在不同車型(class
)的數量 bar chart
- x: class
- y: count
- fill: 多餘的 mapping,但為了強調不同車型,還是加入
- facet: manufacturer
ggplot(data = mpg) +
geom_bar(mapping = aes(x = class, fill = class)) +
facet_wrap( ~ manufacturer, ncol = 4) +
scale_fill_viridis_d() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
- 原Facet 小結
則:一個座標平面(方格)最好不要超過三個變數 - Don’t overplotting - 拆出類別變數 (nominal) 放在個別的小方格 (facets)
當變數很多時 Faceting 就是你最好的朋友!
4. 其他做法:例如表格
mpg %>%
group_by(manufacturer, class) %>%
tally() %>%
ggplot(aes(x = class, y = manufacturer, fill = n)) +
geom_tile() +
geom_text(aes(label = n)) +
scale_fill_distiller(direction = 1) +
scale_x_discrete(position = "top") +
ggtitle("Number of car class by manufacturers")
Labels
圖表一定要有標題,別人才知道你要講的故事是什麼
Title 標題
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth() +
ggtitle("Fuel efficiency vs. Engine size")
標題 + 座標軸
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth() +
ggtitle("Fuel efficiency vs. Engine size") +
xlab("Engine displacement (L)") +
ylab("Highway fuel efficiency (mpg)")
Learning from copying 從抄別人的圖表學起
- Google: “圖表名稱 + R”
- 如果要用得順手,平常就要多看別人畫的好圖,要用時才知道從哪裡找起
- Google 是學習畫圖的好朋友
- 視覺化資源整理
Export Plots 匯出圖表
p <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth() +
theme_bw()
ggsave(p,
filename = "my_plot.png",
device = "png", h = 2, w = 3, type = "cairo")
ggplot2 視覺化流程總結
ggplot2 的繪圖流程
- Data (noun/subject)
- Aesthetic mappings (adjectives): x, y, color, size, …
- Layers: Geom (verb), Stat (adverb)
- Position (preposition): Scales, Coord, Facet
- Theme
HW: 畫自己的圖
- 找出一張平常會畫的圖表,以及其資料,填入上面格式
- 試著用
ggplot2
畫出來- 資料匯入
- 資料前處理
- 畫圖
- 匯出成 png
Resources 視覺化資源整理
Data Cleansing (ETL)
dplyr
tidyr
broom
ggplot2 Cookbook and Documentation
- ggplot2 官方使用手冊: 完整範例細節
- Cookbook for R - Graphs: 此站其他學習資源也很推薦
ggplot2 輔助繪圖
- RColorBrewer: 色票
- ColorBrewer by PennState University: Web tool for guidance in choosing choropleth map color schemes, based on the reasearch of Dr. Cynthia Brewer.
- cowplot: 組合圖表
- ggrepel: labeling ggplot
- directlabels: labeling ggplot (has few bugs)
- lemon: Refresh axis lines, facets, pointpath
ggplot2 延伸套件整理
Cheatsheet
Other Viz Packages
R Plot Galleries
Other Plot Galleries
- The Data Visualisation Catalogue: 各種圖表類型範例
- The Economist - Graphic detail: 看經濟學人如何用圖表說故事