Blog

免費圖庫及 icon 資源清單

Jun 20, 2016

製作投影片可使用的免費圖庫及 icon 資源清單，都會搭配使用，下面按照使用頻率排序；附上使用經驗的短評以供不同時候快速選擇。

icon

有 package、好搜尋、換底色、方便直接使用

Noun Project：黑白、搜尋方便
flaticon：種類多、可調顏色
Icon Finder：種類多
Flat Icon Design (日本)：品質高，可換不同風格底色，但用日文較易搜尋
ICONSDB：種類少、可調顏色
pictogram2
Google Material Icon：SVG, Icon font
visualhunt：從顏色找圖片

Reference

簡報藝術烘焙坊

圖片素材

Unsplash: high-resolution photos
finda.photo: BROWSE BY COLOR, COLLECTION, SOURCE
STOCK UP: Searching 14,364 free stock photos across 28 websites
Visual Hunt: 大學生玩簡報目前最推薦的圖庫網站
istockphoto: 專業付費圖片

色票

資料視覺化用

Colorbrewer：Rcolorbrewer、地圖

設計配色用

手動選色票：Dataman 大推
Palettes：色票精華，數量少

看別人的設計

上傳圖片選色

套版

Pinterest.com

線上圖片編輯

Canva：線上編輯海報
Pixir：線上編輯
Backgound Burner：去背
Fotor：現成特效模組

[R] 字串操作：Regular Expression 及 stringr 套件

Jun 15, 2016

There are some handy functions in the package stringr to be substituted for lousy base R string manipulation functions.

Functions for string manipulation in base funciton and `stringr` package

stringr	base	Description
`str_match`	`regmaches` + `regexpr`	Extract matched groups from a string
`str_match_all`	`regmaches` + `gregexpr`	Extract matched groups from a string (globally)
`str_replace`	`sub`	Replace first matched patterns in a string
`str_replace_all`	`gsub`	Replace all matched patterns in a string
`str_detect`	`grepl`	Detect the presence or absence of a pattern in a string
`str_subset`	`grep(value = TRUE)`	x[str_detect(x, pattern)]
`str_split`	`strsplit`	Split up a string into pieces
`str_length`	`nchar`	The length of a string
`str_sub`	`substr`	Extract and replace substrings from a character vector

在 RStudio 檢視 xml/html 的工具：xmlview Package

Jan 15, 2016

xml_view_ptt_xpath

在寫爬蟲的過程中，常需要針對取得的 html 檢查內容，並用 XPath 或 CSS selector 擷取所需要的資料區塊。但在使用 IDE 撰寫腳本時，要做到這些事必須要把 html 的文本內容 print 出來，或是另存成 html file 再用瀏覽器檢視；若測試 XPath 時，因無法很清楚地直接在 console 瀏覽 xml 的樹狀結構，所以原本需搭配 Chrome 的 XPath Helper 會比較方便。

xmlview package 提供了一個在 RStudio 上互動檢視 XML 以及測試 XPath 的方式，這裡用個簡單的 XML 當例子：

# devtools::install_github("hrbrmstr/xmlview")
library(xml2)
library(xmlview)
library(magrittr)

## plain text XML
xml_view("<note><to>Dale</to><from>Chip</from><heading>Reminder</heading><body>Baby, don't forget tonight! xxxxx</body></note>")

利用 xml_view 這個函數吃進 XML string，即可得到 Parsed 後的顯示，

xml_view_test

XPath 測試

用 PTT 的隨便一篇文章當範例，先把網頁的內容抓下來，並用 read_html 做成 xml_document 物件：

## read-in XML document
doc <- xml2::read_html("https://www.ptt.cc/bbs/Stock/M.1452818794.A.FEC.html",
                encoding = "UTF-8")
# xml_view(doc, add_filter = TRUE)

由於 read_html 會自動將內容轉換成 unmarked UTF-8 encoding，經測試吃進時xml_view無法顯示，所以必須先轉換成 marked UTF-8 encoding 或 system locale (e.g., Big5) 才能正確顯示，因此這裡先把 xml_document 直接轉成 character 後再調整 encoding，

doc_string <- as.character(doc) %>% `Encoding<-`("UTF-8")
xml_view(doc_string, add_filter=TRUE)

xml_view_ptt_result

吃進去 xml_view 後，在 RStudio 的 Viewer pane 顯示了剛才的網頁內容，因為加了add_filter=TRUE 這個參數，因此上方出現了 XPath 的輸入框，輸入想測試的 XPath expression 後直接按 enter 就會馬上跑出結果，還可以按下”R”的圖示自動產生 R code 可直接複製貼上。

xml_view_ptt_xpath

最後就得到想要的資料了！

xml_find_all(doc, '//span[@class="f3 push-content"]', ns=xml2::xml_ns(doc))

## or you want to use rvest package
# doc %>% rvest::html_nodes(xpath = '//span[@class="f3 push-content"]')

只是在 Windows 的 encoding 問題還是要再處理一下，

doc %>%
  rvest::html_nodes(xpath = '//span[@class="f3 push-content"]') %>%
  rvest::html_text() %>%
  `Encoding<-`("UTF-8") %>%
  gsub("^: ", "",.)

## [1] "他，還有隱形眼鏡"
## [2] "她,還有雙鏡頭"
## [3] "蘋果怎麼可能容忍供應商EPS180但又不降價？"
## [4] "它，還有眼鏡蛇"
## [5] "不要看新聞做股票吧"
## [6] "外資要壓低吃貨囉"
## [7] "再爛股價還是會維持四位數啦"

[R crawler] 公開資訊觀測站 (實作篇)

Jan 05, 2016

在前一篇 [R crawler] 公開資訊觀測站 (觀察篇) 中，我們已經找到需要的資料在哪裡了，接下來就是用 R 將所需的資料抓回來。

mops_post

爬蟲的流程可分為 Connection 和 Parsing 兩階段，這裡用到的套件是httr作為 Connection 的工具，以及rvest, XML作為 Parser。

圖片來源: data-sci.info

在開始之前先確認所需套件是否已安裝。

library(magrittr)
library(httr)
library(rvest)
library(XML)  # readHTMLTable
library(dplyr) # data manipulation & pipe line
library(stringr)

Connection

我們先試查尋「上市」、「水泥工業」的結果，可以直接將這個 POST 寫成這樣：

res <- POST(
  "http://mops.twse.com.tw/mops/web/ajax_t51sb01",
  body = "encodeURIComponent=1&step=1&firstin=1&TYPEK=sii&code=01",
  encode = "form")

參數 body 裡面的東西就是 form data

如前所述 form data 可從 Chrome 開發人員開發人員工具中看到：

encodeURIComponent: 1
step: 1
firstin: 1
TYPEK: sii
code:

但為了後面方便置換參數，這裡統一將 body 用 list 的方式呈現，再由 POST 函數直接幫我 encode。這裡有一點要注意的是，預設的 encode 方法是 "multipart"可解析檔案上傳或字串，但是在這個例子會產生錯誤；所以這裡設定為 "form"，因為這裡的 form data 是單純的 escaped string。

res <- POST(
  "http://mops.twse.com.tw/mops/web/ajax_t51sb01",
  body = list(
    encodeURIComponent = 1,
    step = 1,
    firstin = 1,
    TYPEK = "sii",
    code = "01"
  ),
  encode = "form"
)
# ?httr::POST

看看連線是否成功：

print(res)

# Response [http://mops.twse.com.tw/mops/web/ajax_t51sb01]
#   Date: 2016-01-06 13:34
#   Status: 200
#   Content-Type: text/html
#   Size: 18.2 kB

Parsing

取得連線後的資料 (response) ，先確定我們要的東西在裡面，用 content 這個函數來看 response content，如果沒錯的話應該會看到和 Chrome 開發人員開發人員工具，在 response 標籤的相同內容。

res_text <- content(res, "text", encoding = "UTF-8") %>%
  `Encoding<-`("UTF-8")  # Windows encodind issue
res_text

[1] "\r\n\r\n<html>\r\n<head>\r\n\t<title>公開資訊觀測站</title>\r\n<!--\t<link href=\"css/css1.css\" rel=\"stylesheet\" type=\"text/css\" Media=\"Screen\"/> -->\r\n<!--\t<script type=\"text/javascript\" src=\"js/mops1.js\"></script> -->\r\n</head>\r\n\r\n<body>\r\n<table class='noBorder'>\n<tr><td align='right'>\n<form action='/server-java/t56ques' method='post'>\n<input type='hidden' name='step' value='0'>\n<input type='hidden' name='Market' value='sii'>\n<input type='hidden' name='SysName' value='公司彙總報表'>\n<input type='hidden' name='reportName' value='公司基本資料查詢彙總表'>\n<input type='hidden' name='colorchg' value=''>\n<input type='hidden' name='QNum' value='1'>\n<input type='hidden' name='Q1N' value='公司類別'>\n<input type='hidden' name='Q1V' value='???d?u·~                '>\n</form></td></tr></table>...

Parsing html table

在 rvest 套件中的 html_table() 可輕鬆幫我們擷取表格。首先先用 read_html() 將 html 字串轉成 xml_document，再用 html_nodes() 選到我們要的表格，最後再將表格擷取出來：

res_text <- content(res, as = "text", encoding = "UTF-8")
dt <- res_text %>%
  read_html(encoding = "UTF-8") %>%
  html_nodes(xpath = "//table[2]") %>%
  html_table(header=TRUE) %>%
  .[[1]]

但如果很不幸的是用 Windows 系統，因爲編碼問題 html_table 會出現問題，所以可以改用 XML 套件裡的 readHTMLTable() 替代：

## Windows
dt <- res_text %>%
  read_html(encoding = "UTF-8") %>%
  html_nodes(xpath = "//table[2]") %>%
  as.character %>%
  XML::readHTMLTable(encoding = "UTF-8") %>%
  .[[1]]

可以看一下資料長什麼樣子：

View(dt)

Refactor

如果想要進一步取得不同市場別和產業別的資料，就要先取得各查詢的代號，以便在 POST 的 form data 置換。

mops_xpat

先來做出市場別的 key-value vector：

# different post data values

res_doc <- GET("http://mops.twse.com.tw/mops/web/t51sb01") %>%
  content(type="text", encoding = "UTF-8") %>%
  read_html(encoding = "UTF-8")

market_type <- setNames(
  res_doc %>%
    html_node(xpath = "//select[@name='TYPEK']") %>%
    html_children %>%
    html_attr("value"),
  res_doc %>%
    html_node(xpath = "//select[@name='TYPEK']") %>%
    html_children %>%
    html_text()
)

market_type
#    上市     上櫃     興櫃 公開發行
#   "sii"    "otc"   "rotc"    "pub"

industry_type <- setNames(
  res_doc %>%
    html_node(xpath = "//select[@name='code']") %>%
    html_children %>%
    html_attr("value"),
  res_doc %>%
    html_node(xpath = "//select[@name='code']") %>%
    html_children %>%
    html_text()
)[-1]

industry_type
# 水泥工業         食品工業         塑膠工業         紡織纖維
# "01"             "02"             "03"             "04"
# 電機機械         電器電纜         化學工業       生技醫療業
# "05"             "06"             "21"             "22"
# 化學生技醫療         玻璃陶瓷         造紙工業         鋼鐵工業
# "07"             "08"             "09"             "10"
# 橡膠工業         汽車工業         半導體業 電腦及週邊設備業
# "11"             "12"             "24"             "25"
# 光電業       通信網路業     電子零組件業       電子通路業
# "26"             "27"             "28"             "29"
# 資訊服務業       其他電子業         電子工業       油電燃氣業
# "30"             "31"             "13"             "23"
# 建材營造           航運業         觀光事業       金融保險業
# "14"             "15"             "16"             "17"
# 貿易百貨         綜合企業             其他         存託憑證
# "18"             "19"             "20"             "91"

試試取得「上櫃」的「半導體業」資料：

# get different data
res <- POST(
  "http://mops.twse.com.tw/mops/web/ajax_t51sb01",
  body = list(
    encodeURIComponent = "1",
    step = "1",
    firstin = "1",
    TYPEK = market_type["上櫃"],
    code = industry_type["半導體業"]
  ),
  encode = "form"
)


res_text <- content(res, "text", encoding = "UTF-8")
dt2 <- res_text %>%
  read_html(encoding = "UTF-8") %>%
  html_nodes(xpath = "//table[2]") %>%
  html_table(header=TRUE) %>%
  .[[1]]

Data Cleansing

檢查一下資料 dt2 發現，每隔 15 列就會出現一次表頭，這是一個非常爛的資料寫法，我們只好把它清掉，順便清掉資料中的多餘空白 Non-breaking space：

## data cleansing
dt2 <- dt2 %>%
  filter(`公司代號` != "公司代號") %>%
  sapply(str_trim)

enter image description here

[R crawler] 公開資訊觀測站 (觀察篇)

Jan 02, 2016

這次要來攻擊抓取的資料是「公開資訊觀測站」的公司資料，如同前篇文章所談到的：

「⋯⋯以前許多資料源取得的限制，不再是個無解的難題之後，會釋放出更自由的想像空間，更大的挑戰便是整合資訊的應用，以及如何從中淘金了。」

這些公司資料實際上可以有相當程度的運用，諸如交叉持股的情形，董監事的關係群體，或是台灣在企業投資的整體影響力結構，都可藉由對這些資料的進一步分析得到寶貴的資訊。

目標網站

公開資訊觀測站

Step 1. 定義目標資料

寫爬蟲總要先知道你要爬的資料是什麼，這裡的「資料」可以是在網頁上看到的任何東西，甚至是整個網頁的內容。

我們先來看一下這個網站有什麼東西。進到首頁後，上方的一排按鈕都會連到類似的查詢頁面，有些資料的查詢結果是重複的只是換一個方式呈現，因此這裡就直接以公司基本資料作為攻擊目標。

點入左上方的總彙報表 > 基本資料 > 基本資料查詢總彙報表，可以看到一個查詢的下拉式選單，可選擇市場別和產業別。

mops_index

mops_2

市場別先選擇「上市」來觀察，而產業別可以選擇空白來一次查詢所有公司的資料。

enter image description here

最後得到的這一大張表格就是我們想要的資料了。

mops_data_table

Step 2. 觀察

寫爬蟲最重要的心法便是觀察，在這裡用的是Chrome 的開發人員工具，先在這一頁按下快捷鍵Cmd/Ctrl + Shift + i(如果忘記的話可以按右上的漢堡圖案 > 更多工具 > 開發人員工具)，切換到Network分頁，重新整理後便可以看到瀏覽過程的所有連線。

» 找到資料

在這裡使用開發人員工具觀察，有個小技巧，由於我們要的資料通常是出現在我們眼前才會讀進來，因此，

Tips 可以先按紅色按鈕旁的鈕清掉之前的連線，在按下「搜尋」按鈕跑出資料的同時，我們需要的那個連線便會出現在前幾個。

mops_observe

按下「搜尋」按鈕的瞬間，~~正直和善良~~資料全部都進來了，觀察後發現第一個連線有極高的機率是我們要的資料所在。

mops_click

點進去看一下Preview，果然這是很單純的Page Render類型的網頁，資料就直接出現在Document裡了。

mops_res_data

» 連線方法

點進去Headers分頁確認一下連線的類型，可看到這個連線敲的是這個網址：

http://mops.twse.com.tw/mops/web/ajax_t51sb01

而連線的類型是POST¹²，看到 POST method 就一定要接著看他 post 的form data。

mops_headers_1

看到下面的 form data 有幾個參數：

encodeURIComponent: 1
step: 1
firstin: 1
TYPEK: sii
code:

mops_headers_2

這些參數的值，應該有部分是在剛才選擇「市場別」和「產業別」的時候帶入的，為了確認這點，可以有兩種作法：第一種是再重新選擇別的選項查詢試試，但是如果選項很多的話就會試到天荒地老；或者可以用第二種方式，直接去網頁 html 原始碼一探究竟。

對著下拉式選單按右鍵 > 檢查 (inspect element)，查看 html 的內容。

mops_form

enter image description here

看到在 select 的 name attribute分別是 TYPEK 和 code，在 option 有 value 這個 attribute ，其值分別就是剛才 form data 的參數項目；同時也可發現，裡面有數個 option 的 subtag 出現剛才在下拉式選單看到的選項：

TYPEK: sii
code:

到這邊就差不多完成這隻爬蟲所需要的觀察項目了，接下來就是實際上在 R 的實作和反覆試誤，對照本篇所談到的觀察技巧來回修正，寫出一隻完整的爬蟲了。

Newer Older

« 2 »

Leoluyi 呂奕 Coding with Data

Blog

免費圖庫及 icon 資源清單

icon

圖片素材

色票

資料視覺化用

設計配色用

看別人的設計

上傳圖片選色

套版

線上圖片編輯

[R] 字串操作：Regular Expression 及 stringr 套件

Functions for string manipulation in base funciton and stringr package

在 RStudio 檢視 xml/html 的工具：xmlview Package

XPath 測試

[R crawler] 公開資訊觀測站 (實作篇)

Connection

Parsing

Parsing html table

Refactor

Data Cleansing

[R crawler] 公開資訊觀測站 (觀察篇)

目標網站

Step 1. 定義目標資料

Step 2. 觀察

» 找到資料

» 連線方法

Functions for string manipulation in base funciton and `stringr` package