Leoluyi 呂奕 Coding with Data

Blog

Archives

爬蟲是一種基本技能

catchbug

如果你有網路上大量資料的需求,卻還在用手動複製貼上大法的時候,強烈建議你一定要學會爬蟲這項技能,因為當你還在猶豫的時候,別人已經不知道爬了幾個站的資料做了多少事了。

常常在分析一些資料的時候,第一步資料的取得便是一個不小的困難,需要在原有的資料源之外,還想要取得一些額外的資料輔助。大部分有用的資訊其實在網路上便可取得,有更多時候這些網路上的資料相當豐富,不論是整理好的開放資料或是一些非結構資料,若能將資訊系統性且有效率地擷取下來,不僅會節省非常多時間,而且能獲得更多應用的機會。

若能確實的按照課程進度,投入在幾個想爬的幾個站,短短幾個小時的練習時數,便可從爬蟲新手,到可以砍大部分的網站。(台鐵不在其中,別來查水表)因此學習曲線可說相當平易近人。

為什麼要懂爬蟲?

在網路資料免費贈送的時代,入寶山卻空手而回的確是虧大了,對於當代的資料科學而言,我相信在不久的將來,使用爬蟲取得資料已是一種基本技能,而非一個競爭優勢,尤其在各種資料越來越容易取得的情況下,要比的只是你有沒有辦法處理分析這些資料,至於要從哪裡得到這些資料,總不能要求別人給你,只能自己想辦法。

到底什麼是爬蟲?

爬蟲就是把你要手動在瀏覽器操作資料取得的一連串動作,用程式幫你自動化完成。這不是一個新技術,也不是黑客不黑客的問題,而是時間的成本是寶貴的,愈來愈多方便套件的開發,使得爬蟲不再是一個需要懂得很多連線的複雜原理,或複雜的 coding 技巧才能辦到,只要懂一些基本的程式技巧即可快速入門。

爬蟲心法:觀察、觀察、再觀察

爬蟲取得資料的最終目的,就是要分析這些資料,以熟悉的工具— R —作為學習爬蟲的開始,是一個好的入門途徑,但前提是要能對於基本的R資料型態和字串處理有些基本的認識。在連線目標網站取得資料的過程中,大部分時間都在嘗試資料藏在哪裡,連線的阻擋是怎麼設計的,因此這是一門亟需實作的課程,透過不同網站的練習(攻擊XD),以及與 mentor 和蟲友的經驗交流,進步才會神速。

R Crawler 101 的收穫

藉由習得爬蟲的技能,讓我也進一步重新思考,更多資料分析流程的可能性。以前許多資料源取得的限制, 不再是個無解的難題之後,會釋放出更自由的想像空間,更大的挑戰便是整合資訊的應用,以及如何從中淘金了。

工商服務:

課程傳送門 http://datasci.kktix.cc/events/rcrawler101-201512

請大力分享給你認為有需要的朋友

Read more


Example content

Howdy! This is an example blog post that shows several types of HTML content supported in this theme.

Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Aenean eu leo quam. Pellentesque ornare sem lacinia quam venenatis vestibulum. Sed posuere consectetur est at lobortis. Cras mattis consectetur purus sit amet fermentum.

Curabitur blandit tempus porttitor. Nullam quis risus eget urna mollis ornare vel eu leo. Nullam id dolor id nibh ultricies vehicula ut id elit.

Etiam porta sem malesuada magna mollis euismod. Cras mattis consectetur purus sit amet fermentum. Aenean lacinia bibendum nulla sed consectetur.

Inline HTML elements

HTML defines a long list of available inline tags, a complete list of which can be found on the Mozilla Developer Network.

  • To bold text, use <strong>.
  • To italicize text, use <em>.
  • Abbreviations, like HTML should use <abbr>, with an optional title attribute for the full phrase.
  • Citations, like — Mark otto, should use <cite>.
  • Deleted text should use <del> and inserted text should use <ins>.
  • Superscript text uses <sup> and subscript text uses <sub>.

Most of these elements are styled by browsers with few modifications on our part.

Heading

Vivamus sagittis lacus vel augue rutrum faucibus dolor auctor. Duis mollis, est non commodo luctus, nisi erat porttitor ligula, eget lacinia odio sem nec elit. Morbi leo risus, porta ac consectetur ac, vestibulum at eros.

Code

Cum sociis natoque penatibus et magnis dis code element montes, nascetur ridiculus mus.

// Example can be run directly in your JavaScript console

// Create a function that takes two arguments and returns the sum of those arguments
var adder = new Function("a", "b", "return a + b");

// Call the function
adder(2, 6);
// > 8
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
library(data.table)
dt <- data.table(iris)
a <- "1"
b <- 5:20

Aenean lacinia bibendum nulla sed consectetur. Etiam porta sem malesuada magna mollis euismod. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa.

Lists

Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Aenean lacinia bibendum nulla sed consectetur. Etiam porta sem malesuada magna mollis euismod. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus.

  • Praesent commodo cursus magna, vel scelerisque nisl consectetur et.
  • Donec id elit non mi porta gravida at eget metus.
  • Nulla vitae elit libero, a pharetra augue.

Donec ullamcorper nulla non metus auctor fringilla. Nulla vitae elit libero, a pharetra augue.

  1. Vestibulum id ligula porta felis euismod semper.
  2. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.
  3. Maecenas sed diam eget risus varius blandit sit amet non magna.

Cras mattis consectetur purus sit amet fermentum. Sed posuere consectetur est at lobortis.

HyperText Markup Language (HTML)
The language used to describe and define the content of a Web page
Cascading Style Sheets (CSS)
Used to describe the appearance of Web content
JavaScript (JS)
The programming language used to build advanced Web sites and applications

Integer posuere erat a ante venenatis dapibus posuere velit aliquet. Morbi leo risus, porta ac consectetur ac, vestibulum at eros. Nullam quis risus eget urna mollis ornare vel eu leo.

Tables

Aenean lacinia bibendum nulla sed consectetur. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Name Upvotes Downvotes
Totals 21 23
Alice 10 11
Bob 4 3
Charlie 7 9

Nullam id dolor id nibh ultricies vehicula ut id elit. Sed posuere consectetur est at lobortis. Nullam quis risus eget urna mollis ornare vel eu leo.


Want to see something else added? Open an issue.

Read more


A Full and Comprehensive Style Test (HTML)

This is Markdown Cheatsheet Demo for Sustain, this Jekyll theme. Please check the raw content of this file for the markdown usage.

This is just an ipsis verbis copy of the first example running on the Ghost Demo. This shows how you can use html styling to achieve your hopes.

Below is just about everything you'll need to style in the theme. Check the source code to see the many embedded elements within paragraphs.


Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, test link adipiscing elit. This is strong. Nullam dignissim convallis est. Quisque aliquam. This is emphasized. Donec faucibus. Nunc iaculis suscipit dui. 53 = 125. Water is H2O. Nam sit amet sem. Aliquam libero nisi, imperdiet at, tincidunt nec, gravida vehicula, nisl. The New York Times (That's a citation). Underline. Maecenas ornare tortor. Donec sed tellus eget sapien fringilla nonummy. Mauris a ante. Suspendisse quam sem, consequat at, commodo vitae, feugiat in, nunc. Morbi imperdiet augue quis tellus.

HTML and CSS are our tools. Mauris a ante. Suspendisse quam sem, consequat at, commodo vitae, feugiat in, nunc. Morbi imperdiet augue quis tellus. Praesent mattis, massa quis luctus fermentum, turpis mi volutpat justo, eu volutpat enim diam eget metus. To copy a file type COPY filename. Dinner's at 5:00. Let's make that 7. This text has been struck.


Media

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore.

Big Image

Test Image

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Small Image

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore.

Small Test Image

Labore et dolore.


List Types

Definition List

Definition List Title
This is a definition list division.
Definition
An exact statement or description of the nature, scope, or meaning of something: our definition of what constitutes poetry.

Ordered List

  1. List Item 1
  2. List Item 2
    1. Nested list item A
    2. Nested list item B
  3. List Item 3

Unordered List

  • List Item 1
  • List Item 2
    • Nested list item A
    • Nested list item B
  • List Item 3

Table

Table Header 1 Table Header 2 Table Header 3
Division 1 Division 2 Division 3
Division 1 Division 2 Division 3
Division 1 Division 2 Division 3

Preformatted Text

Typographically, preformatted text is not the same thing as code. Sometimes, a faithful execution of the text requires preformatted text that may not have anything to do with code. Most browsers use Courier and that's a good default --- with one slight adjustment, Courier 10 Pitch over regular Courier for Linux users.

Code

Code can be presented inline, like <?php bloginfo('stylesheet_url'); ?>, or within a <pre> block. Because we have more specific typographic needs for code, we'll specify Consolas and Monaco ahead of the browser-defined monospace font.

#container {
    float: left;
    margin: 0 -240px 0 0;
    width: 100%;
}

Blockquotes

Let's keep it simple. Italics are good to help set it off from the body text. Be sure to style the citation.

Good afternoon, gentlemen. I am a HAL 9000 computer. I became operational at the H.A.L. plant in Urbana, Illinois on the 12th of January 1992. My instructor was Mr. Langley, and he taught me to sing a song. If you'd like to hear it I can sing it for you. --- HAL 9000

And here's a bit of trailing text.


Text-level semantics

The a element example
The abbr element and abbr element with title examples
The b element example
The cite element example
The code element example
The del element example
The dfn element and dfn element with title examples
The em element example
The i element example
The ins element example
The kbd element example
The mark element example
The q element inside a q element example
The s element example
The samp element example
The small element example
The span element example
The strong element example
The sub element example
The sup element example
The var element example
The u element example


Forms

Inputs as descendents of labels (form legend)
Clickable inputs and buttons
box-sizing tests

Embeds

Sometimes all you want to do is embed a little love from another location and set your post alive.

Video

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Culpa qui officia deserunt mollit anim id est laborum.

Slides

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Culpa qui officia deserunt mollit anim id est laborum.

Audio

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Culpa qui officia deserunt mollit anim id est laborum.

Code

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt.

</p>
var c = new Sketch.create({autoclear: false}),
    bigCircle = 50,
    littleCircle = 5,
    // The velocity value determines how much to move the spinner head (in radians).
    velocity = 0.105,
    hue = 0,
    // The alpha value below determines the length of the spinner's tail.
    bg = 'rgba(40,40,40,.075)';
    Spinner = function() {};

Spinner.prototype.setup = function() {
  this.x = c.width / 2;
  this.y = c.height / 2 - bigCircle;
  this.rotation = 0;
}
Spinner.prototype.update = function() {
  this.rotation += velocity;
  this.rotation = this.rotation % TWO_PI;
  this.x = c.width /2 + cos(this.rotation) * bigCircle;
  this.y = c.height / 2 + sin(this.rotation) * bigCircle;
}
Spinner.prototype.draw = function() {
  c.fillStyle = 'hsl('+hue+',50%,50%)';
  c.beginPath();
  c.arc(this.x, this.y, littleCircle, 0, TWO_PI);
  c.fill();
  c.closePath();
}
c.setup = function() {
  spinner = new Spinner();
  spinner.setup();
}
c.update = function() {
  spinner.update();
  hue = ++hue % 360;
}
c.draw = function() {
  spinner.draw();
  c.fillStyle = bg;
  c.fillRect(0,0,c.width,c.height);
}

See the Pen Simple Rotating Spinner by Rob Glazebrook (@rglazebrook) on CodePen.

</div>

Isn't it beautiful.

Read more