接下來的後幾年開始撰寫爬蟲，一開始是接寫前一個人所開發的程式碼，早期是要研究淘寶網的防禦機制與網頁結構，之後因為有其他的平台研究比較需要，所以轉成採集蝦皮跟Y拍(Yahoo 拍賣)，一直維護到現在。

現在來紀錄一下我較為常用的Nodejs 套件:request,node-fetch,chromeless(puppeteer)

會補充got這個request套件

request

這是我最先使用的套件，可以在網頁上可以執行GET的工作，不過現在去Npm 網頁上查詢會發現到request

套件好像被廢止了，雖然維護停止了，但是現行來說還是可以使用的。

安裝方式

$ npm install request

就可以安裝好程式，簡易的範例程式是

const request = require('request');

request(url, function (error, response, body) {

         if (response) {

            // let zJson = JSON.parse(body);

             console.log(body);

         } else {

             console.log(error);

}});

透過response.status可以知道get回傳狀態，透過Body可以知道最後回傳結果，因為這個套件過於簡易，所以原作者放棄支援，不過這樣的套件還是可以正常運作使用，如果想要架設Server用途，建議使用Nodejs 原生的http套件架設。

因為2020年時候開始停止支援request套件，我看網路建議是改用got套件，所以也簡單的說明一下got套件，這個套件是可以執行get/post 的工作

簡易範例程式是

import got from 'got';

const {data} = await got.post('https://httpbin.org/anything', {

json: { hello: 'world' }

}).json();

console.log(data);

//Print:{"hello": "world"}

這是官網上範例最後會印出JSON顯示HelloWorld

node-fetch

是我拿來執行Post專用元件，不過因為最近改動對於require的支援，所以新版的nodejs對於這個套件執行CommonJS會出現錯誤:

Error [ERR_REQUIRE_ESM]: require() of ES Module .... to a dynamic import() which is available in all CommonJS modules.
at Object.<anonymous> ({file}) {
code: 'ERR_REQUIRE_ESM'
}

這樣的錯誤訊息，所以建議轉換成Import套件或是載回舊版就可以使用原本的require()

npm install node-fetch@2

使用之前的got範例來說，在node-fetch寫法是

const fetch = require('node-fetch');


let zUrl = 'https://httpbin.org/anything';
let zdate = {

    method:'POST',     // 這裡需要設定method
    body:{'hello':'world'}
      }


fetch(zUrl,zdate).then((reponse)=>{
    // 這裡會得到一個 ReadableStream 的物件
    // console.log(response);
    // 可以透過 blob(), json(), text() 轉成可用的資訊

reqponse.json().then((value)=>{

console.log(value);

});

    }).catch((err)=>{
    console.log(err);
    })

這個回傳結果會跟got範例結果應該相同

chromeless

這是一個Chrome瀏覽器，居於Chrome Headless 做了一層封裝，通過api調用操作chrome瀏覽器可以執行任何動作（點擊，輸入，打開網站等等）

原本要開始研究Chromeless ，發現到其實在Nodejs 上是使用Puppeteer就可以達到同樣的功能，可以透過Chromeless 去執行網站上JS

執行結果。

安裝方法:

$ npm i puppeteer

會自動下載Chromium 下來，所以資料會非常大

不過npm會提供跳過Chromium下載的動作，只要下npm install 開始前，先設定

$ npm i puppeteer-core

這樣在執行Npm install 時候就會跳過Chromium使用內部安裝的Chrome

Github上面的範例：

const puppeteer = require('puppeteer');

(async () => {

const browser = await puppeteer.launch();

const page = await browser.newPage();

await page.goto('https://example.com');

await page.screenshot({ path: 'example.png' });

await browser.close();

})();

* 函數

接下來分析一下範例上的程式碼

* .launch()

首先第一步先執行了launch()，目的是為了把chrome launcher呼叫起來，所以每次執行都需要這樣把程式呼叫起來。

* .newPage()

開啟空白頁

* goto({URL Path})

執行到URL 頁面上

* screenshot({FilePath})

拍攝頁面

最後就可以把拍攝頁面的結果儲存到資料夾上，程式預設大小是800x600，不過可以透過Page.setViewport()設定

let json = {

width: 1024,

height: 768,

deviceScaleFactor: 1,

}

* close()

結束瀏覽器，套件結束.

我自己的範例:get Cookie

透過google launch

const puppeteer = require('puppeteer');

const browser = await puppeteer.launch();

const page = await browser.newPage();

await page.goto('https://www.google.com.tw');

const cookies = await page.cookies();

會回傳Json Array格式的Cookie，可以透過forloop 抓取Cookie Name and Cookie Value

轉換成手機模式

參考:https://github.com/puppeteer/puppeteer/blob/main/src/common/DeviceDescriptors.ts


const Android = puppeteer.devices['Nexus 5X'];
const page = await browser.newPage();
await page.emulate(Android);

setUserAgent= {
    'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
}

搜尋此網誌

小刀's Informatics Grid Computing

Nodejs學習日誌(2)

request

node-fetch

chromeless

留言

這個網誌中的熱門文章

How to install CUDA for Linux System

Linux's C++ Compiler

最近...