用Golang写爬虫(五) – 使用XPath

在这个系列文章里面已经介绍了BeautifulSoup的替代库soup和Pyquery的替代库goquery，但其实我写Python爬虫最愿意用的页面解析组合是lxml+XPath。为什么呢？先分别说一下lxml和XPath的优势吧

lxml

lxml是HTML/XML的解析器，它用 C 语言实现的 libxml2 和l ibxslt 的P ython 绑定。除了效率高，还有一个特点是文档容错能力强。

XPath

XML Path LanguageXML路径语言

XPath与BeautifulSoup(soup)、Pyquery(goquery)相比，学习曲线要高一些，但是学会它是非常有价值的，你会爱上它。你看我现在，原来用Python写爬虫学会了XPath，现在可以直接找支持XPath的库直接用了。

另外说一点，如果你非常喜欢BeautifulSoup，一定要选择BeautifulSoup+lxml这个组合，因为BeautifulSoup默认的HTML解析器用的是Python标准库中的html.parser，虽然文档容错能力也很强，但是效率会差很多。

我学习XPath是通过w3school，可以从延伸阅读找到链接

Golang中的Xpath库

用Golang写的Xpath库是很多的，由于我还没有什么实际开发经验，所以能搜到的几个库都试用一下，然后再出结论吧。

首先把豆瓣Top250的部分HTML代码贴出来

<ol class="grid_view">
  <li>
    <div class="item">
      <div class="info">
        <div class="hd">
          <a href="https://movie.douban.com/subject/1292052/" class="">
            <span class="title">肖申克的救赎</span>
            <span class="title">&nbsp;/&nbsp;The Shawshank Redemption</span>
            <span class="other">&nbsp;/&nbsp;月黑高飞(港)  /  刺激 1995(台)</span>
          </a>
          <span class="playable">[可播放]</span>
        </div>
      </div>
    </div>
  </li>
  ....
</ol>
复制代码

还是原来的需求：获得条目 ID 和标题

github.com/lestrrat-go/libxml2

lestrrat-go/libxml2是一个libxml2的Golang绑定库，

首先安装它：

 go get github.com/lestrrat-go/libxml2
复制代码

接着改代码

import (
        "log"
        "time"
        "strings"
        "strconv"
        "net/http"

        "github.com/lestrrat-go/libxml2"
        "github.com/lestrrat-go/libxml2/types"
        "github.com/lestrrat-go/libxml2/xpath"
)

func fetch(url string) types.Document {
        log.Println("Fetch Url", url)
        client := &http.Client{}
        req, _ := http.NewRequest("GET", url, nil)
        req.Header.Set("User-Agent", "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
        resp, err := client.Do(req)
        if err != nil {
                log.Fatal("Http get err:", err)
        }
        if resp.StatusCode != 200 {
                log.Fatal("Http status code:", resp.StatusCode)
        }
        defer resp.Body.Close()
        doc, err := libxml2.ParseHTMLReader(resp.Body)
        if err != nil {
                log.Fatal(err)
        }
        return doc
}
复制代码

libxml2.ParseHTMLReader(resp.Body)

func parseUrls(url string, ch chan bool) {
        doc := fetch(url)
        defer doc.Free()
        nodes := xpath.NodeList(doc.Find(`//ol[@class="grid_view"]/li//div[@class="hd"]`))
        for _, node := range nodes {
                urls, _ := node.Find("./a/@href")
                titles, _ := node.Find(`.//span[@class="title"]/text()`)
                log.Println(strings.Split(urls.NodeList()[0].TextContent(), "/")[4],
                        titles.NodeList()[0].TextContent())
        }
        time.Sleep(2 * time.Second)
        ch <- true
}
复制代码

NodeList()[index].TextContent

xpath.NewContextxpath.String(ctx.Find("/foo/bar"))

github.com/antchfx/htmlquery

htmlquery如其名，是一个对HTML文档做XPath查询的包。它的核心是antchfx/xpath，项目更新频繁，文档也比较完整。

首先安装它：

 go get github.com/antchfx/htmlquery
复制代码

接着按需求修改：

import (
    "log"
    "time"
    "strings"
    "strconv"
    "net/http"

    "golang.org/x/net/html"
    "github.com/antchfx/htmlquery"
)

func fetch(url string) *html.Node {
    log.Println("Fetch Url", url)
    client := &http.Client{}
    req, _ := http.NewRequest("GET", url, nil)
    req.Header.Set("User-Agent", "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
    resp, err := client.Do(req)
    if err != nil {
        log.Fatal("Http get err:", err)
    }
    if resp.StatusCode != 200 {
        log.Fatal("Http status code:", resp.StatusCode)
    }
    defer resp.Body.Close()
    doc, err := htmlquery.Parse(resp.Body)
    if err != nil {
        log.Fatal(err)
    }
    return doc
}
复制代码

htmlquery.Parse(resp.Body)*html.NodeparseUrls

func parseUrls(url string, ch chan bool) {
    doc := fetch(url)
    nodes := htmlquery.Find(doc, `//ol[@class="grid_view"]/li//div[@class="hd"]`)
    for _, node := range nodes {
        url := htmlquery.FindOne(node, "./a/@href")
        title := htmlquery.FindOne(node, `.//span[@class="title"]/text()`)
        log.Println(strings.Split(htmlquery.InnerText(url), "/")[4],
            htmlquery.InnerText(title))
    }
    time.Sleep(2 * time.Second)
    ch <- true
}
复制代码

antchfx/htmlquerylestrrat-go/libxml2

后记

gopkg.in/xmlpath.v2

随便说一下gopkg.in，gopkg是一种包管理方式，其实是用约定好的方式「代理」Github上对应项目的对应分支的包。具体的请看延伸阅读链接2。

xmlpath.v2这个包就是Github上的go-xmlpath/xmlpath, 分支是v2。

antchfx/htmlquery

代码地址

完整代码可以在这个地址找到。