深入浅出爬虫之道： Python、Golang与GraphQuery的对比

PythonGolangGraphQuery

深入浅出爬虫之道： Python、Golang与GraphQuery的对比
- 一、前言
  - 1. 语义化的DOM结构
  - 2. 稳定的解析代码
- 二、进行页面的解析
  - 使用Python进行页面的解析
    - 1. 获取title节点
    - 2. 获取size节点
    - 3. 完整的Python代码
  - 使用Golang进行页面的解析
  - 使用GraphQuery进行解析
    - 1. 在Golang中调用GraphQuery
    - 2. 在Python中调用GraphQuery
- 三、后记

一、前言

在前言中，为了防止在后面的章节产生不必要的困扰，我们将会首先了解一些基本的编程理念。

1. 语义化的DOM结构

classid

<div class="main-right fr">
    <p>编号：32490230</p>
    <p class="main-rightStage">模式：RGB</p>
    <p class="main-rightStage">体积：16.659 MB</p>
    <p class="main-rightStage">分辨率：72dpi</p>
</div>

32504070RGB16.659 MB72dpi动态变化的属性<span>jsonpAjax

<p class="main-rightStage property-mode">
    模式：<span>RGB</span>
</p>

property-modespanclass

2. 稳定的解析代码

语义化的DOM结构

<div class="main-right fr">
    <p>编号：32490230</p>
    <p class="main-rightStage">模式：RGB</p>
    <p class="main-rightStage">体积：16.659 MB</p>
    <p class="main-rightStage">分辨率：72dpi</p>
</div>

模式

classmain-rightdivdivp模式：RGB

不稳定的模式尺寸

<div class="main-right fr">
    <p>编号：32490230</p>
    <p class="main-rightStage">尺寸：4724×6299像素</p>
    <p class="main-rightStage">模式：RGB</p>
    <p class="main-rightStage">体积：16.659 MB</p>
    <p class="main-rightStage">分辨率：72dpi</p>
</div>

classmain-rightStagep模式：模式：([A-Z]+)contains.main-rightStage:contains(模式)模式classmain-rightStage稳定性复杂程度运行效率兼容性

二、进行页面的解析

浏览量收藏量下载量尺寸模式metainfo

深入浅出爬虫之道： Python、Golang与GraphQuery的对比

由此我们可以很快设计出我们的数据结构：

{
    title
    pictype
    number
    type
    metadata {
        size
        volume
        mode
        resolution
    }
    author
    images []
    tags []
}

sizevolumemoderesolutionmetadataimagestags

使用Python进行页面的解析

Python库的数量非常庞大，有很多优秀的库可以帮助到我们，在使用Python进行页面的解析时，我们通常用到下面这些库：

正则表达式reCSS选择器pyquerybeautifulsoup4XpathlxmlJSON PATHjsonpath_rw

Python 3pip installCSS选择器Xpathpyquerybeautifulsoup4pyquerytitletyperequests

import requests
from pyquery import PyQuery as pq
response = requests.get("http://www.58pic.com/newpic/32504070.html")
document = pq(response.content.decode('gb2312'))

下面使用Python进行的解析都将依次为前提进行。

1. 获取title节点

查看元素

深入浅出爬虫之道： Python、Golang与GraphQuery的对比

大侠海报金庸武侠水墨中国风黑白思路一</div><p思路二

title_node = document.find(".detail-title")
title_node.find("div").remove()
title_node.find("p").remove()
print(title_node.text())

大侠海报金庸武侠水墨中国风黑白

2. 获取size节点

尺寸

深入浅出爬虫之道： Python、Golang与GraphQuery的对比

我们发现这些节点不具有语义化的选择器，并且这些属性不一定都存在（详见Page1 和 Page2 的对比）。在稳定的解析代码中我们也讲到了对于这种结构的文档可以采取的几种思路，这里我们采用正则解析的方法：

import re
context = document.find(".mainRight-file").text()
file_type_matches = re.compile("尺寸：(.*?像素)").findall(context)
filetype = ""
if len(file_type_matches) > 0:
    filetype =  file_type_matches[0]
print(filetype)

sizevolumemoderesolution

def regex_get(text, expr):
    matches = re.compile(expr).findall(text)
    if len(matches) == 0:
        return ""
    return matches[0]

size

size = regex_get(context, r"尺寸：(.*?像素)")

3. 完整的Python代码

到这里，我们解析页面可能遇到的问题就已经解决了大半，整个Python代码如下：

import requests
import re
from pyquery import PyQuery as pq

def regex_get(text, expr):
    matches = re.compile(expr).findall(text)
    if len(matches) == 0:
        return ""
    return matches[0]

conseq = {}

## 下载文档
response = requests.get("http://www.58pic.com/newpic/32504070.html")
document = pq(response.text)

## 获取文件标题
title_node = document.find(".detail-title")
title_node.find("div").remove()
title_node.find("p").remove()
conseq["title"] = title_node.text()

## 获取素材类型
conseq["pictype"] = document.find(".pic-type").text()

## 获取文件格式
conseq["filetype"] =  regex_get(document.find(".mainRight-file").text(), r"文件格式：([a-z]+)")

## 获取元数据
context = document.find(".main-right p").text()
conseq['metainfo'] = {
    "size": regex_get(context, r"尺寸：(.*?像素)"),
    "volume": regex_get(context, r"体积：(.*? MB)"),
    "mode": regex_get(context, r"模式：([A-Z]+)"),
    "resolution": regex_get(context, r"分辨率：(\d+dpi)"),
}

## 获取作者
conseq['author'] = document.find('.user-name').text()

## 获取图片
conseq['images'] = []
for node_image in document.find("#show-area-height img"):
    conseq['images'].append(pq(node_image).attr("src"))

## 获取tag
conseq['tags'] = []
for node_image in document.find(".mainRight-tagBox .fl"):
    conseq['tags'].append(pq(node_image).text())

print(conseq)

使用Golang进行页面的解析

Golanghtmlxml

正则表达式regexpCSS选择器github.com/PuerkitoBio/goqueryXpathgopkg.in/xmlpath.v2JSON PATHgithub.com/tidwall/gjson

go get -uGolangPython

type Reuslt struct {
    Title    string
    Pictype  string
    Number   string
    Type     string
    Metadata struct {
        Size       string
        Volume     string
        Mode       string
        Resolution string
    }
    Author string
    Images []string
    Tags   []string
}

gbkutf-8gbkgithub.com/axgle/mahoniadecoderConvert

func decoderConvert(name string, body string) string {
    return mahonia.NewDecoder(name).ConvertString(body)
}

golang

package main

import (
    "encoding/json"
    "log"
    "regexp"
    "strings"

    "github.com/axgle/mahonia"
    "github.com/parnurzeal/gorequest"

    "github.com/PuerkitoBio/goquery"
)

type Reuslt struct {
    Title    string
    Pictype  string
    Number   string
    Type     string
    Metadata struct {
        Size       string
        Volume     string
        Mode       string
        Resolution string
    }
    Author string
    Images []string
    Tags   []string
}

func RegexGet(text string, expr string) string {
    regex, _ := regexp.Compile(expr)
    return regex.FindString(text)
}

func decoderConvert(name string, body string) string {
    return mahonia.NewDecoder(name).ConvertString(body)
}

func main() {
    //下载文档
    request := gorequest.New()
    _, body, _ := request.Get("http://www.58pic.com/newpic/32504070.html").End()
    document, err := goquery.NewDocumentFromReader(strings.NewReader(decoderConvert("gbk", body)))
    if err != nil {
        panic(err)
    }
    conseq := &Reuslt{}
    //获取文件标题
    titleNode := document.Find(".detail-title")
    titleNode.Find("div").Remove()
    titleNode.Find("p").Remove()
    conseq.Title = titleNode.Text()

    // 获取素材类型
    conseq.Pictype = document.Find(".pic-type").Text()
    // 获取文件格式
    conseq.Type = document.Find(".mainRight-file").Text()
    // 获取元数据
    context := document.Find(".main-right p").Text()
    conseq.Metadata.Mode = RegexGet(context, `尺寸：(.*?)像素`)
    conseq.Metadata.Resolution = RegexGet(context, `体积：(.*? MB)`)
    conseq.Metadata.Size = RegexGet(context, `模式：([A-Z]+)`)
    conseq.Metadata.Volume = RegexGet(context, `分辨率：(\d+dpi)`)
    // 获取作者
    conseq.Author = document.Find(".user-name").Text()
    // 获取图片
    document.Find("#show-area-height img").Each(func(i int, element *goquery.Selection) {
        if attribute, exists := element.Attr("src"); exists && attribute != "" {
            conseq.Images = append(conseq.Images, attribute)
        }
    })
    // 获取tag
    document.Find(".mainRight-tagBox .fl").Each(func(i int, element *goquery.Selection) {
        conseq.Tags = append(conseq.Tags, element.Text())
    })
    bytes, _ := json.Marshal(conseq)
    log.Println(string(bytes))
}

GraphQuery

使用GraphQuery进行解析

已知我们想要得到的数据结构如下：

{
    title
    pictype
    number
    type
    metadata {
        size
        volume
        mode
        resolution
    }
    author
    images []
    tags []
}

GraphQuery

{
    title `xpath("/html/body/div[4]/div[1]/div/div/div[1]/text()")`
    pictype `css(".pic-type")`
    number `css(".detailBtn-down");attr("data-id")`
    type `regex("文件格式：([a-z]+)")`
    metadata `css(".main-right p")` {
        size `regex("尺寸：(.*?)像素")`
        volume `regex("体积：(.*? MB)")`
        mode `regex("模式：([A-Z]+)")`
        resolution `regex("分辨率：(\d+dpi)")`  
    }
    author `css(".user-name")`
    images `css("#show-area-height img")` [
        src `attr("src")`
    ]
    tags `css(".mainRight-tagBox .fl")` [
        tag `text()`
    ]
}

PythonGolangGraphQuery

{
    "data": {
        "author": "Ice bear",
        "images": [
            "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a0", 
            "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a1024", 
            "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a2048", 
            "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a3072"
        ],
        "metadata": {
            "mode": "RGB",
            "resolution": "200dpi",
            "size": "4724×6299",
            "volume": "196.886 MB"
        },
        "number": "32504070",
        "pictype": "原创",
        "tags": ["大侠", "海报", "黑白", "金庸", "水墨", "武侠", "中国风"],
        "title": "大侠海报金庸武侠水墨中国风黑白",
        "type": "psd"
    },
    "error": "",
    "timecost": 10997800
}

GraphQueryGraphQueryxpathcssjsonpath正则表达式数据结构解析代码返回结果

项目地址： github.com/storyicon/graphquery

GraphQuery符合直觉

1. 在Golang中调用GraphQuery

golanggo get -u github.com/storyicon/graphqueryGraphQuery

package main

import (
    "log"

    "github.com/axgle/mahonia"
    "github.com/parnurzeal/gorequest"
    "github.com/storyicon/graphquery"
)

func decoderConvert(name string, body string) string {
    return mahonia.NewDecoder(name).ConvertString(body)
}

func main() {
    request := gorequest.New()
    _, body, _ := request.Get("http://www.58pic.com/newpic/32504070.html").End()
    body = decoderConvert("gbk", body)
    response := graphquery.ParseFromString(body, "{ title `xpath(\"/html/body/div[4]/div[1]/div/div/div[1]/text()\")` pictype `css(\".pic-type\")` number `css(\".detailBtn-down\");attr(\"data-id\")` type `regex(\"文件格式：([a-z]+)\")` metadata `css(\".main-right p\")` { size `regex(\"尺寸：(.*?)像素\")` volume `regex(\"体积：(.*? MB)\")` mode `regex(\"模式：([A-Z]+)\")` resolution `regex(\"分辨率：(\\d+dpi)\")` } author `css(\".user-name\")` images `css(\"#show-area-height img\")` [ src `attr(\"src\")` ] tags `css(\".mainRight-tagBox .fl\")` [ tag `text()` ] }")
    log.Println(response)
}

GraphQuery单行graphquery.ParseFromString

2. 在Python中调用GraphQuery

PythonGraphQuerywindowsmaclinuxGraphQuery

import requests

def GraphQuery(document, expr):
    response = requests.post("http://127.0.0.1:8559", data={
        "document": document,
        "expression": expr,
    })
    return response.text

response = requests.get("http://www.58pic.com/newpic/32504070.html")
conseq = GraphQuery(response.text, r"""
    {
        title `xpath("/html/body/div[4]/div[1]/div/div/div[1]/text()")`
        pictype `css(".pic-type")`
        number `css(".detailBtn-down");attr("data-id")`
        type `regex("文件格式：([a-z]+)")`
        metadata `css(".main-right p")` {
            size `regex("尺寸：(.*?)像素")`
            volume `regex("体积：(.*? MB)")`
            mode `regex("模式：([A-Z]+)")`
            resolution `regex("分辨率：(\d+dpi)")`  
        }
        author `css(".user-name")`
        images `css("#show-area-height img")` [
            src `attr("src")`
        ]
        tags `css(".mainRight-tagBox .fl")` [
            tag `text()`
        ]
    }
""")
print(conseq)

输出结果为：

{
    "data": {
        "author": "Ice bear",
        "images": [
            "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a0", 
            "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a1024", 
            "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a2048", 
            "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a3072"
        ],
        "metadata": {
            "mode": "RGB",
            "resolution": "200dpi",
            "size": "4724×6299",
            "volume": "196.886 MB"
        },
        "number": "32504070",
        "pictype": "原创",
        "tags": ["大侠", "海报", "黑白", "金庸", "水墨", "武侠", "中国风"],
        "title": "大侠海报金庸武侠水墨中国风黑白",
        "type": "psd"
    },
    "error": "",
    "timecost": 10997800
}

三、后记

GraphQueryGraphQuery