PythonGolangGraphQuery
- 深入浅出爬虫之道: Python、Golang与GraphQuery的对比
- 一、前言
- 1. 语义化的DOM结构
- 2. 稳定的解析代码
- 二、进行页面的解析
- 使用Python进行页面的解析
- 1. 获取title节点
- 2. 获取size节点
- 3. 完整的Python代码
- 使用Golang进行页面的解析
- 使用GraphQuery进行解析
- 1. 在Golang中调用GraphQuery
- 2. 在Python中调用GraphQuery
- 使用Python进行页面的解析
- 三、后记
- 一、前言
一、前言
在前言中,为了防止在后面的章节产生不必要的困扰,我们将会首先了解一些基本的编程理念。
1. 语义化的DOM结构
classid
<div class="main-right fr">
<p>编号:32490230</p>
<p class="main-rightStage">模式:RGB</p>
<p class="main-rightStage">体积:16.659 MB</p>
<p class="main-rightStage">分辨率:72dpi</p>
</div>
32504070RGB16.659 MB72dpi动态变化的属性<span>jsonpAjax
<p class="main-rightStage property-mode">
模式:<span>RGB</span>
</p>
property-modespanclass
2. 稳定的解析代码
语义化的DOM结构
<div class="main-right fr">
<p>编号:32490230</p>
<p class="main-rightStage">模式:RGB</p>
<p class="main-rightStage">体积:16.659 MB</p>
<p class="main-rightStage">分辨率:72dpi</p>
</div>
模式
classmain-rightdivdivp模式:RGB
不稳定的模式尺寸
<div class="main-right fr">
<p>编号:32490230</p>
<p class="main-rightStage">尺寸:4724×6299像素</p>
<p class="main-rightStage">模式:RGB</p>
<p class="main-rightStage">体积:16.659 MB</p>
<p class="main-rightStage">分辨率:72dpi</p>
</div>
classmain-rightStagep模式:模式:([A-Z]+)contains.main-rightStage:contains(模式)模式classmain-rightStage稳定性复杂程度运行效率兼容性
二、进行页面的解析
浏览量收藏量下载量尺寸模式metainfo
由此我们可以很快设计出我们的数据结构:
{
title
pictype
number
type
metadata {
size
volume
mode
resolution
}
author
images []
tags []
}
sizevolumemoderesolutionmetadataimagestags
使用Python进行页面的解析
Python库的数量非常庞大,有很多优秀的库可以帮助到我们,在使用Python进行页面的解析时,我们通常用到下面这些库:
正则表达式reCSS选择器pyquerybeautifulsoup4XpathlxmlJSON PATHjsonpath_rw
Python 3pip installCSS选择器Xpathpyquerybeautifulsoup4pyquerytitletyperequests
import requests
from pyquery import PyQuery as pq
response = requests.get("http://www.58pic.com/newpic/32504070.html")
document = pq(response.content.decode('gb2312'))
下面使用Python进行的解析都将依次为前提进行。
1. 获取title节点
查看元素
大侠海报金庸武侠水墨中国风黑白思路一</div><p思路二
title_node = document.find(".detail-title")
title_node.find("div").remove()
title_node.find("p").remove()
print(title_node.text())
大侠海报金庸武侠水墨中国风黑白
2. 获取size节点
尺寸
我们发现这些节点不具有语义化的选择器,并且这些属性不一定都存在(详见Page1 和 Page2 的对比)。在 稳定的解析代码 中我们也讲到了对于这种结构的文档可以采取的几种思路,这里我们采用正则解析的方法:
import re
context = document.find(".mainRight-file").text()
file_type_matches = re.compile("尺寸:(.*?像素)").findall(context)
filetype = ""
if len(file_type_matches) > 0:
filetype = file_type_matches[0]
print(filetype)
sizevolumemoderesolution
def regex_get(text, expr):
matches = re.compile(expr).findall(text)
if len(matches) == 0:
return ""
return matches[0]
size
size = regex_get(context, r"尺寸:(.*?像素)")
3. 完整的Python代码
到这里,我们解析页面可能遇到的问题就已经解决了大半,整个Python代码如下:
import requests
import re
from pyquery import PyQuery as pq
def regex_get(text, expr):
matches = re.compile(expr).findall(text)
if len(matches) == 0:
return ""
return matches[0]
conseq = {}
## 下载文档
response = requests.get("http://www.58pic.com/newpic/32504070.html")
document = pq(response.text)
## 获取文件标题
title_node = document.find(".detail-title")
title_node.find("div").remove()
title_node.find("p").remove()
conseq["title"] = title_node.text()
## 获取素材类型
conseq["pictype"] = document.find(".pic-type").text()
## 获取文件格式
conseq["filetype"] = regex_get(document.find(".mainRight-file").text(), r"文件格式:([a-z]+)")
## 获取元数据
context = document.find(".main-right p").text()
conseq['metainfo'] = {
"size": regex_get(context, r"尺寸:(.*?像素)"),
"volume": regex_get(context, r"体积:(.*? MB)"),
"mode": regex_get(context, r"模式:([A-Z]+)"),
"resolution": regex_get(context, r"分辨率:(\d+dpi)"),
}
## 获取作者
conseq['author'] = document.find('.user-name').text()
## 获取图片
conseq['images'] = []
for node_image in document.find("#show-area-height img"):
conseq['images'].append(pq(node_image).attr("src"))
## 获取tag
conseq['tags'] = []
for node_image in document.find(".mainRight-tagBox .fl"):
conseq['tags'].append(pq(node_image).text())
print(conseq)
使用Golang进行页面的解析
Golanghtmlxml
正则表达式regexpCSS选择器github.com/PuerkitoBio/goqueryXpathgopkg.in/xmlpath.v2JSON PATHgithub.com/tidwall/gjson
go get -uGolangPython
type Reuslt struct {
Title string
Pictype string
Number string
Type string
Metadata struct {
Size string
Volume string
Mode string
Resolution string
}
Author string
Images []string
Tags []string
}
gbkutf-8gbkgithub.com/axgle/mahoniadecoderConvert
func decoderConvert(name string, body string) string {
return mahonia.NewDecoder(name).ConvertString(body)
}
golang
package main
import (
"encoding/json"
"log"
"regexp"
"strings"
"github.com/axgle/mahonia"
"github.com/parnurzeal/gorequest"
"github.com/PuerkitoBio/goquery"
)
type Reuslt struct {
Title string
Pictype string
Number string
Type string
Metadata struct {
Size string
Volume string
Mode string
Resolution string
}
Author string
Images []string
Tags []string
}
func RegexGet(text string, expr string) string {
regex, _ := regexp.Compile(expr)
return regex.FindString(text)
}
func decoderConvert(name string, body string) string {
return mahonia.NewDecoder(name).ConvertString(body)
}
func main() {
//下载文档
request := gorequest.New()
_, body, _ := request.Get("http://www.58pic.com/newpic/32504070.html").End()
document, err := goquery.NewDocumentFromReader(strings.NewReader(decoderConvert("gbk", body)))
if err != nil {
panic(err)
}
conseq := &Reuslt{}
//获取文件标题
titleNode := document.Find(".detail-title")
titleNode.Find("div").Remove()
titleNode.Find("p").Remove()
conseq.Title = titleNode.Text()
// 获取素材类型
conseq.Pictype = document.Find(".pic-type").Text()
// 获取文件格式
conseq.Type = document.Find(".mainRight-file").Text()
// 获取元数据
context := document.Find(".main-right p").Text()
conseq.Metadata.Mode = RegexGet(context, `尺寸:(.*?)像素`)
conseq.Metadata.Resolution = RegexGet(context, `体积:(.*? MB)`)
conseq.Metadata.Size = RegexGet(context, `模式:([A-Z]+)`)
conseq.Metadata.Volume = RegexGet(context, `分辨率:(\d+dpi)`)
// 获取作者
conseq.Author = document.Find(".user-name").Text()
// 获取图片
document.Find("#show-area-height img").Each(func(i int, element *goquery.Selection) {
if attribute, exists := element.Attr("src"); exists && attribute != "" {
conseq.Images = append(conseq.Images, attribute)
}
})
// 获取tag
document.Find(".mainRight-tagBox .fl").Each(func(i int, element *goquery.Selection) {
conseq.Tags = append(conseq.Tags, element.Text())
})
bytes, _ := json.Marshal(conseq)
log.Println(string(bytes))
}
GraphQuery
使用GraphQuery进行解析
已知我们想要得到的数据结构如下:
{
title
pictype
number
type
metadata {
size
volume
mode
resolution
}
author
images []
tags []
}
GraphQuery
{
title `xpath("/html/body/div[4]/div[1]/div/div/div[1]/text()")`
pictype `css(".pic-type")`
number `css(".detailBtn-down");attr("data-id")`
type `regex("文件格式:([a-z]+)")`
metadata `css(".main-right p")` {
size `regex("尺寸:(.*?)像素")`
volume `regex("体积:(.*? MB)")`
mode `regex("模式:([A-Z]+)")`
resolution `regex("分辨率:(\d+dpi)")`
}
author `css(".user-name")`
images `css("#show-area-height img")` [
src `attr("src")`
]
tags `css(".mainRight-tagBox .fl")` [
tag `text()`
]
}
PythonGolangGraphQuery
{
"data": {
"author": "Ice bear",
"images": [
"http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a0",
"http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a1024",
"http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a2048",
"http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a3072"
],
"metadata": {
"mode": "RGB",
"resolution": "200dpi",
"size": "4724×6299",
"volume": "196.886 MB"
},
"number": "32504070",
"pictype": "原创",
"tags": ["大侠", "海报", "黑白", "金庸", "水墨", "武侠", "中国风"],
"title": "大侠海报金庸武侠水墨中国风黑白",
"type": "psd"
},
"error": "",
"timecost": 10997800
}
GraphQueryGraphQueryxpathcssjsonpath正则表达式数据结构解析代码返回结果
项目地址: github.com/storyicon/graphquery
GraphQuery符合直觉
1. 在Golang中调用GraphQuery
golanggo get -u github.com/storyicon/graphqueryGraphQuery
package main
import (
"log"
"github.com/axgle/mahonia"
"github.com/parnurzeal/gorequest"
"github.com/storyicon/graphquery"
)
func decoderConvert(name string, body string) string {
return mahonia.NewDecoder(name).ConvertString(body)
}
func main() {
request := gorequest.New()
_, body, _ := request.Get("http://www.58pic.com/newpic/32504070.html").End()
body = decoderConvert("gbk", body)
response := graphquery.ParseFromString(body, "{ title `xpath(\"/html/body/div[4]/div[1]/div/div/div[1]/text()\")` pictype `css(\".pic-type\")` number `css(\".detailBtn-down\");attr(\"data-id\")` type `regex(\"文件格式:([a-z]+)\")` metadata `css(\".main-right p\")` { size `regex(\"尺寸:(.*?)像素\")` volume `regex(\"体积:(.*? MB)\")` mode `regex(\"模式:([A-Z]+)\")` resolution `regex(\"分辨率:(\\d+dpi)\")` } author `css(\".user-name\")` images `css(\"#show-area-height img\")` [ src `attr(\"src\")` ] tags `css(\".mainRight-tagBox .fl\")` [ tag `text()` ] }")
log.Println(response)
}
GraphQuery单行graphquery.ParseFromString
2. 在Python中调用GraphQuery
PythonGraphQuerywindowsmaclinuxGraphQuery
import requests
def GraphQuery(document, expr):
response = requests.post("http://127.0.0.1:8559", data={
"document": document,
"expression": expr,
})
return response.text
response = requests.get("http://www.58pic.com/newpic/32504070.html")
conseq = GraphQuery(response.text, r"""
{
title `xpath("/html/body/div[4]/div[1]/div/div/div[1]/text()")`
pictype `css(".pic-type")`
number `css(".detailBtn-down");attr("data-id")`
type `regex("文件格式:([a-z]+)")`
metadata `css(".main-right p")` {
size `regex("尺寸:(.*?)像素")`
volume `regex("体积:(.*? MB)")`
mode `regex("模式:([A-Z]+)")`
resolution `regex("分辨率:(\d+dpi)")`
}
author `css(".user-name")`
images `css("#show-area-height img")` [
src `attr("src")`
]
tags `css(".mainRight-tagBox .fl")` [
tag `text()`
]
}
""")
print(conseq)
输出结果为:
{
"data": {
"author": "Ice bear",
"images": [
"http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a0",
"http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a1024",
"http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a2048",
"http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a3072"
],
"metadata": {
"mode": "RGB",
"resolution": "200dpi",
"size": "4724×6299",
"volume": "196.886 MB"
},
"number": "32504070",
"pictype": "原创",
"tags": ["大侠", "海报", "黑白", "金庸", "水墨", "武侠", "中国风"],
"title": "大侠海报金庸武侠水墨中国风黑白",
"type": "psd"
},
"error": "",
"timecost": 10997800
}
三、后记
GraphQueryGraphQuery