抓取 App Store和 Google Play 应用评论并写入 Excel
这几天开始在利用业余时间用 Swift 5 重写公司项目,继上一次用 Swift 写项目隔了有两年了吧(中间陆续在学习),Swift 变得更加好用了,真是愈发嫌弃 OC,来新公司把所负责的 App 从规范到基础服务再到业务里外优化了一番,算是有所成果了。新项目到写得正 high,组长丢给我一个任务,产品经理需要竞品在 App Store 和 Google Play 的评论数据用于调研,简单实现了一番,趁热打铁,记录一下结果,以便日后需要。
想到第一方案
在未做调研前,想到的第一方案是利用 Appium + mitmproxy + python 实现,想怎么抓就怎么抓,唰唰唰先代理看看,结果发现 App Store 启用网络代理后无法联网,原因是 SSL Pinning,然后就解决这个问题,找到 ssl-kill-switch2,可以解决这个问题,在越狱手机上安装了 deb 包,关闭 itunesstored 服务,发现并不生效:
dpkg -i com.nablac0d3.sslkillswitch2_0.13.deb
ps -ef | grep itunesstored
kill -s KILL 1170
看 Issues 确实在 iOS 11.4.1 上无法生效,可惜我的越狱设备正好是这个版本,该方案夭折。
调研第三方服务商
既然一方案夭折,想想该需求市场上肯定有数据服务商,于是谷歌搜索了一 ha,果然前面几位都是“科学家”网站,进去看看,可以看到有广告 产品舆情 - WeTest腾讯质量开放平台 和 七麦数据 的,前者没去看,简单看了下后者,注册账号,两个问题
- 不能查看非中国区评论
- 不能自定义生成的 Excel 选项
于是,该方案 pass
App Store 从官方渠道获取最近500条
考虑到多款竞品在 App Store 对应国区的评论数均不足 500条,于是使用苹果官方提供的 API
https://itunes.apple.com/rss/customerreviews/page=1/id=414478124/sortby=mostrecent/json?l=en&&cc=cn
page 和 id 自行更改,该 API最多可以获取一个App的最近500条评论(每页最多50,最多10页),该方案顺利抓取数据,代码:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import requests
import urllib.request
import re
import xlsxwriter
import json
import os
def getHTMLText(url):
response = urllib.request.urlopen(url)
myjson = json.loads(response.read().decode())
return myjson
# try:
# r = requests.get(url)
# r.raise_for_status()
# r.encoding = r.apparent_encoding
# return r.text
# except:
# return ''
def main():
appid = input("请输入应用id号:")
appName = input("请输入应用名称:")
if not os.path.exists(appid):
os.system('mkdir ' + appid)
workbook = xlsxwriter.Workbook(appid + '/' + appName + '_comments.xlsx')
worksheet = workbook.add_worksheet()
format = workbook.add_format()
format.set_border(1)
format.set_border(1)
format_title = workbook.add_format()
format_title.set_border(1)
format_title.set_bg_color('#cccccc')
format_title.set_align('left')
format_title.set_bold()
title = ['作者', '标题', '评论内容', '版本', '评级', '投票']
# 设置单元格宽度
worksheet.set_column(0, 0, 30)
worksheet.set_column(1, 1, 40)
worksheet.set_column(2, 2, 100)
worksheet.set_column(3, 3, 10)
worksheet.set_column(4, 4, 10)
worksheet.set_column(5, 5, 10)
worksheet.write_row('A1', title, format_title)
count = 0
total = 10
totalCount = 0
for n in range(total):
url = 'https://itunes.apple.com/rss/customerreviews/page=' + \
str(n+1) + '/id=' + str(appid) + \
'/sortby=mostrecent/json?l=en&&cc=kh'
print('当前地址:' + url)
jsonText = getHTMLText(url)
fileName = appid + '/' + str(n+1) + '.json'
data_feed = jsonText['feed']
entry = data_feed['entry']
for i in range(len(entry)):
value = entry[i]
fixedIndex = i + 1
startRow = totalCount + 1
worksheet.write(
startRow, 0, value['author']['name']['label'], format)
worksheet.write(startRow, 1, value['title']['label'], format)
worksheet.write(startRow, 2, value['content']['label'], format)
worksheet.write(
startRow, 3, value['im:version']['label'], format)
worksheet.write(
startRow, 4, value['im:rating']['label'], format)
worksheet.write(
startRow, 5, value['im:rating']['label'], format)
totalCount = totalCount + 1
with open(fileName, 'w') as file:
file.write(json.dumps(jsonText, sort_keys=True,
indent=4, ensure_ascii=False))
count = count + 1
print("当前进度: {:.2f}%".format(count * 100 / total), end="\n\n")
workbook.close()
if __name__ == '__main__':
main()
Google Play App评论获取
谷歌相比苹果而言这方面更加开放,App 数据获取 API 次数不限,而且有 现成的 npm 库,笔芯,感谢开源世界。新建个 node 项目,三下五除二实验了一下,库高效可用,于是开始简单分析数据编写代码,产出如下:
'use strict';
var gplay = require('google-play-scraper');
var fs = require("fs");
// const readline = require('readline').createInterface({
// input: process.stdin,
// output: process.stdout
// })
// readline.question(`请输入 apk 包名:`, (inputText) => {
// bundle_id = inputText
// console.log(`包名: ${bundle_id}!`)
// readline.close()
// fs.mkdirSync(bundle_id)
// gplay.reviews({
// appId: bundle_id,
// lang: 'kh'
// })
// .then((body) => {
// fs.writeFile(`${bundle_id}/comments.json`, JSON.stringify(body), (err) => {
// if (err) {
// console.error(err);
// return;
// };
// console.log("File has been created");
// });
// })
// })
var arr = [
{
'bundle_id': 'com.pipay.app.android',
'name': 'PiPay'
},
// {
// 'bundle_id': 'com.aeon_cambodia.rielpay',
// 'name': 'AEON Wallet'
// },
// {
// 'bundle_id': 'com.wingmoney.wingpay',
// 'name': 'Wing Money'
// },
// {
// 'bundle_id': 'th.co.truemoney.wallet',
// 'name': 'True Money'
// }
]
const total = 50
for (let i = 0; i < arr.length; i++) {
const object = arr[i]
var bundle_id, name
if (object.hasOwnProperty('bundle_id')) {
bundle_id = object['bundle_id'];
if (!fs.existsSync(bundle_id)) {
fs.mkdirSync(bundle_id)
}
}
if (object.hasOwnProperty('bundle_id')) {
name = object['name'];
}
console.log(`包名:${bundle_id} 应用名:${name}`)
for (let index = 0; index < total; index++) {
gplay.reviews({
appId: bundle_id,
lang: 'kh',
page: index
})
.then((body) => {
if (body.length > 0) {
fs.writeFile(`${bundle_id}/${index}.json`, JSON.stringify(body), (err) => {
if (err) {
console.error(err);
return;
};
console.log(`成功写入文件 ${bundle_id}/${index}.json`);
});
} else {
console.log('无更多评论')
}
})
}
}
抓取后评论数据本地存档,如下:
➜ ls th.co.truemoney.wallet
0.json 35.json 61.json
1.json 36.json 62.json
10.json 37.json 63.json
11.json 38.json 64.json
12.json 39.json 65.json
13.json 4.json 66.json
14.json 40.json 67.json
15.json 41.json 68.json
16.json 42.json 69.json
17.json 43.json 7.json
18.json 44.json 70.json
19.json 45.json 71.json
2.json 46.json 72.json
20.json 47.json 73.json
21.json 48.json 74.json
...
将数据生成 Excel 并简单美化
for AppStore
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import xlsxwriter
import json
import os
from googletrans import Translator
import sys
sys.path.append("..")
from mtranslate.mtranslate import translate
# translator = Translator()
# print(translator.translate('មិចញុមវាយលេខចូលហើយមិចក៏មិនអោយ', dest='zh-CN').text)
arr = [{
'id': '1234143591',
'name': 'PiPay'
},
{
'id': '1328330562',
'name': 'AEON Wallet'
},
{
'id': '1113286385',
'name': 'Wing Money'
},
{
'id': '1162466939',
'name': 'True Money'
}]
def translateToChinese(text):
translator = Translator()
try:
return translator.translate(text, dest='zh-CN').text
except print(0):
return text
def main():
for item in arr:
appid = item['id']
name = item['name']
# appid = input("请输入应用id号:")
# name = input("请输入应用名称:")
workbook = xlsxwriter.Workbook(
str(appid) + '/' + name + '_comments.xlsx')
worksheet = workbook.add_worksheet()
format = workbook.add_format()
format.set_border(1)
format.set_border(1)
format_title = workbook.add_format()
format_title.set_border(1)
format_title.set_bg_color('#cccccc')
format_title.set_align('left')
format_title.set_bold()
title = ['作者', '标题', '评论内容', '版本', '评级', '投票']
# 设置单元格宽度
worksheet.set_column(0, 0, 30)
worksheet.set_column(1, 1, 40)
worksheet.set_column(2, 2, 100)
worksheet.set_column(3, 3, 10)
worksheet.set_column(4, 4, 10)
worksheet.set_column(5, 5, 10)
worksheet.write_row('A1', title, format_title)
total = 50
totalCount = 0
for n in range(total):
fileName = appid + '/' + str(n) + '.json'
if not os.path.exists(fileName):
continue
with open(fileName, 'r') as file:
result = json.load(file)
data_feed = result['feed']
entry = data_feed['entry']
for i in range(len(entry)):
value = entry[i]
fixedIndex = i + 1
startRow = totalCount + 1
worksheet.write(
startRow, 0, value['author']['name']['label'], format)
worksheet.write(startRow, 1, value['title']['label'], format)
worksheet.write(
startRow, 2, translate(value['content']['label'], 'zh-CN'), format)
# startRow, 2,value['content']['label'], format)
worksheet.write(
startRow, 3, value['im:version']['label'], format)
worksheet.write(
startRow, 4, value['im:rating']['label'], format)
worksheet.write(
startRow, 5, value['im:rating']['label'], format)
totalCount = totalCount + 1
workbook.close()
if __name__ == '__main__':
main()
for Google Play
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import xlsxwriter
import json
import os
from googletrans import Translator
import sys
sys.path.append("..")
from mtranslate.mtranslate import translate
# translator = Translator()
# print(translator.translate('មិចញុមវាយលេខចូលហើយមិចក៏មិនអោយ', dest='zh-CN').text)
# print(translate('មិចញុមវាយលេខចូលហើយមិចក៏មិនអោយ', 'zh-CN'))
arr = [{
'bundle_id': 'com.pipay.app.android',
'name': 'PiPay'
# },
# {
# 'bundle_id': 'com.aeon_cambodia.rielpay',
# 'name': 'AEON Wallet'
# },
# {
# 'bundle_id': 'com.wingmoney.wingpay',
# 'name': 'Wing Money'
# },
# {
# 'bundle_id': 'th.co.truemoney.wallet',
# 'name': 'True Money'
}]
def translateToChinese(text):
translator = Translator()
try:
return translator.translate(text, dest='zh-CN').text
except print(0):
return text
def main():
for item in arr:
bundle_id = item['bundle_id']
name = item['name']
# appid = input("请输入应用id号:")
# appName = input("请输入应用名称:")
workbook = xlsxwriter.Workbook(
bundle_id + '/' + name + '_comments.xlsx')
worksheet = workbook.add_worksheet()
format = workbook.add_format()
format.set_border(1)
format.set_border(1)
format_title = workbook.add_format()
format_title.set_border(1)
format_title.set_bg_color('#cccccc')
format_title.set_align('left')
format_title.set_bold()
title = ['作者', '标题', '评论内容', '日期', '评分', '头像']
# 设置单元格宽度
worksheet.set_column(0, 0, 30)
worksheet.set_column(1, 1, 30)
worksheet.set_column(2, 2, 100)
worksheet.set_column(3, 3, 20)
worksheet.set_column(4, 4, 10)
worksheet.set_column(5, 5, 120)
worksheet.write_row('A1', title, format_title)
total = 50
totalCount = 0
for n in range(total):
fileName = bundle_id + '/' + str(n) + '.json'
if not os.path.exists(fileName):
continue
with open(fileName, 'r') as file:
result = json.load(file)
entry = result
for i in range(len(entry)):
value = entry[i]
fixedIndex = i + 1
startRow = totalCount + 1
worksheet.write(
startRow, 0, value['userName'], format)
worksheet.write(
startRow, 1, value['title'], format)
worksheet.write(
startRow, 2, translate(value['text'], 'zh-CN'), format)
worksheet.write(
startRow, 3, value['date'], format)
worksheet.write(
startRow, 4, value['score'], format)
worksheet.write(
startRow, 5, value['userImage'], format)
totalCount = totalCount + 1
workbook.close()
if __name__ == '__main__':
pass
main()
Excuse me?
整理下数据发给产品,准备继续撸 Swift 搬砖了,产品又跟我说发现个大问题,几千条评论全是英语和高棉语,看不懂!!!好吧,给你翻译(谷歌翻译)。随便一 google,谷歌有免费的 python API 的开源库 googletrans,看看介绍
Googletrans is a free and unlimited python library that implemented Google Translate API. This uses the Google Translate Ajax API to make calls to such methods as detect and translate.
Compatible with Python 2.7+ and 3.4+. (Note: Python 2 support will be dropped in the next major release.)
For details refer to the API Documentation.
太棒了,还无限制,两三行代码就能使用,关键代码:
from googletrans import Translator
translator = Translator()
print(translator.translate('មិចញុមវាយលេខចូលហើយមិចក៏មិនអោយ', dest='zh-CN').text)
但数据量比较大,亲测使用该方式翻译速度较慢,于是找到 mtranslate ,其是对 googletrans
的一个封装,可以稍微加快翻译速度,最后使用该库完成翻译,代码就几行:
from mtranslate.mtranslate import translate
print(translate('មិចញុមវាយលេខចូលហើយមិចក៏មិនអោយ', 'zh-CN'))
PS:翻译库 JavaScript 版本的可以使用 google-translate-api