抓取 App Store和 Google Play 应用评论并写入 Excel

这几天开始在利用业余时间用 Swift 5 重写公司项目,继上一次用 Swift 写项目隔了有两年了吧(中间陆续在学习),Swift 变得更加好用了,真是愈发嫌弃 OC,来新公司把所负责的 App 从规范到基础服务再到业务里外优化了一番,算是有所成果了。新项目到写得正 high,组长丢给我一个任务,产品经理需要竞品在 App Store 和 Google Play 的评论数据用于调研,简单实现了一番,趁热打铁,记录一下结果,以便日后需要。

想到第一方案

在未做调研前,想到的第一方案是利用 Appium + mitmproxy + python 实现,想怎么抓就怎么抓,唰唰唰先代理看看,结果发现 App Store 启用网络代理后无法联网,原因是 SSL Pinning,然后就解决这个问题,找到 ssl-kill-switch2,可以解决这个问题,在越狱手机上安装了 deb 包,关闭 itunesstored 服务,发现并不生效:

dpkg -i com.nablac0d3.sslkillswitch2_0.13.deb
ps -ef | grep itunesstored
kill -s KILL 1170

Issues 确实在 iOS 11.4.1 上无法生效,可惜我的越狱设备正好是这个版本,该方案夭折。

调研第三方服务商

既然一方案夭折,想想该需求市场上肯定有数据服务商,于是谷歌搜索了一 ha,果然前面几位都是“科学家”网站,进去看看,可以看到有广告 产品舆情 - WeTest腾讯质量开放平台七麦数据 的,前者没去看,简单看了下后者,注册账号,两个问题

- 不能查看非中国区评论
- 不能自定义生成的 Excel 选项

于是,该方案 pass

App Store 从官方渠道获取最近500条

考虑到多款竞品在 App Store 对应国区的评论数均不足 500条,于是使用苹果官方提供的 API

https://itunes.apple.com/rss/customerreviews/page=1/id=414478124/sortby=mostrecent/json?l=en&&cc=cn

page 和 id 自行更改,该 API最多可以获取一个App的最近500条评论(每页最多50,最多10页),该方案顺利抓取数据,代码:

#!/usr/bin/python3
# -*- coding: utf-8 -*-

import requests
import urllib.request
import re
import xlsxwriter
import json
import os


def getHTMLText(url):
    response = urllib.request.urlopen(url)
    myjson = json.loads(response.read().decode())

    return myjson

    # try:
    #     r = requests.get(url)
    #     r.raise_for_status()
    #     r.encoding = r.apparent_encoding
    #     return r.text
    # except:
    #     return ''


def main():

    appid = input("请输入应用id号:")
    appName = input("请输入应用名称:")

    if not os.path.exists(appid):
        os.system('mkdir ' + appid)

    workbook = xlsxwriter.Workbook(appid + '/' + appName + '_comments.xlsx')
    worksheet = workbook.add_worksheet()
    format = workbook.add_format()
    format.set_border(1)
    format.set_border(1)
    format_title = workbook.add_format()
    format_title.set_border(1)
    format_title.set_bg_color('#cccccc')
    format_title.set_align('left')
    format_title.set_bold()
    title = ['作者', '标题', '评论内容', '版本', '评级', '投票']

    # 设置单元格宽度
    worksheet.set_column(0, 0, 30)
    worksheet.set_column(1, 1, 40)
    worksheet.set_column(2, 2, 100)
    worksheet.set_column(3, 3, 10)
    worksheet.set_column(4, 4, 10)
    worksheet.set_column(5, 5, 10)

    worksheet.write_row('A1', title, format_title)

    count = 0

    total = 10
    totalCount = 0
    for n in range(total):
        url = 'https://itunes.apple.com/rss/customerreviews/page=' + \
            str(n+1) + '/id=' + str(appid) + \
            '/sortby=mostrecent/json?l=en&&cc=kh'

        print('当前地址:' + url)

        jsonText = getHTMLText(url)

        fileName = appid + '/' + str(n+1) + '.json'

        data_feed = jsonText['feed']
        entry = data_feed['entry']
        for i in range(len(entry)):
            value = entry[i]
            fixedIndex = i + 1
            startRow = totalCount + 1
            worksheet.write(
                startRow, 0, value['author']['name']['label'], format)
            worksheet.write(startRow, 1, value['title']['label'], format)
            worksheet.write(startRow, 2, value['content']['label'], format)
            worksheet.write(
                startRow, 3, value['im:version']['label'], format)
            worksheet.write(
                startRow, 4, value['im:rating']['label'], format)
            worksheet.write(
                startRow, 5, value['im:rating']['label'], format)
            totalCount = totalCount + 1

        with open(fileName, 'w') as file:
            file.write(json.dumps(jsonText, sort_keys=True,
                                  indent=4, ensure_ascii=False))

        count = count + 1
        print("当前进度: {:.2f}%".format(count * 100 / total), end="\n\n")

    workbook.close()


if __name__ == '__main__':
    main()

Google Play App评论获取

谷歌相比苹果而言这方面更加开放,App 数据获取 API 次数不限,而且有 现成的 npm 库,笔芯,感谢开源世界。新建个 node 项目,三下五除二实验了一下,库高效可用,于是开始简单分析数据编写代码,产出如下:

'use strict';

var gplay = require('google-play-scraper');
var fs = require("fs");

// const readline = require('readline').createInterface({
//   input: process.stdin,
//   output: process.stdout
// })

// readline.question(`请输入 apk 包名:`, (inputText) => {
//   bundle_id = inputText
//   console.log(`包名: ${bundle_id}!`)

//   readline.close()

//   fs.mkdirSync(bundle_id)

//   gplay.reviews({
//     appId: bundle_id,
//     lang: 'kh'
//   })
//     .then((body) => {
//       fs.writeFile(`${bundle_id}/comments.json`, JSON.stringify(body), (err) => {
//         if (err) {
//           console.error(err);
//           return;
//         };
//         console.log("File has been created");
//       });
//     })
// })

var arr = [
  {
  'bundle_id': 'com.pipay.app.android',
  'name': 'PiPay'
},
// {
//   'bundle_id': 'com.aeon_cambodia.rielpay',
//   'name': 'AEON Wallet'
// },
// {
//   'bundle_id': 'com.wingmoney.wingpay',
//   'name': 'Wing Money'
// },
// {
//   'bundle_id': 'th.co.truemoney.wallet',
//   'name': 'True Money'
// }
]

const total = 50

for (let i = 0; i < arr.length; i++) {
  const object = arr[i]
  var bundle_id, name
  if (object.hasOwnProperty('bundle_id')) {
    bundle_id = object['bundle_id'];

    if (!fs.existsSync(bundle_id)) {
      fs.mkdirSync(bundle_id)
    }
  }
  if (object.hasOwnProperty('bundle_id')) {
    name = object['name'];
  }

  console.log(`包名:${bundle_id}  应用名:${name}`)

  for (let index = 0; index < total; index++) {
    gplay.reviews({
      appId: bundle_id,
      lang: 'kh',
      page: index
    })
      .then((body) => {
        if (body.length > 0) {
          fs.writeFile(`${bundle_id}/${index}.json`, JSON.stringify(body), (err) => {
            if (err) {
              console.error(err);
              return;
            };
            console.log(`成功写入文件 ${bundle_id}/${index}.json`);
          });
        } else {
          console.log('无更多评论')
        }
      })
  }
}

抓取后评论数据本地存档,如下:

➜ ls th.co.truemoney.wallet
0.json                          35.json                         61.json
1.json                          36.json                         62.json
10.json                         37.json                         63.json
11.json                         38.json                         64.json
12.json                         39.json                         65.json
13.json                         4.json                          66.json
14.json                         40.json                         67.json
15.json                         41.json                         68.json
16.json                         42.json                         69.json
17.json                         43.json                         7.json
18.json                         44.json                         70.json
19.json                         45.json                         71.json
2.json                          46.json                         72.json
20.json                         47.json                         73.json
21.json                         48.json                         74.json
...

将数据生成 Excel 并简单美化

for AppStore

#!/usr/bin/python3
# -*- coding: utf-8 -*-

import xlsxwriter
import json
import os
from googletrans import Translator
import sys
sys.path.append("..")
from mtranslate.mtranslate import translate

# translator = Translator()
# print(translator.translate('មិចញុមវាយលេខចូលហើយមិចក៏មិនអោយ', dest='zh-CN').text)

arr = [{
    'id': '1234143591',
    'name': 'PiPay'
    },
        {
    'id': '1328330562',
    'name': 'AEON Wallet'
    },
        {
    'id': '1113286385',
    'name': 'Wing Money'
    },
        {
    'id': '1162466939',
        'name': 'True Money'
}]


def translateToChinese(text):
    translator = Translator()

    try:
        return translator.translate(text, dest='zh-CN').text
    except print(0):
        return text


def main():

    for item in arr:
        appid = item['id']
        name = item['name']

        # appid = input("请输入应用id号:")
        # name = input("请输入应用名称:")

        workbook = xlsxwriter.Workbook(
            str(appid) + '/' + name + '_comments.xlsx')
        worksheet = workbook.add_worksheet()
        format = workbook.add_format()
        format.set_border(1)
        format.set_border(1)
        format_title = workbook.add_format()
        format_title.set_border(1)
        format_title.set_bg_color('#cccccc')
        format_title.set_align('left')
        format_title.set_bold()
        title = ['作者', '标题', '评论内容', '版本', '评级', '投票']

        # 设置单元格宽度
        worksheet.set_column(0, 0, 30)
        worksheet.set_column(1, 1, 40)
        worksheet.set_column(2, 2, 100)
        worksheet.set_column(3, 3, 10)
        worksheet.set_column(4, 4, 10)
        worksheet.set_column(5, 5, 10)

        worksheet.write_row('A1', title, format_title)

        total = 50
        totalCount = 0
        for n in range(total):
            fileName = appid + '/' + str(n) + '.json'
            if not os.path.exists(fileName):
                continue

            with open(fileName, 'r') as file:
                result = json.load(file)
                data_feed = result['feed']
                entry = data_feed['entry']
                for i in range(len(entry)):
                    value = entry[i]
                    fixedIndex = i + 1
                    startRow = totalCount + 1
                    worksheet.write(
                        startRow, 0, value['author']['name']['label'], format)
                    worksheet.write(startRow, 1, value['title']['label'], format)
                    worksheet.write(
                        startRow, 2, translate(value['content']['label'], 'zh-CN'), format)
                        # startRow, 2,value['content']['label'], format)
                    worksheet.write(
                        startRow, 3, value['im:version']['label'], format)
                    worksheet.write(
                        startRow, 4, value['im:rating']['label'], format)
                    worksheet.write(
                        startRow, 5, value['im:rating']['label'], format)
                    totalCount = totalCount + 1

        workbook.close()


if __name__ == '__main__':
    main()

for Google Play

#!/usr/bin/python3
# -*- coding: utf-8 -*-

import xlsxwriter
import json
import os
from googletrans import Translator
import sys
sys.path.append("..")
from mtranslate.mtranslate import translate

# translator = Translator()
# print(translator.translate('មិចញុមវាយលេខចូលហើយមិចក៏មិនអោយ', dest='zh-CN').text)
# print(translate('មិចញុមវាយលេខចូលហើយមិចក៏មិនអោយ', 'zh-CN'))

arr = [{
    'bundle_id': 'com.pipay.app.android',
    'name': 'PiPay'
# },
#     {
    # 'bundle_id': 'com.aeon_cambodia.rielpay',
    # 'name': 'AEON Wallet'
# },
#     {
    # 'bundle_id': 'com.wingmoney.wingpay',
    # 'name': 'Wing Money'
# },
#     {
    # 'bundle_id': 'th.co.truemoney.wallet',
    # 'name': 'True Money'
}]

def translateToChinese(text):
    translator = Translator()

    try:
        return translator.translate(text, dest='zh-CN').text
    except print(0):
        return text


def main():

    for item in arr:
        bundle_id = item['bundle_id']
        name = item['name']

    # appid = input("请输入应用id号:")
    # appName = input("请输入应用名称:")

        workbook = xlsxwriter.Workbook(
            bundle_id + '/' + name + '_comments.xlsx')
        worksheet = workbook.add_worksheet()
        format = workbook.add_format()
        format.set_border(1)
        format.set_border(1)
        format_title = workbook.add_format()
        format_title.set_border(1)
        format_title.set_bg_color('#cccccc')
        format_title.set_align('left')
        format_title.set_bold()
        title = ['作者', '标题', '评论内容', '日期', '评分', '头像']

        # 设置单元格宽度
        worksheet.set_column(0, 0, 30)
        worksheet.set_column(1, 1, 30)
        worksheet.set_column(2, 2, 100)
        worksheet.set_column(3, 3, 20)
        worksheet.set_column(4, 4, 10)
        worksheet.set_column(5, 5, 120)

        worksheet.write_row('A1', title, format_title)

        total = 50
        totalCount = 0
        for n in range(total):
            fileName = bundle_id + '/' + str(n) + '.json'
            if not os.path.exists(fileName):
                continue

            with open(fileName, 'r') as file:
                result = json.load(file)
                entry = result
                for i in range(len(entry)):
                    value = entry[i]
                    fixedIndex = i + 1
                    startRow = totalCount + 1
                    worksheet.write(
                        startRow, 0, value['userName'], format)
                    worksheet.write(
                        startRow, 1, value['title'], format)
                    worksheet.write(
                        startRow, 2, translate(value['text'], 'zh-CN'), format)
                    worksheet.write(
                        startRow, 3, value['date'], format)
                    worksheet.write(
                        startRow, 4, value['score'], format)
                    worksheet.write(
                        startRow, 5, value['userImage'], format)
                    totalCount = totalCount + 1

        workbook.close()

if __name__ == '__main__':
    pass
    main()

Excuse me?

整理下数据发给产品,准备继续撸 Swift 搬砖了,产品又跟我说发现个大问题,几千条评论全是英语和高棉语,看不懂!!!好吧,给你翻译(谷歌翻译)。随便一 google,谷歌有免费的 python API 的开源库 googletrans,看看介绍

Googletrans is a free and unlimited python library that implemented Google Translate API. This uses the Google Translate Ajax API to make calls to such methods as detect and translate.
Compatible with Python 2.7+ and 3.4+. (Note: Python 2 support will be dropped in the next major release.)
For details refer to the API Documentation.

太棒了,还无限制,两三行代码就能使用,关键代码:

from googletrans import Translator

translator = Translator()
print(translator.translate('មិចញុមវាយលេខចូលហើយមិចក៏មិនអោយ', dest='zh-CN').text)

但数据量比较大,亲测使用该方式翻译速度较慢,于是找到 mtranslate ,其是对 googletrans 的一个封装,可以稍微加快翻译速度,最后使用该库完成翻译,代码就几行:

from mtranslate.mtranslate import translate

print(translate('មិចញុមវាយលេខចូលហើយមិចក៏មិនអោយ', 'zh-CN'))

PS:翻译库 JavaScript 版本的可以使用 google-translate-api

效果图

完整 Demo 地址

附上 github 完整项目地址