Web scraping Google play store reviews

So, what is the easiest way to scrape reviews from an app in Google Play store ?

Figure 1: Screenshot of app reviews in Google Play store.

At the end of this article, you will be able to extract these reviews as a CSV file shown in figure 2.

Figure 2: Screenshot of app reviews in Google Play store extracted as CSV file.

Option 1: Hire a fully managed web scraping service.

You can contact us contact us for our fully managed web scraping service to get Google Play store app reviews data as a CSV or excel file without dealing with any coding.

Our pricing starts at $99 for fully managed Google Play store scraping.

You can simply sit back and let us handle all complexities of web scraping a site like Google that has plenty of anti-scraping protections built in to try and dissuade from people scraping it in bulk.

We can also create a rest API endpoint for you if you want structured data on demand.

Option 2: Scrape Google play store on your own

We will use a browser automation library called Selenium to extract results for the a particular app in play store.

Selenium has bindings available in all major programming language so you use whichever language you like, but we will use Python here.

# Using Selenium to extract google play store reviews

from selenium import webdriver
import time
from bs4 import BeautifulSoup

test_url = 'https://play.google.com/store/apps/details?id=itsolutionever.karanponda.dmbi'

option = webdriver.ChromeOptions()
option.add_argument("--incognito")
chromedriver = r'chromedriver.exe'
browser = webdriver.Chrome(chromedriver, options=option)
browser.get(test_url)
html_source = browser.page_source
browser.close()

Using BeautifulSoup to extract Google play reviews

Once we have the raw html source, we should use a Python library called BeautifulSoup for parsing the raw html files.

Extracting review author

From inspecting the html source, we see that review authors have span tags and belong to class ‘X43Kjb’.

# extracting authors

soup=BeautifulSoup(html_source, "html.parser")

review_author_list_src = soup.find_all('span', {'class','X43Kjb'})
review_author_name_list = []

for val in review_author_list_src:
    try:
        review_author_name_list.append(val.get_text())
    except:
        pass
review_author_name_list[:3]                                  
#Output
    ['Baraka Mark Bright', 'Raj Kamdiya', 'Incognito Inventions']

Extracting review date

The next step is extracting review dates of each review.

# extracting review dates

date_src = soup.find_all('span',{'class', 'p2TkOb'})

date_src
date_list = []

for val in date_src:
    date_list.append(val.get_text())
date_list[:3]
# Output
['September 24, 2019', 'March 25, 2019', 'April 9, 2019']

Extracting review contents

For brevity we will only show results from first three results, and you can verify that the first result matches the text in figure 2 above.

# extracting review content

review_content_src = soup.find_all('div',{'class', 'UD7Dzf'})
review_content_list = []

for val in review_content_src:
    review_content_list.append(val.get_text())
review_content_list[:3]
# Output
[" Everything is perfect. The UI and the content itself are all fantastic. Thanks so much. This app deserves a five 🌟 but I can't give you 100%",
 ' This is really useful app for me to learn all data mining concepts as well as data warehousing concept in this step by step explanation of all data mining concept with well n good figures.',
 ' this application is really useful for me to learn data mining tutorial with well n good examples and covered all data mining concepts really useful for me.']

Converting into CSV file

You can take the lists above, and read it as a pandas DataFrame. Once you have the Dataframe, you can convert to CSV, Excel or JSON easily without any issues.

Scaling up to a full crawler for extracting all Google play reviews of an app

Pagination

To fetch all the reviews, you will have to paginate through the results.

Implementing anti-CAPTCHA measures

After few dozen requests, the Google.com servers will start blocking your IP address outright or you will be flagged and will start getting CAPTCHA.
For successfully fetching data, you will have to implement:
- rotating proxy IP addresses preferably using residential proxies.
- rotate user agents
- Use an external CAPTCHA solving service like 2captcha or anticaptcha.com

After you follow all the steps above, you will realize that our pricing for managed web scraping is one of the most competitive in the market.