Ctrip Popular Scenic Area Review Scraping

Ctrip Popular Scenic Area Review Scraping#

Introduction#

Recently, I participated in a competition that required scraping reviews and information about popular scenic spots in the capital cities of several provinces in Yunnan, Guizhou, and Sichuan. I looked at various projects online, tried most of them, but found the operations too cumbersome. Some required finding parameters one by one, which did not meet my needs. So, I decided to write my own. First, let's take a look at the results.

The scraped data is saved in excel

Scraping in progress

After some time of scraping, I successfully collected reviews from popular scenic spots in all cities of the three provinces, totaling 280,000 data points. Not easy! 😭😭😭

Now, let me share the process of this scraping 🚀🚀🚀

Note: All the code shared here is not complete; for the complete code, see aglorice/CtripSpider: Ctrip Review Scraper, using thread pools to scrape popular scenic area reviews, simple and easy to use. One-click to scrape all popular scenic spots in any province. (github.com)

1. Analyze the Page#

First, go to the Ctrip Strategy.Scenic Spots page, hover the cursor over Domestic (including Hong Kong, Macau, and Taiwan) to get all cities in almost all provinces. This is the source of our city data.

Open the console, and quickly locate the relevant information as shown below.

With this, let's write the code. Here, I use BeautifulSoup to parse the page.

    def get_areas(self) -> list:
        city_list = []
        try:
            res = self.sees.get(
                url=GET_HOME,
                headers={
                    "User-Agent": get_fake_user_agent("pc")
                },
                proxies=my_get_proxy(),
                timeout=TIME_OUT
            )
        except Exception as e:
            self.console.print(f"[red]Failed to get city scenic area information, {e}, you can check your network or proxy.", style="bold red")
            exit()
        res_shop = BeautifulSoup(res.text, "lxml")
        areas = res_shop.find_all("div", attrs={"class": "city-selector-tab-main-city"})

        for area in areas:
            area_title = area.find("div", attrs={"class": "city-selector-tab-main-city-title"}).string
            if area_title is None:
                continue
            area_items = area.find_all("div", attrs={"class": "city-selector-tab-main-city-list"})
            area_items_list = [{"name": item.string, "url": item["href"]} for item in area_items[0].find_all("a")]
            city_list.append({
                "name": area_title,
                "city": area_items_list
            })
        return city_list

By this method, all city names and urls of the specified provinces are saved as city.json. The reason for saving them first is mainly for convenience in customization; you can freely add or remove the cities you want to scrape based on your needs. The scraping results are as follows:

Next, we open the urls of these scenic spots. As shown, we can see that the homepage displays popular scenic spots or attractions:

The preliminary work is done; now let's scrape the reviews for the corresponding scenic spots.

2. Scenic Area Review Scraping#

Open the reviews of a random scenic spot and check the requests in the console. As shown below:

First, let's analyze the parameters. After multiple attempts, we can identify which ones are dynamic. The first is _fxpcqlniredt; checking the cookie will quickly reveal it.

The second is x-traceID. By reverse engineering the JS, I directly found the relevant code. As shown:

Now that we know how it is generated, it becomes simple. Let's write the code.

    def generate_scene_comments_params(self) -> dict:
        """
        Generate params for requesting scenic area reviews
        :return:
        """
        random_number = random.randint(100000, 999999)
        return {
            "_fxpcqlniredt": self.sees.cookies.get("GUID"),
            "x-traceID": self.sees.cookies.get("GUID") + "-" + str(int(time.time() * 1000000)) + "-" + str(
                random_number)
        }

Actually, we are almost done here. Now we just need to solve the poild issue, which is present on every page, in the script tag under each scenic area page. However, making an additional request for this is too time-consuming. If we could directly request the data without entering the scenic spot page, that would be ideal. So, let's change our approach. We enter Ctrip's h5 page and find that its comment retrieval interface is different from the pc version, as shown:

On the mobile side, we do not need to use the poild parameter. At this point, we are almost done. The remaining task is to address various issues that arise during the scraping process, the most important of which is Ctrip's anti-scraping measures. Since I used a thread pool for faster scraping, the speed was quite high. To solve this problem, I employed random ua, a proxy pool, and various fault tolerance mechanisms to ensure stable scraping. Below is the result of frequent interface access:

3. Solutions to Ctrip's Anti-Scraping Measures#

The first solution to anti-scraping is to use a random ua. Previously, I used fake-useragent, but since I switched to the h5 interface, the ua must be for mobile devices. However, this library does not support it, so I manually created a simple but practical solution.

# -*- coding = utf-8 -*-
# @Time :2023/7/13 21:32
# @Author :Xiao Yue
# @Email  :[email protected]
# @PROJECT_NAME :scenic_spots_comment
# @File :  fake_user_agent.py
from fake_useragent import UserAgent
import random
from config import IS_FAKE_USER_AGENT


def get_fake_user_agent(ua: str, default=True) -> str:
    match ua:
        case "mobile":
            if IS_FAKE_USER_AGENT and default:
                ua = get_mobile_user_agent()
                return ua
            else:
                return "Mozilla/5.0 (Linux; Android 8.0.0; SM-G955U Build/R16NW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Mobile Safari/537.36 Edg/114.0.0.0"
        case "pc":
            if IS_FAKE_USER_AGENT and default:
                ua = UserAgent()
                return ua.random
            else:
                return "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Mobile Safari/537.36 Edg/103.0.1264.49"


def get_mobile_user_agent() -> str:
    platforms = [
        'iPhone; CPU iPhone OS 14_6 like Mac OS X',
        'Linux; Android 11.0.0; Pixel 5 Build/RD1A.201105.003',
        'Linux; Android 8.0.0; Pixel 5 Build/RD1A.201105.003',
        'iPad; CPU OS 14_6 like Mac OS X',
        'iPad; CPU OS 15_6 like Mac OS X',
        'Linux; U; Android 9; en-us; SM-G960U Build/PPR1.180610.011',  # Samsung Galaxy S9
        'Linux; U; Android 10; en-us; SM-G975U Build/QP1A.190711.020',  # Samsung Galaxy S10
        'Linux; U; Android 11; en-us; SM-G998U Build/RP1A.200720.012',  # Samsung Galaxy S21 Ultra
        'Linux; U; Android 9; en-us; Mi A3 Build/PKQ1.180904.001',  # Xiaomi Mi A3
        'Linux; U; Android 10; en-us; Mi 10T Pro Build/QKQ1.200419.002',  # Xiaomi Mi 10T Pro
        'Linux; U; Android 11; en-us; LG-MG870 Build/RQ1A.210205.004',  # LG Velvet
        'Linux; U; Android 11; en-us; ASUS_I003D Build/RKQ1.200826.002',  # Asus ROG Phone 3
        'Linux; U; Android 10; en-us; CLT-L29 Build/10.0.1.161',  # Huawei P30 Pro
    ]

    browsers = [
        'Chrome',
        'Firefox',
        'Safari',
        'Opera',
        'Edge',
        'UCBrowser',
        'SamsungBrowser'
    ]

    platform = random.choice(platforms)
    browser = random.choice(browsers)

    match browser:
        case 'Chrome':
            version = random.randint(70, 90)
            return f'Mozilla/5.0 ({platform}) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/{version}.0.#{random.randint(1000, 9999)}.#{random.randint(10, 99)} Mobile Safari/537.36'

        case 'Firefox':
            version = random.randint(60, 80)
            return f'Mozilla/5.0 ({platform}; rv:{version}.0) Gecko/20100101 Firefox/{version}.0'

        case 'Safari':
            version = random.randint(10, 14)
            return f'Mozilla/5.0 ({platform}) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/{version}.0 Safari/605.1.15'

        case 'Opera':
            version = random.randint(60, 80)
            return f'Mozilla/5.0 ({platform}) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/{version}.0.#{random.randint(1000, 9999)}.#{random.randint(10, 99)} Mobile Safari/537.36 OPR/{version}.0'

        case 'Edge':
            version = random.randint(80, 90)
            return f'Mozilla/5.0 ({platform}) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/{version}.0.#{random.randint(1000, 9999)}.#{random.randint(10, 99)} Mobile Safari/537.36 Edg/{version}.0'

        case 'UCBrowser':
            version = random.randint(12, 15)
            return f'Mozilla/5.0 ({platform}) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 UBrowser/{version}.1.2.49 Mobile Safari/537.36'

        case 'SamsungBrowser':
            version = random.randint(10, 14)
            return f'Mozilla/5.0 ({platform}) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/{version}.0 Chrome/63.0.3239.26 Mobile Safari/537.36'

The remaining task is to use a thread pool, which utilizes an open-source project jhao104/proxy_pool: Python Scraper Proxy IP Pool (github.com).

With this, we are almost done. 👀👀👀

4. Conclusion#

Through this scraping of Ctrip, I can summarize some experiences. When encountering problems, it is helpful to broaden your thinking and try multiple approaches.

Project Address aglorice/CtripSpider: Ctrip Review Scraper, using thread pools to scrape popular scenic area reviews, simple and easy to use. One-click to scrape all popular scenic spots in any province. (github.com)