TripAdvisor is the largest travel website in the world, receiving more than 200 million visitors per month and maintaining a database of more than 9 million reviews. The site has over 400 employees and is worth $1 billion. To provide such extensive information on all these hotels worldwide, TripAdvisor uses data scraping from their website to import all reviews from TripAdvisor into a central database, which already has over 8 million entries.
This post will cover how data scraping works with TripAdvisor specifically for travel & hotel reviews and then briefly explain what it does for them before we discuss how you can scrape your websites for review information in your application.
Data scraping is the process of programmatically extracting data from a website. You can web scrape in various ways, all of which will have advantages and disadvantages. For example, DOM traversal could find specific elements on a page and parse them into an array. Another way would be to use Ajax calls to obtain the same information as before without reading HTML or CSS. Data can be extracted from any part of the page – the title, meta description, links, text in forms, and more.
Data scraping is used in a wide variety of ways but can be generally broken down into the following categories:
This scraping provides valuable data to web admins who want to include fresh content on their pages without creating it themselves. The content might be news articles based on relevant keyword searches or recent tweets. This process could then use the data removed from the web for sentiment analysis and keyword extraction.
Content is often updated at specific times by the website owner. Data scraping can mine the HTML for time-based information and store it in a schedule. It could trigger events in your application at specific times (e.g., an email reminder) or send on-demand requests when fresh content becomes available instead of regularly polling at intervals outside the site's schedule.
It's sometimes necessary to snag email addresses from a web page directly. Data scraping can do this by processing the HTML and tracking the email fields in forms. You can then import a newsletter signup list or update an in-app mailing list with the latest information about your service, which is excellent for sending push notifications later.
TripAdvisor uses data scraping to process all of its reviews, one of the largest collections of related content on the Internet, with over 8 million entries at the last count. The review data is available for all of their hotels, and making sure the reviews are legit and recent is their top priority.
The reviews are collected through automated scraping or manually by an employee. Reviews are either written in English or human checks that match what has been scraped before being added to the collection. Some reviews can be accessed only by special verified accounts, and some require creating an account, so there are plenty of hoops to jump through before you can get your hands on them.
There needs to be an automated process that combines all the reviews into a single page, making finding information about specific hotels challenging and time-consuming. Instead, you have to search for each hotel individually and then go through the reviews one by one to find the reviews you need. It is an absolute nightmare when you're trying to browse hundreds of hotels at once!
Each hotel has its page, which usually has all the information you want to find, including pictures and pricing. In the hotel's account settings, you can view your facility, corporate profile, and detailed information about your specific property. You can then find the reviews for that particular property by clicking "Hotel Reviews" under the "Reviews" tab.
Server-side processing will gather all the reviews and queue them up by date since they were last updated or created. A separate process will then export them into a data collection XML file (in a similar format to iTunes) that can be easily imported into a database or spreadsheet program for analysis. The figure below shows a sample data scraper in Python that pulls XML files from TripAdvisor's website and then saves them to the local filesystem, giving you access to all the information you need.
There are many things you can do with this kind of data, but there are a couple of clear application areas that you should care about:
Give the correct information to your customers when they get to your website. It is essential to let them know whether the hotel is available and if they need to make an advance payment. It also gives them useful information, like how long they should expect to wait in line at check-in or how many hours of their vacation might be wasted if they drive across town and check in without finding out that there's no parking available.
Delivering the correct information at the right time gives your customers a good experience and enhances their experience using your service. For example, you could use data scraping to send out emails about new releases of your software or special deals that you're running. It can be done by emailing the recipients through a platform like MailChimp or ActiveCampaign (as shown in the figure below). You can also take it one step further by sending out SMS text messages and voice calls to more specific groups of people (like those on a separate list).
You can also use data scraping to make new content available to your customers, like daily blog posts (topic-specific email blasts) and newsletter signups that get delivered automatically. It can be done through an in-house or external service such as Mailchimp. The latter gives you more flexibility, allowing you to cut checks via subscription or set up recurring payments (popular with companies).
If you want to collect reviews for your website or other purposes, it's best to create a new account at TripAdvisor (and make sure you're the only one logging in from the device where the review data will be scraped). It will avoid the situation where you accidentally try to download a review that someone else has already written, causing TripAdvisor to flag your IP address and cause trouble for you later. You must build a scraper to navigate the TripAdvisor site and collect the necessary data.
You can now test the script by visiting any hotel's page and clicking on the link "Hotel Reviews" in the left panel. It will bring up more information about all of the reviews for that hotel. To download all the reviews into a file, you can use a Python library called XML to create a parser automatically. In your scraper, include code like this:
Import lxml import subprocess parser = lxml.etree.HTMLParser() results = parser.parse(urllib2.urlopen("https://www.tripadvisor.com/Hotels-g575817-d126009-Reviews-Hotel_Name-Cobble_Point_West_Bay-Providenciales_Turks_and_Caicos.html").read())
The code will create an XML parser that can read the HTML code on the page and identify all of the relevant information (the reviews themselves are extracted from the tags). The scraped data can then be written to a file, copied into an application for further analysis, or sent to another system for processing.
Data scraping is a simple and effective technique for extracting structured information such as reviews and other website content. Depending on the nature of your business, this data can be precious. Making the most of it requires careful planning regarding what scraper to build and how to use the data after it's retrieved. This guide will help anybody who wants to get more out of their website!