How Web Scraping Is Used In Sentiment Analysis And Scraping Restaurant Reviews?

 Blog /  How Web Scraping Is Used In Sentiment Analysis And Scraping Restaurant Reviews

  08 April 2022

How-web-scraping-is-used-in-sentiment-analysis-and-scraping-restaurant-reviews

Here, we will use Cosine Similarity to classify the sentiment of London's top 300 restaurant reviews and construct a recommendation system based on the restaurant description, summary, and user comments. Finally, we will use Flask to organize.

Image

Let's start by defining the issue.

Let us first begin with a general overview by using sentiment analysis for the top 300 restaurants. What are people's preferred aspects of these restaurants? or what they're most upset about these restaurants?

Image

The other issue we shall be looking at is if we're hungry and go to the same place we normally go to,

but oh no! That day, the restaurant was closed.

Then why don't we have a look at ten additional restaurants that are similar to our favorite? So, let's get started!

Image

Database for Data Collection and Storage (Store DB)

The data was scraped from tripadvisor.com using the BeautifulSoup package and put in MongoDB.

Image

MongoDB is a very adaptable database. In a structured database, if there are four fields, filling all four of them is mandatory. This is not the case in MongoDB. It allows adding as many columns as we wish or even leaving them blank.

Exploring data

Image

London has a total of 19k restaurants; however, we did not work with them all.

Image

When we look at the "Info" column, we can see that the spots marked "TRY" represent the Turkish lira. If we convert it to pound sterling:), it will be more accurate. This is a useful feature of the recommendation system but it is not used here. Then let us present our columns, Price range, special diets (Vegetarian Friendly, Vegan Options, Gluten Free Options, etc. ), meals (dinner, lunch, etc. ), cuisines, and features are all shown in the "Info" column (Delivery, Takeout, Reservations, Seating, Highchairs Available, Wheelchair Accessible, Free Wi-Fi, Accepts Credit Cards, Table Service, Digital Payments, BYOB, etc.). The restaurant introduction appears in the "Summary" section.

Image

When we look at the graph, we can see that the polarity is almost always greater than 0 and that the rating is almost always more than 4. User reviews are rarely longer than 85 words. Food and service are, unsurprisingly, the most often used words in feedback.

Image

Preparing the Text

We should remove special characters, emojis, stop words, numerals, and case separation from our text before entering it into the model so that the model does not interpret the same term as different. Separating the roots of the words will also cut down on the number of words we have to include in the model. To find it, here we have used the SnowballStremmer library.

Text mining algorithms are now available for use.

Classification

Positive comments appear to predominate in the EDA section.

Image

While our model can accurately predict good comments, it cannot get ahead onto negative ones, therefore Random undersampling method is used to balance accuracy.

Text data cannot be used directly in ml algorithms. We must convert the text to numerical feature vectors before using it in machine learning algorithms. For this, we can use Tfidf and CountVectorizer. With CountVectorizer, we may count the number of times a word appears in the document. Tfidf accomplishes a statistical way for the same. CountVectorizer gave me the greatest results with Multinominal Naive Bayes.

Image

Feature Importance

Image

The following negative comments about the top 300 restaurants are: "bland," "cold," "bad" (in terms of food), and "rude" (for the waiter).

Recommendation

1. Description Based on Restaurant Information:

Image
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import cosine_similarity
tfidf_vec = TfidfVectorizer(stop_words="english",min_df=1,max_df=.85,ngram_range=(1, 3), token_pattern="\\b[a-z][a-z][a-z]+\\b")


# TfIdf matrix
matrix = tfidf_vec.fit_transform(rc.full)
# Compute the cosine similarity
cosine_sim = linear_kernel(matrix, matrix)


def get_n_recommendations(title, cosine_sim ,top_n):     
    recommended_restaurant = []    
    # restaurant match indices
    idx = indices[indices == title].index[0]
    # similarity scores
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
    # top n
    top_n_indexes = list(score_series.iloc[1:top_n+1].index)
    for i in top_n_indexes:
        recommended_restaurant.append(list(indices)[i])
        
    return recommended_restaurant

2. Description and User Reviews Based on Information from the Restaurant

Image

Flask is a micro-framework for building web apps. All the recommendations were deployed with Flask at the end of this analysis.

Looking to scrap restaurant review data? Contact ReviewGators today!

Request for a quote!

Send a message

Feel free to reach us if you need any assistance.

Contact Us

We’re always ready to help as well as answer all your queries. We are looking forward to hearing from you!

Call Us On

+1(832) 251 7311

Address

10685-B Hazelhurst Dr. # 25582 Houston,TX 77043 USA