Files
LLM_Engineering_OLD/week1/community-contributions/week1-google-map-review-summarizer/google-map-review-summarizer.ipynb
2025-10-23 07:11:42 +05:30

368 lines
13 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"id": "1fecd49e",
"metadata": {},
"source": [
"# 🗺️ Google Maps Review Summarizer\n",
"\n",
"This Python app automates the process of fetching and summarizing Google Maps reviews for any business or location.\n",
"\n",
"## 🚀 Overview\n",
"The app performs two main tasks:\n",
"1. **Scrape Reviews** Uses a web scraping script to extract reviews directly from Google Maps.\n",
"2. **Summarize Content** Leverages OpenAI's language models to generate concise, insightful summaries of the collected reviews and analyse the sentiments.\n",
"\n",
"## 🧠 Tech Stack\n",
"- **Python** Core language\n",
"- **Playwright** For scraping reviews\n",
"- **OpenAI API** For natural language summarization\n",
"- **Jupyter Notebook** For exploration, testing, and demonstration\n",
"\n",
"### 🙏 Credits\n",
"The web scraping logic is **inspired by [Antonello Zaninis blog post](https://blog.apify.com/how-to-scrape-google-reviews/)** on building a Google Reviews scraper. Special thanks for the valuable insights on **structuring and automating the scraping workflow**, which greatly informed the development of this improved scraper.\n",
"\n",
"This app, however, uses an **enhanced version of the scraper** that can scroll infinitely to load more reviews until it collects **at least 1,000 reviews**. If only a smaller number of reviews are available, the scraper stops scrolling earlier.\n",
"\n",
"## ✅ Sample Output\n",
"Here is a summary of reviews of a restuarant generated by the app.\n",
"\n",
"![Alt text](google-map-review-summary.jpg)\n",
"\n",
"\n",
"---\n",
"\n",
"**Note:** This project is intended for educational and research purposes. Please ensure compliance with Googles [Terms of Service](https://policies.google.com/terms) when scraping or using their data.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "df04a4aa",
"metadata": {},
"outputs": [],
"source": [
"#Activate the llm_engineering virtual environment\n",
"!source ../../../.venv/bin/activate \n",
"\n",
"#Make sure pip is available and up to date inside the venv\n",
"!python3 -m ensurepip --upgrade\n",
"\n",
"#Verify that pip now points to the venv path (should end with /.venv/bin/pip)\n",
"!which pip3\n",
"\n",
"#Install Playwright inside the venv\n",
"!pip3 install playwright\n",
"\n",
"#Download the required browser binaries and dependencies\n",
"!python3 -m playwright install"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "1c794cfd",
"metadata": {},
"outputs": [],
"source": [
"import asyncio\n",
"from playwright.async_api import async_playwright\n",
"from IPython.display import Markdown, display\n",
"import os\n",
"from dotenv import load_dotenv\n",
"from openai import OpenAI\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "317af2b8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"API key found and looks good so far!\n"
]
}
],
"source": [
"# Load environment variables in a file called .env\n",
"\n",
"load_dotenv(override=True)\n",
"api_key = os.getenv('OPENAI_API_KEY')\n",
"\n",
"# Check the key\n",
"\n",
"if not api_key:\n",
" print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
"elif not api_key.startswith(\"sk-proj-\"):\n",
" print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n",
"elif api_key.strip() != api_key:\n",
" print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n",
"else:\n",
" print(\"API key found and looks good so far!\")\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "6f142c79",
"metadata": {},
"outputs": [],
"source": [
"async def scroll_reviews_panel(page, max_scrolls=50, max_reviews=10):\n",
" \"\"\"\n",
" Scrolls through the reviews panel to lazy load all reviews.\n",
" \n",
" Args:\n",
" page: Playwright page object\n",
" max_scrolls: Maximum number of scroll attempts to prevent infinite loops\n",
" \n",
" Returns:\n",
" Number of reviews loaded\n",
" \"\"\"\n",
" # Find the scrollable reviews container\n",
" # Google Maps reviews are in a specific scrollable div\n",
" scrollable_div = page.locator('div[role=\"main\"] div[jslog$=\"mutable:true;\"]').first\n",
" \n",
" previous_review_count = 0\n",
" scroll_attempts = 0\n",
" no_change_count = 0\n",
"\n",
" print(\"Starting to scroll and load reviews...\")\n",
" \n",
" while scroll_attempts < max_scrolls:\n",
" # Get current count of reviews\n",
" review_elements = page.locator(\"div[data-review-id][jsaction]\")\n",
" current_review_count = await review_elements.count()\n",
" \n",
" #if we have loaded max_reviews, we will stop scrolling\n",
" if current_review_count >= max_reviews:\n",
" break\n",
"\n",
" print(f\"Scroll attempt {scroll_attempts + 1}: Found {current_review_count} reviews\")\n",
" \n",
" # Scroll to the bottom of the reviews panel\n",
" await scrollable_div.evaluate(\"\"\"\n",
" (element) => {\n",
" element.scrollTo(0, element.scrollHeight + 100);\n",
" }\n",
" \"\"\")\n",
" \n",
" # Wait for potential new content to load\n",
" await asyncio.sleep(2)\n",
" \n",
" # Check if new reviews were loaded\n",
" if current_review_count == previous_review_count:\n",
" no_change_count += 1\n",
" # If count hasn't changed for 3 consecutive scrolls, we've likely reached the end\n",
" if no_change_count >= 3:\n",
" print(f\"No new reviews loaded after {no_change_count} attempts. Finished loading.\")\n",
" break\n",
" else:\n",
" no_change_count = 0\n",
" \n",
" previous_review_count = current_review_count\n",
" scroll_attempts += 1\n",
" \n",
" final_count = await review_elements.count()\n",
" print(f\"Finished scrolling. Total reviews loaded: {final_count}\")\n",
" return final_count"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "f7f67b70",
"metadata": {},
"outputs": [],
"source": [
"async def scrape_google_reviews(url):\n",
" # Where to store the scraped data\n",
" reviews = []\n",
"\n",
" async with async_playwright() as p:\n",
" # Initialize a new Playwright instance\n",
" browser = await p.chromium.launch(\n",
" headless=True # Set to False if you want to see the browser in action\n",
" )\n",
" context = await browser.new_context()\n",
" page = await context.new_page()\n",
"\n",
" # The URL of the Google Maps reviews page\n",
"\n",
" # Navigate to the target Google Maps page\n",
" print(\"Navigating to Google Maps page...\")\n",
" await page.goto(url)\n",
"\n",
" # Wait for initial reviews to load\n",
" print(\"Waiting for initial reviews to load...\")\n",
" review_html_elements = page.locator(\"div[data-review-id][jsaction]\")\n",
" await review_html_elements.first.wait_for(state=\"visible\", timeout=10000)\n",
" \n",
" # Scroll through the reviews panel to lazy load all reviews\n",
" total_reviews = await scroll_reviews_panel(page, max_scrolls=100)\n",
" \n",
" print(f\"\\nStarting to scrape {total_reviews} reviews...\")\n",
"\n",
" # Get all review elements after scrolling\n",
" review_html_elements = page.locator(\"div[data-review-id][jsaction]\")\n",
" all_reviews = await review_html_elements.all()\n",
" \n",
" # Iterate over the elements and scrape data from each of them\n",
" for idx, review_html_element in enumerate(all_reviews, 1):\n",
" try:\n",
" # Scraping logic\n",
"\n",
" stars_element = review_html_element.locator(\"[aria-label*=\\\"star\\\"]\")\n",
" stars_label = await stars_element.get_attribute(\"aria-label\")\n",
"\n",
" # Extract the review score from the stars label\n",
" stars = None\n",
" for i in range(1, 6):\n",
" if stars_label and str(i) in stars_label:\n",
" stars = i\n",
" break\n",
"\n",
" # Get the next sibling of the previous element with an XPath expression\n",
" time_sibling = stars_element.locator(\"xpath=following-sibling::span\")\n",
" time = await time_sibling.text_content()\n",
"\n",
" # Select the \"More\" button and if it is present, click it\n",
" more_element = review_html_element.locator(\"button[aria-label=\\\"See more\\\"]\").first\n",
" if await more_element.is_visible():\n",
" await more_element.click()\n",
" await asyncio.sleep(0.3) # Brief wait for text expansion\n",
"\n",
" text_element = review_html_element.locator(\"div[tabindex=\\\"-1\\\"][id][lang]\")\n",
" text = await text_element.text_content()\n",
"\n",
" reviews.append(str(stars) + \" Stars: \\n\" +\"Reviewed On:\" + time + \"\\n\"+ text)\n",
" \n",
" if idx % 10 == 0:\n",
" print(f\"Scraped {idx}/{total_reviews} reviews...\")\n",
" \n",
" except Exception as e:\n",
" print(f\"Error scraping review {idx}: {str(e)}\")\n",
" continue\n",
"\n",
" print(f\"\\nSuccessfully scraped {len(reviews)} reviews!\")\n",
"\n",
" # Close the browser and release its resources\n",
" await browser.close()\n",
"\n",
" return \"\\n\".join(reviews)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "cb160d5f",
"metadata": {},
"outputs": [],
"source": [
"system_prompt = \"\"\"\n",
"You are an expert assistant that analyzes google reviews,\n",
"and provides a summary and centiment of the reviews.\n",
"Respond in markdown. Do not wrap the markdown in a code block - respond just with the markdown.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "69e08d4b",
"metadata": {},
"outputs": [],
"source": [
"# Define our user prompt\n",
"\n",
"user_prompt_prefix = \"\"\"\n",
"Here are the reviews of a google map location/business.\n",
"Provide a short summary of the reviews and the sentiment of the reviews.\n",
"\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "d710972d",
"metadata": {},
"outputs": [],
"source": [
"\n",
"def prepare_message(reviews):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt_prefix + reviews}\n",
" ]"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "cb51f436",
"metadata": {},
"outputs": [],
"source": [
"async def summarize(url):\n",
" openai = OpenAI()\n",
" reviews = await scrape_google_reviews(url)\n",
" response = openai.chat.completions.create(\n",
" model = \"gpt-4.1-mini\",\n",
" messages = prepare_message(reviews)\n",
" )\n",
" return response.choices[0].message.content"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "2f09e2d2",
"metadata": {},
"outputs": [],
"source": [
"async def display_summary(url):\n",
" summary = await summarize(url)\n",
" display(Markdown(summary))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ca7995c9",
"metadata": {},
"outputs": [],
"source": [
"url = \"https://www.google.com/maps/place/Grace+Home+Nursing+%26+Assisted+Living/@12.32184,75.0853037,17z/data=!4m8!3m7!1s0x3ba47da1be6a0279:0x9e73181ab0827f7e!8m2!3d12.32184!4d75.0853037!9m1!1b1!16s%2Fg%2F11qjl430n_?entry=ttu&g_ep=EgoyMDI1MTAyMC4wIKXMDSoASAFQAw%3D%3D\"\n",
"await display_summary(url)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}