Add Shijin's week 1 solution (cleared outputs)
This commit is contained in:
@@ -0,0 +1,367 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "1fecd49e",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# 🗺️ Google Maps Review Summarizer\n",
|
||||||
|
"\n",
|
||||||
|
"This Python app automates the process of fetching and summarizing Google Maps reviews for any business or location.\n",
|
||||||
|
"\n",
|
||||||
|
"## 🚀 Overview\n",
|
||||||
|
"The app performs two main tasks:\n",
|
||||||
|
"1. **Scrape Reviews** – Uses a web scraping script to extract reviews directly from Google Maps.\n",
|
||||||
|
"2. **Summarize Content** – Leverages OpenAI's language models to generate concise, insightful summaries of the collected reviews and analyse the sentiments.\n",
|
||||||
|
"\n",
|
||||||
|
"## 🧠 Tech Stack\n",
|
||||||
|
"- **Python** – Core language\n",
|
||||||
|
"- **Playwright** – For scraping reviews\n",
|
||||||
|
"- **OpenAI API** – For natural language summarization\n",
|
||||||
|
"- **Jupyter Notebook** – For exploration, testing, and demonstration\n",
|
||||||
|
"\n",
|
||||||
|
"### 🙏 Credits\n",
|
||||||
|
"The web scraping logic is **inspired by [Antonello Zanini’s blog post](https://blog.apify.com/how-to-scrape-google-reviews/)** on building a Google Reviews scraper. Special thanks for the valuable insights on **structuring and automating the scraping workflow**, which greatly informed the development of this improved scraper.\n",
|
||||||
|
"\n",
|
||||||
|
"This app, however, uses an **enhanced version of the scraper** that can scroll infinitely to load more reviews until it collects **at least 1,000 reviews**. If only a smaller number of reviews are available, the scraper stops scrolling earlier.\n",
|
||||||
|
"\n",
|
||||||
|
"## ✅ Sample Output\n",
|
||||||
|
"Here is a summary of reviews of a restuarant generated by the app.\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"---\n",
|
||||||
|
"\n",
|
||||||
|
"**Note:** This project is intended for educational and research purposes. Please ensure compliance with Google’s [Terms of Service](https://policies.google.com/terms) when scraping or using their data.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "df04a4aa",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"#Activate the llm_engineering virtual environment\n",
|
||||||
|
"!source ../../../.venv/bin/activate \n",
|
||||||
|
"\n",
|
||||||
|
"#Make sure pip is available and up to date inside the venv\n",
|
||||||
|
"!python3 -m ensurepip --upgrade\n",
|
||||||
|
"\n",
|
||||||
|
"#Verify that pip now points to the venv path (should end with /.venv/bin/pip)\n",
|
||||||
|
"!which pip3\n",
|
||||||
|
"\n",
|
||||||
|
"#Install Playwright inside the venv\n",
|
||||||
|
"!pip3 install playwright\n",
|
||||||
|
"\n",
|
||||||
|
"#Download the required browser binaries and dependencies\n",
|
||||||
|
"!python3 -m playwright install"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 2,
|
||||||
|
"id": "1c794cfd",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"import asyncio\n",
|
||||||
|
"from playwright.async_api import async_playwright\n",
|
||||||
|
"from IPython.display import Markdown, display\n",
|
||||||
|
"import os\n",
|
||||||
|
"from dotenv import load_dotenv\n",
|
||||||
|
"from openai import OpenAI\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 3,
|
||||||
|
"id": "317af2b8",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"API key found and looks good so far!\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# Load environment variables in a file called .env\n",
|
||||||
|
"\n",
|
||||||
|
"load_dotenv(override=True)\n",
|
||||||
|
"api_key = os.getenv('OPENAI_API_KEY')\n",
|
||||||
|
"\n",
|
||||||
|
"# Check the key\n",
|
||||||
|
"\n",
|
||||||
|
"if not api_key:\n",
|
||||||
|
" print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
|
||||||
|
"elif not api_key.startswith(\"sk-proj-\"):\n",
|
||||||
|
" print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n",
|
||||||
|
"elif api_key.strip() != api_key:\n",
|
||||||
|
" print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n",
|
||||||
|
"else:\n",
|
||||||
|
" print(\"API key found and looks good so far!\")\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 4,
|
||||||
|
"id": "6f142c79",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"async def scroll_reviews_panel(page, max_scrolls=50, max_reviews=10):\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" Scrolls through the reviews panel to lazy load all reviews.\n",
|
||||||
|
" \n",
|
||||||
|
" Args:\n",
|
||||||
|
" page: Playwright page object\n",
|
||||||
|
" max_scrolls: Maximum number of scroll attempts to prevent infinite loops\n",
|
||||||
|
" \n",
|
||||||
|
" Returns:\n",
|
||||||
|
" Number of reviews loaded\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" # Find the scrollable reviews container\n",
|
||||||
|
" # Google Maps reviews are in a specific scrollable div\n",
|
||||||
|
" scrollable_div = page.locator('div[role=\"main\"] div[jslog$=\"mutable:true;\"]').first\n",
|
||||||
|
" \n",
|
||||||
|
" previous_review_count = 0\n",
|
||||||
|
" scroll_attempts = 0\n",
|
||||||
|
" no_change_count = 0\n",
|
||||||
|
"\n",
|
||||||
|
" print(\"Starting to scroll and load reviews...\")\n",
|
||||||
|
" \n",
|
||||||
|
" while scroll_attempts < max_scrolls:\n",
|
||||||
|
" # Get current count of reviews\n",
|
||||||
|
" review_elements = page.locator(\"div[data-review-id][jsaction]\")\n",
|
||||||
|
" current_review_count = await review_elements.count()\n",
|
||||||
|
" \n",
|
||||||
|
" #if we have loaded max_reviews, we will stop scrolling\n",
|
||||||
|
" if current_review_count >= max_reviews:\n",
|
||||||
|
" break\n",
|
||||||
|
"\n",
|
||||||
|
" print(f\"Scroll attempt {scroll_attempts + 1}: Found {current_review_count} reviews\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Scroll to the bottom of the reviews panel\n",
|
||||||
|
" await scrollable_div.evaluate(\"\"\"\n",
|
||||||
|
" (element) => {\n",
|
||||||
|
" element.scrollTo(0, element.scrollHeight + 100);\n",
|
||||||
|
" }\n",
|
||||||
|
" \"\"\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Wait for potential new content to load\n",
|
||||||
|
" await asyncio.sleep(2)\n",
|
||||||
|
" \n",
|
||||||
|
" # Check if new reviews were loaded\n",
|
||||||
|
" if current_review_count == previous_review_count:\n",
|
||||||
|
" no_change_count += 1\n",
|
||||||
|
" # If count hasn't changed for 3 consecutive scrolls, we've likely reached the end\n",
|
||||||
|
" if no_change_count >= 3:\n",
|
||||||
|
" print(f\"No new reviews loaded after {no_change_count} attempts. Finished loading.\")\n",
|
||||||
|
" break\n",
|
||||||
|
" else:\n",
|
||||||
|
" no_change_count = 0\n",
|
||||||
|
" \n",
|
||||||
|
" previous_review_count = current_review_count\n",
|
||||||
|
" scroll_attempts += 1\n",
|
||||||
|
" \n",
|
||||||
|
" final_count = await review_elements.count()\n",
|
||||||
|
" print(f\"Finished scrolling. Total reviews loaded: {final_count}\")\n",
|
||||||
|
" return final_count"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 5,
|
||||||
|
"id": "f7f67b70",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"async def scrape_google_reviews(url):\n",
|
||||||
|
" # Where to store the scraped data\n",
|
||||||
|
" reviews = []\n",
|
||||||
|
"\n",
|
||||||
|
" async with async_playwright() as p:\n",
|
||||||
|
" # Initialize a new Playwright instance\n",
|
||||||
|
" browser = await p.chromium.launch(\n",
|
||||||
|
" headless=True # Set to False if you want to see the browser in action\n",
|
||||||
|
" )\n",
|
||||||
|
" context = await browser.new_context()\n",
|
||||||
|
" page = await context.new_page()\n",
|
||||||
|
"\n",
|
||||||
|
" # The URL of the Google Maps reviews page\n",
|
||||||
|
"\n",
|
||||||
|
" # Navigate to the target Google Maps page\n",
|
||||||
|
" print(\"Navigating to Google Maps page...\")\n",
|
||||||
|
" await page.goto(url)\n",
|
||||||
|
"\n",
|
||||||
|
" # Wait for initial reviews to load\n",
|
||||||
|
" print(\"Waiting for initial reviews to load...\")\n",
|
||||||
|
" review_html_elements = page.locator(\"div[data-review-id][jsaction]\")\n",
|
||||||
|
" await review_html_elements.first.wait_for(state=\"visible\", timeout=10000)\n",
|
||||||
|
" \n",
|
||||||
|
" # Scroll through the reviews panel to lazy load all reviews\n",
|
||||||
|
" total_reviews = await scroll_reviews_panel(page, max_scrolls=100)\n",
|
||||||
|
" \n",
|
||||||
|
" print(f\"\\nStarting to scrape {total_reviews} reviews...\")\n",
|
||||||
|
"\n",
|
||||||
|
" # Get all review elements after scrolling\n",
|
||||||
|
" review_html_elements = page.locator(\"div[data-review-id][jsaction]\")\n",
|
||||||
|
" all_reviews = await review_html_elements.all()\n",
|
||||||
|
" \n",
|
||||||
|
" # Iterate over the elements and scrape data from each of them\n",
|
||||||
|
" for idx, review_html_element in enumerate(all_reviews, 1):\n",
|
||||||
|
" try:\n",
|
||||||
|
" # Scraping logic\n",
|
||||||
|
"\n",
|
||||||
|
" stars_element = review_html_element.locator(\"[aria-label*=\\\"star\\\"]\")\n",
|
||||||
|
" stars_label = await stars_element.get_attribute(\"aria-label\")\n",
|
||||||
|
"\n",
|
||||||
|
" # Extract the review score from the stars label\n",
|
||||||
|
" stars = None\n",
|
||||||
|
" for i in range(1, 6):\n",
|
||||||
|
" if stars_label and str(i) in stars_label:\n",
|
||||||
|
" stars = i\n",
|
||||||
|
" break\n",
|
||||||
|
"\n",
|
||||||
|
" # Get the next sibling of the previous element with an XPath expression\n",
|
||||||
|
" time_sibling = stars_element.locator(\"xpath=following-sibling::span\")\n",
|
||||||
|
" time = await time_sibling.text_content()\n",
|
||||||
|
"\n",
|
||||||
|
" # Select the \"More\" button and if it is present, click it\n",
|
||||||
|
" more_element = review_html_element.locator(\"button[aria-label=\\\"See more\\\"]\").first\n",
|
||||||
|
" if await more_element.is_visible():\n",
|
||||||
|
" await more_element.click()\n",
|
||||||
|
" await asyncio.sleep(0.3) # Brief wait for text expansion\n",
|
||||||
|
"\n",
|
||||||
|
" text_element = review_html_element.locator(\"div[tabindex=\\\"-1\\\"][id][lang]\")\n",
|
||||||
|
" text = await text_element.text_content()\n",
|
||||||
|
"\n",
|
||||||
|
" reviews.append(str(stars) + \" Stars: \\n\" +\"Reviewed On:\" + time + \"\\n\"+ text)\n",
|
||||||
|
" \n",
|
||||||
|
" if idx % 10 == 0:\n",
|
||||||
|
" print(f\"Scraped {idx}/{total_reviews} reviews...\")\n",
|
||||||
|
" \n",
|
||||||
|
" except Exception as e:\n",
|
||||||
|
" print(f\"Error scraping review {idx}: {str(e)}\")\n",
|
||||||
|
" continue\n",
|
||||||
|
"\n",
|
||||||
|
" print(f\"\\nSuccessfully scraped {len(reviews)} reviews!\")\n",
|
||||||
|
"\n",
|
||||||
|
" # Close the browser and release its resources\n",
|
||||||
|
" await browser.close()\n",
|
||||||
|
"\n",
|
||||||
|
" return \"\\n\".join(reviews)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 6,
|
||||||
|
"id": "cb160d5f",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"system_prompt = \"\"\"\n",
|
||||||
|
"You are an expert assistant that analyzes google reviews,\n",
|
||||||
|
"and provides a summary and centiment of the reviews.\n",
|
||||||
|
"Respond in markdown. Do not wrap the markdown in a code block - respond just with the markdown.\n",
|
||||||
|
"\"\"\""
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 7,
|
||||||
|
"id": "69e08d4b",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Define our user prompt\n",
|
||||||
|
"\n",
|
||||||
|
"user_prompt_prefix = \"\"\"\n",
|
||||||
|
"Here are the reviews of a google map location/business.\n",
|
||||||
|
"Provide a short summary of the reviews and the sentiment of the reviews.\n",
|
||||||
|
"\n",
|
||||||
|
"\"\"\""
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 8,
|
||||||
|
"id": "d710972d",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"\n",
|
||||||
|
"def prepare_message(reviews):\n",
|
||||||
|
" return [\n",
|
||||||
|
" {\"role\": \"system\", \"content\": system_prompt},\n",
|
||||||
|
" {\"role\": \"user\", \"content\": user_prompt_prefix + reviews}\n",
|
||||||
|
" ]"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 9,
|
||||||
|
"id": "cb51f436",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"async def summarize(url):\n",
|
||||||
|
" openai = OpenAI()\n",
|
||||||
|
" reviews = await scrape_google_reviews(url)\n",
|
||||||
|
" response = openai.chat.completions.create(\n",
|
||||||
|
" model = \"gpt-4.1-mini\",\n",
|
||||||
|
" messages = prepare_message(reviews)\n",
|
||||||
|
" )\n",
|
||||||
|
" return response.choices[0].message.content"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 10,
|
||||||
|
"id": "2f09e2d2",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"async def display_summary(url):\n",
|
||||||
|
" summary = await summarize(url)\n",
|
||||||
|
" display(Markdown(summary))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "ca7995c9",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"url = \"https://www.google.com/maps/place/Grace+Home+Nursing+%26+Assisted+Living/@12.32184,75.0853037,17z/data=!4m8!3m7!1s0x3ba47da1be6a0279:0x9e73181ab0827f7e!8m2!3d12.32184!4d75.0853037!9m1!1b1!16s%2Fg%2F11qjl430n_?entry=ttu&g_ep=EgoyMDI1MTAyMC4wIKXMDSoASAFQAw%3D%3D\"\n",
|
||||||
|
"await display_summary(url)\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": ".venv",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.12.12"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 5
|
||||||
|
}
|
||||||
Binary file not shown.
|
After Width: | Height: | Size: 451 KiB |
Reference in New Issue
Block a user