LLM_Engineering_OLD/week5/community-contributions/elchanio_rag_bot/IR_Scraper.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "18d85036",
   "metadata": {},
   "source": [
    "\n",
    "![image](img/spider_bot.png)\n",
    "\n",
    "## Investor Relations Web Scraping bot\n",
    "This code will pop up a Gradio interface to start scraping a website. This is a utility notebook, created to quickly gather documents from IR sites to create a KB. \n",
    "I've tuned the scraper to go through the Investor Relations tree of a company website and save all documents with extensions (xls, pdf, word, etc), but not the HTML content.\n",
    "\n",
    "Due to the way scrapy works with async loops, I had to make a separate script and run it as a subprocess, in order for it to work in a Jupyter notebook.\n",
    "\n",
    "Can be used to scrape multiple websites (one at a time). Saves scraped files in a kb/{domain} subdirectory (it does **not** preserve website tree structure)\n",
    "\n",
    "Uses **spider_runner.py**, which needs to be in the same directory as the notebook (will check and abort if not present).\n",
    "\n",
    "\n",
    "### Scraping logic\n",
    "scrapy does a pretty decent job of getting the necessary files, although some dynamic sites will not yield the best results. For a more robust scraper I probably need to move to Selenium in a future upgrade. Still, the tool is quite practical for many occasions, as many companies keep their IR websites static. You may need to tweak the follow-on link scraping patterns, I have kept it very simple (it will follow whatever link has 'investor-relations/' in it and limit the links to follow per page to avoid infinite scraping)\n",
    "\n",
    "In a real application environment we would be running the spider class inside the application - this would enable simpler real-time updates in the output. For an interactive notebook I find this approach sufficient enough."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69f99b6a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import subprocess, os, sys\n",
    "import gradio as gr\n",
    "from urllib.parse import urlparse, urljoin\n",
    "\n",
    "\n",
    "# from urllib.parse import urljoin, urlparse\n",
    "# from scrapy.crawler import CrawlerRunner\n",
    "# from scrapy.utils.log import configure_logging\n",
    "# from twisted.internet import reactor, defer\n",
    "# import asyncio\n",
    "\n",
    "is_scraper_completed = False # global variable to check if the scraper has completed\n",
    "status_value= \"Ready\"\n",
    "\n",
    "with gr.Blocks() as scraper_ui:\n",
    "    gr.Markdown(\"## Web Scraper\")\n",
    "    gr.Markdown(\"This is a simple web scraper that can be used to scrape investor relations pages.\")\n",
    "    \n",
    "    url = gr.Textbox(label=\"Enter URL\", placeholder=\"https://example.com\")\n",
    "    \n",
    "    status = gr.Textbox(label=\"Status\", interactive=False, value=\"Ready to scrape. Enter a URL and press Enter.\", lines=5)\n",
    "\n",
    "    def run_scraper(url):\n",
    "        # Run the spider as a subprocess\n",
    "        if not url.startswith(\"http\"):\n",
    "            url = \"http://\" + url\n",
    "        # Extract the domain from the URL\n",
    "        parsed_url = urlparse(url)\n",
    "        domain = parsed_url.netloc.replace(\"www.\", \"\")\n",
    "        if not domain:\n",
    "            return \"Invalid URL. Please enter a valid URL.\"\n",
    "        # Check if the spider_runner.py file exists\n",
    "        if not os.path.exists('spider_runner.py'):\n",
    "            return \"Error: spider_runner.py not found. Please ensure it is in the current directory.\"\n",
    "        # Run the spider using subprocess\n",
    "        try:\n",
    "            result = subprocess.run([sys.executable, 'spider_runner.py', url, domain], check=True, text=True, capture_output=True)\n",
    "            status_value = f\"Scraping completed for {url}.\"\n",
    "            is_scraper_completed = True  # Set the global variable to True\n",
    "            return result.stderr, status_value\n",
    "        except subprocess.CalledProcessError as e:\n",
    "            is_scraper_completed = True\n",
    "            status_value = \"Error during scraping. Check the logs for details.\"\n",
    "            return f\"Error: {e}\", status_value\n",
    "    \n",
    "    output = gr.Textbox(label=\"Output\", interactive=False)\n",
    "    \n",
    "    url.submit(run_scraper, inputs=url, outputs=[output,status]) \n",
    "\n",
    "scraper_ui.launch(inbrowser=True)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llms",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.11.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}