{ "cells": [ { "cell_type": "markdown", "id": "18d85036", "metadata": {}, "source": [ "\n", "![image](img/spider_bot.png)\n", "\n", "## Investor Relations Web Scraping bot\n", "This code will pop up a Gradio interface to start scraping a website. This is a utility notebook, created to quickly gather documents from IR sites to create a KB. \n", "I've tuned the scraper to go through the Investor Relations tree of a company website and save all documents with extensions (xls, pdf, word, etc), but not the HTML content.\n", "\n", "Due to the way scrapy works with async loops, I had to make a separate script and run it as a subprocess, in order for it to work in a Jupyter notebook.\n", "\n", "Can be used to scrape multiple websites (one at a time). Saves scraped files in a kb/{domain} subdirectory (it does **not** preserve website tree structure)\n", "\n", "Uses **spider_runner.py**, which needs to be in the same directory as the notebook (will check and abort if not present).\n", "\n", "\n", "### Scraping logic\n", "scrapy does a pretty decent job of getting the necessary files, although some dynamic sites will not yield the best results. For a more robust scraper I probably need to move to Selenium in a future upgrade. Still, the tool is quite practical for many occasions, as many companies keep their IR websites static. You may need to tweak the follow-on link scraping patterns, I have kept it very simple (it will follow whatever link has 'investor-relations/' in it and limit the links to follow per page to avoid infinite scraping)\n", "\n", "In a real application environment we would be running the spider class inside the application - this would enable simpler real-time updates in the output. For an interactive notebook I find this approach sufficient enough." ] }, { "cell_type": "code", "execution_count": null, "id": "69f99b6a", "metadata": {}, "outputs": [], "source": [ "import subprocess, os, sys\n", "import gradio as gr\n", "from urllib.parse import urlparse, urljoin\n", "\n", "\n", "# from urllib.parse import urljoin, urlparse\n", "# from scrapy.crawler import CrawlerRunner\n", "# from scrapy.utils.log import configure_logging\n", "# from twisted.internet import reactor, defer\n", "# import asyncio\n", "\n", "is_scraper_completed = False # global variable to check if the scraper has completed\n", "status_value= \"Ready\"\n", "\n", "with gr.Blocks() as scraper_ui:\n", " gr.Markdown(\"## Web Scraper\")\n", " gr.Markdown(\"This is a simple web scraper that can be used to scrape investor relations pages.\")\n", " \n", " url = gr.Textbox(label=\"Enter URL\", placeholder=\"https://example.com\")\n", " \n", " status = gr.Textbox(label=\"Status\", interactive=False, value=\"Ready to scrape. Enter a URL and press Enter.\", lines=5)\n", "\n", " def run_scraper(url):\n", " # Run the spider as a subprocess\n", " if not url.startswith(\"http\"):\n", " url = \"http://\" + url\n", " # Extract the domain from the URL\n", " parsed_url = urlparse(url)\n", " domain = parsed_url.netloc.replace(\"www.\", \"\")\n", " if not domain:\n", " return \"Invalid URL. Please enter a valid URL.\"\n", " # Check if the spider_runner.py file exists\n", " if not os.path.exists('spider_runner.py'):\n", " return \"Error: spider_runner.py not found. Please ensure it is in the current directory.\"\n", " # Run the spider using subprocess\n", " try:\n", " result = subprocess.run([sys.executable, 'spider_runner.py', url, domain], check=True, text=True, capture_output=True)\n", " status_value = f\"Scraping completed for {url}.\"\n", " is_scraper_completed = True # Set the global variable to True\n", " return result.stderr, status_value\n", " except subprocess.CalledProcessError as e:\n", " is_scraper_completed = True\n", " status_value = \"Error during scraping. Check the logs for details.\"\n", " return f\"Error: {e}\", status_value\n", " \n", " output = gr.Textbox(label=\"Output\", interactive=False)\n", " \n", " url.submit(run_scraper, inputs=url, outputs=[output,status]) \n", "\n", "scraper_ui.launch(inbrowser=True)" ] } ], "metadata": { "kernelspec": { "display_name": "llms", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.11.13" } }, "nbformat": 4, "nbformat_minor": 5 }