Files
LLM_Engineering_OLD/week5/community-contributions/elchanio_rag_bot/IR_Scraper.ipynb
2025-07-21 23:53:30 +03:00

102 lines
4.7 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "18d85036",
"metadata": {},
"source": [
"\n",
"![image](img/spider_bot.png)\n",
"\n",
"## Investor Relations Web Scraping bot\n",
"This code will pop up a Gradio interface to start scraping a website. This is a utility notebook, created to quickly gather documents from IR sites to create a KB. \n",
"I've tuned the scraper to go through the Investor Relations tree of a company website and save all documents with extensions (xls, pdf, word, etc), but not the HTML content.\n",
"\n",
"Due to the way scrapy works with async loops, I had to make a separate script and run it as a subprocess, in order for it to work in a Jupyter notebook.\n",
"\n",
"Can be used to scrape multiple websites (one at a time). Saves scraped files in a kb/{domain} subdirectory (it does **not** preserve website tree structure)\n",
"\n",
"Uses **spider_runner.py**, which needs to be in the same directory as the notebook (will check and abort if not present).\n",
"\n",
"\n",
"### Scraping logic\n",
"scrapy does a pretty decent job of getting the necessary files, although some dynamic sites will not yield the best results. For a more robust scraper I probably need to move to Selenium in a future upgrade. Still, the tool is quite practical for many occasions, as many companies keep their IR websites static. You may need to tweak the follow-on link scraping patterns, I have kept it very simple (it will follow whatever link has 'investor-relations/' in it and limit the links to follow per page to avoid infinite scraping)\n",
"\n",
"In a real application environment we would be running the spider class inside the application - this would enable simpler real-time updates in the output. For an interactive notebook I find this approach sufficient enough."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "69f99b6a",
"metadata": {},
"outputs": [],
"source": [
"import subprocess, os, sys\n",
"import gradio as gr\n",
"from urllib.parse import urlparse, urljoin\n",
"\n",
"\n",
"# from urllib.parse import urljoin, urlparse\n",
"# from scrapy.crawler import CrawlerRunner\n",
"# from scrapy.utils.log import configure_logging\n",
"# from twisted.internet import reactor, defer\n",
"# import asyncio\n",
"\n",
"is_scraper_completed = False # global variable to check if the scraper has completed\n",
"status_value= \"Ready\"\n",
"\n",
"with gr.Blocks() as scraper_ui:\n",
" gr.Markdown(\"## Web Scraper\")\n",
" gr.Markdown(\"This is a simple web scraper that can be used to scrape investor relations pages.\")\n",
" \n",
" url = gr.Textbox(label=\"Enter URL\", placeholder=\"https://example.com\")\n",
" \n",
" status = gr.Textbox(label=\"Status\", interactive=False, value=\"Ready to scrape. Enter a URL and press Enter.\", lines=5)\n",
"\n",
" def run_scraper(url):\n",
" # Run the spider as a subprocess\n",
" if not url.startswith(\"http\"):\n",
" url = \"http://\" + url\n",
" # Extract the domain from the URL\n",
" parsed_url = urlparse(url)\n",
" domain = parsed_url.netloc.replace(\"www.\", \"\")\n",
" if not domain:\n",
" return \"Invalid URL. Please enter a valid URL.\"\n",
" # Check if the spider_runner.py file exists\n",
" if not os.path.exists('spider_runner.py'):\n",
" return \"Error: spider_runner.py not found. Please ensure it is in the current directory.\"\n",
" # Run the spider using subprocess\n",
" try:\n",
" result = subprocess.run([sys.executable, 'spider_runner.py', url, domain], check=True, text=True, capture_output=True)\n",
" status_value = f\"Scraping completed for {url}.\"\n",
" is_scraper_completed = True # Set the global variable to True\n",
" return result.stderr, status_value\n",
" except subprocess.CalledProcessError as e:\n",
" is_scraper_completed = True\n",
" status_value = \"Error during scraping. Check the logs for details.\"\n",
" return f\"Error: {e}\", status_value\n",
" \n",
" output = gr.Textbox(label=\"Output\", interactive=False)\n",
" \n",
" url.submit(run_scraper, inputs=url, outputs=[output,status]) \n",
"\n",
"scraper_ui.launch(inbrowser=True)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "llms",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}