102 lines
4.7 KiB
Plaintext
102 lines
4.7 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "18d85036",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"## Investor Relations Web Scraping bot\n",
|
|
"This code will pop up a Gradio interface to start scraping a website. This is a utility notebook, created to quickly gather documents from IR sites to create a KB. \n",
|
|
"I've tuned the scraper to go through the Investor Relations tree of a company website and save all documents with extensions (xls, pdf, word, etc), but not the HTML content.\n",
|
|
"\n",
|
|
"Due to the way scrapy works with async loops, I had to make a separate script and run it as a subprocess, in order for it to work in a Jupyter notebook.\n",
|
|
"\n",
|
|
"Can be used to scrape multiple websites (one at a time). Saves scraped files in a kb/{domain} subdirectory (it does **not** preserve website tree structure)\n",
|
|
"\n",
|
|
"Uses **spider_runner.py**, which needs to be in the same directory as the notebook (will check and abort if not present).\n",
|
|
"\n",
|
|
"\n",
|
|
"### Scraping logic\n",
|
|
"scrapy does a pretty decent job of getting the necessary files, although some dynamic sites will not yield the best results. For a more robust scraper I probably need to move to Selenium in a future upgrade. Still, the tool is quite practical for many occasions, as many companies keep their IR websites static. You may need to tweak the follow-on link scraping patterns, I have kept it very simple (it will follow whatever link has 'investor-relations/' in it and limit the links to follow per page to avoid infinite scraping)\n",
|
|
"\n",
|
|
"In a real application environment we would be running the spider class inside the application - this would enable simpler real-time updates in the output. For an interactive notebook I find this approach sufficient enough."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "69f99b6a",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import subprocess, os, sys\n",
|
|
"import gradio as gr\n",
|
|
"from urllib.parse import urlparse, urljoin\n",
|
|
"\n",
|
|
"\n",
|
|
"# from urllib.parse import urljoin, urlparse\n",
|
|
"# from scrapy.crawler import CrawlerRunner\n",
|
|
"# from scrapy.utils.log import configure_logging\n",
|
|
"# from twisted.internet import reactor, defer\n",
|
|
"# import asyncio\n",
|
|
"\n",
|
|
"is_scraper_completed = False # global variable to check if the scraper has completed\n",
|
|
"status_value= \"Ready\"\n",
|
|
"\n",
|
|
"with gr.Blocks() as scraper_ui:\n",
|
|
" gr.Markdown(\"## Web Scraper\")\n",
|
|
" gr.Markdown(\"This is a simple web scraper that can be used to scrape investor relations pages.\")\n",
|
|
" \n",
|
|
" url = gr.Textbox(label=\"Enter URL\", placeholder=\"https://example.com\")\n",
|
|
" \n",
|
|
" status = gr.Textbox(label=\"Status\", interactive=False, value=\"Ready to scrape. Enter a URL and press Enter.\", lines=5)\n",
|
|
"\n",
|
|
" def run_scraper(url):\n",
|
|
" # Run the spider as a subprocess\n",
|
|
" if not url.startswith(\"http\"):\n",
|
|
" url = \"http://\" + url\n",
|
|
" # Extract the domain from the URL\n",
|
|
" parsed_url = urlparse(url)\n",
|
|
" domain = parsed_url.netloc.replace(\"www.\", \"\")\n",
|
|
" if not domain:\n",
|
|
" return \"Invalid URL. Please enter a valid URL.\"\n",
|
|
" # Check if the spider_runner.py file exists\n",
|
|
" if not os.path.exists('spider_runner.py'):\n",
|
|
" return \"Error: spider_runner.py not found. Please ensure it is in the current directory.\"\n",
|
|
" # Run the spider using subprocess\n",
|
|
" try:\n",
|
|
" result = subprocess.run([sys.executable, 'spider_runner.py', url, domain], check=True, text=True, capture_output=True)\n",
|
|
" status_value = f\"Scraping completed for {url}.\"\n",
|
|
" is_scraper_completed = True # Set the global variable to True\n",
|
|
" return result.stderr, status_value\n",
|
|
" except subprocess.CalledProcessError as e:\n",
|
|
" is_scraper_completed = True\n",
|
|
" status_value = \"Error during scraping. Check the logs for details.\"\n",
|
|
" return f\"Error: {e}\", status_value\n",
|
|
" \n",
|
|
" output = gr.Textbox(label=\"Output\", interactive=False)\n",
|
|
" \n",
|
|
" url.submit(run_scraper, inputs=url, outputs=[output,status]) \n",
|
|
"\n",
|
|
"scraper_ui.launch(inbrowser=True)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "llms",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"name": "python",
|
|
"version": "3.11.13"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|