205 lines
8.4 KiB
Plaintext
205 lines
8.4 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0bb7f4e9",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Prepare dependencies\n",
|
||
"I had issues with selenium and my chromedriver, I had to install the exact dependencies below to make it work.\n",
|
||
"First add the dependency selenium by executing folowing command\n",
|
||
"``` bash\n",
|
||
"uv pip install selenium==4.11.2\n",
|
||
"uv pip install urllib3==1.26.16\n",
|
||
"```\n",
|
||
"***Do not forget to restart the Jupyter kernel to make the package available.***"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7a116541",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Prefered Web Browser\n",
|
||
"This script will use Safari on MacOSX, please install the Safari driver when required [Safari driver](https://webkit.org/blog/6900/webdriver-support-in-safari-10).\n",
|
||
"It will assume that Edge is used on Windows systems, install the Chrome driver from: [ChromeDriver](https://googlechromelabs.github.io/chrome-for-testing/#stable) \n",
|
||
"Feel free to add other browser support when required.\n",
|
||
"\n",
|
||
"I am on Windows and I extracted the ChromeDriver and put it in my %USERPROFILE%\\AppData\\Local\\Microsoft\\WindowsApps folder to ensure that it is available via the PATH. Any folder available in the PATH environment setting will work.\n",
|
||
"Start cmd and execute set to see all folders in the PATH environment setting.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"id": "1353f8ef",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# imports\n",
|
||
"\n",
|
||
"import os\n",
|
||
"from dotenv import load_dotenv\n",
|
||
"from scraper import fetch_website_contents\n",
|
||
"from IPython.display import Markdown, display\n",
|
||
"from openai import OpenAI\n",
|
||
"from selenium import webdriver"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"id": "24d86842",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def verify_openai_api_key():\n",
|
||
" \"\"\"Verify that the OpenAI API key is set in the environment variables.\"\"\"\n",
|
||
" load_dotenv(override=True)\n",
|
||
" api_key = os.getenv('OPENAI_API_KEY')\n",
|
||
"\n",
|
||
" if not api_key:\n",
|
||
" raise ValueError(\"OPENAI_API_KEY is not set in environment variables.\")\n",
|
||
" \n",
|
||
" # Dry run with a simple request to verify the key.\n",
|
||
" try:\n",
|
||
" client = OpenAI(api_key=api_key)\n",
|
||
" client.models.list()\n",
|
||
" except:\n",
|
||
" raise ValueError(\"Invalid OPENAI_API_KEY.\")\n",
|
||
" \n",
|
||
" return api_key\n",
|
||
"\n",
|
||
"def get_webdriver():\n",
|
||
" \"\"\"Initialize and return a Selenium WebDriver based on the operating system.\"\"\"\n",
|
||
" \n",
|
||
" # Verify the os, use Safari for MacOS, Chrome for others.\n",
|
||
" if os.name == 'posix': \n",
|
||
" driver = webdriver.Safari()\n",
|
||
" else:\n",
|
||
" driver = webdriver.Chrome()\n",
|
||
" return driver\n",
|
||
"\n",
|
||
"def fetch_website_contents_selenium(url: str) -> str:\n",
|
||
" \"\"\"Fetch website contents using Selenium WebDriver.\"\"\"\n",
|
||
" driver = get_webdriver()\n",
|
||
" driver.get(url)\n",
|
||
" content = driver.page_source\n",
|
||
" driver.quit()\n",
|
||
" return content\n",
|
||
"\n",
|
||
"def messages_for(website: str) -> list[dict]:\n",
|
||
" return [\n",
|
||
" {\n",
|
||
" \"role\": \"system\",\n",
|
||
" \"content\": \"\"\"You are a helpful assistant that summarizes website content.\n",
|
||
" You can perfectly understand HTML structure and extract meaningful information from it.\"\"\"\n",
|
||
" },\n",
|
||
" {\n",
|
||
" \"role\": \"user\",\n",
|
||
" \"content\": f\"Summarize the following website content:\\n\\n{website}\"\n",
|
||
" }\n",
|
||
" ]\n",
|
||
"\n",
|
||
"def summarize_website(url: str, api_key: str) -> str:\n",
|
||
" content = fetch_website_contents_selenium(url)\n",
|
||
" openai_client = OpenAI(api_key=api_key)\n",
|
||
" response = openai_client.chat.completions.create(\n",
|
||
" model = \"gpt-4.1-mini\",\n",
|
||
" messages = messages_for(content)\n",
|
||
" )\n",
|
||
" return response.choices[0].message.content\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"id": "9a7e7f0a",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"The website content is from Forbes, a global media company focusing on business, investing, technology, entrepreneurship, leadership, and lifestyle. The homepage for the Europe edition features a wide range of news articles, opinion pieces, and multimedia related to these topics.\n",
|
||
"\n",
|
||
"Key Highlights:\n",
|
||
"\n",
|
||
"1. **Navigation Menu**: \n",
|
||
" - Sections include Featured, Billionaires, Innovation, Leadership, Money, Forbes Digital Assets, Business, Small Business, Lifestyle, Real Estate, Forbes Vetted (product reviews), Deals, Lists, Advisor, Health, Newsletters, Forbes Games, More from Forbes (Videos, Magazine, etc.).\n",
|
||
" - Each section has further subsections and related articles and lists.\n",
|
||
"\n",
|
||
"2. **Main Stories & Featured Articles**:\n",
|
||
" - Highlighted articles cover various topics including:\n",
|
||
" - \"This Gulf Nation Is Powering Trump’s Moneymaking Machine\"\n",
|
||
" - \"Inside Gavin Newsom’s Multimillion-Dollar Business Empire\"\n",
|
||
" - \"Monday Afternoon Air Traffic Staffing Issues Cause Flight Delays In Dallas\"\n",
|
||
" - \"How Loop Earplugs Turned Down The Volume For Gen Z And Dialed Up Massive Revenue\"\n",
|
||
" - \"Trump Crypto Partner Suspends CEO With No Explanation After Stock Falls 74%\"\n",
|
||
" - \"How The Shutdown Impacts Healthcare\"\n",
|
||
" - Articles span categories such as Billionaires, Money in Politics, Innovation, Lifestyle, and Travel.\n",
|
||
"\n",
|
||
"3. **Breaking News and Trending Picks**:\n",
|
||
" - News highlights include government shutdown updates, military campaigns, political news, and economic updates.\n",
|
||
" - Trending picks and breaking news are regularly updated with a variety of topics.\n",
|
||
"\n",
|
||
"4. **Special Sections**:\n",
|
||
" - “Forbes Vetted” for trusted product reviews and recommendations including fashion, furniture, tech, and health products.\n",
|
||
" - “Forbes Shorts” showcasing short video content on trending topics.\n",
|
||
" - “Live Poll” engaging readers on topics like political scams.\n",
|
||
"\n",
|
||
"5. **Video Content**:\n",
|
||
" - Includes Forbes Shorts and a video carousel featuring varied thematic stories.\n",
|
||
"\n",
|
||
"6. **Additional Features**:\n",
|
||
" - Quote of the Day featuring notable quotes from leaders.\n",
|
||
" - Polls on current issues for reader interaction.\n",
|
||
" - Links to subsidiaries, licensees, and global editions of Forbes.\n",
|
||
" - Footer with company info, products, councils, conferences, and privacy/legal information.\n",
|
||
"\n",
|
||
"7. **Focus Areas**:\n",
|
||
" - Business news and analysis.\n",
|
||
" - Profiles of billionaires and wealthy individuals.\n",
|
||
" - Innovation and technology including AI developments.\n",
|
||
" - Leadership advice and career development.\n",
|
||
" - Lifestyle including travel, fashion, and wellness.\n",
|
||
" - Financial advice and investment insights.\n",
|
||
" - Coverage of politics and global markets.\n",
|
||
"\n",
|
||
"Overall, the Forbes Europe homepage is a comprehensive destination for news and insights across business, finance, innovation, leadership, lifestyle, and more, supplemented by multimedia, interactive polls, and verified product reviews.\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"try:\n",
|
||
" api_key = verify_openai_api_key()\n",
|
||
" print(summarize_website(\"https://www.forbes.com/\", api_key))\n",
|
||
"except Exception as e:\n",
|
||
" print(f\"Error: {e}\")"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "llm-engineering",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.12.12"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|