Add web scraping and summarization script using Playwright and OpenAI
This script allows users to input a URL, scrape the visible content using the Playwright framework, and summarize it using the OpenAI GPT-4o API. The summarized output is saved as a Markdown (.md) file, providing a clean and accessible format. Key features: - Prompts user for a URL at runtime - Uses Playwright to scrape the page content - Extracts visible text with BeautifulSoup - Summarizes content using OpenAI's chat model - Saves output to a user-friendly Markdown file This contribution supports browser-based content summarization and expands the repo’s AI toolset for web interaction tasks.
This commit is contained in:
@@ -0,0 +1,56 @@
|
||||
import os
|
||||
import openai
|
||||
from IPython.display import Markdown, display
|
||||
from dotenv import load_dotenv
|
||||
from playwright.sync_api import sync_playwright
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
load_dotenv()
|
||||
openai.api_key = os.getenv("OPENAI_API_KEY") # Or set it directly
|
||||
|
||||
def scrape_website(url):
|
||||
# Code to scrape a website using Playwright
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.launch(headless=True)
|
||||
page = browser.new_page()
|
||||
page.goto(url)
|
||||
content = page.content()
|
||||
browser.close()
|
||||
return content
|
||||
|
||||
def summarize_content(html_content):
|
||||
#Get only the text parts of the webpage
|
||||
soup = BeautifulSoup(html_content, 'html.parser')
|
||||
summary_text = soup.get_text(separator=' ', strip=True)
|
||||
# Code to summarize using OpenAI API
|
||||
system_prompt = ("You summarize html content as markdown.")
|
||||
user_prompt = (
|
||||
"You are a helpful assistant. Summarize the following HTML webpage content in markdown with simple terms:\n\n"
|
||||
+ summary_text
|
||||
)
|
||||
response = openai.chat.completions.create(
|
||||
model="gpt-4o",
|
||||
messages=[{"role": "user", "content": user_prompt}]
|
||||
)
|
||||
return response.choices[0].message.content
|
||||
|
||||
def save_markdown(summary, filename="summary.md", url=None):
|
||||
#Open the file summary.md
|
||||
with open(filename, "w", encoding="utf-8") as f:
|
||||
if url:
|
||||
f.write(f"# Summary of [{url}]({url})\n\n")
|
||||
else:
|
||||
f.write("# Summary\n\n")
|
||||
f.write(summary.strip())
|
||||
|
||||
# 4. Main Logic
|
||||
def main():
|
||||
url = input("Enter the URL to summarize: ").strip()
|
||||
html = scrape_website(url)
|
||||
summary = summarize_content(html)
|
||||
save_markdown(summary, filename="summary.md", url=url)
|
||||
print("✅ Summary saved to summary.md")
|
||||
|
||||
# 5. Entry Point
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,34 @@
|
||||
# Summary of [https://www.willwight.com/](https://www.willwight.com/)
|
||||
|
||||
# Will Wight - New York Times Best-Selling Author
|
||||
|
||||
### Overview
|
||||
Will Wight is a renowned author known for the "Cradle" series, alongside other works like "The Last Horizon" and "The Traveler's Gate Trilogy." He combines humor and storytelling in his blog and engages actively with his readers.
|
||||
|
||||
### Books
|
||||
- **The Last Horizon**: Currently ongoing series.
|
||||
- **Cradle**: A 12-book series, now complete.
|
||||
- **The Traveler's Gate Trilogy**: Completed series.
|
||||
- **The Elder Empire**: Consists of two trilogies with stories happening simultaneously, totaling 6 books.
|
||||
|
||||
### Recent Highlights
|
||||
- **The Pilot Release**: The fourth book in "The Last Horizon" series, celebrated on July 4th, 2025. The 26th book by Will, marking a milestone as his next book will be his 27th.
|
||||
- **Barnes & Noble Success**: A significant achievement of getting Will's books stocked nationwide in Barnes & Noble, marking a breakthrough for indie publishing.
|
||||
|
||||
### Blog Highlights
|
||||
- Will shares personal anecdotes and behind-the-scenes insights into his creative process.
|
||||
- A humorous tone is used, including whimsical stories about his life and writing challenges.
|
||||
- Recent experiences at Epic Universe theme park with thoughts on its design and offerings.
|
||||
|
||||
### Connect
|
||||
- **Mailing List**: Over 15,000 fans subscribe to receive updates on new stories and releases.
|
||||
- **Hidden Gnome Publishing**: The entity behind Will's publications, working to bring his books to wider audiences.
|
||||
|
||||
### Extras
|
||||
- **Merch**: Available for fans wanting to support and connect with Will's universe.
|
||||
- **Podcast**: Offers sneak peeks, discussions, and insights into Will's works.
|
||||
|
||||
### Humorous Note
|
||||
Will humorously describes himself transforming into a "monstrous mongoose" during a full moon, adding a quirky touch to his persona.
|
||||
|
||||
For more detailed information on books, blogs, and extras, visit Will's website and explore his engaging world of storytelling!
|
||||
242
week1/day1.ipynb
242
week1/day1.ipynb
@@ -110,7 +110,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 1,
|
||||
"id": "4e2a9393-7767-488e-a8bf-27c12dca35bd",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -151,10 +151,18 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 9,
|
||||
"id": "7b87cadb-d513-4303-baee-a37b6f938e4d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"API key found and looks good so far!\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Load environment variables in a file called .env\n",
|
||||
"\n",
|
||||
@@ -175,7 +183,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 10,
|
||||
"id": "019974d9-f3ad-4a8a-b5f9-0a3719aea2d3",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -196,15 +204,26 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 11,
|
||||
"id": "a58394bf-1e45-46af-9bfd-01e24da6f49a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Hello! It's great to meet you. How can I assist you today?\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# To give you a preview -- calling OpenAI with these messages is this easy. Any problems, head over to the Troubleshooting notebook.\n",
|
||||
"import openai\n",
|
||||
"\n",
|
||||
"message = \"Hello, GPT! This is my first ever message to you! Hi!\"\n",
|
||||
"response = openai.chat.completions.create(model=\"gpt-4o-mini\", messages=[{\"role\":\"user\", \"content\":message}])\n",
|
||||
"response = openai.chat.completions.create(\n",
|
||||
" model=\"gpt-4o\",\n",
|
||||
" messages=[{\"role\": \"user\", \"content\": message}]\n",
|
||||
")\n",
|
||||
"print(response.choices[0].message.content)"
|
||||
]
|
||||
},
|
||||
@@ -251,7 +270,62 @@
|
||||
"execution_count": null,
|
||||
"id": "2ef960cf-6dc2-4cda-afb3-b38be12f4c97",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Home - Edward Donner\n",
|
||||
"Home\n",
|
||||
"Connect Four\n",
|
||||
"Outsmart\n",
|
||||
"An arena that pits LLMs against each other in a battle of diplomacy and deviousness\n",
|
||||
"About\n",
|
||||
"Posts\n",
|
||||
"Well, hi there.\n",
|
||||
"I’m Ed. I like writing code and experimenting with LLMs, and hopefully you’re here because you do too. I also enjoy DJing (but I’m badly out of practice), amateur electronic music production (\n",
|
||||
"very\n",
|
||||
"amateur) and losing myself in\n",
|
||||
"Hacker News\n",
|
||||
", nodding my head sagely to things I only half understand.\n",
|
||||
"I’m the co-founder and CTO of\n",
|
||||
"Nebula.io\n",
|
||||
". We’re applying AI to a field where it can make a massive, positive impact: helping people discover their potential and pursue their reason for being. Recruiters use our product today to source, understand, engage and manage talent. I’m previously the founder and CEO of AI startup untapt,\n",
|
||||
"acquired in 2021\n",
|
||||
".\n",
|
||||
"We work with groundbreaking, proprietary LLMs verticalized for talent, we’ve\n",
|
||||
"patented\n",
|
||||
"our matching model, and our award-winning platform has happy customers and tons of press coverage.\n",
|
||||
"Connect\n",
|
||||
"with me for more!\n",
|
||||
"May 28, 2025\n",
|
||||
"Connecting my courses – become an LLM expert and leader\n",
|
||||
"May 18, 2025\n",
|
||||
"2025 AI Executive Briefing\n",
|
||||
"April 21, 2025\n",
|
||||
"The Complete Agentic AI Engineering Course\n",
|
||||
"January 23, 2025\n",
|
||||
"LLM Workshop – Hands-on with Agents – resources\n",
|
||||
"Navigation\n",
|
||||
"Home\n",
|
||||
"Connect Four\n",
|
||||
"Outsmart\n",
|
||||
"An arena that pits LLMs against each other in a battle of diplomacy and deviousness\n",
|
||||
"About\n",
|
||||
"Posts\n",
|
||||
"Get in touch\n",
|
||||
"ed [at] edwarddonner [dot] com\n",
|
||||
"www.edwarddonner.com\n",
|
||||
"Follow me\n",
|
||||
"LinkedIn\n",
|
||||
"Twitter\n",
|
||||
"Facebook\n",
|
||||
"Subscribe to newsletter\n",
|
||||
"Type your email…\n",
|
||||
"Subscribe\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Let's try one out. Change the website and add print statements to follow along.\n",
|
||||
"\n",
|
||||
@@ -280,7 +354,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 15,
|
||||
"id": "abdb8417-c5dc-44bc-9bee-2e059d162699",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -294,7 +368,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 16,
|
||||
"id": "f0275b1b-7cfe-4f9d-abfa-7650d378da0c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -312,10 +386,67 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 17,
|
||||
"id": "26448ec4-5c00-4204-baec-7df91d11ff2e",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"You are looking at a website titled Home - Edward Donner\n",
|
||||
"The contents of this website is as follows; please provide a short summary of this website in markdown. If it includes news or announcements, then summarize these too.\n",
|
||||
"\n",
|
||||
"Home\n",
|
||||
"Connect Four\n",
|
||||
"Outsmart\n",
|
||||
"An arena that pits LLMs against each other in a battle of diplomacy and deviousness\n",
|
||||
"About\n",
|
||||
"Posts\n",
|
||||
"Well, hi there.\n",
|
||||
"I’m Ed. I like writing code and experimenting with LLMs, and hopefully you’re here because you do too. I also enjoy DJing (but I’m badly out of practice), amateur electronic music production (\n",
|
||||
"very\n",
|
||||
"amateur) and losing myself in\n",
|
||||
"Hacker News\n",
|
||||
", nodding my head sagely to things I only half understand.\n",
|
||||
"I’m the co-founder and CTO of\n",
|
||||
"Nebula.io\n",
|
||||
". We’re applying AI to a field where it can make a massive, positive impact: helping people discover their potential and pursue their reason for being. Recruiters use our product today to source, understand, engage and manage talent. I’m previously the founder and CEO of AI startup untapt,\n",
|
||||
"acquired in 2021\n",
|
||||
".\n",
|
||||
"We work with groundbreaking, proprietary LLMs verticalized for talent, we’ve\n",
|
||||
"patented\n",
|
||||
"our matching model, and our award-winning platform has happy customers and tons of press coverage.\n",
|
||||
"Connect\n",
|
||||
"with me for more!\n",
|
||||
"May 28, 2025\n",
|
||||
"Connecting my courses – become an LLM expert and leader\n",
|
||||
"May 18, 2025\n",
|
||||
"2025 AI Executive Briefing\n",
|
||||
"April 21, 2025\n",
|
||||
"The Complete Agentic AI Engineering Course\n",
|
||||
"January 23, 2025\n",
|
||||
"LLM Workshop – Hands-on with Agents – resources\n",
|
||||
"Navigation\n",
|
||||
"Home\n",
|
||||
"Connect Four\n",
|
||||
"Outsmart\n",
|
||||
"An arena that pits LLMs against each other in a battle of diplomacy and deviousness\n",
|
||||
"About\n",
|
||||
"Posts\n",
|
||||
"Get in touch\n",
|
||||
"ed [at] edwarddonner [dot] com\n",
|
||||
"www.edwarddonner.com\n",
|
||||
"Follow me\n",
|
||||
"LinkedIn\n",
|
||||
"Twitter\n",
|
||||
"Facebook\n",
|
||||
"Subscribe to newsletter\n",
|
||||
"Type your email…\n",
|
||||
"Subscribe\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(user_prompt_for(ed))"
|
||||
]
|
||||
@@ -341,7 +472,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 18,
|
||||
"id": "f25dcd35-0cd0-4235-9f64-ac37ed9eaaa5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -354,10 +485,18 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 19,
|
||||
"id": "21ed95c5-7001-47de-a36d-1d6673b403ce",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Oh, you’re really hitting me with the tough questions! But fine, I’ll play along. 2 + 2 equals 4. Happy now?\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# To give you a preview -- calling OpenAI with system and user messages:\n",
|
||||
"\n",
|
||||
@@ -375,7 +514,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 20,
|
||||
"id": "0134dfa4-8299-48b5-b444-f2a8c3403c88",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -391,10 +530,24 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 21,
|
||||
"id": "36478464-39ee-485c-9f3f-6a4e458dbc9c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[{'role': 'system',\n",
|
||||
" 'content': 'You are an assistant that analyzes the contents of a website and provides a short summary, ignoring text that might be navigation related. Respond in markdown.'},\n",
|
||||
" {'role': 'user',\n",
|
||||
" 'content': 'You are looking at a website titled Home - Edward Donner\\nThe contents of this website is as follows; please provide a short summary of this website in markdown. If it includes news or announcements, then summarize these too.\\n\\nHome\\nConnect Four\\nOutsmart\\nAn arena that pits LLMs against each other in a battle of diplomacy and deviousness\\nAbout\\nPosts\\nWell, hi there.\\nI’m Ed. I like writing code and experimenting with LLMs, and hopefully you’re here because you do too. I also enjoy DJing (but I’m badly out of practice), amateur electronic music production (\\nvery\\namateur) and losing myself in\\nHacker News\\n, nodding my head sagely to things I only half understand.\\nI’m the co-founder and CTO of\\nNebula.io\\n. We’re applying AI to a field where it can make a massive, positive impact: helping people discover their potential and pursue their reason for being. Recruiters use our product today to source, understand, engage and manage talent. I’m previously the founder and CEO of AI startup untapt,\\nacquired in 2021\\n.\\nWe work with groundbreaking, proprietary LLMs verticalized for talent, we’ve\\npatented\\nour matching model, and our award-winning platform has happy customers and tons of press coverage.\\nConnect\\nwith me for more!\\nMay 28, 2025\\nConnecting my courses – become an LLM expert and leader\\nMay 18, 2025\\n2025 AI Executive Briefing\\nApril 21, 2025\\nThe Complete Agentic AI Engineering Course\\nJanuary 23, 2025\\nLLM Workshop – Hands-on with Agents – resources\\nNavigation\\nHome\\nConnect Four\\nOutsmart\\nAn arena that pits LLMs against each other in a battle of diplomacy and deviousness\\nAbout\\nPosts\\nGet in touch\\ned [at] edwarddonner [dot] com\\nwww.edwarddonner.com\\nFollow me\\nLinkedIn\\nTwitter\\nFacebook\\nSubscribe to newsletter\\nType your email…\\nSubscribe'}]"
|
||||
]
|
||||
},
|
||||
"execution_count": 21,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Try this out, and then try for a few more websites\n",
|
||||
"\n",
|
||||
@@ -411,7 +564,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 22,
|
||||
"id": "905b9919-aba7-45b5-ae65-81b3d1d78e34",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -429,17 +582,28 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 23,
|
||||
"id": "05e38d41-dfa4-4b20-9c96-c46ea75d9fb5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'# Summary of Edward Donner\\'s Website\\n\\nThis website belongs to Ed Donner, a technology enthusiast focused on coding and experimenting with large language models (LLMs). He is the co-founder and CTO of Nebula.io, a company leveraging AI to help individuals discover their potential in the talent space. Ed has a background in AI startups, having previously founded untapt, which was acquired in 2021.\\n\\n## Key Features:\\n- **Personal Interests**: Ed enjoys DJing, electronic music production, and engaging with content on Hacker News.\\n- **Professional Focus**: Nebula.io specializes in developing proprietary LLMs tailored for talent matching and management.\\n\\n## Recent Announcements:\\n- **May 28, 2025**: Course introduction on becoming an LLM expert and leader.\\n- **May 18, 2025**: Announcement of a 2025 AI Executive Briefing.\\n- **April 21, 2025**: Launch of the \"Complete Agentic AI Engineering Course\".\\n- **January 23, 2025**: A workshop titled \"LLM Workshop – Hands-on with Agents\" offering related resources. \\n\\nThe website encourages visitors to connect with Ed and subscribe to his newsletter for updates and insights.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 23,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"summarize(\"https://edwarddonner.com\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 24,
|
||||
"id": "3d926d59-450e-4609-92ba-2d6f244f1342",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -453,10 +617,34 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 25,
|
||||
"id": "3018853a-445f-41ff-9560-d925d1774b2f",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/markdown": [
|
||||
"# Summary of Edward Donner's Website\n",
|
||||
"\n",
|
||||
"Edward Donner's personal website showcases his interests in coding, LLMs (Large Language Models), and electronic music. He is the co-founder and CTO of Nebula.io, a company focused on using AI to enhance talent discovery and management. He has previously founded another AI startup, untapt, which was acquired in 2021.\n",
|
||||
"\n",
|
||||
"## Recent Announcements:\n",
|
||||
"\n",
|
||||
"- **May 28, 2025**: Launch of courses aimed at helping individuals become experts and leaders in LLM technology.\n",
|
||||
"- **May 18, 2025**: Announcement of the 2025 AI Executive Briefing.\n",
|
||||
"- **April 21, 2025**: Introductory course titled \"The Complete Agentic AI Engineering Course.\"\n",
|
||||
"- **January 23, 2025**: LLM Workshop offering hands-on resources for working with agents.\n",
|
||||
"\n",
|
||||
"The website encourages connection and collaboration, inviting visitors to reach out and engage."
|
||||
],
|
||||
"text/plain": [
|
||||
"<IPython.core.display.Markdown object>"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"display_summary(\"https://edwarddonner.com\")"
|
||||
]
|
||||
@@ -593,7 +781,7 @@
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"display_name": "llms",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
@@ -607,7 +795,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.12"
|
||||
"version": "3.11.13"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
Reference in New Issue
Block a user