473 lines
16 KiB
Plaintext
473 lines
16 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "603cd418-504a-4b4d-b1c3-be04febf3e79",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Article Title Generator (V2)\n",
|
|
"\n",
|
|
"Summarization use-case in which the user provides an article, which the LLM will analyze to suggest an SEO-optimized title.\n",
|
|
"\n",
|
|
"**NOTES**:\n",
|
|
"\n",
|
|
"1. This version supports website scrapping using Selenium (based on the code from **/week1/community-\n",
|
|
" contributions/day1-webscraping-selenium-for-javascript.ipynb** - Thanks for the contribution!)\n",
|
|
"2. Leverage streaming (OpenAI only).\n",
|
|
"3. The following models were configured:\\\n",
|
|
" \n",
|
|
" a. OpenAI gpt-4o-mini\\\n",
|
|
" b. Llama llama3.2\\\n",
|
|
" c. Deepseek deepseek-r1:1.5b\\\n",
|
|
"\n",
|
|
" It is possible to configure additional models by adding the new model to the MODELS dictionary and its\n",
|
|
" initialization to the CLIENTS dictionary. Then, call the model with --> ***answer =\n",
|
|
" get_answer('NEW_MODEL')***.\n",
|
|
"5. Improved system_prompt to provide specific SEO best practices to adopt during the title generation.\n",
|
|
"6. Rephrased the system_prompt to ensure the model provides a single Title (not a list of suggestions).\n",
|
|
"7. Includes function to remove unrequired thinking/reasoning verbose from the model response (Deepseek). \n",
|
|
"8. Users are encouraged to assess and rank the suggested titles using any headline analyzer tool online.\n",
|
|
" Example: https://www.isitwp.com/headline-analyzer/. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "115004a8-747a-4954-9580-1ed548f80336",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# install required libraries if they were not part of the requirements.txt\n",
|
|
"!pip install selenium\n",
|
|
"!pip install undetected-chromedriver"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "e773daa6-d05e-49bf-ad8e-a8ed4882b77e",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# confirming Llama is loaded\n",
|
|
"!ollama pull llama3.2"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "279b0c00-9bb0-4c7f-9c6d-aa0b108274b9",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# imports\n",
|
|
"import os\n",
|
|
"from dotenv import load_dotenv\n",
|
|
"from IPython.display import Markdown, display, update_display\n",
|
|
"from openai import OpenAI\n",
|
|
"import undetected_chromedriver as uc\n",
|
|
"from selenium.webdriver.common.by import By\n",
|
|
"from selenium.webdriver.support.ui import WebDriverWait\n",
|
|
"from selenium.webdriver.support import expected_conditions as EC\n",
|
|
"import time\n",
|
|
"from bs4 import BeautifulSoup"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "d4730d8d-3e20-4f3c-a4ff-ed2ac0a8aa27",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# set environment variables for OpenAi\n",
|
|
"load_dotenv(override=True)\n",
|
|
"api_key = os.getenv('OPENAI_API_KEY')\n",
|
|
"\n",
|
|
"# validate API Key\n",
|
|
"if not api_key:\n",
|
|
" raise ValueError(\"No API key was found! Please check the .env file.\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "1abbb826-de66-498c-94d8-33369ad01885",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# constants\n",
|
|
"MODELS = { 'GPT': 'gpt-4o-mini', \n",
|
|
" 'LLAMA': 'llama3.2', \n",
|
|
" 'DEEPSEEK': 'deepseek-r1:1.5b'\n",
|
|
" }\n",
|
|
"\n",
|
|
"CLIENTS = { 'GPT': OpenAI(), \n",
|
|
" 'LLAMA': OpenAI(base_url='http://localhost:11434/v1', api_key='ollama'),\n",
|
|
" 'DEEPSEEK': OpenAI(base_url='http://localhost:11434/v1', api_key='ollama') \n",
|
|
" }\n",
|
|
"\n",
|
|
"# path to Chrome\n",
|
|
"CHROME_PATH = \"C:/Program Files/Google/Chrome/Application/chrome.exe\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "6f490fe4-32d5-41f3-890d-ecf4e5e01dd4",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Webcrawler** (based on the code from __/week1/community-contributions/day1-webscraping-selenium-for-javascript.ipynb__)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "c2a1cf7a-044f-4a9c-b76e-8f112d384550",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"class WebsiteCrawler:\n",
|
|
" def __init__(self, url, wait_time=20, chrome_path=None):\n",
|
|
" \"\"\"\n",
|
|
" Initialize the WebsiteCrawler using Selenium to scrape JavaScript-rendered content.\n",
|
|
" \"\"\"\n",
|
|
" self.url = url\n",
|
|
" self.wait_time = wait_time\n",
|
|
"\n",
|
|
" options = uc.ChromeOptions()\n",
|
|
" options.add_argument(\"--disable-gpu\")\n",
|
|
" options.add_argument(\"--no-sandbox\")\n",
|
|
" options.add_argument(\"--disable-dev-shm-usage\")\n",
|
|
" options.add_argument(\"--disable-blink-features=AutomationControlled\")\n",
|
|
" # options.add_argument(\"--headless=new\") # For Chrome >= 109 - unreliable on my end!\n",
|
|
" options.add_argument(\"start-maximized\")\n",
|
|
" options.add_argument(\n",
|
|
" \"user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n",
|
|
" )\n",
|
|
" if chrome_path:\n",
|
|
" options.binary_location = chrome_path\n",
|
|
"\n",
|
|
" self.driver = uc.Chrome(options=options)\n",
|
|
"\n",
|
|
" try:\n",
|
|
" # Load the URL\n",
|
|
" self.driver.get(url)\n",
|
|
"\n",
|
|
" # Wait for Cloudflare or similar checks\n",
|
|
" time.sleep(10)\n",
|
|
"\n",
|
|
" # Ensure the main content is loaded\n",
|
|
" WebDriverWait(self.driver, self.wait_time).until(\n",
|
|
" EC.presence_of_element_located((By.TAG_NAME, \"main\"))\n",
|
|
" )\n",
|
|
"\n",
|
|
" # Extract the main content\n",
|
|
" main_content = self.driver.find_element(By.CSS_SELECTOR, \"main\").get_attribute(\"outerHTML\")\n",
|
|
"\n",
|
|
" # Parse with BeautifulSoup\n",
|
|
" soup = BeautifulSoup(main_content, \"html.parser\")\n",
|
|
" self.title = self.driver.title if self.driver.title else \"No title found\"\n",
|
|
" self.text = soup.get_text(separator=\"\\n\", strip=True)\n",
|
|
"\n",
|
|
" except Exception as e:\n",
|
|
" print(f\"Error occurred: {e}\")\n",
|
|
" self.title = \"Error occurred\"\n",
|
|
" self.text = \"\"\n",
|
|
"\n",
|
|
" finally:\n",
|
|
" self.driver.quit()\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "592d8f86-fbf7-4b16-a69d-468030d72dc4",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Prompts"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "1914afad-dbd8-4c1f-8e68-80b0e5d743a9",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# system prompt\n",
|
|
"system_prompt = \"\"\"\n",
|
|
" You are an experienced SEO-focused copywriter. The user will provide an article, and your task is to analyze its content and generate a single, most effective, keyword-optimized title to maximize SEO performance.\n",
|
|
"\n",
|
|
"Instructions:\n",
|
|
"Ignore irrelevant content, such as the current title (if any), navigation menus, advertisements, or unrelated text.\n",
|
|
"Prioritize SEO best practices, considering:\n",
|
|
"Keyword relevance and search intent (informational, transactional, etc.).\n",
|
|
"Readability and engagement.\n",
|
|
"Avoiding keyword stuffing.\n",
|
|
"Ensure conciseness and clarity, keeping the title under 60 characters when possible for optimal SERP display.\n",
|
|
"Use a compelling structure that balances informativeness and engagement, leveraging formats like:\n",
|
|
"Listicles (\"10 Best Strategies for…\")\n",
|
|
"How-to guides (\"How to Boost…\")\n",
|
|
"Questions (\"What Is the Best Way to…\")\n",
|
|
"Power words to enhance click-through rates (e.g., \"Proven,\" \"Ultimate,\" \"Essential\").\n",
|
|
"Provide only one single, best title—do not suggest multiple options.\n",
|
|
"Limit the answer to the following Response Format (Markdown):\n",
|
|
"Optimized Title: [Provide only one title here]\n",
|
|
"Justification: [Explain why this title is effective for SEO]\n",
|
|
"\n",
|
|
" \"\"\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "b0486867-6d38-4cb5-91d4-fb60952c3a9b",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Provide the article URL and get its content for analysis**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "ddd76319-13ce-480b-baa7-cab6a5c88168",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# article url - change to any other article URL\n",
|
|
"article_url = \"https://searchengineland.com/seo-trends-2025-447745\"\n",
|
|
"\n",
|
|
"# get article content\n",
|
|
"article = WebsiteCrawler(url=article_url, chrome_path=CHROME_PATH)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "176cfac7-5e6d-4d4a-a1c4-1b63b60de1f7",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# user prompt\n",
|
|
"user_prompt = \"\"\"\n",
|
|
"Following the article to be analyzed to suggest a title. Limit the answer to the following Response Format (Markdown): \n",
|
|
"Optimized Title: [Provide only one title here]\n",
|
|
"Justification: [Explain why this title is effective for SEO].\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"user_prompt = f\"{user_prompt} {article}\"\n",
|
|
" "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "c45fc7d7-08c9-4e34-b427-b928a219bb94",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# message list\n",
|
|
"messages = [\n",
|
|
" {\"role\": \"system\", \"content\": system_prompt},\n",
|
|
" {\"role\": \"user\", \"content\": user_prompt}\n",
|
|
" ]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "f67b881f-1040-4cf7-82c5-e85f4c0bd252",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# get suggested title\n",
|
|
"def get_title(model, **kwargs):\n",
|
|
" # stream if GPT\n",
|
|
" if 'stream' in kwargs:\n",
|
|
" response = CLIENTS[model].chat.completions.create(\n",
|
|
" model=MODELS[model],\n",
|
|
" messages=messages,\n",
|
|
" stream=kwargs['stream']\n",
|
|
" )\n",
|
|
" else:\n",
|
|
" response = CLIENTS[model].chat.completions.create(\n",
|
|
" model=MODELS[model],\n",
|
|
" messages=messages,\n",
|
|
" )\n",
|
|
"\n",
|
|
" return response\n",
|
|
" "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "8988d6ff-076a-4eae-baf4-26a8d6a2bc44",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# filter response from model verbose - like Deepseek reasoning/thinking verbose\n",
|
|
"def filter_response(response):\n",
|
|
" # Find last occurrence of 'Optimized Title:' to avoid displaying reasoning verbose\n",
|
|
" substring = 'Optimized Title:'\n",
|
|
" start = response.rfind('Optimized Title:')\n",
|
|
" if start > -1:\n",
|
|
" filtered_response = response[start:]\n",
|
|
"\n",
|
|
" # insert line break to preserve format\n",
|
|
" filtered_response = filtered_response.replace(\"**Justification:**\", \"\\n**Justification:**\")\n",
|
|
" \n",
|
|
" return filtered_response"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "0e9e99cf-5e25-4a1f-ab11-a2255e318671",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# display suggested title\n",
|
|
"def display_title(model):\n",
|
|
" # get model-suggested title\n",
|
|
" title = get_title(model)\n",
|
|
" \n",
|
|
" display(Markdown(f\"### {model} (___{MODELS[model]}___) Answer\\n\\n_______\")) \n",
|
|
"\n",
|
|
" response = \"\"\n",
|
|
"\n",
|
|
" if model == 'GPT':\n",
|
|
" display_handle = display(Markdown(\"\"), display_id=True)\n",
|
|
" # for chunk in stream:\n",
|
|
" for chunk in get_title(model=model, stream=True):\n",
|
|
" response += chunk.choices[0].delta.content or ''\n",
|
|
" response = (\n",
|
|
" response.replace(\"```\",\"\")\n",
|
|
" .replace(\"markdown\", \"\")\n",
|
|
" .replace(\"Optimized Title:\", \"**Optimized Title:**\")\n",
|
|
" .replace(\"Justification:\", \"**Justification:**\")\n",
|
|
" )\n",
|
|
" update_display(Markdown(response), display_id=display_handle.display_id)\n",
|
|
" else:\n",
|
|
" response = get_title(model=model)\n",
|
|
" response = response.choices[0].message.content\n",
|
|
" response = filter_response(response)\n",
|
|
" response = (\n",
|
|
" response.replace(\"Optimized Title:\", \"**Optimized Title:**\")\n",
|
|
" .replace(\"Justification:\", \"**Justification:**\")\n",
|
|
" )\n",
|
|
" display(Markdown(response))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "947b42ed-5b43-486d-8af3-e5b671c1fd0e",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Get OpenAI Suggested Title"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "eb6f66e3-ab99-4f76-9358-896cb43c1fa1",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# get and display openAi suggested title\n",
|
|
"display_title(model='GPT')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "70073ebf-a00a-416b-854d-642d450cd99b",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Get Llama Suggested Title"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "caa190bb-de5f-45cc-b671-5d62688f7b25",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# get and display Llama suggested title\n",
|
|
"display_title(model='LLAMA')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "811edc4f-20e2-482d-ac89-fae9d1b70bed",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Get Deepseek Suggested Title"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "082628e4-ff4c-46dd-ae5f-76578eb017ad",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# get and display Deepseek title\n",
|
|
"display_title(model='DEEPSEEK')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7fc404a6-3a91-4c09-89de-867d3d69b4b2",
|
|
"metadata": {
|
|
"jp-MarkdownHeadingCollapsed": true
|
|
},
|
|
"source": [
|
|
"### Observations\n",
|
|
"\n",
|
|
"1. **Selenium:** The headless option (__options.add_argument(\"--headless=new\")__), while ideal to speed up the scanning process, presented problems while scanning several websites (including openai.com and canva.com).\n",
|
|
"2. **Deepseek challenges:**\\\n",
|
|
" a.It always returns its thinking/reasoning verbose, which, while helpful to understand how it works, is not always\n",
|
|
" required, such as in this example code. A new function (**filter_response**) was created to remove the additional verbose.\\\n",
|
|
" b. It is unreliable with the response, sometimes returning the required format for the response instead of the\n",
|
|
" actual response. For example, for the title, it may sometimes return:\n",
|
|
" \n",
|
|
" **Optimized Title:** \\[The user wants the suggested title here]\n",
|
|
" \n",
|
|
"### Suggested future improvements\n",
|
|
"\n",
|
|
"1. Add the logic that would allow each model to assess the recommendations from the different models and \n",
|
|
" select the best among these.\n",
|
|
"2. Add the logic to leverage an API (if available) that automatically assesses the suggested titles."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "1af8260b-5ba1-4eeb-acd0-02de537b1bf4",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.11.11"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|