Files
LLM_Engineering_OLD/week1/community-contributions/day1-webpage-summarizer-brazilian-news.ipynb
2025-06-04 10:20:03 -03:00

248 lines
12 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "code",
"execution_count": 15,
"id": "8ce13728-0040-43cc-82cd-e10c838ef71c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"🌍 Detected language: PT\n",
"🔗 Preview of extracted text:\n",
"\n",
"ITASAT2 irá atuar para aplicações científicas e de defesa\n",
"Publicado em 14/04/2025 - 14h15\n",
"O Instituto Tecnológico de Aeronáutica (ITA) realizou, entre os dias 17 e 19 de março, a Revisão Preliminar de Projeto (PDR) do ITASAT 2, novo microssatélite em desenvolvimento por pesquisadores do Centro Espacial ITA (CEI). A atividade representa uma etapa importante dos estudos e contou com a presença de instituições parceiras, tanto do Brasil quanto do exterior.\n",
"Participaram do encontro representantes do\n",
"...\n",
"\n",
"Amount of words: 526\n",
"\n",
"\n",
"📊 Usage Report\n",
"🧾 Prompt tokens: 927\n",
"🧠 Completion tokens: 309\n",
"🔢 Total tokens: 1236\n",
"💰 Total cost: $0.000927\n",
"\n",
"\n",
"\n"
]
},
{
"data": {
"text/markdown": [
"# 📝 Summary\n",
"\n",
"The ITA (Instituto Tecnológico de Aeronáutica) is working on the ITASAT 2 project, a new microsatellite geared towards scientific and defense applications! 🌟 This initiative was highlighted at the Preliminary Design Review (PDR) held from March 17 to 19, with participation from notable organizations such as NASA and the Brazilian Space Agency (AEB). This is a fantastic collaboration that spans both domestic and international partnerships how exciting is that? \n",
"\n",
"ITASAT 2 will consist of a constellation of three CubeSats focusing on monitoring the Earth's ionosphere and assessing plasma bubble formation. Interestingly, it also has defense applications such as geolocating radio frequency sources and optical identification of uncooperative vessels a crucial capability for maritime security!\n",
"\n",
"The PDR showcased the team's technical and managerial capabilities, receiving unanimous approval to proceed with the project. Its great to see such thorough preparation reflecting the dedication of the ITA team! \n",
"\n",
"The CubeSats themselves are cubic nano or microsatellites, and the ITASAT 2 is of the 16U variety, meaning it's made up of 16 units measuring 10 cm each just amazing how compact these technologies can be! Additionally, the CEI is also developing another CubeSat called SelenITA, which will contribute to NASA's Artemis mission to study the Moon! 🌕\n",
"\n",
"Keep an eye on this remarkable project as it continues to develop the future of space exploration and defense technology looks bright! 🚀"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Import Libraries\n",
"import os\n",
"import requests\n",
"from openai import OpenAI\n",
"\n",
"from bs4 import BeautifulSoup\n",
"from langdetect import detect, LangDetectException\n",
"from dotenv import load_dotenv\n",
"\n",
"from IPython.display import Markdown, display\n",
"\n",
"# Load .env variables\n",
"load_dotenv()\n",
"openai_api_key = os.getenv('OPENAI_API_KEY')\n",
"\n",
"# Check the key\n",
"if not openai_api_key:\n",
" raise ValueError(\"⚠️ OPENAI_API_KEY not found in .env file.\")\n",
"\n",
"# Generating object to work with GPT tasks \n",
"openai = OpenAI()\n",
"\n",
"# Class to work with text extraction, processing and summarizing from a given url\n",
"class WebPageSummarizer():\n",
" \"\"\"\n",
" Class to work with text extraction, processing and summarizing from a given url using the BeautifulSoup library. It also includes pricing.\n",
" \"\"\"\n",
" def __init__(self, url: str, summary_detail: str = \"high\", show_summary: bool = True, language_of_reference = \"English\", model: str = \"gpt-4o-mini\") -> None:\n",
"\n",
" # Initial summarizer settings\n",
" self.url = url\n",
" self.model = model\n",
" self.show_summary = show_summary\n",
" self.summary_detail = summary_detail\n",
" self.language_of_reference = language_of_reference\n",
" self.language_code_map = {\n",
" \"english\": \"en\",\n",
" \"portuguese\": \"pt\",\n",
" \"spanish\": \"es\",\n",
" \"french\": \"fr\",\n",
" \"german\": \"de\",\n",
" \"italian\": \"it\",\n",
" \"japanese\": \"ja\",\n",
" \"chinese\": \"zh\",\n",
" \"korean\": \"ko\",\n",
" }\n",
" \n",
" self.model_pricing = {\n",
" \"gpt-4o-mini\": {\"input\": 0.0005, \"output\": 0.0015},\n",
" \"gpt-4o\": {\"input\": 0.005, \"output\": 0.015},\n",
" \"gpt-4-turbo\": {\"input\": 0.01, \"output\": 0.03},\n",
" \"gpt-4\": {\"input\": 0.03, \"output\": 0.06}, # Rarely used now\n",
" \"gpt-3.5-turbo\": {\"input\": 0.0005, \"output\": 0.0015}\n",
" }\n",
"\n",
" self.headers = {\n",
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 \"\n",
" \"(KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36\"\n",
" }\n",
"\n",
" if self.summary_detail not in [\"high\", \"low\"]:\n",
" raise Exception(\"\"\"Please select summary detail as either \"high\" or \"low\".\"\"\")\n",
"\n",
" def __extract_text(self):\n",
" response = requests.get(self.url, headers=self.headers)\n",
" if response.status_code != 200:\n",
" raise Exception(f\"Failed to fetch page. Status code: {response.status_code}\")\n",
" \n",
" soup = BeautifulSoup(response.text, \"html.parser\")\n",
" \n",
" # Try to extract meaningful content\n",
" paragraphs = soup.find_all(\"p\")\n",
" \n",
" # Join all paragraph text\n",
" self.text = \"\\n\".join([p.get_text() for p in paragraphs if p.get_text().strip() != \"\"])\n",
"\n",
" # Guarantee limit of text to summary\n",
" max_words = 7000\n",
" if len(self.text.split()) > max_words:\n",
" self.text = \" \".join(self.text.split()[:max_words])\n",
" \n",
" def __detect_language(self):\n",
" # Detect language\n",
" try:\n",
" self.language_url = detect(self.text)\n",
" except LangDetectException:\n",
" self.language_url = \"unknown\"\n",
"\n",
" # Normalize and resolve target language code\n",
" target_language_name = self.language_of_reference.lower().strip()\n",
" self.target_language_code = self.language_code_map.get(target_language_name)\n",
" \n",
" if not self.target_language_code:\n",
" raise ValueError(f\"❌ Unsupported language: {self.language_of_reference}. Please use one of: {list(LANGUAGE_CODE_MAP.keys())}\")\n",
"\n",
" print(f\"🌍 Detected language: {self.language_url.upper()}\")\n",
" \n",
" if self.show_summary:\n",
" print(\"🔗 Preview of extracted text:\\n\")\n",
" print(self.text[:500] + \"\\n...\\n\")\n",
" print(f\"Amount of words: {len(self.text.split())}\\n\")\n",
"\n",
" def __calculate_cost(self, prompt_tokens: int, completion_tokens: int) -> float:\n",
" \"\"\"\n",
" Calculates total cost in USD based on selected model.\n",
" \"\"\"\n",
" pricing = self.model_pricing.get(self.model)\n",
" if pricing is None:\n",
" raise ValueError(f\"\"\"Pricing not available for model \"{self.model}\". Add it to model_pricing.\"\"\")\n",
" \n",
" input_cost = (prompt_tokens / 1000) * pricing[\"input\"]\n",
" output_cost = (completion_tokens / 1000) * pricing[\"output\"]\n",
" return input_cost + output_cost\n",
"\n",
" def summarize(self)-> str:\n",
" \"\"\"\n",
" Method to process user prompts in the context of the user.\n",
" \"\"\"\n",
" self.__extract_text()\n",
" self.__detect_language()\n",
" \n",
" # Prompt for system definition\n",
" self.system_prompt = f\"\"\" \n",
" You are an assistant that analyzes the contents of a website and provides a summary. \n",
" Please notice that providing a {self.summary_detail} summary detail is IMPORTANT.\n",
" If you find text that might be navigation related or ad related please ignore. Respond in markdown. \n",
" Also, can you please start your summary with the tile \"📝 Summary\"?\n",
" \n",
" Please show some excited behavior during your summary, making comments with extra knowledge if possible during or at the end of the sentence. \n",
" \"\"\"\n",
"\n",
" self.content = f\"\"\"The text to summarize is as follows: {self.text}\"\"\"\n",
"\n",
" if self.language_url != self.target_language_code:\n",
" self.system_prompt = f\"\"\"The website content is in {self.language_url.upper()}. Please first translate it to {self.language_of_reference}. \n",
" {self.system_prompt.strip()}\n",
" \"\"\"\n",
"\n",
" response = openai.chat.completions.create(model=self.model, messages=[{\"role\":\"system\", \"content\":self.system_prompt}, \n",
" {\"role\": \"user\", \"content\":self.content}])\n",
"\n",
" # Cost calculation and usage report\n",
" usage = response.usage\n",
" total_cost = self.__calculate_cost(usage.prompt_tokens, usage.completion_tokens)\n",
" \n",
" print(\"\\n📊 Usage Report\")\n",
" print(f\"🧾 Prompt tokens: {usage.prompt_tokens}\")\n",
" print(f\"🧠 Completion tokens: {usage.completion_tokens}\")\n",
" print(f\"🔢 Total tokens: {usage.total_tokens}\")\n",
" print(f\"💰 Total cost: ${total_cost:.6f}\\n\\n\\n\")\n",
"\n",
" return response.choices[0].message.content\n",
"\n",
"\n",
"web_page_summarizer = WebPageSummarizer(\"http://www.ita.br/noticias/revisodeprojetodonovomicrossatlitedoitaaprovada\", summary_detail = \"low\")\n",
"display(Markdown(web_page_summarizer.summarize()))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "af5a186a-bb25-4cf4-a6d2-6034cd493bc4",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}