🥊 Summarization Battle: Ollama vs. OpenAI Judge
This mini-project pits multiple local LLMs (via Ollama) against each other in a web summarization contest, with an OpenAI model serving as the impartial judge.
It automatically fetches web articles, summarizes them with several models, and evaluates the results on coverage, faithfulness, clarity, and conciseness.
🚀 Features
- Fetch Articles – Download and clean text content from given URLs.
- Summarize with Ollama – Run multiple local models (e.g.,
llama3.2,phi3,deepseek-r1) via the Ollama API. - Judge with OpenAI – Use
gpt-4o-mini(or any other OpenAI model) to score summaries. - Battle Results – Collect JSON results with per-model scores, rationales, and winners.
- Timeout Handling & Warmup – Keeps models alive with
keep_aliveto avoid cold-start delays.
📂 Project Structure
.
├── urls.txt # Dictionary of categories → URLs
├── battle_results.json # Summarization + judging results
├── main.py # Main script
├── requirements.txt # Dependencies
└── README.md # You are here
⚙️ Installation
-
Clone the repo:
git clone https://github.com/khashayarbayati1/wikipedia-summarization-battle.git cd summarization-battle -
Install dependencies:
pip install -r requirements.txtMinimal requirements:
requests beautifulsoup4 python-dotenv openai>=1.0.0 httpx -
Install Ollama & models:
- Install Ollama if not already installed.
- Pull the models you want:
ollama pull llama3.2:latest ollama pull deepseek-r1:1.5b ollama pull phi3:latest
-
Set up OpenAI API key: Create a
.envfile with:OPENAI_API_KEY=sk-proj-xxxx...
▶️ Usage
-
Put your URL dictionary in
urls.txt, e.g.:{ "sports": "https://en.wikipedia.org/wiki/Sabermetrics", "Politics": "https://en.wikipedia.org/wiki/Separation_of_powers", "History": "https://en.wikipedia.org/wiki/Industrial_Revolution" } -
Run the script:
python main.py -
Results are written to:
battle_results.json- Printed in the terminal
🏆 Example Results
Sample output (excerpt):
{
"category": "sports",
"url": "https://en.wikipedia.org/wiki/Sabermetrics",
"scores": {
"llama3.2:latest": { "score": 4, "rationale": "Covers the main points..." },
"deepseek-r1:1.5b": { "score": 3, "rationale": "Some inaccuracies..." },
"phi3:latest": { "score": 5, "rationale": "Concise, accurate, well-organized." }
},
"winner": "phi3:latest"
}
From the full run:
- 🥇
phi3:latestwon in Sports, History, Productivity - 🥇
deepseek-r1:1.5bwon in Politics, Technology
💡 Ideas for Extension
- Add more Ollama models (e.g.,
mistral,gemma, etc.) - Try different evaluation criteria (e.g., readability, length control)
- Visualize results with charts
- Benchmark runtime and token usage
📜 License
MIT License – free to use, modify, and share.