Add juan contribution

2025-10-23 15:29:54 +01:00
parent a1a9bc0f95
commit 101b0baf62
18 changed files with 1426 additions and 0 deletions
--- a/week3/community-contributions/juan_synthetic_data/README.md
+++ b/week3/community-contributions/juan_synthetic_data/README.md
@@ -0,0 +1,254 @@
+# Synthetic Data Generator
+**NOTE:** This is a copy of the repository https://github.com/Jsrodrigue/synthetic-data-creator.
+
+An intelligent synthetic data generator that uses OpenAI models to create realistic tabular datasets based on reference data. This project includes an intuitive web interface built with Gradio.
+
+> **🎓 Educational Project**: This project was inspired by the highly regarded LLM Engineering course on Udemy: [LLM Engineering: Master AI and Large Language Models](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/learn/lecture/52941433#questions/23828099). It demonstrates practical applications of LLM engineering principles, prompt engineering, and synthetic data generation techniques.
+
+## Key highlights:
+- Built with Python & Gradio  
+- Uses OpenAI GPT-4 models for tabular data synthesis  
+- Focused on statistical consistency and controlled randomness  
+- Lightweight and easy to extend
+
+## 📸 Screenshots & Demo
+
+### Application Interface
+<p align="center">
+  <img src="screenshots/homepage.png" alt="Main Interface" width="70%">
+</p>
+<p align="center"><em>Main interface showing the synthetic data generator with all controls</em></p>
+
+### Generated Data Preview
+<p align="center">
+  <img src="screenshots/generated_table.png" alt="Generated table" width="70%">
+</p>
+<p align="center"><em> Generated CSV preview with the Wine dataset reference</em></p>
+
+### Histogram plots
+<p align="center">
+  <img src="screenshots/histogram.png" alt="Histogram plot" width="70%">
+</p>
+<p align="center"><em>Example of Histogram comparison plot in the Wine dataset</em></p>
+
+### Boxplots
+<p align="center">
+  <img src="screenshots/boxplot.png" alt="Boxplot" width="70%">
+</p>
+<p align="center"><em>Example of Boxplot comparison</em></p>
+
+
+### Video Demo
+[![Video Demo](https://img.youtube.com/vi/C7c8BbUGGBA/0.jpg)](https://youtu.be/C7c8BbUGGBA)
+
+*Click to watch a complete walkthrough of the application*
+
+
+## 📋 Features
+
+- **Intelligent Generation**: Generates synthetic data using OpenAI models (GPT-4o-mini, GPT-4.1-mini)
+- **Web Interface**: Provides an intuitive Gradio UI with real-time data preview
+- **Reference Data**: Optionally load CSV files to preserve statistical distributions
+- **Export Options**: Download generated datasets directly in CSV format
+- **Included Examples**: Comes with ready-to-use sample datasets for people and sentiment analysis
+- **Dynamic Batching**: Automatically adapts batch size based on prompt length and reference sample size
+- **Reference Sampling**: Uses random subsets of reference data to ensure variability and reduce API cost.  
+  The sample size (default `64`) can be modified in `src/constants.py` via `N_REFERENCE_ROWS`.
+
+## 🚀 Installation
+
+### Prerequisites
+- Python 3.12+
+- OpenAI account with API key
+
+### Installation with pip
+```bash
+# Create virtual environment
+python -m venv venv
+source venv/bin/activate  # On Windows: venv\Scripts\activate
+
+# Install dependencies
+pip install -r requirements.txt
+```
+
+### Installation with uv
+```bash
+# Clone the repository
+git clone https://github.com/Jsrodrigue/synthetic-data-creator.git
+cd synthetic_data
+
+# Install dependencies
+uv sync
+
+# Activate virtual environment
+uv shell
+```
+
+### Configuration
+1. Copy the environment variables example file:
+```bash
+cp .env_example .env
+```
+
+2. Edit `.env` and add your OpenAI API key:
+```
+OPENAI_API_KEY=your_api_key_here
+```
+
+
+
+## 🎯 Usage
+
+### Start the application
+```bash
+python app.py
+```
+
+The script will print a local URL (e.g., http://localhost:7860) — open that link in your browser.
+
+### How to use the interface
+
+1. **Configure Prompts**:
+   - **System Prompt**: Uses the default rules defined in `src/constants.py` or can be edited there for custom generation.
+   - **User Prompt**: Specifies what type of data to generate (default: 15 rows, defined in `src/constants.py`).
+
+
+2. **Select Model**:
+   - `gpt-4o-mini`: Faster and more economical
+   - `gpt-4.1-mini`: Higher reasoning capacity
+
+3. **Load Reference Data** (optional):
+   - Upload a CSV file with similar data
+   - Use included examples: `people_reference.csv`, `sentiment_reference.csv` or `wine_reference.csv`
+
+4. **Generate Data**:
+   - Click "🚀 Generate Data"
+   - Review results in the gradio UI
+   - Download the generated CSV
+
+
+
+## 📊 Quality Evaluation
+
+### Simple Evaluation System
+
+The project includes a simple evaluation system focused on basic metrics and visualizations:
+
+#### Features
+- **Simple Metrics**: Basic statistical comparisons and quality checks
+- **Integrated Visualizations**: Automatic generation of comparison plots in the app
+- **Easy to Understand**: Clear scores and simple reports
+- **Scale Invariant**: Works with datasets of different sizes
+- **Temporary Files**: Visualizations are generated in temp files and cleaned up automatically
+
+
+
+## 🛠️ Improvements and Next Steps
+
+### Immediate Improvements
+
+1. **Advanced Validation**:
+   - Implement specific validators by data type
+   - Create evaluation reports
+
+2. **Advanced Quality Metrics**
+   - Include more advanced metrics to compare multivariate similarity (for future work), e.g.:
+      - C2ST (Classifier Two‑Sample Test): train a classifier to distinguish real vs synthetic — report AUROC (ideal ≈ 0.5).
+      - MMD (Maximum Mean Discrepancy): kernel-based multivariate distance.
+      - Multivariate Wasserstein / Optimal Transport: joint-distribution distance (use POT).
+     
+3. **More Models**:
+   - Integrate Hugging Face models
+   - Support for local models (Ollama)
+   - Comparison between different models
+
+### Advanced Features
+
+1. **Conditional Generation**:
+   - Data based on specific conditions
+   - Controlled outlier generation
+   - Maintaining complex relationships
+
+2. **Privacy Analysis**:
+   - Differential privacy metrics
+   - Sensitive data detection
+   - Automatic anonymization
+
+3. **Database Integration**:
+   - Direct database connection
+   - Massive data generation
+   - Automatic synchronization
+
+### Scalable Architecture
+
+1. **REST API**:
+   - Endpoints for integration
+   - Authentication and rate limiting
+   - OpenAPI documentation
+
+2. **Asynchronous Processing**:
+   - Work queues for long generations
+   - Progress notifications
+   - Robust error handling
+
+3. **Monitoring and Logging**:
+   - Usage and performance metrics
+   - Detailed generation logs
+   - Quality alerts
+
+## 📁 Project Structure
+
+```
+synthetic_data/
+├── app.py                 # Main Gradio application for synthetic data generation
+├── README.md              # Project documentation
+├── pyproject.toml         # Project configuration
+├── requirements.txt       # Python dependencies
+├── data/                  # Reference CSV datasets used for generating synthetic data
+│   ├── people_reference.csv
+│   ├── sentiment_reference.csv
+│   └── wine_reference.csv
+├── notebooks/             # Jupyter notebooks for experiments and development
+│   └── notebook.ipynb
+├── src/                   # Python source code
+│   ├── __init__.py
+    ├── constants.py       # Default constants, reference sample size, and default prompts
+│   ├── data_generation.py  # Core functions for batch generation and evaluation
+│   ├── evaluator.py        # Evaluation logic and metrics
+│   ├── IO_utils.py         # Utilities for file management and temp directories
+│   ├── openai_utils.py     # Wrappers for OpenAI API calls
+│   └── plot_utils.py  
+     # Functions to create visualizations from data
+└──  temp_plots/            # Temporary folder for generated plot images (auto-cleaned)
+```
+
+## 📄 License
+
+This project is under the MIT License. See the `LICENSE` file for more details.
+
+
+
+
+## 🎓 Course Context & Learning Outcomes
+
+This project was developed as part of the [LLM Engineering: Master AI and Large Language Models](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/learn/lecture/52941433#questions/23828099) course on Udemy. It demonstrates practical implementation of:
+
+### Key Learning Objectives:
+- **Prompt Engineering Mastery**: Creating effective system and user prompts for consistent outputs
+- **API Integration**: Working with OpenAI's API for production applications
+- **Data Processing**: Handling JSON parsing, validation, and error management
+- **Web Application Development**: Building user interfaces with Gradio
+
+### Course Insights Applied:
+- **Why OpenAI over Open Source**: This project was developed as an alternative to open-source models due to consistency issues in prompt following with models like Llama 3.2. OpenAI provides more reliable and faster results for this specific task.
+- **Production Considerations**: Focus on error handling, output validation, and user experience
+- **Scalability Planning**: Architecture designed for future enhancements and integrations
+
+### Related Course Topics:
+- Prompt engineering techniques
+- LLM API integration and optimization
+- Selection of best models for each usecase.
+
+---
+
+**📚 Course Link**: [LLM Engineering: Master AI and Large Language Models](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/learn/lecture/52941433#questions/23828099)