Add juan contribution

This commit is contained in:
Jsrodrigue
2025-10-23 15:29:54 +01:00
parent a1a9bc0f95
commit 101b0baf62
18 changed files with 1426 additions and 0 deletions

View File

@@ -0,0 +1,254 @@
# Synthetic Data Generator
**NOTE:** This is a copy of the repository https://github.com/Jsrodrigue/synthetic-data-creator.
An intelligent synthetic data generator that uses OpenAI models to create realistic tabular datasets based on reference data. This project includes an intuitive web interface built with Gradio.
> **🎓 Educational Project**: This project was inspired by the highly regarded LLM Engineering course on Udemy: [LLM Engineering: Master AI and Large Language Models](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/learn/lecture/52941433#questions/23828099). It demonstrates practical applications of LLM engineering principles, prompt engineering, and synthetic data generation techniques.
## Key highlights:
- Built with Python & Gradio
- Uses OpenAI GPT-4 models for tabular data synthesis
- Focused on statistical consistency and controlled randomness
- Lightweight and easy to extend
## 📸 Screenshots & Demo
### Application Interface
<p align="center">
<img src="screenshots/homepage.png" alt="Main Interface" width="70%">
</p>
<p align="center"><em>Main interface showing the synthetic data generator with all controls</em></p>
### Generated Data Preview
<p align="center">
<img src="screenshots/generated_table.png" alt="Generated table" width="70%">
</p>
<p align="center"><em> Generated CSV preview with the Wine dataset reference</em></p>
### Histogram plots
<p align="center">
<img src="screenshots/histogram.png" alt="Histogram plot" width="70%">
</p>
<p align="center"><em>Example of Histogram comparison plot in the Wine dataset</em></p>
### Boxplots
<p align="center">
<img src="screenshots/boxplot.png" alt="Boxplot" width="70%">
</p>
<p align="center"><em>Example of Boxplot comparison</em></p>
### Video Demo
[![Video Demo](https://img.youtube.com/vi/C7c8BbUGGBA/0.jpg)](https://youtu.be/C7c8BbUGGBA)
*Click to watch a complete walkthrough of the application*
## 📋 Features
- **Intelligent Generation**: Generates synthetic data using OpenAI models (GPT-4o-mini, GPT-4.1-mini)
- **Web Interface**: Provides an intuitive Gradio UI with real-time data preview
- **Reference Data**: Optionally load CSV files to preserve statistical distributions
- **Export Options**: Download generated datasets directly in CSV format
- **Included Examples**: Comes with ready-to-use sample datasets for people and sentiment analysis
- **Dynamic Batching**: Automatically adapts batch size based on prompt length and reference sample size
- **Reference Sampling**: Uses random subsets of reference data to ensure variability and reduce API cost.
The sample size (default `64`) can be modified in `src/constants.py` via `N_REFERENCE_ROWS`.
## 🚀 Installation
### Prerequisites
- Python 3.12+
- OpenAI account with API key
### Installation with pip
```bash
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
```
### Installation with uv
```bash
# Clone the repository
git clone https://github.com/Jsrodrigue/synthetic-data-creator.git
cd synthetic_data
# Install dependencies
uv sync
# Activate virtual environment
uv shell
```
### Configuration
1. Copy the environment variables example file:
```bash
cp .env_example .env
```
2. Edit `.env` and add your OpenAI API key:
```
OPENAI_API_KEY=your_api_key_here
```
## 🎯 Usage
### Start the application
```bash
python app.py
```
The script will print a local URL (e.g., http://localhost:7860) — open that link in your browser.
### How to use the interface
1. **Configure Prompts**:
- **System Prompt**: Uses the default rules defined in `src/constants.py` or can be edited there for custom generation.
- **User Prompt**: Specifies what type of data to generate (default: 15 rows, defined in `src/constants.py`).
2. **Select Model**:
- `gpt-4o-mini`: Faster and more economical
- `gpt-4.1-mini`: Higher reasoning capacity
3. **Load Reference Data** (optional):
- Upload a CSV file with similar data
- Use included examples: `people_reference.csv`, `sentiment_reference.csv` or `wine_reference.csv`
4. **Generate Data**:
- Click "🚀 Generate Data"
- Review results in the gradio UI
- Download the generated CSV
## 📊 Quality Evaluation
### Simple Evaluation System
The project includes a simple evaluation system focused on basic metrics and visualizations:
#### Features
- **Simple Metrics**: Basic statistical comparisons and quality checks
- **Integrated Visualizations**: Automatic generation of comparison plots in the app
- **Easy to Understand**: Clear scores and simple reports
- **Scale Invariant**: Works with datasets of different sizes
- **Temporary Files**: Visualizations are generated in temp files and cleaned up automatically
## 🛠️ Improvements and Next Steps
### Immediate Improvements
1. **Advanced Validation**:
- Implement specific validators by data type
- Create evaluation reports
2. **Advanced Quality Metrics**
- Include more advanced metrics to compare multivariate similarity (for future work), e.g.:
- C2ST (Classifier TwoSample Test): train a classifier to distinguish real vs synthetic — report AUROC (ideal ≈ 0.5).
- MMD (Maximum Mean Discrepancy): kernel-based multivariate distance.
- Multivariate Wasserstein / Optimal Transport: joint-distribution distance (use POT).
3. **More Models**:
- Integrate Hugging Face models
- Support for local models (Ollama)
- Comparison between different models
### Advanced Features
1. **Conditional Generation**:
- Data based on specific conditions
- Controlled outlier generation
- Maintaining complex relationships
2. **Privacy Analysis**:
- Differential privacy metrics
- Sensitive data detection
- Automatic anonymization
3. **Database Integration**:
- Direct database connection
- Massive data generation
- Automatic synchronization
### Scalable Architecture
1. **REST API**:
- Endpoints for integration
- Authentication and rate limiting
- OpenAPI documentation
2. **Asynchronous Processing**:
- Work queues for long generations
- Progress notifications
- Robust error handling
3. **Monitoring and Logging**:
- Usage and performance metrics
- Detailed generation logs
- Quality alerts
## 📁 Project Structure
```
synthetic_data/
├── app.py # Main Gradio application for synthetic data generation
├── README.md # Project documentation
├── pyproject.toml # Project configuration
├── requirements.txt # Python dependencies
├── data/ # Reference CSV datasets used for generating synthetic data
│ ├── people_reference.csv
│ ├── sentiment_reference.csv
│ └── wine_reference.csv
├── notebooks/ # Jupyter notebooks for experiments and development
│ └── notebook.ipynb
├── src/ # Python source code
│ ├── __init__.py
├── constants.py # Default constants, reference sample size, and default prompts
│ ├── data_generation.py # Core functions for batch generation and evaluation
│ ├── evaluator.py # Evaluation logic and metrics
│ ├── IO_utils.py # Utilities for file management and temp directories
│ ├── openai_utils.py # Wrappers for OpenAI API calls
│ └── plot_utils.py
# Functions to create visualizations from data
└── temp_plots/ # Temporary folder for generated plot images (auto-cleaned)
```
## 📄 License
This project is under the MIT License. See the `LICENSE` file for more details.
## 🎓 Course Context & Learning Outcomes
This project was developed as part of the [LLM Engineering: Master AI and Large Language Models](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/learn/lecture/52941433#questions/23828099) course on Udemy. It demonstrates practical implementation of:
### Key Learning Objectives:
- **Prompt Engineering Mastery**: Creating effective system and user prompts for consistent outputs
- **API Integration**: Working with OpenAI's API for production applications
- **Data Processing**: Handling JSON parsing, validation, and error management
- **Web Application Development**: Building user interfaces with Gradio
### Course Insights Applied:
- **Why OpenAI over Open Source**: This project was developed as an alternative to open-source models due to consistency issues in prompt following with models like Llama 3.2. OpenAI provides more reliable and faster results for this specific task.
- **Production Considerations**: Focus on error handling, output validation, and user experience
- **Scalability Planning**: Architecture designed for future enhancements and integrations
### Related Course Topics:
- Prompt engineering techniques
- LLM API integration and optimization
- Selection of best models for each usecase.
---
**📚 Course Link**: [LLM Engineering: Master AI and Large Language Models](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/learn/lecture/52941433#questions/23828099)