# Synthetic Data Generator **NOTE:** This is a copy of the repository https://github.com/Jsrodrigue/synthetic-data-creator. An intelligent synthetic data generator that uses OpenAI models to create realistic tabular datasets based on reference data. This project includes an intuitive web interface built with Gradio. > **πŸŽ“ Educational Project**: This project was inspired by the highly regarded LLM Engineering course on Udemy: [LLM Engineering: Master AI and Large Language Models](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/learn/lecture/52941433#questions/23828099). It demonstrates practical applications of LLM engineering principles, prompt engineering, and synthetic data generation techniques. ## Key highlights: - Built with Python & Gradio - Uses OpenAI GPT-4 models for tabular data synthesis - Focused on statistical consistency and controlled randomness - Lightweight and easy to extend ## πŸ“Έ Screenshots & Demo ### Application Interface

Main Interface

Main interface showing the synthetic data generator with all controls

### Generated Data Preview

Generated table

Generated CSV preview with the Wine dataset reference

### Histogram plots

Histogram plot

Example of Histogram comparison plot in the Wine dataset

### Boxplots

Boxplot

Example of Boxplot comparison

### Video Demo [![Video Demo](https://img.youtube.com/vi/C7c8BbUGGBA/0.jpg)](https://youtu.be/C7c8BbUGGBA) *Click to watch a complete walkthrough of the application* ## πŸ“‹ Features - **Intelligent Generation**: Generates synthetic data using OpenAI models (GPT-4o-mini, GPT-4.1-mini) - **Web Interface**: Provides an intuitive Gradio UI with real-time data preview - **Reference Data**: Optionally load CSV files to preserve statistical distributions - **Export Options**: Download generated datasets directly in CSV format - **Included Examples**: Comes with ready-to-use sample datasets for people and sentiment analysis - **Dynamic Batching**: Automatically adapts batch size based on prompt length and reference sample size - **Reference Sampling**: Uses random subsets of reference data to ensure variability and reduce API cost. The sample size (default `64`) can be modified in `src/constants.py` via `N_REFERENCE_ROWS`. ## πŸš€ Installation ### Prerequisites - Python 3.12+ - OpenAI account with API key ### Installation with pip ```bash # Create virtual environment python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install dependencies pip install -r requirements.txt ``` ### Installation with uv ```bash # Clone the repository git clone https://github.com/Jsrodrigue/synthetic-data-creator.git cd synthetic_data # Install dependencies uv sync # Activate virtual environment uv shell ``` ### Configuration 1. Copy the environment variables example file: ```bash cp .env_example .env ``` 2. Edit `.env` and add your OpenAI API key: ``` OPENAI_API_KEY=your_api_key_here ``` ## 🎯 Usage ### Start the application ```bash python app.py ``` The script will print a local URL (e.g., http://localhost:7860) β€” open that link in your browser. ### How to use the interface 1. **Configure Prompts**: - **System Prompt**: Uses the default rules defined in `src/constants.py` or can be edited there for custom generation. - **User Prompt**: Specifies what type of data to generate (default: 15 rows, defined in `src/constants.py`). 2. **Select Model**: - `gpt-4o-mini`: Faster and more economical - `gpt-4.1-mini`: Higher reasoning capacity 3. **Load Reference Data** (optional): - Upload a CSV file with similar data - Use included examples: `people_reference.csv`, `sentiment_reference.csv` or `wine_reference.csv` 4. **Generate Data**: - Click "πŸš€ Generate Data" - Review results in the gradio UI - Download the generated CSV ## πŸ“Š Quality Evaluation ### Simple Evaluation System The project includes a simple evaluation system focused on basic metrics and visualizations: #### Features - **Simple Metrics**: Basic statistical comparisons and quality checks - **Integrated Visualizations**: Automatic generation of comparison plots in the app - **Easy to Understand**: Clear scores and simple reports - **Scale Invariant**: Works with datasets of different sizes - **Temporary Files**: Visualizations are generated in temp files and cleaned up automatically ## πŸ› οΈ Improvements and Next Steps ### Immediate Improvements 1. **Advanced Validation**: - Implement specific validators by data type - Create evaluation reports 2. **Advanced Quality Metrics** - Include more advanced metrics to compare multivariate similarity (for future work), e.g.: - C2ST (Classifier Two‑Sample Test): train a classifier to distinguish real vs synthetic β€” report AUROC (ideal β‰ˆ 0.5). - MMD (Maximum Mean Discrepancy): kernel-based multivariate distance. - Multivariate Wasserstein / Optimal Transport: joint-distribution distance (use POT). 3. **More Models**: - Integrate Hugging Face models - Support for local models (Ollama) - Comparison between different models ### Advanced Features 1. **Conditional Generation**: - Data based on specific conditions - Controlled outlier generation - Maintaining complex relationships 2. **Privacy Analysis**: - Differential privacy metrics - Sensitive data detection - Automatic anonymization 3. **Database Integration**: - Direct database connection - Massive data generation - Automatic synchronization ### Scalable Architecture 1. **REST API**: - Endpoints for integration - Authentication and rate limiting - OpenAPI documentation 2. **Asynchronous Processing**: - Work queues for long generations - Progress notifications - Robust error handling 3. **Monitoring and Logging**: - Usage and performance metrics - Detailed generation logs - Quality alerts ## πŸ“ Project Structure ``` synthetic_data/ β”œβ”€β”€ app.py # Main Gradio application for synthetic data generation β”œβ”€β”€ README.md # Project documentation β”œβ”€β”€ pyproject.toml # Project configuration β”œβ”€β”€ requirements.txt # Python dependencies β”œβ”€β”€ data/ # Reference CSV datasets used for generating synthetic data β”‚ β”œβ”€β”€ people_reference.csv β”‚ β”œβ”€β”€ sentiment_reference.csv β”‚ └── wine_reference.csv β”œβ”€β”€ notebooks/ # Jupyter notebooks for experiments and development β”‚ └── notebook.ipynb β”œβ”€β”€ src/ # Python source code β”‚ β”œβ”€β”€ __init__.py β”œβ”€β”€ constants.py # Default constants, reference sample size, and default prompts β”‚ β”œβ”€β”€ data_generation.py # Core functions for batch generation and evaluation β”‚ β”œβ”€β”€ evaluator.py # Evaluation logic and metrics β”‚ β”œβ”€β”€ IO_utils.py # Utilities for file management and temp directories β”‚ β”œβ”€β”€ openai_utils.py # Wrappers for OpenAI API calls β”‚ └── plot_utils.py # Functions to create visualizations from data └── temp_plots/ # Temporary folder for generated plot images (auto-cleaned) ``` ## πŸ“„ License This project is under the MIT License. See the `LICENSE` file for more details. ## πŸŽ“ Course Context & Learning Outcomes This project was developed as part of the [LLM Engineering: Master AI and Large Language Models](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/learn/lecture/52941433#questions/23828099) course on Udemy. It demonstrates practical implementation of: ### Key Learning Objectives: - **Prompt Engineering Mastery**: Creating effective system and user prompts for consistent outputs - **API Integration**: Working with OpenAI's API for production applications - **Data Processing**: Handling JSON parsing, validation, and error management - **Web Application Development**: Building user interfaces with Gradio ### Course Insights Applied: - **Why OpenAI over Open Source**: This project was developed as an alternative to open-source models due to consistency issues in prompt following with models like Llama 3.2. OpenAI provides more reliable and faster results for this specific task. - **Production Considerations**: Focus on error handling, output validation, and user experience - **Scalability Planning**: Architecture designed for future enhancements and integrations ### Related Course Topics: - Prompt engineering techniques - LLM API integration and optimization - Selection of best models for each usecase. --- **πŸ“š Course Link**: [LLM Engineering: Master AI and Large Language Models](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/learn/lecture/52941433#questions/23828099)