# Synthetic Data Generator **NOTE:** This is a copy of the repository https://github.com/Jsrodrigue/synthetic-data-creator. # Synthetic Data Generator An intelligent synthetic data generator that uses OpenAI models to create realistic tabular datasets based on reference data. This project includes an intuitive web interface built with Gradio. > **π Educational Project**: This project was inspired by the highly regarded LLM Engineering course on Udemy: [LLM Engineering: Master AI and Large Language Models](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/learn/lecture/52941433#questions/23828099). It demonstrates practical applications of LLM engineering principles, prompt engineering, and synthetic data generation techniques. ## Key highlights: - Built with Python & Gradio - Uses OpenAI GPT-4 models for tabular data synthesis - Focused on statistical consistency and controlled randomness - Lightweight and easy to extend ## πΈ Screenshots & Demo ### Application Interface
Main interface showing the synthetic data generator with all controls
### Generated Data Preview
Generated CSV preview with the Wine dataset reference
### Histogram plots
Example of Histogram comparison plot in the Wine dataset
### Boxplots
Example of Boxplot comparison
### Video Demo [](https://youtu.be/C7c8BbUGGBA) *Click to watch a complete walkthrough of the application* ## π Features - **Intelligent Generation**: Generates synthetic data using OpenAI models (GPT-4o-mini, GPT-4.1-mini) - **Web Interface**: Provides an intuitive Gradio UI with real-time data preview - **Reference Data**: Optionally load CSV files to preserve statistical distributions - **Export Options**: Download generated datasets directly in CSV format - **Included Examples**: Comes with ready-to-use sample datasets for people and sentiment analysis - **Dynamic Batching**: Automatically adapts batch size based on prompt length and reference sample size - **Reference Sampling**: Uses random subsets of reference data to ensure variability and reduce API cost. The sample size (default `64`) can be modified in `src/constants.py` via `N_REFERENCE_ROWS`. ## π Installation ### Prerequisites - Python 3.12+ - OpenAI account with API key ### Option 1: Using pip ```bash # Create virtual environment python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install dependencies pip install -r requirements.txt ``` ### Option 2: Using uv ```bash # Clone the repository git clone https://github.com/Jsrodrigue/synthetic-data-creator.git cd synthetic-data-creator # Install dependencies uv sync # Activate virtual environment uv shell ``` ### Configuration 1. Copy the environment variables example file: ```bash cp .env_example .env ``` 2. Edit `.env` and add your OpenAI API key: ``` OPENAI_API_KEY=your_api_key_here ``` ## π― Usage ### Start the application You can run the app either with **Python** or with **uv** (recommended if you installed dependencies using `uv sync`): ```bash # Option 1: using Python python app.py # Option 2: using uv (no need to activate venv manually) uv run app.py ``` The script will print a local URL (e.g., http://localhost:7860) β open that link in your browser. ### How to use the interface 1. **Configure Prompts**: - **System Prompt**: Uses the default rules defined in `src/constants.py` or can be edited there for custom generation. - **User Prompt**: Specifies what type of data to generate (default: 15 rows, defined in `src/constants.py`). 2. **Select Model**: - `gpt-4o-mini`: Faster and more economical - `gpt-4.1-mini`: Higher reasoning capacity 3. **Load Reference Data** (optional): - Upload a CSV file with similar data - Use included examples: `people_reference.csv`, `sentiment_reference.csv` or `wine_reference.csv` 4. **Generate Data**: - Click "π Generate Data" - Review results in the gradio UI - Download the generated CSV ## π Quality Evaluation ### Simple Evaluation System The project includes a simple evaluation system focused on basic metrics and visualizations: #### Features - **Simple Metrics**: Basic statistical comparisons and quality checks - **Integrated Visualizations**: Automatic generation of comparison plots in the app - **Easy to Understand**: Clear scores and simple reports - **Scale Invariant**: Works with datasets of different sizes - **Temporary Files**: Visualizations are generated in temp files and cleaned up automatically ## π οΈ Improvements and Next Steps ### Immediate Improvements 1. **Advanced Validation**: - Implement specific validators by data type - Create evaluation reports 2. **Advanced Quality Metrics** - Include more advanced metrics to compare multivariate similarity (for future work), e.g.: - C2ST (Classifier TwoβSample Test): train a classifier to distinguish real vs synthetic β report AUROC (ideal β 0.5). - MMD (Maximum Mean Discrepancy): kernel-based multivariate distance. - Multivariate Wasserstein / Optimal Transport: joint-distribution distance (use POT). 3. **More Models**: - Integrate Hugging Face models - Support for local models (Ollama) - Comparison between different models ### Advanced Features 1. **Conditional Generation**: - Data based on specific conditions - Controlled outlier generation - Maintaining complex relationships 2. **Privacy Analysis**: - Differential privacy metrics - Sensitive data detection - Automatic anonymization 3. **Database Integration**: - Direct database connection - Massive data generation - Automatic synchronization ### Scalable Architecture 1. **REST API**: - Endpoints for integration - Authentication and rate limiting - OpenAPI documentation 2. **Asynchronous Processing**: - Work queues for long generations - Progress notifications - Robust error handling 3. **Monitoring and Logging**: - Usage and performance metrics - Detailed generation logs - Quality alerts ## π Project Structure ``` synthetic_data/ βββ app.py # Main Gradio application for synthetic data generation βββ README.md # Project documentation βββ pyproject.toml # Project configuration βββ requirements.txt # Python dependencies βββ data/ # Reference CSV datasets used for generating synthetic data β βββ people_reference.csv β βββ sentiment_reference.csv β βββ wine_reference.csv βββ notebooks/ # Jupyter notebooks for experiments and development β βββ notebook.ipynb βββ src/ # Python source code β βββ __init__.py βββ constants.py # Default constants, reference sample size, and default prompts β βββ data_generation.py # Core functions for batch generation and evaluation β βββ evaluator.py # Evaluation logic and metrics β βββ IO_utils.py # Utilities for file management and temp directories β βββ openai_utils.py # Wrappers for OpenAI API calls β βββ plot_utils.py # Functions to create visualizations from data βββ temp_plots/ # Temporary folder for generated plot images (auto-cleaned) ``` ## π License This project is under the MIT License. See the `LICENSE` file for more details. ## π Course Context & Learning Outcomes This project was developed as part of the [LLM Engineering: Master AI and Large Language Models](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/learn/lecture/52941433#questions/23828099) course on Udemy. It demonstrates practical implementation of: ### Key Learning Objectives: - **Prompt Engineering Mastery**: Creating effective system and user prompts for consistent outputs - **API Integration**: Working with OpenAI's API for production applications - **Data Processing**: Handling JSON parsing, validation, and error management - **Web Application Development**: Building user interfaces with Gradio ### Course Insights Applied: - **Why OpenAI over Open Source**: This project was developed as an alternative to open-source models due to consistency issues in prompt following with models like Llama 3.2. OpenAI provides more reliable and faster results for this specific task. - **Production Considerations**: Focus on error handling, output validation, and user experience - **Scalability Planning**: Architecture designed for future enhancements and integrations ### Related Course Topics: - Prompt engineering techniques - LLM API integration and optimization - Selection of best models for each usecase. --- **π Course Link**: [LLM Engineering: Master AI and Large Language Models](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/learn/lecture/52941433#questions/23828099)