From 2e7e59a98ada5d07cd977734936d485b120e6af1 Mon Sep 17 00:00:00 2001 From: Bharat Puri Date: Wed, 22 Oct 2025 13:52:48 +0530 Subject: [PATCH] Assignment week3 by Bharat Puri --- .../bharat_puri/synthetic_data_generator.ipynb | 1 + 1 file changed, 1 insertion(+) create mode 100644 week3/community-contributions/bharat_puri/synthetic_data_generator.ipynb diff --git a/week3/community-contributions/bharat_puri/synthetic_data_generator.ipynb b/week3/community-contributions/bharat_puri/synthetic_data_generator.ipynb new file mode 100644 index 0000000..19af672 --- /dev/null +++ b/week3/community-contributions/bharat_puri/synthetic_data_generator.ipynb @@ -0,0 +1 @@ +{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[{"file_id":"1DjcrYDZldAXKJ08x1uYIVCtItoLPk1Wr","timestamp":1761118409825}],"gpuType":"T4"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"},"accelerator":"GPU"},"cells":[{"cell_type":"markdown","source":["# Synthetic Data Generator - Week 3 Assignment\n","\n","Submitted By : Bharat Puri\n","\n","## ✅ Summary\n","- Implemented a **synthetic data generator** using the **transformer architecture directly**.\n","- Used `AutoTokenizer` and `AutoModelForCausalLM` for manual inference.\n","- Demonstrated core transformer flow: Tokenize → Generate → Decode.\n","- Wrapped the logic in a **Gradio UI** for usability.\n","- Used a small model (`gpt2-medium`) to ensure it runs on free Colab CPU/GPU.\n","- Fully aligned with Week 3 challenge: *“Write models that generate datasets and explore model APIs.”*\n","\n","\n"],"metadata":{"id":"JTygxy-RAn1f"}},{"cell_type":"markdown","source":["Basic Pip installations"],"metadata":{"id":"ovoHky6M2fho"}},{"cell_type":"code","source":["!pip install -q transformers gradio torch"],"metadata":{"id":"iQqYgGVYnhco","executionInfo":{"status":"ok","timestamp":1761121098786,"user_tz":-330,"elapsed":13451,"user":{"displayName":"Bharat Puri","userId":"13621281326895888713"}}},"execution_count":1,"outputs":[]},{"cell_type":"markdown","source":["Validate Google Colab T4 instance"],"metadata":{"id":"Rcj47nAL2qwD"}},{"cell_type":"code","source":["# @title Default title text\n","# Let's check the GPU - it should be a Tesla T4\n","\n","gpu_info = !nvidia-smi\n","gpu_info = '\\n'.join(gpu_info)\n","if gpu_info.find('failed') >= 0:\n"," print('Not connected to a GPU')\n","else:\n"," print(gpu_info)\n"," if gpu_info.find('Tesla T4') >= 0:\n"," print(\"Success - Connected to a T4\")\n"," else:\n"," print(\"NOT CONNECTED TO A T4\")"],"metadata":{"id":"E2aO6PbB0WU3","executionInfo":{"status":"ok","timestamp":1761121098897,"user_tz":-330,"elapsed":109,"user":{"displayName":"Bharat Puri","userId":"13621281326895888713"}},"outputId":"73cfc6c9-2248-4796-a9ae-3b2e5cb85598","colab":{"base_uri":"https://localhost:8080/"}},"execution_count":2,"outputs":[{"output_type":"stream","name":"stdout","text":["Wed Oct 22 08:18:18 2025 \n","+-----------------------------------------------------------------------------------------+\n","| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |\n","|-----------------------------------------+------------------------+----------------------+\n","| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |\n","| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |\n","| | | MIG M. |\n","|=========================================+========================+======================|\n","| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |\n","| N/A 43C P8 9W / 70W | 0MiB / 15360MiB | 0% Default |\n","| | | N/A |\n","+-----------------------------------------+------------------------+----------------------+\n"," \n","+-----------------------------------------------------------------------------------------+\n","| Processes: |\n","| GPU GI CI PID Type Process name GPU Memory |\n","| ID ID Usage |\n","|=========================================================================================|\n","| No running processes found |\n","+-----------------------------------------------------------------------------------------+\n","Success - Connected to a T4\n"]}]},{"cell_type":"markdown","source":["Import required python libraries"],"metadata":{"id":"I7kioiEz2x1j"}},{"cell_type":"code","source":["import torch\n","from transformers import AutoTokenizer, AutoModelForCausalLM\n","import gradio as gr"],"metadata":{"executionInfo":{"status":"ok","timestamp":1761121119633,"user_tz":-330,"elapsed":20734,"user":{"displayName":"Bharat Puri","userId":"13621281326895888713"}},"id":"xqGrPpCP2b0N"},"execution_count":3,"outputs":[]},{"cell_type":"markdown","source":["# Connecting Hugging Face\n","\n","You'll need to log in to the HuggingFace hub if you've not done so before.\n","\n","1. If you haven't already done so, create a **free** HuggingFace account at https://huggingface.co and navigate to Settings from the user menu on the top right. Then Create a new API token, giving yourself write permissions. \n","\n","**IMPORTANT** when you create your HuggingFace API key, please be sure to select WRITE permissions for your key by clicking on the WRITE tab, otherwise you may get problems later. Not \"fine-grained\" but \"write\".\n","\n","2. Back here in colab, press the \"key\" icon on the side panel to the left, and add a new secret: \n"," In the name field put `HF_TOKEN` \n"," In the value field put your actual token: `hf_...` \n"," Ensure the notebook access switch is turned ON.\n","\n","3. Execute the cell below to log in. You'll need to do this on each of your colabs. It's a really useful way to manage your secrets without needing to type them into colab."],"metadata":{"id":"TV8_hr1rCGUr"}},{"cell_type":"code","source":["from huggingface_hub import login\n","from google.colab import userdata\n","\n","\n","hf_token = userdata.get('HF_TOKEN')\n","login(hf_token, add_to_git_credential=True)"],"metadata":{"id":"ZR-wgFH-CKtO","executionInfo":{"status":"ok","timestamp":1761121120770,"user_tz":-330,"elapsed":1135,"user":{"displayName":"Bharat Puri","userId":"13621281326895888713"}}},"execution_count":4,"outputs":[]},{"cell_type":"markdown","source":["## Load Model and Tokenizer\n","\n","We’ll use a small model (distilgpt2) so it’s light and fast, but we’ll handle everything manually — just like a full transformer workflow."],"metadata":{"id":"8bG3a_Xr3DrM"}},{"cell_type":"code","source":["# Load lightweight model and tokenizer\n","model_name = \"gpt2-medium\"\n","tokenizer = AutoTokenizer.from_pretrained(model_name)\n","model = AutoModelForCausalLM.from_pretrained(model_name)"],"metadata":{"id":"9jTthxWyAJJZ","executionInfo":{"status":"ok","timestamp":1761121132779,"user_tz":-330,"elapsed":12007,"user":{"displayName":"Bharat Puri","userId":"13621281326895888713"}}},"execution_count":5,"outputs":[]},{"cell_type":"markdown","source":["## Build a Prompt\n","We create a simple function to structure the generation task."],"metadata":{"id":"mLkpfycP3IME"}},{"cell_type":"code","source":["def build_prompt(region, count):\n"," return (\n"," f\"Generate {count} unique Indian names from the {region} region. \"\n"," f\"Include both male and female names. \"\n"," f\"Return the list numbered 1 to {count}.\"\n"," )"],"metadata":{"id":"HAeRMxVdJMDF","executionInfo":{"status":"ok","timestamp":1761121132802,"user_tz":-330,"elapsed":20,"user":{"displayName":"Bharat Puri","userId":"13621281326895888713"}}},"execution_count":6,"outputs":[]},{"cell_type":"markdown","source":["## Tokenize → Generate → Decode\n","\n","Here’s the key “transformer logic”:\n","\n","Tokenize input (convert text → tensor)\n","\n","Generate tokens using the model\n","\n","Decode back to text"],"metadata":{"id":"LhYbFsuA3Lmp"}},{"cell_type":"code","source":["def generate_names(region, count):\n"," # Few-shot example prompt to guide GPT2\n"," prompt = f\"\"\"\n","Generate {count} unique Indian names from the {region} region.\n","Each name should be realistic and common in that region.\n","Include both male and female names.\n","Here are some examples:\n","\n","1. Arjun Kumar\n","2. Priya Sharma\n","3. Karthik Reddy\n","4. Meena Devi\n","5. Suresh Babu\n","\n","Now continue with more names:\n","\"\"\"\n","\n"," print(\"Prompt sent to model:\\n\", prompt)\n","\n"," # --- Load model and tokenizer ---\n"," model_name = \"gpt2-medium\" # better than distilgpt2, still light enough\n"," tokenizer = AutoTokenizer.from_pretrained(model_name)\n"," model = AutoModelForCausalLM.from_pretrained(model_name)\n","\n"," # --- Encode input ---\n"," inputs = tokenizer(prompt, return_tensors=\"pt\")\n","\n"," # --- Generate ---\n"," outputs = model.generate(\n"," **inputs,\n"," max_new_tokens=100,\n"," temperature=0.9,\n"," do_sample=True,\n"," pad_token_id=tokenizer.eos_token_id\n"," )\n","\n"," # --- Decode output ---\n"," text = tokenizer.decode(outputs[0], skip_special_tokens=True)\n","\n"," # --- Extract possible names ---\n"," lines = text.split(\"\\n\")\n"," names = []\n"," for line in lines:\n"," if any(ch.isalpha() for ch in line):\n"," clean = line.strip()\n"," if \".\" in clean:\n"," clean = clean.split(\".\", 1)[1].strip()\n"," if len(clean.split()) <= 3 and not clean.lower().startswith(\"generate\"):\n"," names.append(clean)\n"," # remove duplicates and limit\n"," names = list(dict.fromkeys(names))[:count]\n","\n"," if not names:\n"," names = [\"Model didn't generate recognizable names. Try again.\"]\n","\n"," return \"\\n\".join(names)\n"],"metadata":{"id":"UubQ06ZvEOj-","executionInfo":{"status":"ok","timestamp":1761121132826,"user_tz":-330,"elapsed":23,"user":{"displayName":"Bharat Puri","userId":"13621281326895888713"}}},"execution_count":7,"outputs":[]},{"cell_type":"markdown","source":["## Gradio Interface"],"metadata":{"id":"dGrV0RiR6-hb"}},{"cell_type":"code","source":["def run_app():\n"," with gr.Blocks() as demo:\n"," gr.Markdown(\"# 🇮🇳 Indian Name Generator using Transformers (Week 3 Assignment)\")\n"," gr.Markdown(\"Generates synthetic Indian names using Hugging Face Transformers with manual tokenization and decoding.\")\n","\n"," region = gr.Dropdown(\n"," [\"North India\", \"South India\", \"East India\", \"West India\"],\n"," label=\"Select Region\",\n"," value=\"North India\"\n"," )\n"," count = gr.Number(label=\"Number of Names\", value=10)\n"," output = gr.Textbox(label=\"Generated Indian Names\", lines=10)\n"," generate_btn = gr.Button(\"Generate Names\")\n","\n"," generate_btn.click(fn=generate_names, inputs=[region, count], outputs=output)\n"," demo.launch()"],"metadata":{"id":"L9F4Gpnu7AQ3","executionInfo":{"status":"ok","timestamp":1761121132853,"user_tz":-330,"elapsed":25,"user":{"displayName":"Bharat Puri","userId":"13621281326895888713"}}},"execution_count":8,"outputs":[]},{"cell_type":"markdown","source":["## Run App"],"metadata":{"id":"-12xy-R1-tfm"}},{"cell_type":"code","source":["run_app()"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":626},"id":"Izj8VmXG-ukg","executionInfo":{"status":"ok","timestamp":1761121135385,"user_tz":-330,"elapsed":2530,"user":{"displayName":"Bharat Puri","userId":"13621281326895888713"}},"outputId":"bc212815-78e7-49fa-b92d-05c38221ae0b"},"execution_count":9,"outputs":[{"output_type":"stream","name":"stdout","text":["It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).\n","\n","Colab notebook detected. To show errors in colab notebook, set debug=True in launch()\n","* Running on public URL: https://0876ef599f401ea674.gradio.live\n","\n","This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)\n"]},{"output_type":"display_data","data":{"text/plain":[""],"text/html":["
"]},"metadata":{}}]}]} \ No newline at end of file