Refreshed notebooks, particularly with new Week 1

This commit is contained in:
Edward Donner
2024-11-13 15:46:22 +00:00
parent 6ba1875cd3
commit 21c7a8155c
34 changed files with 2331 additions and 410 deletions

View File

@@ -259,7 +259,22 @@
"\n",
"And will create a prompt to be used during Training.\n",
"\n",
"Items will be rejected if they don't have sufficient characters."
"Items will be rejected if they don't have sufficient characters.\n",
"\n",
"## But why 180 tokens??\n",
"\n",
"A student asked me a great question - why are we truncating to 180 tokens? How did we determine that number? (Thank you Moataz A. for the excellent question).\n",
"\n",
"The answer: this is an example of a \"hyper-parameter\". In other words, it's basically trial and error! We want a sufficiently large number of tokens so that we have enough useful information to gauge the price. But we also want to keep the number low so that we can train efficiently. You'll see this in action in Week 7.\n",
"\n",
"I started with a number that seemed reasonable, and experimented with a few variations before settling on 180. If you have time, you should do the same! You might find that you can beat my results by finding a better balance. This kind of trial-and-error might sound a bit unsatisfactory, but it's a crucial part of the data science R&D process.\n",
"\n",
"There's another interesting reason why we might favor a lower number of tokens in the training data. When we eventually get to use our model at inference time, we'll want to provide new products and have it estimate a price. And we'll be using short descriptions of products - like 1-2 sentences. For best performance, we should size our training data to be similar to the inputs we will provide at inference time.\n",
"\n",
"## But I see in items.py it constrains inputs to 160 tokens?\n",
"\n",
"Another great question from Moataz A.! The description of the products is limited to 160 tokens because we add some more text before and after the description to turn it into a prompt. That brings it to around 180 tokens in total.\n",
"\n"
]
},
{

View File

@@ -155,8 +155,6 @@
"metadata": {},
"outputs": [],
"source": [
"\n",
"\n",
"with open('train.pkl', 'rb') as file:\n",
" train = pickle.load(file)\n",
"\n",

View File

@@ -529,77 +529,6 @@
"source": [
"Tester.test(gpt_fine_tuned, test)"
]
},
{
"cell_type": "code",
"execution_count": 320,
"id": "03ff4b48-3788-4370-9e34-6592f23d1bce",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"DNS resolution for api.gradio.app: 54.68.118.249\n",
"Gradio API Status: 500\n",
"Gradio API Response: Internal Server Error\n",
"HuggingFace CDN Status: 403\n"
]
}
],
"source": [
"import requests\n",
"import socket\n",
"\n",
"def check_connectivity():\n",
" try:\n",
" # Check DNS resolution\n",
" ip = socket.gethostbyname('api.gradio.app')\n",
" print(f\"DNS resolution for api.gradio.app: {ip}\")\n",
"\n",
" # Check connection to Gradio API\n",
" response = requests.get(\"https://api.gradio.app/v2/tunnel/\", timeout=5)\n",
" print(f\"Gradio API Status: {response.status_code}\")\n",
" print(f\"Gradio API Response: {response.text}\")\n",
"\n",
" # Check connection to HuggingFace CDN\n",
" cdn_response = requests.get(\"https://cdn-media.huggingface.co/frpc-gradio-0.2/frpc_linux_aarch64\", timeout=5)\n",
" print(f\"HuggingFace CDN Status: {cdn_response.status_code}\")\n",
" except Exception as e:\n",
" print(f\"Error in connectivity check: {e}\")\n",
"\n",
"check_connectivity()"
]
},
{
"cell_type": "code",
"execution_count": 323,
"id": "f7d4eec4-da5e-4fbf-ba3e-fbbcfb399d6c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'4.44.0'"
]
},
"execution_count": 323,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import gradio\n",
"gradio.__version__"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cad08a54-912b-43d2-9280-f00b5a7775a6",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {

View File

@@ -3,8 +3,10 @@ from transformers import AutoTokenizer
import re
BASE_MODEL = "meta-llama/Meta-Llama-3.1-8B"
MIN_TOKENS = 150
MAX_TOKENS = 160
MIN_TOKENS = 150 # Any less than this, and we don't have enough useful content
MAX_TOKENS = 160 # Truncate after this many tokens. Then after adding in prompt text, we will get to around 180 tokens
MIN_CHARS = 300
CEILING_CHARS = MAX_TOKENS * 7