Improvements to explanations and minor edits

This commit is contained in:
Edward Donner
2025-01-05 12:51:20 -05:00
parent 90f3fe774a
commit e334b841ca
16 changed files with 59 additions and 26 deletions

View File

@@ -11,7 +11,8 @@
"\n",
"## Data Curation Part 2\n",
"\n",
"Today we'll extend our dataset to a greater coverage, and craft it into an excellent dataset for training.\n",
"Today we'll extend our dataset to a greater coverage, and craft it into an excellent dataset for training. \n",
"Data curation can seem less exciting than other things we work on, but it's a crucial part of the LLM engineers' responsibility and an important craft to hone, so that you can build your own commercial solutions with high quality datasets.\n",
"\n",
"The dataset is here: \n",
"https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023\n",
@@ -23,7 +24,9 @@
"\n",
"We are about to craft a massive dataset of 400,000 items covering multiple types of product. In Week 7 we will be using this data to train our own model. It's a pretty big dataset, and depending on the GPU you select, training could take 20+ hours. It will be really good fun, but it could cost a few dollars in compute units.\n",
"\n",
"As an alternative, if you want to keep things quick & low cost, you can work with a smaller dataset focused only on Home Appliances. You'll be able to cover the same learning points; the results will be good -- not quite as good as the full dataset, but still pretty amazing! If you'd prefer to do this, I've set up an alternative jupyter notebook in this folder called `lite.ipynb` that you should use in place of this one."
"As an alternative, if you want to keep things quick & low cost, you can work with a smaller dataset focused only on Home Appliances. You'll be able to cover the same learning points; the results will be good -- not quite as good as the full dataset, but still pretty amazing! If you'd prefer to do this, I've set up an alternative jupyter notebook in this folder called `lite.ipynb` that you should use in place of this one.\n",
"\n",
"Also, if you'd prefer, you can shortcut running all this data curation by downloading the pickle files that we save in the last cell. The pickle files are available here: https://drive.google.com/drive/folders/1f_IZGybvs9o0J5sb3xmtTEQB3BXllzrW"
]
},
{
@@ -610,7 +613,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.10"
"version": "3.11.11"
}
},
"nbformat": 4,

View File

@@ -52,6 +52,20 @@
"from sklearn.preprocessing import StandardScaler"
]
},
{
"cell_type": "markdown",
"id": "b3c87c11-8dbe-4b8c-8989-01e3d3a60026",
"metadata": {},
"source": [
"## NLP imports\n",
"\n",
"In the next cell, we have more imports for our NLP related machine learning. \n",
"If the gensim import gives you an error like \"Cannot import name 'triu' from 'scipy.linalg' then please run in another cell: \n",
"`!pip install \"scipy<1.13\"` \n",
"As described on StackOverflow [here](https://stackoverflow.com/questions/78279136/importerror-cannot-import-name-triu-from-scipy-linalg-when-importing-gens). \n",
"Many thanks to students Arnaldo G and Ard V for sorting this."
]
},
{
"cell_type": "code",
"execution_count": null,
@@ -59,7 +73,7 @@
"metadata": {},
"outputs": [],
"source": [
"# And more imports for our NLP related machine learning\n",
"# NLP related imports\n",
"\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"from gensim.models import Word2Vec\n",

View File

@@ -1508,7 +1508,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.10"
"version": "3.11.11"
}
},
"nbformat": 4,