What is Hugging Face?
Hugging Face (HF) is an organization and a platform that provides machine learning models and datasets with a focus on natural language processing. To get started, try working through this demonstration on Google Colab.
Tips for Working with HF on the Research Computing Clusters
- Before beginning your work, make sure that you have sufficient space by running the checkquota command.
- By default, your HF cache directory is in your /home directory. This is too small. Set the cache directory to a path where you have more space by setting this environment variable:
export HF_HOME=/scratch/network/<YourNetID>/.cache/huggingface/ # adroit
export HF_HOME=/scratch/gpfs/<YourNetID>/.cache/huggingface/ # della
It is recommended to set this environment variable in your ~/.bashrc file so that it will take effect in all subsequent sessions. This can be done by running the following command:
Adroit$ echo "export HF_HOME=/scratch/network/$USER/.cache/huggingface/" >> $HOME/.bashrc
Della$ echo "export HF_HOME=/scratch/gpfs/$USER/.cache/huggingface/" >> $HOME/.bashrc
To make the change take effect, log out and then back in again. Here is the full sequence of commands for Della:della8:~$ echo "export HF_HOME=/scratch/gpfs/$USER/.cache/huggingface/" >> $HOME/.bashrc
della8:~$ exit
$ ssh <YourNetID>@della.princeton.edu
You can make sure that the variable is correctly defined by running this command:della8:~$ echo ${HF_HOME}
HF_HOME=/scratch/gpfs/<YourNetID>/.cache/huggingface/ - You must download models and datasets on the login node before submitting jobs to Slurm since the compute nodes, where the jobs run, do not have internet access. HF makes downloads from these functions:
.from_pretrained(...)
pipeline(...)
load_dataset(...)
For more details, see this GitHub repo and specifically slides.pptx by David Turner of PNI.
Datasets
The "en" variant of the c4 dataset is available on Della. Research Computing can add additional datasets if they will be used by a sufficient number of users.