A machine learning community with a focus on natural language processing What is Hugging Face?Hugging Face (HF) is an organization and a platform that provides machine learning models and datasets with a focus on natural language processing. To get started, try working through this demonstration on Google Colab.Training MaterialsSee the Hugging Face getting started materials by Princeton Language and Intelligence and the Wintersession 2025 materials.Tips for Working with HF on the Research Computing ClustersBefore beginning your work, make sure that you have sufficient space by running the checkquota command.By default, your HF cache directory is in your /home directory. This is too small. Set the cache directory to a path where you have more space by setting this environment variable:export HF_HOME=/scratch/network/<YourNetID>/.cache/huggingface/ # adroitexport HF_HOME=/scratch/gpfs/<YourNetID>/.cache/huggingface/ # dellaIt is recommended to set this environment variable in your ~/.bashrc file so that it will take effect in all subsequent sessions. This can be done by running the following command:Adroit$ echo "export HF_HOME=/scratch/network/$USER/.cache/huggingface/" >> $HOME/.bashrcDella$ echo "export HF_HOME=/scratch/gpfs/$USER/.cache/huggingface/" >> $HOME/.bashrcTo make the change take effect, log out and then back in again. Here is the full sequence of commands for Della:della8:~$ echo "export HF_HOME=/scratch/gpfs/$USER/.cache/huggingface/" >> $HOME/.bashrcdella8:~$ exit$ ssh <YourNetID>@della.princeton.eduYou can make sure that the variable is correctly defined by running this command:della8:~$ echo ${HF_HOME}HF_HOME=/scratch/gpfs/<YourNetID>/.cache/huggingface/You must download models and datasets on the login node before submitting jobs to Slurm since the compute nodes, where the jobs run, do not have internet access.Here is an example of downloading a model on the login node:from transformers import AutoTokenizer, AutoModelForCausalLM AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf", cache_dir=".cache") AutoModelForCausalLM.from_pretrained("codellama/CodeLlama-7b-hf", cache_dir=".cache")Make sure that no downloads are needed on the compute nodes from these functions:.from_pretrained(...) .pipeline(...) .load_dataset(...)Be aware of offline mode in HF.Learn about methods and tools for efficient training on a single GPU.For more details, see this GitHub repo and specifically slides.pptx by David Turner of PNI.DatasetsThe "en" variant of the c4 dataset is available on Della. Research Computing can add additional datasets if they will be used by a sufficient number of users.