Hugging Face

What is Hugging Face?

Hugging Face (HF) is an organization and a platform that provides machine learning models and datasets with a focus on natural language processing. To get started, try working through this demonstration on Google Colab.

 

Tips for Working with HF on the Research Computing Clusters

  1. Before beginning your work, make sure that you have sufficient space by running the checkquota command.
  2. By default, your HF cache directory is in your /home directory. This is too small. Set the cache directory to a path where you have more space by setting this environment variable:

    export HF_HOME=/scratch/network/<YourNetID>/.cache/huggingface/  # adroit
    export HF_HOME=/scratch/gpfs/<YourNetID>/.cache/huggingface/     # della

    It is recommended to set this environment variable in your ~/.bashrc file so that it will take effect in all subsequent sessions. This can be done by running the following command:

    Adroit
    $ echo "export HF_HOME=/scratch/network/$USER/.cache/huggingface/" >> $HOME/.bashrc
    Della
    $ echo "export HF_HOME=/scratch/gpfs/$USER/.cache/huggingface/" >> $HOME/.bashrc
    To make the change take effect, log out and then back in again. Here is the full sequence of commands for Della:
    della8:~$ echo "export HF_HOME=/scratch/gpfs/$USER/.cache/huggingface/" >> $HOME/.bashrc
    della8:~$ exit
    $ ssh <YourNetID>@della.princeton.edu

    You can make sure that the variable is correctly defined by running this command:
    della8:~$ echo ${HF_HOME}
    HF_HOME=/scratch/gpfs/<YourNetID>/.cache/huggingface/
  3. You must download models and datasets on the login node before submitting jobs to Slurm since the compute nodes, where the jobs run, do not have internet access.

    Here is an example of downloading a model on the login node:

    from transformers import AutoTokenizer, AutoModelForCausalLM
    
    AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf", cache_dir=".cache")
    AutoModelForCausalLM.from_pretrained("codellama/CodeLlama-7b-hf", cache_dir=".cache")
    Make sure that no downloads are needed on the compute nodes from these functions:
    .from_pretrained(...)
    .pipeline(...)
    .load_dataset(...)
    Be aware of offline mode in HF.
  4. Learn about methods and tools for efficient training on a single GPU.

For more details, see this GitHub repo and specifically slides.pptx by David Turner of PNI.

 

Datasets

The "en" variant of the c4 dataset is available on Della. Research Computing can add additional datasets if they will be used by a sufficient number of users.