Hugging Face 🤗¶
What is Hugging Face? 🤔¶
- An open-source organization founded in 2016
- Revolves around developing tools and resources for natural language processing and machine learning tasks.
- Over 350k models, 75k datasets, and 150k demo apps.
- "Hugging Face is to machine learning what github is to software engineering"
Why Hugging Face? 😎¶
- Easy-to-use APIs and pipelines
- Tons of pre-trained models
- Ability to fine tune models to suit your purpose
- Organizations such as Google, Microsoft, AWS, Nvidia, Facebook and many more are using Hugging Face
Let's get started! 🚀¶
Website link - https://huggingface.co
Models (https://huggingface.co/models)¶
Collection of 'state-of-the-art pretrained' models for NLP, vision, audio and many more tasks
Datasets (https://huggingface.co/datasets)¶
Thousands of datasets in more than 100 languages, each dedicated for numerous ML tasks
Spaces (https://huggingface.co/spaces)¶
A simple way to host ML demo apps on the HuggingFace Hub. Users can create demos using Gradio, Streamlit or even HTML/CSS/JS.
Enough theory! Time to build something 🛠¶
Transformers¶
!pip install -q transformers
Transformers is a library which provides tools and API for accessing pre trained models.
But before we continue, let's talk about Transformers!
One is a python library while the other is a breakthrough neural network architecture
For more reference about the architecture, click here
Tokenizers¶
Tokenizers are used to convert the input text to an array of numbers
from transformers import AutoTokenizer
# https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment
tokenizer_1 = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
# https://huggingface.co/bert-base-uncased
tokenizer_2 = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "We are very happy to show you the very famous 🤗 Transformers library."
encoding_1 = tokenizer_1([text,"yes"],max_length=12,padding=True)
encoding_2 = tokenizer_2(text)
print('First tokenizer:')
for e in encoding_1:
print(str(e)+' : '+str(encoding_1[e]))
print('Second tokenizer:')
for e in encoding_2:
print(str(e)+' : '+str(encoding_2[e]))
First tokenizer: input_ids : [[101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 12495, 17377, 100, 58263, 13299, 119, 102], [101, 31617, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]] token_type_ids : [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]] attention_mask : [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]] Second tokenizer: input_ids : [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 2200, 3297, 100, 19081, 3075, 1012, 102] token_type_ids : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] attention_mask : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2632: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`. warnings.warn(
encoding_1 = tokenizer_1(["Welcome to the session","Enjoy"],max_length=20,padding=True)
for e in encoding_1:
print(str(e)+' : '+str(encoding_1[e]))
input_ids : [[101, 32252, 10114, 10103, 23734, 102], [101, 61530, 102, 0, 0, 0]] token_type_ids : [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]] attention_mask : [[1, 1, 1, 1, 1, 1], [1, 1, 1, 0, 0, 0]]
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
seq = "Welcome to BITS"
print(tokenizer(seq))
{'input_ids': [101, 6160, 2000, 9017, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}
seq = tokenizer("Welcome to the session", "Hope you have a good time!")
for key, value in seq.items():
print(f"{key}: {value}")
input_ids: [101, 6160, 2000, 1996, 5219, 102, 3246, 2017, 2031, 1037, 2204, 2051, 999, 102] token_type_ids: [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1] attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
NOTE : 'AutoTokenizer' module only works on textual data. To process image and audio into the correct format, use 'AutoImageProcessor' and 'AutoFeatureExtractor' respectively
Pipeline¶
Pipeline module makes it very easy to access pre-trained model from the hub
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier("We are very happy to show you the 🤗 Transformers library.")
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english). Using a pipeline without specifying a model name and revision in production is not recommended.
[{'label': 'POSITIVE', 'score': 0.9997795224189758}]
Parameters -
- model
- task
- tokenizer
- batch_size
- device
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("typeform/distilbert-base-uncased-mnli")
model = AutoModelForSequenceClassification.from_pretrained("typeform/distilbert-base-uncased-mnli")
sa_pipeline = pipeline(
task="sentiment-analysis",
model=model,
tokenizer=tokenizer,
batch_size=32,
device=-1
)
sentence = "I love this movie!"
sentiment = sa_pipeline(sentence)
print(sentiment)
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file. The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file. The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file. The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
[{'label': 'ENTAILMENT', 'score': 0.6847606301307678}]
To view list of accepted tasks visit https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.pipeline.task
Models¶
What is architecture and what are checkpoints? (BERT and bert-base-uncased)
- BERT is a modified-transformers based architecture.
- bert-base-uncased is a pretrained model on english language and based on BERT architecture. (https://huggingface.co/bert-base-uncased)
from transformers import pipeline
pipe = pipeline("fill-mask", model="bert-base-uncased")
text = "Delhi is [MASK] of India."
prediction = pipe(text)
for p in prediction:
print(p)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'cls.seq_relationship.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.weight'] - This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
{'score': 0.9531505107879639, 'token': 3007, 'token_str': 'capital', 'sequence': 'delhi is capital of india.'} {'score': 0.025008611381053925, 'token': 2110, 'token_str': 'state', 'sequence': 'delhi is state of india.'} {'score': 0.00889446958899498, 'token': 2112, 'token_str': 'part', 'sequence': 'delhi is part of india.'} {'score': 0.0013557332567870617, 'token': 2103, 'token_str': 'city', 'sequence': 'delhi is city of india.'} {'score': 0.0013061006320640445, 'token': 3072, 'token_str': 'republic', 'sequence': 'delhi is republic of india.'}
Use Cases¶
1. Text Classification 💬¶
from transformers import pipeline
pipe = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
# results = pipe('I feel very happy')
results = pipe(['I am not enjoying','I feel very happy'])
# results[0]['label']
results
[{'label': 'NEGATIVE', 'score': 0.9996455907821655}, {'label': 'POSITIVE', 'score': 0.999884843826294}]
2. Image Classification 🖼️¶
# Use a pipeline as a high-level helper
from transformers import pipeline
from PIL import Image
import requests
# https://huggingface.co/google/vit-base-patch16-224
pipe = pipeline("image-classification", model="google/vit-base-patch16-224")
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
results = pipe(image)
results
[{'score': 0.9374414086341858, 'label': 'Egyptian cat'}, {'score': 0.038442570716142654, 'label': 'tabby, tabby cat'}, {'score': 0.014411387033760548, 'label': 'tiger cat'}, {'score': 0.0032743187621235847, 'label': 'lynx, catamount'}, {'score': 0.0006795920198783278, 'label': 'Siamese cat, Siamese'}]
from transformers import ViTImageProcessor, ViTForImageClassification
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])
Predicted class: Egyptian cat
3. Text to speech 🔊¶
!pip install -q fairseq
!pip install -q g2p_en
!pip install huggingface-hub
Requirement already satisfied: huggingface-hub in /usr/local/lib/python3.10/dist-packages (0.19.3) Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub) (3.13.1) Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub) (2023.6.0) Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from huggingface-hub) (2.31.0) Requirement already satisfied: tqdm>=4.42.1 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub) (4.66.1) Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub) (6.0.1) Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub) (4.5.0) Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub) (23.2) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub) (3.4) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub) (2023.7.22)
# https://huggingface.co/facebook/fastspeech2-en-200_speaker-cv4
from fairseq.checkpoint_utils import load_model_ensemble_and_task_from_hf_hub
from fairseq.models.text_to_speech.hub_interface import TTSHubInterface
import IPython.display as ipd
models, cfg, task = load_model_ensemble_and_task_from_hf_hub(
"facebook/fastspeech2-en-200_speaker-cv4",
arg_overrides={"vocoder": "hifigan", "fp16": False}
)
model = models[0]
TTSHubInterface.update_cfg_with_data_cfg(cfg, task.data_cfg)
generator = task.build_generator([model], cfg)
Fetching 9 files: 0%| | 0/9 [00:00<?, ?it/s]
/usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
# text = "Hello, this is a test run."
text = "Google Developer Student Club BITS Pilani Dubai"
sample = TTSHubInterface.get_model_input(task, text)
wav, rate = TTSHubInterface.get_prediction(task, model, generator, sample)
ipd.Audio(wav, rate=rate)
Let's create a chatbot using streamlit and huggingface! 🤖¶
!pip install -q streamlit
%%writefile generic_chatbot.py
import streamlit as st
import random
import time
from transformers import pipeline, Conversation
chatbot = pipeline(model="facebook/blenderbot-400M-distill")
message_list = []
response_list = []
st.title("Generic Chatbot")
# Initialize chat history
if "messages" not in st.session_state:
st.session_state.messages = []
# Display chat messages from history on app rerun
for message in st.session_state.messages:
with st.chat_message(message["role"],avatar=message["avatar"]):
st.markdown(message["content"])
# Accept user input
if prompt := st.chat_input("Enter something..."):
# Add user message to chat history
st.session_state.messages.append({"role": "user", "content": prompt,"avatar":"🤗"})
# Display user message in chat message container
with st.chat_message("user",avatar='🤗'):
st.markdown(prompt)
# Display assistant response in chat message container
with st.chat_message("assistant",avatar='🤖'):
message_placeholder = st.empty()
full_response = ""
message_list.append(prompt)
conversation = Conversation(text=prompt, past_user_inputs=message_list, generated_responses=response_list)
conversation = chatbot(conversation)
assistant_response = conversation.generated_responses[-1]
response_list.append(assistant_response)
# Simulate stream of response with milliseconds delay
for chunk in assistant_response.split():
full_response += chunk + " "
time.sleep(0.05)
# Add a blinking cursor to simulate typing
message_placeholder.markdown(full_response + "▌")
message_placeholder.markdown(full_response)
# Add assistant response to chat history
st.session_state.messages.append({"role": "assistant", "content": full_response,"avatar":"🤖"})
Overwriting generic_chatbot.py
!wget -q -O - ipv4.icanhazip.com
! streamlit run generic_chatbot.py & npx localtunnel --port 8501
34.86.46.52 [##................] - fetchMetadata: sill resolveWithNewModule y18n@5.0.8 chec Collecting usage statistics. To deactivate, set browser.gatherUsageStats to False. You can now view your Streamlit app in your browser. Network URL: http://172.28.0.12:8501 External URL: http://34.86.46.52:8501 npx: installed 22 in 10.758s your url is: https://odd-books-make.loca.lt Stopping... ^C
How to run streamlit in colab?¶
1. Click on the generated link¶
2. Enter the IP address to access the streamlit platform¶
What to do if you dont know how to use a model? 😖¶
- Check Github
- Check HuggingFace Spaces
eg : model="facebook/blenderbot-400M-distill"
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("conversational", model="facebook/blenderbot-400M-distill")
ans = pipe('Hi how are you?')
No chat template is defined for this tokenizer - using the default template for the BlenderbotTokenizerFast class. If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-29-7861ef620052> in <cell line: 6>() 4 pipe = pipeline("conversational", model="facebook/blenderbot-400M-distill") 5 ----> 6 ans = pipe('Hi how are you?') /usr/local/lib/python3.10/dist-packages/transformers/pipelines/conversational.py in __call__(self, conversations, num_workers, **kwargs) 287 elif isinstance(conversations, list) and isinstance(conversations[0], list): 288 conversations = [Conversation(conv) for conv in conversations] --> 289 outputs = super().__call__(conversations, num_workers=num_workers, **kwargs) 290 if isinstance(outputs, list) and len(outputs) == 1: 291 return outputs[0] /usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py in __call__(self, inputs, num_workers, batch_size, *args, **kwargs) 1138 ) 1139 else: -> 1140 return self.run_single(inputs, preprocess_params, forward_params, postprocess_params) 1141 1142 def run_multi(self, inputs, preprocess_params, forward_params, postprocess_params): /usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py in run_single(self, inputs, preprocess_params, forward_params, postprocess_params) 1146 model_inputs = self.preprocess(inputs, **preprocess_params) 1147 model_outputs = self.forward(model_inputs, **forward_params) -> 1148 outputs = self.postprocess(model_outputs, **postprocess_params) 1149 return outputs 1150 /usr/local/lib/python3.10/dist-packages/transformers/pipelines/conversational.py in postprocess(self, model_outputs, clean_up_tokenization_spaces) 321 ) 322 conversation = model_outputs["conversation"] --> 323 conversation.add_message({"role": "assistant", "content": answer}) 324 return conversation AttributeError: 'str' object has no attribute 'add_message'
from transformers import Conversation
conversation = Conversation("I'm looking for a movie - what's your favourite one?")
ans = pipe(conversation)
print(ans)
Conversation id: 093afc7a-edd8-488a-bbec-1bba22265344 user: I'm looking for a movie - what's your favourite one? assistant: I don't really have a favorite movie, but I do like action movies. What about you?
from transformers import BlenderbotTokenizer, BlenderbotForConditionalGeneration
#import model class and tokenizer
from transformers import BlenderbotTokenizer, BlenderbotForConditionalGeneration
#download and setup the model and tokenizer
model_name = 'facebook/blenderbot-400M-distill'
tokenizer = BlenderbotTokenizer.from_pretrained(model_name)
model = BlenderbotForConditionalGeneration.from_pretrained(model_name)
def func (message):
inputs = tokenizer(message, return_tensors="pt")
result = model.generate(**inputs)
return tokenizer.decode(result[0])
ans = func('Hi how are you')
print(ans)
<s> I'm doing well. How are you? What do you like to do in your free time?</s>
What's next?¶
- Learn how how to fine tune these pre-trained models
- Get acquainted on how to use transformers in JavaScript
- Create an account on HuggingFace and start contributing!