Slide 1: Introduction to Vector Embedding
Vector embedding is a technique used to represent discrete objects as continuous vectors in a high-dimensional space. This method allows us to capture semantic relationships and similarities between objects, making it easier for machine learning models to process and analyze data.
import numpy as np
# Example: Word embedding
word_embedding = {
"cat": np.array([0.2, 0.5, -0.3]),
"dog": np.array([0.1, 0.4, -0.2]),
"fish": np.array([-0.3, 0.1, 0.6])
}
# Calculate cosine similarity between "cat" and "dog"
cat_dog_similarity = np.dot(word_embedding["cat"], word_embedding["dog"]) / (
np.linalg.norm(word_embedding["cat"]) * np.linalg.norm(word_embedding["dog"])
)
print(f"Similarity between 'cat' and 'dog': {cat_dog_similarity:.4f}")Slide 2: Word2Vec: A Popular Vector Embedding Technique
Word2Vec is a widely used method for creating word embeddings. It learns vector representations of words by predicting surrounding words in a context window. There are two main architectures: Continuous Bag of Words (CBOW) and Skip-gram.
from gensim.models import Word2Vec
# Sample corpus
corpus = [
["I", "love", "machine", "learning"],
["Vector", "embeddings", "are", "useful"],
["Natural", "language", "processing", "is", "fascinating"]
]
# Train Word2Vec model
model = Word2Vec(sentences=corpus, vector_size=100, window=5, min_count=1, workers=4)
# Get vector for a word
vector = model.wv["learning"]
print(f"Vector for 'learning': {vector[:5]}...") # Showing first 5 dimensions
# Find similar words
similar_words = model.wv.most_similar("machine", topn=3)
print("Words similar to 'machine':", similar_words)Slide 3: TF-IDF Vectorization
Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic that reflects the importance of a word in a document within a collection. It's commonly used for text classification and information retrieval.
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample documents
documents = [
"The cat sat on the mat",
"The dog chased the cat",
"The bird flew over the mat"
]
# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer()
# Transform documents to TF-IDF vectors
tfidf_matrix = vectorizer.fit_transform(documents)
# Get feature names (words)
feature_names = vectorizer.get_feature_names_out()
# Print TF-IDF scores for the first document
first_document_vector = tfidf_matrix[0]
for idx, score in zip(first_document_vector.indices, first_document_vector.data):
print(f"{feature_names[idx]}: {score:.4f}")Slide 4: Doc2Vec: Document Embedding
Doc2Vec extends the Word2Vec model to learn vector representations for entire documents. It's useful for tasks like document classification and clustering.
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
# Sample documents
documents = [
"The quick brown fox jumps over the lazy dog",
"The cat sat on the mat",
"The bird flew over the rainbow"
]
# Prepare tagged documents
tagged_docs = [TaggedDocument(doc.split(), [i]) for i, doc in enumerate(documents)]
# Train Doc2Vec model
model = Doc2Vec(vector_size=50, min_count=1, epochs=40)
model.build_vocab(tagged_docs)
model.train(tagged_docs, total_examples=model.corpus_count, epochs=model.epochs)
# Infer vector for a new document
new_doc = "The quick brown fox"
inferred_vector = model.infer_vector(new_doc.split())
print(f"Inferred vector for '{new_doc}': {inferred_vector[:5]}...") # Showing first 5 dimensions
# Find similar documents
similar_docs = model.dv.most_similar([inferred_vector])
print("Similar documents:", similar_docs)Slide 5: BERT Embeddings
BERT (Bidirectional Encoder Representations from Transformers) provides contextualized word embeddings, capturing the context-dependent meaning of words in sentences.
from transformers import BertTokenizer, BertModel
import torch
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# Prepare input
text = "Vector embeddings are powerful for NLP tasks."
inputs = tokenizer(text, return_tensors="pt")
# Generate BERT embeddings
with torch.no_grad():
outputs = model(**inputs)
# Get the embeddings of the last layer
last_hidden_states = outputs.last_hidden_state
print(f"Shape of BERT embeddings: {last_hidden_states.shape}")
print(f"Embedding for the first token: {last_hidden_states[0][0][:5]}...") # Showing first 5 dimensionsSlide 6: Visualization of Vector Embeddings
Visualizing high-dimensional embeddings can help understand the relationships between objects. t-SNE (t-Distributed Stochastic Neighbor Embedding) is a popular technique for this purpose.
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# Generate sample word embeddings
words = ["cat", "dog", "fish", "bird", "lion", "tiger", "shark", "whale"]
embeddings = np.random.rand(len(words), 50) # 50-dimensional embeddings
# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
reduced_embeddings = tsne.fit_transform(embeddings)
# Plot the reduced embeddings
plt.figure(figsize=(10, 8))
for i, word in enumerate(words):
x, y = reduced_embeddings[i]
plt.scatter(x, y)
plt.annotate(word, (x, y))
plt.title("t-SNE visualization of word embeddings")
plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.show()Slide 7: Sentence Embeddings with Transformers
Sentence embeddings represent entire sentences as vectors, capturing their semantic meaning. We can use pre-trained models like Sentence-BERT for this purpose.
from sentence_transformers import SentenceTransformer
# Load pre-trained Sentence-BERT model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
# Sample sentences
sentences = [
"The cat sat on the mat",
"A feline rested on the rug",
"The dog chased the ball"
]
# Generate sentence embeddings
embeddings = model.encode(sentences)
print(f"Shape of sentence embeddings: {embeddings.shape}")
print(f"Embedding for the first sentence: {embeddings[0][:5]}...") # Showing first 5 dimensions
# Calculate cosine similarity between sentences
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(embeddings)
print("Similarity matrix:")
print(similarity_matrix)Slide 8: Custom Embedding Layer in Neural Networks
In deep learning models, we often use custom embedding layers to learn task-specific embeddings for discrete objects like words or categories.
import torch
import torch.nn as nn
class SimpleEmbeddingModel(nn.Module):
def __init__(self, vocab_size, embedding_dim):
super(SimpleEmbeddingModel, self).__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.linear = nn.Linear(embedding_dim, 1)
def forward(self, x):
embedded = self.embedding(x)
return self.linear(embedded)
# Example usage
vocab_size = 1000
embedding_dim = 50
model = SimpleEmbeddingModel(vocab_size, embedding_dim)
# Sample input (batch of word indices)
input_tensor = torch.LongTensor([[1, 4, 2], [5, 3, 7]])
# Forward pass
output = model(input_tensor)
print(f"Model output shape: {output.shape}")
# Access learned embeddings
learned_embeddings = model.embedding.weight.data
print(f"Learned embedding for word 0: {learned_embeddings[0][:5]}...") # Showing first 5 dimensionsSlide 9: Word Analogies with Vector Embeddings
Vector embeddings can capture semantic relationships, allowing us to perform word analogies. For example, "king" - "man" + "woman" ≈ "queen".
import gensim.downloader as api
# Load pre-trained Word2Vec embeddings
word2vec_model = api.load('word2vec-google-news-300')
# Perform word analogy
result = word2vec_model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(f"king - man + woman ≈ {result[0][0]} (similarity: {result[0][1]:.4f})")
# Find words similar to "computer"
similar_words = word2vec_model.most_similar("computer", topn=5)
print("Words similar to 'computer':")
for word, similarity in similar_words:
print(f"{word}: {similarity:.4f}")Slide 10: Image Embeddings with Convolutional Neural Networks
Convolutional Neural Networks (CNNs) can be used to generate embeddings for images, capturing their visual features in a compact vector representation.
import torch
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image
# Load pre-trained ResNet model
resnet = models.resnet50(pretrained=True)
resnet.eval()
# Remove the last fully connected layer
embedding_model = torch.nn.Sequential(*list(resnet.children())[:-1])
# Prepare image transform
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
# Load and preprocess an image
image = Image.open("sample_image.jpg")
input_tensor = transform(image).unsqueeze(0)
# Generate image embedding
with torch.no_grad():
embedding = embedding_model(input_tensor)
print(f"Image embedding shape: {embedding.shape}")
print(f"Image embedding (first 5 values): {embedding.squeeze()[:5]}")Slide 11: Real-life Example: Sentiment Analysis with Vector Embeddings
Vector embeddings are crucial for many natural language processing tasks, including sentiment analysis. Let's use pre-trained GloVe embeddings for a simple sentiment classifier.
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GlobalAveragePooling1D, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Sample data
texts = ["I love this product", "This is terrible", "Not bad, but could be better"]
labels = [1, 0, 0.5] # 1: positive, 0: negative, 0.5: neutral
# Tokenize texts
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
padded_sequences = pad_sequences(sequences, maxlen=20)
# Load pre-trained GloVe embeddings
embeddings_index = {}
with open('glove.6B.100d.txt', encoding='utf-8') as f:
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
# Create embedding matrix
embedding_matrix = np.zeros((10000, 100))
for word, i in tokenizer.word_index.items():
if i < 10000:
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
# Build model
model = Sequential([
Embedding(10000, 100, weights=[embedding_matrix], input_length=20, trainable=False),
GlobalAveragePooling1D(),
Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Train model
model.fit(padded_sequences, np.array(labels), epochs=10, verbose=0)
# Make predictions
new_texts = ["This is amazing", "I hate this"]
new_sequences = tokenizer.texts_to_sequences(new_texts)
new_padded = pad_sequences(new_sequences, maxlen=20)
predictions = model.predict(new_padded)
for text, pred in zip(new_texts, predictions):
print(f"Text: '{text}', Sentiment: {pred[0]:.2f}")Slide 12: Real-life Example: Image Search with Vector Embeddings
Vector embeddings can be used to build efficient image search systems. Here's a simple example using pre-trained ResNet features.
import torch
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image
import os
from sklearn.metrics.pairwise import cosine_similarity
# Load pre-trained ResNet model
resnet = models.resnet50(pretrained=True)
embedding_model = torch.nn.Sequential(*list(resnet.children())[:-1])
embedding_model.eval()
# Image transformation
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
# Function to get image embedding
def get_image_embedding(image_path):
image = Image.open(image_path)
input_tensor = transform(image).unsqueeze(0)
with torch.no_grad():
embedding = embedding_model(input_tensor)
return embedding.squeeze().numpy()
# Create a database of image embeddings
image_database = {}
for image_file in os.listdir("image_database"):
if image_file.endswith((".jpg", ".png")):
image_path = os.path.join("image_database", image_file)
image_database[image_file] = get_image_embedding(image_path)
# Search for similar images
query_image = "query_image.jpg"
query_embedding = get_image_embedding(query_image)
similarities = {}
for image_name, embedding in image_database.items():
similarity = cosine_similarity(query_embedding.reshape(1, -1), embedding.reshape(1, -1))[0][0]
similarities[image_name] = similarity
# Get top 5 similar images
top_similar = sorted(similarities.items(), key=lambda x: x[1], reverse=True)[:5]
print("Top 5 similar images:")
for image_name, similarity in top_similar:
print(f"{image_name}: Similarity = {similarity:.4f}")Slide 13: Challenges and Future Directions in Vector Embedding
Vector embedding techniques continue to evolve, addressing challenges such as:
- Contextual embeddings: Capturing word meaning in different contexts.
- Multimodal embeddings: Representing different types of data (text, images, audio) in a shared embedding space.
- Efficient storage and retrieval: Handling large-scale embedding databases.
- Interpretability: Understanding the semantic meaning of embedding dimensions.
- Bias mitigation: Reducing unwanted biases in learned embeddings.
Future directions include developing more sophisticated models for context-aware embeddings and exploring unsupervised learning techniques for domain-specific embeddings.
Slide 14: Challenges and Future Directions in Vector Embedding
import numpy as np
# Simulating a contextual embedding model
def contextual_embedding(word, context):
# This is a simplified representation of how contextual embeddings might work
base_embedding = np.random.rand(100) # Base 100-dimensional embedding
context_influence = sum(hash(c) for c in context) % 100 # Context-based modification
return base_embedding + np.roll(base_embedding, context_influence) * 0.1
# Example usage
word = "bank"
context1 = "I went to the bank to deposit money"
context2 = "The river bank was covered in flowers"
embedding1 = contextual_embedding(word, context1)
embedding2 = contextual_embedding(word, context2)
print(f"Cosine similarity: {np.dot(embedding1, embedding2) / (np.linalg.norm(embedding1) * np.linalg.norm(embedding2)):.4f}")Slide 15: Ethical Considerations in Vector Embeddings
As vector embeddings become more prevalent in AI systems, it's crucial to consider their ethical implications:
- Privacy: Embeddings can potentially encode sensitive information about individuals.
- Fairness: Embeddings may perpetuate or amplify societal biases present in training data.
- Transparency: The black-box nature of embeddings can make it difficult to explain model decisions.
- Misuse: Embeddings could be used for surveillance or other potentially harmful applications.
Researchers and practitioners must address these concerns to ensure responsible development and deployment of embedding-based systems.
Slide 16: Ethical Considerations in Vector Embeddings
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Simulating word embeddings
embeddings = {
"man": np.array([0.2, 0.3, 0.1]),
"woman": np.array([0.2, 0.3, -0.1]),
"doctor": np.array([0.4, 0.2, 0.1]),
"nurse": np.array([0.4, 0.2, -0.1])
}
# Check for gender bias
def check_occupation_bias(occupation1, occupation2):
man_occ1 = cosine_similarity([embeddings["man"] + embeddings[occupation1]], [embeddings[occupation1]])[0][0]
woman_occ1 = cosine_similarity([embeddings["woman"] + embeddings[occupation1]], [embeddings[occupation1]])[0][0]
man_occ2 = cosine_similarity([embeddings["man"] + embeddings[occupation2]], [embeddings[occupation2]])[0][0]
woman_occ2 = cosine_similarity([embeddings["woman"] + embeddings[occupation2]], [embeddings[occupation2]])[0][0]
print(f"{occupation1.capitalize()} bias: {man_occ1 - woman_occ1:.4f}")
print(f"{occupation2.capitalize()} bias: {man_occ2 - woman_occ2:.4f}")
check_occupation_bias("doctor", "nurse")Slide 17: Additional Resources
For those interested in diving deeper into vector embeddings, here are some valuable resources:
- "Efficient Estimation of Word Representations in Vector Space" by Mikolov et al. (2013) ArXiv: https://arxiv.org/abs/1301.3781
- "GloVe: Global Vectors for Word Representation" by Pennington et al. (2014) ArXiv: https://arxiv.org/abs/1405.4053
- "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" by Devlin et al. (2018) ArXiv: https://arxiv.org/abs/1810.04805
- "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks" by Reimers and Gurevych (2019) ArXiv: https://arxiv.org/abs/1908.10084
These papers provide in-depth explanations of various embedding techniques and their applications in natural language processing and machine learning.