Machine-Learning/Python/Python-for-ML-and-AI/Vector Embedding in Python Representing Semantic Relationships.md at main · xbeat/Machine-Learning

Vector Embedding in Python Representing Semantic Relationships

Slide 1: Introduction to Vector Embedding

Vector embedding is a technique used to represent discrete objects as continuous vectors in a high-dimensional space. This method allows us to capture semantic relationships and similarities between objects, making it easier for machine learning models to process and analyze data.

import numpy as np

# Example: Word embedding
word_embedding = {
    "cat": np.array([0.2, 0.5, -0.3]),
    "dog": np.array([0.1, 0.4, -0.2]),
    "fish": np.array([-0.3, 0.1, 0.6])
}

# Calculate cosine similarity between "cat" and "dog"
cat_dog_similarity = np.dot(word_embedding["cat"], word_embedding["dog"]) / (
    np.linalg.norm(word_embedding["cat"]) * np.linalg.norm(word_embedding["dog"])
)

print(f"Similarity between 'cat' and 'dog': {cat_dog_similarity:.4f}")

Slide 2: Word2Vec: A Popular Vector Embedding Technique

Word2Vec is a widely used method for creating word embeddings. It learns vector representations of words by predicting surrounding words in a context window. There are two main architectures: Continuous Bag of Words (CBOW) and Skip-gram.

from gensim.models import Word2Vec

# Sample corpus
corpus = [
    ["I", "love", "machine", "learning"],
    ["Vector", "embeddings", "are", "useful"],
    ["Natural", "language", "processing", "is", "fascinating"]
]

# Train Word2Vec model
model = Word2Vec(sentences=corpus, vector_size=100, window=5, min_count=1, workers=4)

# Get vector for a word
vector = model.wv["learning"]
print(f"Vector for 'learning': {vector[:5]}...")  # Showing first 5 dimensions

# Find similar words
similar_words = model.wv.most_similar("machine", topn=3)
print("Words similar to 'machine':", similar_words)

Slide 3: TF-IDF Vectorization

Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic that reflects the importance of a word in a document within a collection. It's commonly used for text classification and information retrieval.

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    "The cat sat on the mat",
    "The dog chased the cat",
    "The bird flew over the mat"
]

# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Transform documents to TF-IDF vectors
tfidf_matrix = vectorizer.fit_transform(documents)

# Get feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Print TF-IDF scores for the first document
first_document_vector = tfidf_matrix[0]
for idx, score in zip(first_document_vector.indices, first_document_vector.data):
    print(f"{feature_names[idx]}: {score:.4f}")

Slide 4: Doc2Vec: Document Embedding

Doc2Vec extends the Word2Vec model to learn vector representations for entire documents. It's useful for tasks like document classification and clustering.

from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# Sample documents
documents = [
    "The quick brown fox jumps over the lazy dog",
    "The cat sat on the mat",
    "The bird flew over the rainbow"
]

# Prepare tagged documents
tagged_docs = [TaggedDocument(doc.split(), [i]) for i, doc in enumerate(documents)]

# Train Doc2Vec model
model = Doc2Vec(vector_size=50, min_count=1, epochs=40)
model.build_vocab(tagged_docs)
model.train(tagged_docs, total_examples=model.corpus_count, epochs=model.epochs)

# Infer vector for a new document
new_doc = "The quick brown fox"
inferred_vector = model.infer_vector(new_doc.split())
print(f"Inferred vector for '{new_doc}': {inferred_vector[:5]}...")  # Showing first 5 dimensions

# Find similar documents
similar_docs = model.dv.most_similar([inferred_vector])
print("Similar documents:", similar_docs)

Slide 5: BERT Embeddings

BERT (Bidirectional Encoder Representations from Transformers) provides contextualized word embeddings, capturing the context-dependent meaning of words in sentences.

from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Prepare input
text = "Vector embeddings are powerful for NLP tasks."
inputs = tokenizer(text, return_tensors="pt")

# Generate BERT embeddings
with torch.no_grad():
    outputs = model(**inputs)

# Get the embeddings of the last layer
last_hidden_states = outputs.last_hidden_state
print(f"Shape of BERT embeddings: {last_hidden_states.shape}")
print(f"Embedding for the first token: {last_hidden_states[0][0][:5]}...")  # Showing first 5 dimensions

Slide 6: Visualization of Vector Embeddings

Visualizing high-dimensional embeddings can help understand the relationships between objects. t-SNE (t-Distributed Stochastic Neighbor Embedding) is a popular technique for this purpose.

import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Generate sample word embeddings
words = ["cat", "dog", "fish", "bird", "lion", "tiger", "shark", "whale"]
embeddings = np.random.rand(len(words), 50)  # 50-dimensional embeddings

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
reduced_embeddings = tsne.fit_transform(embeddings)

# Plot the reduced embeddings
plt.figure(figsize=(10, 8))
for i, word in enumerate(words):
    x, y = reduced_embeddings[i]
    plt.scatter(x, y)
    plt.annotate(word, (x, y))

plt.title("t-SNE visualization of word embeddings")
plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.show()

Slide 7: Sentence Embeddings with Transformers

Sentence embeddings represent entire sentences as vectors, capturing their semantic meaning. We can use pre-trained models like Sentence-BERT for this purpose.

from sentence_transformers import SentenceTransformer

# Load pre-trained Sentence-BERT model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Sample sentences
sentences = [
    "The cat sat on the mat",
    "A feline rested on the rug",
    "The dog chased the ball"
]

# Generate sentence embeddings
embeddings = model.encode(sentences)

print(f"Shape of sentence embeddings: {embeddings.shape}")
print(f"Embedding for the first sentence: {embeddings[0][:5]}...")  # Showing first 5 dimensions

# Calculate cosine similarity between sentences
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(embeddings)
print("Similarity matrix:")
print(similarity_matrix)

Slide 8: Custom Embedding Layer in Neural Networks

In deep learning models, we often use custom embedding layers to learn task-specific embeddings for discrete objects like words or categories.

import torch
import torch.nn as nn

class SimpleEmbeddingModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(SimpleEmbeddingModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(embedding_dim, 1)
        
    def forward(self, x):
        embedded = self.embedding(x)
        return self.linear(embedded)

# Example usage
vocab_size = 1000
embedding_dim = 50
model = SimpleEmbeddingModel(vocab_size, embedding_dim)

# Sample input (batch of word indices)
input_tensor = torch.LongTensor([[1, 4, 2], [5, 3, 7]])

# Forward pass
output = model(input_tensor)
print(f"Model output shape: {output.shape}")

# Access learned embeddings
learned_embeddings = model.embedding.weight.data
print(f"Learned embedding for word 0: {learned_embeddings[0][:5]}...")  # Showing first 5 dimensions

Slide 9: Word Analogies with Vector Embeddings

Vector embeddings can capture semantic relationships, allowing us to perform word analogies. For example, "king" - "man" + "woman" ≈ "queen".

import gensim.downloader as api

# Load pre-trained Word2Vec embeddings
word2vec_model = api.load('word2vec-google-news-300')

# Perform word analogy
result = word2vec_model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(f"king - man + woman ≈ {result[0][0]} (similarity: {result[0][1]:.4f})")

# Find words similar to "computer"
similar_words = word2vec_model.most_similar("computer", topn=5)
print("Words similar to 'computer':")
for word, similarity in similar_words:
    print(f"{word}: {similarity:.4f}")

Slide 10: Image Embeddings with Convolutional Neural Networks

Convolutional Neural Networks (CNNs) can be used to generate embeddings for images, capturing their visual features in a compact vector representation.

import torch
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image

# Load pre-trained ResNet model
resnet = models.resnet50(pretrained=True)
resnet.eval()

# Remove the last fully connected layer
embedding_model = torch.nn.Sequential(*list(resnet.children())[:-1])

# Prepare image transform
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Load and preprocess an image
image = Image.open("sample_image.jpg")
input_tensor = transform(image).unsqueeze(0)

# Generate image embedding
with torch.no_grad():
    embedding = embedding_model(input_tensor)

print(f"Image embedding shape: {embedding.shape}")
print(f"Image embedding (first 5 values): {embedding.squeeze()[:5]}")

Slide 11: Real-life Example: Sentiment Analysis with Vector Embeddings

Vector embeddings are crucial for many natural language processing tasks, including sentiment analysis. Let's use pre-trained GloVe embeddings for a simple sentiment classifier.

import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GlobalAveragePooling1D, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Sample data
texts = ["I love this product", "This is terrible", "Not bad, but could be better"]
labels = [1, 0, 0.5]  # 1: positive, 0: negative, 0.5: neutral

# Tokenize texts
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
padded_sequences = pad_sequences(sequences, maxlen=20)

# Load pre-trained GloVe embeddings
embeddings_index = {}
with open('glove.6B.100d.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

# Create embedding matrix
embedding_matrix = np.zeros((10000, 100))
for word, i in tokenizer.word_index.items():
    if i < 10000:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

# Build model
model = Sequential([
    Embedding(10000, 100, weights=[embedding_matrix], input_length=20, trainable=False),
    GlobalAveragePooling1D(),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train model
model.fit(padded_sequences, np.array(labels), epochs=10, verbose=0)

# Make predictions
new_texts = ["This is amazing", "I hate this"]
new_sequences = tokenizer.texts_to_sequences(new_texts)
new_padded = pad_sequences(new_sequences, maxlen=20)
predictions = model.predict(new_padded)

for text, pred in zip(new_texts, predictions):
    print(f"Text: '{text}', Sentiment: {pred[0]:.2f}")

Slide 12: Real-life Example: Image Search with Vector Embeddings

Vector embeddings can be used to build efficient image search systems. Here's a simple example using pre-trained ResNet features.

import torch
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image
import os
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained ResNet model
resnet = models.resnet50(pretrained=True)
embedding_model = torch.nn.Sequential(*list(resnet.children())[:-1])
embedding_model.eval()

# Image transformation
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Function to get image embedding
def get_image_embedding(image_path):
    image = Image.open(image_path)
    input_tensor = transform(image).unsqueeze(0)
    with torch.no_grad():
        embedding = embedding_model(input_tensor)
    return embedding.squeeze().numpy()

# Create a database of image embeddings
image_database = {}
for image_file in os.listdir("image_database"):
    if image_file.endswith((".jpg", ".png")):
        image_path = os.path.join("image_database", image_file)
        image_database[image_file] = get_image_embedding(image_path)

# Search for similar images
query_image = "query_image.jpg"
query_embedding = get_image_embedding(query_image)

similarities = {}
for image_name, embedding in image_database.items():
    similarity = cosine_similarity(query_embedding.reshape(1, -1), embedding.reshape(1, -1))[0][0]
    similarities[image_name] = similarity

# Get top 5 similar images
top_similar = sorted(similarities.items(), key=lambda x: x[1], reverse=True)[:5]

print("Top 5 similar images:")
for image_name, similarity in top_similar:
    print(f"{image_name}: Similarity = {similarity:.4f}")

Slide 13: Challenges and Future Directions in Vector Embedding

Vector embedding techniques continue to evolve, addressing challenges such as:

Contextual embeddings: Capturing word meaning in different contexts.
Multimodal embeddings: Representing different types of data (text, images, audio) in a shared embedding space.
Efficient storage and retrieval: Handling large-scale embedding databases.
Interpretability: Understanding the semantic meaning of embedding dimensions.
Bias mitigation: Reducing unwanted biases in learned embeddings.

Future directions include developing more sophisticated models for context-aware embeddings and exploring unsupervised learning techniques for domain-specific embeddings.

Slide 14: Challenges and Future Directions in Vector Embedding

import numpy as np

# Simulating a contextual embedding model
def contextual_embedding(word, context):
    # This is a simplified representation of how contextual embeddings might work
    base_embedding = np.random.rand(100)  # Base 100-dimensional embedding
    context_influence = sum(hash(c) for c in context) % 100  # Context-based modification
    return base_embedding + np.roll(base_embedding, context_influence) * 0.1

# Example usage
word = "bank"
context1 = "I went to the bank to deposit money"
context2 = "The river bank was covered in flowers"

embedding1 = contextual_embedding(word, context1)
embedding2 = contextual_embedding(word, context2)

print(f"Cosine similarity: {np.dot(embedding1, embedding2) / (np.linalg.norm(embedding1) * np.linalg.norm(embedding2)):.4f}")

Slide 15: Ethical Considerations in Vector Embeddings

As vector embeddings become more prevalent in AI systems, it's crucial to consider their ethical implications:

Privacy: Embeddings can potentially encode sensitive information about individuals.
Fairness: Embeddings may perpetuate or amplify societal biases present in training data.
Transparency: The black-box nature of embeddings can make it difficult to explain model decisions.
Misuse: Embeddings could be used for surveillance or other potentially harmful applications.

Researchers and practitioners must address these concerns to ensure responsible development and deployment of embedding-based systems.

Slide 16: Ethical Considerations in Vector Embeddings

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Simulating word embeddings
embeddings = {
    "man": np.array([0.2, 0.3, 0.1]),
    "woman": np.array([0.2, 0.3, -0.1]),
    "doctor": np.array([0.4, 0.2, 0.1]),
    "nurse": np.array([0.4, 0.2, -0.1])
}

# Check for gender bias
def check_occupation_bias(occupation1, occupation2):
    man_occ1 = cosine_similarity([embeddings["man"] + embeddings[occupation1]], [embeddings[occupation1]])[0][0]
    woman_occ1 = cosine_similarity([embeddings["woman"] + embeddings[occupation1]], [embeddings[occupation1]])[0][0]
    man_occ2 = cosine_similarity([embeddings["man"] + embeddings[occupation2]], [embeddings[occupation2]])[0][0]
    woman_occ2 = cosine_similarity([embeddings["woman"] + embeddings[occupation2]], [embeddings[occupation2]])[0][0]
    
    print(f"{occupation1.capitalize()} bias: {man_occ1 - woman_occ1:.4f}")
    print(f"{occupation2.capitalize()} bias: {man_occ2 - woman_occ2:.4f}")

check_occupation_bias("doctor", "nurse")

Slide 17: Additional Resources

For those interested in diving deeper into vector embeddings, here are some valuable resources:

"Efficient Estimation of Word Representations in Vector Space" by Mikolov et al. (2013) ArXiv: https://arxiv.org/abs/1301.3781
"GloVe: Global Vectors for Word Representation" by Pennington et al. (2014) ArXiv: https://arxiv.org/abs/1405.4053
"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" by Devlin et al. (2018) ArXiv: https://arxiv.org/abs/1810.04805
"Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks" by Reimers and Gurevych (2019) ArXiv: https://arxiv.org/abs/1908.10084

These papers provide in-depth explanations of various embedding techniques and their applications in natural language processing and machine learning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vector Embedding in Python Representing Semantic Relationships

Uh oh!

FilesExpand file tree

Vector Embedding in Python Representing Semantic Relationships.md

Latest commit

History

Vector Embedding in Python Representing Semantic Relationships.md

File metadata and controls

Vector Embedding in Python Representing Semantic Relationships