βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β ββββ ββββββ βββββββ βββββββ βββββββ βββββββ βββββββββ β
β βββββ ββββββ ββββββββ ββββββββββββββββββββββββββββββββββ β
β ββββββ ββββββ ββββββββ βββββββββββ βββββββββββ βββ β
β βββββββββββββ βββββββ βββββββ βββ βββββββββββ βββ β
β βββ βββββββββββββββββ βββ ββββββββββββ βββ βββ β
β βββ ββββββββββββββββ βββ βββββββ βββ βββ βββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Production-Ready Natural Language Processing & Machine Learning Portfolio
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Advanced NLP implementations spanning text analytics, machine learning classifiers, sequence modeling, and deep learning for named entity recognition
View Projects β’ Skills β’ Results
|
This repository demonstrates end-to-end machine learning and NLP expertise through four comprehensive assignments implementing algorithms from mathematical foundations. Core Focus Areas: nlp_pipeline = {
"text_processing": ["Tokenization", "Stemming", "Lemmatization"],
"ml_algorithms": ["Naive Bayes", "Logistic Regression"],
"sequence_modeling": ["N-grams", "HMM", "CRF"],
"deep_learning": ["LSTM", "Word2Vec", "NER"]
} |
|
ββ THREE COMPLETE NLP SYSTEMS ββββββββββββββββββββββββββββββββββββββββββββββ
β β
β βΈ Corpus-Based Chatbot (TF-IDF Retrieval) β
β β’ Custom TF-IDF implementation from scratch β
β β’ NPS Chat corpus (~10K messages) β
β β’ Cosine similarity-based response matching β
β β’ Intelligent filtering (removes questions, short responses) β
β β’ Evaluation: Engagingness 3/5, Making Sense 3/4, Fluency 4.5/5 β
β β
β βΈ LSTM Slot Filling (ATIS Dataset) β
β β’ Bidirectional LSTM architecture: Embedding β BiLSTM(128) β Dense β
β β’ ATIS travel dataset: 4.4K train, 900 test sentences β
β β’ 127 unique slot labels (locations, dates, airlines, etc.) β
β β’ Performance: Precision 0.95, Recall 0.94, F1-Score 0.95 β
β β’ TimeDistributed output layer for sequence labeling β
β β
β βΈ Neural Machine Translation (German β English) β
β β’ Seq2Seq architecture with attention mechanism β
β β’ WMT14 dataset (de-en configuration) β
β β’ Encoder: Embedding β LSTM with context vectors β
β β’ Decoder: LSTM β Attention β Dense β Softmax β
β β’ BLEU Score: 0.18 (greedy decoding) β
β β’ 10K vocab for both German and English β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Technologies: TensorFlow, Keras, NLTK, Hugging Face Datasets, NumPy, Pandas
ββ THREE SUMMARIZATION APPROACHES ββββββββββββββββββββββββββββββββββββββββββ
β β
β βΈ Abstractive (Encoder-Decoder with Beam Search) β
β β’ Custom LSTM encoder-decoder architecture β
β β’ CNN/DailyMail dataset (300K+ articles) β
β β’ Beam search for text generation (beam width: 3) β
β β’ ROUGE Scores: R-1: 0.25, R-2: 0.10, R-L: 0.20 β
β β’ Generate summaries of 10+ words β
β β
β βΈ Abstractive (Pre-trained T5) β
β β’ T5-small model from Hugging Face β
β β’ No training required - inference only β
β β’ ROUGE Scores: R-1: 0.40, R-2: 0.18, R-L: 0.35 β
β β’ Superior performance vs custom encoder-decoder β
β β’ Evaluation: Fluency 4/5, Coherence 4/5, Fact-preserving 2.4/3 β
β β
β βΈ Extractive (PageRank Algorithm) β
β β’ GloVe embeddings (Wikipedia 2014 + Gigaword 5) β
β β’ Sentence ranking via NetworkX PageRank β
β β’ BBC News Summary dataset (business category) β
β β’ Cosine similarity for sentence comparison β
β β’ ROUGE Scores: R-1: 0.35, R-2: 0.15, R-L: 0.30 β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Technologies: PyTorch, Transformers, NetworkX, GloVe, TorchMetrics, NLTK
ββ SEMANTIC UNDERSTANDING & ROLE LABELING ββββββββββββββββββββββββββββββββββ
β β
β βΈ Word Sense Disambiguation β
β β’ Simplified Lesk Algorithm: Overlap(C, D) = |C β© D| β
β β’ Most Frequent Sense baseline: F-Score 0.54 β
β β’ Lesk with gloss overlap: F-Score 0.48 β
β β’ BiLSTM neural approach: F-Score 0.59 (best performance) β
β β’ SemCor corpus evaluation (50 test sentences) β
β β
β βΈ Semantic Role Labeling β
β β’ LSTM architecture: Word(100D) + Predicate(10D) β LSTM(128) β
β β’ OntoNotes v5 dataset for SRL β
β β’ Identifies predicate-argument structures β
β β’ Performance: Precision 0.85, Recall 0.82, F1-Score 0.83 β
β β’ Handles complex argument types (A0, A1, AM-TMP, etc.) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Technologies: NLTK, WordNet, TensorFlow/Keras, BiLSTM, OntoNotes
ββ PARSING ALGORITHMS & SYNTACTIC ANALYSIS βββββββββββββββββββββββββββββββββ
β β
β βΈ Constituency Tree Visualization β
β β’ Built parse trees using production rules β
β β’ NLTK tree.draw() for graphical representation β
β β’ Demonstrated S β VP, VP β NP V PP derivations β
β β
β βΈ CKY Parsing Algorithm β
β β’ Full implementation from Jurafsky & Martin Section 13.4 β
β β’ Chomsky Normal Form conversion (5,517 β 13,500 rules) β
β β’ Back-pointer tracking for parse tree reconstruction β
β β’ Handles ambiguous grammars with multiple parse outputs β
β β
β βΈ Dependency Parsing with Stanford CoreNLP β
β β’ NLTK CoreNLP interface integration β
β β’ CoNLL format output (word, POS, head, relation) β
β β’ Server-based parsing on port 9000 β
β β
β βΈ Ambiguous Sentence Analysis β
β β’ "Flying planes can be dangerous" - gerund vs adjective β
β β’ "Amid the chaos I saw her duck" - noun vs verb β
β β’ Parser limitation analysis β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Technologies: NLTK, Stanford CoreNLP, CKY Algorithm, CFG, Chomsky Normal Form
ββ DEEP LEARNING FOR SEQUENCE LABELING βββββββββββββββββββββββββββββββββββββ
β β
β βΈ TF-IDF Vectorization & Cosine Similarity β
β β’ Custom implementation from scratch β
β β’ Processed 1,000 documents with 5,847 unique tokens β
β β’ Achieved semantic similarity scoring on sentence pairs β
β β
β βΈ Positive Pointwise Mutual Information (PPMI) β
β β’ Word association discovery through co-occurrence analysis β
β β’ Implemented PMI calculation with probability estimation β
β β’ Identified meaningful collocations in natural text β
β β
β βΈ LSTM-based Named Entity Recognition β
β β’ 3-layer LSTM architecture with Word2Vec embeddings (300D) β
β β’ Trained on CoNLL2003 dataset (5,000 samples) β
β β’ BIO tagging scheme for 4 entity types (PER, ORG, LOC, MISC) β
β β’ Model Performance: 94.2% accuracy, 86.6% F1-score β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Technical Implementation:
|
Architecture Design Input (100 tokens)
β Embedding(300D Word2Vec)
β LSTM(128, dropout=0.2)
β LSTM(64, dropout=0.2)
β LSTM(32, dropout=0.2)
β Dense(64, ReLU)
β Softmax(9 classes) |
Performance Metrics
|
Key Technologies: TensorFlow, Keras, Gensim (Word2Vec), Hugging Face Datasets, NumPy, Pandas
[π Source Code](ASN3/Assignment 3.py) | π Corpus
ββ STATISTICAL LANGUAGE MODELING & SEQUENCE LABELING βββββββββββββββββββββββ
β β
β βΈ Bigram Language Model β
β β’ Built n-gram model from The Great Gatsby corpus β
β β’ Conditional probability: p(w_i|w_{i-1}) calculation β
β β’ Text generation with top-10 candidate sampling β
β β’ Perplexity evaluation: 14.56 (excellent probability distribution) β
β β
β βΈ Hidden Markov Model (HMM) POS Tagging β
β β’ Full HMM implementation with Viterbi decoding β
β β’ Transition matrix A (tagβtag) and emission matrix B (tagβword) β
β β’ Penn Treebank dataset (3,914 sentences, 80/20 split) β
β β’ Achieved 91.25% accuracy on sequence labeling β
β β
β βΈ Conditional Random Fields (CRF) POS Tagging β
β β’ Discriminative model with rich feature engineering β
β β’ Features: word properties, character n-grams, contextual info β
β β’ Achieved 95.20% accuracy (+3.95% improvement over HMM) β
β β’ Production integration with sklearn-crfsuite β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Comparative Analysis:
| Model | Accuracy | Approach | Key Advantage |
|---|---|---|---|
| HMM + Viterbi | 91.25% | Generative | Fast inference, interpretable |
| CRF | 95.20% | Discriminative | Rich features, better accuracy |
Key Technologies: NLTK, sklearn-crfsuite, NumPy, Penn Treebank, Dynamic Programming
[π Source Code](ASN2/Assignment 2.py) | π Results Summary
ββ FINANCIAL SENTIMENT ANALYSIS WITH CUSTOM ML MODELS ββββββββββββββββββββββ
β β
β βΈ Naive Bayes Classifier (Generative Model) β
β β’ Built from mathematical foundations with Laplace smoothing β
β β’ Conditional probability: p(word|class) estimation β
β β’ Bag-of-words feature extraction (1,452 dimensions) β
β β’ Trained on financial phrasebank (2,264 sentences) β
β β
β βΈ Logistic Regression (Discriminative Model) β
β β’ Implemented gradient descent optimization from scratch β
β β’ Custom cross-entropy loss with numerical stability β
β β’ Hyperparameter tuning: learning rate Ξ± β [0.0001, 0.1] β
β β’ Achieved 75.6% accuracy on 3-way sentiment classification β
β β
β βΈ Production Pipeline β
β β’ Data preprocessing: tokenization, lowercasing, vectorization β
β β’ Train/validation/test split: 60/20/20 β
β β’ Comprehensive evaluation: accuracy, precision, recall, F1-score β
β β’ Modular OOP design with reusable classifier classes β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Model Performance:
Accuracy75.6%3-way classification |
Training Epochs500Gradient descent |
Feature Space1,452DBag-of-words |
Key Technologies: NumPy, pandas, scikit-learn (CountVectorizer), Custom Gradient Descent
[π Source Code](ASN1/Assignment 1.py) | π Corpus Data
ββ HEALTHCARE SOCIAL MEDIA NLP PIPELINE ββββββββββββββββββββββββββββββββββββ
β β
β βΈ Multi-Source Data Integration β
β β’ Aggregated 6,045 health tweets from CNN & Fox News β
β β’ Robust error handling with configurable data quality checks β
β β’ Regex-based cleaning: URLs, mentions, hashtags, special chars β
β β
β βΈ Advanced Text Processing β
β β’ Hierarchical tokenization: sentences β words β
β β’ Morphological analysis: WordNet lemmatization vs Porter stemming β
β β’ Stopword filtering: 20,586 common words removed β
β β’ Vocabulary reduction: 8,797 β 6,345 tokens (27.9% optimization) β
β β
β βΈ Intelligent Spell Correction β
β β’ Minimum Edit Distance algorithm (dynamic programming) β
β β’ Configurable costs: insertion, deletion, substitution β
β β’ Corpus-based suggestions with top-N ranking β
β β’ Domain-aware corrections for health terminology β
β β
β βΈ Social Media Analytics β
β β’ Hashtag extraction: 914 unique tags, 3,572 total occurrences β
β β’ Trend analysis: #getfit, #ebola, #cancer, #flu identification β
β β’ Frequency distribution and statistical analysis β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Data Processing Metrics:
| Metric | Value | Optimization |
|---|---|---|
| Total Documents | 6,045 tweets | Multi-source integration |
| Original Vocabulary | 8,797 words | β |
| After Stopword Removal | 8,670 words | 127 words removed |
| After Stemming | 6,345 stems | 27.9% reduction |
| Unique Lemmas | 7,657 lemmas | Quality preservation |
Key Technologies: NLTK, pandas, NumPy, RegEx, Collections, Dynamic Programming
Algorithms Implemented:
Supervised Learning:
- Naive Bayes (generative)
- Logistic Regression (discriminative)
- Hidden Markov Models (probabilistic)
- Conditional Random Fields (discriminative)
- LSTM Neural Networks (recurrent)
Optimization:
- Gradient Descent
- Adam Optimizer
- Viterbi Decoding (dynamic programming)
- Hyperparameter Tuning
Model Evaluation:
- Cross-validation
- Accuracy, Precision, Recall, F1-score
- Confusion matrices
- Perplexity measurement |
Core NLP Techniques:
Text Preprocessing:
- Tokenization (sentence & word-level)
- Normalization (lowercasing, stemming)
- Lemmatization (WordNet-based)
- Stopword removal
Feature Engineering:
- TF-IDF vectorization
- Bag-of-words representation
- Word embeddings (Word2Vec)
- Character-level features
- Contextual features
Advanced Methods:
- Named Entity Recognition (NER)
- Part-of-Speech tagging
- N-gram language models
- PPMI word associations
- Edit distance algorithms |
π Core PythonPython 3.8+NumPypandasCollectionsRegEx
|
π€ ML/DL FrameworksTensorFlow 2.xKerasscikit-learnsklearn-crfsuite
|
π NLP LibrariesNLTKGensim (Word2Vec)Hugging FacespaCy-compatible
|
π Data & VisualizationJupyter NotebookMatplotlibSeabornChart.js
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Built ML models from mathematical foundations including: β Probability theory (Bayes theorem) |
Production-ready development practices: β Object-oriented design (modular classes) |
End-to-end ML pipeline expertise: β Data acquisition and cleaning |
Python 3.8 or higher
pip package manager# Clone repository
git clone https://github.com/RamenMachine/Natural-Language-Processing.git
cd Natural-Language-Processing
# Install dependencies
pip install -r requirements.txt
# Download NLTK data (first run only)
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet')"# Assignment 1: Text Analytics & Spell Correction
cd ASN1
python "Assignment 1.py"
# Assignment 2: Machine Learning Classifiers
cd ../ASN2
python "Assignment 2.py"
# Assignment 3: N-grams & POS Tagging
cd ../ASN3
python "Assignment 3.py"
# Assignment 4: Named Entity Recognition with LSTM
cd ../ASN4
python HW4.pyNatural-Language-Processing/
β
βββ ASN1/ # Text Analytics & Spell Correction
β βββ Assignment 1.py # Main implementation
β βββ corpus.csv # Processed health tweets (6K+ records)
β βββ Health-Tweets/ # Raw data sources (CNN, Fox News)
β
βββ ASN2/ # From-Scratch ML Classifiers
β βββ Assignment 2.py # Naive Bayes & Logistic Regression
β βββ Assignment_2_Results_Summary.md
β βββ FinancialPhraseBank-v1.0/ # Financial sentiment dataset
β
βββ ASN3/ # N-gram Text Generation & POS Tagging
β βββ Assignment 3.py # Bigram model, HMM, CRF implementation
β βββ GreatGatsby.txt # Project Gutenberg corpus
β
βββ ASN4/ # Named Entity Recognition with LSTM
β βββ HW4.py # Deep learning NER model
β βββ assignment4_showcase.ipynb # Interactive visualizations
β βββ index.html # GitHub Pages demo
β βββ README.md # Project documentation
β βββ requirements.txt # Python dependencies
β
βββ ASN5/ # Constituency & Dependency Parsing
β βββ assignment5.py # CKY algorithm, constituency trees
β βββ dep_parser.py # Stanford CoreNLP dependency parser
β βββ start_corenlp.bat # Server startup script (Windows)
β βββ README.md # Setup instructions
β βββ stanford-corenlp-4.5.10/ # CoreNLP installation
β
βββ ASN6/ # Word Sense Disambiguation & SRL
β βββ assignment6.py # Lesk algorithm, BiLSTM WSD, SRL model
β βββ README.md # Project documentation
β
βββ ASN7/ # NLP Toolkit (Chatbot, Slot Filling, Translation)
β βββ assignment7.py # All 3 questions: Chatbot, Slot Filling, Translation
β βββ q1_chatbot_evaluation.txt # Written evaluation for Q1
β βββ atis.train(1).csv # ATIS training data
β βββ atis.val(1).csv # ATIS validation data
β βββ atis.test(1).csv # ATIS test data
β βββ README.md # Complete documentation
β βββ requirements.txt # Dependencies for ASN7
β
βββ ASN8/ # Text Summarization
β βββ assignment8.py # All coding questions (Q1, Q2, Q4)
β βββ ASN8.txt # Written analysis for Q3
β βββ README.md # Project documentation
β βββ requirements.txt # Dependencies for ASN8
β
βββ index.html # Main portfolio page with tabs
βββ README.md # This file
βββ requirements.txt # Global dependencies
βββ LICENSE # MIT License
|
Mastered Core NLP Concepts: Statistical Language Processing
Machine Learning Algorithms
Deep Learning for NLP
Feature Engineering
Model Evaluation & Optimization
|
Industry-Ready Solutions: Healthcare Analytics:
- Social media health trend monitoring
- Medical entity extraction (NER)
- Patient sentiment analysis
Financial Technology:
- Real-time sentiment classification
- Automated trading signals
- Risk assessment from news
Content & Media:
- Automated content categorization
- Text generation systems
- Information extraction pipelines
Enterprise Search:
- Semantic similarity matching
- Document retrieval optimization
- Query understanding |
|
From Theory to Code Every algorithm implemented from mathematical foundations, not just library calls. Demonstrates deep understanding of ML/NLP internals. |
Production Quality Clean, modular, documented code following software engineering best practices. Ready for deployment in real systems. |
Quantifiable Results Comprehensive performance metrics with benchmark comparisons. Achieved 95.2% accuracy on POS tagging, 94.2% on NER. |
Full-Stack ML End-to-end pipeline: data collection β preprocessing β modeling β evaluation β deployment. Complete workflow mastery. |
Interested in discussing NLP projects, machine learning systems, or collaboration opportunities?
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π‘ Open to opportunities in: β
β β
β βΈ Machine Learning Engineering β
β βΈ Natural Language Processing β
β βΈ Deep Learning Research β
β βΈ Data Science & Analytics β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Star this repository if you find it valuable for NLP/ML learning!
Built with Python, TensorFlow, NLTK, and a passion for Natural Language Processing
From Mathematical Theory β Production ML Systems β Business Impact
Copyright Β© 2025 | CS 421: Natural Language Processing