Kunal Patil

Data Alchemist - Because turning raw, messy data into gold-standard insights is just medieval magic.

About Me

Meet Your Friendly Neighbourhood Data Wrangler, a supposed master at turning dumpster-fire datasets into "actionable insights" for business growth—because buzzwords pay the bills. I juggle Python, SQL, and overhyped machine learning/generative AI tricks, excelling at predictive modeling and stats so dazzling they might actually mean something. With cutting-edge AI and a knack for spotting trends (or mirages), I churn out "impactful" solutions and innovative strategies that sound brilliant—results may vary. Peek at my GitHub if you’re brave enough.

Technical Skills

Python
Scikit-learn
Deep Learning
NLP
Data Visualization
Data Analytics
Pandas
Numpy
Excel
Matplotlib
Seaborn
Machine Learning
TensorFlow/Keras
GenAI
Langchain
HuggingFace
SQL (MYSQL, Oracle)
Git/Github
Streamlit

ML Projects

RAG Chatbot with PDF Uploads and Chat History

GitHub Demo
  • Project Overview: Developed a Streamlit web app for uploading PDFs and querying their content using Retrieval-Augmented Generation (RAG).
  • PDF Handling: Used PyPDFLoader and RecursiveCharacterTextSplitter to process and split PDFs into chunks.
  • AI Integration: Leveraged ChatGroq with selectable models (gemma2-9b-it, llama-3.1-8b-instant) for generating responses.
  • Retrieval: Employed HuggingFaceEmbeddings and Chroma for vector-based document retrieval.
  • Chat Features: Implemented session-based chat history with ChatMessageHistory for context-aware conversations.
  • UI: Designed an intuitive interface with PDF uploads, model selection, and query input.

Data Science Salary Estimator

GitHub Demo
  • Project Overview: Developed a salary prediction tool for data science roles using Glassdoor data, focusing on how skills like Python, Excel, AWS, and Spark impact salaries.
  • Data Cleaning: Parsed salary data, extracted company ratings, and engineered features for skills, job location, company age, and job description length.
  • Exploratory Analysis: Conducted exploratory data analysis (EDA), revealing correlations between job description length and company age, and visualizing job role distributions across sectors.
  • Model Building: Evaluated Lasso Regression and RandomForestRegressor, optimized both using RandomSearchCV, and selected RandomForest for its superior performance (MAE: 13.69).
  • Deployment: Deployed the final model on Streamlit, creating an accessible tool for salary estimation.
  • Tools Used: Leveraged pandas, numpy, sklearn, matplotlib, seaborn, and pickle for data processing, modeling, and deployment.

Customer Churn Prediction (Imbalanced dataset)

GitHub
  • Project Overview: Developed a predictive model for customer churn using an imbalanced dataset (4000 "no" vs. 521 "yes") targeting subscription to a term deposit.
  • Data Insights: Conducted exploratory analysis, revealing outliers in balance and duration, job type as a key predictor, and a low 17% defaulter rate, with further details in the notebook.
  • Data Preprocessing: Transformed binary features (yes/no) to 1/0, encoded categorical variables (e.g., job, education) with OneHotEncoder, and scaled numeric features using StandardScaler.
  • Model Selection: Evaluated multiple models and selected GradientBoostingClassifier, training it on both imbalanced and balanced datasets via BalancedBaggingClassifier.
  • Performance Outcomes: First model accurately identified churners (non-subscribers); second model, with balanced data, effectively detected subscribers despite their minority status.
  • Business Value: Provided two tailored models, allowing businesses to prioritize either churn detection or subscriber identification based on strategic needs.

Time Series Analysis

GitHub
  • Project Overview: Developed Jupyter notebooks for time series analysis, focusing on stock prices and trading volumes of companies such as Google (GOOGL), Tesla (TSLA), Ford, and GM using Python.
  • Data Retrieval: Utilized pandas_datareader to retrieve historical stock data, enabling detailed analysis of prices, volumes, and market trends.
  • Exploratory Analysis: Conducted initial data exploration, visualized stock price trends, and examined trading volumes to highlight significant trading days.
  • Technical Analysis: Implemented moving averages (simple and exponential), calculated market capitalization, and used rolling and expanding windows to study price trends and fluctuations.
  • Correlation & Volatility: Explored correlations between stock prices with scatter matrices, computed daily percentage changes, and assessed volatility using histograms and KDE plots.
  • Advanced Insights: Created box plots to compare stock returns, calculated cumulative returns, and performed time period-specific analyses to reveal trends during key intervals.

IMDB Movie Review Sentiment Analysis

GitHub
  • Project Overview: Developed a Simple Recurrent Neural Network (RNN) to classify IMDB movie reviews as positive or negative based on their text content.
  • Core Technologies: Utilized TensorFlow for model development and Streamlit for creating an interactive web application.
  • Data Preprocessing: Prepared the IMDB dataset using tokenization to break text into words and padding to standardize input lengths.
  • Model Development: Designed and trained a Simple RNN to capture sequential patterns in text for accurate sentiment classification.
  • NLP Exploration: Investigated word embeddings to understand text representation within the model.
  • Interactive Deployment: Deployed the model via a Streamlit app, enabling users to input reviews and receive real-time sentiment predictions.

Movie Recommendation System

GitHub
  • Project Overview: Developed a content-based movie recommendation system utilizing movie data from the TMDB API and natural language processing (NLP) techniques to generate personalized recommendations.
  • Data Collection: Collected movie metadata, such as overviews, cast, and genre details, directly from the TMDB API to serve as the foundation for the recommendation system.
  • Data Preprocessing: Processed and cleaned the raw data by removing spaces and applying stemming to standardize text, ensuring consistency for analysis.
  • Feature Extraction: Employed NLP methods to extract key features from the movie data, creating unique, meaningful representations for each movie.
  • Similarity Calculation: Used cosine similarity to measure the relationships between movies based on their extracted features, enabling accurate similarity comparisons.
  • Recommendation Functionality: Built a recommendation function that takes a movie title as input and returns a list of the most similar movies based on calculated similarity scores.