Hello I'm

Shubham

I’m a data scientist and analyst.

I turn complex datasets into actionable business insights.

About Me

I’m a data analyst with 3+ years of experience transforming raw numbers into clear, impactful business stories. My work ranges from building scalable ETL pipelines to designing intuitive dashboards that guide high-stakes decisions.

I’ve worked extensively with SQL, Python, Power BI, and cloud platforms like AWS, Azure, and GCP. Whether it’s uncovering retention patterns, automating reports, or designing predictive models, I love the challenge of connecting the dots between data and business impact.

Outside of work, I’m a curious learner, an occasional open-source contributor, and someone who enjoys exploring how AI and generative models can shape the future of analytics.

Here are a few technologies I’ve been working with recently:

  • SQL & BigQuery
  • Python
  • PowerBI
  • Spark
  • AWS SageMaker & Redshift
  • LangChain
  • OpenAI API
  • Hugging Face Transformers

What I do

Data Analytics & Business Intelligence

Machine Learning & Predictive Modeling

Generative AI & Natural Language Processing

Where I've Worked

  • Center for Health Informatics, UIUC
  • University of Illinois
  • Quantiphi
  • BitGenie
  • J.F. Info Systems
  • RGM Tech

Student Researcher @ Center of Health Informatics, UIUC

Aug 2025 – Present

  • Engineering predictive models to forecast national drug demand by integrating multiple datasets ,improving planning accuracy and supply chains resilience across South America.
  • Partnering with Pan-American Health Organization to implement scalable, data-driven workflows that support policy decisions and efficient resource allocation.

Research Assistant @ UIUC

Aug 2025 – Present

  • Assisted in fine-tuning pre-trained LLMs, improving model accuracy and contextual understanding for historical and cultural datasets using metadata-driven pipelines that simulate diverse contexts and timeframes for improved narrative reasoning.
  • Conducted experiments with prompt engineering and retrieval-augmented generation (RAG) techniques to explore ways of enhancing response quality.
  • Helped design and run evaluation benchmarks using metrics such as BLEU and ROUGE, that measure historical and narrative reasoning, enabling insights into cultural patterns, advancing AI safety and explainability.

Business Data Analyst @ Quantiphi

Jul 2021 – Aug 2024

  • Automated monthly sales and product usage reporting by creating ETL scripts and scheduled workflows, reducing reporting time from 3 days to a few hours and freeing up time for strategic analysis.
  • Collaborated with business and marketing teams to understand key goals, pulled data using SQL, and visualize campaign performance using Power BI, leading to a more informed budget reallocation that improved ROI by 20%
  • Cleaned and merged user-level, transaction, and event-tracking data from various platforms to identify user retention patterns and create executive-ready reports that highlighted drop-off points across product features.
  • Assisted the product team in deciding feature changes by running A/B test analysis, comparing group behavior, and clearly explaining the impact of proposed updates through dashboards and reports.
  • Reviewed raw data outputs daily for inconsistencies, set up alert rules for anomalies in key performance metrics, and coordinated with engineering to resolve data issues faster and reduce reporting errors.

Business Data Analyst Intern @ Quantiphi

Jan 2021 – Jul 2021

  • Analyzed churn patterns by user segment, summarizing behavioral insights in Power BI visuals that led to better targeting of win-back campaigns and improved retention by over 10% within one quarter.
  • Partnered with the analytics engineering team to clean input data, standardize naming conventions, and prepare consistent datasets used for churn prediction models, improving model quality.
  • Created plug-and-play automated monitoring views and visualization templates for business managers to track region-level KPIs and sales activities, increasing team adoption of analytics tools and reducing dependency on technical staff.

Data Science Intern @ BitGenie

Apr 2020 – Jul 2020

  • Built interactive reports and visualizations that tracked user interactions by region and time of day, which helped the design team prioritize features and improve user experience for the most active time windows.
  • Designed easy-to-use SQL queries that answered frequently asked business questions, reducing the need for engineers to pull custom reports and enabling faster decision-making by operations teams.
  • Standardized key metrics and definitions across reports to ensure consistency in tracking performance and reduce confusion among teams.

Hardware Support Engineer Intern @ J.F Info Systems

May 2019 – Jul 2019

  • Diagnosed and eliminated high-risk malware, ransomware, and security threats across company systems, preventing potential data breaches and minimizing operational downtime.
  • Implemented critical security patches and performed strategic software upgrades to protect sensitive business data and maintain compliance with IT security standards.
  • Engineered and deployed customized workstation setups with optimized hardware and software configurations, enabling employees to operate at peak productivity from day one.

Software QA & Testing Intern @ RGM Tech

Apr 2018 – Jul 2018

  • Conducted comprehensive website testing and troubleshooting prior to deployment, ensuring seamless performance and facilitating timely bug resolution.
  • Designed detailed test plans, scenarios, and procedures to execute unit testing, optimizing browser compatibility and user experience.
  • Documented software defects in the bug-tracking system, maintained an updated defect database, and collaborated with developers to expedite fixes and improve product quality.

Other Noteworthy Projects

Python Games Collection – Classic Arcade and Utility Games in Python

A collection of classic and educational games built with Python and Pygame. Designed for beginners and hobbyists to explore game development fundamentals while coding fun, interactive experiences.

This repository includes a variety of games such as Breakout, Pong, Tetris, and text-based utilities like Compare Documents and Markov text generation. Each game is implemented in Python using Pygame, making it easy to understand, modify, and extend.

  • Python
  • Pygame
  • Game Development
  • OOP
Check it out!

A showcase of multiple classic games recreated in Python — from arcade-style challenges to text-based generators.

Social Media Clone – Full-Stack Real-Time Social Platform

A full-stack social media application built to replicate the core features of modern platforms. Includes user authentication, post creation, likes, comments, and real-time updates for an engaging interactive experience.

Developed with React, Node.js, Express, and MongoDB, It features JWT-based authentication, protected routes, WebSocket-powered real-time interactions, and a responsive UI. Designed as a learning project to practice Web development.

  • React
  • Node.js
  • Express
  • MongoDB
  • Socket.io
  • JWT
Check it out!

Full-stack social media platform featuring posts, comments, likes, and real-time updates.

NLP-Based Text Summarizer with Configurable Pipelines

A Python-based application for generating concise summaries from long texts using state-of-the-art NLP models. Designed for both experimentation and production use, it supports flexible configuration and deployment options.

The project offers notebook workflows for rapid prototyping, modular Python code for integration, and Dockerized environments for consistent deployments. Users can adjust summarization parameters via a simple YAML configuration file and run the tool from notebooks, scripts, or containers.

  • Python
  • Natural Language Processing
  • Transformers
  • PyTorch
  • Docker
Check it out!
img04

Generates concise summaries of long documents using configurable NLP pipelines.

YouTube End-to-End ETL Data Pipeline

A complete end-to-end data engineering solution that extracts, processes, and analyzes YouTube data using Python and Apache Airflow. The pipeline automates data extraction via the YouTube Data API v3, cleans and transforms it, stores it locally, and orchestrates the entire workflow with Airflow — all running locally and free of cloud costs.

Designed for both experimentation and production, the project enables configurable parameters through environment variables, robust logging, and comprehensive testing. Data analysis is provided through Jupyter notebooks, revealing insights into channel performance, engagement metrics, and content trends over time.

  • Python
  • Apache Airflow
  • YouTube Data API v3
  • Pandas
  • ETL
Check it out!

Automated ETL pipeline for extracting and analyzing YouTube data, orchestrated with Apache Airflow.

Student Exam Performance Indicator

This project builds a machine learning pipeline to predict students' Maths scores based on demographic and academic data. It covers data ingestion, preprocessing, model training, and deployment via a Flask web app, offering real-time predictions through a user-friendly interface.

Features include automated data processing, scaling and encoding of features, model evaluation, and an interactive frontend where users input student data to get instant Maths score predictions. The project demonstrates end-to-end ML workflows combining data science with web deployment.

  • Python
  • Flask
  • Scikit-learn
  • Pandas
Check it out!

Comprehensive SQL Data Warehousing and Analytics Solution

This project demonstrates an end-to-end data warehousing approach using the medallion architecture (Bronze, Silver, Gold layers) to ingest, cleanse, transform, and present business-ready data for analytics and reporting. It emphasizes scalable, structured data pipelines with ETL/ELT processes and data cataloging.

The solution supports building dimension and fact tables optimized for BI tools, delivering actionable insights and enabling robust data-driven decision-making across enterprises.

  • SQL
  • Data Warehousing
  • ETL
  • Data Modeling
Check it out!

Recipe Book – Angular CRUD App

A modern, responsive recipe book application built with Angular. Users can create, read, update, and delete recipes, browse a clean list view, and search or filter recipes by title or ingredient. The design adapts seamlessly between desktop and mobile.

Features include routing between pages, reactive form handling, localStorage persistence, and a sleek, menu-driven interface powered by Angular Material/Bootstrap.

  • Angular
  • TypeScript
  • HTML
  • SCSS/CSS
  • RxJS
  • LocalStorage
Check it out!

Collection of Personal Web Development Projects

This repository contains a series of personal web development projects designed to practice and improve front-end development skills. The projects range from layout exercises to interactive applications, focusing on HTML, CSS, JavaScript, and occasional frontend frameworks or libraries.

Each project explores different aspects of frontend development — from responsive design and UI components to DOM manipulation and API integration — building a practical foundation for creating modern, user-friendly websites.

  • HTML
  • CSS
  • JavaScript
Check it out!

Lights, Data, Action: An Analysis of the Entertainment Landscape

This project presents a comprehensive analytics suite of four interconnected Tableau dashboards. It delves into the vast world of entertainment by analyzing data from Netflix, IMDb's Top 1000 Movies, IMDb TV Shows, and the Oscars. The primary goal is to provide actionable insights for content strategy, production decisions, and audience targeting, offering a multi-dimensional view of the industry's trends, successes, and audience preferences.

Check it out!

Los Angeles Crime Prediction Project

This project analyzes and predicts crime in Los Angeles using the LAPD Crime Dataset (2020–2025). It focuses on two main goals: Crime Type Classification into categories like Assault, Burglary, and Other; and Crime Count Forecasting to predict monthly crime volumes for better law enforcement resource allocation.

Built in Python and deployed on AWS SageMaker, the project uses advanced feature engineering, dimensionality reduction (TruncatedSVD), class balancing (SMOTE), and machine learning models like Random Forest, LightGBM, and XGBoost for classification. Time series models, including Linear Regression, Random Forest Regressor, and SARIMAX, were applied to forecast crime trends.

XGBoost achieved ~83% accuracy for classification, outperforming others, while Linear Regression delivered the best forecasting performance, showing that engineered temporal and seasonal features effectively captured underlying patterns.

  • Python
  • Time Series Forecasting
  • XGBoost
  • LightGBM
  • Random Forest
Check it out!
Crime Prediction Dashboard

Classification accuracy comparison between XGBoost, LightGBM, and Random Forest models.

Crime Forecast Results

Linear Regression outperforming other models in monthly crime count forecasting.

Interested to Work With Me?

I’m currently looking for any new opportunities, my inbox is always open. Whether you have a question or just want to say hi, Get!

Say Hello!