Cafe Sales Data Cleaning and Analysis

Why I Took On This Project

Data cleaning and analysis are critical skills for cloud and data engineering roles. I chose a messy 10,000-row cafe sales dataset from Kaggle to practice cleaning, querying, and cloud deployment using Python, SQL, and AWS RDS. This project showcases my ability to handle real-world data challenges and deploy solutions in the cloud, aligning with my transition to IT and cloud security. See more at trisoncloudresume.com.

What I Built

I cleaned a Kaggle dataset (Cafe Sales - Dirty Data) with 10,000 rows, addressing missing values, duplicates, and format issues. The cleaned data (8,733 rows) was loaded into SQLite and AWS RDS (PostgreSQL) for analysis. Advanced SQL queries analyzed sales trends, visualized in an HTML page. Key components:

Python script using pandas, numpy, SQLAlchemy, and psycopg2 to clean data and load into databases.
SQLite and PostgreSQL databases for local and cloud analysis.
Five advanced SQL queries: total sales by item, average spend by payment method, monthly sales trends, top items by location, and sales percentage by item per location.
HTML page (cafe_sales_analysis.html) with formatted query results.

What I Learned

This project deepened my understanding of data engineering and cloud database management:

Data Cleaning: Systematic handling of missing values (e.g., 32.65% in Location), invalid data ("ERROR"/"UNKNOWN"), and type conversions is critical for reliable analysis.
SQL Differences: PostgreSQL’s strict case-sensitivity and GROUP BY rules differ from SQLite, requiring precise query syntax.
AWS RDS: Setting up and securing RDS instances involves IAM permissions, VPC configuration, and connectivity troubleshooting.
Chromebook Workflow: Linux terminal file access on Chromebook requires workarounds like dragging files via Files app.

Challenges and Solutions

The dataset and cloud setup presented several challenges:

Chromebook File System: Inaccessible Downloads path in Linux terminal. Fixed by dragging CSV via Files app.
Missing Values: High missing percentages (25.79% Payment Method, 32.65% Location). Filled with "Unknown" to preserve data.
Invalid Data: "ERROR"/"UNKNOWN" in numeric columns. Replaced with NaN, imputed Total Spent, dropped ~10% of rows with missing Quantity/Price Per Unit.
Data Types: Numeric columns as `object`. Converted to `float64` after cleaning.
SQLAlchemy Errors: Multi-statement SQL failed in SQLite. Used `text()` and separate `execute()` calls.
SQLite Limitation: `RANK()` in `HAVING` failed. Used CTE for ranking.
PostgreSQL Issues: Case-sensitive column names and `GROUP BY` errors. Fixed with quoted names and adjusted `GROUP BY`.
RDS Access: `--no-publicly-accessible` blocked connections. Made instance publicly accessible, configured security group for my IP.

Why It’s a Win

This project demonstrates my ability to clean and analyze large datasets, deploy cloud databases, and visualize results. Starting with a 10,000-row messy dataset, I reduced it to 8,733 clean rows, ran advanced SQL queries in SQLite and AWS RDS, and created a polished HTML visualization. The process honed my Python, SQL, and AWS skills, aligning with junior cloud or data engineering roles. Check out the results at cafe_sales_analysis.html and my portfolio at trisoncloudresume.com.