Cafe Sales Data Cleaning and Analysis
October 2025
Why I Took On This Project
Data cleaning and analysis are critical skills for cloud and data engineering roles. I chose a messy
10,000-row cafe sales dataset from Kaggle to practice cleaning, querying, and cloud deployment using
Python, SQL, and AWS RDS. This project showcases my ability to handle real-world data challenges and
deploy solutions in the cloud, aligning with my transition to IT and cloud security. See more at trisoncloudresume.com.
What I Built
I cleaned a Kaggle dataset (Cafe Sales - Dirty Data) with 10,000 rows, addressing
missing values, duplicates, and format issues. The cleaned data (8,733 rows) was loaded into SQLite and
AWS RDS (PostgreSQL) for analysis. Advanced SQL queries analyzed sales trends, visualized in an HTML
page. Key components:
- Python script using pandas, numpy, SQLAlchemy, and psycopg2 to clean data and load into databases.
- SQLite and PostgreSQL databases for local and cloud analysis.
- Five advanced SQL queries: total sales by item, average spend by payment method, monthly sales
trends, top items by location, and sales percentage by item per location.
- HTML page (cafe_sales_analysis.html) with formatted query
results.
What I Learned
This project deepened my understanding of data engineering and cloud database management:
- Data Cleaning: Systematic handling of missing values (e.g., 32.65% in Location),
invalid data ("ERROR"/"UNKNOWN"), and type conversions is critical for reliable analysis.
- SQL Differences: PostgreSQL’s strict case-sensitivity and GROUP BY rules differ
from SQLite, requiring precise query syntax.
- AWS RDS: Setting up and securing RDS instances involves IAM permissions, VPC
configuration, and connectivity troubleshooting.
- Chromebook Workflow: Linux terminal file access on Chromebook requires workarounds
like dragging files via Files app.
Challenges and Solutions
The dataset and cloud setup presented several challenges:
- Chromebook File System: Inaccessible Downloads path in Linux terminal. Fixed by
dragging CSV via Files app.
- Missing Values: High missing percentages (25.79% Payment Method, 32.65% Location).
Filled with "Unknown" to preserve data.
- Invalid Data: "ERROR"/"UNKNOWN" in numeric columns. Replaced with NaN, imputed
Total Spent, dropped ~10% of rows with missing Quantity/Price Per Unit.
- Data Types: Numeric columns as `object`. Converted to `float64` after cleaning.
- SQLAlchemy Errors: Multi-statement SQL failed in SQLite. Used `text()` and separate
`execute()` calls.
- SQLite Limitation: `RANK()` in `HAVING` failed. Used CTE for ranking.
- PostgreSQL Issues: Case-sensitive column names and `GROUP BY` errors. Fixed with
quoted names and adjusted `GROUP BY`.
- RDS Access: `--no-publicly-accessible` blocked connections. Made instance publicly
accessible, configured security group for my IP.
Why It’s a Win
This project demonstrates my ability to clean and analyze large datasets, deploy cloud databases, and
visualize results. Starting with a 10,000-row messy dataset, I reduced it to 8,733 clean rows, ran
advanced SQL queries in SQLite and AWS RDS, and created a polished HTML visualization. The process honed
my Python, SQL, and AWS skills, aligning with junior cloud or data engineering roles. Check out the
results at cafe_sales_analysis.html and my portfolio at trisoncloudresume.com.