Welcome to DataSubway

A blog about Data Engineering — told through stories and technical deep-dives. Follow Conductor Mickey through the Invisible City, or jump straight into the Engineer’s Notebook for SQL and architecture references.

The Great Train Debate: Batch vs. Stream Processing

In the Invisible City, not every train travels at the same speed. Some are massive freight trains that move heavy loads once a night, while others are sleek, light subway cars that zip through the tunnels every single second. Choosing how to move your data is one of the most important decisions a Conductor can make. If you send a “Real-Time” message on a “Midnight Freight” train, the city won’t react in time. If you try to fit a mountain of cargo onto a tiny “Subway” car, the tracks will jam. Today, Conductor Mickey is going to show us the difference between Batch and Stream processing. ...

SQL Window Functions: A Practical Guide

Window functions are one of the most powerful and underused features in SQL. Unlike GROUP BY, they let you perform calculations across a set of rows related to the current row — without collapsing the result set. The Core Syntax function_name() OVER ( PARTITION BY column1 ORDER BY column2 ROWS/RANGE BETWEEN ... AND ... ) Clause Purpose Required? PARTITION BY Divides rows into groups (like GROUP BY, but rows are kept) No ORDER BY Defines the order within each partition Depends on function Frame (ROWS BETWEEN) Defines which rows to include relative to the current row No 1. Ranking Functions ROW_NUMBER() Assigns a unique sequential integer to each row within a partition. No ties. ...

The City Constitution: Establishing Order in the Invisible City

Before we dive into our next adventure with Conductor Mickey, we need to talk about the “law of the land.” In the tech world, we call this Data Governance. What is Data Governance? At its core, Data Governance is a collection of processes, roles, policies, standards, and metrics that ensure the effective and efficient use of information. It isn’t just a software tool; it’s a management framework that defines: Who has the authority and responsibility for specific data assets. What standards the data must meet (integrity, naming, etc.) before it is used. How data is protected, stored, and shared. When data should be archived, deleted, or updated. Without governance, a data system is just a pile of unorganized files. With it, data becomes a reliable asset that can power everything from basic reports to the most advanced AI. ...

Star vs. Snowflake

This is the final installment of our schema series. We’ve seen the Star Express and the Snowflake District. Now, it’s time for the ultimate showdown. How does Conductor Mickey decide which tracks to lay down for the “Invisible City”? In the previous posts, we explored two very different ways to organize a subway station: The Star: Fast, flat, and simple. The Snowflake: Detailed, organized, and space-saving. But in the real world of Data Engineering, you can’t always have both. You have to choose your priority: Speed or Storage? ...

The Snowflake Schema: The 'Detailed' District

In our Invisible City, as the population grows and the subway map expands, Conductor Mickey sometimes finds that a simple Star Hub isn’t enough. When the “Station” information becomes a giant, messy pile of repeating words, it’s time to move to the Snowflake Schema. If the Star Schema is the “Express Line,” the Snowflake is the “Highly Detailed District.” What is a Snowflake Schema? In the previous post, we saw how the Star Schema keeps things simple. But what happens when your data starts repeating itself? What if you have 10 million rows, and you’ve written “United States” and “North America” 10 million times? ...

The Star Schema—The 'Express' Hub

In our digital subway city, some stations are “Hubs.” They sit right in the middle, and every train line passes through them. In data engineering, if you don’t organize your stations, your subway maps will look like a plate of spaghetti. To keep things clean, we use Dimensional Modeling. The most famous layout for speed is the Star Schema. What is a Star Schema? The Star Schema is designed for one thing: Speed. It organizes your data into two distinct types of citizens: Facts and Dimensions. ...

The Medallion Architecture Journey

In our Invisible City, the subway doesn’t just move people from point A to point B. It moves them through a refinement process. When data first arrives at the station, it’s usually messy, tired, and covered in “digital mud.” We can’t let them go straight to the Mayor’s office (the Business Dashboard) looking like that! To fix this, Conductor Mickey uses the Medallion Architecture. It’s a three-stage filtration system that turns “Raw Junk” into “Pure Gold.” ...

Time Travel in Azure SQL Database

In our Invisible City subway system, we’ve talked about the “Time Machine” features in Delta Lake for big data. But what if your main terminal is built on a traditional, reliable foundation like Azure SQL Database? Usually, a database has short-term memory. If a passenger changes their destination from “Station A” to “Station B,” the database overwrites the old entry. Station A is gone forever. But sometimes, Conductor Mickey needs to know exactly what the city looked like last Tuesday at 2:00 PM. To do that without a magic wand, he uses Temporal Tables. ...

Understanding Delta Lake and Delta Tables

In our Invisible City subway system, we’ve already upgraded our shipping containers from old wooden crates (CSV) to futuristic, high-tech pods (Parquet). But a depot full of pods is still just a pile of boxes. If a train crashes while unloading, or if two conductors try to move the same crate at once, you get a “Data Crash.” To solve this, Conductor Mickey uses a special system: Delta Lake. But wait—is “Delta Lake” the same as a “Delta Table”? Let’s clear up the confusion with a trip to the Smart Depot. ...

Why Parquet is the Future of Data Shipping

In our Invisible City, if the Data Lake is the storage yard, then Parquet is the high-tech, space-saving shipping container we use to pack the goods. Most people are used to CSV files—they are like old-fashioned wooden crates. They work, but they are heavy, take up too much space, and are slow to move. Parquet is the “Flat-Pack” futuristic alternative. The Vertical Train: Why Data Engineers Love Parquet Imagine a subway train full of passengers. Each passenger has a Name, an Age, and a Destination. ...