<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Beshoy's Notes]]></title><description><![CDATA[A Syntopicon of Tech Notes]]></description><link>https://notes.beesho.me</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1745519332744/e1a3a226-bcd8-4344-b5a2-7611c857c269.png</url><title>Beshoy&apos;s Notes</title><link>https://notes.beesho.me</link></image><generator>RSS for Node</generator><lastBuildDate>Sat, 18 Apr 2026 11:35:09 GMT</lastBuildDate><atom:link href="https://notes.beesho.me/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Git Study Notes]]></title><description><![CDATA[Introduction to Git
Git is a distributed version control system that allows data engineering teams to track changes in code and collaborate effectively. It provides a full history of modifications, enabling easy rollbacks and parallel development. Ke...]]></description><link>https://notes.beesho.me/git-study-notes</link><guid isPermaLink="true">https://notes.beesho.me/git-study-notes</guid><category><![CDATA[Git]]></category><dc:creator><![CDATA[Beshoy Sabri]]></dc:creator><pubDate>Thu, 24 Apr 2025 19:59:38 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1745524747213/d6868410-567e-429f-9242-4705bf1edec8.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-introduction-to-git">Introduction to Git</h2>
<p>Git is a <strong>distributed version control system</strong> that allows data engineering teams to track changes in code and collaborate effectively. It provides a full history of modifications, enabling easy rollbacks and parallel development. Key benefits include:</p>
<ul>
<li><p><strong>Version Tracking:</strong> Every change is saved as a commit with metadata, making it easy to audit and revert changes when needed.</p>
</li>
<li><p><strong>Collaboration:</strong> Multiple engineers can work on different features or bug fixes simultaneously via branches, later merging their work without overwriting each other’s code.</p>
</li>
<li><p><strong>Reproducibility:</strong> Git ensures that code for data pipelines, ETL scripts, or infrastructure is versioned, so you can reproduce past configurations exactly – critical for debugging and regulatory compliance.</p>
</li>
</ul>
<p>Git’s advantages in <strong>version control and collaboration</strong> make it an essential tool for data engineers, helping streamline code management and sharing (<a target="_blank" href="https://medium.com/@vijaygadhave2014/git-for-data-engineers-essential-commands-c9216d3d0c8e#:~:text=OK%2C%20let%E2%80%99s%20break%20down%20the,benefits%20in%20clear%20bullet%20points">Git for Data Engineers: Essential Commands | by Vijay Gadhave | Medium</a>). By mastering Git’s command-line usage – from basic commits to advanced history editing – data engineers of all levels can efficiently manage project repositories.</p>
<h2 id="heading-getting-started-with-git-repository-setup-and-basics">Getting Started with Git: Repository Setup and Basics</h2>
<p><strong>Installing Git:</strong> First, install Git on your system (e.g., via package manager or the official installer). On Linux, you might use <code>sudo apt install git</code>, or on macOS <code>brew install git</code>. After installation, configure your identity:</p>
<ul>
<li><p><strong>Configure Identity:</strong> Set your name and email for commits:</p>
<pre><code class="lang-bash">  git config --global user.name <span class="hljs-string">"Your Name"</span>  
  git config --global user.email <span class="hljs-string">"you@example.com"</span>
</code></pre>
<p>  This ensures your commits are labeled properly. The <code>--global</code> flag applies this setting for all repositories (you can omit it to set per-project).</p>
</li>
</ul>
<p><strong>Initializing a Repository:</strong> To start version controlling a new or existing project, navigate into the project directory and initialize Git:</p>
<ul>
<li>Run <code>git init</code> – this creates a new <code>.git</code> subdirectory, containing all Git metadata for the repo ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/setting-up-a-repository/git-init#:~:text=The%20,run%20in%20a%20new%20project">git init | Atlassian Git Tutorial</a> ). After <code>git init</code>, your directory becomes a Git repository; most Git commands can then be used inside it. (If the project already had a <code>.git</code> folder, re-initializing with <code>git init</code> is safe – it won’t overwrite existing version history ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/setting-up-a-repository/git-init#:~:text=match%20at%20L556%20If%20you%27ve,configuration">git init | Atlassian Git Tutorial</a> ).)</li>
</ul>
<p><strong>Cloning a Repository:</strong> To obtain a working copy of an existing repository (for example, from GitHub or a shared network path), use <code>git clone</code>:</p>
<ul>
<li>Run <code>git clone &lt;repo-url&gt; [&lt;directory&gt;]</code>. This command copies an existing repository to your local machine. Internally, <code>git clone</code> first initializes a new repo then <strong>downloads all the data</strong> from the source repository and checks out the latest snapshot into your working directory ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/setting-up-a-repository/git-init#:~:text=A%20quick%20note%3A%20,However%2C%20%60git">git init | Atlassian Git Tutorial</a> ). After cloning, you have the full history and the working files ready to use.</li>
</ul>
<p><strong>Basic Workflow – Stage and Commit Changes:</strong> Once a repo is initialized or cloned, the typical cycle of work is: edit files, stage changes, commit, and (when collaborating) push to a remote.</p>
<ol>
<li><p><strong>Check Repository Status:</strong> Use <code>git status</code> to review which files are modified, added, or removed. This helps you see what will be included in the next commit.</p>
</li>
<li><p><strong>Stage Changes:</strong> Use <code>git add &lt;file&gt;</code> to add a file’s changes to the staging area (index). For example:</p>
<ul>
<li><p><code>git add script.py</code> stages modifications in <em>script.py</em>.</p>
</li>
<li><p>Use <code>git add .</code> to stage <strong>all</strong> changes (new, modified, deleted files) in the current directory.</p>
</li>
</ul>
</li>
<li><p><strong>Commit Changes:</strong> Once staged, create a commit with <code>git commit -m "Message describing the change"</code>. A commit is a snapshot of the repository state. It’s good practice to write clear, concise commit messages.</p>
</li>
<li><p><strong>View Commit History:</strong> Use <code>git log</code> to view the history of commits. By default, it shows commits in reverse chronological order, with their hash, author, date, and message. For a one-line summary per commit, try <code>git log --oneline</code>.</p>
</li>
<li><p><strong>Difference (Optional):</strong> To see what changes have been made (before or after staging), use <code>git diff</code>. For example, <code>git diff HEAD</code> shows differences between the working directory and the latest commit.</p>
</li>
</ol>
<p>Each commit in Git is identified by a SHA-1 hash. You can refer to commits by these hashes (full or abbreviated) or by pointers like branch names and tags (more on these later).</p>
<p><strong>Using .gitignore:</strong> In any project, you’ll have files that shouldn’t be tracked (e.g., data outputs, environment files, credentials). Create a <code>.gitignore</code> file listing patterns of files to exclude (for example, <code>*.csv</code> to ignore all CSV files). Git will ignore those files when staging changes.</p>
<h2 id="heading-branching-and-merging">Branching and Merging</h2>
<p>(<a target="_blank" href="https://commons.wikimedia.org/wiki/File:Basic_git_branching_workflow_\(GitLab\).png">File:Basic git branching workflow (GitLab).png - Wikimedia Commons</a>) <em>Basic Git branching workflow: feature branches diverge from the main branch and are merged back after development. This allows parallel development without disrupting the main codebase.</em></p>
<p><strong>What is a Branch?</strong> A branch in Git is effectively a movable pointer to a series of commits (a timeline of changes). Branches allow you to isolate work. For example, you might create a branch for a new feature or a bug fix. This way, unstable or experimental code is kept separate from the <code>main</code> branch (sometimes called <code>master</code> in older repos) until it’s ready. In Git, creating and switching branches is quick and cheap – it doesn’t copy the entire project, just points to a commit snapshot ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/using-branches#:~:text=Git%20branches%20are%20effectively%20a,it%20into%20the%20main%20branch">Git Branch | Atlassian Git Tutorial</a> ).</p>
<ul>
<li><p>To <strong>create a new branch</strong>: <code>git branch &lt;branch-name&gt;</code>. This makes a new branch pointer at the current commit, but keeps you on the current branch.</p>
</li>
<li><p>To <strong>switch to a branch</strong>: <code>git checkout &lt;branch-name&gt;</code>. This updates your working directory to the state of that branch’s latest commit. (In newer Git versions, you can use <code>git switch &lt;branch-name&gt;</code> for clarity when just changing branches.)</p>
</li>
<li><p>To create and switch in one step: <code>git checkout -b &lt;branch-name&gt;</code> (or <code>git switch -c &lt;branch-name&gt;</code>). This is commonly used to start a new branch and begin working on it immediately.</p>
</li>
</ul>
<p>For example, <code>git checkout -b feature/data-cleanup</code> will create a branch called “feature/data-cleanup” and check it out. Now commits you make will advance this branch, while the <code>main</code> branch remains untouched until you merge the feature branch back.</p>
<p><strong>Why Branches?</strong> Branches enable parallel development. You can have multiple feature branches, bug-fix branches, or experiment branches all evolving independently. This isolation prevents incomplete or broken code on one branch from affecting others. When a branch’s work is complete and tested, it can be merged back into a main line (such as <code>main</code> or <code>develop</code> branch), integrating the changes.</p>
<h3 id="heading-common-branching-strategies">Common Branching Strategies</h3>
<p>Choosing a branching strategy helps define how your team collaborates and delivers code. Here are a few strategies/data engineers may encounter:</p>
<ul>
<li><p><strong>Trunk-Based Development:</strong> All developers commit to a single long-lived branch (often <code>main</code> or <code>trunk</code>). Feature branches are short-lived; developers merge small, frequent updates back into <code>main</code>. This approach emphasizes continuous integration and is often considered a best practice for fast-paced DevOps environments ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow#:~:text=Gitflow%20is%20a%20legacy%20Git,details%20Gitflow%20for%20historical%20purposes">Gitflow Workflow | Atlassian Git Tutorial</a> ). It minimizes merge conflicts by integrating changes often.</p>
</li>
<li><p><strong>Feature Branch Workflow:</strong> Each feature or issue is developed in its own branch off of <code>main</code> (or off a <code>develop</code> branch). When the work is done, it’s merged via a pull request (PR) into the main line. This is very common – it keeps work isolated and allows code review before merging. Feature branches are usually deleted after merge to keep the repository tidy.</p>
</li>
<li><p><strong>Gitflow Workflow:</strong> A more complex, legacy model with multiple long-lived branches (e.g. <code>develop</code>, <code>master</code>, plus supporting branches for features, releases, and hotfixes). In Gitflow, developers integrate feature branches into a central <code>develop</code> branch, releases are prepared on separate release branches, and finished releases are merged into <code>master</code> (and back into <code>develop</code>) ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow#:~:text=Gitflow%20is%20an%20alternative%20Git,can%20also%20introduce%20conflicting%20updates">Gitflow Workflow | Atlassian Git Tutorial</a> ). Gitflow was popularized by Vincent Driessen and suits projects with scheduled release cycles. However, it has fallen out of favor in modern continuous delivery practices in favor of simpler trunk-based workflows ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow#:~:text=Gitflow%20is%20a%20legacy%20Git,details%20Gitflow%20for%20historical%20purposes">Gitflow Workflow | Atlassian Git Tutorial</a> ), as Gitflow can be challenging to use with fast CI/CD pipelines.</p>
</li>
<li><p><strong>Forking Workflow:</strong> Common in open-source, this is where each contributor works on their personal fork (copy) of the repository, then submits changes back to the main repository via pull requests. Internally, they use branches for features, but collaboration happens through the fork rather than a shared repo. In corporate data engineering, forking may be less common than in open source, except when collaborating across organizations.</p>
</li>
</ul>
<p>In practice, many teams adopt a hybrid of these. For example, a data engineering team might use a simple feature branch model on a trunk-based approach: everyone branches off <code>main</code> for each task, then integrates back into <code>main</code> frequently (perhaps deploying changes to an analytics pipeline continuously). Choose a strategy that fits your release cadence and collaboration style. For stable data pipeline releases, you can also use <strong>release branches</strong> or tags (discussed below) to mark production versions.</p>
<h3 id="heading-merging-branches">Merging Branches</h3>
<p>Once work in a branch is complete, you’ll merge it back into a target branch (often <code>main</code> or a integration branch like <code>develop</code>). Merging takes the changes from one branch and integrates them into another. The result is that the target branch now contains all the work that was in the source branch.</p>
<p><strong>Basic merge:</strong> Check out the branch you want to merge into (e.g., <code>main</code>), then run <code>git merge &lt;source-branch&gt;</code>. This will bring the commits from <code>&lt;source-branch&gt;</code> into the current branch. There are a couple of outcomes:</p>
<ul>
<li><p>If the current branch’s HEAD is an ancestor of the source branch’s HEAD (in other words, the target branch has no new commits since the source branched off), Git will do a <strong>fast-forward merge</strong>. In a fast-forward, no new commit is created; the HEAD pointer of the target just moves forward to the source’s latest commit. The history stays linear (no merge commit). Fast-forwards often happen when one person was working on a feature branch and nothing changed in <code>main</code> in the meantime.</p>
</li>
<li><p>If the branches have diverged (both have new commits since they split), Git will perform a <strong>three-way merge</strong> and create a <strong>merge commit</strong> by default. This special commit has two parent commits (the tips of the branches being merged) and a combined set of changes. The merge commit message is auto-generated (you can edit it) to note the two branches that were merged. This merge commit keeps history non-linear but records the branch structure, which some teams prefer for traceability.</p>
</li>
</ul>
<p>You can control merge behavior:</p>
<ul>
<li><p>To <strong>force a merge commit</strong> even if a fast-forward is possible, use <code>git merge --no-ff &lt;branch&gt;</code>. This is sometimes done to ensure a merge commit exists (for example, to keep a history of all merges or to group a feature’s commits under one merge commit).</p>
</li>
<li><p>To <strong>squash merge</strong> (combine all of a branch’s commits into a single commit on the target), you can use <code>git merge --squash &lt;branch&gt;</code> and then commit. This doesn’t produce a merge commit with two parents; instead, it produces one new commit with the cumulative changes. Squash merging is used to keep history linear and concise (e.g., squashing “fixup” commits on feature branch into one).</p>
</li>
</ul>
<p>After merging, it’s good practice to <strong>delete the feature branch</strong> if it’s no longer needed (<code>git branch -d &lt;branch&gt;</code>), especially if using short-lived feature branches, to reduce clutter. (The commits remain in history even if the branch label is removed.)</p>
<h3 id="heading-merge-conflicts-and-resolution">Merge Conflicts and Resolution</h3>
<p>Sometimes Git cannot automatically merge changes because the same part of the same file was edited differently in the two branches – this is a <strong>merge conflict</strong>. When a conflict occurs, Git will merge what it can and mark the conflicted files for manual resolution. You’ll see in the terminal output which files have conflicts, and <code>git status</code> will list those files as “unmerged” or with conflicts.</p>
<p><strong>What a conflict looks like:</strong> Git inserts conflict markers in the affected files to indicate the differing sections. For example, a conflicted file will contain lines like:</p>
<pre><code class="lang-diff">&lt;&lt;&lt;&lt;&lt;&lt;&lt; HEAD
your code on the current branch
<span class="hljs-comment">=======</span>
someone else’s code on the branch being merged
&gt;&gt;&gt;&gt;&gt;&gt;&gt; feature-branch
</code></pre>
<p>Everything between <code>&lt;&lt;&lt;&lt;&lt;&lt;&lt; HEAD</code> and <code>=======</code> is the content from the current branch, and between <code>=======</code> and <code>&gt;&gt;&gt;&gt;&gt;&gt;&gt; feature-branch</code> is the content from the other branch. Your job is to edit this file to reconcile the differences – decide what the final content should be.</p>
<p><strong>Steps to resolve a conflict:</strong></p>
<ol>
<li><p><strong>Identify conflicts:</strong> After a merge attempt, Git stops at conflicts. Run <code>git status</code> to see which files are in conflict. Open those files in a text editor or IDE.</p>
</li>
<li><p><strong>Edit the files:</strong> Find the conflict markers (<code>&lt;&lt;&lt;&lt;&lt;&lt;&lt;</code>, <code>=======</code>, <code>&gt;&gt;&gt;&gt;&gt;&gt;&gt;</code>). Decide how to combine the changes. You might keep one side, the other, or a mix of both. Remove the conflict marker lines and make the file look exactly as it should after the merge.</p>
</li>
<li><p><strong>Mark as resolved:</strong> After editing, stage the file with <code>git add &lt;file&gt;</code>. (Git now knows you’ve resolved that file’s conflict.)</p>
</li>
<li><p><strong>Finalize the merge:</strong> Once <em>all</em> conflicts in all files are resolved and staged, complete the merge by running <code>git commit</code> (if Git didn’t auto-create a merge commit earlier) or <code>git merge --continue</code> (if you were in the middle of a rebase or cherry-pick conflict scenario – see rebasing below). The merge commit will be created with the changes you just made.</p>
</li>
</ol>
<p>If you want to abort the merge during a conflict (to return to the pre-merge state), you can use <code>git merge --abort</code>. This will abandon the merge and leave the branch as it was.</p>
<p><strong>Tip:</strong> To simplify conflict resolution, you can use graphical mergetools (<code>git mergetool</code>) or IDE features that present side-by-side comparisons. Tools like VSCode, GitKraken, or P4Merge can make resolving conflicts more visual.</p>
<h2 id="heading-rebasing-and-history-rewriting">Rebasing and History Rewriting</h2>
<p>Rebasing is another method to integrate changes from one branch into another, by <strong>moving</strong> the base of a branch. In contrast to merging (which adds a new merge commit), rebasing <em>reapplies</em> your branch commits on top of another base commit, producing a linear history.</p>
<p><strong>Basic Rebase:</strong> Suppose you have a feature branch that forked from <code>main</code>, and <code>main</code> has progressed (new commits) while you worked. To update your feature branch with the latest <code>main</code> changes, you can rebase:</p>
<pre><code class="lang-bash">git checkout feature
git rebase main
</code></pre>
<p>This takes the commits on <code>feature</code> and replays them as if you started <code>feature</code> at the tip of <code>main</code>. Technically, it finds the common ancestor of <code>feature</code> and <code>main</code>, takes the diff introduced by each commit on <code>feature</code> since that ancestor, and re-applies those diffs on top of the current <code>main</code> HEAD, one by one. The <code>feature</code> branch pointer is then moved to the last of these new commits ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/merging-vs-rebasing#:~:text=This%20moves%20the%20entire%20,commit%20in%20the%20original%20branch">Merging vs. Rebasing | Atlassian Git Tutorial</a> ).</p>
<blockquote>
<p><em>Effectively,</em> <code>git rebase main</code> “moves the entire feature branch to begin on the tip of the main branch, incorporating all new commits in main. But instead of using a merge commit, rebasing re-writes the project history by creating brand new commits for each commit in the original branch” ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/merging-vs-rebasing#:~:text=This%20moves%20the%20entire%20,commit%20in%20the%20original%20branch">Merging vs. Rebasing | Atlassian Git Tutorial</a> ).</p>
</blockquote>
<p>After the rebase, your feature branch will have a new set of commits (new hashes) that include the latest from <code>main</code> in its history. The history becomes one straight line, as if you had developed the feature on top of the updated <code>main</code> all along. This yields a cleaner, linear history with no merge commits ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/merging-vs-rebasing#:~:text=The%20major%20benefit%20of%20rebasing,gitk">Merging vs. Rebasing | Atlassian Git Tutorial</a> ).</p>
<p><strong>When to rebase vs. merge:</strong> Rebasing achieves the same end result as merging (your feature has <code>main</code>’s updates), but the history looks different:</p>
<ul>
<li><p>A merge preserves the context of parallel development (with a merge commit joining two lines of history), whereas</p>
</li>
<li><p>a rebase flattens it into a single line, as if work was serial.</p>
</li>
</ul>
<p>Many teams rebase feature branches before merging to main (often via a pull request) to maintain a linear history in <code>main</code>. This can make tools like <code>git log</code> or <code>git bisect</code> simpler to use ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/merging-vs-rebasing#:~:text=The%20major%20benefit%20of%20rebasing,gitk">Merging vs. Rebasing | Atlassian Git Tutorial</a> ). However, rebase is effectively changing history, which comes with caveats.</p>
<p><strong>Caution – The Golden Rule of Rebasing:</strong> <em>Do not rebase commits that exist outside your local repository.</em> In other words, <strong>never rebase a public/shared branch</strong>. If you rebase commits that others have already pulled (for example, rebasing the <code>main</code> branch or any published branch), you’ll force everyone else to reconcile the rewritten history, which is error-prone and confusing ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/merging-vs-rebasing#:~:text=If%20you%20try%20to%20push,force%60%20flag%2C%20like%20so">Merging vs. Rebasing | Atlassian Git Tutorial</a> ). Only rebase local feature branches or branches that only you work on. The golden rule: <strong>never use</strong> <code>git rebase</code> on public branches ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/merging-vs-rebasing#:~:text=If%20you%20try%20to%20push,force%60%20flag%2C%20like%20so">Merging vs. Rebasing | Atlassian Git Tutorial</a> ), because rebasing replaces old commits with new ones, and anyone who has the old ones will get conflicts pushing or pulling.</p>
<p><strong>Rebase Conflicts:</strong> Rebasing can also result in conflicts (similar to a merge) if your commits and the new base have touched the same lines. Git will stop at a conflict during the rebase. You resolve it by editing the file and using <code>git add</code> just like a merge, then continue the rebase with <code>git rebase --continue</code>. If things go wrong, <code>git rebase --abort</code> will return you to the state before the rebase started.</p>
<p><strong>Interactive Rebase – cleaning up commits:</strong> A powerful feature is <code>git rebase -i</code> (interactive). For example, <code>git rebase -i main</code> (while on <code>feature</code>) allows you to edit the commit history of your feature branch before finishing the rebase. Git opens an editor with a list of your commits and allows actions like reordering commits, squashing multiple commits into one, editing commit messages, or dropping commits. For instance, you can squash a “fix typo” commit into an earlier commit instead of having two separate commits for clarity. Interactive rebase is great for crafting a clean history (especially before merging to main) ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/merging-vs-rebasing#:~:text=Interactive%20rebasing">Merging vs. Rebasing | Atlassian Git Tutorial</a> ) ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/merging-vs-rebasing#:~:text=pick%C2%A033d5b7a%C2%A0Message%C2%A0for%C2%A0commit%C2%A0">Merging vs. Rebasing | Atlassian Git Tutorial</a> ). After you finalize the changes in the editor and save, Git will apply the commits as specified. You’ll end up with a rewritten branch history (make sure this is only done on your own branch, not something others depend on).</p>
<p><strong>Reset (Rewriting History):</strong> While discussing history, another related tool is <code>git reset</code>. This command moves the HEAD of the current branch to a specified commit, optionally updating the working directory/index:</p>
<ul>
<li><p><code>git reset --soft &lt;commit&gt;</code>: move HEAD to <code>&lt;commit&gt;</code> but leave all changes after that commit staged (in index). Useful if you want to “uncommit” some changes but keep them ready to recommit.</p>
</li>
<li><p><code>git reset --mixed &lt;commit&gt;</code> (default): move HEAD to <code>&lt;commit&gt;</code>, keep changes after that commit in the working directory (unstaged). This essentially “unstages” and “uncommits” those changes.</p>
</li>
<li><p><code>git reset --hard &lt;commit&gt;</code>: move HEAD to <code>&lt;commit&gt;</code> <strong>and</strong> wipe out any changes in the index and working directory (all changes after that commit are lost, unless they exist somewhere else like another branch or the reflog). <strong>Use with caution</strong>, as <code>--hard</code> can discard work permanently if not saved elsewhere.</p>
</li>
</ul>
<p>Reset is useful for rewriting history locally (e.g., undoing a bad commit or several commits) before pushing. But like rebase, <strong>never reset public history</strong> (especially <code>--hard</code> on a shared branch) – others who pulled the old history will have issues. For public fixes, prefer <code>git revert</code> (discussed next) which preserves history.</p>
<p><strong>Reverting Commits:</strong> If you need to undo a commit that has already been pushed/shared, use <code>git revert &lt;commit&gt;</code>. This creates a new commit that inverses the changes of the specified commit (without altering history behind it). It’s the “safe” way to undo a commit because history remains chronological; you don’t remove the commit, you add a new one that says “undo that change.” This is especially relevant for production code or database migration scripts in data engineering – if a bad change was committed and pushed, revert will apply an opposite change as a new commit (which can itself be pushed). You can revert a single commit or a range of commits. Each revert will prompt for a commit message (by default, noting the commit hash being reverted).</p>
<h2 id="heading-stashing-changes-shelving-work-in-progress">Stashing Changes (Shelving Work in Progress)</h2>
<p>In data engineering, you might often find yourself in the middle of developing a pipeline when something urgent comes up (e.g., a bug on another branch needs fixing). You have uncommitted changes that aren’t ready to commit or push. <strong>Git stash</strong> is a handy feature to temporarily save this work and get back to a clean working state.</p>
<p><strong>What is stashing?</strong> Stashing takes your working directory <strong>changes</strong> (and by default, the staging area as well) and saves them on a stack of “stashes,” then reverts your working copy to match the HEAD commit. It’s like putting your work in a locker temporarily.</p>
<ul>
<li>Run <code>git stash</code> to stash current changes. Git will save all modified tracked files, staged changes, and delete markers, then revert your working directory to a clean state (the last commit). Your changes aren’t lost – they’re stored in the stash stack. By default, stash entries are named like “WIP on branch-name…​” with the timestamp, but you can provide a description: <code>git stash push -m "message"</code>.</li>
</ul>
<blockquote>
<p><strong>Stash example:</strong> If you have edited 3 files but need to switch to <code>main</code> to hotfix something, do <code>git stash</code>. Now your repo is clean (no edits). After fixing on <code>main</code> and switching back to your feature, you can apply the stash to continue where you left off.</p>
</blockquote>
<p><strong>Stash list and apply:</strong> You can have multiple stashes:</p>
<ul>
<li><p><code>git stash list</code> – shows all stashed entries in a stack (indexed as stash@{0}, stash@{1}, ... with 0 being the most recent). You’ll see the message and branch info for each.</p>
</li>
<li><p><code>git stash show [stash@{n}]</code> – show what changes are in a stash (a summary diff). Add <code>-p</code> for a full diff.</p>
</li>
<li><p><code>git stash apply &lt;stash&gt;</code> – apply a stash’s changes back onto the working directory <em>without removing it</em> from the stash list (so it can be applied again if needed). If you omit the stash reference, it applies the latest stash.</p>
</li>
<li><p><code>git stash pop &lt;stash&gt;</code> – applies the stash and <strong>drops</strong> it from the stash list (applying and popping the most recent stash is the default if none specified). Use this when you’re done with that saved state.</p>
</li>
</ul>
<p>After applying a stash, your previously saved changes are restored in your working copy (you might need to resolve conflicts if the code changed in the meantime). If a stash is no longer needed, you can drop it manually: <code>git stash drop stash@{n}</code> or clear all with <code>git stash clear</code>.</p>
<p>Stashing is great for keeping work in progress aside. For instance, stashing allows you to switch branches (<code>git switch &lt;other-branch&gt;</code>) without committing half-done work, and then come back later and unstash to resume. Keep in mind that stash is local to your repository (stashes aren’t pushed to remotes). It’s essentially a convenience—some workflows prefer committing to a “WIP” branch instead—but stash is quick and tidy for short-term use.</p>
<blockquote>
<p><strong>In summary:</strong> <code>git stash</code> <strong>"saves your local changes to a separate place and reverts your working directory to the last commit"</strong> (<a target="_blank" href="https://dev.to/farhatsharifh/from-changes-to-safe-keeping-git-stash-m42#:~:text=Command%3A%20,or%20work%20on%20something%20else">From Changes to Safe Keeping: Git Stash - DEV Community</a>), letting you safely check out a different branch or pull updates. Later, <code>git stash apply</code> or <code>git stash pop</code> restores those saved changes so you can continue where you left off.</p>
</blockquote>
<h2 id="heading-recovering-lost-work-with-reflog">Recovering Lost Work with Reflog</h2>
<p>Git has a safety net called the <strong>reflog</strong> (reference log) that records when heads and branch tips are updated in your local repository. Even if you lose track of a commit (say you reset a branch or a commit isn’t on any branch), the reflog allows you to find its hash and recover it.</p>
<p>Every time HEAD moves, Git records an entry in the reflog. This includes commits, checkouts, resets, rebases, stash operations, etc. Reflog entries are local (they’re not transferred to remotes) and expire typically after 90 days by default.</p>
<ul>
<li><p>Run <code>git reflog</code> to see the log of HEAD changes (it’s essentially shorthand for <code>git reflog show HEAD</code>) ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/rewriting-history/git-reflog#:~:text=The%20most%20basic%20Reflog%20use,case%20is%20invoking">Git Reflog Configuration | Atlassian Git Tutorial</a> ). The output will look like:</p>
<pre><code class="lang-sql">  abc1234 HEAD@{0}: <span class="hljs-keyword">commit</span>: added <span class="hljs-keyword">data</span> <span class="hljs-keyword">validation</span> step
  <span class="hljs-number">98</span>f00b2 <span class="hljs-keyword">HEAD</span>@{<span class="hljs-number">1</span>}: checkout: moving <span class="hljs-keyword">from</span> <span class="hljs-keyword">main</span> <span class="hljs-keyword">to</span> feature/<span class="hljs-keyword">cleanup</span>
  <span class="hljs-number">0123</span>fed <span class="hljs-keyword">HEAD</span>@{<span class="hljs-number">2</span>}: <span class="hljs-keyword">commit</span>: WIP <span class="hljs-keyword">commit</span>
  ...
</code></pre>
<p>  Each line shows a reference (here HEAD) with an index <code>{n}</code> (0 is the latest action, then 1, 2, etc. going back in time), the commit hash at that moment, and a description of the action. You can also view reflog for specific branches: <code>git reflog show branchName</code> (or <code>git reflog branchName</code> shorthand) ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/rewriting-history/git-reflog#:~:text=By%20default%2C%20,the%20Git%20stash%20can%20be">Git Reflog Configuration | Atlassian Git Tutorial</a> ), and even the stash has a reflog (<code>git reflog show stash</code>).</p>
</li>
</ul>
<p>Using the reflog, you can recover from mistakes. For example, if you accidentally reset <code>main</code> to an earlier commit and “lost” some commits, you can look at <code>git reflog</code> to find the hash of where <code>main</code> <em>was</em> before the reset (it will show a HEAD entry for the reset). You might see <code>main@{1}: reset: moving to &lt;old-sha&gt;</code>. The commit before that in reflog (main@{2} maybe) is the commit that got “lost”. You can then do <code>git checkout &lt;lost-commit-hash&gt;</code> or create a branch at that hash to recover it.</p>
<p>In short, <strong>reflog tracks updates to branch heads</strong> (and HEAD itself) in your repository ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/rewriting-history/git-reflog#:~:text=This%20page%20provides%20a%20detailed,Common%20examples%20include">Git Reflog Configuration | Atlassian Git Tutorial</a> ). It’s extremely useful for undoing operations that seem irreversible (like an unintended hard reset or an overwritten branch). Keep in mind reflog entries expire over time (and some aggressive Git GC operations can clean them), so don’t rely on reflog for extremely long-term recovery – but for recent oopsies, it’s a lifesaver.</p>
<h2 id="heading-tagging-marking-specific-commits">Tagging: Marking Specific Commits</h2>
<p>Tags are references to specific commits, typically used to mark <strong>release points</strong> (v1.0, v2.0, etc.) or other important milestones in history. Unlike branches, tags don’t move – they always point to the same commit. There are two types of tags in Git:</p>
<ul>
<li><p><strong>Annotated Tags:</strong> These are full Git objects stored in the repo. An annotated tag has a tagger name, email, date, and a message (like a commit message) (<a target="_blank" href="https://confluence.atlassian.com/bbkb/tagger-date-and-message-fields-missing-in-webhook-payloads-and-the-rest-api-1142247761.html#:~:text=Annotated%20Tags">Tagger, Date, and Message fields missing in Webhook Payloads and the Rest API | Bitbucket Cloud | Atlassian Support</a>). Annotated tags are created with <code>git tag -a &lt;tagname&gt; -m "message"</code> (or <code>-s</code> to sign with GPG). They are recommended for releases because they can contain metadata and can be signed for security.</p>
</li>
<li><p><strong>Lightweight Tags:</strong> These are basically just named pointers (like a branch that never moves). A lightweight tag has no extra data – it’s just a name for a commit (<a target="_blank" href="https://confluence.atlassian.com/bbkb/tagger-date-and-message-fields-missing-in-webhook-payloads-and-the-rest-api-1142247761.html#:~:text=Lightweight%20Tags">Tagger, Date, and Message fields missing in Webhook Payloads and the Rest API | Bitbucket Cloud | Atlassian Support</a>). You create one with <code>git tag &lt;tagname&gt;</code> (no <code>-a</code>). It’s quick and simple but lacks the description and metadata.</p>
</li>
</ul>
<p>Example:</p>
<pre><code class="lang-bash">git tag -a v1.0 -m <span class="hljs-string">"Release version 1.0"</span>
</code></pre>
<p>This tags the current commit with an annotated tag “v1.0” with the given message. Alternatively, <code>git tag v1.0</code> (no -m) would make a lightweight tag on the current commit.</p>
<p>Use <code>git tag</code> (with no arguments) to list all tags. You can also filter tags: <code>git tag -l "v1.*"</code> lists tags matching the pattern (e.g., all tags that start with v1.).</p>
<p>To see details of a tag (especially annotated tag message), use <code>git show &lt;tagname&gt;</code> which will display the tag message and the commit it points to.</p>
<p><strong>Pushing Tags:</strong> By default, <code>git push</code> <strong>does not</strong> send tags to the remote. You must explicitly push tags:</p>
<ul>
<li><p><code>git push origin &lt;tagname&gt;</code> to push a specific tag.</p>
</li>
<li><p><code>git push origin --tags</code> to push all tags at once.</p>
</li>
</ul>
<p>It’s important to push tags if others need to see them (e.g., teammates or CI systems that deploy specific tags). Tag names, once pushed, are global in the repository like branch names.</p>
<p><strong>Using Tags in Data Engineering:</strong> You might tag a commit that corresponds to a deployed version of a data pipeline. For example, if you release “Batch ETL pipeline v2”, you could tag that commit as <code>pipeline-v2-prod</code>. This makes it easy to check out exactly what code was running in production at that version, or to compare changes between versions.</p>
<blockquote>
<p><strong>Annotated vs Lightweight:</strong> <em>“Annotated tags are full objects stored within Git’s database and contain all information (checksum, tagger, date, message). Lightweight tags are simply pointers to a commit and contain no additional data.”</em> (<a target="_blank" href="https://confluence.atlassian.com/bbkb/tagger-date-and-message-fields-missing-in-webhook-payloads-and-the-rest-api-1142247761.html#:~:text=Annotated%20Tags">Tagger, Date, and Message fields missing in Webhook Payloads and the Rest API | Bitbucket Cloud | Atlassian Support</a>) (<a target="_blank" href="https://confluence.atlassian.com/bbkb/tagger-date-and-message-fields-missing-in-webhook-payloads-and-the-rest-api-1142247761.html#:~:text=Lightweight%20Tags">Tagger, Date, and Message fields missing in Webhook Payloads and the Rest API | Bitbucket Cloud | Atlassian Support</a>) For important milestones, annotated tags are preferred due to the extra context.</p>
</blockquote>
<p>If you ever need to delete a tag (for instance, a mistyped one): <code>git tag -d &lt;tagname&gt;</code> removes it locally. To remove it from the remote, you’d push a deletion: <code>git push origin :refs/tags/&lt;tagname&gt;</code>.</p>
<h2 id="heading-working-with-remote-repositories-github-gitlab-bitbucket-etc">Working with Remote Repositories (GitHub, GitLab, Bitbucket, etc.)</h2>
<p>In professional settings, your Git repositories are usually hosted on a remote server or service (such as <strong>GitHub, GitLab, or Bitbucket</strong>). These platforms provide a central location for your team’s Git repo and add features like pull requests, issue tracking, and CI integration.</p>
<p>After doing local commits, you’ll <strong>push</strong> them to the remote so others can pull them. Similarly, you <strong>pull</strong> others’ changes to update your local repository.</p>
<p><strong>Adding a Remote:</strong> If you cloned a repository, the remote (often named <code>origin</code>) is already configured. If you started from scratch with <code>git init</code>, you can add a remote:</p>
<pre><code class="lang-bash">git remote add origin &lt;repo-url&gt;
</code></pre>
<p>This links the name “origin” to the repository URL (could be an HTTPS URL or SSH like <code>git@github.com:account/repo.git</code>). You can check remotes with <code>git remote -v</code>.</p>
<p><strong>Pushing Changes:</strong> To share your commits, push them to a remote branch:</p>
<pre><code class="lang-bash">git push origin &lt;branch-name&gt;
</code></pre>
<p>This will send the commits on your local branch to the remote <code>origin</code>. If the remote branch doesn’t exist yet, Git will create it. The first time you push a new branch, you might use <code>git push -u origin &lt;branch&gt;</code> to set the “upstream tracking” – meaning your local branch will track that remote branch, so future <code>git pull</code> or <code>git push</code> commands know which branch to synchronize with by default.</p>
<p>For example, on a new repo:</p>
<pre><code class="lang-bash">git push -u origin main
</code></pre>
<p>pushes your local main branch to the remote’s main and sets it as the upstream. Next time, just <code>git push</code> is enough.</p>
<p><strong>Fetching and Pulling:</strong> To get commits from others, there are two commands:</p>
<ul>
<li><p><code>git fetch origin</code> will fetch all new commits and update remote-tracking branches (like <code>origin/main</code>) in your local repo, but it <strong>won’t alter your working copy</strong> or local branches. After fetching, you can inspect what’s new (with <code>git log origin/main</code>, for example) and then merge or rebase manually.</p>
</li>
<li><p><code>git pull</code> is essentially a shortcut that does a <code>git fetch</code> followed by a <code>git merge</code> (or rebase) of the fetched branch into your current branch. Typically, running <code>git pull</code> in your local main branch will fetch updates from origin and then merge origin/main into your local main (bringing you up to date).</p>
</li>
</ul>
<p>Because <code>git pull</code> by default merges, some prefer to avoid potential merge commits by configuring pull to rebase (<code>git config pull.rebase true</code> or using <code>git pull --rebase</code>). This will replay your local commits on top of the fetched commits instead of a merge commit, similar to doing <code>fetch</code> then <code>rebase</code>. Use whichever strategy your team prefers.</p>
<p><strong>Collaboration via Pull Requests:</strong> On platforms like GitHub/GitLab/Bitbucket, a common workflow is:</p>
<ul>
<li><p>Developer pushes a feature branch to the remote.</p>
</li>
<li><p>They open a Pull Request (GitHub/Bitbucket term) or Merge Request (GitLab term) to merge their branch into another (e.g., into <code>main</code>).</p>
</li>
<li><p>Teammates review the code via the platform, discuss, and eventually the PR is merged (often the platform will handle the merge by creating a merge commit or squashing, depending on settings).</p>
</li>
</ul>
<p>As a data engineer, you’ll use these features to collaborate on pipelines and analytics code. Pull requests enable code review which is valuable to maintain code quality in ETL scripts or SQL transformations.</p>
<p><strong>Remote Branch Tracking:</strong> When you clone, your local main is automatically set to track origin/main. You’ll see this with <code>git branch -vv</code> (which shows what each local branch is tracking and if it’s behind/ahead). If your local branch is tracking a remote, you can use <code>git pull</code> and <code>git push</code> without specifying the remote/branch every time.</p>
<p>If you have multiple remotes (say <code>origin</code> and another called <code>upstream</code>), you specify which remote in commands (e.g., <code>git fetch upstream</code>, <code>git push origin feature-branch</code>).</p>
<p><strong>Tip:</strong> It’s a good habit to pull the latest changes on <code>main</code> before starting new work, and also before pushing, ensure your local main is up to date (fast-forward merged) with remote to avoid conflicts.</p>
<h2 id="heading-advanced-topics-and-tooling-integration">Advanced Topics and Tooling Integration</h2>
<p>This section touches on how Git interacts with other tools and workflows common in data engineering.</p>
<h3 id="heading-git-integration-with-cicd-and-automation-tools">Git Integration with CI/CD and Automation Tools</h3>
<p><strong>Jenkins (Continuous Integration server):</strong> Jenkins is a popular CI tool often used to automate builds, tests, and deployments in data pipelines. Jenkins has robust Git support – the Git plugin in Jenkins can poll repositories, fetch branches, and even push tags or code as part of a pipeline (<a target="_blank" href="https://plugins.jenkins.io/git/#:~:text=The%20git%20plugin%20provides%20fundamental,merge%2C%20tag%2C%20and%20push%20repositories">Git - Jenkins Plugins</a>). In a typical Jenkins pipeline (defined by a <code>Jenkinsfile</code> stored in the repo), you might see a step like <code>checkout scm</code> which clones the repository at the correct commit of a build. For data engineers, this means whenever code is pushed to a branch, Jenkins can automatically trigger jobs: run data pipeline tests, lint SQL scripts, build container images for data apps, etc. The integration is seamless – you point Jenkins to the repo and branch, and it handles the rest (it uses your Git credentials or webhooks to know when to pull new commits).</p>
<p>Jenkins can be configured to trigger on certain branch pushes or PR merges. For example, you might set up Jenkins to run a nightly ETL job; Jenkins will <code>git pull</code> the latest scripts from <code>main</code> before execution, ensuring the job uses the most up-to-date code. In summary, Jenkins + Git allows <strong>continuous integration</strong>: every change in Git can automatically go through a pipeline of tests/deployments. The Git plugin in Jenkins supports all fundamental operations (clone, fetch, branch, merge) needed to integrate your repository into CI workflows (<a target="_blank" href="https://plugins.jenkins.io/git/#:~:text=The%20git%20plugin%20provides%20fundamental,merge%2C%20tag%2C%20and%20push%20repositories">Git - Jenkins Plugins</a>).</p>
<p><strong>GitHub Actions, GitLab CI, Bitbucket Pipelines:</strong> These are CI/CD systems integrated into Git hosting platforms. They all revolve around Git events. For instance, GitHub Actions can be triggered on a push or PR; the workflow will automatically checkout the repository (<code>uses: actions/checkout@v3</code> in config) which grabs the code at that commit. Similarly, GitLab CI/CD is configured via a <code>.gitlab-ci.yml</code> in the repo – on each push, runners fetch the code and execute jobs. As a data engineer, you can leverage these to automate testing of data transformations or even deploy infrastructure as code from your repo.</p>
<p><strong>Apache Airflow (Workflow Orchestration):</strong> Airflow is used to schedule and manage data pipelines (DAGs). While Airflow itself isn’t a version control system, it’s a best practice to treat your Airflow <strong>DAG files as code</strong> and manage them in Git (<a target="_blank" href="https://www.restack.io/p/version-control-for-ai-answer-airflow-dag-api-cat-ai#:~:text=1,for%20different%20workflows%20or%20projects">Version Control For Ai Airflow Dag Api | Restackio</a>). Typically, you develop DAGs (Python scripts defining tasks and dependencies) in a repository. Benefits of using Git here include tracking changes to workflows, code reviews for DAG changes, and the ability to roll back to prior versions of pipelines if a new DAG run fails.</p>
<p>Many Airflow deployments integrate with Git:</p>
<ul>
<li><p>The <strong>Astronomer/Airflow Kubernetes setup</strong> can use a sidecar container to periodically <code>git pull</code> a repo of DAGs (using tools like <strong>git-sync</strong>). This way, updating the DAG code in Git automatically updates what Airflow runs.</p>
</li>
<li><p>In a simpler scenario, you might manually deploy DAGs by pulling from Git, or use a CI job to deploy: e.g., a GitHub Actions workflow detects changes in the <code>dags/</code> folder and pushes the updated files to your Airflow instance.</p>
</li>
</ul>
<p><strong>Best practices:</strong> Use branches and pull requests for DAG changes, just like any code. For instance, implement a new data pipeline on a feature branch, get it reviewed, merge to main, then your deployment mechanism pulls main to update Airflow. You can also use tags to mark stable DAG versions that correspond to production deployments (<a target="_blank" href="https://www.restack.io/p/version-control-for-ai-answer-airflow-dag-api-cat-ai#:~:text=3,previous%20state%20if%20issues%20arise">Version Control For Ai Airflow Dag Api | Restackio</a>).</p>
<p>In fact, guidelines for Airflow version control mirror normal software:</p>
<ul>
<li><p><em>“Store your DAG files in a Git repository. This allows you to track changes, collaborate with team members, and revert to previous versions if necessary.”</em> (<a target="_blank" href="https://www.restack.io/p/version-control-for-ai-answer-airflow-dag-api-cat-ai#:~:text=1,for%20different%20workflows%20or%20projects">Version Control For Ai Airflow Dag Api | Restackio</a>)</p>
</li>
<li><p><em>“Implement a branching strategy (feature branches for development, main branch for stable releases) to isolate changes and test them before merging into the main workflow.”</em> (<a target="_blank" href="https://www.restack.io/p/version-control-for-ai-answer-airflow-dag-api-cat-ai#:~:text=2,merging%20into%20the%20main%20workflow">Version Control For Ai Airflow Dag Api | Restackio</a>)</p>
</li>
<li><p><em>“Use Git tags to mark stable releases of your DAGs, making it simpler to reference specific versions or roll back if issues arise.”</em> (<a target="_blank" href="https://www.restack.io/p/version-control-for-ai-answer-airflow-dag-api-cat-ai#:~:text=3,previous%20state%20if%20issues%20arise">Version Control For Ai Airflow Dag Api | Restackio</a>).</p>
</li>
</ul>
<p>By following these, your data pipeline code benefits from the same rigor as application code.</p>
<p><strong>GitOps and Infrastructure as Code:</strong> GitOps is an approach where Git is the single source of truth for infrastructure and deployments. In GitOps workflows, all changes to infrastructure or configuration (Kubernetes manifests, Terraform files, etc.) are made via Git commits, and automated processes apply those changes to the environment.</p>
<p>For example, in a Kubernetes-based data platform, you might have all your cluster configs and pipeline manifests in a Git repo. A tool like <strong>Argo CD</strong> or <strong>Flux</strong> monitors the repo and applies changes to the cluster whenever a commit is made. This means <strong>deployment is triggered by a git push</strong>, not a manual kubectl or CLI call. GitOps heavily relies on <strong>Infrastructure as Code (IaC)</strong>: <em>“GitOps uses a Git repository as the single source of truth for infrastructure definitions”</em> (<a target="_blank" href="https://about.gitlab.com/topics/gitops/#:~:text=IaC%3A">What is GitOps?</a>) and uses pull/merge requests as the change mechanism. When changes are merged to the main config repo, a CI/CD pipeline or GitOps operator applies those changes (e.g., deploys new services, updates database config, etc.) ensuring the live environment always matches what’s in Git (<a target="_blank" href="https://about.gitlab.com/topics/gitops/#:~:text=CI%2FCD%3A">What is GitOps?</a>).</p>
<p>In data engineering, you can apply GitOps to things like: managing Kubernetes-based ETL services, configuration for data processing jobs, or even scheduling configurations. It brings benefits of <strong>auditability</strong> (every infra change is logged in Git history) and <strong>reliability</strong> (reducing ad-hoc manual changes). For instance, if you manage an Apache Kafka cluster config via code, any topic creation or ACL change could be done by editing a file in Git and letting an automated process enact it.</p>
<p>In summary, Git isn’t just for application code – it’s central to modern <strong>DataOps</strong> practices too. With Git-managed workflows, you achieve consistent, repeatable deployments for data pipelines and infrastructure, often via GitOps tooling.</p>
<h3 id="heading-submodules-including-external-repos-in-your-repo">Submodules (Including External Repos in Your Repo)</h3>
<p>Sometimes a project may want to include another repository within itself – for example, a data engineering team might maintain a separate repo for common SQL scripts or a shared library that they want to use in multiple pipeline projects. <strong>Git submodules</strong> allow you to embed one repository as a sub-directory of another repository. The submodule still retains its own history and can be developed somewhat independently.</p>
<p>A submodule is essentially a <strong>reference to another repo at a specific commit</strong> (a snapshot) ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/git-submodule#:~:text=Git%20submodules%20allow%20you%20to,to%20another%20repository%20at%20a">Git submodule | Atlassian</a> ). Your main repo will store a pointer (the commit SHA) of the external repo that you’ve locked in. This enables you to version external code along with your project.</p>
<p><strong>Adding a submodule:</strong> In the parent repository, run:</p>
<pre><code class="lang-bash">git submodule add &lt;repo-url&gt; &lt;path&gt;
</code></pre>
<p>This will: clone the external repo into a sub-folder (the path you specified), and record that path and the current commit of the external repo in the parent repo’s index. After adding, you’ll see a new file <code>.gitmodules</code> that maps submodule paths to their URLs.</p>
<p>For example:</p>
<pre><code class="lang-bash">git submodule add https://github.com/org/data-utils.git libs/data-utils
</code></pre>
<p>This creates a folder <code>libs/data-utils</code> in your repo containing that repository. By default, it will checkout the default branch’s HEAD of that repo at that time. Commit the new submodule (it appears like a new file with a special mode in <code>git status</code>). The commit in your repo doesn’t include all of the submodule’s files, just a reference to the commit ID of the sub-repo.</p>
<p><strong>Working with submodules:</strong> If the submodule repository changes (you want to update to a newer commit), you go into the submodule directory (<code>cd libs/data-utils</code>) which is itself a Git repo. You can pull or checkout a different commit/tag there. Then go back to the main repo and you’ll see that Git recognizes the submodule’s commit changed (like it sees that “libs/data-utils” now points to a new commit). You then commit that change in the parent to update the submodule reference.</p>
<p><strong>Cloning a repo with submodules:</strong> By default, <code>git clone</code> won’t fetch submodule contents. After cloning a repo that has submodules, you’ll see empty directories for them until you initialize them. To get them:</p>
<ul>
<li><p>Run <code>git submodule update --init --recursive</code> to fetch all submodules (and any nested submodules, hence <code>--recursive</code>) and check them out at the recorded commits.</p>
</li>
<li><p>Alternatively, clone with <code>--recurse-submodules</code> flag: <code>git clone --recurse-submodules &lt;repo-url&gt;</code>, which clones and automatically initializes and checks out submodules.</p>
</li>
</ul>
<p><strong>Common submodule commands:</strong></p>
<ul>
<li><p><code>git submodule status</code> – lists submodules with their current commit and whether they are up-to-date or modified.</p>
</li>
<li><p><code>git submodule update --remote --merge</code> – can update submodules to the latest commit from a branch (if configured), but generally submodules stick to specific commits until manually changed.</p>
</li>
<li><p>Each submodule is an independent git repo, so you can treat it like one: go into its folder, create branches, commit, push, etc., but remember to update the parent repo’s pointer if you want the parent to use those new commits.</p>
</li>
</ul>
<p><strong>When (not) to use submodules:</strong> Submodules are useful for pulling in external code without copying it, and keeping it versioned. However, they add complexity: developers need to remember to init/update them, and two repos’ workflows now intersect. For data engineers, a common use might be including a shared utils library across multiple projects, or maybe embedding a specific version of a third-party tool’s source for auditability. If the coupling isn’t strict, an alternative is to use package managers (pip, etc.) or service endpoints for shared resources. But if you need to vendor code with full history, submodules can work.</p>
<p>Submodules are considered an advanced feature – they can confuse new Git users. Weigh the pros/cons (ease of single-repo versus multiple) before using them ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/git-submodule#:~:text=match%20at%20L687%20Git%20submodules,for%20team%20members%20to%20adopt">Git submodule | Atlassian</a> ). A quick recap: <em>“Git submodules allow you to keep a Git repository as a subdirectory of another. They are simply a reference to another repository at a particular snapshot in time, enabling you to incorporate and track version history of external code within your project.”</em> ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/git-submodule#:~:text=Git%20submodules%20allow%20you%20to,to%20another%20repository%20at%20a">Git submodule | Atlassian</a> )</p>
<h3 id="heading-git-large-file-storage-lfs-for-big-data">Git Large File Storage (LFS) for Big Data</h3>
<p>Data engineers often deal with large data files, binaries, or machine learning models. <strong>Git LFS</strong> (Large File Storage) is an extension to Git that replaces large files in your repository with lightweight pointers, while storing the actual file contents on a remote server optimized for big files. If you find yourself needing to version datasets or other large artifacts, consider using Git LFS instead of normal Git.</p>
<p>For example, without LFS, adding a 100MB CSV to a repo will bloat the repo and every clone must download it. With LFS, that CSV would be replaced by a small text pointer in Git, and the actual 100MB content is stored on the LFS server (which may be provided by your Git host). When you clone or pull, LFS will download the big file separately (only if needed).</p>
<p>Git LFS was co-developed by GitHub and Atlassian, among others, to mitigate large file issues ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/git-lfs#:~:text=file%20has%20to%20be%20downloaded,Specifically%2C%20large%20files%20are">Git LFS - large file storage | Atlassian Git Tutorial</a> ). To use it:</p>
<ul>
<li><p>Install the <code>git lfs</code> client and run <code>git lfs install</code> (one-time per repo or machine).</p>
</li>
<li><p>Track specific extensions: e.g., <code>git lfs track "*.csv"</code> – this adds patterns to an <code>.gitattributes</code> file indicating those files should use LFS.</p>
</li>
<li><p>From then on, any new CSV you add will be stored via LFS (you’ll see a pointer file in Git instead of actual content). Pushing will push through LFS.</p>
</li>
</ul>
<p>This keeps your repository lean and fast while still versioning large files. It’s worth noting that Git LFS requires server support (most major hosts support it, though there might be storage limits). If working with truly massive data (GBs), often it’s better to store data in proper data storage (cloud buckets, HDFS, etc.) and just keep references or scripts in Git, but LFS is a good middle-ground for moderately large files that need version tracking (like model binaries under 500MB, etc.).</p>
<hr />
<p>These study notes have covered Git from fundamentals to advanced topics, focusing on command-line usage which is the backbone of many data engineering workflows. We saw how to set up repos, manage branching and merging strategies, handle conflicts, rewrite history carefully, stash work for flexibility, use reflog to recover, tag releases, and even bring in external code or large files. We also touched on how Git underpins collaboration platforms (GitHub/GitLab/Bitbucket) and modern workflows with CI/CD (Jenkins, Airflow) and GitOps for infrastructure.</p>
<p>By mastering these Git commands and concepts, a data engineer can confidently version-control pipelines, collaborate with team members, and integrate with automation tools – ensuring that data projects are reproducible, auditable, and maintainable for the long run.</p>
<p><strong>Sources:</strong> Git documentation and tutorials ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/setting-up-a-repository/git-init#:~:text=The%20,run%20in%20a%20new%20project">git init | Atlassian Git Tutorial</a> ) ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/setting-up-a-repository/git-init#:~:text=A%20quick%20note%3A%20,However%2C%20%60git">git init | Atlassian Git Tutorial</a> ) ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/using-branches#:~:text=Git%20branches%20are%20effectively%20a,it%20into%20the%20main%20branch">Git Branch | Atlassian Git Tutorial</a> ) ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow#:~:text=Gitflow%20is%20a%20legacy%20Git,details%20Gitflow%20for%20historical%20purposes">Gitflow Workflow | Atlassian Git Tutorial</a> ) ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/merging-vs-rebasing#:~:text=This%20moves%20the%20entire%20,commit%20in%20the%20original%20branch">Merging vs. Rebasing | Atlassian Git Tutorial</a> ) (<a target="_blank" href="https://dev.to/farhatsharifh/from-changes-to-safe-keeping-git-stash-m42#:~:text=Command%3A%20,or%20work%20on%20something%20else">From Changes to Safe Keeping: Git Stash - DEV Community</a>) ( <a target="_blank" href="https://www.atlassian.com/git/tutorials/rewriting-history/git-reflog#:~:text=This%20page%20provides%20a%20detailed,Common%20examples%20include">Git Reflog Configuration | Atlassian Git Tutorial</a> ) (<a target="_blank" href="https://confluence.atlassian.com/bbkb/tagger-date-and-message-fields-missing-in-webhook-payloads-and-the-rest-api-1142247761.html#:~:text=Annotated%20Tags">Tagger, Date, and Message fields missing in Webhook Payloads and the Rest API | Bitbucket Cloud | Atlassian Support</a>) (<a target="_blank" href="https://confluence.atlassian.com/bbkb/tagger-date-and-message-fields-missing-in-webhook-payloads-and-the-rest-api-1142247761.html#:~:text=Lightweight%20Tags">Tagger, Date, and Message fields missing in Webhook Payloads and the Rest API | Bitbucket Cloud | Atlassian Support</a>) (<a target="_blank" href="https://plugins.jenkins.io/git/#:~:text=The%20git%20plugin%20provides%20fundamental,merge%2C%20tag%2C%20and%20push%20repositories">Git - Jenkins Plugins</a>) (<a target="_blank" href="https://www.restack.io/p/version-control-for-ai-answer-airflow-dag-api-cat-ai#:~:text=1,for%20different%20workflows%20or%20projects">Version Control For Ai Airflow Dag Api | Restackio</a>) (<a target="_blank" href="https://www.restack.io/p/version-control-for-ai-answer-airflow-dag-api-cat-ai#:~:text=3,previous%20state%20if%20issues%20arise">Version Control For Ai Airflow Dag Api | Restackio</a>) (<a target="_blank" href="https://about.gitlab.com/topics/gitops/#:~:text=IaC%3A">What is GitOps?</a>)</p>
]]></content:encoded></item><item><title><![CDATA[DAX Study Notes]]></title><description><![CDATA[Introduction to DAX
Data Analysis Expressions (DAX) is the formula language for the data modeling layer in Microsoft Power BI, Excel Power Pivot, and SQL Server Analysis Services (Tabular mode). DAX allows analysts to create custom calculations over ...]]></description><link>https://notes.beesho.me/dax-study-notes</link><guid isPermaLink="true">https://notes.beesho.me/dax-study-notes</guid><category><![CDATA[dax]]></category><category><![CDATA[DAXFunctions]]></category><dc:creator><![CDATA[Beshoy Sabri]]></dc:creator><pubDate>Thu, 24 Apr 2025 18:13:42 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1745518386168/8def3217-280c-495e-bf1f-e3e2a9421848.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-introduction-to-dax">Introduction to DAX</h2>
<p><strong>Data Analysis Expressions (DAX)</strong> is the formula language for the data modeling layer in Microsoft Power BI, Excel Power Pivot, and SQL Server Analysis Services (Tabular mode). DAX allows analysts to create custom calculations over their data in a <em>relational</em> model. It is similar in syntax to Excel formulas, but operates on <strong>tables and columns</strong> rather than individual cells. DAX formulas are used to create calculated columns, measures, and even entire tables within the data model. This enables dynamic analyses where results respond to user selections and filters (<a target="_blank" href="https://support.microsoft.com/en-us/office/context-in-dax-formulas-2728fae0-8309-45b6-9d32-1d600440a7ad#:~:text=Context%20enables%20you%20to%20perform,for%20troubleshooting%20problems%20in%20formulas">Context in DAX Formulas - Microsoft Support</a>). In Power BI and Power Pivot, DAX is essential for building interactive reports and dashboards.</p>
<p><strong>Key characteristics of DAX:</strong></p>
<ul>
<li><strong>Relational &amp; Contextual:</strong> DAX works with tables that can be related. Calculations respect the relationships between tables (similar to a database join). The result of a DAX formula can change depending on the current <em>context</em> (such as filters applied in a report) (<a target="_blank" href="https://support.microsoft.com/en-us/office/context-in-dax-formulas-2728fae0-8309-45b6-9d32-1d600440a7ad#:~:text=Context%20enables%20you%20to%20perform,for%20troubleshooting%20problems%20in%20formulas">Context in DAX Formulas - Microsoft Support</a>).</li>
</ul>
<ul>
<li><strong>Calculated at Query Time:</strong> Many DAX calculations (especially <em>measures</em>) are not materialized until needed in a visual or query. This means they can adapt to user slicers or pivot selections in real-time.</li>
</ul>
<ul>
<li><strong>Use Cases:</strong> Commonly used for financial aggregations, time intelligence (year-to-date totals, year-over-year comparisons), filtering data for specific conditions, creating new classifications or groupings, and performing lookups between tables in your model.</li>
</ul>
<ul>
<li><strong>Power BI vs Excel:</strong> The DAX language is the same in Power BI Desktop and Excel’s Power Pivot (and Analysis Services). In Excel 2013, measures were called <em>Calculated Fields</em>, but since Excel 2016 they are also called <em>Measures</em> ( <a target="_blank" href="https://www.sqlbi.com/articles/calculated-columns-and-measures-in-dax/#:~:text=aggregate%20values%20from%20many%20rows,Pivot%20for%20Excel%202010%2C%20too">Calculated Columns and Measures in DAX - SQLBI</a>) ( <a target="_blank" href="https://www.sqlbi.com/articles/calculated-columns-and-measures-in-dax/#:~:text=There%20is%20another%20way%20of,Pivot%20for%20Excel%202010%2C%20too">Calculated Columns and Measures in DAX - SQLBI</a>). In Power BI Desktop, you create DAX calculations using the <em>New Measure</em>, <em>New Column</em>, or <em>New Table</em> buttons in the modeling interface.</li>
</ul>
<blockquote>
<p><strong>Tip:</strong> If you're familiar with Excel formulas, think of DAX as serving a similar purpose but for <em>tables of data</em>. Unlike Excel’s cell-by-cell computation, DAX computes over entire columns and tables at once, leveraging an in-memory engine for fast aggregation. Always ensure you have a well-designed data model (with lookup tables and relationships) to get the most out of DAX.</p>
</blockquote>
<hr />
<h2 id="heading-calculated-columns">Calculated Columns</h2>
<p>A <strong>Calculated Column</strong> is a new column added to a table in the data model whose values are defined by a DAX formula (expression). When you create a calculated column, the formula is evaluated <strong>for each row</strong> of the table, and the results are stored in the model. In other words, a calculated column uses a <strong>row context</strong>: it knows the “current row” of the table and computes a value based on that row’s data ( <a target="_blank" href="https://www.sqlbi.com/articles/row-context-and-filter-context-in-dax/#:~:text=You%20have%20a%20row%20context,in%20a%20calculated%20column">Row Context and Filter Context in DAX - SQLBI</a>). Once computed, the values exist in memory just like any other column in the table.</p>
<ul>
<li><p><strong>How to create:</strong> In Power BI Desktop, use <em>New Column</em> and write an expression. For example, to add a profit margin column in a Sales table:</p>
<pre><code class="lang-sql">  Profit = Sales[SalesAmount] - Sales[TotalCost]
</code></pre>
<p>  This formula runs for every row of the Sales table, subtracting cost from sales amount for that row. In Power Pivot (Excel), you would insert a new column and provide a similar formula (without the “Sales[Column] =” part).</p>
</li>
</ul>
<ul>
<li><strong>Context:</strong> Calculated columns inherently evaluate row by row. You can reference other columns from the same row easily by name. If you reference a column from a related table, you typically need to use functions like <code>RELATED()</code> (more on that in the Relationships section) because the row context doesn’t automatically traverse to lookup tables without help.</li>
</ul>
<ul>
<li><strong>When they calculate:</strong> Calculated column values are computed at data refresh (or when the column is first created). They do <strong>not</strong> automatically respond to slicers or filters in a report. In other words, their values are static for each row until the data is refreshed. This is a key difference from measures.</li>
</ul>
<ul>
<li><strong>Storage and performance:</strong> Once computed, a calculated column’s values are stored in the model, increasing the model size. Every intermediate or extra calculated column consumes memory (<a target="_blank" href="https://endjin.com/blog/2022/04/measures-vs-calculated-columns-in-dax#:~:text=applied%20to%20the%20whole%20table,is%20used%20in%20the%20report">Measures vs Calculated Columns in DAX and Power BI</a>). Too many calculated columns can bloat your data model. Since they are pre-calculated, they don’t add query-time overhead (no recalculation per interaction), but they do use RAM and can slow down data refresh if complex.</li>
</ul>
<p><strong>Example – Creating a Calculated Column</strong><br />Suppose we want a category for each product indicating if it’s a high-value item (price &gt; $100). We have a Products table with a [Price] column. We create a new column:</p>
<pre><code class="lang-sql">HighValueFlag = IF(Products[Price] &gt; 100, "High Value", "Standard")
</code></pre>
<p>This will mark each product as “High Value” or “Standard” based on its own price. The result is stored in the Products table as a new field. We can use this field in our visuals like any other column (for example, count of High Value products).</p>
<blockquote>
<p><strong>Note:</strong> The row context means each row’s calculation is independent. You cannot “peek” at other rows from a calculated column without an explicit aggregation or function. For instance, in a calculated column, you can’t directly ask for the “max price in the table” unless you wrap it in a function like <code>MAX(Products[Price])</code>, which would ignore row context and give the same result for every row (the overall max). Generally, if you find yourself needing an aggregate in a calculated column, it might be a sign that you should use a measure instead.</p>
</blockquote>
<p><strong>When to use Calculated Columns:</strong></p>
<ul>
<li>You need a fixed value per row that will be used for slicing or grouping in reports (e.g., classification, category, or an intermediate calculation that won’t change with filter context).</li>
</ul>
<ul>
<li>The calculation is simple and necessary for the data model (for example, a concatenation of two text fields for a unique ID, or extracting a year from a date if you don’t have a separate Date table).</li>
</ul>
<ul>
<li>Avoid using calculated columns for values that should dynamically respond to filters (those should be measures). Also avoid creating many calculated columns as a substitute for proper data transformation in Power Query or the source; if it can be done in ETL, often that’s preferable for model size.</li>
</ul>
<blockquote>
<p><strong>Quick tip:</strong> It’s generally better to perform row-by-row calculations in the data source or Power Query (during data load) if possible. Use calculated columns sparingly for model-specific needs or conditional flags needed in the model. Overuse of calculated columns can lead to large models and memory pressure.</p>
</blockquote>
<hr />
<h2 id="heading-measures">Measures</h2>
<p>A <strong>Measure</strong> (sometimes called a <em>calculated measure</em> or <em>calculation</em>) is a DAX formula that is evaluated on the fly, in the context of the filter selections in your report (the <em>filter context</em>). Measures are not stored as a new column; instead, a measure is a formula saved in the model that computes a result at query time. Because of this, measures <strong>always reflect the current slicers, filters, and rows/columns of your report visualization</strong> (<a target="_blank" href="https://endjin.com/blog/2022/04/measures-vs-calculated-columns-in-dax#:~:text=Another%20important%20difference%20between%20measures,user%20interaction%20in%20the%20report">Measures vs Calculated Columns in DAX and Power BI</a>). They are effectively re-calculated for each cell of a pivot table or visual.</p>
<ul>
<li><p><strong>How to create:</strong> In Power BI Desktop, clicking <em>New Measure</em> lets you write a formula that usually aggregates or computes over data. For example:</p>
<pre><code class="lang-sql">  Total Sales = SUM( Sales[SalesAmount] )
</code></pre>
<p>  This measure adds up the SalesAmount for whatever filters are applied. In an Excel pivot table or Power BI visual, if you put <code>Total Sales</code> and segment by Year, it will show the sum for each Year (each cell’s value differs based on the year filter).</p>
</li>
</ul>
<ul>
<li><strong>No inherent row context:</strong> Measures don’t evaluate per row of a table by default. If you try to reference a column directly in a measure (e.g. <code>Sales[SalesAmount]</code> alone), DAX will throw an error because it doesn’t know <em>which</em> row’s value to take ( <a target="_blank" href="https://www.sqlbi.com/articles/row-context-and-filter-context-in-dax/#:~:text=If%20a%20row%20context%20is,this%20measure%20is%20not%20valid">Row Context and Filter Context in DAX - SQLBI</a>). Measures require aggregation or an expression that can work on potentially many rows. That’s why most measures use aggregator functions like SUM, AVERAGE, COUNT, etc., or more complex expressions with <code>SUMX</code>, <code>CALCULATE</code>, etc. Measures always produce a single value (scalar) for the context in which they are evaluated.</li>
</ul>
<ul>
<li><strong>Filter context:</strong> A critical aspect of measures is that they are evaluated in a <strong>filter context</strong>. Filter context is essentially the set of filters (from slicers, visuals, or page/report filters) that define the current subset of data for evaluation ( <a target="_blank" href="https://www.sqlbi.com/articles/row-context-and-filter-context-in-dax/#:~:text=The%20filter%20context%20is%20the,%E2%80%9Cfilter%20context%E2%80%9D%20the%20set%20of">Row Context and Filter Context in DAX - SQLBI</a>). Measures automatically consider any filters on their underlying tables. For example, if a report page is filtered to Region = "Europe", the <code>Total Sales</code> measure will yield the sum of SalesAmount <em>only for European sales</em> without you having to explicitly specify that in the formula. Similarly, placing fields on rows/columns of a pivot creates a filter context for each cell (the intersection of that row and column’s values) ( <a target="_blank" href="https://www.sqlbi.com/articles/row-context-and-filter-context-in-dax/#:~:text=The%20filter%20context%20is%20the,%E2%80%9Cfilter%20context%E2%80%9D%20the%20set%20of">Row Context and Filter Context in DAX - SQLBI</a>).</li>
</ul>
<ul>
<li><strong>Calculated at query time:</strong> Unlike calculated columns, measures are calculated on the fly. They don’t increase the stored data size since only the formula is stored, not the results (<a target="_blank" href="https://endjin.com/blog/2022/04/measures-vs-calculated-columns-in-dax#:~:text=applied%20to%20the%20whole%20table,is%20used%20in%20the%20report">Measures vs Calculated Columns in DAX and Power BI</a>). The computation happens whenever the measure is used in a visual or PivotTable. If no visual is using a measure, it doesn’t calculate at all. This makes measures very powerful and flexible – they respond instantly to user interaction, enabling truly dynamic analysis.</li>
</ul>
<ul>
<li><p><strong>Example measure calculations:</strong></p>
<ul>
<li><em>Basic sum:</em> <code>Total Quantity = SUM(Sales[Quantity])</code></li>
</ul>
</li>
</ul>
<ul>
<li><em>Calculated ratio:</em> <code>Avg Price = DIVIDE( [Total Sales], [Total Quantity] )</code> – Here we reused two measures to get an average unit price. Measures can reference other measures.</li>
</ul>
<ul>
<li><em>Conditional measure:</em> <code>Sales (High Value Customers) = CALCULATE( [Total Sales], Customers[HighValueFlag] = "High Value" )</code> – (Assuming there's a HighValueFlag column in Customers). This measure uses <code>CALCULATE</code> to modify filter context (more on CALCULATE later), so it only sums sales for customers flagged as High Value.</li>
</ul>
<p><strong>Calculated Columns vs. Measures – Key Differences:</strong></p>
<ul>
<li><strong>Row-by-row vs. aggregate:</strong> Calculated columns compute a result <em>per each row</em> of a table (row context), whereas measures compute a result <em>per filter context</em> (which could be an aggregate of many rows). A measure like <code>[Total Sales]</code> aggregates many rows’ values into one number for whatever filters apply, but a calculated column has one result per row of the original table.</li>
</ul>
<ul>
<li><strong>Storage vs. on-the-fly:</strong> Calculated column results are stored in the model (using memory) and do not change until data is refreshed (<a target="_blank" href="https://endjin.com/blog/2022/04/measures-vs-calculated-columns-in-dax#:~:text=applied%20to%20the%20whole%20table,is%20used%20in%20the%20report">Measures vs Calculated Columns in DAX and Power BI</a>). Measures are formulas evaluated on the fly, so they don’t take up space per se and can react to user selections. A measure is stored only as an expression (similar to how a query or view works).</li>
</ul>
<ul>
<li><strong>Filter context availability:</strong> Measures are aware of and evaluated within the report’s filter context (slicers, rows, etc.) (<a target="_blank" href="https://endjin.com/blog/2022/04/measures-vs-calculated-columns-in-dax#:~:text=Another%20important%20difference%20between%20measures,user%20interaction%20in%20the%20report">Measures vs Calculated Columns in DAX and Power BI</a>). Calculated columns are evaluated without any interactive filter context – they cannot know what the user selects on a report; they only know the data in that single row (though they can lookup related data in the model). This means measures can respond to user interactions (e.g., “sales for whatever year the user selected”), while calculated columns cannot (they are fixed once computed).</li>
</ul>
<ul>
<li><strong>Usage:</strong> Use calculated columns for categories or values you need as part of the data model (especially if you need to slice/filter on them). Use measures for dynamic calculations, especially anything aggregated (sum, counts, ratios, etc.) or that needs to respect user filters. In many cases, you can choose either approach for a given result, but the general guidance is to prefer measures for aggregation and business metrics, and use columns only when necessary for slicing or as intermediate building blocks.</li>
</ul>
<p>(<a target="_blank" href="https://endjin.com/blog/2022/04/measures-vs-calculated-columns-in-dax#:~:text=Calculated%20columns%20are%20computed%20based,is%20used%20in%20the%20report">Measures vs Calculated Columns in DAX and Power BI</a>) (<a target="_blank" href="https://endjin.com/blog/2022/04/measures-vs-calculated-columns-in-dax#:~:text=Another%20important%20difference%20between%20measures,user%20interaction%20in%20the%20report">Measures vs Calculated Columns in DAX and Power BI</a>) summarizes these differences: a calculated column is computed for each row at data refresh (increasing model size), whereas a measure is computed at query time and is evaluated in the context of report filters and slicers (so it’s more dynamic).</p>
<blockquote>
<p><strong>Real-world example:</strong> If you wanted “Sales per Region”: You could add a calculated column on each sales transaction row for Region (via relationship) and then sum it, but that’s inefficient. Instead, create a measure that sums Sales and use Region from a related table to slice the data. The measure will automatically give the correct total per region in a chart or pivot, without needing a new stored column.</p>
</blockquote>
<hr />
<h2 id="heading-calculated-tables">Calculated Tables</h2>
<p>A <strong>Calculated Table</strong> is a table added to your model by writing a DAX formula that returns a table. Unlike calculated columns or measures, which live inside an existing table, a calculated table is a new table in the field list. You create them using the <em>New Table</em> feature in Power BI (or in the model diagram of Power Pivot) (<a target="_blank" href="https://learn.microsoft.com/en-us/power-bi/transform-model/desktop-calculated-tables#:~:text=Most%20of%20the%20time%2C%20you,to%20define%20the%20table%27s%20values">Using calculated tables in Power BI Desktop - Power BI | Microsoft Learn</a>). Calculated tables are useful for intermediate results, staging tables, or supporting tables like a date/calendar table generated via DAX.</p>
<ul>
<li><p><strong>How to create:</strong> Write a DAX expression that returns a table (enclosed in curly braces for literal rows, or using functions that produce tables). For example, to combine two tables with identical structure (like appending rows):</p>
<pre><code class="lang-sql">  CombinedTable = UNION( 'Northwest Employees', 'Southwest Employees' )
</code></pre>
<p>  This creates a new table named <em>CombinedTable</em> with rows from both regional employee tables (<a target="_blank" href="https://learn.microsoft.com/en-us/power-bi/transform-model/desktop-calculated-tables#:~:text=2,in%20the%20formula%20bar">Using calculated tables in Power BI Desktop - Power BI | Microsoft Learn</a>). The calculated table will appear in the Fields list, and you can create relationships to it or use its fields just like an imported table.</p>
</li>
</ul>
<ul>
<li><strong>When they calculate:</strong> A calculated table is computed during the data refresh process (or when first created). The results are stored in the model. Essentially, it’s as if the data was loaded from an external source – except the source is a DAX expression based on other data already in the model. If the underlying data changes and the model is refreshed, the calculated table will recalc. Note that for DirectQuery sources, calculated tables do <strong>not</strong> update until the model itself is refreshed (they don’t dynamically reflect query changes) (<a target="_blank" href="https://learn.microsoft.com/en-us/power-bi/transform-model/desktop-calculated-tables#:~:text=Just%20like%20other%20Power%20BI,table%20in%20DirectQuery%20as%20well">Using calculated tables in Power BI Desktop - Power BI | Microsoft Learn</a>).</li>
</ul>
<ul>
<li><p><strong>Usage examples:</strong></p>
<ul>
<li><p><strong>Date Table:</strong> Many models create a Date table using DAX. For instance:</p>
<pre><code class="lang-sql">  Calendar = CALENDAR( DATE(2020,1,1), DATE(2025,12,31) )
</code></pre>
<p>  This produces a table with one date per day for the given range. You can then add calculated columns to this Calendar table for Year, Month, etc., or use built-in functions like <code>YEAR()</code> to create those columns. Mark this table as the official Date table in the model so time intelligence functions work properly.</p>
</li>
</ul>
</li>
</ul>
<ul>
<li><p><strong>Union or filtered tables:</strong> As in the previous example, combining two or more tables (<code>UNION</code>), or taking a subset of a table. For instance, you could create a table of Top 100 Customers by revenue:</p>
<pre><code class="lang-sql">  TopCustomers = TOPN( 100, SUMMARIZE( Sales, Sales[CustomerID], "TotalSales", SUM(Sales[Amount]) ), [TotalSales], DESC )
</code></pre>
<p>  This DAX snippet (advanced) creates a table of 100 customer IDs with highest sales. This can be used for special reporting or further analysis.</p>
</li>
</ul>
<ul>
<li><strong>Snapshot or summarized data:</strong> Sometimes you might create a calculated table to store a summary (e.g., a distinct list of values with some pre-calculated measures). However, remember these are static once computed at refresh – if you want truly dynamic summaries, a measure or visual-level calculation might be more appropriate.</li>
</ul>
<ul>
<li><strong>Relation to data model:</strong> Calculated tables can participate in relationships just like imported ones (<a target="_blank" href="https://learn.microsoft.com/en-us/power-bi/transform-model/desktop-calculated-tables#:~:text=Just%20like%20other%20Power%20BI,If%20a%20table%20needs%20to">Using calculated tables in Power BI Desktop - Power BI | Microsoft Learn</a>). For example, a Date table created via DAX can be related to a Sales table on the Date field. Calculated tables can also have calculated columns and measures of their own. Essentially, once created, they are normal tables in the model.</li>
</ul>
<p>(<a target="_blank" href="https://learn.microsoft.com/en-us/power-bi/transform-model/desktop-calculated-tables#:~:text=Most%20of%20the%20time%2C%20you,to%20define%20the%20table%27s%20values">Using calculated tables in Power BI Desktop - Power BI | Microsoft Learn</a>) (<a target="_blank" href="https://learn.microsoft.com/en-us/power-bi/transform-model/desktop-calculated-tables#:~:text=constructs%2C%20providing%20immense%20flexibility%20in,cross%20join%20two%20existing%20tables">Using calculated tables in Power BI Desktop - Power BI | Microsoft Learn</a>) — Calculated tables let you add new tables based on data already in your model, using DAX formulas to define their content. They are best used for intermediate calculations or data that you want as part of the model (as opposed to measures which are calculated on the fly).</p>
<blockquote>
<p><strong>Best practice:</strong> Use calculated tables for supporting structures like a calendar (date) table or a table of specific reference values that you cannot easily get from the data source. If your goal is purely to display a summary in a report, often a measure or visual-level aggregation is sufficient. Calculated tables shine when you need the table for further model logic (relationships, reusable subsets, etc.). Keep in mind they increase model size and refresh time, so ensure they’re truly needed (for instance, a small static table of dates or categories is fine, but creating a huge calculated table of millions of rows might be better done in the query or source).</p>
</blockquote>
<hr />
<h2 id="heading-relationships-in-dax">Relationships in DAX</h2>
<p>Data models in Power BI/Power Pivot are built with <strong>relationships</strong> between tables (often primary key to foreign key, like a Dimensional model with dimension tables and fact tables). Understanding how DAX works with relationships is critical:</p>
<ul>
<li><strong>Filter Propagation:</strong> Relationships allow <strong>filter context to flow</strong> from one table to another. For example, if you have a one-to-many relationship between a Product table (one side) and a Sales table (many side), placing a field like Product[Category] in a visual along with a measure summing Sales[Amount] will filter the Sales table to only those rows for the current category. The report’s filter context on Product filters down to Sales automatically via the relationship. This is why you typically create measures on fact tables (like Sales) and slice by dimensions (like Product or Date) – the relationships make the measure respond to dimension filters without additional coding.</li>
</ul>
<ul>
<li><strong>Directional context:</strong> In a one-to-many, the default filter propagation is from the one side to the many side (one -&gt; many). Many Power BI relationships are single-direction by default. If you have a bidirectional relationship (or set cross-filter to both), filter context can propagate both ways. Use bidirectional filtering carefully, as it can introduce ambiguity or performance issues. However, it’s useful for scenarios like many-to-many relationships via a bridge table.</li>
</ul>
<ul>
<li><p><strong>Using RELATED() in calculated columns:</strong> In a calculated column, you often have a row context on a fact table and want to fetch a value from a related lookup table. DAX won’t automatically pull that value just by referencing it (unlike a VLOOKUP in Excel, you must explicitly tell DAX to traverse the relationship). The function <code>RELATED(Table[column])</code> is used in a row context to get the corresponding value from the one-side (lookup) table for the current row’s relationship. For example, in a Sales table (many side) that has a relationship to Product (one side), if we want the product category name on each sales row, we could create:</p>
<pre><code class="lang-sql">  CategoryName = RELATED( Product[Category] )
</code></pre>
<p>  This will go to the Product table, find the row related to the current Sales row, and return the Category. <code>RELATED()</code> works when there is a single related row (which is true in a one-to-many where you’re on the many side and pulling from the one side) ( <a target="_blank" href="https://www.sqlbi.com/articles/using-related-and-relatedtable-in-dax/#:~:text=RELATED%20is%20one%20of%20the,functions%2C%20along%20with%20common%20misperceptions">Using RELATED and RELATEDTABLE in DAX - SQLBI</a>) ( <a target="_blank" href="https://www.sqlbi.com/articles/using-related-and-relatedtable-in-dax/#:~:text=When%20you%20have%20a%20row,following%20model%20as%20an%20example">Using RELATED and RELATEDTABLE in DAX - SQLBI</a>).</p>
</li>
</ul>
<ul>
<li><p><strong>RELATEDTABLE() for the inverse:</strong> If you are on the one side (say Product table) and want to bring in information from the many side (Sales), you can use <code>RELATEDTABLE(Sales)</code> which returns a table of all sales rows related to the current product ( <a target="_blank" href="https://www.sqlbi.com/articles/using-related-and-relatedtable-in-dax/#:~:text=RELATED%20is%20one%20of%20the,functions%2C%20along%20with%20common%20misperceptions">Using RELATED and RELATEDTABLE in DAX - SQLBI</a>). In a calculated column scenario, <code>RELATEDTABLE</code> can be used inside an aggregator. For instance, in the Product table you might define:</p>
<pre><code class="lang-sql">  TotalProductSales = SUMX( RELATEDTABLE(Sales), Sales[Amount] )
</code></pre>
<p>  This will iterate over all Sales rows related to the current product (via the relationship) and sum the Amount. Essentially, <code>RELATEDTABLE</code> gives you the table of related rows, and then you can aggregate or count them. This is less common in measures (measures have an easier way – they naturally sum by product if you slice by product), but it’s good for calculated columns that need facts.</p>
</li>
</ul>
<ul>
<li><p><strong>Active vs Inactive relationships:</strong> Power BI allows multiple relationships between the same two tables, but only one can be <em>active</em> at a time (the one that propagates filters by default). The others are <em>inactive</em> and do nothing unless explicitly invoked. For example, you might have an Order Date and a Ship Date both linking Sales to the Date table. One (say Order Date) is active. If you want to create a measure that uses the Ship Date, you use the function <code>USERELATIONSHIP</code> inside a measure to activate the inactive relationship for that calculation ( <a target="_blank" href="https://www.sqlbi.com/articles/using-userelationship-in-dax/#:~:text=If%20two%20tables%20are%20linked,like%20in%20the%20following%20picture">Using USERELATIONSHIP in DAX - SQLBI</a>) ( <a target="_blank" href="https://www.sqlbi.com/articles/using-userelationship-in-dax/#:~:text=Inactive%20relationships%20are%20%E2%80%93%20by,activate%20a%20relationship%20using%20USERELATIONSHIP">Using USERELATIONSHIP in DAX - SQLBI</a>). For example:</p>
<pre><code class="lang-sql">  Shipped Sales = 
    CALCULATE( [Total Sales],
       USERELATIONSHIP( Sales[ShipDate], 'Date'[Date] )
    )
</code></pre>
<p>  Here, within this measure, the relationship on Sales[ShipDate] to Date is used instead of the default Sales[OrderDate] -&gt; Date. <code>USERELATIONSHIP</code> is only valid inside <code>CALCULATE</code> (or functions that accept filter modifiers) ( <a target="_blank" href="https://www.sqlbi.com/articles/using-userelationship-in-dax/#:~:text=using%20USERELATIONSHIP">Using USERELATIONSHIP in DAX - SQLBI</a>). Once the CALCULATE is done, the active relationship reverts back. This technique is crucial for multi-date-table models (role-playing dimensions like Date).</p>
</li>
</ul>
<ul>
<li><strong>Cross-filtering direction:</strong> DAX also provides <code>CROSSFILTER(table1[col], table2[col], direction)</code> as a CALCULATE modifier to override the relationship’s filtering direction (useful for certain calculations where you temporarily need a different filter behavior). Another function, <code>TREATAS</code>, can apply the values of a table or column as filters on another table’s column – effectively creating a relationship on the fly for a specific measure.</li>
</ul>
<p><strong>Real-world perspective on relationships:</strong> If your model is well-designed with correct relationships, writing DAX measures becomes much easier, since you typically do not have to explicitly join tables. The filter context does the heavy lifting. For example, if you want “Sales for Contoso brand in 2021”, and your model has Sales -&gt; Product -&gt; Brand and Sales -&gt; Date relationships, simply filtering Brand = Contoso and Year = 2021 in the report or within a CALCULATE will yield the correct sales. No manual joining in the DAX formula is needed. Understanding this allows you to trust the model and focus on what you want to calculate (sum, average, etc.) rather than how to join tables.</p>
<blockquote>
<p><strong>Tip:</strong> Always ensure you have necessary relationships in place (and the correct active one for default behaviors). Use <code>USERELATIONSHIP</code> in measures to utilize any secondary relationships (like multiple dates), and consider marking your Date table as a “Date Table” in Power BI which helps time intelligence and makes sure relationships on date behave as expected. Avoid modeling pitfalls like many-to-many relationships unless necessary; if you do have them, be mindful that filter context might flow in less obvious ways.</p>
</blockquote>
<hr />
<h2 id="heading-row-context-vs-filter-context">Row Context vs Filter Context</h2>
<p>Understanding <strong>context</strong> is foundational to mastering DAX. There are two primary types of context in DAX: <strong>Row context</strong> and <strong>Filter context</strong>. The combination of these (at any point of execution) is sometimes called the <strong>evaluation context</strong>. Context determines <em>what data</em> a DAX expression is currently operating over.</p>
<ul>
<li><p><strong>Row Context:</strong> This is essentially “the current row” in a table that an expression is iterating over ( <a target="_blank" href="https://www.sqlbi.com/articles/row-context-and-filter-context-in-dax/#:~:text=When%20you%20use%20a%20column,row%E2%80%9D%20defines%20the%20Row%20Context">Row Context and Filter Context in DAX - SQLBI</a>). You have a row context whenever a formula is being evaluated for each row of a table. As discussed, calculated columns inherently have a row context (each row is calculated separately). Also, any <strong>iterator</strong> function (like <code>SUMX</code>, <code>FILTER</code>, <code>AVERAGEX</code>, etc.) creates a row context as it loops through a table ( <a target="_blank" href="https://www.sqlbi.com/articles/row-context-and-filter-context-in-dax/#:~:text=You%20have%20a%20row%20context,in%20a%20calculated%20column">Row Context and Filter Context in DAX - SQLBI</a>). Within a row context, you can directly reference columns of that table, and it will understand you mean “the value of this column in the current row.” For example, in a <code>SUMX(Sales, Sales[Quantity] * Sales[Price])</code>, the expression <code>Sales[Quantity] * Sales[Price]</code> is evaluated in a row context for each row of Sales, so it multiplies quantity and price of <em>that specific row</em>.</p>
<p>  If a row context is present, you <strong>must</strong> use functions like <code>SUMX</code>/<code>AVERAGEX</code> etc. to aggregate, or use <code>CALCULATE</code> to introduce a filter context (more on that soon) if you want to do operations across rows. Without a row context, a naked column reference is ambiguous and errors out (e.g., a measure cannot just say <code>Sales[Amount]</code> by itself) ( <a target="_blank" href="https://www.sqlbi.com/articles/row-context-and-filter-context-in-dax/#:~:text=If%20a%20row%20context%20is,this%20measure%20is%20not%20valid">Row Context and Filter Context in DAX - SQLBI</a>).</p>
<p>  Another key point: Row context by itself does not do filtering on other tables automatically. In a calculated column, if you need a value from a related table, you still need <code>RELATED()</code> as discussed. Row context <em>can</em> propagate through relationships when using certain functions (like <code>RELATEDTABLE</code> which uses the current row’s context to fetch related rows ( <a target="_blank" href="https://www.sqlbi.com/articles/row-context-and-filter-context-in-dax/#:~:text=In%20the%20innermost%20expression%2C%20you,definition%2C%20the%20current%20customer%20in">Row Context and Filter Context in DAX - SQLBI</a>)), but generally just having a row context on one table doesn’t filter other tables unless you explicitly tell DAX to do so.</p>
</li>
</ul>
<ul>
<li><p><strong>Filter Context:</strong> This is the set of filters applied to the data model before evaluating a DAX expression ( <a target="_blank" href="https://www.sqlbi.com/articles/row-context-and-filter-context-in-dax/#:~:text=The%20filter%20context%20is%20the,%E2%80%9Cfilter%20context%E2%80%9D%20the%20set%20of">Row Context and Filter Context in DAX - SQLBI</a>). Filter context comes from a few places: the user’s selections in the report (slicers, filters, the current row/column of a pivot or visual), and filters applied within DAX formulas (like filter arguments in CALCULATE, or filters from measures in filter visuals). Filter context can be thought of as “a subspace of the data” that’s currently active. For instance, if you put Year=2022 and Region=Europe as page filters in Power BI, those constitute a filter context that will affect all measures (only 2022, Europe data is visible to them). If you add a visual with Product Category on rows, each category value in a row has an additional filter (Category = that value) for that specific evaluation.</p>
<p>  In practical terms, when a measure is evaluated, first any relevant filters from the report are applied to the tables (this is like a WHERE clause limiting rows). Then the measure’s formula runs against that filtered data. If the measure is <code>[Total Sales]</code> and the filter context is Year=2022 and Region=Europe, it will sum only the sales that meet those criteria.</p>
<p>  Filter context can also be modified within DAX (the primary function to do this is <code>CALCULATE</code>, which we’ll cover next). The filter context is <em>additive</em> by default – multiple filters combine (they function like an AND of conditions). It’s possible to have no filter context (meaning all data is considered), such as a measure in a card with no filters applied, or you can explicitly clear filters using functions like <code>ALL</code>.</p>
</li>
</ul>
<p>In summary: <strong>Row context = current row (for iterative calculations)</strong>, <strong>Filter context = current filters (for aggregation)</strong>. A single DAX calculation can have both contexts at play: e.g., in a <code>SUMX</code>, you have a row context iterating the table, and also a filter context coming from outside (like slicers on other fields). The row context applies to that inner expression, while the outer filter context restricts which rows of the table are iterated.</p>
<p>( <a target="_blank" href="https://www.sqlbi.com/articles/row-context-and-filter-context-in-dax/#:~:text=The%20filter%20context%20is%20the,%E2%80%9Cfilter%20context%E2%80%9D%20the%20set%20of">Row Context and Filter Context in DAX - SQLBI</a>) defines filter context as “the set of filters applied to the evaluation of a DAX expression.” Any pivot table cell or visual corresponds to a filter context (even if it's "all data", that's essentially an empty filter context meaning no filters). And ( <a target="_blank" href="https://www.sqlbi.com/articles/row-context-and-filter-context-in-dax/#:~:text=You%20have%20a%20row%20context,in%20a%20calculated%20column">Row Context and Filter Context in DAX - SQLBI</a>) explains that a row context exists when iterating over each row in a calculated column or with an X iterator function.</p>
<p><strong>Example to illustrate contexts:</strong><br />Imagine a PivotTable showing <em>Total Sales</em> by <em>Year</em> and <em>Product Category</em>. When computing the measure for Year=2021 and Category=“Electronics”, the filter context is Year=2021 AND Category=Electronics (and any other global filters applied). The measure [Total Sales] = SUM(Sales[Amount]) will then sum only those sales that match those filters. There is no row-by-row loop in the measure; the Vertipaq engine (Power BI’s data engine) efficiently scans the pre-filtered data. Now, if we had a measure like <code>Average Sale per Order = AVERAGEX( Sales, Sales[Amount] )</code>, here AVERAGEX creates a row context iterating each Sale (each row in Sales table) within the <em>current filter context</em>. If Year=2021 and Category=Electronics filters are on, AVERAGEX will iterate only Sales rows for 2021 Electronics (filter context first), then within that, row context for each sale to pick <code>Sales[Amount]</code>, and then average those. So filter context and row context worked together in that calculation.</p>
<hr />
<h2 id="heading-context-transition-and-evaluation-context">Context Transition and Evaluation Context</h2>
<p><strong>Evaluation context</strong> is a term for the combined context that applies when an expression is evaluated – it includes any existing filter context and any row context(s) that might be in play. One of the more advanced concepts is <strong>context transition</strong>, which is how DAX converts a row context into a filter context (so that a measure or aggregation can be evaluated per row).</p>
<ul>
<li><p><strong>Context Transition:</strong> This typically occurs when you use the <code>CALCULATE</code> (or <code>CALCULATETABLE</code>) function. When <code>CALCULATE</code> is called from within a row context, it will <strong>take the values of the current row’s columns and treat them as filter context</strong> for the expression inside CALCULATE (<a target="_blank" href="https://endjin.com/blog/2022/01/evaluation-contexts-in-dax-context-transition#:~:text=TLDR%3B%20When%20used%20inside%20of,and%20relationships%20interact%20with%20it">Evaluation Contexts in DAX - Context Transition</a>). In effect, it transitions the existing row context into an equivalent set of filters. This is what allows measures to work when called in a row context. For instance, if you write a calculated column like <code>= CALCULATE( [Total Sales] )</code> in a Sales table, here [Total Sales] is a measure (which normally works on filter context). By wrapping it in CALCULATE, DAX will transition the row context (the current Sales row’s values for all relevant columns, such as Product, Date, etc.) into filters, then evaluate [Total Sales] under those filters – effectively giving you the sales of just that one row (which is actually just SalesAmount itself in this trivial example). This might seem pointless in a simple case, but it’s how a measure can be evaluated per row of another table or in an iterator. <strong>In fact, calling a measure within an iterator like SUMX automatically triggers context transition even without explicitly writing CALCULATE</strong> (<a target="_blank" href="https://learn.microsoft.com/en-us/dax/calculate-function-dax#:~:text=,context%2C%20context%20transition%20is%20automatic">CALCULATE function (DAX) - DAX | Microsoft Learn</a>) – DAX sees a measure needing filter context and converts the current row context to filters for that measure.</p>
<p>  Another example: Suppose you have a calculated table of customers and you want a column "Total Sales to Customer" in that table. You might do: <code>Total Sales to Customer = CALCULATE( [Total Sales], Sales[CustomerID] = Customers[CustomerID] )</code>. Here for each customer row, CALCULATE takes that customer’s ID and filters the Sales table to that ID, then [Total Sales] sums only those. This works because of context transition – effectively CALCULATE says "current row’s CustomerID -&gt; apply it as a filter on Sales[CustomerID]". If you omitted CALCULATE and just tried <code>[Total Sales]</code> in that row context, it wouldn’t know the row context of Customers unless transitioned.</p>
<p>  In summary, <strong>context transition allows row context to be converted to filter context</strong>, usually via CALCULATE (<a target="_blank" href="https://endjin.com/blog/2022/01/evaluation-contexts-in-dax-context-transition#:~:text=TLDR%3B%20When%20used%20inside%20of,and%20relationships%20interact%20with%20it">Evaluation Contexts in DAX - Context Transition</a>). This is essential for many advanced calculations.</p>
</li>
</ul>
<ul>
<li><strong>CALCULATE specifics:</strong> When you use <code>CALCULATE(Expression, filters...)</code> in general, it creates a new filter context for evaluating <em>Expression</em>: starting from the existing one, then applying the filters you give. If CALCULATE is invoked where there is a current row context (like inside an iterator or a calculated column), that row context becomes part of the new filter context (context transition) (<a target="_blank" href="https://learn.microsoft.com/en-us/dax/calculate-function-dax#:~:text=,context%2C%20context%20transition%20is%20automatic">CALCULATE function (DAX) - DAX | Microsoft Learn</a>). We will discuss CALCULATE more in the next section, but it’s important to highlight its role in context transition here. Also, a measure placed in a visual cell inherently has no row context, only filter context. If that measure was defined using row-based logic, it's likely using CALCULATE internally to simulate that.</li>
</ul>
<ul>
<li><strong>Evaluation context:</strong> At any point, DAX evaluates an expression with possibly <em>multiple row contexts and one filter context</em> in effect. For example, if you have nested iterators (like a SUMX inside an AVERAGEX), you could have two row contexts (one for the inner, one for the outer) plus the global filter context. The evaluation context is the union of all active contexts. If a column reference matches a current row context, it takes that; if it’s used where no row context exists, it tries the filter context. Understanding which context applies to each part of your formula is key to debugging DAX.</li>
</ul>
<p>One way to think about evaluation context: <strong>Filter context restricts <em>which rows</em> are visible to the calculation, while row context dictates <em>the current row from those rows</em> when needed.</strong> When you call an aggregation like SUM, it looks at all rows visible in the filter context (ignoring row context unless context transition happened). When you are in an iterator, the iterator moves row by row (that’s the row context), but inside it you can still call CALCULATE or measures to leverage filter context logic.</p>
<p>To reinforce context transition with a straightforward statement: <em>“When used inside a filter context, CALCULATE performs context transition, transforming the current row context into a filter context.”</em> (<a target="_blank" href="https://endjin.com/blog/2022/01/evaluation-contexts-in-dax-context-transition#:~:text=TLDR%3B%20When%20used%20inside%20of,and%20relationships%20interact%20with%20it">Evaluation Contexts in DAX - Context Transition</a>) This is often on exam tests and DAX theory — it means if you are in a row context and want to do something like a measure or any calculation that needs a filter context, wrapping it in CALCULATE will turn your row’s values into filters so the calc can happen.</p>
<blockquote>
<p><strong>Advanced note:</strong> Most DAX measures you write will implicitly handle context correctly without you needing to think about context transition. It mainly comes into play when writing calculated columns that use measures, or when writing measures that iterate (like using X functions or manual filters). A common pitfall for newcomers is thinking a measure can be used directly in a calculated column and yield row-specific results – it won’t, unless you wrap it in CALCULATE. If you ever see the same value repeated on every row of a calc column when using a measure, it’s likely because context transition was not applied and the measure was evaluated in the overall filter context (probably empty) instead of per row.</p>
</blockquote>
<hr />
<h2 id="heading-calculate-and-filter-functions">CALCULATE and FILTER Functions</h2>
<p><strong>CALCULATE</strong> is the most powerful (and arguably most important) function in DAX. It allows you to modify the filter context under which an expression is evaluated. In other words, <code>CALCULATE</code> enables you to <strong>add or override filters</strong> for a calculation. This is how you perform tasks like “Sales for 2021 only” or “Number of customers in US, regardless of any country filter on the report” inside your DAX measures.</p>
<ul>
<li><strong>Syntax:</strong> <code>CALCULATE( &lt;Expression&gt;, &lt;Filter1&gt;, &lt;Filter2&gt;, ... )</code>.<br />  The <code>&lt;Expression&gt;</code> is typically a measure or an aggregation you want to compute. The filters can be given as Boolean filter expressions (like <code>Table[Column] = value</code>), table expressions (often using the <code>FILTER()</code> function or all/values functions), or special filter modifier functions (like <code>ALL(), REMOVEFILTERS(), USERELATIONSHIP()</code> etc.). Each filter argument modifies the context separately: they either add a new filter or override an existing one on the specified column/table.</li>
</ul>
<ul>
<li><strong>Adding vs overriding filters:</strong> By default, if you specify a filter on a column that isn’t already filtered in the current context, CALCULATE will add that filter. If that column is already being filtered (say by a slicer or visual), CALCULATE will override it with the one you specified (<a target="_blank" href="https://learn.microsoft.com/en-us/dax/calculate-function-dax#:~:text=,wrapped%20in%20the%20KEEPFILTERS%20function">CALCULATE function (DAX) - DAX | Microsoft Learn</a>). For example, if a report is filtering Region = "Europe", but your measure does <code>CALCULATE([Total Sales], Region[Country] = "USA")</code>, the result inside that CALCULATE will act as if Region = USA (the filter on Europe is replaced for that measure). This replacement behavior is often what you want for explicit overrides. If instead you want to <em>add</em> a filter without removing existing ones on the same column, there’s a function <code>KEEPFILTERS</code> we can use (more on that in advanced tips).</li>
</ul>
<ul>
<li><p><strong>Basic usage examples:</strong></p>
<ul>
<li><em>Explicit filter:</em> <code>Sales 2021 = CALCULATE( [Total Sales], 'Date'[CalendarYear] = 2021 )</code>. This measure will give the total sales for the year 2021, no matter what year is filtered in the report (it overrides year filter to 2021). You can list multiple filters: e.g. <code>CALCULATE( [Total Sales], 'Date'[CalendarYear]=2021, Product[Category]="Electronics" )</code> to apply both Year and Category filters at once (<a target="_blank" href="https://learn.microsoft.com/en-us/dax/calculate-function-dax#:~:text=,wrapped%20in%20the%20KEEPFILTERS%20function">CALCULATE function (DAX) - DAX | Microsoft Learn</a>).</li>
</ul>
</li>
</ul>
<ul>
<li><em>Removing filters:</em> <code>All Sales = CALCULATE( [Total Sales], REMOVEFILTERS( Product[Category] ) )</code>. This would calculate total sales ignoring any filter on Product Category (i.e., returning total across all categories even if a specific category is in context). Similarly, <code>ALL(Table)</code> can be used to ignore filters on a whole table or column. (ALL used inside CALCULATE is considered a <em>filter modifier</em>, telling CALCULATE to clear those filters rather than add a new one).</li>
</ul>
<ul>
<li><em>Using CALCULATE for context transition:</em> As mentioned earlier, if you use CALCULATE in a calculated column or within an X iterator, it will take the current row’s context and turn it into filters. For instance, a calculated column: <code>SalesAmount_Check = CALCULATE( [Total Sales] )</code> in the Sales table would for each row filter Sales to that row and compute [Total Sales], effectively just returning that row’s SalesAmount (this is a trivial example of context transition).</li>
</ul>
<ul>
<li><p><strong>FILTER function:</strong> <code>FILTER</code> is a DAX function that returns a table which is a subset of another table, filtering rows by a condition. Its syntax: <code>FILTER( &lt;Table&gt;, &lt;Condition&gt; )</code>. The condition is evaluated for each row of the table, and any row that returns TRUE is kept. FILTER is an <strong>iterator</strong> (it creates a row context over the table you provide and evaluates the condition for each row) (<a target="_blank" href="https://community.fabric.microsoft.com/t5/Desktop/Explanation-of-Context-in-Calculating-Cumulative-Values/m-p/257701#:~:text=Values%20community,is%20an%20iterator">Explanation of Context in Calculating Cumulative Values</a>). This means if the condition itself involves aggregates or references to other tables, it will still evaluate row by row (often you might see conditions like <code>FILTER( Sales, Sales[Amount] &gt; 1000 &amp;&amp; Sales[Amount] &lt; [Some Threshold] )</code> etc., where it filters on Sales rows meeting certain criteria).</p>
<p>  By itself, <code>FILTER</code> is most often used as an argument to other functions that take a table. Common use cases:</p>
<ul>
<li>Inside CALCULATE: e.g. <code>CALCULATE( [Total Sales], FILTER( Sales, Sales[Amount] &gt; 1000 ) )</code>. Here, instead of a simple <code>Sales[Amount] &gt; 1000</code> filter (which is actually not allowed directly in CALCULATE because it’s not a simple equality), we use FILTER to apply a more complex condition. This will filter the Sales table to only rows where amount &gt; 1000, then CALCULATE will evaluate [Total Sales] (which sums SalesAmount) under that modified context (so effectively “Sales of transactions &gt; 1000”).</li>
</ul>
</li>
</ul>
<ul>
<li>In iterator combinations: e.g. <code>COUNTROWS( FILTER( Customers, Customers[HasEmail] = TRUE ) )</code> which would count how many customers have an email. (This could also be done with a CALCULATE: <code>CALCULATE( COUNTROWS(Customers), Customers[HasEmail] = TRUE )</code> – in fact, CALCULATE with simple conditions is often simpler.)</li>
</ul>
<p>    Remember that <code>FILTER(Table, condition)</code> returns the filtered table, but doesn’t on its own do anything to existing filter context unless wrapped in CALCULATE or used in an iterator. So you often see it paired with functions like SUMX, COUNTX, CALCULATE, etc.</p>
<ul>
<li><strong>Efficiency note:</strong> If a simple filter (like <code>Column = Value</code>) suffices in CALCULATE, use that rather than FILTER, because CALCULATE can set simple filters efficiently. <code>FILTER</code> should be used for more complex logic (such as multiple conditions, ranges, or involving measures/aggregates). For example, to filter on a measure or an aggregate, you must use FILTER (e.g., “filter products to those with sales &gt; X” – you can’t put <code>[Total Sales] &gt; X</code> directly as a CALCULATE filter without FILTER). But if you just need <code>Product[Color] = "Red"</code>, you can pass that directly to CALCULATE. Under the hood, direct CALCULATE filters and FILTER(...) achieve the same outcome (modifying filter context), but direct filters are simpler and often faster.</li>
</ul>
<p><strong>Example – Using CALCULATE and FILTER:</strong><br />Let’s say we want a measure for <strong>High Value Sales</strong>: total sales where the transaction amount was over 1000. We have a measure [Total Sales]. We can do:</p>
<pre><code class="lang-sql">High Value Sales = 
CALCULATE( [Total Sales],
    FILTER( Sales, Sales[SalesAmount] &gt; 1000 )
)
</code></pre>
<p>Here, <code>FILTER(Sales, Sales[SalesAmount] &gt; 1000)</code> produces a table of only the sales transactions over 1000. CALCULATE then applies that as the filter context for [Total Sales]. The result is the sum of SalesAmount for those high-value transactions. We could also achieve this with an iterator: <code>High Value Sales = SUMX( FILTER(Sales, Sales[SalesAmount] &gt; 1000), Sales[SalesAmount] )</code> – which in effect does the same thing (iterates and sums). Using CALCULATE with an existing [Total Sales] measure is a bit cleaner.</p>
<p>Another example: <strong>Percentage of total</strong> often uses CALCULATE. Imagine <code>[Total Sales]</code> as current context sales, and you want a measure for “Percentage of All Sales that this context represents”. You could write:</p>
<pre><code class="lang-sql">% of All Sales = 
DIVIDE( [Total Sales],
        CALCULATE( [Total Sales], REMOVEFILTERS() )
      )
</code></pre>
<p>The <code>CALCULATE( [Total Sales], REMOVEFILTERS() )</code> part gives the total sales with all filters removed (i.e., the denominator is total sales for all data). This uses the filter modifier <code>REMOVEFILTERS()</code> to clear filters (<a target="_blank" href="https://learn.microsoft.com/en-us/dax/calculate-function-dax#:~:text=,wrapped%20in%20the%20KEEPFILTERS%20function">CALCULATE function (DAX) - DAX | Microsoft Learn</a>). The DIVIDE then gives the fraction. This pattern is common in DAX for percent of total, percent of parent, etc.</p>
<blockquote>
<p><strong>Think of CALCULATE as: "Evaluate this expression, but under these filter conditions (possibly in place of whatever external filters there were)."</strong> It’s how you explicitly control context inside a measure. Meanwhile, think of FILTER (function) as: "From this table, give me only the rows that meet this test." They often work together but serve different purposes.</p>
</blockquote>
<p><strong>Advanced filter modifiers:</strong> Within CALCULATE, you can use special functions:</p>
<ul>
<li><code>ALL(Table/Column)</code>: ignores filters on that table/column (<a target="_blank" href="https://learn.microsoft.com/en-us/dax/calculate-function-dax#:~:text=REMOVEFILTERS%20%20Remove%20all%20filters%2C,single%2C%20or%20from%20single%20to">CALCULATE function (DAX) - DAX | Microsoft Learn</a>). (Use <code>ALL</code> or <code>REMOVEFILTERS</code> to get overall totals or to override specific filters.)</li>
</ul>
<ul>
<li><code>ALLEXCEPT(Table, Column1, Column2, ...)</code>: clears all filters on a table except the ones explicitly listed.</li>
</ul>
<ul>
<li><code>KEEPFILTERS(FilterExpression)</code>: changes how filter application works by intersecting with existing filters instead of overwriting (<a target="_blank" href="https://maqsoftware.com/insights/dax-best-practices.html#:~:text=7,T">DAX Best Practices | MAQ Software Insights</a>). For example, <code>CALCULATE( [Measure], KEEPFILTERS( Table[Column] = "Value" ) )</code> will keep any existing filters on Table[Column] and require that "Value" also be true (i.e., an intersection), whereas normally CALCULATE would replace any filter on that column with "Value". Use case: maybe a measure that refines an existing report filter rather than replacing it.</li>
</ul>
<ul>
<li><code>USERELATIONSHIP(col1, col2)</code>: as mentioned, activates an inactive relationship for the calculation ( <a target="_blank" href="https://www.sqlbi.com/articles/using-userelationship-in-dax/#:~:text=using%20USERELATIONSHIP">Using USERELATIONSHIP in DAX - SQLBI</a>).</li>
</ul>
<ul>
<li><code>CROSSFILTER(col1, col2, direction)</code>: temporarily change relationship filter direction (or disable it) for the calc.</li>
</ul>
<ul>
<li>There are others like <code>ALLSELECTED</code>, <code>ALLNOBLANKROW</code>, but those are more specialized.</li>
</ul>
<p>(<a target="_blank" href="https://learn.microsoft.com/en-us/dax/calculate-function-dax#:~:text=,wrapped%20in%20the%20KEEPFILTERS%20function">CALCULATE function (DAX) - DAX | Microsoft Learn</a>) (Microsoft docs) underscores how CALCULATE modifies filter context by adding or overriding filters provided in its arguments. And (<a target="_blank" href="https://community.fabric.microsoft.com/t5/Desktop/Explanation-of-Context-in-Calculating-Cumulative-Values/m-p/257701#:~:text=Values%20community,is%20an%20iterator">Explanation of Context in Calculating Cumulative Values</a>) reminds us that FILTER returns a table of rows that meet a condition (and acts as an iterator).</p>
<blockquote>
<p><strong>Important:</strong> CALCULATE <strong>only works in measures or calculated columns, not in row-level security or DirectQuery mode in certain cases</strong> (with exceptions). But in general use, you’ll be writing CALCULATE a lot in measures. Also note, you cannot nest CALCULATE inside another CALCULATE’s filter (there’s a restriction on that), and you cannot use a measure as a direct filter argument (you must use a measure in a filter through something like FILTER or a comparison as part of a boolean expression).</p>
</blockquote>
<hr />
<h2 id="heading-aggregation-and-iterator-functions">Aggregation and Iterator Functions</h2>
<p>DAX has a variety of aggregation functions (like SUM, MIN, MAX, COUNT, AVERAGE) and their <strong>iterator</strong> counterparts (SUMX, MINX, MAXX, COUNTX, AVERAGEX, etc.). Understanding the difference between these is key to writing correct calculations:</p>
<ul>
<li><strong>Simple aggregators (non-X):</strong> Functions such as <code>SUM(column)</code>, <code>AVERAGE(column)</code>, <code>MIN(column)</code>, <code>MAX(column)</code>, <code>COUNT(column)</code> operate over a <em>column</em> in the current filter context. They take a column reference and compute the aggregate of all values currently visible in that column. For example, <code>SUM(Sales[Amount])</code> adds all Sales Amount values in whatever context (filters) is active. These functions do not have an inherent row context; they implicitly consider all rows allowed by the filter context. They are highly optimized to work on the column data.</li>
</ul>
<ul>
<li><p><strong>Iterator functions (X functions):</strong> These include <code>SUMX(table, expression)</code>, <code>AVERAGEX(table, expression)</code>, <code>MINX</code>, <code>MAXX</code>, <code>COUNTX</code>, <code>RANKX</code>, etc. These functions iterate over a specified <em>table</em>, evaluating the given expression for each row, then aggregate the results. During their execution, a <strong>row context</strong> is created for the table they iterate ( <a target="_blank" href="https://www.sqlbi.com/articles/row-context-and-filter-context-in-dax/#:~:text=You%20have%20a%20row%20context,in%20a%20calculated%20column">Row Context and Filter Context in DAX - SQLBI</a>). For instance, <code>SUMX(Sales, Sales[Quantity] * Sales[Price])</code> will go row by row in the Sales table, compute <code>Quantity * Price</code> for each row, and then sum up those computed values. If you tried to achieve the same with simple SUM, you couldn’t directly, because you’d need to multiply per row first (which SUM can’t do by itself).</p>
<p>  Another example: <code>AVERAGEX( Dates, [Daily Sales] )</code> (imagine [Daily Sales] is a measure that gives sales on a single date context). This would iterate each date in the Dates table (perhaps filtered to a year), retrieve [Daily Sales] for that date (context transition happens for the measure inside, effectively), and then average those values.</p>
</li>
</ul>
<ul>
<li><p><strong>Why use X iterators?</strong> They let you perform calculations per row that are more complex than just the raw column value, and then aggregate. Common scenarios:</p>
<ul>
<li>Calculated weighted averages or ratios per row then summed.</li>
</ul>
</li>
</ul>
<ul>
<li>Concatenate values (there’s even <code>CONCATENATEX</code> which joins strings from rows).</li>
</ul>
<ul>
<li>Applying a filter with a condition within an aggregation (though often there are alternatives, like CALCULATE with filters).</li>
</ul>
<ul>
<li>Ranking or sorting by an expression (RANKX lets you rank rows of a table by an expression, e.g., rank products by sales within a category).</li>
</ul>
<ul>
<li><strong>Performance considerations:</strong> If a simple aggregator can do the job, use it instead of an X iterator because it’s usually faster (it leverages the optimized column storage). For example, <code>SUM(Sales[Amount])</code> is preferred to <code>SUMX(Sales, Sales[Amount])</code> – they return the same result, but the latter introduces unnecessary row-by-row iteration. In fact, <code>SUM(column)</code> is essentially a shorthand for <code>SUMX(table, column)</code> where the engine can optimize it directly (<a target="_blank" href="https://www.sqlbi.com/tv/sum-sumx-dax-guide/#:~:text=SUM%2C%20SUMX%20%E2%80%93%20DAX%20Guide,each%20row%20in%20a%20table">SUM, SUMX – DAX Guide - SQLBI</a>). As a rule: use the X version only when you need to do something per row that a simple function can’t handle.</li>
</ul>
<ul>
<li><p><strong>Examples:</strong></p>
<ul>
<li>Basic sum vs. sumx:<br />  <strong>SUM</strong> – <code>Total Sales = SUM(Sales[Amount])</code> adds the Amount column directly (fast).<br />  <strong>SUMX</strong> – <code>Total Sales (SUMX) = SUMX( Sales, Sales[Amount] )</code> would theoretically do the same thing but less efficiently.<br />  A more meaningful SUMX example: <code>Total Revenue = SUMX( Sales, Sales[Quantity] * Sales[UnitPrice] )</code>. Here we multiply quantity and unit price per row (to get revenue per sale), then sum it up. There is no single column that directly has revenue, so SUMX is needed to evaluate the expression for each row.</li>
</ul>
</li>
</ul>
<ul>
<li><strong>AVERAGE vs AVERAGEX:</strong> If you want the average of a calculated value per row, use AVERAGEX. For instance, “average gross margin per product” might be <code>AVERAGEX( Products, Products[Margin] )</code> if [Margin] itself is a calculated column or expression. But if you wanted the overall gross margin (total profit / total sales), you’d better do it as a measure ratio (which might be different from averaging margins of each product). Distinguishing when to average an aggregate vs aggregate an average is important. AVERAGEX is row-by-row, then average; whereas AVERAGE just takes a column of already stored values.</li>
</ul>
<ul>
<li><strong>COUNTX:</strong> You might not use COUNTX often; usually COUNT or COUNTROWS suffice. <code>COUNTX(table, expression)</code> will count the number of rows where the expression is not blank. A typical use might be something like counting non-empty results of an expression.</li>
</ul>
<ul>
<li><strong>RANKX:</strong> This is an iterator used for ranking. E.g., <code>Rank Sales by Product = RANKX( ALL(Product), [Total Sales], , DESC )</code> will rank each product’s total sales against all products. RANKX iterates over a table (here ALL(Product) which is the table of all products ignoring filters) and compares the [Total Sales] for each product to determine rank.</li>
</ul>
<p>(<a target="_blank" href="https://www.sqlbi.com/tv/sum-sumx-dax-guide/#:~:text=SUM%2C%20SUMX%20%E2%80%93%20DAX%20Guide,each%20row%20in%20a%20table">SUM, SUMX – DAX Guide - SQLBI</a>) succinctly says: <em>SUM adds all numbers in a column; SUMX returns the sum of an expression evaluated for each row of a table.</em> That applies to other aggregates too (e.g. AVERAGE vs AVERAGEX, etc.).</p>
<ul>
<li><strong>Iterators and filter context:</strong> An iterator will respect the outside filter context on the table it’s iterating. For instance, if you do <code>SUMX(Sales, ...)</code> and the report or outer CALCULATE has Year=2020 filter active, then that Sales table passed into SUMX is already filtered to 2020 sales. The iterator then goes through just those. You can also embed CALCULATE inside an iterator’s expression if needed to override context per row (advanced use case).</li>
</ul>
<ul>
<li><strong>Avoiding common mistakes:</strong> A frequent error is writing a measure that tries to sum an expression without SUMX. For example, writing <code>Total Revenue = Sales[Quantity] * Sales[UnitPrice]</code> as a measure will error out – because there’s no row context in a measure by default. The correct approach is <code>SUMX(Sales, Sales[Quantity] * Sales[UnitPrice])</code>. Conversely, sometimes newcomers use SUMX when a SUM would do; e.g., <code>SUMX(Sales, Sales[Amount])</code> – this works but is unnecessarily verbose. Overuse of iterators can hurt performance if the table is large, because it forces a row-by-row evaluation in the formula engine rather than a swift columnar aggregation.</li>
</ul>
<p><strong>Table functions:</strong> In addition to X iterators, DAX has other functions that return tables (e.g., <code>FILTER</code>, <code>ALL</code>, <code>VALUES</code>, <code>SUMMARIZE</code>, <code>ADDCOLUMNS</code>, etc.). These aren't aggregators themselves but are used to shape data for either iterators or for producing calculated tables. For instance, <code>VALUES(Column)</code> returns a one-column table of distinct values (and in a measure context is often used to get a single value of a filter if present). <code>ADDCOLUMNS(Table, "NewCol", Expression)</code> can add a calculated column on the fly to a table (useful within measures to create a temporary table with extra data).</p>
<p>For most everyday use, focus on choosing between non-X and X correctly:</p>
<ul>
<li>If you need a calculation per row -&gt; use an X iterator.</li>
</ul>
<ul>
<li>If you just need to aggregate a single existing column -&gt; use the simple aggregator.</li>
</ul>
<blockquote>
<p><strong>Quick Reference – Common Aggregators vs Iterators:</strong></p>
<ul>
<li><strong>SUM</strong> vs <strong>SUMX</strong>: <code>SUM(Column)</code> adds up column values; <code>SUMX(Table, Expr)</code> evaluates Expr for each row and sums. (<a target="_blank" href="https://www.sqlbi.com/tv/sum-sumx-dax-guide/#:~:text=SUM%2C%20SUMX%20%E2%80%93%20DAX%20Guide,each%20row%20in%20a%20table">SUM, SUMX – DAX Guide - SQLBI</a>)</li>
</ul>
<ul>
<li><strong>AVERAGE</strong> vs <strong>AVERAGEX</strong>: <code>AVERAGE(Column)</code> vs <code>AVERAGEX(Table, Expr)</code> – similar pattern. Use AVERAGEX if the thing you want to average isn’t a base column.</li>
</ul>
<ul>
<li><strong>COUNT/COUNTROWS</strong> vs <strong>COUNTX</strong>: <code>COUNTROWS(Table)</code> counts rows (optionally filtered via CALCULATE). <code>COUNTX(Table, Expr)</code> counts non-blank results of Expr per row. Often COUNTROWS combined with FILTER is sufficient for conditional counts (e.g., <code>CALCULATE(COUNTROWS(Sales), Sales[Amount] &gt; 1000)</code> to count transactions &gt;1000).</li>
</ul>
<ul>
<li><strong>MIN/MAX</strong> vs <strong>MINX/MAXX</strong>: min or max of a column vs of an expression per row. For example, <code>MAXX( Dates, [Daily Sales] )</code> might give the maximum daily sales value in the current context.</li>
</ul>
<ul>
<li><strong>DISTINCTCOUNT</strong> (counts distinct values in a column) doesn’t have an X version because the operation is inherently on a column set, but if you needed distinct count of an expression, you might do something like <code>COUNTROWS( DISTINCT( GENERATE( table, ... ) ) )</code> which is more advanced. Typically stick to provided functions.</li>
</ul>
</blockquote>
<hr />
<h2 id="heading-time-intelligence-functions">Time Intelligence Functions</h2>
<p>One of DAX’s strengths is built-in <strong>time intelligence</strong> – functions that make calculating time-based metrics easier. Time intelligence functions allow you to manipulate <strong>dates</strong> to get calculations like year-to-date, quarter-to-date, same period last year, year-over-year growth, moving averages, etc., without manually writing complex filter logic every time.</p>
<p><strong>Prerequisite: Date Table.</strong> In order to use most time intelligence functions properly, you should have a dedicated <strong>Date table</strong> in your model (a table that contains all dates over the period of interest, with contiguous date ranges, with no missing dates). Mark this table as the "Date Table" in Power BI. The Date table should have a relationship to your fact table (e.g., Sales) on the date field. The functions assume this setup. <em>If you don’t have a proper date table, functions like TOTALYTD or SAMEPERIODLASTYEAR might not work correctly.</em> (<a target="_blank" href="https://learn.microsoft.com/en-us/dax/time-intelligence-functions-dax#:~:text=Data%20Analysis%20Expressions%20,date%20column%20as%20Date%20Table">Time intelligence functions (DAX) - DAX | Microsoft Learn</a>) emphasizes marking a date table before using these functions.</p>
<ul>
<li><p><strong>Year-to-Date (YTD), Quarter-to-Date, Month-to-Date:</strong><br />  DAX provides functions like <code>TOTALYTD( &lt;Measure&gt;, &lt;DatesColumn&gt; [, &lt;FiscalYearEndDate&gt;] )</code>, <code>TOTALQTD</code>, <code>TOTALMTD</code> to accumulate a measure from the start of the year/quarter/month up to the current context date. For example:</p>
<pre><code class="lang-sql">  Sales YTD = TOTALYTD( [Total Sales], 'Date'[Date] )
</code></pre>
<p>  If you put this measure in a visual with Month, it will show running total of sales from Jan 1 to that month’s end for each month (assuming 'Date'[Date] is a continuous date column in your Date table). There are equivalent TOTALQTD, TOTALMTD for quarter and month. These are convenience wrappers around a combination of CALCULATE and filter logic on the date. They automatically detect year boundaries (or you can specify a fiscal year end).</p>
</li>
</ul>
<ul>
<li><p><strong>Same Period Last Year / Previous Periods:</strong></p>
<ul>
<li><p><code>SAMEPERIODLASTYEAR(&lt;dates&gt;)</code>: returns the set of dates exactly one year before the dates in the current filter context. Often used like:</p>
<pre><code class="lang-sql">  Sales LY = CALCULATE( [Total Sales], SAMEPERIODLASTYEAR( 'Date'[Date] ) )
</code></pre>
<p>  This gives the sales for the same period last year corresponding to the current filter context period. For example, if the current context is March 2025, it filters to March 2024’s dates and returns that sales.</p>
</li>
</ul>
</li>
</ul>
<ul>
<li>There are also <code>PREVIOUSYEAR</code>, <code>PREVIOUSQUARTER</code>, <code>PREVIOUSMONTH</code> which give the entire previous period (regardless of alignment with current). For instance, if you’re filtering a specific month, PREVIOUSMONTH would give the entire immediately preceding month.</li>
</ul>
<ul>
<li><code>DATEADD(&lt;dates&gt;, -1, YEAR)</code>: an alternative to SAMEPERIODLASTYEAR; DATEADD is more flexible since you can shift by any number of intervals (e.g., -1 year, -3 months, etc.). It returns a table of dates shifted by the interval. E.g., <code>CALCULATE( [Total Sales], DATEADD('Date'[Date], -1, YEAR) )</code> is effectively same as above.</li>
</ul>
<ul>
<li><p><strong>Period-to-date vs period-over-period:</strong> YTD/QTD/MTD are cumulative from period start. For year-over-year or quarter-over-quarter comparisons, you typically use SAMEPERIODLASTYEAR or DATEADD to get a comparable period in the past, then maybe compute differences or percentages. Example:</p>
<pre><code class="lang-sql">  Sales YoY % =
    DIVIDE( [Total Sales] - CALCULATE( [Total Sales], SAMEPERIODLASTYEAR('Date'[Date]) ),
            CALCULATE( [Total Sales], SAMEPERIODLASTYEAR('Date'[Date]) )
          )
</code></pre>
<p>  This measure calculates year-over-year growth percentage. It takes current sales minus last year’s sales, divided by last year’s sales.</p>
</li>
</ul>
<ul>
<li><strong>First/Last/Opening/Closing balances:</strong> There are functions like <code>FIRSTDATE</code>, <code>LASTDATE</code>, which give the first or last date in the current filter context. And specialized ones like <code>CLOSINGBALANCEMONTH(&lt;expr&gt;, &lt;dates&gt;)</code>, <code>OPENINGBALANCEYEAR(&lt;expr&gt;, &lt;dates&gt;)</code> which evaluate an expression at the end or start of a period. These are used for things like inventory or account balances where you need the value at a period boundary.</li>
</ul>
<ul>
<li><strong>Working with months, quarters, etc.:</strong> DAX’s time functions often expect a contiguous date range. They often internally figure out things like “all dates in the same year up to this date” etc. For example, <code>TOTALYTD</code> essentially does <code>CALCULATE( &lt;measure&gt;, DATESYTD(&lt;dates&gt;) )</code> where <code>DATESYTD</code> returns all dates from the start of the year to the max date in the current context. Similarly <code>SAMEPERIODLASTYEAR</code> is essentially shifting that set by -1 year. If your Date table has gaps or multiple entries per day, results may be off.</li>
</ul>
<p>(<a target="_blank" href="https://learn.microsoft.com/en-us/dax/time-intelligence-functions-dax#:~:text=Data%20Analysis%20Expressions%20,date%20column%20as%20Date%20Table">Time intelligence functions (DAX) - DAX | Microsoft Learn</a>) (Microsoft documentation) states: <em>DAX includes time-intelligence functions that enable you to manipulate data using time periods (days, months, quarters, years) and build and compare calculations over those periods.</em> It also reminds to mark a table as Date Table.</p>
<p><strong>Example – Year-over-Year Sales:</strong><br />Imagine you want a column chart showing sales this year vs last year by month. You’d have [Total Sales] and [Sales LY] as measures. We defined <code>Sales LY</code> above. You place Month on X-axis, and both measures as values. Thanks to the DAX time functions, for each month context, [Sales LY] automatically grabs the equivalent month last year’s data. This simplifies writing such comparisons.</p>
<p><strong>Example – Year-to-Date Total:</strong><br />If you show a cumulative line chart of sales across months, [Sales YTD] will give a running total. The measure we gave <code>Sales YTD = TOTALYTD([Total Sales], 'Date'[Date])</code> will accumulate from the start of each year. The function knows to reset at year boundaries (if using calendar year or the fiscal year you provide). Under the hood, <code>TOTALYTD</code> is doing something akin to:</p>
<pre><code class="lang-sql">Sales YTD =
CALCULATE( [Total Sales],
    DATESYTD( 'Date'[Date] )
)
</code></pre>
<p>where <code>DATESYTD</code> returns all dates from Jan 1 of the current year up to the current date context.</p>
<ul>
<li><p><strong>Rolling averages or moving sums:</strong> There isn’t a single built-in for moving average (like 30-day rolling), but you can combine functions. For example, to get last 30 days sales:</p>
<pre><code class="lang-sql">  Last30DaysSales =
  CALCULATE( [Total Sales],
      DATESINPERIOD( 'Date'[Date], MAX('Date'[Date]), -30, DAY )
  )
</code></pre>
<p>  Here <code>DATESINPERIOD</code> takes the max date in current context (say you’re at March 31, 2025 in a visual, max date is that) and goes back 30 days to produce that set of dates, and CALCULATE sums over that set. This would let you then do a moving average by dividing by 30 or adjusting context in a measure.</p>
</li>
</ul>
<p>Time intelligence functions rely on having that full range of dates. They can handle fiscal years (with year_end parameter in TOTALYTD, etc.), and can work with quarters, months, weeks (though week-based calculations might require more custom logic or a week number in the date table).</p>
<blockquote>
<p><strong>Tip:</strong> Always use a dedicated Date table and relate it to fact tables. Use the built-in functions like TOTALYTD, SAMEPERIODLASTYEAR, etc., to save time – they implement common patterns. If you need something custom (like “sales in the same month of the previous year but align by fiscal week” or “trailing 12 months”), you might have to use combinations of <code>DATEADD</code> or <code>DATESINPERIOD</code>. The DAX Patterns website and Microsoft docs have recipes for many of these scenarios (<a target="_blank" href="https://www.sqlbi.com/articles/week-based-time-intelligence-in-dax/#:~:text=Week,year%20%28YOY%29%20and%20so%20on">Week-Based Time Intelligence in DAX - SQLBI</a>). Start with simple ones and be sure to test the results to ensure they match expected values.</p>
</blockquote>
<hr />
<h2 id="heading-variables-and-debugging-techniques">Variables and Debugging Techniques</h2>
<p>DAX introduced the <strong>VAR</strong> and <strong>RETURN</strong> syntax to allow variables within expressions. Using <strong>variables</strong> can greatly improve the clarity, performance, and debuggability of your DAX formulas (<a target="_blank" href="https://www.sqlbi.com/articles/variables-in-dax/#:~:text=Using%20variables%20in%20DAX%20makes,when%20to%20use%20variables">Variables in DAX - SQLBI</a>). A variable lets you compute a sub-expression once, name it, and reuse it multiple times within the same measure (or calculated column). Variables do not persist beyond the single evaluation of that formula; they are not like Excel named ranges that stick around globally – they’re local to the measure or query.</p>
<ul>
<li><p><strong>Syntax:</strong></p>
<pre><code class="lang-sql">  VAR &lt;varName&gt; = &lt;expression&gt;  
  VAR &lt;varName2&gt; = &lt;expression2&gt;  
  ...  
  RETURN &lt;final_expression_using_vars&gt;
</code></pre>
<p>  You can define one or multiple VARs, and then after the RETURN keyword, put the expression that produces the final result (usually involving those variables). In the final expression, you refer to the variable by name (without any special quoting).</p>
</li>
</ul>
<ul>
<li><p><strong>Benefits of variables:</strong></p>
<ol>
<li><strong>Clarity:</strong> You can break a complex calculation into understandable parts. For example, instead of a nested monstrosity, you can do <code>VAR Intermediate = ... RETURN ...</code> to make it clear what each piece represents.</li>
</ol>
</li>
</ul>
<ol>
<li><strong>Avoid repetition:</strong> If the same sub-calculation is used multiple times, putting it in a variable means it’s calculated once and reused. For example, <code>VAR totalRows = COUNTROWS(Sales)</code> then use <code>totalRows</code> twice is better than writing <code>COUNTROWS(Sales)</code> in two places (<a target="_blank" href="https://maqsoftware.com/insights/dax-best-practices.html#:~:text=5,measures%20inside%20the%20IF%20branch">DAX Best Practices | MAQ Software Insights</a>). This not only avoids writing it twice, but also ensures the value is identical and computed once (which can improve performance).</li>
</ol>
<ol>
<li><strong>Debugging:</strong> You can use variables to inspect intermediate results. During development, you might temporarily set the RETURN expression to just a variable to see what it evaluates to in a visual. For instance, if a measure isn’t working, you can break it into parts with VAR and then output one part at a time to verify each. This is a common debugging trick – and since variables are not output normally, you make them output by returning them or including them in a simple way for testing (<a target="_blank" href="https://medium.com/@rganesh0203/power-bi-dax-debugging-tricks-c02e1b01fe45#:~:text=Use%20the%20RETURN%20Statement,you%20to%20output%20intermediate%20results">Power BI DAX Debugging Tricks! - Medium</a>).</li>
</ol>
<ul>
<li><p><strong>Example of using VAR for performance:</strong></p>
<pre><code class="lang-sql">  BigRatio =
  VAR totalOrders = COUNTROWS( Sales )
  VAR totalSales = [Total Sales]   <span class="hljs-comment">-- assume [Total Sales] is defined elsewhere</span>
  RETURN 
    IF( totalOrders = 0, 
        BLANK(), 
        totalSales / totalOrders 
    )
</code></pre>
<p>  In this measure, we calculate <code>totalOrders</code> (number of sales) once. We also capture <code>[Total Sales]</code> into <code>totalSales</code> variable (technically if [Total Sales] is a measure, calling it multiple times would give the same result in the same filter context – the engine might already optimize it, but using a VAR guarantees it’s only evaluated once here). Then we return an IF that uses those variables. If we did not use variables, we might write <code>IF(COUNTROWS(Sales)=0, BLANK(), [Total Sales] / COUNTROWS(Sales))</code>. In that original form, <code>COUNTROWS(Sales)</code> would be evaluated twice (once for the IF check, once for the division). Using VAR, it’s evaluated once (<a target="_blank" href="https://maqsoftware.com/insights/dax-best-practices.html#:~:text=5,measures%20inside%20the%20IF%20branch">DAX Best Practices | MAQ Software Insights</a>). This is a trivial example, but with heavier expressions this matters a lot.</p>
</li>
</ul>
<ul>
<li><strong>Scope of variables:</strong> A variable is computed <strong>before</strong> the filter context is modified by row context in an iterator, if used inside one. That’s an advanced detail: basically, in a measure or query, all VARs are evaluated in the context they are defined, then the RETURN uses those values. In an iterator, if you define a VAR outside the iterator and use it inside, it stays constant; if defined inside, it might be computed per row. Typically, you define at the top of a measure and it’s one per overall evaluation. You cannot have variables that change per row in a simple way (they’re not like looping counters; they’re constants during that evaluation). So consider them like let-bindings in math.</li>
</ul>
<ul>
<li><p><strong>Debugging with variables:</strong><br />  Let’s say you have a complex measure and you suspect the problem is in one part of it. You can refactor:</p>
<pre><code class="lang-sql">  ComplexMeasure =
  VAR part1 = ... some expression ...
  VAR part2 = ... another expression possibly using part1 ...
  RETURN part2  <span class="hljs-comment">-- temporarily return part2 to see if it's right</span>
</code></pre>
<p>  If the result of the measure in a visual seems off, you might set <code>RETURN part1</code> just to see what part1 is producing in that context. This way, you isolate which part is incorrect. Another trick: sometimes output multiple variables by combining them in a table or concatenated string just for debugging purposes, though in a measure you can only return one scalar value – but you could make a temporary calculated table measure or use DAX Studio to run a query to see multiple variables. There is also a feature in tools like Tabular Editor or DAX Studio to evaluate intermediate steps.</p>
</li>
</ul>
<p>(<a target="_blank" href="https://medium.com/@rganesh0203/dax-debugging-tricks-basic-to-advanced-power-bi-d738bfa8e463#:~:text=Use%20Variables%20for%20Debugging,intermediate%20results%20to%20be%20reused">DAX Debugging Tricks ( Basic to Advanced ) | Power BI! - Medium</a>) indicates: <em>Variables in DAX help split complex formulas into manageable pieces and allow intermediate results to be reused.</em> This captures why they are good for both debugging and performance.</p>
<ul>
<li><p><strong>Example of a complex measure using multiple variables:</strong><br />  Suppose we want a measure “% of category sales that are current product’s sales”. This could be done in steps:</p>
<pre><code class="lang-sql">  Product Sales % of Category =
  VAR currProductSales = [Total Sales]  <span class="hljs-comment">-- sales for the current product (filter context)</span>
  VAR categorySales = CALCULATE( [Total Sales], ALL( Product[ProductName] ) )
      <span class="hljs-comment">-- remove product filter, so it's total for the category (assuming Category is still filtered)</span>
  RETURN
    DIVIDE( currProductSales, categorySales )
</code></pre>
<p>  Here we used two variables to clarify our intent: one for the current product’s sales, one for the whole category’s sales (we removed only the product filter, so category context remains). This makes the measure easier to read and double-check. We could even debug by returning categorySales alone to ensure it’s doing what we expect.</p>
</li>
</ul>
<ul>
<li><strong>DAX debugging tools:</strong> Outside of writing variables, it’s worth mentioning <strong>DAX Studio</strong> (an external tool) which allows you to run DAX queries and see results or performance metrics, and <strong>Performance Analyzer</strong> in Power BI which shows how long measures take. In a purely formula sense, variables and the <code>RETURN</code> trick are your best friend to understand complex logic. SQLBI’s article on debugging DAX measures suggests techniques like returning table constructs or using the <code>ERROR()</code> function cleverly, but those are advanced. Usually, incrementally building the measure and testing as you go is the best approach.</li>
</ul>
<blockquote>
<p><strong>Tip:</strong> Use descriptive names for variables (and measures). This self-documents your code. For example, <code>VAR avgSales = DIVIDE([Total Sales], [Order Count])</code> is clearer than <code>VAR x = ...</code>. While DAX variables can’t be viewed outside, the measure definition itself becomes easier to understand when you or someone else revisits it. Also, keep in mind a variable is calculated <strong>once per filter context</strong> (not once per row, unless used inside an iterator per row). If you need a per-row calculation, you still need an iterator or a calc column.</p>
</blockquote>
<hr />
<h2 id="heading-performance-tuning-and-best-practices">Performance Tuning and Best Practices</h2>
<p>As you get into more complex DAX scenarios, it’s important to follow best practices to ensure your measures run efficiently and your results are correct. DAX is powerful, but performance can suffer if formulas are not optimized or if the model is not designed well. Here are some key practices and tips:</p>
<ol>
<li><strong>Model First, DAX Second:</strong> A good data model (star schema with proper relationships, necessary columns pre-computed in Power Query or source, no overly high cardinality columns for no reason) lays the foundation for simpler DAX. Whenever possible, shape your data so that DAX measures don’t have to do heavy lifting like string parsing or complex lookups at query time.</li>
</ol>
<ol>
<li><strong>Prefer Measures over Calculated Columns for Dynamic Calculations:</strong> If a value can be computed on the fly and especially if it needs to respond to filters, use a measure. Measures are evaluated lazily (only when needed) and don’t bloat your model. Calculated columns are best for static categorizations or when you explicitly need to slice data by that result. Measures keep your model lean. (As noted, calculated columns increase RAM usage and file size, and too many can slow refresh).</li>
</ol>
<ol>
<li><strong>Avoid Repeating Calculations – Use Variables or Separate Measures:</strong> If you notice the same expression used multiple times in a measure, factor it out. For example, instead of <code>[Result] = IF(X &gt; 0, X/total, X)</code> where X is some complex sum, do <code>VAR X = complex sum RETURN IF(X &gt; 0, X/total, X)</code>. This way <code>complex sum</code> runs once. Similarly, you can create helper measures for reusability. E.g., define <code>[Total Sales]</code> once and reuse it in many other measures rather than writing <code>SUM(Sales[Amount])</code> in every measure (and possibly adding filters each time incorrectly). This modular approach also makes maintenance easier. (<a target="_blank" href="https://maqsoftware.com/insights/dax-best-practices.html#:~:text=5,measures%20inside%20the%20IF%20branch">DAX Best Practices | MAQ Software Insights</a>) illustrated how using a variable prevented double calculation of a measure.</li>
</ol>
<ol>
<li><p><strong>Use Built-in Functions Optimally:</strong></p>
<ul>
<li>Use <code>DIVIDE(numerator, denominator, alternateResult)</code> instead of the <code>/</code> operator when dividing measures, to gracefully handle division by zero without extra IF logic (<a target="_blank" href="https://maqsoftware.com/insights/dax-best-practices.html#:~:text=6,">DAX Best Practices | MAQ Software Insights</a>). It’s both clearer and potentially a tiny bit more efficient than doing an IF yourself.</li>
</ul>
</li>
</ol>
<ul>
<li>Use <code>SELECTEDVALUE(column, alternate)</code> instead of the pattern <code>IF(HASONEVALUE(column), VALUES(column), alternate)</code> (<a target="_blank" href="https://maqsoftware.com/insights/dax-best-practices.html#:~:text=3,HASONEVALUE">DAX Best Practices | MAQ Software Insights</a>) (<a target="_blank" href="https://maqsoftware.com/insights/dax-best-practices.html#:~:text=4,VALUES">DAX Best Practices | MAQ Software Insights</a>). SELECTEDVALUE does exactly that check internally (returns the single value if one exists, otherwise the alternate or BLANK). This makes code cleaner and potentially avoids performance issues of accidentally getting an error from <code>VALUES()</code> when multiple values exist.</li>
</ul>
<ul>
<li>Use <code>CONCATENATEX</code> for string aggregations instead of trying to stitch strings via hacks (like using PATH functions or so).</li>
</ul>
<ul>
<li>Leverage <code>COALESCE()</code> (as of newer updates) to handle blanks in a cleaner way than nested IFs for default values.</li>
</ul>
<ol>
<li><strong>Context Modification Functions:</strong> Be mindful with <code>FILTER</code> inside CALCULATE. Recall that giving a filter like <code>Table[Column] = value</code> directly is usually faster than using <code>FILTER(Table, Table[Column]=value)</code> because the latter iterates the whole table. If you have multiple conditions on the same table, you can often combine them in one FILTER. If you want to preserve an existing filter and add a new one, consider <code>KEEPFILTERS</code> instead of FILTER (<a target="_blank" href="https://maqsoftware.com/insights/dax-best-practices.html#:~:text=7,T">DAX Best Practices | MAQ Software Insights</a>). For example, <code>CALCULATE([Measure], KEEPFILTERS(Table[Col] = "X"))</code> will respect existing filters on Col and require "X" too, whereas FILTER would disregard the existing filter on Col. This can both affect correctness and performance (because of how queries get optimized).</li>
</ol>
<ol>
<li><strong>Avoid Iterating Over Huge Tables Unnecessarily:</strong> If your dataset is large (millions of rows), an iterator like <code>FILTER(Sales, ...)</code> or <code>SUMX(Sales, ...)</code> will iterate all those rows in the formula engine, which can be slow. If you can push a filter to the storage engine (VertiPaq) it’s faster. E.g., <code>CALCULATE([Measure], Table[Col] = "Value")</code> will let the engine use its internal indexes to filter, which is usually very fast. But <code>FILTER(Table, Table[Col]="Value")</code> would enumerate rows one by one. So use iterator functions judiciously. When possible, structure your measure using CALCULATE with filter arguments or use <code>COUNTROWS( FILTER(...) )</code> patterns that at least limit the rows early.</li>
</ol>
<ol>
<li><strong>Limit the Scope of Calculations:</strong> If you need a calculation at a certain granularity, consider doing it in a summarized or calculated table instead of on the fly for every detail row. For example, if you frequently need to calculate something per customer and then sum it up, you could create a calculated table or use SUMMARIZE to pre-calc per customer. Alternatively, a measure with SUMX( VALUES(Customer[ID]), ... ) can ensure you only iterate unique customers not every transaction.</li>
</ol>
<ol>
<li><strong>Use the RIGHT grain of context in measures:</strong> This is a bit conceptual: Sometimes you can write a measure in different ways – one might do a heavy calc on each row of a fact, another might leverage an aggregated table. For example, to count customers who purchased something, rather than doing <code>COUNTROWS( DISTINCT( Sales[CustomerID] ) )</code> for a huge sales table, you could have a Customers table and do <code>COUNTROWS( FILTER( Customers, [Total Sales for Customer] &gt; 0 ) )</code> where [Total Sales for Customer] is maybe a calc column or measure. The latter might iterate fewer rows (just customers). This depends on model and need, but think about what you’re iterating over.</li>
</ol>
<ol>
<li><strong>Formatting and Readability:</strong> Use a consistent formatting style. Tools like <strong>DAX Formatter</strong> (by SQLBI) can format your DAX code for readability (indenting, line breaks). Clean, well-formatted DAX is easier to debug and often helps you spot logical issues. Also, as a best practice, name your measures descriptively (and include units or time frame if relevant, e.g., "Total Sales LY" for last year). You can also add descriptions to measures in the model for documentation.</li>
</ol>
<ol>
<li><strong>Measure Dependencies and Ordering:</strong> Be cautious of measures that depend on others that depend on others... While modularizing is good, a very long chain might be harder to troubleshoot. Try to keep measure logic coherent. But referencing a measure within another is fine (the engine calculates them in the right order as needed).</li>
</ol>
<ol>
<li><strong>Testing Performance:</strong> Use the <strong>Performance Analyzer</strong> in Power BI to see which measures are slow. If a particular measure is slow, consider if it’s doing something complex like nested iterators or scanning a huge table with FILTER. Try simplifying it or see if the heavy part can be precomputed in a column or table. Also, check your model size; large models might benefit from aggregations or reducing cardinality of columns (like splitting high cardinality columns if possible, or removing unneeded detail).</li>
</ol>
<ol>
<li><p><strong>Common Mistakes to Avoid:</strong></p>
<ul>
<li>Avoid using <code>VALUES()</code> in a context where it can return multiple values without wrapping in an aggregator or SELECTEDVALUE; it will error if multiple values appear. If you expect one value but sometimes there are many, use SELECTEDVALUE or an iterator to handle it.</li>
</ul>
</li>
</ol>
<ul>
<li>Be careful with <code>ALLSELECTED</code> vs <code>ALL</code> – ALLSELECTED is context-sensitive (it respects outer filters but removes filters from the current visual context), can be confusing for percent of total calcs. Use ALL when you mean truly all, ALLSELECTED when doing total of the selection (for subtotals etc.).</li>
</ul>
<ul>
<li>Don’t assume a calculated column will update when you use a slicer. This misconception can lead to wrong approaches. If you need dynamic behavior, measure is the way.</li>
</ul>
<ul>
<li>In measures, don’t ignore the filter context. For example, instead of trying to store an intermediate result globally, think in terms of context. Each evaluation has to consider “given the filters, what’s the output”.</li>
</ul>
<ol>
<li><strong>Iterate only as needed:</strong> For example, instead of <code>SUMX(Filter(Sales,...), Sales[Amount])</code>, consider if <code>CALCULATE(SUM(Sales[Amount]), ... )</code> could do the job with less explicit iteration. The DAX optimizer might do similar things under the hood, but writing it in a set-based way often aligns with engine optimizations.</li>
</ol>
<ol>
<li><strong>Use of USERELATIONSHIP and relationships:</strong> Only activate alternate relationships in measures that require them, and isolate their use to those measures. That keeps your model’s default behavior predictable and uses heavier calc (like USERELATIONSHIP in CALCULATE) only when necessary (like a special measure). Similarly, try to avoid bi-directional cross-filtering on whole model unless needed; it can usually be handled with specific CALCULATE modifiers or measure logic for those few cases (like a many-to-many through a bridge using TREATAS or CROSSFILTER).</li>
</ol>
<p>In short, <strong>write DAX clearly and efficiently</strong>. Favor the powerful filter context over manual row iteration when possible. Test your measures with small filters to ensure they compute expected results before applying to big data. And remember, DAX is a mix of a query language and formula language – sometimes there are multiple ways to achieve something (set-based vs procedural). Aim for the approach that leverages the strengths of the VertiPaq engine (set-based filtering, compression) rather than brute-force calculations.</p>
<p>Finally, stay curious and keep learning: DAX has many nuances and functions (over 200 functions). Refer to official documentation and community blogs for best practices. Over time, you’ll develop an intuition for writing DAX that is not only correct but also performant and easy to maintain. Happy DAX-ing!</p>
]]></content:encoded></item><item><title><![CDATA[BigQuery Study Reference]]></title><description><![CDATA[I have just finished a study guide on BigQuery @ GCP. Below you can find it on Heptabase Whiteboard
In Summary, you can find it in here:
BigQuery SQL Statement Guide (Standard SQL & Legacy SQL)
Google BigQuery’s SQL dialect (Google Standard SQL) exte...]]></description><link>https://notes.beesho.me/bigquery-study-reference</link><guid isPermaLink="true">https://notes.beesho.me/bigquery-study-reference</guid><category><![CDATA[bigquery]]></category><category><![CDATA[GCP]]></category><category><![CDATA[SQL]]></category><dc:creator><![CDATA[Beshoy Sabri]]></dc:creator><pubDate>Thu, 24 Apr 2025 16:47:12 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1745513140973/c1190ba7-0377-4d80-a2fc-c467cc0ffedd.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745513213591/e26125fd-8e54-40dc-abd1-2915d38ec538.png" alt /></p>
<p>I have just finished a study guide on BigQuery @ GCP. Below you can find it on <a target="_blank" href="https://app.heptabase.com/w/0055443fa87dc599e6578fc486cb7ebecb7caeecad70e05e9d8780d344782142">Heptabase Whiteboard</a></p>
<p>In Summary, you can find it in here:</p>
<h1 id="heading-bigquery-sql-statement-guide-standard-sql-amp-legacy-sql">BigQuery SQL Statement Guide (Standard SQL &amp; Legacy SQL)</h1>
<p>Google BigQuery’s SQL dialect (Google Standard SQL) extends traditional SQL with unique features. Below, we explore BigQuery-specific SQL statements in the classic categories – DDL, DQL, DML, DCL, TCL – with examples and notes on legacy SQL where applicable, followed by additional topics like aggregations, window functions, and performance tips.</p>
<h2 id="heading-1-ddl-data-definition-language">1. DDL – Data Definition Language</h2>
<p><strong>Definition:</strong> DDL statements are used to define or alter the structure of the database objects. In BigQuery, DDL can create, modify, and delete <strong>datasets</strong> (schemas), <strong>tables</strong>, <strong>views</strong>, <strong>materialized views</strong>, <strong>table snapshots</strong>, <strong>table clones</strong>, <strong>functions (UDFs)</strong>, <strong>procedures</strong>, and <strong>row-level access policies</strong> (<a target="_blank" href="https://www.janbasktraining.com/tutorials/ddl-commands/#:~:text=Using%20Google%20Standard%20SQL%20query,the%20help%20of%20DDL%20commands">Learn All About DDL Commands</a>). DDL operations affect the <strong>schema/metadata</strong> but not the actual data contents (<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-create-drop-tables-data-definition-language#:~:text=Data%20Definition%20Language%20,without%20affecting%20the%20data%20itself">Create and Delete Tables in BigQuery: A 2025 DDL Guide</a>).</p>
<p>BigQuery’s legacy SQL did not support DDL in queries (table management had to be done via the UI or CLI). BigQuery Standard SQL (also called GoogleSQL) introduced in-query DDL support. Key BigQuery DDL statements include <strong>CREATE</strong>, <strong>ALTER</strong>, and <strong>DROP</strong>, each extended with BigQuery-specific options:</p>
<ul>
<li><p><strong>CREATE</strong> – Used to create new datasets, tables, views, routines, etc. BigQuery’s <code>CREATE TABLE</code> supports additional clauses for partitioning and clustering data for performance. You can also use <code>CREATE TABLE ... AS SELECT</code> (CTAS) to create a table from a query result in one statement (<a target="_blank" href="https://www.aampe.com/blog/how-to-create-a-table-in-bigquery#:~:text=CREATE%20TABLE%20%60your,table">How to Create a Table in BigQuery: A Step-by-Step Guide</a>). For example, the following creates a partitioned, clustered table:</p>
<pre><code class="lang-sql">  <span class="hljs-comment">-- Create a partitioned table for sales data, partitioned by date and clustered by product</span>
  <span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-string">`project_id.my_dataset.sales`</span> (
    sale_date <span class="hljs-built_in">DATE</span>,
    product <span class="hljs-keyword">STRING</span>,
    quantity INT64,
    price <span class="hljs-built_in">NUMERIC</span>
  )
  <span class="hljs-keyword">PARTITION</span> <span class="hljs-keyword">BY</span> sale_date
  CLUSTER <span class="hljs-keyword">BY</span> product;
</code></pre>
<p>  In the above, <strong>PARTITION BY</strong> and <strong>CLUSTER BY</strong> are BigQuery-specific extensions to define date or integer range partitions and clustered indexing on a column, which help optimize query performance (<a target="_blank" href="https://www.kpipartners.com/blogs/bigquery-best-practices-to-optimize-cost-and-performance#:~:text=Use%20Partitioned%20tables%3A">BigQuery Best Practices to Optimize Cost and Performance</a>) (<a target="_blank" href="https://www.kpipartners.com/blogs/bigquery-best-practices-to-optimize-cost-and-performance#:~:text=Clustered%20tables%3A">BigQuery Best Practices to Optimize Cost and Performance</a>). BigQuery also allows adding a table description or labels via the <code>OPTIONS</code> clause in DDL. For example, you could append <code>OPTIONS(description="Sales data", labels=[("team","finance")])</code> to the <code>CREATE TABLE</code> to document the table.</p>
<p>  BigQuery <strong>datasets</strong> are analogous to schemas: use <code>CREATE SCHEMA dataset_name</code> (Standard SQL) to create a new dataset (<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-create-drop-tables-data-definition-language#:~:text=CREATE%20SCHEMA%20Statement">Create and Delete Tables in BigQuery: A 2025 DDL Guide</a>) (<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-create-drop-tables-data-definition-language#:~:text=,Statement">Create and Delete Tables in BigQuery: A 2025 DDL Guide</a>). Similarly, <code>CREATE VIEW</code> creates a logical view from a query, and <code>CREATE MATERIALIZED VIEW</code> creates a view that caches results for faster reuse.</p>
</li>
</ul>
<ul>
<li><p><strong>ALTER</strong> – Used to modify existing objects. BigQuery allows altering table schemas flexibly. For example, you can add a column to a table without rewriting it:</p>
<pre><code class="lang-sql">  <span class="hljs-keyword">ALTER</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-string">`my_dataset.sales`</span> 
  <span class="hljs-keyword">ADD</span> <span class="hljs-keyword">COLUMN</span> <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> comments <span class="hljs-keyword">STRING</span>;
</code></pre>
<p>  This adds a new nullable field <code>comments</code> to the <strong>sales</strong> table. Other BigQuery ALTER operations include <code>ALTER TABLE SET OPTIONS</code> (to change table metadata like description or default expiration), <code>ALTER COLUMN</code> (to change column options like descriptions), or renaming tables/columns. You can also alter datasets (schemas) – e.g. <code>ALTER SCHEMA my_dataset SET OPTIONS(default_table_expiration_days=90)</code> to set a default expiration for tables in a dataset.</p>
</li>
</ul>
<ul>
<li><strong>DROP</strong> – Removes BigQuery objects. For example, <code>DROP TABLE my_dataset.sales;</code> will delete the <strong>sales</strong> table. BigQuery also supports <code>DROP SCHEMA my_dataset CASCADE;</code> to delete a dataset along with all its tables. (Legacy SQL required using the API or console for such operations, since it lacked DDL.)</li>
</ul>
<p><strong>Example – Creating and Dropping a Table:</strong></p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Create a new table from a query (CTAS example):</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-string">`project_id.my_dataset.top_customers`</span> <span class="hljs-keyword">AS</span>
<span class="hljs-keyword">SELECT</span> customer_id, <span class="hljs-keyword">SUM</span>(total_spend) <span class="hljs-keyword">AS</span> total_spend
<span class="hljs-keyword">FROM</span> <span class="hljs-string">`project_id.my_dataset.orders`</span>
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> customer_id
<span class="hljs-keyword">HAVING</span> <span class="hljs-keyword">SUM</span>(total_spend) &gt; <span class="hljs-number">1000</span>;

<span class="hljs-comment">-- Later, drop the table if no longer needed:</span>
<span class="hljs-keyword">DROP</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-string">`project_id.my_dataset.top_customers`</span>;
</code></pre>
<p>The above <code>CREATE TABLE ... AS SELECT</code> will create <strong>top_customers</strong> with the results of the query (<a target="_blank" href="https://www.aampe.com/blog/how-to-create-a-table-in-bigquery#:~:text=CREATE%20TABLE%20%60your,table">How to Create a Table in BigQuery: A Step-by-Step Guide</a>). The <code>DROP TABLE</code> statement then deletes it. BigQuery DDL statements can include <code>IF EXISTS</code>/<code>IF NOT EXISTS</code> to guard against errors when objects may or may not exist (<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-create-drop-tables-data-definition-language#:~:text=A%20Data%20Definition%20Language%20,This%20includes%20creating%2C%20altering%2C%20and">Create and Delete Tables in BigQuery: A 2025 DDL Guide</a>).</p>
<h2 id="heading-2-dql-data-query-language">2. DQL – Data Query Language</h2>
<p><strong>Definition:</strong> DQL refers to data <em>query</em> statements, primarily the <strong>SELECT</strong> statement and its clauses used to retrieve data. In BigQuery, SELECT queries support standard SQL semantics and many BigQuery-only extensions. (Legacy SQL mode also allowed SELECT queries but with different functions and handling of nested data.)</p>
<p>A basic BigQuery SELECT query supports the usual clauses: <code>SELECT ... FROM ... JOIN ... WHERE ... GROUP BY ... HAVING ... ORDER BY ... LIMIT</code>. BigQuery’s Standard SQL has full support for SQL joins, subqueries, and set operations. The <strong>WHERE</strong> clause filters rows before aggregation, and <strong>HAVING</strong> filters groups after aggregation (more on HAVING in Section 6). BigQuery’s implementation follows standard evaluation order: WHERE filters, then GROUP BY aggregates, then HAVING filters aggregated groups (<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-sql-where-vs-having-vs-qualify#:~:text=,LIMIT">A Complete Guide to WHERE, HAVING, and QUALIFY in SQL</a>) (<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-sql-where-vs-having-vs-qualify#:~:text=The%20HAVING%20clause%20filters%20records,for%20conditions%20involving%20aggregate%20functions">A Complete Guide to WHERE, HAVING, and QUALIFY in SQL</a>).</p>
<p><strong>BigQuery-specific Query Features:</strong></p>
<ul>
<li><strong>SELECT * EXCEPT / REPLACE:</strong> BigQuery extends <code>SELECT *</code> with the ability to exclude or replace specific columns. For example, <code>SELECT * EXCEPT(password)</code> will select all columns except <code>password</code>, and <code>SELECT * REPLACE(customer_id * 100 AS customer_id)</code> would select all columns but use a calculated value in place of <code>customer_id</code>. This is unique to BigQuery’s SQL (and now adopted by some other databases). Usage is straightforward: <code>SELECT * EXCEPT(col1, col2) FROM ...</code> to drop columns (<a target="_blank" href="https://stackoverflow.com/questions/34056485/select-all-columns-except-some-in-google-bigquery#:~:text=SELECT%20,FROM">Select All Columns Except Some in Google BigQuery? - Stack Overflow</a>). You can even combine them, e.g. <code>SELECT * EXCEPT(id) REPLACE("widget" AS product_name) FROM products;</code> (<a target="_blank" href="https://stackoverflow.com/questions/34056485/select-all-columns-except-some-in-google-bigquery#:~:text=In%20addition%20to%20SELECT%20,and%20obvious%20as%20per%20documentation">Select All Columns Except Some in Google BigQuery? - Stack Overflow</a>). This saves time when you need most columns but not all.</li>
</ul>
<ul>
<li><strong>Qualify (Filtering on Window Functions):</strong> BigQuery supports the <code>QUALIFY</code> clause to filter the results of window functions (analytic functions) in the same query, rather than using a subquery. <code>QUALIFY</code> is analogous to HAVING but for window function results (<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-sql-where-vs-having-vs-qualify#:~:text=The%20QUALIFY%20clause%20filters%20the,and%20WHERE%20is%20to%20FROM">A Complete Guide to WHERE, HAVING, and QUALIFY in SQL</a>). For example, after using a ranking function in the SELECT, you can add <code>QUALIFY ROW_NUMBER() OVER(PARTITION BY category ORDER BY sales DESC) = 1</code> to return only the top-selling row per category. This clause is evaluated after window functions are computed, letting you filter on their outcomes (<a target="_blank" href="https://gnarlyware.com/blog/qualify-clause-is-now-available-in-bigquery/#:~:text=The%20%60QUALIFY%60,I%E2%80%99ve%20had%20for%20a%20while">QUALIFY clause is now available in BigQuery - gnarlyware</a>) (<a target="_blank" href="https://gnarlyware.com/blog/qualify-clause-is-now-available-in-bigquery/#:~:text=match%20at%20L85%20QUALIFY%20ROW_NUMBER,%3D%201">QUALIFY clause is now available in BigQuery - gnarlyware</a>). (Without QUALIFY, you would need a subquery or CTE to filter by a window function result.)</li>
</ul>
<ul>
<li><strong>Window Functions (OVER ... PARTITION BY):</strong> BigQuery fully supports SQL window functions for running totals, ranking, lag/leads, etc. You can include expressions like <code>SUM(amount) OVER(PARTITION BY region ORDER BY date)</code> in the SELECT clause to compute a running total, or <code>RANK() OVER(PARTITION BY region ORDER BY sales DESC)</code> to rank rows within partitions. (We cover window functions in detail in Section 8.) These use the <code>OVER()</code> syntax with optional <code>PARTITION BY</code> and <code>ORDER BY</code>. BigQuery even allows filtering by these in the same query via QUALIFY as noted. Legacy SQL did not support window functions; this is a Standard SQL feature.</li>
</ul>
<ul>
<li><p><strong>Arrays and Structs (Nested Data):</strong> BigQuery can store <strong>nested</strong> and <strong>repeated</strong> fields (arrays and structs) in tables, and its SQL can directly query such structures. For example, a table can have a column which is an ARRAY of STRUCTs. In Standard SQL, you use the <code>UNNEST()</code> function in the FROM clause to flatten arrays for analysis. BigQuery’s ability to handle nested data is a distinguishing feature – it allows <strong>denormalized</strong> schemas (e.g., an order with an array of line-items) that you can still query with SQL. (Legacy SQL had a different approach using <code>FLATTEN()</code> for repeated fields.) For instance:</p>
<pre><code class="lang-sql">  <span class="hljs-keyword">SELECT</span> order_id, item.product_name, item.quantity
  <span class="hljs-keyword">FROM</span> <span class="hljs-string">`project.dataset.orders`</span>, 
       <span class="hljs-keyword">UNNEST</span>(items) <span class="hljs-keyword">AS</span> item
  <span class="hljs-keyword">WHERE</span> item.product_category = <span class="hljs-string">'Electronics'</span>;
</code></pre>
<p>  Here, <strong>items</strong> is an ARRAY field in <strong>orders</strong>; <code>UNNEST(items)</code> produces a table of item structs, which we alias as <strong>item</strong>. This query will output one row per array element. BigQuery’s SQL treats each element as a row but still associates it with its parent order. This nested data capability allows BigQuery to avoid expensive joins by keeping related data together (<a target="_blank" href="https://www.kpipartners.com/blogs/bigquery-best-practices-to-optimize-cost-and-performance#:~:text=Use%20Nested%20and%20Repeated%20fields%3A">BigQuery Best Practices to Optimize Cost and Performance</a>) (the data is physically nested). It’s a powerful feature not found in many SQL dialects.</p>
</li>
</ul>
<ul>
<li><strong>Wildcard Table Queries:</strong> BigQuery can query multiple tables in one go using wildcards or table decorators. For example, you might have monthly partitioned tables like <code>sales_2019Jan, sales_2019Feb, ...</code> and run a query across all of them with <code>FROM \</code>project.dataset.sales_2019*`<code>. BigQuery provides a pseudo-column</code> _TABLE_SUFFIX<code>to filter which underlying tables to include. (In legacy SQL, functions like</code>TABLE_DATE_RANGE()` were used for similar effect.) This helps when data is sharded into many tables by date.</li>
</ul>
<ul>
<li><strong>Set Operators:</strong> BigQuery supports standard SQL set operations: <code>UNION [ALL]</code>, <code>EXCEPT</code>, and <code>INTERSECT</code>. These allow combining result sets from multiple SELECT queries. BigQuery follows the standard behavior (e.g., <code>UNION</code> deduplicates unless ALL is specified).</li>
</ul>
<ul>
<li><strong>Advanced Grouping:</strong> BigQuery Standard SQL (recent versions) supports grouping sets, rollups, and cubes for aggregation. For example, you can do <code>GROUP BY ROLLUP(region, product)</code> to get subtotals by region and product combinations, or use <code>GROUP BY CUBE</code> or explicit <code>GROUPING SETS</code>. These were not available in legacy SQL. (Support for <code>GROUP BY ROLLUP, CUBE, GROUPING SETS</code> was added to BigQuery’s SQL (<a target="_blank" href="https://medium.com/codex/google-bigquery-now-supports-cubes-e50ecd39e447#:~:text=Google%20BigQuery%20now%20supports%20Cubes,BY%20CUBE%20clause%20completely%20new">Google BigQuery now supports Cubes | by Christianlauer - Medium</a>).)</li>
</ul>
<p><strong>Example – BigQuery Query with Unique Features:</strong></p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> 
  <span class="hljs-keyword">category</span>,
  product,
  <span class="hljs-keyword">SUM</span>(sales) <span class="hljs-keyword">AS</span> total_sales,
  <span class="hljs-keyword">RANK</span>() <span class="hljs-keyword">OVER</span>(<span class="hljs-keyword">PARTITION</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">category</span> <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">SUM</span>(sales) <span class="hljs-keyword">DESC</span>) <span class="hljs-keyword">AS</span> product_rank_in_cat
<span class="hljs-keyword">FROM</span> <span class="hljs-string">`my_dataset.product_sales`</span>
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">category</span>, product
<span class="hljs-keyword">HAVING</span> total_sales &gt; <span class="hljs-number">1000</span>
QUALIFY product_rank_in_cat = <span class="hljs-number">1</span>
</code></pre>
<p>In this (contrived) example, we use <strong>GROUP BY</strong> to aggregate sales by category and product, <strong>HAVING</strong> to keep only products with &gt;1000 sales in a category, and a <strong>window function</strong> (<code>RANK() OVER(PARTITION BY category ORDER BY SUM(sales) DESC)</code>) to rank products by sales within each category. Finally, <code>QUALIFY product_rank_in_cat = 1</code> filters the results to return only the top product per category. This single query finds the top-selling product in each category with over 1000 sales, demonstrating BigQuery’s ability to combine aggregation, window functions, and qualify filtering. (Note: The <code>SUM(sales)</code> inside the RANK() would actually be computed as an analytic function – BigQuery allows using aggregations in analytic functions by computing them over the partition – here it effectively ranks by the same SUM per partition.)</p>
<h2 id="heading-3-dml-data-manipulation-language">3. DML – Data Manipulation Language</h2>
<p><strong>Definition:</strong> DML statements modify table data (insert, update, delete rows). BigQuery’s DML enables adding or changing data in BigQuery tables via SQL, which was a major enhancement over the early append-only model. Standard DML in BigQuery includes <strong>INSERT</strong>, <strong>UPDATE</strong>, <strong>DELETE</strong>, and the combined <strong>MERGE</strong> statement (<a target="_blank" href="https://stackoverflow.com/questions/69146082/bigquery-sql-how-to-update-rows-and-insert-new-data#:~:text=You%20are%20looking%20for%20merge,statement">Bigquery SQL how to update rows AND Insert new data - Stack Overflow</a>). These statements operate in BigQuery with some constraints (for example, BigQuery executes DML on a <strong>snapshot</strong> of the table to avoid conflicts, and prior to 2020 there were quotas on the number of DML operations per day, which have since been removed (<a target="_blank" href="https://cloud.google.com/blog/products/data-analytics/dml-without-limits-now-in-bigquery#:~:text=DML%20without%20limits%2C%20now%20in,DML%20statements%20on%20a%20table">DML without limits, now in BigQuery | Google Cloud Blog</a>)).</p>
<p><strong>INSERT:</strong> Adds new rows to a table. BigQuery supports two forms: inserting explicit values, or inserting the results of a query. For example:</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Insert a single row with explicit values:</span>
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> <span class="hljs-string">`my_dataset.users`</span> (user_id, <span class="hljs-keyword">name</span>, signup_date)
<span class="hljs-keyword">VALUES</span> (<span class="hljs-number">123</span>, <span class="hljs-string">'Alice'</span>, <span class="hljs-keyword">CURRENT_DATE</span>);

<span class="hljs-comment">-- Insert multiple rows using a subquery:</span>
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> <span class="hljs-string">`my_dataset.gold_customers`</span> (customer_id, total_spend)
<span class="hljs-keyword">SELECT</span> customer_id, <span class="hljs-keyword">SUM</span>(amount)
<span class="hljs-keyword">FROM</span> <span class="hljs-string">`my_dataset.sales`</span>
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> customer_id
<span class="hljs-keyword">HAVING</span> <span class="hljs-keyword">SUM</span>(amount) &gt; <span class="hljs-number">10000</span>;
</code></pre>
<p>The first <code>INSERT</code> adds one row to <strong>users</strong>. The second inserts the results of a query (all customers with &gt;10000 total spend) into the <strong>gold_customers</strong> table. BigQuery’s Standard SQL does not require a <code>VALUES</code> list for each row if using a SELECT; it will append all rows from the query. (Legacy SQL did not support DML; data was usually loaded in bulk or via streaming inserts.)</p>
<p><strong>UPDATE:</strong> Modifies existing rows, setting new values for some columns where a condition is met. BigQuery’s <code>UPDATE</code> syntax allows a FROM clause for more complex updates (e.g. updating a table based on a join with another table). Example:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">UPDATE</span> <span class="hljs-string">`my_dataset.users`</span>
<span class="hljs-keyword">SET</span> last_login = <span class="hljs-keyword">CURRENT_TIMESTAMP</span>
<span class="hljs-keyword">WHERE</span> user_id = <span class="hljs-number">123</span>;
</code></pre>
<p>This updates the <code>last_login</code> timestamp for the user with ID 123. If we needed to update based on another table (say we have a staging table of latest logins), we could do something like:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">UPDATE</span> <span class="hljs-string">`my_dataset.users`</span> <span class="hljs-keyword">AS</span> u
<span class="hljs-keyword">SET</span> last_login = s.new_login_time
<span class="hljs-keyword">FROM</span> <span class="hljs-string">`my_dataset.login_updates`</span> <span class="hljs-keyword">AS</span> s
<span class="hljs-keyword">WHERE</span> u.user_id = s.user_id;
</code></pre>
<p>BigQuery executes the update in a single pass (it's atomic). Under the hood, BigQuery may rewrite the entire table or the modified partitions, but the user sees it as an in-place update.</p>
<p><strong>DELETE:</strong> Removes rows that meet a condition. For example:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">DELETE</span> <span class="hljs-keyword">FROM</span> <span class="hljs-string">`my_dataset.users`</span>
<span class="hljs-keyword">WHERE</span> is_active = <span class="hljs-literal">FALSE</span> <span class="hljs-keyword">AND</span> last_login &lt; <span class="hljs-keyword">DATE_SUB</span>(<span class="hljs-keyword">CURRENT_DATE</span>(), <span class="hljs-built_in">INTERVAL</span> <span class="hljs-number">1</span> <span class="hljs-keyword">YEAR</span>);
</code></pre>
<p>This would delete users who are marked inactive and have not logged in for over a year. As with updates, deletes in BigQuery are atomic. (There is no need for a separate “TRUNCATE” in BigQuery; you can <code>DELETE</code> all rows or use <code>DROP TABLE</code> to quickly remove a table’s data.)</p>
<p><strong>MERGE:</strong> BigQuery’s MERGE statement combines insert, update, and delete logic in one operation – essentially an “upsert” capability. MERGE allows you to compare a target table with a source (such as a stage or delta table) and specify actions for when the rows match or don’t match. BigQuery performs the specified inserts/updates/deletes atomically as one transaction (<a target="_blank" href="https://stackoverflow.com/questions/69146082/bigquery-sql-how-to-update-rows-and-insert-new-data#:~:text=You%20are%20looking%20for%20merge,statement">Bigquery SQL how to update rows AND Insert new data - Stack Overflow</a>). This is particularly useful for <strong>incremental data pipelines</strong> where you apply changes (new records and updates) to a master table.</p>
<p>A typical MERGE example is merging daily new data into a master table:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">MERGE</span> <span class="hljs-string">`my_dataset.Inventory`</span> <span class="hljs-keyword">AS</span> T
<span class="hljs-keyword">USING</span> <span class="hljs-string">`my_dataset.NewArrivals`</span> <span class="hljs-keyword">AS</span> S
<span class="hljs-keyword">ON</span> T.product_id = S.product_id
<span class="hljs-keyword">WHEN</span> <span class="hljs-keyword">MATCHED</span> <span class="hljs-keyword">THEN</span> 
  <span class="hljs-keyword">UPDATE</span> <span class="hljs-keyword">SET</span> T.quantity = T.quantity + S.quantity
<span class="hljs-keyword">WHEN</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">MATCHED</span> <span class="hljs-keyword">THEN</span> 
  <span class="hljs-keyword">INSERT</span>(product_id, product_name, quantity) 
  <span class="hljs-keyword">VALUES</span>(S.product_id, S.product_name, S.quantity);
</code></pre>
<p>In this example, the <strong>Inventory</strong> table is updated by adding quantities for products that already exist (matched on product_id), and inserting new rows for products that are not yet in Inventory (<a target="_blank" href="https://stackoverflow.com/questions/59164578/transaction-management-in-google-bigquery#:~:text=MERGE%20mydataset,quantity">Transaction Management in Google Bigquery - Stack Overflow</a>). The MERGE can have multiple <code>WHEN</code> clauses, including <code>WHEN MATCHED [AND condition] THEN DELETE</code> to delete rows that meet some condition. All these changes (updates to some rows, inserts of others) happen in one combined operation. This MERGE effectively keeps the Inventory table in sync with the NewArrivals table.</p>
<p>BigQuery’s MERGE syntax and behavior are similar to ANSI SQL MERGE (as in Oracle or SQL Server). Note that when using MERGE, the source and target can each be a table or subquery, giving flexibility to, for example, merge the result of an aggregation into a table. Because MERGE is atomic, it can be used within a multi-statement transaction (see Section 5) or as a standalone way to ensure data consistency.</p>
<p><strong>Legacy SQL vs Standard SQL for DML:</strong> Legacy SQL did not support these DML statements in queries. Data manipulation had to be done outside of query jobs (e.g., via the BigQuery API, or by writing query results to a new table). With Standard SQL, BigQuery can handle transactional modifications. Initially BigQuery imposed a limit of 1,000 DML statements per table per day (to maintain performance) (<a target="_blank" href="https://forum.knime.com/t/google-big-query-exceed-rate-limits-how-to-upload-large-tables/22444#:~:text=Google%20Big%20Query%3A%20,DB%20Loader%20node%20in">Google Big Query: "Exceed rate limits" - How to upload large tables</a>), but as of 2020 this limit has been removed, allowing unlimited DML on tables (<a target="_blank" href="https://cloud.google.com/blog/products/data-analytics/dml-without-limits-now-in-bigquery#:~:text=DML%20without%20limits%2C%20now%20in,DML%20statements%20on%20a%20table">DML without limits, now in BigQuery | Google Cloud Blog</a>). Still, very large or complex DML operations may be slower than using BigQuery’s fast load jobs or using MERGE to batch changes, so usage should be planned accordingly.</p>
<h2 id="heading-4-dcl-data-control-language">4. DCL – Data Control Language</h2>
<p><strong>Definition:</strong> DCL statements control access permissions on database objects (tables, views, etc.) – typically <strong>GRANT</strong> and <strong>REVOKE</strong>. In BigQuery, access control is usually managed by Google Cloud IAM (Identity and Access Management), where you assign roles like BigQuery Data Viewer or Data Editor at the project or dataset level. However, BigQuery has introduced SQL DCL syntax to allow granting and revoking fine-grained permissions at the SQL level for convenience. These statements effectively map to IAM under the hood and let you manage table or dataset access with SQL commands (<a target="_blank" href="https://www.aampe.com/blog/how-to-create-a-table-in-bigquery#:~:text=Assign%20appropriate%20permissions%20to%20control,required%20to%20perform%20a%20task">How to Create a Table in BigQuery: A Step-by-Step Guide</a>).</p>
<p>Using BigQuery’s DCL requires appropriate permissions (you generally need to be a BigQuery admin or the owner of the dataset/table to change its IAM policy). You can grant permissions to users, groups, or service accounts.</p>
<ul>
<li><p><strong>GRANT:</strong> Gives a user or role certain privileges on a BigQuery resource (such as SELECT rights on a table). BigQuery’s syntax is similar to other SQL dialects. For example, to grant read access on a table to a specific user:</p>
<pre><code class="lang-sql">  <span class="hljs-keyword">GRANT</span> <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">ON</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-string">`my_project.my_dataset.sales`</span> 
  <span class="hljs-keyword">TO</span> <span class="hljs-string">'user:john.doe@example.com'</span>;
</code></pre>
<p>  This statement grants <strong>SELECT</strong> (read/query) privileges on the <code>sales</code> table to the user <a target="_blank" href="john.doe@example.com"><em>john.doe@example.com</em></a>. In effect, this user would be able to query the table (equivalent to the BigQuery IAM role <strong>BigQuery Data Viewer</strong> on that table). Similarly, you could grant at the dataset (schema) level. For instance, to allow a group to read all tables in a dataset:</p>
<pre><code class="lang-sql">  <span class="hljs-keyword">GRANT</span> <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">ON</span> <span class="hljs-keyword">SCHEMA</span> <span class="hljs-string">`my_project.finance_data`</span> 
  <span class="hljs-keyword">TO</span> <span class="hljs-string">'group:analysts@example.com'</span>;
</code></pre>
<p>  This would grant read access on all current and future tables in the <strong>finance_data</strong> dataset to all members of the <a target="_blank" href="analysts@example.com">analysts@example.com</a> group. (Behind the scenes, BigQuery creates an entry in the dataset’s access list for that group.)</p>
<p>  BigQuery supports privileges like SELECT, INSERT, UPDATE at the table level. It also allows granting roles; for example, you can grant the predefined IAM role <code>roles/bigquery.dataEditor</code> at a dataset level via a GRANT statement if needed. (In the GRANT syntax, BigQuery identifies tables vs. schemas using the <code>ON TABLE</code> or <code>ON SCHEMA</code> keywords.)</p>
</li>
</ul>
<ul>
<li><p><strong>REVOKE:</strong> Removes previously granted permissions. For example, if we want to revoke the SELECT permission we gave above:</p>
<pre><code class="lang-sql">  <span class="hljs-keyword">REVOKE</span> <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">ON</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-string">`my_project.my_dataset.sales`</span> 
  <span class="hljs-keyword">FROM</span> <span class="hljs-string">'user:john.doe@example.com'</span>;
</code></pre>
<p>  This will strip that user’s access to the table (assuming they didn’t have access via some other route). Similarly, <code>REVOKE ... ON SCHEMA ... FROM ...</code> would remove dataset-level access. Revokes take effect immediately – the next query the user tries on that table will fail with a permission error.</p>
</li>
</ul>
<p><strong>Managing Permissions Example:</strong> Suppose we have a dataset <strong>analytics</strong> and we want to grant a data scientist read-only access to a specific table <a target="_blank" href="analytics.marketing"><strong>analytics.marketing</strong></a><strong>_data</strong> without giving access to the whole project. We can run:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">GRANT</span> <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">ON</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-string">`my_project.analytics.marketing_data`</span> 
<span class="hljs-keyword">TO</span> <span class="hljs-string">'user:data.scientist@company.com'</span>;
</code></pre>
<p>Now that user can query <code>marketing_data</code> but no other tables (unless separately granted). If later we decide they shouldn’t see it, we do:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">REVOKE</span> <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">ON</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-string">`my_project.analytics.marketing_data`</span> 
<span class="hljs-keyword">FROM</span> <span class="hljs-string">'user:data.scientist@company.com'</span>;
</code></pre>
<p>Under the hood, these DCL statements are updating BigQuery’s IAM policy for that table (or dataset). Note that some privileges (like dataset-level roles) might map to broader IAM roles. Also, BigQuery has the concept of authorized views (a view that exposes certain data and can be shared instead of the raw table) – those are managed by granting a view access to a dataset, which can also be done via DCL by granting to a special identifier representing the view.</p>
<p><strong>Legacy SQL:</strong> Legacy BigQuery had no SQL GRANT/REVOKE; all permission management was via Cloud IAM in the console or command-line. The introduction of DCL in Standard SQL makes it easier to script and automate access control. It’s important to adhere to the principle of least privilege (only grant the minimum required access) (<a target="_blank" href="https://www.castordoc.com/how-to/how-to-use-grant-role-in-bigquery#:~:text=When%20granting%20roles%20in%20BigQuery%2C,unauthorized%20access%20or%20accidental%20data">How to use grant role in BigQuery?</a>), especially since BigQuery often holds sensitive data.</p>
<h2 id="heading-5-tcl-transaction-control-language">5. TCL – Transaction Control Language</h2>
<p><strong>Definition:</strong> TCL statements manage transactions – units of work that can be committed or rolled back together. In traditional SQL, this includes commands like <strong>BEGIN</strong> (start transaction), <strong>COMMIT</strong> (save changes), and <strong>ROLLBACK</strong> (undo changes). BigQuery historically operated on an <em>append-only</em> model without transactions (each query was its own atomic operation). However, BigQuery now supports <strong>multi-statement transactions</strong> inside BigQuery <strong>scripting</strong> (and within stored procedures or via API) for executing multiple DML statements as a single atomic unit (<a target="_blank" href="https://stackoverflow.com/questions/59164578/transaction-management-in-google-bigquery#:~:text=BigQuery%20supports%20multi,roll%20back%20the%20changes%20atomically">Transaction Management in Google Bigquery - Stack Overflow</a>).</p>
<p><strong>BigQuery Multi-statement Transactions:</strong> In BigQuery scripting, you can start a transaction with <code>BEGIN TRANSACTION;</code>, execute a series of statements (DML or even certain DDL on temporary tables), then end with <code>COMMIT TRANSACTION;</code> to apply changes atomically, or <code>ROLLBACK TRANSACTION;</code> to cancel if something went wrong (<a target="_blank" href="https://medium.com/codex/now-available-multi-statements-transactions-in-bigquery-35016d79b68a#:~:text=Now%20Available%3A%20Multi%20Statements%20Transactions,ROLLBACK%20TRANSACTION%20to%20abandons">Now Available: Multi Statements Transactions in BigQuery - Medium</a>) (<a target="_blank" href="https://dev.to/stack-labs/bigquery-transactions-over-multiple-queries-with-sessions-2ll5#:~:text=BigQuery%20transactions%20over%20multiple%20queries%2C,ending%20with%20COMMIT%20TRANSACTION%3B">BigQuery transactions over multiple queries, with sessions</a>). All the statements in between will either all succeed (on commit) or all be undone (on rollback). This is crucial when you need to ensure consistency across multiple operations – for example, updating two tables in sync, or doing a “delete then insert” (replace) safely.</p>
<p><strong>Example Transaction:</strong> Suppose we want to transfer inventory from one warehouse to another. We need to deduct from one table and add to another, and ensure both succeed or neither does. In a BigQuery script, we could do:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">BEGIN</span> <span class="hljs-keyword">TRANSACTION</span>;

<span class="hljs-comment">-- 1. Insert new stock into Warehouse B's table</span>
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> <span class="hljs-string">`my_dataset.warehouseB_inventory`</span> (product_id, qty)
<span class="hljs-keyword">SELECT</span> product_id, qty 
<span class="hljs-keyword">FROM</span> <span class="hljs-string">`my_dataset.new_stock_delivery`</span>
<span class="hljs-keyword">WHERE</span> warehouse = <span class="hljs-string">'B'</span>;

<span class="hljs-comment">-- 2. Remove that stock from Warehouse A's table</span>
<span class="hljs-keyword">DELETE</span> <span class="hljs-keyword">FROM</span> <span class="hljs-string">`my_dataset.warehouseA_inventory`</span>
<span class="hljs-keyword">WHERE</span> product_id <span class="hljs-keyword">IN</span> (
    <span class="hljs-keyword">SELECT</span> product_id <span class="hljs-keyword">FROM</span> <span class="hljs-string">`my_dataset.new_stock_delivery`</span> <span class="hljs-keyword">WHERE</span> warehouse = <span class="hljs-string">'B'</span>
);

<span class="hljs-keyword">COMMIT</span> <span class="hljs-keyword">TRANSACTION</span>;
</code></pre>
<p>In this example, we start a transaction. Then we <strong>INSERT</strong> rows for products delivered to Warehouse B and <strong>DELETE</strong> those products from Warehouse A’s inventory. Only when both statements execute successfully do we call <strong>COMMIT</strong> to finalize the changes. If any error occurred in between (or if a condition we check indicates a problem), we could issue <code>ROLLBACK TRANSACTION;</code> instead, and BigQuery would undo any partial changes from the insert/delete (in practice, BigQuery would not have made them permanent yet until commit).</p>
<p>BigQuery’s multi-statement transactions can span multiple tables (even across datasets or projects) and multiple DML operations (<a target="_blank" href="https://stackoverflow.com/questions/59164578/transaction-management-in-google-bigquery#:~:text=,stages%2C%20based%20on%20intermediate%20computations">Transaction Management in Google Bigquery - Stack Overflow</a>) (<a target="_blank" href="https://stackoverflow.com/questions/59164578/transaction-management-in-google-bigquery#:~:text=,DROP%20TABLE%20tmp">Transaction Management in Google Bigquery - Stack Overflow</a>). All locks are managed behind the scenes by BigQuery. Note that BigQuery transactions are scoped within a <strong>single script execution</strong> or session – you cannot yet have an interactive multi-step transaction across separate query jobs; it must be done in one script or procedure call.</p>
<p>Inside a transaction, you can also create temporary tables or use SELECT queries to assist your logic. BigQuery currently does <em>not</em> allow arbitrary DDL on permanent tables inside a transaction (you can only create temp tables or do certain DDL like creating a temp function).</p>
<p>If you don’t explicitly use <code>BEGIN/COMMIT</code>, each DML statement in BigQuery is by default atomic on its own. For many use cases, a single MERGE is enough to apply multiple changes atomically (obviating the need for an explicit transaction). But if you do need to break a complex operation into multiple steps, BigQuery transactions ensure all-or-nothing execution (<a target="_blank" href="https://stackoverflow.com/questions/59164578/transaction-management-in-google-bigquery#:~:text=BigQuery%20supports%20multi,roll%20back%20the%20changes%20atomically">Transaction Management in Google Bigquery - Stack Overflow</a>).</p>
<p><strong>Legacy SQL:</strong> There was no concept of user-controlled transactions in legacy BigQuery. Every query was standalone. The introduction of scripting and transactions in Standard SQL (around 2020) was a significant improvement for ETL workflows that require multi-step operations with rollback on failure.</p>
<h2 id="heading-6-aggregations-and-group-by-in-bigquery">6. Aggregations and GROUP BY in BigQuery</h2>
<p>Aggregations (using functions like <code>SUM</code>, <code>AVG</code>, <code>COUNT</code>, etc. along with <code>GROUP BY</code> clauses) work as in standard SQL, with some BigQuery enhancements.</p>
<p>When you use an aggregate function in a SELECT, you normally need a <code>GROUP BY</code> clause to define how rows are grouped (except when aggregating the entire table). BigQuery Standard SQL follows the standard rule: every non-aggregated select expression must be either in the GROUP BY or be an aggregation of a group. (Legacy SQL was more permissive: it would implicitly treat non-grouped fields as ANY_VALUE, but this could lead to indeterminate results. Standard SQL is stricter, enforcing proper grouping or aggregation.)</p>
<p><strong>Basic Example:</strong></p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> department, <span class="hljs-keyword">AVG</span>(salary) <span class="hljs-keyword">AS</span> avg_salary, <span class="hljs-keyword">MAX</span>(salary) <span class="hljs-keyword">AS</span> max_salary
<span class="hljs-keyword">FROM</span> <span class="hljs-string">`company.employees`</span>
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> department;
</code></pre>
<p>This returns one row per department with the average and maximum salary in each. BigQuery handles large-scale aggregations efficiently in its distributed engine.</p>
<p><strong>BigQuery-Specific Behavior and Functions:</strong></p>
<ul>
<li><strong>COUNT(*) returns 0 rows vs 0:</strong> If no rows match the WHERE clause, aggregate functions like COUNT return 0 (not null). This is standard SQL behavior. BigQuery adheres to that.</li>
</ul>
<ul>
<li><strong>Approximate Aggregation:</strong> For very large datasets, exact distinct counts or quantiles can be expensive. BigQuery provides approximate aggregate functions that trade a tiny error for big speed gains. For example, <code>APPROX_COUNT_DISTINCT(column)</code> uses HyperLogLog++ to estimate the number of distinct values in a column (<a target="_blank" href="https://leadpanda.media/en/blog/yak-skorotiti-vitrati-na-bigquery-10-perevirenix-lifexakiv-dlya-optimizacii/#:~:text=BigQuery%20offers%20functions%20for%20approximate,reliable%20for%20most%20analytical%20tasks">How to reduce BigQuery costs: 10 effective life hacks for optimization | Lead Panda Media</a>). The result is usually within about 1% of the true value, but the computation is faster and uses less memory, which can be crucial at petabyte scale. Similarly, BigQuery has <code>APPROX_TOP_COUNT</code> and <code>APPROX_TOP_SUM</code> for approximate heavy-hitter analysis. Use these when exact precision isn’t required – it can <strong>dramatically</strong> reduce query cost on huge tables (<a target="_blank" href="https://leadpanda.media/en/blog/yak-skorotiti-vitrati-na-bigquery-10-perevirenix-lifexakiv-dlya-optimizacii/#:~:text=BigQuery%20offers%20functions%20for%20approximate,reliable%20for%20most%20analytical%20tasks">How to reduce BigQuery costs: 10 effective life hacks for optimization | Lead Panda Media</a>). (Legacy SQL had <code>COUNT(DISTINCT x)</code> but would error if the number of distinct elements was too high; approximate functions solve that problem in Standard SQL.)</li>
</ul>
<ul>
<li><strong>COUNTIF and SUMIF:</strong> BigQuery includes convenient conditional aggregation functions. <code>COUNTIF(condition)</code> counts rows where the condition is true (<a target="_blank" href="https://leadpanda.media/en/blog/yak-skorotiti-vitrati-na-bigquery-10-perevirenix-lifexakiv-dlya-optimizacii/#:~:text=Use%20rough%20estimates%20for%20large,aggregations">How to reduce BigQuery costs: 10 effective life hacks for optimization | Lead Panda Media</a>), and <code>SUMIF(expr, condition)</code> sums <code>expr</code> over rows where the condition is true. For example, <code>COUNTIF(status = "ERROR")</code> counts only error status rows. These are shorthand; you can always use <code>COUNT(CASE WHEN ... END)</code> or <code>SUM(IF(..., value, 0))</code> in standard SQL, but BigQuery provides these for simplicity.</li>
</ul>
<ul>
<li><strong>ANY_VALUE():</strong> BigQuery supports <code>ANY_VALUE(field)</code> which returns an arbitrary value of <code>field</code> from within each group. This is useful when you know all values in the group are the same (or you don’t care which one is picked) and you want to avoid grouping by it. (MySQL has a similar function.) For example, if you group by <code>user_id</code> and want to pick <em>any</em> single <code>email</code> for that user (assuming email doesn’t vary per row), you could do: <code>SELECT user_id, ANY_VALUE(email) FROM ... GROUP BY user_id;</code>. This function was introduced to help with migrating queries from legacy SQL’s behavior of non-grouped columns (<a target="_blank" href="https://stackoverflow.com/questions/34056485/select-all-columns-except-some-in-google-bigquery#:~:text=queries%20using%20the%20SELECT%20,column">Select All Columns Except Some in Google BigQuery? - Stack Overflow</a>).</li>
</ul>
<ul>
<li><p><strong>Grouping Sets / Rollup / Cube:</strong> As mentioned, BigQuery allows advanced grouping. For instance:</p>
<pre><code class="lang-sql">  <span class="hljs-keyword">SELECT</span> region, product, <span class="hljs-keyword">SUM</span>(sales) <span class="hljs-keyword">as</span> total_sales
  <span class="hljs-keyword">FROM</span> sales
  <span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">ROLLUP</span>(region, product);
</code></pre>
<p>  This query would produce aggregated totals at multiple levels: by (region, product), by region (overall product total per region), and a grand total (all regions/products) (<a target="_blank" href="https://medium.com/codex/google-bigquery-now-supports-cubes-e50ecd39e447#:~:text=Google%20BigQuery%20now%20supports%20Cubes,BY%20CUBE%20clause%20completely%20new">Google BigQuery now supports Cubes | by Christianlauer - Medium</a>). <code>GROUP BY CUBE(a, b)</code> would produce all combinations (by a, by b, by both, by neither). And <code>GROUPING SETS</code> allows explicit list of group combinations. BigQuery also provides the <code>GROUPING()</code> function to identify subtotal rows (returning 1 for subtotal vs 0 for detail row). These features let you produce pivot-table style summaries in a single query. (These capabilities did not exist in legacy SQL; they are part of Standard SQL improvements.)</p>
</li>
</ul>
<ul>
<li><strong>Distinct Aggregates:</strong> BigQuery can do <code>COUNT(DISTINCT col)</code> like other SQLs. One nuance: BigQuery in some cases allows <strong>multiple distinct aggregates in the same query</strong> (e.g., <code>COUNT(DISTINCT col1), COUNT(DISTINCT col2)</code>), which some SQL databases do not unless you use more complex workarounds. BigQuery can handle multiple distinct counts by internally using approximation or additional query stages. Just be mindful of performance: each distinct aggregate may add overhead.</li>
</ul>
<p><strong>Example – Using APPROX_COUNT_DISTINCT:</strong></p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> 
  country, 
  APPROX_COUNT_DISTINCT(user_id) <span class="hljs-keyword">AS</span> approx_unique_users
<span class="hljs-keyword">FROM</span> <span class="hljs-string">`my_dataset.web_logs`</span>
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> country;
</code></pre>
<p>This query quickly estimates the number of unique users per country in the web_logs table using HyperLogLog++ (<a target="_blank" href="https://leadpanda.media/en/blog/yak-skorotiti-vitrati-na-bigquery-10-perevirenix-lifexakiv-dlya-optimizacii/#:~:text=BigQuery%20offers%20functions%20for%20approximate,reliable%20for%20most%20analytical%20tasks">How to reduce BigQuery costs: 10 effective life hacks for optimization | Lead Panda Media</a>). The approximation significantly reduces computational load for extremely large log tables with minimal loss of accuracy (usually ~1% error). If exact counts are needed, you could use <code>COUNT(DISTINCT user_id)</code> at higher cost.</p>
<p><strong>Example – Grouping Sets (Rollup):</strong></p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> 
  region, product, 
  <span class="hljs-keyword">SUM</span>(sales) <span class="hljs-keyword">AS</span> total_sales,
  <span class="hljs-keyword">GROUPING</span>(region) <span class="hljs-keyword">AS</span> region_total_flag,
  <span class="hljs-keyword">GROUPING</span>(product) <span class="hljs-keyword">AS</span> product_total_flag
<span class="hljs-keyword">FROM</span> <span class="hljs-string">`my_dataset.sales`</span>
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">ROLLUP</span>(region, product);
</code></pre>
<p>This will produce rows for each (region, product) combination, plus subtotal rows where product is NULL (total per region) and a grand total row where both region and product are NULL (<a target="_blank" href="https://medium.com/codex/google-bigquery-now-supports-cubes-e50ecd39e447#:~:text=Google%20BigQuery%20now%20supports%20Cubes,BY%20CUBE%20clause%20completely%20new">Google BigQuery now supports Cubes | by Christianlauer - Medium</a>). The <code>GROUPING()</code> function returns 1 when the column is NULL because of a subtotal. For example, the grand total row will have <code>region_total_flag=1</code> and <code>product_total_flag=1</code>. Such SQL constructs can eliminate the need for manual UNION of multiple grouping queries.</p>
<h2 id="heading-7-using-having-with-aggregations">7. Using HAVING with Aggregations</h2>
<p>The <strong>HAVING</strong> clause is used to filter aggregated results. It is applied <strong>after</strong> the <code>GROUP BY</code> step, unlike <code>WHERE</code> which filters before grouping. In BigQuery (as in standard SQL), you use HAVING to impose conditions on aggregate values (SUM, COUNT, AVG, etc.) for each group (<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-sql-where-vs-having-vs-qualify#:~:text=The%20HAVING%20clause%20filters%20records,for%20conditions%20involving%20aggregate%20functions">A Complete Guide to WHERE, HAVING, and QUALIFY in SQL</a>).</p>
<p>For example, if we want to find departments with more than 3 employees, we would use HAVING on a COUNT:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> department, <span class="hljs-keyword">COUNT</span>(employee_id) <span class="hljs-keyword">AS</span> num_employees
<span class="hljs-keyword">FROM</span> <span class="hljs-string">`owox-analytics.myDataset.employee_data`</span>
<span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">status</span> = <span class="hljs-string">'active'</span>
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> department
<span class="hljs-keyword">HAVING</span> <span class="hljs-keyword">COUNT</span>(employee_id) &gt; <span class="hljs-number">3</span>;
</code></pre>
<p>(<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-sql-where-vs-having-vs-qualify#:~:text=SELECT%20department%2C%20COUNT%28employee_id%29%20FROM%20%60owox,3">A Complete Guide to WHERE, HAVING, and QUALIFY in SQL</a>)</p>
<p>This query groups employees by department (considering only active employees due to the WHERE clause), then the HAVING clause filters out any groups that have 3 or fewer employees (<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-sql-where-vs-having-vs-qualify#:~:text=,with%20more%20than%203%20employees">A Complete Guide to WHERE, HAVING, and QUALIFY in SQL</a>). The result will only include departments with <strong>count &gt; 3</strong>.</p>
<p>Let’s break down the logic:</p>
<ul>
<li>The <strong>WHERE</strong> filter (<code>status = 'active'</code>) runs first, limiting rows to active employees before grouping.</li>
</ul>
<ul>
<li><strong>GROUP BY department</strong> then aggregates the data so we have one row per department.</li>
</ul>
<ul>
<li><strong>COUNT(employee_id)</strong> computes the number of employees in each department.</li>
</ul>
<ul>
<li>The <strong>HAVING COUNT(employee_id) &gt; 3</strong> condition is evaluated on each aggregated group, and discards any department that doesn’t satisfy it (<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-sql-where-vs-having-vs-qualify#:~:text=,with%20more%20than%203%20employees">A Complete Guide to WHERE, HAVING, and QUALIFY in SQL</a>).</li>
</ul>
<ul>
<li>The SELECT outputs the department name and the count (alias <em>num_employees</em>).</li>
</ul>
<p>So, if “HR” had 5 active employees and “Sales” had 2, the result would include “HR, 5” but not “Sales, 2”.</p>
<p>Important notes about HAVING in BigQuery Standard SQL:</p>
<ul>
<li>You can reference aggregate expressions by alias in HAVING. In our example, we could have written <code>HAVING num_employees &gt; 3</code> because we alias <code>COUNT(employee_id)</code> as <em>num_employees</em>. BigQuery (like many SQLs) allows this alias usage in HAVING.</li>
</ul>
<ul>
<li>If there is no GROUP BY, a HAVING still can be used – it would treat the entire result as one group. E.g., <code>SELECT SUM(x) as total FROM table HAVING total &gt; 100</code> would return nothing if the total is not &gt;100. But usually, HAVING is paired with GROUP BY.</li>
</ul>
<ul>
<li>Legacy SQL in BigQuery also had HAVING, with similar usage, but in legacy SQL you might see HAVING used without GROUP BY as a workaround to filter on an aggregate. In Standard SQL, you can often use a window function + QUALIFY or a subquery instead, but HAVING remains the direct way to filter grouped results.</li>
</ul>
<p><strong>Example – HAVING vs WHERE:</strong></p>
<p>Suppose you want all products that have total sales &gt; 1000 in a sales table:</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Incorrect: This will fail or return wrong results</span>
<span class="hljs-keyword">SELECT</span> product, <span class="hljs-keyword">SUM</span>(amount) 
<span class="hljs-keyword">FROM</span> sales
<span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">SUM</span>(amount) &gt; <span class="hljs-number">1000</span>   <span class="hljs-comment">-- not allowed, aggregate in WHERE</span>
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> product;
</code></pre>
<p>This is invalid because you cannot use an aggregate (<code>SUM(amount)</code>) in a WHERE clause. The correct approach is to use HAVING:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> product, <span class="hljs-keyword">SUM</span>(amount) <span class="hljs-keyword">as</span> total_sales
<span class="hljs-keyword">FROM</span> sales
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> product
<span class="hljs-keyword">HAVING</span> total_sales &gt; <span class="hljs-number">1000</span>;
</code></pre>
<p>Now, <code>HAVING total_sales &gt; 1000</code> will filter the grouped products, and only those with sum &gt; 1000 remain. In BigQuery, this works as expected (you could also write <code>HAVING SUM(amount) &gt; 1000</code> directly) (<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-sql-where-vs-having-vs-qualify#:~:text=The%20HAVING%20clause%20filters%20records,for%20conditions%20involving%20aggregate%20functions">A Complete Guide to WHERE, HAVING, and QUALIFY in SQL</a>). If you had an initial filter on individual rows (say, <code>WHERE region = 'US'</code> to consider only US sales), that would be applied <em>before</em> the grouping.</p>
<p>To summarize, use <strong>WHERE</strong> for conditions on raw rows (especially non-aggregated columns) and <strong>HAVING</strong> for conditions on aggregated results. BigQuery enforces this order of execution, just like standard SQL, ensuring that HAVING only sees grouped results (<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-sql-where-vs-having-vs-qualify#:~:text=,LIMIT">A Complete Guide to WHERE, HAVING, and QUALIFY in SQL</a>) (<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-sql-where-vs-having-vs-qualify#:~:text=The%20HAVING%20clause%20filters%20records,for%20conditions%20involving%20aggregate%20functions">A Complete Guide to WHERE, HAVING, and QUALIFY in SQL</a>).</p>
<h2 id="heading-8-window-functions-over-partition-by-advanced-sql-analytics">8. Window Functions (OVER, PARTITION BY) – Advanced SQL Analytics</h2>
<p>Window functions (also known as analytic functions) are powerful in BigQuery for performing calculations across sets of rows related to the current row, without collapsing those rows into a single result. They use the <code>OVER(...)</code> clause to define a “window” of rows for the calculation. BigQuery’s implementation is fully compliant with standard SQL window functions and adds a few functions of its own.</p>
<p><strong>What Are Window Functions?</strong><br />Unlike GROUP BY aggregations which <strong>reduce</strong> rows (one result per group), window functions produce a value for <strong>each row</strong> while looking at a window of multiple rows (which you define using PARTITION BY and ORDER BY in the OVER clause). This lets you compute running totals, ranks, moving averages, percentiles, lead/lag comparisons, etc., all while retaining the detail of each row (<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-window-functions#:~:text=Window%20functions%20are%20powerful%20tools,related%20to%20the%20current%20row">Using Window Functions in BigQuery: A 2025 Guide</a>) (<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-window-functions#:~:text=Unlike%20aggregate%20functions%2C%20window%20functions,incredibly%20useful%20for%20analytical%20tasks">Using Window Functions in BigQuery: A 2025 Guide</a>). They are called “window” functions because each calculation considers a frame of rows (the window) relative to the current row’s position.</p>
<p><strong>Basic Syntax:</strong></p>
<pre><code class="lang-sql">function_name(expression) 
OVER (
  [PARTITION BY partition_columns...] 
  [ORDER BY sort_columns [ASC|DESC]] 
  [window_frame_clause]
)
</code></pre>
<ul>
<li><strong>PARTITION BY</strong> divides the data into partitions (sub-groups) for the function, similar to GROUP BY but without collapsing – the function will reset at partition boundaries (<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-window-functions#:~:text=PARTITION%20BY%20Clause">Using Window Functions in BigQuery: A 2025 Guide</a>) (<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-window-functions#:~:text=,their%20salary%20in%20descending%20order">Using Window Functions in BigQuery: A 2025 Guide</a>).</li>
</ul>
<ul>
<li><strong>ORDER BY</strong> defines the ordering within each partition that the window function will use (e.g., for running totals or ranking).</li>
</ul>
<ul>
<li>The optional <strong>frame clause</strong> (like <code>ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</code>) further refines which rows in the partition are considered for that row’s calculation.</li>
</ul>
<p>BigQuery supports three main categories of window functions (<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-window-functions#:~:text=BigQuery%20offers%20a%20variety%20of,navigation%20functions%2C%20and%20numbering%20functions">Using Window Functions in BigQuery: A 2025 Guide</a>):</p>
<ol>
<li><strong>Aggregate functions</strong> as window functions: e.g. <code>SUM(), AVG(), MIN(), MAX(), COUNT()</code> can be used as window functions by adding OVER(). This produces a running or total aggregate value <strong>per row</strong> rather than one per group. By default, without a frame clause, <code>SUM() OVER (PARTITION BY X)</code> gives a total sum per partition (like a group total attached to each row of the group), and <code>SUM() OVER (ORDER BY Y)</code> gives a cumulative sum from the start up to the current row (the default frame is <code>UNBOUNDED PRECEDING ... CURRENT ROW</code> if ORDER BY is present).</li>
</ol>
<ol>
<li><strong>Ranking functions</strong>: <code>ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE(n)</code>. These assign rank numbers based on ordering within partitions. For example, <code>ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC)</code> will give 1 to the highest-paid person in each department, 2 to the next, and so on.</li>
</ol>
<ol>
<li><strong>Analytic functions for navigation</strong>: e.g. <code>LAG(value, N)</code>, <code>LEAD(value, N)</code> to pull data from previous or next rows, <code>FIRST_VALUE(value)</code>/<code>LAST_VALUE(value)</code> to get first/last in the window, etc. These allow comparisons across row boundaries (e.g., compare this row’s value to last week’s value in a time series).</li>
</ol>
<p><strong>Example 1 – Ranking:</strong> Let’s rank employees by salary within each department:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> 
  employee_id,
  department,
  salary,
  <span class="hljs-keyword">RANK</span>() <span class="hljs-keyword">OVER</span> (<span class="hljs-keyword">PARTITION</span> <span class="hljs-keyword">BY</span> department <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> salary <span class="hljs-keyword">DESC</span>) <span class="hljs-keyword">AS</span> salary_rank
<span class="hljs-keyword">FROM</span> <span class="hljs-string">`company.employees`</span>;
</code></pre>
<p>This will output each employee and their rank within their department by salary (1 = highest salary in that dept, etc.). The <strong>PARTITION BY department</strong> means ranking restarts for each department (<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-window-functions#:~:text=,Specifies%20the%20source%20table%20for">Using Window Functions in BigQuery: A 2025 Guide</a>). The <strong>ORDER BY salary DESC</strong> means highest salary gets rank 1. Employees with equal salaries get the same rank number (and rank numbers will have gaps if there's a tie, since RANK is being used; use DENSE_RANK() if you want no gaps). If we wanted a global rank ignoring departments, we’d omit the partition clause. If we wanted row number instead (no gaps, strict ordering), we’d use ROW_NUMBER(). This kind of query could help, for example, to find the top 3 earners in each department (you would then add <code>QUALIFY salary_rank &lt;= 3</code> in BigQuery to filter the top 3 per dept).</p>
<p><strong>Example 2 – Running Total:</strong> Suppose we have a table of daily sales and we want a running cumulative sales amount by date:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> 
  sales_date,
  amount,
  <span class="hljs-keyword">SUM</span>(amount) <span class="hljs-keyword">OVER</span> (<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> sales_date <span class="hljs-keyword">ROWS</span> <span class="hljs-keyword">BETWEEN</span> <span class="hljs-keyword">UNBOUNDED</span> <span class="hljs-keyword">PRECEDING</span> <span class="hljs-keyword">AND</span> <span class="hljs-keyword">CURRENT</span> <span class="hljs-keyword">ROW</span>) <span class="hljs-keyword">AS</span> running_total
<span class="hljs-keyword">FROM</span> <span class="hljs-string">`my_dataset.daily_sales`</span>
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> sales_date;
</code></pre>
<p>This uses <code>SUM(amount) OVER (ORDER BY sales_date ... CURRENT ROW)</code> to calculate a cumulative sum up to the current row (assuming one row per date, this is a running total over time). The frame <code>ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</code> is actually the default for an ORDER BY if not specified, so we could omit it and BigQuery assumes cumulative sum from the start. The result will list each date, the sales for that date, and the total sales from the beginning up through that date. Unlike a GROUP BY, we still see one row per date (not one final total). This is very useful for time-series analysis. In BigQuery, window aggregate functions like this allow things like moving averages by using a frame of a certain width (e.g., last 7 days).</p>
<p><strong>Example 3 – Partitioned Window:</strong> Using the same sales example, if we partition by year and compute running totals within each year:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> 
  <span class="hljs-keyword">YEAR</span>(sales_date) <span class="hljs-keyword">as</span> <span class="hljs-keyword">year</span>,
  sales_date,
  amount,
  <span class="hljs-keyword">SUM</span>(amount) <span class="hljs-keyword">OVER</span> (
       <span class="hljs-keyword">PARTITION</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">YEAR</span>(sales_date) 
       <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> sales_date 
       <span class="hljs-keyword">ROWS</span> <span class="hljs-keyword">BETWEEN</span> <span class="hljs-keyword">UNBOUNDED</span> <span class="hljs-keyword">PRECEDING</span> <span class="hljs-keyword">AND</span> <span class="hljs-keyword">CURRENT</span> <span class="hljs-keyword">ROW</span>
  ) <span class="hljs-keyword">AS</span> year_running_total
<span class="hljs-keyword">FROM</span> <span class="hljs-string">`my_dataset.daily_sales`</span>
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> sales_date;
</code></pre>
<p>Here <strong>PARTITION BY YEAR(sales_date)</strong> means the running total resets at the start of each year (each year treated separately) (<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-window-functions#:~:text=,ROW%3A%20Ensures%20the%20sum%20includes">Using Window Functions in BigQuery: A 2025 Guide</a>). So Jan 1 of each year starts a new cumulative sum. Partitioning is helpful when you want window calculations within subgroups (like per department, per year, per region, etc.).</p>
<p><strong>Window vs. GROUP BY:</strong> It’s worth noting the difference: <em>GROUP BY collapses rows</em>, but <em>window functions produce values that are attached to each row</em>. For instance, a GROUP BY year to get total sales per year would return 1 row per year. The window SUM...PARTITION BY year returns the total per year <strong>on each row of that year’s data</strong> (<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-window-functions#:~:text=Understanding%20the%20Difference%20Between%20GROUP,BY%20and%20Window%20Functions">Using Window Functions in BigQuery: A 2025 Guide</a>) (<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-window-functions#:~:text=match%20at%20L627%20Window%20functions%2C,cumulative%20sum">Using Window Functions in BigQuery: A 2025 Guide</a>). You might use that to calculate each day’s share of the annual total, for example. Window functions let you mix detail and summary in one result.</p>
<p><strong>BigQuery Specifics:</strong> BigQuery supports all standard window functions. In terms of performance, window functions are efficient in BigQuery, but do consider the data volume – partitioning by a high-cardinality field or not partitioning at all (window over the entire table) means a lot of data to process in each function. If you only need a grouped result, use GROUP BY; use window functions when you need the detail with the analytic result. BigQuery also now supports the <strong>QUALIFY</strong> clause (as discussed) to filter the output of window functions easily, which is particularly useful with ranking and row_number queries (e.g., QUALIFY RANK() ... = 1 to get top N per group).</p>
<p><strong>Practical Example – Combining Window and Aggregation:</strong></p>
<p>A common use-case is to find the contribution of each row to a group total. We can use a window sum (total per group on each row) along with the row’s value:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> 
  department,
  employee,
  salary,
  <span class="hljs-keyword">SUM</span>(salary) <span class="hljs-keyword">OVER</span> (<span class="hljs-keyword">PARTITION</span> <span class="hljs-keyword">BY</span> department) <span class="hljs-keyword">AS</span> total_dept_salary,
  salary * <span class="hljs-number">100.0</span> / <span class="hljs-keyword">SUM</span>(salary) <span class="hljs-keyword">OVER</span> (<span class="hljs-keyword">PARTITION</span> <span class="hljs-keyword">BY</span> department) <span class="hljs-keyword">AS</span> percent_of_department
<span class="hljs-keyword">FROM</span> <span class="hljs-string">`company.employees`</span>;
</code></pre>
<p>This produces each employee’s salary, the total salary of their department (repeated for all in dept), and the percentage of the department’s payroll that the employee’s salary represents. Here <code>SUM(salary) OVER (PARTITION BY department)</code> gives the department total attached to each row (<a target="_blank" href="https://www.owox.com/blog/articles/bigquery-window-functions#:~:text=,their%20salary%20in%20descending%20order">Using Window Functions in BigQuery: A 2025 Guide</a>), and we use it in a calculation. This kind of query would be difficult to do with pure GROUP BY (you’d have to join the aggregated results back to the detail), but window functions make it straightforward.</p>
<p><strong>Legacy SQL:</strong> Legacy BigQuery did not have window functions like this. Users often had to do self-joins or subqueries to achieve similar results, which is more cumbersome and sometimes less efficient. The introduction of Standard SQL with window functions greatly simplified analytical queries in BigQuery.</p>
<h2 id="heading-9-bigquery-performance-optimization-tips">9. BigQuery Performance Optimization Tips</h2>
<p>BigQuery is a columnar, massively parallel query engine. Query cost and performance are largely determined by how much data you scan and process. Here are some BigQuery-specific optimization tips to make queries efficient:</p>
<ul>
<li><strong>Select Only Needed Columns:</strong> <strong>Avoid</strong> <code>SELECT *</code> unless you truly need all columns. BigQuery charges by data scanned, which is per byte of column data read. Selecting unnecessary columns causes BigQuery to read more data and slows down the query (<a target="_blank" href="https://leadpanda.media/en/blog/yak-skorotiti-vitrati-na-bigquery-10-perevirenix-lifexakiv-dlya-optimizacii/#:~:text=Avoid%20SELECT%20in%20queries">How to reduce BigQuery costs: 10 effective life hacks for optimization | Lead Panda Media</a>). Always project only the columns you need. For example, if you only need two columns from a 100-column table, write <code>SELECT col1, col2</code> – this could be the difference between scanning 1 GB versus 100 GB. This is one of the simplest and most effective cost optimizations in BigQuery (<a target="_blank" href="https://galaxy.ai/youtube-summarizer/maximizing-efficiency-best-practices-for-bigquery-k81mLJVX08w#:~:text=Maximizing%20Efficiency%3A%20Best%20Practices%20for,of%20the%20columns%20you%20need">Maximizing Efficiency: Best Practices for BigQuery |</a> <a target="_blank" href="Galaxy.ai">Galaxy.ai</a>).</li>
</ul>
<ul>
<li><strong>Filter Early and Specifically:</strong> Use <strong>WHERE</strong> clauses to restrict data as much as possible. BigQuery will prune partitions (if partitioned) and skip irrelevant data. Also, filtering on clustered columns leverages sorted storage to read less. For example, if your table is partitioned by date, always include a date range filter in the WHERE clause so BigQuery can scan only the needed partitions.</li>
</ul>
<ul>
<li><strong>Partition Your Tables:</strong> When creating tables, use BigQuery’s partitioning feature for large fact tables. Partition by a date or integer range that you commonly filter on (e.g., <code>PARTITION BY date</code> for a daily log table). Queries with a filter on the partition column will then scan only the relevant partition rather than the entire table, greatly reducing I/O (<a target="_blank" href="https://www.kpipartners.com/blogs/bigquery-best-practices-to-optimize-cost-and-performance#:~:text=Use%20Partitioned%20tables%3A">BigQuery Best Practices to Optimize Cost and Performance</a>). Time-unit partitioning (DAY, MONTH, etc. on a TIMESTAMP/DATE) or integer range partitioning can be chosen based on data. For example, a web logs table partitioned by date might cut a year’s data (365 days) into 365 partitions; a query on one month would only read ~30 partitions (1/12 of data) (<a target="_blank" href="https://www.kpipartners.com/blogs/bigquery-best-practices-to-optimize-cost-and-performance#:~:text=Use%20Partitioned%20tables%3A">BigQuery Best Practices to Optimize Cost and Performance</a>). Partitioning also helps manage data lifecycle (you can set expiration for partitions).</li>
</ul>
<ul>
<li><strong>Cluster Your Tables:</strong> Clustering sorts data on specified columns, which can dramatically speed up filtering and aggregating on those columns (<a target="_blank" href="https://www.kpipartners.com/blogs/bigquery-best-practices-to-optimize-cost-and-performance#:~:text=Clustered%20tables%3A">BigQuery Best Practices to Optimize Cost and Performance</a>). For example, if you cluster a table by <code>user_id</code>, all rows with the same user_id are stored close together. A query like <code>WHERE user_id = X</code> will only read a small portion of each partition (only the blocks for that user). Clustering is often used in combination with partitioning: e.g., partition by date and cluster by user_id or product category. This way, BigQuery first prunes partitions by date, then within each partition it can binary search on the clustered column to find relevant data (<a target="_blank" href="https://www.kpipartners.com/blogs/bigquery-best-practices-to-optimize-cost-and-performance#:~:text=Clustered%20tables%3A">BigQuery Best Practices to Optimize Cost and Performance</a>). Clustering also helps GROUP BY performance on the clustered columns, because data comes pre-sorted (reducing shuffle).</li>
</ul>
<ul>
<li><strong>Denormalize Data (Use Nested Fields):</strong> BigQuery performs best with fewer large tables rather than many joins. Where appropriate, <strong>denormalize</strong> your schema – e.g., instead of storing user info in one table and events in another (and joining), you might store user info as repeated/nested fields within the events table. BigQuery’s storage can handle repeated (array) fields efficiently. This avoids join overhead; querying nested data is essentially a local unpacking operation, not a distributed join (<a target="_blank" href="https://www.kpipartners.com/blogs/bigquery-best-practices-to-optimize-cost-and-performance#:~:text=BigQuery%20supports%20nested%20records%20within,try%20to%20access%20any%20fields">BigQuery Best Practices to Optimize Cost and Performance</a>) (<a target="_blank" href="https://stackoverflow.com/questions/59164578/transaction-management-in-google-bigquery#:~:text=The%20best%20way%20to%20avoid,to%20have%20your%20data%20denormalized">Transaction Management in Google Bigquery - Stack Overflow</a>). Joins in BigQuery are still fine for reasonably sized dimensions, but for very large datasets, reducing the number of joins (through nesting or pre-joining data) can improve performance. As a rule: small lookup tables (dimensions) are fine to join, but consider denormalizing big fact tables or using nested structures for one-to-many relations (like an order with an array of items) (<a target="_blank" href="https://www.kpipartners.com/blogs/bigquery-best-practices-to-optimize-cost-and-performance#:~:text=BigQuery%20supports%20nested%20records%20within,try%20to%20access%20any%20fields">BigQuery Best Practices to Optimize Cost and Performance</a>). This trades some storage space for speed, which is usually worthwhile in BigQuery’s cost model (storage is cheap; computation is comparatively expensive) (<a target="_blank" href="https://stackoverflow.com/questions/59164578/transaction-management-in-google-bigquery#:~:text=BigQuery%20performs%20best%20when%20your,normalized%29%20schema">Transaction Management in Google Bigquery - Stack Overflow</a>).</li>
</ul>
<ul>
<li><strong>Avoid Excessive JOINs on Huge Tables:</strong> If you must join large tables, ensure the join keys are well-partitioned or clustered if possible. Use <strong>JOIN each</strong> (the default in Standard SQL) which is a shuffle join. In legacy SQL there was a concept of “JOIN EACH” required for big joins; in Standard SQL, BigQuery automatically shuffles data for large joins. However, joining two enormous tables will multiply the data that needs to be processed. Wherever possible, filter both sides of the join heavily and only select necessary columns before joining (possibly using subqueries or CTEs for pre-filtering).</li>
</ul>
<ul>
<li><strong>Use WITH (CTE) or Temp Tables to Break Queries:</strong> BigQuery will happily execute extremely complex queries in one go, but sometimes breaking a query into stages can help the optimizer or reduce duplicate work. For example, if you have a subquery that is used multiple times, consider materializing it as a CTE (Common Table Expression) or temp table so it’s computed once and reused. BigQuery will not automatically reuse results from identical subqueries unless you explicitly CTE them. Using CTEs can also improve readability, but note BigQuery does <em>not</em> materialize CTEs by default – they are inlined. If you want to materialize intermediate results (to reduce data scanned in subsequent steps), you might need to write to a temporary table in one step, then query it in the next.</li>
</ul>
<ul>
<li><strong>Leverage BigQuery’s Caching:</strong> By default, BigQuery caches the results of query for ~24 hours <strong>for the same user</strong> if the underlying data hasn’t changed. If you rerun an identical query, you won’t be charged and it will return faster (because it hits the cache) (<a target="_blank" href="https://leadpanda.media/en/blog/yak-skorotiti-vitrati-na-bigquery-10-perevirenix-lifexakiv-dlya-optimizacii/#:~:text=saves%20three%20additional%20query%20executions,%E2%80%9D">How to reduce BigQuery costs: 10 effective life hacks for optimization | Lead Panda Media</a>). This is automatic. However, note that the cache is per user — if your colleague runs the same query, they won’t get your cached results (<a target="_blank" href="https://leadpanda.media/en/blog/yak-skorotiti-vitrati-na-bigquery-10-perevirenix-lifexakiv-dlya-optimizacii/#:~:text=saves%20three%20additional%20query%20executions,%E2%80%9D">How to reduce BigQuery costs: 10 effective life hacks for optimization | Lead Panda Media</a>). For dashboards or repeated analyses by the same person, result caching is beneficial. For multi-user scenarios, consider caching at the application level or using scheduled queries/materialized views.</li>
</ul>
<ul>
<li><strong>Use Materialized Views for Frequent Aggregations:</strong> BigQuery supports <strong>materialized views</strong> which automatically cache the results of a query (typically an aggregation on a table) and incrementally update as the base table changes. If you have a very expensive aggregation that many queries run (like daily totals per category), a materialized view can serve those results quickly and save cost. Queries that can use the materialized view (e.g., they request data that the view pre-computed) will automatically do so. Design the materialized view on the heavy computation part of your data.</li>
</ul>
<ul>
<li><strong>Consider BI Engine for Dashboards:</strong> If you are doing a lot of repetitive queries (especially dashboard queries that hit the same tables with filter variations), Google’s <strong>BI Engine</strong> can cache data in-memory for super-fast responses. It’s not a query tip per se, but a feature to be aware of for performance tuning in BigQuery when used with tools like Data Studio/Looker.</li>
</ul>
<ul>
<li><strong>Use</strong> <code>EXPLAIN</code> to Understand Query Plans: BigQuery now has an <code>EXPLAIN</code> statement that shows the query execution plan and stages. This can help identify if a query is doing a large shuffle or scan that you didn't expect. For instance, it can reveal if a partition filter is being applied or if a join is causing a huge data shuffle. Using <code>EXPLAIN ANALYZE</code> will actually run the query and provide timing and resource usage statistics for each step. This is useful for performance debugging – if a certain stage takes 90% of time, focus optimization efforts there.</li>
</ul>
<ul>
<li><strong>Use</strong> <code>TABLESAMPLE</code> for Testing: When developing queries on huge tables, use the <code>TABLESAMPLE SYSTEM()</code> clause to run on a fraction of the data (<a target="_blank" href="https://leadpanda.media/en/blog/yak-skorotiti-vitrati-na-bigquery-10-perevirenix-lifexakiv-dlya-optimizacii/#:~:text=However%2C%20we%20recommend%20using%20TABLESAMPLE%2C,comment%20out%20a%20single%20row">How to reduce BigQuery costs: 10 effective life hacks for optimization | Lead Panda Media</a>). For example, <code>FROM big_table TABLESAMPLE SYSTEM (1 PERCENT)</code> will read a 1% random sample of the table. This lets you test query logic quickly and cheaply on large data (with the caveat that results are approximate because it’s a sample). Once you are satisfied with the query on 1%, you can run it on 100%. This can save a lot of time and cost during development iterations.</li>
</ul>
<ul>
<li><strong>Approximate Results for Big Data:</strong> As mentioned, if you only need an estimate (like approx distinct count, top frequencies, etc.), use BigQuery’s approximate functions. For example, <code>APPROX_QUANTILES</code> can get percentile estimates without sorting all data, and <code>HLL_COUNT.INIT</code> / <code>HLL_COUNT.MERGE</code> functions allow creating your own HyperLogLog distinct counts over streams of data (advanced usage). Using these can cut down CPU time for large-scale analytics (<a target="_blank" href="https://leadpanda.media/en/blog/yak-skorotiti-vitrati-na-bigquery-10-perevirenix-lifexakiv-dlya-optimizacii/#:~:text=BigQuery%20offers%20functions%20for%20approximate,reliable%20for%20most%20analytical%20tasks">How to reduce BigQuery costs: 10 effective life hacks for optimization | Lead Panda Media</a>).</li>
</ul>
<ul>
<li><strong>Monitoring and Slot Tuning:</strong> BigQuery automatically manages resources, but if you have a critical query or consistent workload, consider reserving slots (with BigQuery Reservations) and using concurrency tuning. For one-off performance, ensure you’re using the default parallelism effectively by not having overly complex single-threaded UDFs or external calls that serialize things.</li>
</ul>
<p>In summary, BigQuery performance is about <strong>scanning less data</strong> and <strong>distributing work efficiently</strong>. Techniques like partitioning, clustering, filtering, and pre-aggregating data (or using materialized views) directly reduce the amount of data scanned. Others, like using denormalized schemas and caching, reduce the amount of work per query. Following these best practices can lead to massive improvements in query speed and cost – for example, a query that originally scanned 1 TB daily could scan only 100 GB with proper partitioning and clustering, and further drop to 10 GB if you only select needed columns and use a WHERE clause, resulting in faster execution and 1/100th the cost.</p>
<p>Lastly, always test your optimizations. BigQuery’s web UI or CLI will show you how many bytes a query will process before you run it (when you click “Query Validator” or use <code>dry-run</code> flag). Use that as a guide: small changes in the query can sometimes accidentally increase bytes scanned. The goal is to minimize that while still getting correct results. BigQuery’s design (columnar + parallel) will handle large data if you follow these patterns to help it avoid unnecessary scans and shuffles.</p>
]]></content:encoded></item></channel></rss>