Git Basics and Study Insights

Introduction to Git

Git is a distributed version control system that allows data engineering teams to track changes in code and collaborate effectively. It provides a full history of modifications, enabling easy rollbacks and parallel development. Key benefits include:

Version Tracking: Every change is saved as a commit with metadata, making it easy to audit and revert changes when needed.
Collaboration: Multiple engineers can work on different features or bug fixes simultaneously via branches, later merging their work without overwriting each other’s code.
Reproducibility: Git ensures that code for data pipelines, ETL scripts, or infrastructure is versioned, so you can reproduce past configurations exactly – critical for debugging and regulatory compliance.

Git’s advantages in version control and collaboration make it an essential tool for data engineers, helping streamline code management and sharing (Git for Data Engineers: Essential Commands | by Vijay Gadhave | Medium). By mastering Git’s command-line usage – from basic commits to advanced history editing – data engineers of all levels can efficiently manage project repositories.

Getting Started with Git: Repository Setup and Basics

Installing Git: First, install Git on your system (e.g., via package manager or the official installer). On Linux, you might use sudo apt install git, or on macOS brew install git. After installation, configure your identity:

Configure Identity: Set your name and email for commits:
```
  git config --global user.name "Your Name"  
  git config --global user.email "you@example.com"
```
This ensures your commits are labeled properly. The --global flag applies this setting for all repositories (you can omit it to set per-project).

Initializing a Repository: To start version controlling a new or existing project, navigate into the project directory and initialize Git:

Run git init – this creates a new .git subdirectory, containing all Git metadata for the repo ( git init | Atlassian Git Tutorial ). After git init, your directory becomes a Git repository; most Git commands can then be used inside it. (If the project already had a .git folder, re-initializing with git init is safe – it won’t overwrite existing version history ( git init | Atlassian Git Tutorial ).)

Cloning a Repository: To obtain a working copy of an existing repository (for example, from GitHub or a shared network path), use git clone:

Run git clone <repo-url> [<directory>]. This command copies an existing repository to your local machine. Internally, git clone first initializes a new repo then downloads all the data from the source repository and checks out the latest snapshot into your working directory ( git init | Atlassian Git Tutorial ). After cloning, you have the full history and the working files ready to use.

Basic Workflow – Stage and Commit Changes: Once a repo is initialized or cloned, the typical cycle of work is: edit files, stage changes, commit, and (when collaborating) push to a remote.

Check Repository Status: Use git status to review which files are modified, added, or removed. This helps you see what will be included in the next commit.
Stage Changes: Use git add <file> to add a file’s changes to the staging area (index). For example:
- git add script.py stages modifications in script.py.
- Use git add . to stage all changes (new, modified, deleted files) in the current directory.
Commit Changes: Once staged, create a commit with git commit -m "Message describing the change". A commit is a snapshot of the repository state. It’s good practice to write clear, concise commit messages.
View Commit History: Use git log to view the history of commits. By default, it shows commits in reverse chronological order, with their hash, author, date, and message. For a one-line summary per commit, try git log --oneline.
Difference (Optional): To see what changes have been made (before or after staging), use git diff. For example, git diff HEAD shows differences between the working directory and the latest commit.

Each commit in Git is identified by a SHA-1 hash. You can refer to commits by these hashes (full or abbreviated) or by pointers like branch names and tags (more on these later).

Using .gitignore: In any project, you’ll have files that shouldn’t be tracked (e.g., data outputs, environment files, credentials). Create a .gitignore file listing patterns of files to exclude (for example, *.csv to ignore all CSV files). Git will ignore those files when staging changes.

Branching and Merging

(File:Basic git branching workflow (GitLab).png - Wikimedia Commons) Basic Git branching workflow: feature branches diverge from the main branch and are merged back after development. This allows parallel development without disrupting the main codebase.

What is a Branch? A branch in Git is effectively a movable pointer to a series of commits (a timeline of changes). Branches allow you to isolate work. For example, you might create a branch for a new feature or a bug fix. This way, unstable or experimental code is kept separate from the main branch (sometimes called master in older repos) until it’s ready. In Git, creating and switching branches is quick and cheap – it doesn’t copy the entire project, just points to a commit snapshot ( Git Branch | Atlassian Git Tutorial ).

To create a new branch: git branch <branch-name>. This makes a new branch pointer at the current commit, but keeps you on the current branch.
To switch to a branch: git checkout <branch-name>. This updates your working directory to the state of that branch’s latest commit. (In newer Git versions, you can use git switch <branch-name> for clarity when just changing branches.)
To create and switch in one step: git checkout -b <branch-name> (or git switch -c <branch-name>). This is commonly used to start a new branch and begin working on it immediately.

For example, git checkout -b feature/data-cleanup will create a branch called “feature/data-cleanup” and check it out. Now commits you make will advance this branch, while the main branch remains untouched until you merge the feature branch back.

Why Branches? Branches enable parallel development. You can have multiple feature branches, bug-fix branches, or experiment branches all evolving independently. This isolation prevents incomplete or broken code on one branch from affecting others. When a branch’s work is complete and tested, it can be merged back into a main line (such as main or develop branch), integrating the changes.

Common Branching Strategies

Choosing a branching strategy helps define how your team collaborates and delivers code. Here are a few strategies/data engineers may encounter:

Trunk-Based Development: All developers commit to a single long-lived branch (often main or trunk). Feature branches are short-lived; developers merge small, frequent updates back into main. This approach emphasizes continuous integration and is often considered a best practice for fast-paced DevOps environments ( Gitflow Workflow | Atlassian Git Tutorial ). It minimizes merge conflicts by integrating changes often.
Feature Branch Workflow: Each feature or issue is developed in its own branch off of main (or off a develop branch). When the work is done, it’s merged via a pull request (PR) into the main line. This is very common – it keeps work isolated and allows code review before merging. Feature branches are usually deleted after merge to keep the repository tidy.
Gitflow Workflow: A more complex, legacy model with multiple long-lived branches (e.g. develop, master, plus supporting branches for features, releases, and hotfixes). In Gitflow, developers integrate feature branches into a central develop branch, releases are prepared on separate release branches, and finished releases are merged into master (and back into develop) ( Gitflow Workflow | Atlassian Git Tutorial ). Gitflow was popularized by Vincent Driessen and suits projects with scheduled release cycles. However, it has fallen out of favor in modern continuous delivery practices in favor of simpler trunk-based workflows ( Gitflow Workflow | Atlassian Git Tutorial ), as Gitflow can be challenging to use with fast CI/CD pipelines.
Forking Workflow: Common in open-source, this is where each contributor works on their personal fork (copy) of the repository, then submits changes back to the main repository via pull requests. Internally, they use branches for features, but collaboration happens through the fork rather than a shared repo. In corporate data engineering, forking may be less common than in open source, except when collaborating across organizations.

In practice, many teams adopt a hybrid of these. For example, a data engineering team might use a simple feature branch model on a trunk-based approach: everyone branches off main for each task, then integrates back into main frequently (perhaps deploying changes to an analytics pipeline continuously). Choose a strategy that fits your release cadence and collaboration style. For stable data pipeline releases, you can also use release branches or tags (discussed below) to mark production versions.

Merging Branches

Once work in a branch is complete, you’ll merge it back into a target branch (often main or a integration branch like develop). Merging takes the changes from one branch and integrates them into another. The result is that the target branch now contains all the work that was in the source branch.

Basic merge: Check out the branch you want to merge into (e.g., main), then run git merge <source-branch>. This will bring the commits from <source-branch> into the current branch. There are a couple of outcomes:

If the current branch’s HEAD is an ancestor of the source branch’s HEAD (in other words, the target branch has no new commits since the source branched off), Git will do a fast-forward merge. In a fast-forward, no new commit is created; the HEAD pointer of the target just moves forward to the source’s latest commit. The history stays linear (no merge commit). Fast-forwards often happen when one person was working on a feature branch and nothing changed in main in the meantime.
If the branches have diverged (both have new commits since they split), Git will perform a three-way merge and create a merge commit by default. This special commit has two parent commits (the tips of the branches being merged) and a combined set of changes. The merge commit message is auto-generated (you can edit it) to note the two branches that were merged. This merge commit keeps history non-linear but records the branch structure, which some teams prefer for traceability.

You can control merge behavior:

To force a merge commit even if a fast-forward is possible, use git merge --no-ff <branch>. This is sometimes done to ensure a merge commit exists (for example, to keep a history of all merges or to group a feature’s commits under one merge commit).
To squash merge (combine all of a branch’s commits into a single commit on the target), you can use git merge --squash <branch> and then commit. This doesn’t produce a merge commit with two parents; instead, it produces one new commit with the cumulative changes. Squash merging is used to keep history linear and concise (e.g., squashing “fixup” commits on feature branch into one).

After merging, it’s good practice to delete the feature branch if it’s no longer needed (git branch -d <branch>), especially if using short-lived feature branches, to reduce clutter. (The commits remain in history even if the branch label is removed.)

Merge Conflicts and Resolution

Sometimes Git cannot automatically merge changes because the same part of the same file was edited differently in the two branches – this is a merge conflict. When a conflict occurs, Git will merge what it can and mark the conflicted files for manual resolution. You’ll see in the terminal output which files have conflicts, and git status will list those files as “unmerged” or with conflicts.

What a conflict looks like: Git inserts conflict markers in the affected files to indicate the differing sections. For example, a conflicted file will contain lines like:

<<<<<<< HEAD
your code on the current branch
=======
someone else’s code on the branch being merged
>>>>>>> feature-branch

Everything between <<<<<<< HEAD and ======= is the content from the current branch, and between ======= and >>>>>>> feature-branch is the content from the other branch. Your job is to edit this file to reconcile the differences – decide what the final content should be.

Steps to resolve a conflict:

Identify conflicts: After a merge attempt, Git stops at conflicts. Run git status to see which files are in conflict. Open those files in a text editor or IDE.
Edit the files: Find the conflict markers (<<<<<<<, =======, >>>>>>>). Decide how to combine the changes. You might keep one side, the other, or a mix of both. Remove the conflict marker lines and make the file look exactly as it should after the merge.
Mark as resolved: After editing, stage the file with git add <file>. (Git now knows you’ve resolved that file’s conflict.)
Finalize the merge: Once all conflicts in all files are resolved and staged, complete the merge by running git commit (if Git didn’t auto-create a merge commit earlier) or git merge --continue (if you were in the middle of a rebase or cherry-pick conflict scenario – see rebasing below). The merge commit will be created with the changes you just made.

If you want to abort the merge during a conflict (to return to the pre-merge state), you can use git merge --abort. This will abandon the merge and leave the branch as it was.

Tip: To simplify conflict resolution, you can use graphical mergetools (git mergetool) or IDE features that present side-by-side comparisons. Tools like VSCode, GitKraken, or P4Merge can make resolving conflicts more visual.

Rebasing and History Rewriting

Rebasing is another method to integrate changes from one branch into another, by moving the base of a branch. In contrast to merging (which adds a new merge commit), rebasing reapplies your branch commits on top of another base commit, producing a linear history.

Basic Rebase: Suppose you have a feature branch that forked from main, and main has progressed (new commits) while you worked. To update your feature branch with the latest main changes, you can rebase:

git checkout feature
git rebase main

This takes the commits on feature and replays them as if you started feature at the tip of main. Technically, it finds the common ancestor of feature and main, takes the diff introduced by each commit on feature since that ancestor, and re-applies those diffs on top of the current main HEAD, one by one. The feature branch pointer is then moved to the last of these new commits ( Merging vs. Rebasing | Atlassian Git Tutorial ).

Effectively, git rebase main “moves the entire feature branch to begin on the tip of the main branch, incorporating all new commits in main. But instead of using a merge commit, rebasing re-writes the project history by creating brand new commits for each commit in the original branch” ( Merging vs. Rebasing | Atlassian Git Tutorial ).

After the rebase, your feature branch will have a new set of commits (new hashes) that include the latest from main in its history. The history becomes one straight line, as if you had developed the feature on top of the updated main all along. This yields a cleaner, linear history with no merge commits ( Merging vs. Rebasing | Atlassian Git Tutorial ).

When to rebase vs. merge: Rebasing achieves the same end result as merging (your feature has main’s updates), but the history looks different:

A merge preserves the context of parallel development (with a merge commit joining two lines of history), whereas
a rebase flattens it into a single line, as if work was serial.

Many teams rebase feature branches before merging to main (often via a pull request) to maintain a linear history in main. This can make tools like git log or git bisect simpler to use ( Merging vs. Rebasing | Atlassian Git Tutorial ). However, rebase is effectively changing history, which comes with caveats.

Caution – The Golden Rule of Rebasing: Do not rebase commits that exist outside your local repository. In other words, never rebase a public/shared branch. If you rebase commits that others have already pulled (for example, rebasing the main branch or any published branch), you’ll force everyone else to reconcile the rewritten history, which is error-prone and confusing ( Merging vs. Rebasing | Atlassian Git Tutorial ). Only rebase local feature branches or branches that only you work on. The golden rule: never use git rebase on public branches ( Merging vs. Rebasing | Atlassian Git Tutorial ), because rebasing replaces old commits with new ones, and anyone who has the old ones will get conflicts pushing or pulling.

Rebase Conflicts: Rebasing can also result in conflicts (similar to a merge) if your commits and the new base have touched the same lines. Git will stop at a conflict during the rebase. You resolve it by editing the file and using git add just like a merge, then continue the rebase with git rebase --continue. If things go wrong, git rebase --abort will return you to the state before the rebase started.

Interactive Rebase – cleaning up commits: A powerful feature is git rebase -i (interactive). For example, git rebase -i main (while on feature) allows you to edit the commit history of your feature branch before finishing the rebase. Git opens an editor with a list of your commits and allows actions like reordering commits, squashing multiple commits into one, editing commit messages, or dropping commits. For instance, you can squash a “fix typo” commit into an earlier commit instead of having two separate commits for clarity. Interactive rebase is great for crafting a clean history (especially before merging to main) ( Merging vs. Rebasing | Atlassian Git Tutorial ) ( Merging vs. Rebasing | Atlassian Git Tutorial ). After you finalize the changes in the editor and save, Git will apply the commits as specified. You’ll end up with a rewritten branch history (make sure this is only done on your own branch, not something others depend on).

Reset (Rewriting History): While discussing history, another related tool is git reset. This command moves the HEAD of the current branch to a specified commit, optionally updating the working directory/index:

git reset --soft <commit>: move HEAD to <commit> but leave all changes after that commit staged (in index). Useful if you want to “uncommit” some changes but keep them ready to recommit.
git reset --mixed <commit> (default): move HEAD to <commit>, keep changes after that commit in the working directory (unstaged). This essentially “unstages” and “uncommits” those changes.
git reset --hard <commit>: move HEAD to <commit> and wipe out any changes in the index and working directory (all changes after that commit are lost, unless they exist somewhere else like another branch or the reflog). Use with caution, as --hard can discard work permanently if not saved elsewhere.

Reset is useful for rewriting history locally (e.g., undoing a bad commit or several commits) before pushing. But like rebase, never reset public history (especially --hard on a shared branch) – others who pulled the old history will have issues. For public fixes, prefer git revert (discussed next) which preserves history.

Reverting Commits: If you need to undo a commit that has already been pushed/shared, use git revert <commit>. This creates a new commit that inverses the changes of the specified commit (without altering history behind it). It’s the “safe” way to undo a commit because history remains chronological; you don’t remove the commit, you add a new one that says “undo that change.” This is especially relevant for production code or database migration scripts in data engineering – if a bad change was committed and pushed, revert will apply an opposite change as a new commit (which can itself be pushed). You can revert a single commit or a range of commits. Each revert will prompt for a commit message (by default, noting the commit hash being reverted).

Stashing Changes (Shelving Work in Progress)

In data engineering, you might often find yourself in the middle of developing a pipeline when something urgent comes up (e.g., a bug on another branch needs fixing). You have uncommitted changes that aren’t ready to commit or push. Git stash is a handy feature to temporarily save this work and get back to a clean working state.

What is stashing? Stashing takes your working directory changes (and by default, the staging area as well) and saves them on a stack of “stashes,” then reverts your working copy to match the HEAD commit. It’s like putting your work in a locker temporarily.

Run git stash to stash current changes. Git will save all modified tracked files, staged changes, and delete markers, then revert your working directory to a clean state (the last commit). Your changes aren’t lost – they’re stored in the stash stack. By default, stash entries are named like “WIP on branch-name…” with the timestamp, but you can provide a description: git stash push -m "message".

Stash example: If you have edited 3 files but need to switch to main to hotfix something, do git stash. Now your repo is clean (no edits). After fixing on main and switching back to your feature, you can apply the stash to continue where you left off.

Stash list and apply: You can have multiple stashes:

git stash list – shows all stashed entries in a stack (indexed as stash@{0}, stash@{1}, ... with 0 being the most recent). You’ll see the message and branch info for each.
git stash show [stash@{n}] – show what changes are in a stash (a summary diff). Add -p for a full diff.
git stash apply <stash> – apply a stash’s changes back onto the working directory without removing it from the stash list (so it can be applied again if needed). If you omit the stash reference, it applies the latest stash.
git stash pop <stash> – applies the stash and drops it from the stash list (applying and popping the most recent stash is the default if none specified). Use this when you’re done with that saved state.

After applying a stash, your previously saved changes are restored in your working copy (you might need to resolve conflicts if the code changed in the meantime). If a stash is no longer needed, you can drop it manually: git stash drop stash@{n} or clear all with git stash clear.

Stashing is great for keeping work in progress aside. For instance, stashing allows you to switch branches (git switch <other-branch>) without committing half-done work, and then come back later and unstash to resume. Keep in mind that stash is local to your repository (stashes aren’t pushed to remotes). It’s essentially a convenience—some workflows prefer committing to a “WIP” branch instead—but stash is quick and tidy for short-term use.

In summary: git stash "saves your local changes to a separate place and reverts your working directory to the last commit" (From Changes to Safe Keeping: Git Stash - DEV Community), letting you safely check out a different branch or pull updates. Later, git stash apply or git stash pop restores those saved changes so you can continue where you left off.

Recovering Lost Work with Reflog

Git has a safety net called the reflog (reference log) that records when heads and branch tips are updated in your local repository. Even if you lose track of a commit (say you reset a branch or a commit isn’t on any branch), the reflog allows you to find its hash and recover it.

Every time HEAD moves, Git records an entry in the reflog. This includes commits, checkouts, resets, rebases, stash operations, etc. Reflog entries are local (they’re not transferred to remotes) and expire typically after 90 days by default.

Run git reflog to see the log of HEAD changes (it’s essentially shorthand for git reflog show HEAD) ( Git Reflog Configuration | Atlassian Git Tutorial ). The output will look like:
```
  abc1234 HEAD@{0}: commit: added data validation step
  98f00b2 HEAD@{1}: checkout: moving from main to feature/cleanup
  0123fed HEAD@{2}: commit: WIP commit
  ...
```
Each line shows a reference (here HEAD) with an index {n} (0 is the latest action, then 1, 2, etc. going back in time), the commit hash at that moment, and a description of the action. You can also view reflog for specific branches: git reflog show branchName (or git reflog branchName shorthand) ( Git Reflog Configuration | Atlassian Git Tutorial ), and even the stash has a reflog (git reflog show stash).

Using the reflog, you can recover from mistakes. For example, if you accidentally reset main to an earlier commit and “lost” some commits, you can look at git reflog to find the hash of where main was before the reset (it will show a HEAD entry for the reset). You might see main@{1}: reset: moving to <old-sha>. The commit before that in reflog (main@{2} maybe) is the commit that got “lost”. You can then do git checkout <lost-commit-hash> or create a branch at that hash to recover it.

In short, reflog tracks updates to branch heads (and HEAD itself) in your repository ( Git Reflog Configuration | Atlassian Git Tutorial ). It’s extremely useful for undoing operations that seem irreversible (like an unintended hard reset or an overwritten branch). Keep in mind reflog entries expire over time (and some aggressive Git GC operations can clean them), so don’t rely on reflog for extremely long-term recovery – but for recent oopsies, it’s a lifesaver.

Tagging: Marking Specific Commits

Tags are references to specific commits, typically used to mark release points (v1.0, v2.0, etc.) or other important milestones in history. Unlike branches, tags don’t move – they always point to the same commit. There are two types of tags in Git:

Annotated Tags: These are full Git objects stored in the repo. An annotated tag has a tagger name, email, date, and a message (like a commit message) (Tagger, Date, and Message fields missing in Webhook Payloads and the Rest API | Bitbucket Cloud | Atlassian Support). Annotated tags are created with git tag -a <tagname> -m "message" (or -s to sign with GPG). They are recommended for releases because they can contain metadata and can be signed for security.
Lightweight Tags: These are basically just named pointers (like a branch that never moves). A lightweight tag has no extra data – it’s just a name for a commit (Tagger, Date, and Message fields missing in Webhook Payloads and the Rest API | Bitbucket Cloud | Atlassian Support). You create one with git tag <tagname> (no -a). It’s quick and simple but lacks the description and metadata.

Example:

git tag -a v1.0 -m "Release version 1.0"

This tags the current commit with an annotated tag “v1.0” with the given message. Alternatively, git tag v1.0 (no -m) would make a lightweight tag on the current commit.

Use git tag (with no arguments) to list all tags. You can also filter tags: git tag -l "v1.*" lists tags matching the pattern (e.g., all tags that start with v1.).

To see details of a tag (especially annotated tag message), use git show <tagname> which will display the tag message and the commit it points to.

Pushing Tags: By default, git push does not send tags to the remote. You must explicitly push tags:

git push origin <tagname> to push a specific tag.
git push origin --tags to push all tags at once.

It’s important to push tags if others need to see them (e.g., teammates or CI systems that deploy specific tags). Tag names, once pushed, are global in the repository like branch names.

Using Tags in Data Engineering: You might tag a commit that corresponds to a deployed version of a data pipeline. For example, if you release “Batch ETL pipeline v2”, you could tag that commit as pipeline-v2-prod. This makes it easy to check out exactly what code was running in production at that version, or to compare changes between versions.

Annotated vs Lightweight: “Annotated tags are full objects stored within Git’s database and contain all information (checksum, tagger, date, message). Lightweight tags are simply pointers to a commit and contain no additional data.” (Tagger, Date, and Message fields missing in Webhook Payloads and the Rest API | Bitbucket Cloud | Atlassian Support) (Tagger, Date, and Message fields missing in Webhook Payloads and the Rest API | Bitbucket Cloud | Atlassian Support) For important milestones, annotated tags are preferred due to the extra context.

If you ever need to delete a tag (for instance, a mistyped one): git tag -d <tagname> removes it locally. To remove it from the remote, you’d push a deletion: git push origin :refs/tags/<tagname>.

Working with Remote Repositories (GitHub, GitLab, Bitbucket, etc.)

In professional settings, your Git repositories are usually hosted on a remote server or service (such as GitHub, GitLab, or Bitbucket). These platforms provide a central location for your team’s Git repo and add features like pull requests, issue tracking, and CI integration.

After doing local commits, you’ll push them to the remote so others can pull them. Similarly, you pull others’ changes to update your local repository.

Adding a Remote: If you cloned a repository, the remote (often named origin) is already configured. If you started from scratch with git init, you can add a remote:

git remote add origin <repo-url>

This links the name “origin” to the repository URL (could be an HTTPS URL or SSH like git@github.com:account/repo.git). You can check remotes with git remote -v.

Pushing Changes: To share your commits, push them to a remote branch:

git push origin <branch-name>

This will send the commits on your local branch to the remote origin. If the remote branch doesn’t exist yet, Git will create it. The first time you push a new branch, you might use git push -u origin <branch> to set the “upstream tracking” – meaning your local branch will track that remote branch, so future git pull or git push commands know which branch to synchronize with by default.

For example, on a new repo:

git push -u origin main

pushes your local main branch to the remote’s main and sets it as the upstream. Next time, just git push is enough.

Fetching and Pulling: To get commits from others, there are two commands:

git fetch origin will fetch all new commits and update remote-tracking branches (like origin/main) in your local repo, but it won’t alter your working copy or local branches. After fetching, you can inspect what’s new (with git log origin/main, for example) and then merge or rebase manually.
git pull is essentially a shortcut that does a git fetch followed by a git merge (or rebase) of the fetched branch into your current branch. Typically, running git pull in your local main branch will fetch updates from origin and then merge origin/main into your local main (bringing you up to date).

Because git pull by default merges, some prefer to avoid potential merge commits by configuring pull to rebase (git config pull.rebase true or using git pull --rebase). This will replay your local commits on top of the fetched commits instead of a merge commit, similar to doing fetch then rebase. Use whichever strategy your team prefers.

Collaboration via Pull Requests: On platforms like GitHub/GitLab/Bitbucket, a common workflow is:

Developer pushes a feature branch to the remote.
They open a Pull Request (GitHub/Bitbucket term) or Merge Request (GitLab term) to merge their branch into another (e.g., into main).
Teammates review the code via the platform, discuss, and eventually the PR is merged (often the platform will handle the merge by creating a merge commit or squashing, depending on settings).

As a data engineer, you’ll use these features to collaborate on pipelines and analytics code. Pull requests enable code review which is valuable to maintain code quality in ETL scripts or SQL transformations.

Remote Branch Tracking: When you clone, your local main is automatically set to track origin/main. You’ll see this with git branch -vv (which shows what each local branch is tracking and if it’s behind/ahead). If your local branch is tracking a remote, you can use git pull and git push without specifying the remote/branch every time.

If you have multiple remotes (say origin and another called upstream), you specify which remote in commands (e.g., git fetch upstream, git push origin feature-branch).

Tip: It’s a good habit to pull the latest changes on main before starting new work, and also before pushing, ensure your local main is up to date (fast-forward merged) with remote to avoid conflicts.

Advanced Topics and Tooling Integration

This section touches on how Git interacts with other tools and workflows common in data engineering.

Git Integration with CI/CD and Automation Tools

Jenkins (Continuous Integration server): Jenkins is a popular CI tool often used to automate builds, tests, and deployments in data pipelines. Jenkins has robust Git support – the Git plugin in Jenkins can poll repositories, fetch branches, and even push tags or code as part of a pipeline (Git - Jenkins Plugins). In a typical Jenkins pipeline (defined by a Jenkinsfile stored in the repo), you might see a step like checkout scm which clones the repository at the correct commit of a build. For data engineers, this means whenever code is pushed to a branch, Jenkins can automatically trigger jobs: run data pipeline tests, lint SQL scripts, build container images for data apps, etc. The integration is seamless – you point Jenkins to the repo and branch, and it handles the rest (it uses your Git credentials or webhooks to know when to pull new commits).

Jenkins can be configured to trigger on certain branch pushes or PR merges. For example, you might set up Jenkins to run a nightly ETL job; Jenkins will git pull the latest scripts from main before execution, ensuring the job uses the most up-to-date code. In summary, Jenkins + Git allows continuous integration: every change in Git can automatically go through a pipeline of tests/deployments. The Git plugin in Jenkins supports all fundamental operations (clone, fetch, branch, merge) needed to integrate your repository into CI workflows (Git - Jenkins Plugins).

GitHub Actions, GitLab CI, Bitbucket Pipelines: These are CI/CD systems integrated into Git hosting platforms. They all revolve around Git events. For instance, GitHub Actions can be triggered on a push or PR; the workflow will automatically checkout the repository (uses: actions/checkout@v3 in config) which grabs the code at that commit. Similarly, GitLab CI/CD is configured via a .gitlab-ci.yml in the repo – on each push, runners fetch the code and execute jobs. As a data engineer, you can leverage these to automate testing of data transformations or even deploy infrastructure as code from your repo.

Apache Airflow (Workflow Orchestration): Airflow is used to schedule and manage data pipelines (DAGs). While Airflow itself isn’t a version control system, it’s a best practice to treat your Airflow DAG files as code and manage them in Git (Version Control For Ai Airflow Dag Api | Restackio). Typically, you develop DAGs (Python scripts defining tasks and dependencies) in a repository. Benefits of using Git here include tracking changes to workflows, code reviews for DAG changes, and the ability to roll back to prior versions of pipelines if a new DAG run fails.

Many Airflow deployments integrate with Git:

The Astronomer/Airflow Kubernetes setup can use a sidecar container to periodically git pull a repo of DAGs (using tools like git-sync). This way, updating the DAG code in Git automatically updates what Airflow runs.
In a simpler scenario, you might manually deploy DAGs by pulling from Git, or use a CI job to deploy: e.g., a GitHub Actions workflow detects changes in the dags/ folder and pushes the updated files to your Airflow instance.

Best practices: Use branches and pull requests for DAG changes, just like any code. For instance, implement a new data pipeline on a feature branch, get it reviewed, merge to main, then your deployment mechanism pulls main to update Airflow. You can also use tags to mark stable DAG versions that correspond to production deployments (Version Control For Ai Airflow Dag Api | Restackio).

In fact, guidelines for Airflow version control mirror normal software:

“Store your DAG files in a Git repository. This allows you to track changes, collaborate with team members, and revert to previous versions if necessary.” (Version Control For Ai Airflow Dag Api | Restackio)
“Implement a branching strategy (feature branches for development, main branch for stable releases) to isolate changes and test them before merging into the main workflow.” (Version Control For Ai Airflow Dag Api | Restackio)
“Use Git tags to mark stable releases of your DAGs, making it simpler to reference specific versions or roll back if issues arise.” (Version Control For Ai Airflow Dag Api | Restackio).

By following these, your data pipeline code benefits from the same rigor as application code.

GitOps and Infrastructure as Code: GitOps is an approach where Git is the single source of truth for infrastructure and deployments. In GitOps workflows, all changes to infrastructure or configuration (Kubernetes manifests, Terraform files, etc.) are made via Git commits, and automated processes apply those changes to the environment.

For example, in a Kubernetes-based data platform, you might have all your cluster configs and pipeline manifests in a Git repo. A tool like Argo CD or Flux monitors the repo and applies changes to the cluster whenever a commit is made. This means deployment is triggered by a git push, not a manual kubectl or CLI call. GitOps heavily relies on Infrastructure as Code (IaC): “GitOps uses a Git repository as the single source of truth for infrastructure definitions” (What is GitOps?) and uses pull/merge requests as the change mechanism. When changes are merged to the main config repo, a CI/CD pipeline or GitOps operator applies those changes (e.g., deploys new services, updates database config, etc.) ensuring the live environment always matches what’s in Git (What is GitOps?).

In data engineering, you can apply GitOps to things like: managing Kubernetes-based ETL services, configuration for data processing jobs, or even scheduling configurations. It brings benefits of auditability (every infra change is logged in Git history) and reliability (reducing ad-hoc manual changes). For instance, if you manage an Apache Kafka cluster config via code, any topic creation or ACL change could be done by editing a file in Git and letting an automated process enact it.

In summary, Git isn’t just for application code – it’s central to modern DataOps practices too. With Git-managed workflows, you achieve consistent, repeatable deployments for data pipelines and infrastructure, often via GitOps tooling.

Submodules (Including External Repos in Your Repo)

Sometimes a project may want to include another repository within itself – for example, a data engineering team might maintain a separate repo for common SQL scripts or a shared library that they want to use in multiple pipeline projects. Git submodules allow you to embed one repository as a sub-directory of another repository. The submodule still retains its own history and can be developed somewhat independently.

A submodule is essentially a reference to another repo at a specific commit (a snapshot) ( Git submodule | Atlassian ). Your main repo will store a pointer (the commit SHA) of the external repo that you’ve locked in. This enables you to version external code along with your project.

Adding a submodule: In the parent repository, run:

git submodule add <repo-url> <path>

This will: clone the external repo into a sub-folder (the path you specified), and record that path and the current commit of the external repo in the parent repo’s index. After adding, you’ll see a new file .gitmodules that maps submodule paths to their URLs.

For example:

git submodule add https://github.com/org/data-utils.git libs/data-utils

This creates a folder libs/data-utils in your repo containing that repository. By default, it will checkout the default branch’s HEAD of that repo at that time. Commit the new submodule (it appears like a new file with a special mode in git status). The commit in your repo doesn’t include all of the submodule’s files, just a reference to the commit ID of the sub-repo.

Working with submodules: If the submodule repository changes (you want to update to a newer commit), you go into the submodule directory (cd libs/data-utils) which is itself a Git repo. You can pull or checkout a different commit/tag there. Then go back to the main repo and you’ll see that Git recognizes the submodule’s commit changed (like it sees that “libs/data-utils” now points to a new commit). You then commit that change in the parent to update the submodule reference.

Cloning a repo with submodules: By default, git clone won’t fetch submodule contents. After cloning a repo that has submodules, you’ll see empty directories for them until you initialize them. To get them:

Run git submodule update --init --recursive to fetch all submodules (and any nested submodules, hence --recursive) and check them out at the recorded commits.
Alternatively, clone with --recurse-submodules flag: git clone --recurse-submodules <repo-url>, which clones and automatically initializes and checks out submodules.

Common submodule commands:

git submodule status – lists submodules with their current commit and whether they are up-to-date or modified.
git submodule update --remote --merge – can update submodules to the latest commit from a branch (if configured), but generally submodules stick to specific commits until manually changed.
Each submodule is an independent git repo, so you can treat it like one: go into its folder, create branches, commit, push, etc., but remember to update the parent repo’s pointer if you want the parent to use those new commits.

When (not) to use submodules: Submodules are useful for pulling in external code without copying it, and keeping it versioned. However, they add complexity: developers need to remember to init/update them, and two repos’ workflows now intersect. For data engineers, a common use might be including a shared utils library across multiple projects, or maybe embedding a specific version of a third-party tool’s source for auditability. If the coupling isn’t strict, an alternative is to use package managers (pip, etc.) or service endpoints for shared resources. But if you need to vendor code with full history, submodules can work.

Submodules are considered an advanced feature – they can confuse new Git users. Weigh the pros/cons (ease of single-repo versus multiple) before using them ( Git submodule | Atlassian ). A quick recap: “Git submodules allow you to keep a Git repository as a subdirectory of another. They are simply a reference to another repository at a particular snapshot in time, enabling you to incorporate and track version history of external code within your project.” ( Git submodule | Atlassian )

Git Large File Storage (LFS) for Big Data

Data engineers often deal with large data files, binaries, or machine learning models. Git LFS (Large File Storage) is an extension to Git that replaces large files in your repository with lightweight pointers, while storing the actual file contents on a remote server optimized for big files. If you find yourself needing to version datasets or other large artifacts, consider using Git LFS instead of normal Git.

For example, without LFS, adding a 100MB CSV to a repo will bloat the repo and every clone must download it. With LFS, that CSV would be replaced by a small text pointer in Git, and the actual 100MB content is stored on the LFS server (which may be provided by your Git host). When you clone or pull, LFS will download the big file separately (only if needed).

Git LFS was co-developed by GitHub and Atlassian, among others, to mitigate large file issues ( Git LFS - large file storage | Atlassian Git Tutorial ). To use it:

Install the git lfs client and run git lfs install (one-time per repo or machine).
Track specific extensions: e.g., git lfs track "*.csv" – this adds patterns to an .gitattributes file indicating those files should use LFS.
From then on, any new CSV you add will be stored via LFS (you’ll see a pointer file in Git instead of actual content). Pushing will push through LFS.

This keeps your repository lean and fast while still versioning large files. It’s worth noting that Git LFS requires server support (most major hosts support it, though there might be storage limits). If working with truly massive data (GBs), often it’s better to store data in proper data storage (cloud buckets, HDFS, etc.) and just keep references or scripts in Git, but LFS is a good middle-ground for moderately large files that need version tracking (like model binaries under 500MB, etc.).

These study notes have covered Git from fundamentals to advanced topics, focusing on command-line usage which is the backbone of many data engineering workflows. We saw how to set up repos, manage branching and merging strategies, handle conflicts, rewrite history carefully, stash work for flexibility, use reflog to recover, tag releases, and even bring in external code or large files. We also touched on how Git underpins collaboration platforms (GitHub/GitLab/Bitbucket) and modern workflows with CI/CD (Jenkins, Airflow) and GitOps for infrastructure.

By mastering these Git commands and concepts, a data engineer can confidently version-control pipelines, collaborate with team members, and integrate with automation tools – ensuring that data projects are reproducible, auditable, and maintainable for the long run.

Sources: Git documentation and tutorials ( git init | Atlassian Git Tutorial ) ( git init | Atlassian Git Tutorial ) ( Git Branch | Atlassian Git Tutorial ) ( Gitflow Workflow | Atlassian Git Tutorial ) ( Merging vs. Rebasing | Atlassian Git Tutorial ) (From Changes to Safe Keeping: Git Stash - DEV Community) ( Git Reflog Configuration | Atlassian Git Tutorial ) (Tagger, Date, and Message fields missing in Webhook Payloads and the Rest API | Bitbucket Cloud | Atlassian Support) (Tagger, Date, and Message fields missing in Webhook Payloads and the Rest API | Bitbucket Cloud | Atlassian Support) (Git - Jenkins Plugins) (Version Control For Ai Airflow Dag Api | Restackio) (Version Control For Ai Airflow Dag Api | Restackio) (What is GitOps?)

Git Study Notes

Introduction to Git

Getting Started with Git: Repository Setup and Basics

Branching and Merging

Common Branching Strategies

Merging Branches

Merge Conflicts and Resolution

Rebasing and History Rewriting

Stashing Changes (Shelving Work in Progress)

Recovering Lost Work with Reflog

Tagging: Marking Specific Commits

Working with Remote Repositories (GitHub, GitLab, Bitbucket, etc.)

Advanced Topics and Tooling Integration

Git Integration with CI/CD and Automation Tools

Submodules (Including External Repos in Your Repo)

Git Large File Storage (LFS) for Big Data

Comments

More from this blog

DAX Study Notes

BigQuery Study Reference

Command Palette

Introduction to Git

Getting Started with Git: Repository Setup and Basics

Branching and Merging

Common Branching Strategies

Merging Branches

Merge Conflicts and Resolution

Rebasing and History Rewriting

Stashing Changes (Shelving Work in Progress)

Recovering Lost Work with Reflog

Tagging: Marking Specific Commits

Working with Remote Repositories (GitHub, GitLab, Bitbucket, etc.)

Advanced Topics and Tooling Integration

Git Integration with CI/CD and Automation Tools

Submodules (Including External Repos in Your Repo)

Git Large File Storage (LFS) for Big Data

Comments

More from this blog