Intro To Claude Code For Data Engineers (Skills, MCPs & Hooks for the Modern Data Stack)
MCP, Skills, and Hooks turn Claude Code into a control layer for the entire data stack. Here's the map.
Hi there! Alejandro here đ
Subscribe if you like to read about technical data & AI learnings, deep dives!
Enjoy the reading and let me know in the comments what you think about it đ¨đťâđť
đ TL;DR
Claude Code connects to the entire data stack through MCP servers, Skills, and Hooks. Orchestration, modeling, databases, quality, docs.
Data modeling has the most mature integrations: dbt Agent Skills with benchmarked accuracy gains and OpenMetadata MCP for impact analysis.
MCP Data Toolbox gives you a single configuration layer for 30+ databases instead of managing one MCP per database.
The real power is chaining. Multiple MCPs running in one session and you will start building custom use cases to match your workflows.
Coming Soon: Hands-On Claude Code for Data Engineers: Data Modeling with dbt, Miro & PostgreSQL Using Skills & MCPs
đ§ Across the Data Stack
Data engineers already work across 5-10 tools daily. Airflow for orchestration, dbt for transformations, Snowflake or BigQuery for warehousing, pytest for quality, Git for version control.
Claude Code can help here, since it connects those tools through three mechanisms:
MCP servers connect Claude Code to external tools in real time. Databases, orchestrators, metadata catalogs.
Skills encode best practices that Claude follows automatically. dbt modeling patterns, testing strategies, migration guides.
Hooks trigger actions before or after Claude does something. Run pytest before every commit, lint SQL on save.
Iâve been tracking these integrations we will discuss and hereâs where things stand. Not every one is equally mature. Some are production-ready, others are early experiments
Recommended: The Non-Coderâs Guide to Claude Code if you need the fundamentals. This article goes deeper into data-specific integrations.
Letâs get started!
đ ď¸ Deep Dive: Data Modeling with Skills and MCPs
Data modeling is where Claude Code has the most mature ecosystem for data engineers. Two angles here: Skills that teach Claude how to model, and MCPs that give it access to metadata.
dbt Agent Skills
dbt Agent Skills encode analytics engineering workflows that Claude follows as instructions. Some of them require dbt Cloud, but here are the ones working with dbt Core:
Works with Core/OSS:
using-dbt-for-analytics-engineering- build and modify models, debug errors, explore sourcesadding-dbt-unit-test- unit tests and test-driven developmentrunning-dbt-commands- CLI commands with correct flags and selectorsfetching-dbt-docs- look up dbt documentation efficiently
Cloud-only: semantic layer building, natural language queries, job troubleshooting, Fusion migration.
The bottomline is that they improve documentation by default, help you write all dbt semantics and metadata way faster, and keep your project fully compatible with dbtâs latest best practices.
Recommended: Whatâs the real deal about SKILLs (This is not an âMCP is deadâ post
OpenMetadata MCP for Impact Analysis
The other side of data modeling is understanding what happens when you change something. OpenMetadata MCP connects Claude to your metadata catalog, so before modifying a dbt model, Claude can check what dashboards depend on it, who owns the downstream data, and whether thereâs PII involved.
Recommended: End To End Agentic Data Modeling - Using AI and OpenMetadata MCP where we built a full end-to-end setup with PostgreSQL, dbt, Metabase, and OpenMetadata. The use cases there (impact analysis on schema changes, lineage exploration, data discovery) are exactly what this MCP enables inside Claude Code.
⥠Hooks: Automating Quality Gates
Hooks are the part of Claude Code that most data engineers sleep on. They run shell commands before or after Claude does something. Think of them as CI/CD, but inside your AI workflow.
Three hooks that make a real difference:
pytest before every commit
Claude writes a transformation function, stages it, and tries to commit. The hook runs your test suite first. If tests fail, the commit doesnât happen. No bad code gets through.
{
âeventâ: âPreCommitâ,
âcommandâ: âpython3 -m pytest tests/ -x -qâ
}SQL lint on file save
Every time Claude writes or modifies a `.sql` file, sqlfluff checks it against your teamâs style rules. Catches formatting issues, naming conventions, anti-patterns before they reach code review.
{
âeventâ: âPostToolUseâ,
âmatcherâ: âWrite|Editâ,
âcommandâ: âsqlfluff lint --dialect snowflake $FILE_PATHâ
}dbt test after model changes
Claude creates or modifies a dbt model. The hook automatically runs `dbt test` on that model. If a schema test or data test fails, you know immediately instead of finding out in production.
{
âeventâ: âPostToolUseâ,
âmatcherâ: âWrite|Editâ,
âcommandâ: âdbt test --select $(basename $FILE_PATH .sql)â
}The pattern here is simple. Donât trust the AI to remember quality checks. Automate them so they canât be skipped.
Pro Tip: Ask Claude Code to create a âvalidationâ skill to run before every new commit or edit so it will combine all of these actions into one.
Docs: https://code.claude.com/docs/en/hooks
Recommended: Pytest for Data - Fixtures, Mocks, CI-CD, and Claude Code Hooks for the full setup.
đď¸ Deep Dive: MCP Data Toolbox
Direct database access via MCP is the number one use case I see. But if your team uses PostgreSQL, BigQuery, and Snowflake, do you really want to manage three separate MCPs?
The Unified Approach
MCP Data Toolbox is Googleâs open-source unified layer. It supports 30+ databases through a single `tools.yaml` configuration.
What it gives you:
Schema discovery - Claude can explore tables, columns, and relationships across databases
Query execution - read and write operations with connection pooling
Result caching - avoid repeated expensive queries
It sits between Claude Code and your databases as a control plane. Instead of one MCP per database, you configure all connections in one place.
When to Use Individual MCPs Instead
Data Toolbox covers the common ground. But some databases have features that go beyond SQL. Snowflake MCP for Cortex (RAG, semantic modeling, agentic orchestration), PostgreSQL MCP Pro for health analysis and index recommendations, MotherDuck MCP for local dev and prototyping.
What Iâd recommend for most teams: Data Toolbox as the default, plus one specialized MCP for your primary warehouse.
đ Orchestration
Airflow MCP lives under Astronomerâs GitHub but works with any open-source Apache Airflow instance (2.x and 3.x). No Astronomer account needed. It connects via the standard Airflow REST API, so just set AIRFLOW_API_URL, AIRFLOW_USERNAME, and AIRFLOW_PASSWORD and youâre set.
The Airflow MCP tools I find most useful:
explore_dag- get DAG metadata, tasks, recent runs, and source code in one calldiagnose_dag_run- debug a failed run with task details and logs without opening the UIget_system_health- health status, import errors, warnings, DAG stats
Beyond read-only monitoring, it also handles DAG management (trigger, pause/unpause), pool and variable management, and connection handling. Pair it with Context7 so Claude uses the right operator signatures for your Airflow version when generating DAGs.
Astronomer recently launched Airflow Skills, while some of them are Astronomer exclusive, others are compatible with open source Airflow setups, so itâs worth taking a look.
𤯠Bonus: From Conceptual to Physical Models in One Session
If you design conceptual data models on Miro when drafting your dimensional and fact tables, you can wire that up to Claude Code with the Miro MCP. Browse existing boards, generate entity relationship diagrams from descriptions, and iterate on schemas visually.
But the real value is chaining it with everything else:
Read business requirements from a Jira or Notion MCP.
Explore raw database tables with MCP Data Toolbox to understand whatâs available.
Draft and iterate schemas visually on Miro MCP.
Generate the physical dbt models using dbt Agent Skills, fully documented and following your project conventions.
Run impact analysis with OpenMetadata MCP before merging.
Recommended: How AI Can Improve Your Data Modeling (And Whatâs Still Missing) where we explored the gap between conceptual design and physical implementation.
đ Docs-as-MCP
This is not only for Data Engineering projects but for most coding projects requiring documentation.
Claude will hallucinate API signatures when its training data doesnât match your tool versions.
You can inject current docs into context. Context7, use gitingest to convert any repo into a queryable digest, or use tools like docs-mcp-server to make sure Claude works with the right API signatures.
Final Words
Now that we have the tools, we still need to figure out the best way to make them work together in our favor.
Thereâs no correct approach, the only valid one is the one that works for you!
If you enjoyed the content, hit the like â¤ď¸ button, share, comment, repost, and all those nice things people do when like stuff these days. Glad to know you made it to this part!
Hi, I am Alejandro Aboy. I am currently working as a Data Engineer. I started in digital marketing at 19. I gained experience in website tracking, advertising, and analytics. I also founded my agency. In 2021, I found my passion for data engineering. So, I shifted my career focus, despite lacking a CS degree. Iâm now pursuing this path, leveraging my diverse experience and willingness to learn.



