Introduction To Fine Tuning Pipelines For Data Engineers

In the world of AI, fine tuning might be one the most complex topics to pursue, specially for beginners. We will break down a small implementation here to see what happens behind the scenes.

Sep 20, 2025

Hi there! Alejandro here 😊
Suscribe if you like to read about technical data & AI learnings, deep dives!
Enjoy the reading and let me know in the comments what you think about it 👨🏻‍💻

Today I want to talk about one of the most cool-fancy-shinny concepts about AI: fine tuning.

It keeps happening, if you read along some of my recent articles, that I am saying “for Data Engineers” like a broken record 😅

Why?

Because the AI market is not asking for agents, RAG and a bunch of other cools things in isolation.

They need data.

High quality, consistent, robust and well defined data points that will feed the best AI use cases for present and future.

They said it’s 10% AI and 90% just cleaning data.

That’s why, more than ever, Data Engineers are the biggest enablers in many companies, and AI Engineers should align with them or learn their skills to make the “boring but critical 90%” more exciting.

So let’s get started, but set some basics first.

⚠️ Disclaimer: I am a Data Engineer that’s lucky enough to learn & experiment about a bunch of things, you will see that my angle on this topic if fully focused on data aspects more than complex-technical topics, so consider that when going through the article.

Generated image — Image generated by ChatGPT.

What’s Fine Tuning? (And What’s Not)

Let’s compare what we know first:

Prompts can help the model define behaviour temporarily, its context window based.
RAG can help models access the outside world when they need more than just their training dataset.
Fine Tuning is changing how the model is on itself, like rewiring its brain.

Right now, Claude, ChatGPT and Gemini are training with millions of documents, websites, papers, etc just to be helpful assistants.

They don’t have an special skill per se, they just help on different tasks like writing, coding, languages, all the tasks that are compatible with token prediction and word generation.

But at some point, that’s not enough.

I am trying longer deep dives, any thoughts on the format?

When Should You Consider Fine Tuning?

You need a model to write in a specific style consistently. e.g. Legal terminilogy to a specific niche.
You want to embed company-industry specific knowledge directly into the model e.g. Understand documents on legal tech.
You need to handle large volumes of similar requests efficiently

Doing Fine Tuning is expensive, so try to avoid it as long as you can.
Probably 90% of the use cases can be covered with good prompts, adding API outputs to your prompts or even giving tools to LLMs before even jumping into RAG or Fine Tuning, so don’t overdo it.

Check out this one on RAG alternatives:

RAG for Data Engineers: LLM Limitations, Fundamentals, Alternative before RAG

Alejandro Aboy

Aug 16

Read full story

In simple words, don’t use it if:

Prompt engineering with few-shot examples works well
RAG provides sufficient context and accuracy
Your use case requires frequently updated information
You don't have enough high-quality training data

Types of Fine Tuning

The idea is not to go that technical, but an introductory section to some definition makes the upcoming content easier to digest.

This guide covers Supervised Fine Tuning (SFT), which might the most common way of doing fine tuning.

Train on input-output pairs to learn specific behaviors, styles, or domain knowledge.
This is the best option if you can provide consistent response patterns.

Then you have Reinforcement Learning from Human Feedback (RLHF), for which you use human feedback to align model behavior. It’s more complex to implement, typically for safety and alignment.

There are others that we won’t go through, but I suggest going over OpenAI Docs to know more about them.

Fine Tuning Use Case: “Clone” Your Substack Writing

I created a project on Github Github - Fine Tuning For Data Engineers

It consists on different moving parts, but the TL;DR of what the project does is:

Parses RSS Feed → Fetches articles from any Substack
Cleans Content → Removes HTML, extracts clean text
Generates Instructions → Uses GPT to create diverse instruction prompts for each article
Creates Training Data → Builds instruction-response pairs in OpenAI format
Saves Data → Outputs training_data.jsonl in proper JSONL format
Fine tunes Model → Automatically uploads to OpenAI and creates fine tuning job

The important stuff to follow along is in OpenAI Fine Tuning Documentation.

The Instruction Dataset

Let’s go over point #3 and #4, which are the backbone of fine tuning.

When talking about SFT, we mentioned input-output pairs. This means that you give the model examples on the expected outcome for a specific input.

Overall, 100 instruction pairs are usually recommended to have a good baseline.

In the repo, this is the prompt behind it:

This is the principle in which few-show prompting technique is based on. You can read more about it here:

The AI Glossary for Data Analytics: LLMs, Context Window, ReAct, Few-Shot, and CoT

Alejandro Aboy

Jul 12

Read full story

OpenAI expects the outputs for fine tuning in jsonl format structure, for example:

{
  "messages": [
    {
      "role": "user",
      "content": "Explain the core technologies in data engineering and their evolving roles in modern data pipelines."
    },

    {
      "role": "assistant",
      "content": "I have been working with Snowflake, dbt, PostgreSQL, Python, Airflow, and AWS tools for approximately 3 years. The most challenging part of this journey has been always feeling as though I was falling behind. The list (and FOMO) never ends..."
    }
  ]
}

The Fine Tuning Job

We run the full pipeline, which parses the Substack content, generates the training date with the instruction dataset, saves into JSONL format and create a fine tuning job for OpenAI.

Here’s how it look in the repo:

You end up with something like this:

⚠️ I have the tendency of using too aggresive metaphores and had to add an extra step to clean them up because OpenAI was rejecting my samples due to violated policies 😂

And this is how it looks in OpenAI Interface when the job runs properly:

Then you can go into the playground, add tools, prompts and start playing around with it.

You can also call it from the API!

Monitoring Training Progress

After submitting your fine tuning job, monitor the training metrics in OpenAI's interface.

What Good Training Looks Like:

Accuracy increases consistently
Loss decreases and stabilizes
Some volatility is normal as the model adjusts
Final train loss around 0.8-1.2 is typical for successful fine tuning

Accuracy Chart: Shows model performance improving over training steps. You want to see an upward trend from ~45% to 80-90%, indicating the model is learning your data patterns.

Training accuracy of the Substack fine tuned model.

Loss Chart: Training loss should decrease over time (e.g., from 2.7 to 0.6-1.0). Lower loss means better prediction accuracy on your training data.

Traning loss of the Substack fine tuned model.

If you see accuracy plateauing early or loss increasing, you may need more diverse training data or better instruction quality.

⚠️ Disclaimers

100 high-quality examples often outperform 1000 mediocre ones.
A good finetuned model with a poor system prompt is worthless. Experiment until you find the sweet spot.
Fine tuning and inference costs add up. Make sure simpler approaches (prompting, RAG) won't solve your problem first.
You will need to iterate, add more samples, find the right instructions datasets to eventually make it work as you want.

As you might guess by this point, beware of jumping to do this if you can solve it with goods prompting techniques like few-show or even with a simple RAG strategy.

This is not only expensive, but complex in many ways since we are changing the way the model was trained and affecting its foundation, so if not done properly the results will be even worse than regular models with bad prompts.

This is just the tip of the iceberg of fine tuning based on a simple use case & experimentation. Hope if helps you gain more awareness of the fine tuning hype 💡

📝 TL;DR

Only fine tune if prompting/RAG won't work - for consistent style, domain patterns, or embedded knowledge
Build a data pipeline: Parse content → Clean HTML/text → Generate custom instructions with GPT-4o mini → Format as instruction-response pairs
Quality > quantity: 100 good examples beat 1000 mediocre ones
Monitor training: Look for accuracy increasing (45% → 80%+) and loss decreasing (2.7 → 0.8)
Iterate: Test in playground, measure performance, refine data and instructions
Watch for gotchas: OpenAI content policy, cost implications, need for good system prompts

If you enjoyed the content, hit the like ❤️ button, share, comment, repost, and all those nice things people do when like stuff these days. Glad to know you made it to this part!

Hi, I am Alejandro Aboy. I am currently working as a Data Engineer. I started in digital marketing at 19. I gained experience in website tracking, advertising, and analytics. I also founded my agency. In 2021, I found my passion for data engineering. So, I shifted my career focus, despite lacking a CS degree. I'm now pursuing this path, leveraging my diverse experience and willingness to learn.

Follow me on Linkedin

The Pipe & The Line

RAG for Data Engineers: LLM Limitations, Fundamentals, Alternative before RAG

The AI Glossary for Data Analytics: LLMs, Context Window, ReAct, Few-Shot, and CoT

Discussion about this post