Introduction To Fine Tuning Pipelines For Data Engineers
In the world of AI, fine tuning might be one the most complex topics to pursue, specially for beginners. We will break down a small implementation here to see what happens behind the scenes.
Hi there! Alejandro here đ
Suscribe if you like to read about technical data & AI learnings, deep dives!
Enjoy the reading and let me know in the comments what you think about it đ¨đťâđť
Today I want to talk about one of the most cool-fancy-shinny concepts about AI: fine tuning.
It keeps happening, if you read along some of my recent articles, that I am saying âfor Data Engineersâ like a broken record đ
Why?
Because the AI market is not asking for agents, RAG and a bunch of other cools things in isolation.
They need data.
High quality, consistent, robust and well defined data points that will feed the best AI use cases for present and future.
They said itâs 10% AI and 90% just cleaning data.
Thatâs why, more than ever, Data Engineers are the biggest enablers in many companies, and AI Engineers should align with them or learn their skills to make the âboring but critical 90%â more exciting.
So letâs get started, but set some basics first.
â ď¸ Disclaimer: I am a Data Engineer thatâs lucky enough to learn & experiment about a bunch of things, you will see that my angle on this topic if fully focused on data aspects more than complex-technical topics, so consider that when going through the article.
Whatâs Fine Tuning? (And Whatâs Not)
Letâs compare what we know first:
Prompts can help the model define behaviour temporarily, its context window based.
RAG can help models access the outside world when they need more than just their training dataset.
Fine Tuning is changing how the model is on itself, like rewiring its brain.
Right now, Claude, ChatGPT and Gemini are training with millions of documents, websites, papers, etc just to be helpful assistants.
They donât have an special skill per se, they just help on different tasks like writing, coding, languages, all the tasks that are compatible with token prediction and word generation.
But at some point, thatâs not enough.
I am trying longer deep dives, any thoughts on the format?
When Should You Consider Fine Tuning?
You need a model to write in a specific style consistently. e.g. Legal terminilogy to a specific niche.
You want to embed company-industry specific knowledge directly into the model e.g. Understand documents on legal tech.
You need to handle large volumes of similar requests efficiently
Doing Fine Tuning is expensive, so try to avoid it as long as you can.
Probably 90% of the use cases can be covered with good prompts, adding API outputs to your prompts or even giving tools to LLMs before even jumping into RAG or Fine Tuning, so donât overdo it.
Check out this one on RAG alternatives:
In simple words, donât use it if:
Prompt engineering with few-shot examples works well
RAG provides sufficient context and accuracy
Your use case requires frequently updated information
You don't have enough high-quality training data
Types of Fine Tuning
The idea is not to go that technical, but an introductory section to some definition makes the upcoming content easier to digest.
This guide covers Supervised Fine Tuning (SFT), which might the most common way of doing fine tuning.
Train on input-output pairs to learn specific behaviors, styles, or domain knowledge.
This is the best option if you can provide consistent response patterns.
Then you have Reinforcement Learning from Human Feedback (RLHF), for which you use human feedback to align model behavior. Itâs more complex to implement, typically for safety and alignment.
There are others that we wonât go through, but I suggest going over OpenAI Docs to know more about them.
Fine Tuning Use Case: âCloneâ Your Substack Writing
I created a project on Github Github - Fine Tuning For Data Engineers
It consists on different moving parts, but the TL;DR of what the project does is:
Parses RSS Feed â Fetches articles from any Substack
Cleans Content â Removes HTML, extracts clean text
Generates Instructions â Uses GPT to create diverse instruction prompts for each article
Creates Training Data â Builds instruction-response pairs in OpenAI format
Saves Data â Outputs
training_data.jsonl
in proper JSONL formatFine tunes Model â Automatically uploads to OpenAI and creates fine tuning job
The important stuff to follow along is in OpenAI Fine Tuning Documentation.
The Instruction Dataset
Letâs go over point #3 and #4, which are the backbone of fine tuning.
When talking about SFT, we mentioned input-output pairs. This means that you give the model examples on the expected outcome for a specific input.
Overall, 100 instruction pairs are usually recommended to have a good baseline.
In the repo, this is the prompt behind it:
This is the principle in which few-show prompting technique is based on. You can read more about it here:
OpenAI expects the outputs for fine tuning in jsonl format structure, for example:
{
"messages": [
{
"role": "user",
"content": "Explain the core technologies in data engineering and their evolving roles in modern data pipelines."
},
{
"role": "assistant",
"content": "I have been working with Snowflake, dbt, PostgreSQL, Python, Airflow, and AWS tools for approximately 3 years. The most challenging part of this journey has been always feeling as though I was falling behind. The list (and FOMO) never ends..."
}
]
}
The Fine Tuning Job
We run the full pipeline, which parses the Substack content, generates the training date with the instruction dataset, saves into JSONL format and create a fine tuning job for OpenAI.
Hereâs how it look in the repo:
You end up with something like this:
â ď¸ I have the tendency of using too aggresive metaphores and had to add an extra step to clean them up because OpenAI was rejecting my samples due to violated policies đ
And this is how it looks in OpenAI Interface when the job runs properly:
Then you can go into the playground, add tools, prompts and start playing around with it.
You can also call it from the API!
Monitoring Training Progress
After submitting your fine tuning job, monitor the training metrics in OpenAI's interface.
What Good Training Looks Like:
Accuracy increases consistently
Loss decreases and stabilizes
Some volatility is normal as the model adjusts
Final train loss around 0.8-1.2 is typical for successful fine tuning
Accuracy Chart: Shows model performance improving over training steps. You want to see an upward trend from ~45% to 80-90%, indicating the model is learning your data patterns.
Loss Chart: Training loss should decrease over time (e.g., from 2.7 to 0.6-1.0). Lower loss means better prediction accuracy on your training data.
If you see accuracy plateauing early or loss increasing, you may need more diverse training data or better instruction quality.
â ď¸ Disclaimers
100 high-quality examples often outperform 1000 mediocre ones.
A good finetuned model with a poor system prompt is worthless. Experiment until you find the sweet spot.
Fine tuning and inference costs add up. Make sure simpler approaches (prompting, RAG) won't solve your problem first.
You will need to iterate, add more samples, find the right instructions datasets to eventually make it work as you want.
As you might guess by this point, beware of jumping to do this if you can solve it with goods prompting techniques like few-show or even with a simple RAG strategy.
This is not only expensive, but complex in many ways since we are changing the way the model was trained and affecting its foundation, so if not done properly the results will be even worse than regular models with bad prompts.
This is just the tip of the iceberg of fine tuning based on a simple use case & experimentation. Hope if helps you gain more awareness of the fine tuning hype đĄ
đ TL;DR
Only fine tune if prompting/RAG won't work - for consistent style, domain patterns, or embedded knowledge
Build a data pipeline: Parse content â Clean HTML/text â Generate custom instructions with GPT-4o mini â Format as instruction-response pairs
Quality > quantity: 100 good examples beat 1000 mediocre ones
Monitor training: Look for accuracy increasing (45% â 80%+) and loss decreasing (2.7 â 0.8)
Iterate: Test in playground, measure performance, refine data and instructions
Watch for gotchas: OpenAI content policy, cost implications, need for good system prompts
If you enjoyed the content, hit the like â¤ď¸ button, share, comment, repost, and all those nice things people do when like stuff these days. Glad to know you made it to this part!
Hi, I am Alejandro Aboy. I am currently working as a Data Engineer. I started in digital marketing at 19. I gained experience in website tracking, advertising, and analytics. I also founded my agency. In 2021, I found my passion for data engineering. So, I shifted my career focus, despite lacking a CS degree. I'm now pursuing this path, leveraging my diverse experience and willingness to learn.
I was a Data Engineer for a little bit and this an amazing piece of writing about it. Definitely recommend it to anyone whoâs in DE!
Great job!