Leading large data engineering teams in the age of LLMs always leads to many existential questions around the future and state of data engineering considering its current fragmented tooling ecosystem. Will it be a state of “Arcadia”, a beacon of automation, where every facet of data movement and processing is managed by precision AI driven by natural language prompts with no trace of primitive data engineering tools or laborious input used today ? Will our tools evolve to federated agents that use our input and precision to enable the AI to enhance our productivity? Whether that prophecy comes to fruition or not, remains to be seen. This is a developing story with new chapters being written but some basic ground truths will continue to hold –
- Tooling – ETL/ELT pipelines running on massive quantities of unstructured data still need to process this at scale and the infusion of LLMs will only mean that data engineers will adapt to new embedding paradigms and vector databases.
- Quality – With the current generation of LLMs prone to hallucinating at Bilbo Baggins’ levels, data quality becomes even more crucial in the context of llms as they interact with data in different ways.
Annotation and validation of data to establish ground-truths for high quality model input is key. - Productivity and Innovation – The key driver of using LLMs for data engineers will be to solve complex problems, calculate business value of data and ensure efficiency of pipelines.
- Security – Protection of sensitive data during the inference flow as well as ensuring the output is not colored by political leanings, trends and interference patterns means technique like tokenization, encryption and differential privacy are key to implment in a data ecosystem.
- Privacy – Privacy and data protection remain key concerns, requiring data engineers to establish clear guardrails and ensure appropriate access controls. Accessing external endpoints exposing holes in secure networks remain understandably a cause of concern especially in large organizations with non-technical leadership at the decision-making levels that may not fully grasp the adequacies/inadequacies of their environments.
- Data Harmonization – LLMs will need to be trained on multimodel information (text, images,audio, video) will lead to demand of data engineering techniques like multimodel feature engineering, entity resolution to ensure interoperability.
- Cost – Cost management is critical, as infrastructure can quickly become expensive. Massive datasets and compute intensive workloads demand scalable and performant architecture. Capital expenditure will need to be clearly established with the right forecasting of tangible returns instead of AI for the sake of AI.
- Continuous Learning – Collaboration between data engineers, AI scientists, and business stakeholders is essential for successful implementation. With the filed evolving rapidly, a continuous learning and development environment is key to maintain competitive advantage and build innovative solutions.