By Liqueo Senior Consultant, Arun Muttu.
The number of data engineers has more than doubled in the last ten years, and this increase is not going to slow down any time soon. We expect data engineering roles to overtake data science roles within the next ten years, and we already see the reasons why data science has grown over the same period – the almost universal recognition that actionable insights can be derived from cross-referencing unrelated data sets.
So where will all this data engineering talent come from? Clearly some will come in the form of those training in this discipline, not least because of the worldwide emphasis on coding languages such as Python. Python is already the most popular programming language. However, the demand for data engineering talent will far outstrip the supply from academia. To make up for the shortfall, firms will look to those from disciplines that have an organic relationship with data engineering. Database administrators are a natural fit, due to their affinity with data structures and the technical challenges of managing data sets. Others will come from application development, due to the waning demand for bespoke application development. Despite being less inclined towards data structure, data scientists (and to some extent, app developers) will often have a lot of familiarity with data manipulation.
Both disciplines feed into solving the main two challenges of big data – issues of scale and issues of complexity. Database folk have long been aware of the need to effectively store data in structures that can scale well over time. Data scientists are well-aware of the complexity of relating seemingly disparate data sets to derive previously undetected insights.
The core outcomes of data engineering are two-fold: to make data accessible throughout the business, and to “productionise” algorithms that produce recognised valuable insights. To those ends there are a few skillsets that are clear frontrunners for data engineering:
SQL – a declarative language that has long been the staple of data management systems. It was considered outdated with the advent of big data platforms that preferred the columnar bias of NoSQL as a language. However, those platforms have not managed to consign SQL to dust as newer data platforms, such as Snowflake, have recognised the competitive advantage of providing that same columnar scalability, while supporting SQL syntax. This allows firms to leverage the experience of RDBMS professionals, while supporting the scalability of big data and the flexibility of semi-structured data. Indeed, these platforms are not far off supporting unstructured data in SQL too.
Java – until recently the most popular language for manipulating data and supporting cross-platform development. Java developers are well-versed in leveraging bespoke libraries to service the requirements of individual use cases.
Python – taking over from Java, Python is the new ‘Swiss army knife’ of data engineering. It is also cross-platform, but is easily accessible, supported by most library developers, and provides a great platform for a consistent development ethos across organisations. While R is preferred for data science tasks for its superior performance with complex data operations, Python is preferred for data engineering for its superior utility.
Data Lakehouse – these extend the concept of data warehousing to provide a single source of truth across an organisation. Data lakehouses minimise the number of hops that data sets make in transitioning from being an external data set to being an integrated data set that is logically related to data sets of the same domain. The result is a real-time streaming warehouse of data from disparate providers that is structured for consumption on-read.
The future composition of a data team is nascent and probably unclear in the long term. However, in the short to medium term, data teams are likely to be composed of traditional software developers, along with data architects versed in scalable cloud implementations, and data scientists intimately familiar with machine learning algorithms. Each brings their own strengths (and weaknesses) to the team.
Software developers are skilled at building elaborate algorithms to address edge-cases. However, ‘big data’ development requires a holistic approach to architecture. Rather than coding for specific use cases, that can be considered sequential by them, big data architectures require an approach that considers all the major use cases at the same time and prioritises them. This is where the architects will come in and apply data structures that consider the many use cases as well as the growth of the platform. Data scientists will be needed to understand and deconstruct the complexity of machine learning algorithms for implementation across a big data estate.
Over the longer term we can expect these disciplines to merge into a composite skillset that is naturally aligned to the principles of big data, and the use cases of the ever-widening applications of machine learning and artificial intelligence. Here at Liqueo we can assist clients with identifying and forming a data team, leveraging the strengths of members experienced in traditional roles, and advise on how to develop the team to become lucid with the new paradigm. For more information about how our Data Practice can help you please contact us.
We use necessary cookies to make our site work. We'd also like to set analytics cookies that help us make improvements by measuring how you use the site. These will be set only if you accept.
For more detailed information about the cookies we use, see our Cookies page.
Necessary cookies enable core functionality such as security, network management, and accessibility. You may disable these by changing your browser settings, but this may affect how the website functions.