Building new solutions instantly transforms existing data into “Legacy Data." What does this mean?
Software developers use the term “Legacy Code” to summarize the challenges that existing code poses to evolving or changing systems. In Working Effectively with Legacy Code, author Michael Feathers defines Legacy Code as “simply code without tests.” This definition highlights:
- Making changes to legacy code is difficult
- Using legacy code in new contexts is difficult
What about Legacy Data?
Legacy Data seldom has tests. Builders creating new solutions from this data face the same problems as software developers face with Legacy Code. Today’s complex system of data lakes, data warehouses, and data meshes increases this risk.
The best new systems have:
- Robust test data sets that exercise key features of a system without any PII/PHI or other controlled data
- A set of key integrity and correctness checks for incoming data
One system we built used a sanitized version of the previous day’s production data to re-run all the unit and integration tests against. We changed the existing tests to be data-aware and handle the new scale. This approach uncovered subtle and hidden data assumptions. We used this insight to correct quality problems in both the new and legacy code and data.
We are all building more and more innovative data products. As we do, we must get better at embracing the realities of Legacy Data. GistLabsAI can help. Please see our Data Strategy and Data Engineering services for more information.