Importing data
Whenever you need to bring data in from another system, even if it is one of your own, you will have choices to make.
This is a post about a work-project to bring data into a system from a past system and integrate with new enrichment data, changing schema. Fun...
We used importing as a strategy, but I do not much care if you call it ETL or migration.
Plan how to get there
As I joined a call today for the nth time, I re-iterated some points that were causing pain in a migration. We have a plan, it is broad-phase, not in-the-weeds documenting the dust.
My own preferences on documentation
I lay somewhere between where we are and strong documentation. My preferences are unimportant until the lack of fulfilment of them causes pain, wastes time. it is not about me, it is about delivery. I am aware that some people believe documentation is not agile and I am quite open and honest that I think they are theives of your time and attention. I am very pragmatic, documentation is not a set of shackles, you can and should iterate on it, it should not be ready for submission to an examining board unless there is a requirement as-such.
Our problem is simple. We have data which is coming from an existing internal system to a newer system to help launch a new product. It might not always live there; we're not building a walking skeleton, but it needs to come in, it needs to fit into a new schema because the questions we are asking are different and there were known design flaws in the old system.
Not getting your own way
I did not get to fix all the things I wanted to in this iteration. Sometimes you can argue until you're blue in the face and a business or group of stakeholders will pull for something you can see the problems in a mile away. I have a simple strategy for this. I complain loudly, attempt to point out the flaws within 15-30 minutes and then consign anything we cannot agree on to another iteration. Pontification is the enemy of delivery.
- We have ERD diagrams of the high-level concept
- We have code which broadly works enough for an MVP
- we have test cases passing.
- some of us have strong desires, we cannot reach agreement on everything. We can ship first, iterate later!
it is time for an import...
Defining a view of the world
Prior to import you will need to ensure you have views of what you need. We were quite spartan with this, almost to a point of rigidity. All plans came from an initial ERD I had done some months prior. We manage our schema and some of the engineers have tooling to generate an ERD. Theirs is of an application view, not a storage view, it is not source controllable. It has a lot of unimportant system-specific nuance.
I have worked with third-party data and non-perfect data for a lot of my life. I have strategies to improve data and experience of real-world projects which worked and did not work. One of the pivotal steps which separates the successful from the failed is how nuanced your world view of data is. In all but the most slap-dash of projects I like to avoid using an application at all when migrating it is data.
Some smells when you need to use an application
- Presentation format is derived from non-obvious or complex rules that are not suited to ETL
- You lack direct access to the place(s) where data is stored
- There are known storage bugs which are overcome by the application
Benefits of using direct storage -> storage migration
- There are fewer component parts
- There may be tooling you do not need to write or own
- You will reduce interstitial (often throwaway) system count
- Verification is often simpler
- There are often speed benefits
it is not always possible
One of the things you may find is that you need to defer or rule-out import of certain data. Businesses especially in the age of data tend to become needlessly attached to this data., but as a professional and person who wants to help them, you have to remove the sweets from those you love at times.
My track record with this is hit-and miss you cannot win every situation, only use what you know to try to give the best advice.
- Try to tie advice to vision.
- Try to work through real-world cases or analogies.
- do not shy away from the reductive if it helps.
- Bring data if you can.
This will effectively reduce 10 of 10,000,000 records is an easier pill to swallow than the reverse - Try to avoid unilateral decisions
Ultimately one thing businesses need to accept is that we are not magicians; we are engineers, so we should try our best to use a fixed pool of resources to achieve a goal within a context. If that context changes, people need to know.
Natural keys
At best, the word natural can be distracting, can provoke strong reactions for and against. I am using it because it is a term from database systems, which I hope you will know.
A natural key (also known as business key[1]) is a type of unique key in a database formed of attributes that exist and are used in the external world outside the database Wikipedia
Our first imports did not go well. Some data which was not extraordinarily complex, moved. We had 4 maybe 5 programmers at one time or another working from our initial plans to build an ETL, high code-coverage, the trappings of success. One of the problems was that our test-data was dependent on foreign-keys.
Because of my history I had attempted to make use of natural keys where possible, only using foreign keys as a signalling duplicate protection mechanism where natural keys were absent and to provide an extra layer of defence. This made my import scripts look different, but also made them work, and enabled them to produce meaningful errors when they did not. I decided to change others importer scripts to look and behave like mine. This helped, I was glad when the CTO noticed it, sad when syntactic sugar was added which I felt made my own work harder for me to read at-a-glance.
Perhaps a surprising thing I will say on ETL tasks is this. do not write them inside of your application if you can help it! Ours was built into a monolithic application because we had a short time-to-deliver and a lot of passionate people, keen to use Ruby as much as possible. If this were a contract job, I would walk away. There should be no place for fuzzy feelings for a specific technology.
We also had a requirement to make the tools available to the wider business, which is another reason I tried to use natural keys. They just resonate a lot better with people in my experience. "Oh lewis {email} is not importing" gives a quick transferable thing I, or others can scan artefacts for in 3, 6, 12, 18 months once we have all had a while with more interesting problems.
Imports changed meanings entirely to fit requirements better. We decided that each of the 5 main areas for import would succeed or fail at a record level. This gives the property of a document. A resource and it is sub-resources must be created or not. Transactions are a powerful feature of databases which I have found can help to gain success in these matters
- Open a transaction
- Try to create all interconnected objects
- Ensure all relationships between interconnected objects exist
- If there is success, commit the transaction, log success
- In all other cases roll-back the transaction & report errors/li>
This helped us to make import documents more easily humanly readable, it meant we could perform set-based workloads for documents with multiple symmetric entries. I will not go into heavy detail, but read up on set-based operations, batching and computation thrashing
TLDR is that you can generally increase ideal case, without impacting worst-case by doing lots of a specific thing at once instead of doing many different things in cycles. Measure what is right for you, be pragmatic.
Lying with good cause
I have presented this to this point as-if there were one prior system in this analogous to an existing current project. There are multiple systems, following a faux-microservices, distributed monolith. There were decisions made to limit the complexity of this write-up. Other decisions included to import by materializing cross-system representations into a single source of truth.
We used a tool called Chartio which is more about dashboards, analytics and reporting than ETL. All I can offer here is to let your team use what they are comfortable with. Attempt to steer towards tooling made for the job but be a part of the team.
Chartio has allowed us to annotate data with corrections, impose further filtering and logic and had a read-only connection to some of the data needed. it is enabled interoperability amongst a distributed team where tooling is not standard. We use interoperable export and import formats (CSV) to work past gaps in tooling or functionality. I offer this not as an excuse, but to add further context to aid understanding.
This also has led to pinning some data in an interstitial format. Remote URI accessible resources are persisted using URI's for things like images. These add time to imports and often can be a source of distraction through network failures, yet more third-party services. The core thing is not to delete any third-party resources and in-fact take regular snapshots of all resources, so you do not lose information.
Troubleshooting
Our next set of problems which are as-yet unresolved, and in-progress are about bounding data. The business has a clear requirement to bring in all a certain domain, bounded to a few entities. Those entities rely on other entities as is common when coming from an RDBMS system, and this means we need to select the greatest outer-scope first
If you had soccer teams that had played each other and were in division 1, you would first need to select soccer teams, not division 1 teams, if teams could play other teams outside of their division, and you wanted to honestly reflect their match history.
Just because a team is not part of division 1, if data that relies on it points to it, you have some choices
- do not import the data (can you do it later?)
- Accept that you will need to import more than just what you want, but what is connected to it
- do not import anything (rare, but giving up is a valid option)
Verification
It may seem strange given that we are not done that we are already in this stage. Essentially any time you iterate you need to verify the outcomes. In my current role we use Slack to communicate, Google Drive for artefacts as well as Chartio dashboards to track and produce data artefacts. I should probably link the Google drive import artefacts, Chartio source, and import report logs to JIRA which is the project management tool. In-fact I shall do this tomorrow. Given a non-startup I may do this or make a dedicated resource to do this for me. Some things to avoid
- Clutter - How can I find what I want?
- Ambiguity - What is needed to reproduce results?
For other projects I have worked on verification has ranged from check every single record imported to check a few known entries. Some advice I will offer is start by telling the business you need to check everything, but be open to negotiate downwards, getting in writing if you cannot verify the entire data-set that it is acceptable. That writing can be as simple as an IM message from a stakeholder, but people's lives and jobs and the success of product will rest on this keystone.
Wrapping up
Having reduced some import timings, got many more records than the initial import you will be able to deliver iterative value, showing progress until it is time to launch, deliver and move-on. I hope you have learned some things, this was by no means exhaustive, but covered some points I have had on my mind. I Hope to look back at this and find either I know better methods, or it keeps the methods I do know fresh.