A critical aspect of scaling organizations is process. Process allows you to normalize and incorporate best practices to ensure things work smoothly and scalably even when no one is minding the controls. But process in analytics organizations is something that is frequently overlooked, and too often we default to the same processes that engineering abides by: i.e. using git and related patterns to share and store analytics work.
For some parts of data science / analytics, such a transference of engineering behaviors is suitable: analytics infra, analytics engineering, deployment of machine learning models, libraries — all of these workflows are inherently code-based and benefit from the rigorous test + PR culture familiar in engineering organizations. But for the remainder of analytics work — the kind that occurs daily in SQL IDEs and Jupyter notebooks — the fit is poor. 90% of our job, by the nature of our work as analysts and data scientists, is exploratory. And here, unfortunately, engineering practices not only fall short, but can be detrimental to the org. Why?
Blindly enforcing version control and code review as a gatekeeper for sharing exploratory work leads to unshared exploratory work.
So I’d argue we need a different process. And to understand what the processes need to be, we first need to establish the objective of analytics organizations here. In engineering, maintainability, reliability, and scalability are the objectives that underpin practices like version control, code reviews, code coverage, validation testing. But in analytics work, the underlying objectives are necessarily different: reliability, maintainability, and scalability are still important, but they manifest differently. Let’s ditch the emperor’s clothes and replace these concepts with what we really want: discoverability and reproducibility. In other words, we need to put the “science” back into data science (and analytics).
With these concepts in mind, I’ll discuss the following in this article:
This is an oversimplified engineering code base, where arrows indicate imports.
Anyone who’s spent any time in a modern IDE knows that traversing this graph is easy. Every modern IDE has “jump to” functionality where you can immediately jump to object references. Start in def pet() , and you can easily jump to the definition of class Llama , then follow the breadcrumb trail all the way back to the parent class Animal.
Your analytics code base, on the other hand, looks a bit different:
There is still interconnectivity (through the data itself), but these are connections are not discoverable through your IDE (hence the lines are dotted). This makes it substantially harder to see, say, table references in the same way engineers can see function references. So what’s the solution?
Before diving into that, though, let’s talk about the go-to solution for analytics organizations appropriating engineering best-practices: git. Many organizations turn to git to track any sort of query that drives an insight. This is reasonable, and allows for search over your data by, say, table name, but in my experience, git expresses a few inconveniences:
My controversial conclusion from the above points:
Git is the wrong place for sharing most analytics work.
At the end of the day, these inconveniences are just that — inconveniences — but, perhaps surprisingly, these inconveniences are often enough to cause people to NOT share their work. I’ve seen this happen first-hand at Airbnb, Wayfair, and handfuls of other companies that attempted to enforce the same. While some amount of very polished work gets shared (e.g. through Airbnb’s knowledge-repo), but this only captures ~1% of work done. The remainder of work lives in tabs in SQL IDEs, local files, local Jupyter notebooks, and the work thus stands to be duplicated unnecessarily, over and over by different analysts and data scientists.
The solution? Make the work discoverable at all costs. A reasonable way to do this is to share your work in a non-git-backed place. We’ve built Hyperquery.ai expressly for this sort of thing, but I’ve seen reasonable success with more general note-taking solutions like Notion, Confluence, or Library (h/t Brittany Bennett) being used, if you don’t mind keeping your IDE and your query-sharing environment separate. For Jupyter/R markdown, git may unfortunately still be the least of all evils. But this brings me to my second objective: reproducibility.
Analytics / data science is science. And as such, it needs to be reproducible. Any insight you produce is two things: the insight itself and the steps you took to obtain that insight.
Reproducibility is particularly important because I’ve told you not to peer review in a hosted git platform. While it’s important to avoid blocking insight consumption with a mandatory peer review process, it is still, nonetheless, very important to get critical, decision-driving pieces of work checked by colleagues, and prioritizing reproducibility enables this. Moreover, reproducibility fosters careful validation of results much better than code reviews, which, as I’ve mentioned, are generally opaque when it comes to data outputs.
Why? Because SQL reduces the friction to reproduce work. As with discoverability, reducing friction here is important, or no one will go through the pains of faithfully reproducing your efforts and validating your conclusions. As they say, if you have a 5 line git commit, you’ll end up with 50 requested changes. If you have a 500 line commit, you’ll end up with a LGTM. Requiring users to set up virtual environments and wrestle with hidden states replete in code-based notebooks will mean that others will simply not try to reproduce your work. And there’s nothing worse than pushing the business in the wrong direction because you didn’t get your work double-checked.
So now we arrive at my second controversial statement of this article:
Try to use SQL where possible, not Python/R.
Python/R aficionados: before you close your browser and block me forever, hear me out. I love Python and R, and was one of the earliest users of IPython and Jupyter (I spent the entirety of grad school studying rivers in Python). I’ve even released several open-source libraries in Python.
That said, you have to admit: Jupyter notebooks and R Markdown are not the best place to store reproducible work. Hidden states, lack of easy executability, requirements files, and cached data provide numerous points of failure. At the end of the day, data in motion breaks, and SQL use minimizes the motion.
Of course, let’s say you don’t buy this — you are the master of making perfectly reproducible notebooks. This is entirely possible, if you have an odbc-based library that pulls data in directly from your warehouse, and if you make sure to always re-execute all of your cells from the top before sharing your code, and if you make sure you have a good system for sharing these notebooks without accidentally storing massive amounts of data (oops you executed df and listed out several hundred mbs of data in plaintext). But even then, if you get all those things right, there’s a level of opaqueness (inaccessibility, particularly by stakeholders) to Jupyter notebooks that will inevitably degrade trust in your analyses. On the other hand, SQL snippets are self-contained and work out-of-the-box. Stakeholders can (and will) execute SQL if they’re in doubt, and reproducibility comes for free as long as your organization has reasonable data hygiene.
This may be already clear to you at this point in the article, but in my opinion, there are only two necessary processes that, when implemented, will drive discoverability and reproducibility forward by leaps and bounds. I’ll put them in a superfluous table to help you internalize them:
These suggestions are hopefully not as drastic as my earlier points might have primed you to expect:
The changes I’m suggesting are small. Explicitly state “SQL where possible” in your analytics onboarding docs, and set up a SQL-writing environment that enables knowledge sharing (see Hyperquery.ai). But beyond that, there’s not much else you need to change.
Give it a try, and let me know how these adjustments work out for you (or if you have had success in driving forward discoverability and reproducibility by other means). These sorts of small changes can mean the difference between a frustrated, over-worked data team, and one that streamlines and mitigates this workload through re-usable, freely shared SQL.
Tweet @imrobertyi / @hyperquery to say hi.👋
Follow us on LinkedIn. 🙂
To learn more about hyperquery, visit hyperquery.ai.