Back when I was a data scientist, I spent a substantial amount of time doing product analytics work — opportunity sizing, experiment deep dives, ad-hoc checks. But although I worked across a wide range of tools — Jupyter/Python, tidyverse, superset, internal tools, even Java UDFs — the bulk of this work was actually done in SQL. And what would I do? I’d write the query in Superset’s built-in IDE, tweak it until I got an answer to the question I was asking, dump it into a Google doc alongside other findings, then shoot it off as an email/Slack message to vanish forever in a sea of corporate noise.
And as you might expect, weeks or months later, I’d be faced with a similar question, and, unable to locate my past work, I’d rewrite the same query. And from this, I came to a stark realization about this workflow:
The hardest part about analytics work isn’t doing the work. It’s doing the work again next week.
I’ll discuss how to address this in what follows. This is an ode to the long-neglected virtues of organized collaboration within analytics teams, a love letter to the modern doc workspace (a la Notion, Dropbox Paper, Confluence, etc.), and a glimpse into the motivation behind our own analytics-first doc workspace, Hyperquery. But I’m getting ahead of myself — let’s talk about what’s broken before we try to fix it.
Once you’ve written some SQL, unearthed some insights, and written up your work, you know in your soul what you need to do (though perhaps you try to silence this voice from time to time): you need to make this work comprehensible for your work progeny (or for future you). The best of us will thus write up a Google doc documenting our process, our queries, our insights, then share it as an email/in Slack for the viewing pleasure of our colleagues. “Scholarly collaboration, accomplished!” you might think. This workflow can be visualized simply as follows:
But it’s during the process of sharing work where things start to take a dark turn (note the ominous dotted lines in the diagram above). We relegate this step of the insight hand-off process to tools that are well-suited to point-in-time sharing, at the expense of subsequent discovery and reproducibility efforts. Slack, Gmail (don’t get me started on Github for insight sharing)? Here today, gone tomorrow (not to mention completely lacking as platforms to reproduce SQL-based work). And why is this a problem? Because the lifecycle of your work rarely ends here — it lives on as part of the discovery process in future analytics projects.
And so while we view our jobs as complete by leaving our team (and ourselves) with a trail of evidence attesting that “yes! we worked on this before”, we are in fact leaving them with little more than breadcrumbs to discover your work again when it’s actually pertinent to them. In a previous blog post, I discussed why discoverability and reproducibility are critical to enabling scale for analytics organizations, but this is the crux of the pain: wasted/repeated effort resulting from lost knowledge.
The remedy to any broken loop is simple — close the loop. And the loop can be closed by simply adopting a tool that allows for organization and discovery of past work. There are a number of ways to do this, sure, but allow me to gush for a moment. That is, onto the love letter part of this post: though not novel, a wonderful [albeit partial, but I’ll get to that later] solution to this problem is the modern doc workspace — tools like Confluence, Notion, Dropbox paper, Hyperquery offer users the ability to not just write up, but organize results in a context that makes sense to the wider team. Moreover, rich search capabilities allow for much easier discovery than through loose docs or git-backed notebooks. It may seem small, but the user experience is vastly improved.
Where there once was only a thin trail of Slack whispers leading to an obscure Google doc, there now stands an almanac of past SQL queries, insights, and resulting business decisions.
While a small change, the discoverability and visibility afforded by this workflow is not simply a minor UX improvement in the lives of analysts — if you’ll allow me to pontificate for a moment, this is the key to unlocking analytics collaboration. Innovations around collaboration in past years have largely focused on making technical collaboration more feasible/accessible/powerful (i.e. Google colab), but this is hardly collaboration.
Collaboration isn’t have two users in the same place, writing SQL/Python/R together. Collaboration in analytics is about knowledge-sharing, discoverability, and reproducibility.
That said, I am admittedly a bit biased regarding this particular solution shape. We’ve built our own doc workspace, called Hyperquery, which closes this loop even further, by enabling SQL-writing and knowledge-sharing to take within a single ecosystem. But I’d be remiss if I didn’t mention that there are other options out there as well. If you’re already set on the Google docs world, I’ve talked to teams that have had some success using Library, and for Jupyter-based workflows, I’ve personally used knowledge-repo for this purpose at Airbnb (though friction using a git-based sharing paradigm is a little painfully high). But whatever tool you choose to use, accept that these knowledge-sharing workflows are a critical step in reducing redundant work. Eat your vegetables — think more carefully about where you’re storing your work, and level up your analytics team.
My final call to arms?
The key to reducing repeat work is to make it easy to find past work, and a doc workspace is a great way to do this.
Tweet @imrobertyi / @hyperquery to say hi.👋
Follow us on LinkedIn. 🙂
To learn more about hyperquery, visit hyperquery.ai.