Getting to Yes in Data

Getting to Yes in Data
Time spent on getting data is not spent on creating value from it.

Getting to data carries a lot of hidden costs distributed across teams and functions. Notably, these costs are in coordination, queueing effects and configuration, creating significant direct (>€10K per request) and considerable opportunity costs. With PACE we are designing a way to iron out those wrinkles. Request access and read the documentation.

Pattern recognition 🧮

When we founded STRM in 2020 (then 🚂 Stream Machine), we planned to build a data company on the wave of increasing regulatory pressure. We anticipated data value creation was going to be preceded by the question of “But can we?”. Can we use customer emails? Can we distribute their names and shopping history throughout the entire data warehouse? What is even allowed now?

Off we went to build privacy-by-design data infrastructure, as we believed privacy was the most important of regulatory pressures and data-related can we?-questions organisations would face. Most importantly, we wanted to shape data pipelines so that all those questions didn’t matter through “shifting left”.

Over the past years and in hundreds of talks with potential customers, engineers and data users, the question of “but can we?” still stands. But we found that question to be much more comprehensive, nuanced and deeperlyⁿ embedded in organisations than just privacy.

It turned out it is framed the other way round: “How can we just build with data?”

we invented that grammar

May the Force be with you 🪄

We see two forces in “using data”: one is concerned with ensuring data helps to meet (or even set) organisational goals, and another one is to ensure this is done correctly.

Digging deeper, these usually present as three groups: First, we have data consumers, like your teams building insights with data, and data products such as recommendation engines. They need to have as little in the way of data as possible to achieve velocity. The faster products go live, the more business value is generated and compounds.

Second, we have data producers, applications or teams that generate data (or derivatives of other data sources). In strongly data-driven organisations, these teams are tasked with capturing as much data as possible about behaviour, operations, and any relevant business event.

Both the producers and consumers are in the making-most-of-data-team. They are the engineers, teams and drivers in a Formula-1/E-race. They need to win the race for data. 🏎️

As the other Force (often perceived as opposing), we have governance functions, such as privacy, security, and dedicated data governance teams. Their key responsibility is to ensure data is used responsibly and aligned with organisational policies. They are the safety car, ensuring the race to data is driven within the game’s rules. 🚔

Opposing forces create a steady pace forward at best (directed velocity) but keep each other in the same place in the worst case. How does this play out in practice when you want to use data?

Oh, it’s complicated, we find.

Race report: getting to data in reality 🚧

Let’s walk through a practical example illustrating how this complexity materializes:

  • Discover and locate the data in a data catalogue (if it’s discovered and in there)
  • File a data usage request, usually in a separate ticketing tool like Jira or ServiceNow
  • Data steward(s) take a few days to ponder. Weeks if it is complicated.
  • Compliance and legal might have a few extra demands, such as a DPIA or transfer assessment
  • Access is granted/denied (provided there’s a group for that). The result arrives over chat or email.
  • A platform admin grants you the rights in the data platform (like Snowflake or Databricks) within their SLA.

All these steps are often repeated over and over, case by case, and frustrating many of the people involved.

Mapping this to approximate costs, it becomes more interesting (assuming very average cost-to-organisation hourly rates):

  • Data discovery (if there’s a good catalogue): at best, 15 minutes. Likely an hour. (€50)
  • Plowing through the intake forms + handling the request: a few hours to days (€500)
  • Stewards discuss in a board. There are three of them. The requester provides subtitling; it takes a few hours to prep and follow up. (€1500)
  • A request on the table necessitates additional discussions (and most easy cases are already in prod in orgs). Two extra meetings with a few people + necessary preparations, perhaps a bill from a lawyer or (internal) SecOps expert: €3000 - € 6000.
  • Turns out you need a DPIA, which totals a few days + emails + time of a few people (€9000)
  • The same request passes the same board. This time it’s approved (€9500)
  • Admin grants access, but it takes a few days to pick it up and group configuration (there’s other work as well, you know) (€10k)
  • Finish! 🏁

Et voila, you directly burned through €10k in just the process for a single request, let alone the value you haven’t captured in that time (often weeks to months!). The very mature data organisations have reduced this to a week max, but even there the complicated cases tend not to follow Yhprum's law and are caught in queueing patterns simply because there’s a wide matrix of stakeholders, processes, tools and approvals involved.

Please note €10k is averaged but likely on the low end. In organisations with a rich history (read: legacy), operating in strongly regulated, sensitive or safety-driven industries, or readying the business for IPO/institutional rounds, we’ve run into business cases where a single request raked up to €40k.

So, how many of those requests do you have per year (you can approximate)? Probably hundreds!

Driving data at high speed: a safety car as auto-pilot? 🦾

Looking at this race as a spectator, this process is fundamentally broken. Involving three teams, four tools and platforms, tens of emails, conversations and meetings to decide on the applicable policy to only get to the data seems a little overdimensional. Now imagine you run both AWS and GCP or Snowflake and Databricks (which is a sizeable portion of the market!). This amplifies the complexity exponentially. All while the tools involved lock you into their world and charge hefty for it.

And so we have been thinking, what if this was a push-button process?

Imagine there would be a way to:

  • Describe data policies, both global and case-by-case, and collaborate and agree on them in a streamlined process (this could be data contracts!)
  • That is human-defined but machine-readable/programmatic and integrated into data catalogues, perhaps data quality pipelines, and the platforms that store and process the data.
  • Coupled to access (group-level or based on attributes of the data and users)
  • That can connect any tool to any tool (catalogue to processing and everything in between)
  • In effect, it provides a real-time view of all data usage (compliance and auditing - a Reality of Processing Activities-record!)

Imagine the value that would enable in direct costs and additional value capture!

Keeping Pace: help us help you.

And that’s what we’re thinking about and have in development.

Because such a setup requires the combination of workflow, access, processing and compliance in a tool that integrates with your existing data tools (e.g. from Data Hub to Snowflake and Databricks), we can’t build it in isolation or on product trials. Productionizing such a setup needs real-world testing.

Therefore, we are looking for a select set of co-design partners with the need and ability to implement a much better and integrated way to drive the race to data in their organisations.

Reach out to work together on putting this on auto-pilot!

That auto-pilot might look something like this...