Data Science: the business picture

Today there are many companies claiming to embrace the new data science era. Due to the deluge of data businesses big and small resort to data science to improve their products, increase their market share, gain efficiencies in production and/or distribution, and so on. Some publish awesome results, some are more cautious, but is it all truly data science?

A quick search on the web gives many answers to what a data scientist supposedly does (search “what is data science process”). Take, for example, the following answer from Quora. There, R.F. Squire, a self-described “Neuroscientist Turned Data Scientist”, briefly describes the process a data scientist follows: one starts with an interesting question, then gathers data, explores, models and finally communicates and visualises their findings. One can definitely draw a parallel between this summary and the description of the scientific method. So, is data science another name for science in general? Is it that simple? Anything missing?

Let’s provide some clarity around these things.

Three Stages of Data Science

First, not everybody is on the same page and has the same needs. At the most basic level, an infrastructure, an array of technologies, is needed to properly capture and organise all the data. Many companies are at the operational phase where the basic need is to be able to store and retrieve customer and business data usually at a scale not possible before. The end goal in this stage is, supposedly, to assemble a data lake: (see eg. KDnuggets) a large, semi-structured collection of raw data (data in its “natural form”).

The term is introduced antagonistically to that of data warehouse, characterised by a more structured and “rigid” process. At the risk of getting ahead of ourselves, the truth is that a data warehouse is as good as a starting point of data science as a data lake is. What makes it science is not the agility of the source, but the process itself. If the data science process is in place, whether from a data lake, data warehouse or tarball, then there is science. Otherwise, there is not.

The next stage is what I call the analytics phase. In this stage companies are able to profit from the storage of data by offering simple statistics or metrics as a part of their products. For example, for a mobile data provider it could be telling customers how many gigabytes they spend per category over the last billing cycle (i.e. on social networks, on browsing, video streaming, etc.). Some marketers dare to call them “insights”, but although in the right direction and potentially informative, there is no (deep) science around those figures. Do they provide insights to customers about their usage of a service? Maybe. In this case, it is accurate to call them “insights”. Nevertheless, being able to drive business decisions based on numbers for all customers is a different kind of problem.

Nothing wrong with analytics, of course. In fact, in many cases from a product perspective, this is all that a customer just needs – a chart in a dashboard. However, this kind of tasks shouldn’t need to involve a data scientist. The problem is mainly an engineering issue, and in this case we should be content with following the standard agile process.

The third and last stage – deriving solutions for all customers – is where true Data Science lives. Another common name is Machine Learning or Artificial Intelligence. Companies at this stage follow a process similar to the one mentioned above (in Quora): they gather and store data, do some exploration, build models and put them into production to drive business decisions and provide sophisticated “analytic” products. But as we said before this outlines the scientific method within an organisation. So is this enough to be categorized as “data science”? And how does the business come into play?

From a Business Point of View

Within a business that offers services through software (nowadays in the so-called Software as a Service “SaaS” model) there are at least four elements to consider relevant to any product feature:

1. Added value for the customer (i.e. efficacy)
2. Performance and efficiency
3. Maintenance and support costs
4. Overall relevance to know-how and IP (i.e. strategic fit and value)

At Hivery, the core of our products relies on features that depend strongly on Data Science. But those are not the only features. In this regard, every piece of functionality requiring either R&D or “just” engineering needs to pass through the normal feature development cycle. This cycle is embodied in the Agile management approach: elaborate, design, implement, test, deliver – repeated over and over in small closely knitted feedback cycles with the customer.

Of course, core Data Science features should follow the same approach. So, the company as a whole needs to ensure that a data-science-backed feature adds value to the customer. Thus, within the overall business process, it is imperative at the start of any engagement to clarify to the best of ability the problem, beneficiaries, internal champion(s) and context of all data science activities.

In addition to basic goals, this scope would include minimal performance goals. For instance, needless to say that if a Data Science model/algorithm needs hours to provide feedback when seconds is the only acceptable level of performance, then nothing of value has been achieved. I understand that this can be seen by some as an “engineering” task rather than a “scientific” concern. However, as data scientists, we cannot neglect the relevance of this important aspect for any of the features being delivered. We also wouldn’t be making any good by just throwing an algorithm over the fence to the implementation team and letting them deal with our poor forethought.

Once a feature is delivered and working, if it was created following a scientific process, it certainly has assumptions and a number of intrinsic performance metrics. Are those assumptions still valid? Have the model’s predictions aligned with new data? Are the optimal actions being followed? If so, do we know the impact? Monitoring performance is critical, and data science products are no exception. After all, a Data Science feature is not maintainable if we are not able to collect performance indicators on the delivered algorithm.

Data science products are the fruits of a sophisticated process. But they live within a company that incorporates all sorts of processes and activities that ultimately define the colour, taste and ripeness of the fruit. Therefore, the work of a data scientist cannot be a black box or plugged in as a prosthesis to the whole body; otherwise, it would constitute a huge risk for the sustainability of the business. It has to be an integral part of a company’s know-how, its IP.</span?>

How do achieve this at Hivery? Communication and documentation are the key aspects. Proactive communication of findings and issues during research stage in the same spirit as within an Agile software development builds that shared knowledge that splices data science research into the organisation’s DNA. Also, as we do with software, peer-review becomes essential, which can only be done when we document our scientific journey from data to algorithm. As a by-product, documentation allows for memory that guarantees reproducible processes and later becomes a part of Hivery’s Intellectual Property, its know-how.

It is with this perspective that a simple rendering of the scientific method cannot be the definition of the data science function. Neither can a decontextualized Agile practice align the scientific method within the product development cycle. What we need is a structured R&D process that fits within the engineering practice of our products, has specific “contact points” and API of sorts. This is the subject of the second chapter and the heart of this proposal.

1 1682