Diabetes, Article - It's not about the algorithm

Let me start by addressing what I believe is the real problem we need to solve when it comes to analyzing diabetic patient data: finding meaningful patterns in the data that can guide physicians toward asking better questions of their patients. Patterns found this way should enhance, not replace, a doctor's ability to apply their clinical experience to each specific case.

Given that, I argue that the choice of analytical method is far less critical than we often make it seem.

Given that we don't violate the assumptions of a given method, or that we don't draw causal conclusions from descriptive algorithms, many methodologies can support the relationship between doctor and patient. The most important thing is choosing solutions that accommodate the way of thinking of a specific professional while keeping in sight all the technical challenges involved in actually deploying an application.

The reality is that the most sophisticated statistical model in the world is useless if it can't be reliably explained, implemented, maintained, and integrated into existing healthcare workflows. For example, the ambulatory glucose profile has way more "success" in being used with respect to any other statistical-based solution that I know of. Infrastructure concerns like data pipelines, system integration, user interfaces, regulatory compliance are the factors that determine whether a solution succeeds or fails in practice.

Tools and Methods

So then, about our options, I like to think about three categories of analytical tools, each with distinct characteristics and applications.

Visualization Methods (example): These represent the most basic approach, mere visualizations of data with minimal data processing. Think of dashboards showing glucose trends over time, or simple scatter plots revealing correlations between different health metrics.

Deterministic Methods (example): These are represented by methods that use some sort of algorithm to process data.

Statistical Methods (example): These combine mathematical machinery that generates results with a body of theory that allows us to interpret those results across a spectrum, ranging from simple correlation to confident causality. These methods provide the theoretical framework to quantify uncertainty and make probabilistic statements about relationships in the data.

Here's where I should probably recommend using one approach over others, but I don't have a definitive solution, I'm not even close. The only thing that really helped me was the actual act of categorization of different methods given their tradeoffs. Based on my experience, if you want to investigate a problem that is well-defined from a mathematical standpoint, then identifying the right tool is relatively straightforward if you have your categories ready to go.

The following are some pros and cons of these methods, from an implementation point of view, i.e. after it has been established that they fit a particular problem's constraints.

Pros and Cons

Simple Visualization Methods

Pros:

Widely used approach: appealing to many stakeholders
Easy to implement and maintain with minimal technical overhead
Immediate interpretability
Low computational requirements and fast execution, in most cases
Minimal risk of algorithmic bias or black-box decision-making

Cons:

Sometimes cannot answer the specific questions we're asking of the data
Limited ability to handle complex, multivariate relationships

Deterministic Methods

Pros:

No black box, each step is transparent and auditable
Easier to integrate into existing workflows, particularly given legal and regulatory considerations in healthcare
Customization is more straightforward: modifying a deterministic algorithm is far simpler than customizing a statistical method, which essentially requires conducting years of research
Results might be more intuitive for healthcare providers to understand and trust

Cons:

Can at most explain correlation, not causation
If very complex, the auditing can be very difficult
Limited ability to handle uncertainty and variability inherent in biological data
May oversimplify complex relationships

Statistical Methods

Pros:

Statistical theory provides a robust framework for interpreting results
Assumptions are forced to be considered explicitly, promoting methodological rigor
Statistical theory gives us systematic ways of managing and quantifying uncertainty
Can establish stronger evidence for causal relationships when assumptions are met

Cons:

Difficulty in reading results precisely, even for technical people
Models are significantly harder to manage in production environments
Additional complexity is hard to justify in most practical cases where simpler methods suffice
Temptation to claim things that cannot legitimately be claimed
May be difficult to explain to non-technical stakeholders

Challenges

Regardless of the algorithm chosen, certain aspects have universal implications:

Batch Processing vs. Real-Time: Non-real-time processing is generally better, and large batch processing is even better. This allows for more thorough data validation, reduces computational overhead, and provides opportunities for human oversight before results are applied clinically. It is also easier to maintain in production environments.

Privacy and Bias Management: Data privacy concerns and result biases are easier to manage when there's no complex model involved, even better when there's no algorithm at all. Simple approaches also reduce the risk of inadvertent discrimination or privacy breaches through model inference.

Data engineering also presents universal challenges regardless of methodology.

Complexity: Feature engineering is almost always difficult to get right, requiring deep domain expertise and iterative refinement. The features that work well for one patient population may not generalize to another.

Data Uncertainty: We face two distinct types of uncertainty in our data: measurement error (how accurate are our instruments) and variance of error (how consistent is that accuracy over time and conditions). This jittering in the data affects all analytical approaches.

Method-Specific Variables: Different analytical methods require different variables that highlight different aspects of the data. For instance, some methods benefit from hourly aggregated variables, while others work better with zone-change variables that capture transitions between different glucose ranges. This is not only difficult to implement but also highly subjective.

Thank you

As always, thank you for reading.

If you want to send me something, XY at gmail.com where X = tommaso and Y = bassignana, no dots between X and Y. I'm always interested in hearing any of your comments and suggestions. I respond to every email ;)