A Case Study for the Lleida Population Cancer Registry
University of Lleida
University of Lleida
University of Lleida

Context
This work was part of an Industrial PhD, Florensa Cazorla (2023), collaboration between the Population Cancer Registry, Arnau de Vilanova Hospital, and the University of Lleida.
Purpose
Develop an optimized platform that enables high-quality data for identifying associations between medications and cancer types within the Lleida population registry.
Team
Goal
Analyze associations: Medication and cancer type effects on patient survival (protective or harmful).
Challenge
Analyzing 79,931 combinations of medications and cancer types from 2007-2019.
Inital Approach
A single machine would require 61 days to complete this analysis, with each combination consuming 66 seconds.

Goal
Findings
Proposals


Observation
Proposed solutions reduce query time (at the cost of disk space for indexes). However, deserialization time increases across all proposals.
Next Steps
How can we simultaneously minimize deserialization time and reduce query execution?
Findings
Solution
patients, expositions, cancers
Findings
Proposal
Mechanism
Improvements

Results
isin()loc with prefillResults
isin(): 3.25 msloc with prefill: 464 μsSummary of Optimizations
The total reduction in time is around 52 ms per combination. For 79,931 combinations, this results in a total time of ~1 hours.
Problem
Disk I/O for inter-process communication (IPC) with R is a significant bottleneck.
Results
We reduce the processing time from 66 seconds to less than 1 second per combination.
| Function | Time (ms) |
|---|---|
get_cox_df |
52 |
calculate_cox_analysis |
776 |
parse_cox_analysis |
22 |
save_results |
21 |

Technical Insights
Hybrid Strategy
Since threads and processes are not mutually exclusive, we adopted a hybrid approach:
Resource Calibration
The hybrid approach allows fine-tuned calibration of threads and processes, adapting to the device’s CPU and memory capacity. This ensures optimal throughput without exceeding hardware limits.

Architecture
Requirements
Rationale1
| Feature | Traditional MPI Cluster | Kubernetes |
|---|---|---|
| Resource Allocation | Static (fixed per job) | Dynamic (per-task) |
| Scaling | Manual intervention required | Auto-scaling (HPA + Cluster) |
| Fault Tolerance | Job fails if worker crashes | Self-healing |
Comparative Analysis
| Cloud | Instance type | Coremark | Workers | vCPUs | Tasks/s | Total time |
|---|---|---|---|---|---|---|
| GKE | e2-highcpu-4 | 51937 | 1 | 4 | 1.0 | 22h 12min |
| 2 | 8 | 1.9 | 11h 41min | |||
| 4 | 16 | 3.6 | 06h 10min | |||
| 8 | 32 | 7.0 | 03h 10min | |||
| c2d-highcpu-4 | 86953 | 4 | 16 | 17.0 | 01h 18min | |
| On-premise | opteron_6247 | 9634 | 1 | 10 | 0.4 | 2d 7h 30min |
| 2 | 20 | 0.88 | 1d 1h 13min | |||
| 4 | 40 | 2 | 11h 6min |
Key Optimizations
Schema Optimization
Query-driven design and better deserialization.
Precomputation & Storage
Eliminated redundant calculations and migrated from CSV to Parquet for columnar efficiency.
Compute Efficiency and Communication Overhead
Index-aware queries and optimized pipelines.
Parallel Execution
Hybrid threading/multiprocessing to maximize resource utilization.
Distributed Scaling
Kubernetes-orchestrated workers with queue-based load balancing.
Take Home Messages

CMMSE 2025