Announcement of Lab Seminar (Prof. Marco Canini)

We are pleased to announce that Prof. Marco Canini (KAUST) will give a talk on observability techniques for large-scale cloud systems at The University of Tokyo, as part of the Laboratory Seminar Series. All are welcome to attend.
Event Details
- Title: Metrics, Mayhem, and Microservices: Taming the Cloud Observability Beast
- Speaker: Prof. Marco Canini (KAUST)
- Date: Friday, June 27, 2025
- Time: 10:30–11:30 (JST)
- Venue: Room 214, Building 7, School of Science, The University of Tokyo
- Format: Hybrid
- Registration: https://forms.gle/zDAj15M6iKDTeNLK7 (Zoom link will be provided to those who register for online attendance)
Speaker
Marco does not know what the next big thing will be. He asked ChatGPT, though the answer was underwhelming. But he’s sure that our future next-gen computing and networking infrastructure must be a viable platform for it. Marco’s research spans a number of areas in computer systems, including distributed systems, large-scale/cloud computing and computer networking with emphasis on programmable networks. His current focus is on designing better systems support for AI/ML and providing practical implementations deployable in the real world.
Marco is a Professor of Computer Science at KAUST. Marco obtained his Ph.D. in computer science and engineering from the University of Genoa in 2009 after spending the last year as a visiting student at the University of Cambridge. He was a postdoctoral researcher at EPFL and a senior research scientist at Deutsche Telekom Innovation Labs & TU Berlin. Before joining KAUST, he was an assistant professor at UCLouvain. He also held positions at Intel, Microsoft and Google.
Abstract
Cloud applications scale their workload on massively distributed software and hardware infrastructure to deliver swift performance and meet stringent service level objectives. The latest advancements in AI and the promising results delivered by flagship AI models have reinforced this trend, fueling considerable additional investments by all major cloud players to further scale their infrastructure. Yet, fault tolerance and performance debugging remain among the relatively few levers against this exponential growth, demanding ubiquitous system instrumentation and observability to anticipate failures or identify root-causes post-mortem. Overall, observability has become mission-critical to operate cloud technology at scale.
In this talk, we introduce the first-ever offloading of observability operations to data centers’ Infrastructure Processing Units (IPU) accelerators. This novel architecture can significantly reduce the costs and increase operational efficiency to manage large-scale cloud systems serving millions of users daily.
Firstly, we elaborate on the efficiency of observability for cloud-native microservice applications and quantify the impact of today’s observability on application performance. We show that state-of-the-art observability frameworks fail to meet the demands of cloud-native environments, either resulting in crippling complexity and high costs for collecting and storing huge data volumes, or sacrificing events coverage due to coarse-granularity sampling. Then, we present our framework, MicroView, which leverages the proximity between IPUs and the monitored services to tackle the observability bloat. IPUs crucially enable MicroView to run continuous real-time and high-resolution analysis of observability data in a lightweight data plane, while not hurting application performance.
The performance evaluation on representative benchmark applications demonstrates that MicroView’s real-time analysis helps to (1) anticipate SLOs violations, (2) narrow the focus on informative observability data, and (3) trigger useful signals about service performance.