MNM publication list

Paul Trojahn. Application-Specific GPU Scheduling. 4 2025.

BibTeX Entry

@misc{troj25, author = {Paul Trojahn}, title = {{Application-Specific} {GPU} {Scheduling}}, year = {2025}, key = {troj25}, month = {4}, school = {Ludwig-Maximilians-Universität München}, supervisors = {Karl Fuerlinger}, type = {Masterthesis}, }

Andrian Melnikov. Codama: A Configurable C Interpreter for the Clang AST. 3 2025.

BibTeX Entry

@misc{meln25, author = {Andrian Melnikov}, title = {{Codama:} A {Configurable} C {Interpreter} for the {Clang} {AST}}, year = {2025}, key = {meln25}, month = {3}, school = {Ludwig-Maximilians-Universität München}, supervisors = {Karl Fuerlinger}, type = {Masterthesis}, }

Thomas Ziereis. Quantization of LLMs in AI Compilers using MLIR. 3 2025.

BibTeX Entry

@misc{zier25, author = {Thomas Ziereis}, title = {{Quantization} of {LLMs} in {AI} {Compilers} using {MLIR}}, year = {2025}, key = {zier25}, month = {3}, school = {Ludwig-Maximilians-Universität München}, supervisors = {Karl Fuerlinger}, type = {Masterthesis}, }

Simon Hilchenbach. CacheHound: Automated Reverse-Engineering of CPU Cache Policies in Modern Multiprocessors. 9 2024.

PDF

Abstract

In modern multiprocessors, hardware manufacturers employ a hierarchy of CPU caches to mitigate the considerable latency associated with accessing main memory. These CPU caches leverage the temporal and spatial locality of an application's data access patterns to serve a portion of the main memory at significantly reduced latencies. The operation of CPU caches is governed by cache policies. While this solution is effective in the majority of scenarios, an application may encounter difficulties in performing optimally under a given cache policy, potentially leading to issues such as thrashing. Awareness of the policy would facilitate the restructuring of the application to align with it. Such knowledge can be further applied to the domain of cache-based side-channels, from both a hardening and an attacker perspective. However, manufacturers typically refrain from disclosing the details of their cache policies, particularly those pertaining to the placement and replacement of data within the cache. Prior research has focused on the reverse-engineering of replacement policies, yet we are not aware of any investigation into placement policies. Moreover, to the best of our knowledge, there is currently no generic framework for the reverse-engineering of CPU caches. In this work, we devise such a framework and also develop a methodology for the reverse-engineering of placement policies. We provide a corresponding open-source implementation, called CacheHound, and benchmark it on several x86- and ARM-based systems. Finally, we employ the gained knowledge to explore use cases in the fields of security and high-performance computing (HPC).

BibTeX Entry

@misc{hilc24, author = {Simon Hilchenbach}, title = {{CacheHound:} {Automated} {Reverse-Engineering} of {CPU} {Cache} {Policies} in {Modern} {Multiprocessors}}, year = {2024}, pdf = {https://bib.nm.ifi.lmu.de/pdf/hilc24.pdf}, abstract = {In modern multiprocessors, hardware manufacturers employ a hierarchy of CPU caches to mitigate the considerable latency associated with accessing main memory. These CPU caches leverage the temporal and spatial locality of an application's data access patterns to serve a portion of the main memory at significantly reduced latencies. The operation of CPU caches is governed by cache policies. While this solution is effective in the majority of scenarios, an application may encounter difficulties in performing optimally under a given cache policy, potentially leading to issues such as thrashing. Awareness of the policy would facilitate the restructuring of the application to align with it. Such knowledge can be further applied to the domain of cache-based side-channels, from both a hardening and an attacker perspective. However, manufacturers typically refrain from disclosing the details of their cache policies, particularly those pertaining to the placement and replacement of data within the cache. Prior research has focused on the reverse-engineering of replacement policies, yet we are not aware of any investigation into placement policies. Moreover, to the best of our knowledge, there is currently no generic framework for the reverse-engineering of CPU caches. In this work, we devise such a framework and also develop a methodology for the reverse-engineering of placement policies. We provide a corresponding open-source implementation, called CacheHound, and benchmark it on several x86- and ARM-based systems. Finally, we employ the gained knowledge to explore use cases in the fields of security and high-performance computing (HPC).}, key = {hilc24}, month = {9}, school = {Ludwig-Maximilians-Universität München}, supervisors = {Karl Fuerlinger and Sergej Breiter}, type = {Masterthesis}, }

Marcel Quanz. Efficient Parallel Quantum Circuit Optimization using the ZX Calculus. 10 2023.

Abstract

Through the quantum effects of the so-called 'qubit', quantum computers are potentially able to solve a larger problem class than classical computers in polynomial time, called bounded-error quantum polynomial time (BQP). However, the practicality of scaling quantum computers to tackle larger problems comes with formidable challenges – increased costs, reduced accuracy, as well as longer design and construction times. These challenges can render a quantum computing design infeasible, emphasizing the critical need to ensure quantum circuits are only as large as they are required to be. In order to reduce the size of the circuit, various optimization techniques can be employed. Current quantum circuit optimization algorithms transform circuits into a more general representation in the ZX-calculus, perform various simplification operations on it, and then extract a reduced, but equivalent quantum circuit. Out of these, the simplification and extraction require the greatest amount of time, potentially even outclassing the computational class of BQP. Our research explores methods to parallelize the quantum circuit extraction process, rendering it suitable for high-performance systems. We present our implementation of a parallelized quantum circuit extraction algorithm, alongside surprising findings that have emerged during our work and evaluation, offering avenues for further investigation and refinement.

BibTeX Entry

@misc{quan23, author = {Marcel Quanz}, title = {{Efficient} {Parallel} {Quantum} {Circuit} {Optimization} using the {ZX} {Calculus}}, year = {2023}, abstract = {Through the quantum effects of the so-called `qubit', quantum computers are potentially able to solve a larger problem class than classical computers in polynomial time, called bounded-error quantum polynomial time (BQP). However, the practicality of scaling quantum computers to tackle larger problems comes with formidable challenges – increased costs, reduced accuracy, as well as longer design and construction times. These challenges can render a quantum computing design infeasible, emphasizing the critical need to ensure quantum circuits are only as large as they are required to be. In order to reduce the size of the circuit, various optimization techniques can be employed. Current quantum circuit optimization algorithms transform circuits into a more general representation in the ZX-calculus, perform various simplification operations on it, and then extract a reduced, but equivalent quantum circuit. Out of these, the simplification and extraction require the greatest amount of time, potentially even outclassing the computational class of BQP. Our research explores methods to parallelize the quantum circuit extraction process, rendering it suitable for high-performance systems. We present our implementation of a parallelized quantum circuit extraction algorithm, alongside surprising findings that have emerged during our work and evaluation, offering avenues for further investigation and refinement.}, key = {quan23}, month = {10}, school = {Ludwig-Maximilians-Universität München}, supervisors = {Karl Fuerlinger and Korbinian Staudacher}, type = {Masterthesis}, }

Michael Szostak. Application Fingerprinting using eBPF. 8 2023.

BibTeX Entry

@misc{szos23, author = {Michael Szostak}, title = {{Application} {Fingerprinting} using {eBPF}}, year = {2023}, key = {szos23}, month = {8}, school = {Ludwig-Maximilians-Universität München}, supervisors = {Karl Fuerlinger}, type = {Masterthesis}, }

Adrian Uffmann. Entwicklung leistungsfähiger RMA-Locks durch Portierung und Optimierung von NUMA-Algorithmen. 2 2022.

BibTeX Entry

@misc{uffm22, author = {Adrian Uffmann}, title = {{Entwicklung} leistungsfähiger {RMA-Locks} durch {Portierung} und {Optimierung} von {NUMA-Algorithmen}}, year = {2022}, key = {uffm22}, month = {2}, school = {Ludwig-Maximilians-Universität München}, supervisors = {Karl Fuerlinger and Dang Diep}, type = {Masterthesis}, }

Sergej Breiter. Evaluating Sector Caches in High-Performance Computing. 2 2022.

PDF

Abstract

The sector cache is a hardware cache partitioning mechanism of the A64FX processor. The A64FX is used in the Fugaku system – currently the fastest supercomputer on the TOP500 list (as of November 2021). It allows application software to dynamically partition a cache and can reduce the occurrence of cache misses by protecting data with high temporal locality from eviction. Many cache partitioning techniques focus on optimizing the cache behavior of shared caches when multiple co-scheduled processes run on the same processor by assigning them to partitions. In contrast, the sector cache aims to improve the cache behavior of a single application by assigning its data to partitions. However, even the hardware man- ufacturer of the A64FX states that it is difficult to use the sector cache in a meaningful way. Therefore, a profiling tool based on the reuse distance metric is being developed using Intel’s PIN binary instrumentation framework. The profiling tool tries to provide program- mers with opportunities where the sector cache can be usefully applied without requiring the programmer to have detailed knowledge of a program’s data locality. Using the parallel NAS benchmarks as an example, it is shown that the tool can indeed help programmers to find code regions where the sector cache can improve cache behaviour. In addition, it is shown that sector cache can significantly improve performance in certain typical situations and these as well as the sector cache behavior of the A64FX are explored and analyzed.

BibTeX Entry

@misc{brei22, author = {Sergej Breiter}, title = {{Evaluating} {Sector} {Caches} in {High-Performance} {Computing}}, year = {2022}, pdf = {https://bib.nm.ifi.lmu.de/pdf/brei22.pdf}, abstract = {The sector cache is a hardware cache partitioning mechanism of the A64FX processor. The A64FX is used in the Fugaku system – currently the fastest supercomputer on the TOP500 list (as of November 2021). It allows application software to dynamically partition a cache and can reduce the occurrence of cache misses by protecting data with high temporal locality from eviction. Many cache partitioning techniques focus on optimizing the cache behavior of shared caches when multiple co-scheduled processes run on the same processor by assigning them to partitions. In contrast, the sector cache aims to improve the cache behavior of a single application by assigning its data to partitions. However, even the hardware man- ufacturer of the A64FX states that it is difficult to use the sector cache in a meaningful way. Therefore, a profiling tool based on the reuse distance metric is being developed using Intel’s PIN binary instrumentation framework. The profiling tool tries to provide program- mers with opportunities where the sector cache can be usefully applied without requiring the programmer to have detailed knowledge of a program’s data locality. Using the parallel NAS benchmarks as an example, it is shown that the tool can indeed help programmers to find code regions where the sector cache can improve cache behaviour. In addition, it is shown that sector cache can significantly improve performance in certain typical situations and these as well as the sector cache behavior of the A64FX are explored and analyzed.}, key = {brei22}, month = {2}, school = {Ludwig-Maximilians-Universität München}, supervisors = {Karl Fuerlinger and Josef Weidendorfer (LRZ)}, type = {Masterthesis}, }

Daniel Diefenthaler. Hybrid Data Race Detection using CLR Profiling and IL Instrumentation. 6 2021.

BibTeX Entry

@misc{dief21, author = {Daniel Diefenthaler}, title = {{Hybrid} {Data} {Race} {Detection} using {CLR} {Profiling} and {IL} {Instrumentation}}, year = {2021}, key = {dief21}, month = {6}, school = {Ludwig-Maximilians-Universität München}, supervisors = {Karl Fuerlinger and Felix Mößbauer (Siemens) and Dr. Christian Kern (Siemens)}, type = {Masterthesis}, }

Sebastian Peralta Friedburg. Communication Offload Strategies for Multithreaded Distributed Memory Applications. 6 2021.

BibTeX Entry

@misc{pera21, author = {Sebastian Peralta Friedburg}, title = {{Communication} {Offload} {Strategies} for {Multithreaded} {Distributed} {Memory} {Applications}}, year = {2021}, key = {pera21}, month = {6}, school = {Ludwig-Maximilians-Universität München}, supervisors = {Karl Fuerlinger and Roger Kowalewski and Pascal Jungblut}, type = {Masterthesis}, }

Martin Sellmair. Objektorientierte und Datenzentrische Parallelisierung einer Verkehrssimulation basierend auf dem Cell-Transmission-Model. 3 2018.

BibTeX Entry

@misc{sell18, author = {Martin Sellmair}, title = {{Objektorientierte} und {Datenzentrische} {Parallelisierung} einer {Verkehrssimulation} basierend auf dem {Cell-Transmission-Model}}, year = {2018}, key = {sell18}, month = {3}, school = {Ludwig-Maximilians-Universität München}, supervisors = {Karl Fuerlinger}, type = {Masterthesis}, }

Pascal Jungblut. Untersuchung des Laufzeitverhaltens von C++ Containern. 11 2017.

BibTeX Entry

@misc{jung17, author = {Pascal Jungblut}, title = {{Untersuchung} des {Laufzeitverhaltens} von {C++} {Containern}}, year = {2017}, key = {jung17}, month = {11}, school = {Ludwig-Maximilians-Universität München}, supervisors = {Karl Fuerlinger and Roger Kowalewski}, type = {Masterthesis}, }

Bernhard Saumweber. Supporting the PGAS Model on Modern Shared Memory Systems. 2 2016.

BibTeX Entry

@misc{saum16, author = {Bernhard Saumweber}, title = {{Supporting} the {PGAS} {Model} on {Modern} {Shared} {Memory} {Systems}}, year = {2016}, key = {saum16}, month = {2}, school = {Ludwig-Maximilians-Universität München}, supervisors = {Karl Fuerlinger}, type = {Masterthesis}, }

Roger Kowalewski. Nasty-MPI: A Systematic Stress Testing Approach for Correctness Debugging in MPI-3 RMA. 11 2015.

BibTeX Entry

@misc{kowa15, author = {Roger Kowalewski}, title = {{Nasty-MPI:} A {Systematic} {Stress} {Testing} {Approach} for {Correctness} {Debugging} in {MPI-3} {RMA}}, year = {2015}, key = {kowa15}, month = {11}, school = {Ludwig-Maximilians-Universität München}, supervisors = {Dieter Kranzlmüller and Karl Fuerlinger}, type = {Masterthesis}, }

Lei Zhou. Implementation of the Partitioned Global Address Space Model for Multi-GPU Systems. 11 2014.

BibTeX Entry

@misc{zhou14, author = {Lei Zhou}, title = {{Implementation} of the {Partitioned} {Global} {Address} {Space} {Model} for {Multi-GPU} {Systems}}, year = {2014}, key = {zhou14}, month = {11}, school = {Ludwig-Maximilians-Universität München}, supervisors = {Karl Fuerlinger}, type = {Masterthesis}, }

Tobias Fuchs. Evaluation of Task Scheduling Algorithms and Wait-Free Data Structures for Embedded Multi-Core Systems. 10 2014.

BibTeX Entry

@misc{fuch14, author = {Tobias Fuchs}, title = {{Evaluation} of {Task} {Scheduling} {Algorithms} and {Wait-Free} {Data} {Structures} for {Embedded} {Multi-Core} {Systems}}, year = {2014}, key = {fuch14}, month = {10}, school = {Ludwig-Maximilians-Universität München}, supervisors = {Karl Fuerlinger}, type = {Masterthesis}, }

Benedikt Bleimhofer. Hierarchical Arrays for Efficient and Productive Data-Intensive Parallel Computing. 8 2012.

BibTeX Entry

@misc{blei12, author = {Benedikt Bleimhofer}, title = {{Hierarchical} {Arrays} for {Efficient} and {Productive} {Data-Intensive} {Parallel} {Computing}}, year = {2012}, key = {blei12}, month = {8}, school = {Ludwig-Maximilians-Universität München}, supervisors = {Stephan Reiter and Karl Fuerlinger}, type = {Masterthesis}, }

Publication list

Supervised Master's theses of Karl Fuerlinger

Theses and projects (PhD, MSc, BSc, Project)

Disclaimer: