en:seminaires:these:these2024

PhD defences
Thursday December 19, 2024, 2PM, Amphitéâtre Turing and on Zoom
Srinidhi Nagendra Automated Testing of Distributed Protocol Implementations

The growth of modern internet is enabled by large-scale, highly available, and resilient distributed systems. These systems allow data to be replicated globally while ensuring availability under failures. To ensure reliability and availability in the presence of failures, the systems rely on intricate distributed protocols. Yet in practice, bugs in the implementations of these distributed protocols have been the source of many downtimes in popular distributed databases. Ensuring the correctness of the implementations remains a significant challenge due to the large state space.

Over the years, many techniques have been developed to ensure the correctness of the implementations ranging from systematic model checking to pure random exploration. However, a developer testing the implementation with current techniques has no control over the exploration. In effect, the techniques are agnostic to the developer's knowledge of the implementation. Furthermore, very few approaches utilize the formal models of the protocols when testing the implementations. Efforts put into modeling and verifying the correctness of the model are not leveraged to ensure the correctness of the implementation

To address these drawbacks, in this thesis, we propose 3 new approaches to test distributed protocol implementations - Netrix, WaypointRL, and ModelFuzz. The first two techniques - Netrix and WaypointRL are biased exploration approaches that accept developer input. Netrix is a novel unit testing algorithm that allows developers to bias the exploration of an existing testing algorithm. A developer writes low-level filters in a domain-specific language to fix specific events in an execution that are enforced by Netrix. WaypointRL improves on Netrix to accept high-level state predicates as waypoints and uses reinforcement learning to satisfy the predicates. WaypointRL is effective in biasing the exploration while requiring less effort from the developer. Using popular distributed protocol implementations, we show that additional developer input leads to effective biased exploration and improved bug-finding capabilities. The third technique - ModelFuzz - is a new fuzzing algorithm that bridges the gap between the model and the implementation of the protocol. We use model states as coverage to guide input generation that are then executed on the implementation. We show with three industrial benchmarks that existing coverage notions are insufficient for testing distributed systems and that using TLA+ model coverage to test implementations leads to discovering new bugs.

PhD defences
Tuesday December 17, 2024, 2PM, Salle 3052, Bâtiment Sophie Germain
Maël Luce Distributed Randomized and Quantum Algorithms for Cycle Detection in Networks

Distributed computing encompasses all models where a set of computing entities have to collectively solve a problem while each detaining private inputs. This field of theoretical computer science then underlies many real situations: multi-core processors, servers and clouds, the internet, swarms of robots, cars or other intelligent objects, and any network in which entities have some kind of independence. This thesis is dedicated to studying the impact of limited bandwidth on the ability to solve “local problems” in distributed networks. More precisely, in a graph-network where the amount of information exchanged between two neighbors is limited over time, how long does it take to detect a cycle ?

First we study a state of the art algorithm for the detection of small even cycles and show that this technique becomes unefficient for bigger lengths of cycles. For any even length of cycle bigger than 10, there exists a family of graphs of growing size in which this technique cannot detect a cycle in a reasonable time.

Then, we exhibit a new algorithm that matches and extends the complexity of the previously mentioned one to all even lengths of cycles. To prove its correctness, we establish a general graph combinatorics result. It can be seen as a lower bound of the “local density” in a graph without a 2k-cycle.

Finally, we develop a new framework for quantum distributed computing: the distributed quantum Monte-Carlo amplification. It helps us quantumly speed up our even-length cycle detection algorithm. This same technique can also be applied to speed up other cycle detection problems in the litterature.

PhD defences
Tuesday December 17, 2024, 5PM, Salle 3052, Bâtiment Sophie Germain & Zoom
Félix Castro The ramified analytic hierarchy in second-order logic

In its first part, this thesis focuses on the study of the ramified analytic hierarchy (RAH) in second-order arithmetic (PA2). The ramified analytic hierarchy was defined by Kleene in 1960. It is an adaptation of the notion of constructibility (introduced by G¨odel for set theory) to the framework of second-order arithmetic. The properties of this hierarchy, in relation to computability and to the study of set-theoretic models of PA2, have been studied in depth. It seems natural to formalize RAH in PA2 in an attempt to demonstrate that adding the axiom of choice or (a variant of) the axiom of constructibility to second-order arithmetic does not bring contradiction. However, the only written trace of such a formalization seems to be incorrect. In this thesis, we want to work on this formalization. To do this, we will work in a version of arithmetic obtained by removing the axiom of induction from the axioms of PA2. In this system, a new variant of the axiom of choice appears: we call it the axiom of collection, in reference to the homonymous axiom of set theory. It seems that this axiom has never been considered in the context of second-order logic. We show that it has good computational properties: its contrapositive is realized by the identity in classical realizability, while it is itself realized by the identity in intuitionistic realizability. In addition, we show that it is equivalent to an axiom which is well-behaved with respect to a negative translation from classical logic into intuitionistic logic. Finally, we show that this variant of the axiom of collection is weaker than a variant of the axiom of choice in intuitionistic logic. We therefore work in a theory without induction but containing the axiom of collection. We aim at studying the ramified analytic hierarchy. We show that it is a model of PA2 satisfying a strong version of the axiom of choice: the principle of the well-ordered universe. It seems that the axiom of collection is necessary to prove this result and we will thoroughly explain this intuition.
In the second part of the thesis, shorter than the first, we study extensional equality in finite type arithmetic. Higher Type Arithmetic (HA^ω) is a first-order many-sorted theory. It is a conservative extension of Heyting Arithmetic obtained by extending the syntax of terms to all of System T: the objects of interest here are the functionals of higher types. While equality between natural numbers is specified by the axioms of Peano, how can equality between functionals be defined? From this question, different versions of HA^ω arise, such as an extensional version (E-HA^ω) and an intentional version (I-HA^ω). In this work, we will see how the study of partial equivalence relations leads us to design a translation by parametricity from E-HA^ω to HA^ω.

PhD defences
Monday December 16, 2024, 10AM, Salle 3052
Avinandan Das (IRIF, CNRS, Universite Paris Cite) Computation with Partial Information : An Algorithmic Investigation of Graph Problems in Distributed Computing and Streaming

This thesis discusses the paradigm of computation with partial information, which is necessitated by the infeasibility of processing massive datasets in their entirety. Traditional algorithms, which require full access to input data, become impractical for large-scale problems due to time and space constraints. Several models of computation that allow for partial or incomplete data access, such as property testing, data stream models, distributed computing, and massively parallel computing, have been proposed in the literature. This thesis deals with two models, the distributed LOCAL model and the data stream model in this paradigm.

In the LOCAL model, we extend existing frameworks for the well-studied problem of $(\Delta+1)$-proper coloring to a more generalized $k$-partial $(k+1)$-coloring problem, providing novel algorithms and complexity results. Given a $n$-node graph $G=(V,E)$, a $k$-partial $c$-coloring asks for a coloring of the vertices with colors in $\{1,\dots,c\}$, such that every vertex $v\in V$ has at least $\min\{k,\deg(v)\}$ neighbors with a color different from its own color. We extend the rounding based algorithm of Ghaffari and Kuhn (2020), to give a $O(\log n\cdot \log^3k)$ round algorithm and a $O(\log n\cdot \log^2k)$ round algorithm for $k$-partial list coloring and $k$-partial $(k+1)$-coloring respectively.

In the data stream model, we introduce the {\em certification} of solutions to computing problems when access to the input is restricted. This topic has received a lot of attention in the distributed computing setting, and we introduce it here in the context of \emph{streaming} algorithms, where the input is too large to be stored in memory. Given a property $P$, a \emph{streaming certification scheme} for $P$ is a \emph{prover-verifier} pair where the prover is a computationally unlimited but non-trustable oracle, and the verifier is a streaming algorithm. For any input, the prover provides the verifier with a \emph{certificate}. The verifier then receives its input as a stream in an adversarial order and must check whether the certificate is indeed a \emph{proof} that the input satisfies $P$. The main complexity measure for a streaming certification scheme is its \emph{space complexity}, defined as the sum of the size of the certificate provided by the oracle, and of the memory space required by the verifier. We give streaming certification schemes for several graph properties, including maximum matching, diameter, degeneracy, and coloring, with space complexity matching the requirement of \emph{semi-streaming}, i.e., with space complexity $O(n\, \mbox{polylog}\, n)$ for $n$-node graphs. For each of these properties, we provide upper and lower bounds on the space complexity of the corresponding certification schemes, many being tight up to logarithmic multiplicative factors. We also show that some graph properties are hard for streaming certification, in the sense that they cannot be certified in semi-streaming, as they require $\Omega(n^2)$-bit certificates.

The results presented in this thesis contribute to the growing body of work on algorithms for large datasets, particularly in the contexts of graph theory and distributed systems, and open new avenues for future research in computation with partial information.

PhD defences
Wednesday November 27, 2024, 2:30PM, Salle 3052, Bâtiment Sophie Germain
Alexandra Rogova (IRIF) Design principles of property graph languages, a theoretical and experimental study

In the last 50 years, the amount and complexity of information stored in databases has grown and the needs of database users have evolved. These changes have led to the creation of new models such as XML, key-value stores, time series databases and so on. The subject of this thesis is one such model: the graph database. In a graph database, data is stored as a “graph”, or a collection of nodes, representing the entities, connected together by edges, representing relationships between the entities. Additional information about the entities and their relationships can be attached to the nodes and edges. This powerful model, popularized by Neo4j in 2010, is now a staple in various industries such as biology, social networks and banking. By putting the links between the data points front and center, graph databases allow users to reason not only about individual elements but also about the structure of the graph. Accordingly, the goal of a typical graph query is to find a “path” connecting specific nodes.

Because graph traversals inherently rely on transitivity, traditional query languages are not suitable in the graph context, and thus new languages have been created. In the theoretical community, the basic building blocks of a graph language are the Regular Path Queries (RPQs), which define path constraints by way of regular expressions. The expressive power and complexity of RPQs and their extensions (by union, conjunction, two-way navigation, data value comparisons and path properties for example) have been studied since the 1990s but their properties are barely beginning to be understood.

The most popular practical graph language today is Neo4j’s Cypher. In Cypher, a path can be described by a regular expression but it also includes many other features among which aggregation, different path semantics, projection, subqueries and updates. These elements differ from those of other languages, like Tigergraphs’ GSQL, or Oracle’s PGQL, but all graph systems share the same kernel: pattern matching.

In 2019, a new International Organisation for Standardization (ISO) group was created to oversee the standardization of practical graph languages. This led to two new standards: SQL/PGQ and GQL. The idea of SQL/PGQ is to add a view-based pattern matching mechanism to SQL and interpret the relational data as a graph only when necessary, whereas GQL is standalone and stores the data as a native graph. While this standardization work is a step in the right direction, there is one crucial ingredient missing: a corresponding theoretical model.

The goal of this thesis is to define a theoretical language for graph databases, akin to relational algebra for SQL, that reflects the essential aspects of GQL and SQL/PGQ while being simple enough for theoretical study. We start by presenting in detail the pattern matching features shared by SQL/PGQ and GQL. We then identify and formalize the core of this language. Afterwards, we position our formalisations of SQL/PGQ and GQL in comparison to relational algebra, which allows us to highlight their distinct styles, while also proving their equivalence. Finally, we explore the impact of extending pattern matching with list functions and show that this addition is not only dangerous in theory, but also fails in practice.

PhD defences
Friday November 15, 2024, 10AM, Salle 279F, Halle aux farines
Colin González (IRIF) From spreadsheets to functional programs: compilation and efficient execution of structured spreadsheets

The development of applications with spreadsheets by non-professional programmers is a widespread practice both in industrial settings and in certain research fields. Owing to their simplicity and interactivity, spreadsheets count as a significant part of software development today. However, they can also be seen as poorly structured programs prone to committing programming errors.

This thesis proposes a new perspective on spreadsheet development by adopting the principles of structured programming. Through the development of Lisa, a domain specific language, we propose the structured programming approach for spreadsheets. The main feature of this approach is to organise the sheets by columns rather than by cells. In this way, we can model the columns as potentially infinite streams of data. The unique aspect of this language is the introduction of indexed streams. In this context, the indexed nature of the streams capture the notion of the current row number of a given column’s cell. A relative access (relative to the current index) operation allows preserving a programming style akin to that of formulae found in traditional spreadsheets. The remainder of the thesis addresses the design and implementation of Lisa, a functional language focused on programming with indexed streams. In particular, we cover the compilation of Lisa into executables for various execution targets, ranging from traditional hardware platforms to the blockchain. Moreover, the programs issued by the compilation process depend on a runtime environment providing the relative access operator and indexed streams. The implementation of this operation is particularly slow for classical implementations of streams. To address these performance issues, we propose a new way of programming streams that is well suited to accelerating this key operation.

Lisa is a fundamental component for the design of a specialised spreadsheet software tailored for the development of structured spreadsheets. In such a spreadsheet software, Lisa is used both to implement the formulae of a sheet and to develop extensions in the form of macro definitions. Finally, the compiled approach allows generating, from a structured sheet, an executable compatible with various software interfaces, thus enabling the modular composition of spreadsheet application developments, carried by end users, with highly technical software components developed by specialists.

PhD defences
Tuesday November 12, 2024, 4PM, Halle aux Farines, Room 580F (salles des thèses) & zoom : https://u-paris.zoom.us/j/5848945659? (Password: 248469)
Clément Ducros Multi-Party Computation Based on the Computational Hardness of Coding Theory

Since the beginning of modern cryptography, the question of secure computation (MPC) has been extensively studied. MPC allows for a set of parties to perform some joint computation while ensuring that their input remains secure up to what is implied by the output. The massive development of our daily Internet usage makes the need for secure computation methods more urgent: computations on private data is everywhere, from recommendation algorithms to electronic auctions. Modern MPC algorithms greatly benefit from correlated randomness to achieve fast online protocols. Correlated randomness has to be massively produced beforehand for the MPC protocols to work best, and this is still a challenge to do efficiently today.

In this thesis, we study pseudorandom correlation generators (PCGs). This construction transforms a small amount of correlated randomness into a large amount of correlated pseudorandomness with minimal interaction and local computation. We present different PCG constructions whose security relies on variants of the Syndrome Decoding (SD) assumption, a classical assumption in coding theory. The intuition behind these constructions is to merge the SD assumption with some MPC techniques that enable additively sharing sparse vectors. We achieve state-of-the-art results regarding secure computation protocols requiring more than two players when the computation is over a finite field of size > 2. We show how to remove this constraint at the cost of a small overhead in communication.

Next, we consider the construction of pseudorandom correlation functions (PCFs). PCFs produce correlations on-the-fly: players each hold correlated functions such that the outputs of the functions, when evaluated on the same entry, are correlated. These objects offer more flexibility than PCGs but are also harder to construct. We again rely on variants of SD to construct PCFs. We build on a previous PCF construction, showing that the associated proof of security was incorrect, and propose a correction. Additionally, we present optimizations of the construction, coming from a better analysis and precise optimization of parameters via simulations. We achieve parameters usable in practice.

Finally, this thesis revolves around the use of certain codes, particularly the SD assumption. The search for good complexity entails efforts to find the most efficient codes while ensuring that security is not compromised. We consider a framework tailored to the study of promising attacks on SD and its variants. For each construction previously mentioned, we conduct a thorough security analysis of the associated variant of SD and attempt cryptanalysis efforts to compute concrete parameters.

PhD defences
Friday October 4, 2024, 2PM, Amphithéâtre Turing, Bâtiment Sophie Germain & Zoom
Klara Nosan Zero problems in polynomial models

Polynomial models are ubiquitous in computer science, arising in the study of automata and formal languages, optimisation, game theory, control theory, and numerous other areas. In this thesis, we consider models described by polynomial systems of equations and difference equations, where the system evolves through a set of discrete time steps with polynomial updates at every step. We explore three aspects of “zero problems” for polynomial models: zero testing for algebraic expressions given by polynomials, determining the existence of zeros for polynomial systems and determining the existence of zeros for sequences satisfying recurrences with polynomial coefficients.

In the first part, we study identity testing for algebraic expressions involving radicals. That is, given a k-variate polynomial represented by an algebraic circuit and k real radicals, we examine the complexity of determining whether the polynomial vanishes on the radical input. We improve on the existing PSPACE bound, placing the problem in coNP assuming the Generalised Riemann Hypothesis (GRH). We further consider a restricted version of the problem, where the inputs are square roots of odd primes, showing that it can be decided in randomised polynomial time assuming GRH.

We next consider systems of polynomial equations, and study the complexity of determining whether a system of polynomials with polynomial coefficients has a solution. We present a number-theoretic approach to the problem, generalising techniques used for identity testing, showing the problem belongs to the complexity class AM assuming GRH. We discuss how the problem relates to determining the dimension of a complex variety, which is also known to belong to AM assuming GRH.

In the final part of this thesis, we turn our attention to sequences satisfying recurrences with polynomial coefficients. We study the question of whether zero is a member of a polynomially recursive sequence arising as a sum of two hypergeometric sequences. More specifically, we consider the problem for sequences where the polynomial coefficients split over the field of rationals Q. We show its relation to the values of the Gamma function evaluated at rational points, which allows to establish decidability of the problem under the assumption of the Rohrlich-Lang conjecture. We propose a different approach to the problem based on studying the prime divisors of the sequence, allowing us to establish unconditional decidability of the problem.

PhD defences
Friday June 21, 2024, 9:30AM, 3052, Bâtiment Sophie Germain
Mickaël Laurent (IRIF, LMF) Polymorphic type inference for dynamic languages: reconstructing types for systems combining parametric, ad-hoc, and subtyping polymorphism

Programming languages can be broadly categorized into two groups: static languages, such as C, Rust, and OCaml, and dynamic languages, such as Python, JavaScript, and Elixir. While static languages generally offer better performance and more security thanks to a phase of static typing that precedes compilation (“well-typed programs cannot go wrong”), this often comes at the expense of flexibility, making programs safer but prototyping more tedious. In this defense, I present a formal system for statically typing dynamic languages, offering a compromise between safety and flexibility.

Statically typing dynamic languages poses new challenges. In particular, functions in dynamic languages can be overloaded: they can express multiple behaviors that cannot be resolved statically, but are instead selected at runtime (either by explicit type-cases or by built-in dynamic dispatch). In addition, programs typically have no (or few) type annotations. To address these problems, I present a type system that combines, in a controlled way, first-order polymorphism with intersection types, union types and subtyping, and provide an algorithm to automatically reconstruct (infer) the type of functions. This results in a type discipline in which unannotated functions are given polymorphic types (thanks to Hindley-Milner) that can capture the overloaded behavior of the functions they type (thanks to intersection types) and that are deduced by applying advanced techniques of type narrowing (thanks to union types). This makes the system a prime candidate for typing dynamic languages.