Dissertations and Theses - Computer Science
http://hdl.handle.net/2142/10761
Dissertations and Theses from the Dept. of Computer ScienceTue, 13 Oct 2015 07:25:25 GMT2015-10-13T07:25:25ZIntraday market dynamics
http://hdl.handle.net/2142/88139
Intraday market dynamics
The revolutionary technological and regulatory changes in financial markets over the first few years of the new millennium have radically altered trading routines and strategies. Algorithms have taken over trade executions in an environment where interactions between virtual traders happen faster than blink of an eye. New trading strategies of long-term investors, e.g. institutional investors, have moved from one of submitting a few large orders to one of finely splitting orders over time and across trading venues in order to minimize their market impacts. Instead if human decisions, instructions that algorithms follow in order to locate liquidity, arbitrage opportunities, pattern detection, etc. determine the size and timing of transactions. As a result, individual transactions are far from reflecting economic decisions, and classical models of market microstructure may not be used to describe phenomena at transaction level.
I develop a novel aggregation approach that accounts for features modern markets; for a given stock, I identify successive sequences of transactions where cumulative dollar volume of each sequence is a fraction of previous month's market-capitalization plus a fixed dollar-amount. Time durations of these trade sequences measure trading activity, and the corresponding price changes reflect market impacts of a fixed dollar volume traded at variable intensities. With this approach I (a) control for the temporal dependence across individual transactions induced by dynamic order-splitting, (b) finely isolate different market conditions, e.g. volume spikes from low trading activity, (c) tell apart trading activity from trading volume, (d) reduce the effect of odd-lots bias that exists at transaction level, and (e) provide a measure of trading activity that helps us study intraday dynamics of trading activity and prices.
I first show that, for most stocks, price impacts of fixed dollar-positions significantly fall in trading activity. But price impacts and trading activity, on average, are endogenously determined: trading activity rises when liquidity (depth near good prices) is unusually high which presents itself as small price impacts. I then show that one can predict this variation using a simple instrument. Moreover, the relationships between price impacts (trading costs) and instrumented trading activity are very similar across differently-sized stocks post 2006, suggesting greater cross-stock homogeneity post RegNMS. In sharp contrast, greater {\it heterogeneity} obtains if one examines the levels of price impacts (trading costs): smaller (less liquid) stocks became less liquid post 2007, but the opposite holds for larger (more liquid) stocks. Using a CAMP that includes four Fama-French factors and key stock characteristics, I show that this divergence in liquidity is translated to greater liquidity premia post financial crisis. I findings indicate that the massive changes in the design of markets did not led to uniform improvements in stock liquidity and that the asymmetric evolution of liquidity across different stocks affected investment decisions.
I then begin to investigate the intraday dynamics of trading activity and price movements by contrasting two separate cases of changes in trading activity: I capture a relative increase in trading activity by a pair of successive trade sequences whose first sequences has a longer time durations---the opposite pattern reflects a decline in trading activity. I show that, surprisingly, increases in trading activity are associated with return momentum, but declines in trading activity are associated with price reversals. Return momentums are stronger when starting/concluding activity levels are higher and signed trades are less balanced. In sharp contrast, price reversals are stronger when starting/concluding activity levels are lower and signed trades are more balanced. I conclude that these patterns are liquidity driven, e.g. price reversals of falling activity reflects rewards to liquidity provision after a phase of high activity. I then document more interesting time of day patterns: while increases in trading activity are least likely in earlier trading hours, return momentum of rising activity is strongest at these times; similarly, while activity decrease are least likely near close price reversals of falling activity are strongest in later trading hours. These findings highlight the highly variable nature of trading over the course of trading day. Earlier hours witness execution of overnight trading decisions that raise trading activity and persistent price impacts. Later trading hours, however, feature lower competition to provide liquidity since traders target flat closing positions; thus greater rewards to liquidity provision in expected.
I conclude my work by trying to model the dynamic structure of trading activity in the form I measure it. I employ the ACD models of Engle and Russell (1998) that were designed to model the time durations between individual transactions (inter-transaction durations). In todays markets, however, individual transactions are hard to reconcile with economic behavior. Thus, estimates of ACD models or any other dynamic structure that utilized inter-transaction durations have limited economic interpretations. An important contributions of my work is to introduce an alternative input to ACD models that fit features of modern financial markets and can provide a basis for economic interpretations. Moreover, my approach indirectly addresses other computations and statistical challenges one would face dealing with inter-transaction durations. Performing stock-year specific estimates of ACD models, I identify several interesting routes for future research in the fields of empirical market microstructure and financial econometrics.
Trading activity; Liquidity
Tue, 02 Jun 2015 00:00:00 GMThttp://hdl.handle.net/2142/881392015-06-02T00:00:00ZA quantitative methodology for evaluating and deploying security monitors
http://hdl.handle.net/2142/88103
A quantitative methodology for evaluating and deploying security monitors
Despite advances in intrusion detection and prevention systems, attacks on networked computer systems continue to succeed. Intrusion tolerance and forensic analysis are required to adequately detect and defend against attacks that succeed. Intrusion tolerance and forensic analysis techniques depend on monitors to collect information about possible attacks. Since monitoring can be expensive, however, monitors must be selectively deployed to maximize their overall utility. We identify a need for a methodology for evaluating monitor deployment to determine a placement of monitors that meets both security goals and cost constraints.
In this thesis, we introduce a methodology both to quantitatively evaluate monitor deployments in terms of security goals and to deploy monitors optimally based on cost constraints. First, we define a system and data model that describes the system we aim to protect, the monitors that can be deployed, and the relationship between intrusions and data generated by monitors. Second, we define a set of quantitative metrics that both quantify the utility and richness of monitor data with respect to intrusion detection, and quantify the cost associated with monitor deployment. We describe how a practitioner could characterize intrusion detection requirements in terms of target values of our metrics. Finally, we use our data model and metrics to formulate a method to determine the cost-optimal, maximum-utility placement of monitors. We illustrate our approach throughout the thesis with a working example, and demonstrate its practicality and expressiveness with a case study based on an enterprise Web service architecture. The value of our approach comes from its ability to determine optimal monitor placements, which can be counterintuitive or difficult to find, for nearly any set of cost and intrusion detection parameters.
computer security; monitoring; monitor deployment; monitor placement; intrusion tolerance; intrusion detection; metrics; modeling; digital forensics
Wed, 22 Jul 2015 00:00:00 GMThttp://hdl.handle.net/2142/881032015-07-22T00:00:00ZUnsupervised feature analysis for high dimensional big data
http://hdl.handle.net/2142/88065
Unsupervised feature analysis for high dimensional big data
In practice we often encounter the scenario that label information is unavailable due to either high cost of manual labeling or unwillingness of users to label. When label information is not available, traditional supervised learning can not be directly applied so we need to study unsupervised methods which could work well even without supervision.
Feature analysis has been proven effective and important for many applications. Feature analysis is a broad research field, whose research topics includes but are not limited to feature selection, feature extraction, feature construction, and feature composition e.g., in topic discovery the learned topics can be viewed as compound features. In many real systems, it is often necessary and important to do feature analysis to determine which individual or compound features should be used for posterior learning tasks. The effectiveness of traditional feature analysis often relies on labels of the training data examples. However, in the era of big data, label information is often unavailable. In the unsupervised scenario, it is more challenging to do feature analysis.
Two important research topics in unsupervised feature analysis are unsupervised feature selection and unsupervised feature composition, e.g., to discover topics as compound features. This would naturally create two lines for unsupervised feature analysis. Also, combined with single-view or multiple-view for the data, we would generate a table with four cells. Except for the single-view feature composition (or topic discovery) where there're already many work done e.g., PLSA, LDA, and NMF, the other three cells correspond to new research topics, and there is few work done yet.
For single view unsupervised feature analysis, we propose two unsupervised feature selection methods. For multi-view unsupervised feature analysis, we focus on text-image web news data and propose a multi-view unsupervised feature selection method and a text-image topic model.
Specifically, for single-view unsupervised feature selection, we propose a new method that is called Robust Unsupervised Feature Selection (RUFS), where pseudo cluster labels are learned via local learning regularized robust NMF and feature selection is performed simultaneously by robust joint $l_{2, 1}$-norm minimization. Outliers could be effectively handled and redundant or noisy features could be effectively reduced. We also design a (projected) limited-memory BFGS based linear time iterative algorithm to efficiently solve the optimization problem.
We also study how the choice of norms for data fitting and feature selection terms affect the ultimate unsupervised feature selection performance. Specifically, we propose to use joint adaptive loss and $l_2/l_0$ minimization for data fitting and feature selection. We mathematically explain desirable properties of joint adaptive loss and $l_2/l_0$ minimization over recent unsupervised feature selection models. We solve the optimization problem with an efficient iterative algorithm whose computational complexity and memory cost are linear to both sample size and feature size.
For multiple-view unsupervised feature selection, we propose a more effective approach for high dimensional text-image web news data. We propose to use raw text features in label learning to avoid information loss. We propose a new multi-view unsupervised feature selection method in which image local learning regularized orthogonal nonnegative matrix factorization is used to learn pseudo labels and simultaneously robust joint $l_{2,1}$-norm minimization is performed to select discriminative features. Cross-view consensus on pseudo labels can be obtained as much as possible.
For multi-view topic discovery, we study how to systematically mine topics from high dimensional text-image web news data. The application problem is important because almost all news articles have one picture associated. Unlike traditional topic modeling which considers text alone, the new task aims to discover heterogeneous topics from web news of multiple data types. We propose to tackle the problem by a regularized nonnegative constrained $l_{2,1}$-norm minimization framework. We also present a new iterative algorithm to solve the optimization problem.
The proposed single-view feature selection methods can be applied on almost all single-view data. The proposed multi-view methods are designed to process text-image web news data, but the idea can be naturally generalized to analyze any multi-view data. Practitioners could run the proposed methods to select features that will be used in posterior learning tasks. One can also run our multi-view topic model to analyze and visualize topics in text-image web news corpora to help interpret the data.
Feature selection; unsupervised feature selection; multi-view learning; multi-view topic discovery; multi-view unsupervised feature selection
Fri, 17 Jul 2015 00:00:00 GMThttp://hdl.handle.net/2142/880652015-07-17T00:00:00ZDeNovo: rethinking the memory hierarchy for disciplined parallelism
http://hdl.handle.net/2142/88059
DeNovo: rethinking the memory hierarchy for disciplined parallelism
As multicore systems become widespread, both software and hardware face a major challenge in efficiently exploiting and implementing parallelism. While shared–memory remains a popular programming model due to its global address space, it is plagued with undisciplined programming practices that allow implicit communication and unstructured non-determinism. Such “wild” shared-memory behavior not only makes it difficult to test and maintain software but also complicates hardware, preventing it from scaling in a power-efficient manner. Recent research has proposed replacing the wild shared-memory programming models with a more disciplined approach.
The DeNovo project asks the following question: if software is more disciplined, can we build more power-, performance-, and complexity-efficient shared-memory hardware? Focusing on deterministic programs as a discipline to drive DeNovo, we first show that coherence and communication can be made much simpler and more efficient than the current state of the art. The resulting protocol is without transient states, invalidation traffic, directory sharer–lists, or false sharing - all significant sources of inefficiencies in existing protocols. Widening the software space further, we then show how DeNovo can support software with disciplined non-determinism without giving up its benefits for deterministic programs. The remaining challenge is supporting synchronization accesses that are inherently “racy” on DeNovo without writer-initiated invalidation. We show that arbitrary synchronization can be supported on DeNovo with a simple yet efficient hardware mechanism, a big step toward our eventual goal of supporting legacy programs. Finally, we explore the potential for a comprehensive coherence solution that merges all previous DeNovo coherence mechanisms and adaptively switches between them depending on the level of “discipline” of software.
In summary, DeNovo shows the potential for commercially viable software-driven shared-memory systems with higher complexity-, performance-, and energy-efficiency than today’s software-oblivious hardware.
shared memory; cache coherence; memory consistency; disciplined parallelism
Thu, 16 Jul 2015 00:00:00 GMThttp://hdl.handle.net/2142/880592015-07-16T00:00:00ZParallel algorithms for two-stage stochastic optimization
http://hdl.handle.net/2142/88050
Parallel algorithms for two-stage stochastic optimization
We develop scalable algorithms for two-stage stochastic program optimizations. We propose performance optimizations such as cut-window mechanism in Stage 1 and scenario clustering in Stage 2 of benders method for solving two-stage stochastic programs. A naive implementation of benders method has slow convergence rate and does not scale well to large number of processors especially when the problem size is large and/or there are integer variables in Stage 1. Parallelization of stochastic integer programs pose very unique characteristics that make them very challenging to parallelize. We develop a Parallel Stochastic Integer Program Solver (PSIPS) that exploits nested parallelism by exploring the branch-and-bound tree vertices in parallel along with scenario parallelization. PSIPS has been shown to have high parallel efficiency of greater than 40% at 120 cores which is significantly greater than the parallel efficiency of state-of-the-art mixed-integer program solvers. A significant portion of the time in this branch-and-bound solver is spent in optimizing the stochastic linear program at the root vertex. Stochastic linear programs at other vertices of the branch-and-bound tree take very less iterations to converge because they can inherit benders cut from their parent vertices and/or the root. Therefore, it is important to reduce the optimization time of the stochastic linear program at the root vertex. We propose two decomposition schemes namely the Split-and-Merge (SAM) method and the Lagrangian Decomposition and Merge (LDAM) method that significantly increase the convergence rate of benders decomposition. SAM method gives up to 64% reduction in solution time while also giving significantly higher parallel speedups as compared to the naive benders method. LDAM method, on the other hand, has made it possible to solve otherwise intractable stochastic programs. We further provide a computational engine for many real-time and dynamic problems faced by US Air Mobility Command. We first propose a stochastic programming solution to the military aircraft allocation problem with consideration for disaster management. Then, we study US AMC's dynamic mission re-planning problem and propose a mathematical formulation that is computationally feasible and leads to significant savings in cost as compared to myopic and deterministic optimization. It is expected that this work will provide the springboard for more robust problem solving with HPC in many logistics and planning problems.
Stochastic Optimization; Integer Program Optimization; Benders Decomposition; Lagrangian Decomposition; Parallel Computing; High Performance Computing; Branch-and-bound; Resource Allocation; Large Scale Optimization; Air Mobility Command; Dynamic Mission Replanning; Aircraft Allocation; Itinerary Selection; Military Airlift Planning
Fri, 17 Jul 2015 00:00:00 GMThttp://hdl.handle.net/2142/880502015-07-17T00:00:00ZIn pursuit of linear complexity in discrete and computational geometry
http://hdl.handle.net/2142/88048
In pursuit of linear complexity in discrete and computational geometry
Many computational problems arise naturally from geometric data. In this thesis, we consider three such problems: (i) distance optimization problems over point sets, (ii) computing contour trees over simplicial meshes, and (iii) bounding the expected complexity of weighted Voronoi diagrams. While these topics are broad, here the focus is on identifying structure which implies linear (or near linear) algorithmic and descriptive complexity.
The first topic we consider is in geometric optimization. More specifically, we define a large class of distance problems, for which we provide linear time exact or approximate solutions. Roughly speaking, the class of problems facilitate either clustering together close points (i.e. netting) or throwing out outliers (i.e pruning), allowing for successively smaller summaries of the relevant information in the input. A surprising number of classical geometric optimization problems are unified under this framework, including finding the optimal k-center clustering, the kth ranked distance, the kth heaviest edge of the MST, the minimum radius ball enclosing k points, and many others. In several cases we get the first known linear time approximation algorithm for a given problem, where our approximation ratio matches that of previous work.
The second topic we investigate is contour trees, a fundamental structure in computational topology. Contour trees give a compact summary of the evolution of level sets on a mesh, and are typically used on massive data sets. Previous algorithms for computing contour trees took Θ(n log n) time and were worst-case optimal. Here we provide an algorithm whose running time lies between Θ(nα(n)) and Θ(n log n), and varies depending on the shape of the tree, where α(n) is the inverse Ackermann function. In particular, this is the first algorithm with O(nα(n)) running time on instances with balanced contour trees. Our algorithmic results are complemented by lower bounds indicating that, up to a factor of α(n), on all instance types our algorithm performs optimally.
For the final topic, we consider the descriptive complexity of weighted Voronoi diagrams. Such diagrams have quadratic (or higher) worst-case complexity, however, as was the case for contour trees, here we push beyond worst-case analysis. A new diagram, called the candidate diagram, is introduced, which allows us to bound the complexity of weighted Voronoi diagrams arising from a particular probabilistic input model. Specifically, we assume weights are randomly permuted among fixed Voronoi sites, an assumption which is weaker than the more typical sampled locations assumption. Under this assumption, the expected complexity is shown to be near linear.
Computational Geometry; Discrete Geometry; Computational Topology; Geometric Optimization; Contour Trees; Voronoi Diagrams
Wed, 15 Jul 2015 00:00:00 GMThttp://hdl.handle.net/2142/880482015-07-15T00:00:00ZBalance Optimization Subset Selection: a framework for causal inference with observational data
http://hdl.handle.net/2142/88049
Balance Optimization Subset Selection: a framework for causal inference with observational data
Observational data are prevalent in many fields of research, and it is desirable to use this data to explore potential causal relationships. Additional assumptions and methods for post-processing the data are needed to construct unbiased estimators of causal effects because such data is non-random. This dissertation describes the Balance Optimization Subset Selection (BOSS) framework to apply causal inference to observational data.
BOSS is designed to identify the subset of observational data that is most appropriate for computing causal estimates. To do this, it compares the available treatment units to potential sets of control units on a set of confounding factors, called covariates, with the goal of identifying a control group that minimizes a measure of covariate imbalance. Which imbalance measure to use with BOSS is an important consideration that depends both on the quality of the available observational data and on the assumptions that a researcher is willing to make.
The standard assumption for observational data, known as strong ignorability, is extended in several ways to be directly applicable to BOSS. Under these additional assumptions, specific levels of covariate balance are both necessary and sufficient for the treatment effect estimate to be unbiased. There is a trade-off in that weaker assumptions require a higher level of covariate balance in order to guarantee estimator unbiasedness. These additional assumptions bridge the gap between existing parametric and non-parametric methods.
Each imbalance measure for BOSS leads to an associated optimization problem. The computational complexity of these problems is discussed, and efficient algorithms are developed to handle several special cases. A constant factor approximation algorithm is also presented for one imbalance measure.
Given the potential applications of BOSS, identifying optimal or near-optimal solutions for these problems is of great practical interest. Heuristics and exact algorithms are considered, and computational tests demonstrate their effectiveness at minimizing imbalance. Additional tests validate BOSS on a well-studied dataset from the literature and highlight the value of alternate optima as a way to corroborate the assumptions that are made.
optimization; causal inference; operations research
Thu, 16 Jul 2015 00:00:00 GMThttp://hdl.handle.net/2142/880492015-07-16T00:00:00ZRegression test selection: theory and practice
http://hdl.handle.net/2142/88038
Regression test selection: theory and practice
Software affects every aspect of our lives, and software developers write tests to check software correctness. Software also rapidly evolves due to never-ending requirement changes, and software developers practice regression testing – running tests against the latest project revision to check that project changes did not break any functionality. While regression testing is important, it is also time-consuming due to the number of both tests and revisions.
Regression test selection (RTS) speeds up regression testing by selecting to run only tests that are affected by project changes. RTS is efficient if the time to select tests is smaller than the time to run unselected tests; RTS is safe if it guarantees that unselected tests cannot be affected by the changes; and RTS is precise if tests that are not affected are also unselected. Although many RTS techniques have been proposed in research, these techniques have not
been adopted in practice because they do not provide efficiency and safety at once.
This dissertation presents three main bodies of research to motivate, introduce, and improve a novel, efficient, and safe RTS technique, called Ekstazi. Ekstazi is the first RTS technique being adopted by popular open-source projects.
First, this dissertation reports on the first field study of test selection. The study of logs, recorded in real time from a diverse group of developers, finds that almost all developers perform manual RTS, i.e., manually select to run a subset of tests at each revision, and they select these tests in mostly ad hoc ways. Specifically, the study finds that manual RTS is not safe 74% of the time and not precise 73% of the time. These findings showed the urgent need for a better automated RTS techniques that could be adopted in practice.
Second, this dissertation introduces Ekstazi, a novel RTS technique that is efficient and safe. Ekstazi tracks dynamic dependencies of tests on files, and unlike most prior RTS techniques, Ekstazi requires no integration with version-control systems. Ekstazi computes for each test what files it depends on; the files can be either executable code or external resources. A test need not be run in the new project revision if none of its dependent files changed. This dissertation also describes an implementation of Ekstazi for the Java programming language and the JUnit testing framework, and presents an extensive evaluation of Ekstazi on 615 revisions of 32 open-source projects (totaling almost 5M lines of code) with shorter- and longer-running test suites. The results show that Ekstazi reduced the testing time by 32% on average (and by 54% for longer-running test suites) compared to executing all tests. Ekstazi also yields lower testing time than the existing RTS techniques, despite the fact that Ekstazi may select more tests. Ekstazi is the first RTS tool adopted by several popular open-source projects, including Apache Camel, Apache Commons Math, and Apache CXF.
Third, this dissertation presents a novel approach that improves precision of any RTS technique for projects with distributed software histories. The approach considers multiple old revisions, unlike all prior RTS techniques that reasoned about changes between two revisions – an old revision and a new revision – when selecting tests, effectively assuming a development process where changes occur in a linear sequence (as was common for CVS and SVN). However, most projects nowadays follow a development process that uses distributed
version-control systems (such as Git). Software histories are generally modeled as directed
graphs; in addition to changes occurring linearly, multiple revisions can be related by other commands such as branch, merge, rebase, cherry-pick, revert, etc. The novel approach
reasons about commands that create each revision and selects tests for a new revision by considering multiple old revisions. This dissertation also proves the safety of the approach and presents evaluation on several open-source projects. The results show that the approach can reduce the number of selected tests over an order of magnitude for merge revisions.
Regression test selection; Regression testing; Ekstazi; Distributed software histories
Wed, 15 Jul 2015 00:00:00 GMThttp://hdl.handle.net/2142/880382015-07-15T00:00:00ZImproving the end-to-end latency of datacenter applications using coordination across application components
http://hdl.handle.net/2142/88034
Improving the end-to-end latency of datacenter applications using coordination across application components
To handle millions of user requests every second and process hundreds of terabytes of data each day, many organizations have turned to large datacenter-scale computing systems. The applications running in these datacenters consist of a multitude of dependent logical components or stages which perform specific functionality. These stages are connected to form a directed acyclic graph (DAG), with edges representing input-output dependencies. Each stage can run over tens to thousands of machines, and involves multiple cluster sub-systems such as storage, network and compute. The scale and complexity of these applications can lead to significant delays in their end-to-end latency. However, the organizations running these applications have strict requirements on this latency as it directly affects their revenue and operational costs.
Addressing this problem, the goal of this dissertation is to develop scheduling and resource allocation techniques to optimize for the end-to-end latency of datacenter applications. The key idea behind these techniques is to utilize coordination between different application components, allowing us to efficiently allocate cluster resources. In particular, we develop planning algorithms that coordinate the storage and compute sub-systems in datacenters to determine how many resources should be allocated to each stage in an application along with where in the cluster should they be allocated, to meet application requirements (e.g., completion time goals, minimize average completion time etc.). To further speed up applications at runtime, we develop a few latency reduction techniques: reissuing laggards elsewhere in the cluster, returning partial results and speeding up laggards by giving them extra resources. We perform a global optimization to coordinate across all the stages in an application DAG and determine which of these techniques works best for each stage, while ensuring that the cost incurred by these techniques is within a given end-to-end budget. We use application characteristics to predict and determine how resources should be allocated to different application components to meet the end-to-end latency requirements.
We evaluate our techniques on two different kinds of datacenter applications: (a) web services, and (b) data analytics. With large-scale simulations and an implementation in Apache Yarn (Hadoop 2.0), we use workloads derived from production traces to show that our techniques can achieve more than 50% reduction in the 99th percentile latency of web services and up to 56% reduction in the median latency of data analytics jobs.
Datacenter applications; Resource allocation; Cluster schedulers; Cross-layer optimization; Joint data and compute placement; MapReduce; Interactive services; Tail latency; Resource Malleability;
Tue, 14 Jul 2015 00:00:00 GMThttp://hdl.handle.net/2142/880342015-07-14T00:00:00ZLearning-based inductive invariant synthesis
http://hdl.handle.net/2142/88026
Learning-based inductive invariant synthesis
The problem of synthesizing adequate inductive invariants to prove a program correct lies at the heart of automated program verification. We investigate, herein, learning approaches to synthesize inductive invariants of sequential programs towards automatically verifying them. To this end, we identify that prior learning approaches were unduly influenced by traditional machine learning models that learned concepts from positive and negative counterexamples. We argue that these models are not robust for invariant synthesis and, consequently, introduce ICE, a robust learning paradigm for synthesizing invariants that learns using positive, negative and implication counterexamples, and show that it admits honest teachers and strongly convergent mechanisms for invariant synthesis. We develop the first learning algorithms in this model with implication counterexamples for two domains, one for learning arbitrary Boolean combinations of numerical invariants over scalar variables and one for quantified invariants of linear data-structures including arrays and dynamic lists. We implement the ICE learners and an appropriate teacher, and show that the resulting invariant synthesis is robust, practical, convergent, and efficient.
In order to deductively verify shared-memory concurrent programs, we present a sequentialization result and show that synthesizing rely-guarantee annotations for them can be reduced to invariant synthesis for sequential programs. Further, for verifying asynchronous event-driven systems, we develop a new invariant synthesis technique that constructs almost-synchronous invariants over concrete system configurations. These invariants, for most systems, are finitely representable, and can be thereby constructed, including for the USB driver that ships with Microsoft Windows phone.
verification; invariants; invariant synthesis; learning; machine learning; learning invariants using Implication Counter-Examples (ICE) learning model
Tue, 14 Jul 2015 00:00:00 GMThttp://hdl.handle.net/2142/880262015-07-14T00:00:00Z