Call for Participation — Procedure

The competition will compare state-of-the-art tools for software verification with respect to effectiveness and efficiency. The competition consists of two phases: a training phase, in which benchmark programs are given to the tool developers, and an evaluation phase, in which all participating tools will be executed on all training programs, and the number of solved instances as well as the runtime is measured. The competition is performed and presented by the TACAS Competition Chair (for TACAS'12 this is Dirk Beyer).

Publication and Presentation of the Competition Candidates

Every qualified competition candidate can publish (in the LNCS proceedings of TACAS 2012) a contribution paper that gives an overview of the competition candidate. The contribution papers have a page limit of three pages. An overview paper by the competition organizers will describe the competition procedure and present the results. The papers are due on the start date of the evaluation phase.

In addition, every qualified competition candidate is granted a demonstration slot in the TACAS program to present the competition candidate to the TACAS audience.

Definition of Verification Task

A verification task consists of a C program and a safety property. For simplicity, the safety properties to be verified are reduced to reachability problems and encoded in the program source code (using the error label 'ERROR'). In other words, the competition candidate is asked, given a C program and the error label 'ERROR', whether there is a finite path through the program such that the error label can be reached. A verification run is a non-interactive execution of a competition candidate on a single verification task.

The result of a verification run is either

There is no particular fixed format for the error path. The path has to be written to a file or on screen in a reasonable format to make it possible to check validity.

Benchmark Verification Tasks

The training verification tasks will be provided by a specified date on this web page. All training tasks will be included in the competition experiments.

Most programs are provided in CIL (C Intermediate Language) [1]. The programs are assumed to be written in GNU C (most of them adhere to ANSI C).

Potential competition participants are invited to submit benchmark verification tasks until the specified date. Programs have to fulfill two requirements, to be eligible for the competition: (1) the program has to be written in GNU C or ANSI C, and can be successfully CIL-processed with the following parameters '--dosimplify --printCilAsIs --save-temps --domakeCFG', and (2) the property is instrumented into the program and is violated if the label 'ERROR' is reached.

As a further simplification, a verification tool can assume that a function call __VERIFIER_assume(expression) has the following meaning: If 'expression' is evaluated to '0', then the function loops forever, otherwise the function returns (no side effects). The verification tool can assume the following implementation:
void __VERIFIER_assume(int expression) { if (!expression) { LOOP: goto LOOP; }; return; }

In order to model nondeterministic values, the following functions can be assumed to return an arbitrary value of the indicated type: __VERIFIER_nondet_X() (and nondet_X(), deprecated) with X in {int, float, char, short, pointer} (no side effects, pointer for void *). The verification tool can assume that the functions are implemented according to the following template:
X __VERIFIER_nondet_X() { X val; return val; }

Setup

The competition experiments will be run on a dedicated unloaded compute server with a 3.4 GHz 64-bit Quad Core CPU (Intel i7-2600K) and a GNU/Linux operating system (x86_64-linux). The machine has 16 GB of RAM, of which exactly 15 GB will be made available to the competition candidate. Every experiment has a runtime limit of 15 min. The runtime is measured in seconds of CPU time.

Every verification run will be started by a batch script that collects statistics and forwards for each competition candidate and verification task one of the outcomes: SAFE (property holds), UNSAFE (property does not hold, bug reported), UNKNOWN (verification result not known, resources exhausted, crash).

Qualification

A verification tool is qualified to participate as competition candidate if the tool is publicly available (for the GNU/Linux plattform, more specifically, it must run on an x86_64 machine) and succeeds for more than 50 % of all training programs to parse the input and start the verification process (a tool crash during the verification phase does not disqualify). A person (participant) is qualified as competition contributor for a competition candidate if the person is a contributing designer/developer of the submitted competition candidate (witnessed by occurrence of the person’s name on the tool's project web page, a tool paper, or in the revision logs). A paper is qualified if the quality of the description of the competition candidate suffices to run the tool in the competition and is appropriate as competition-candidate representation for the TACAS proceedings.

A verification tool can participate several times as different competition candidates, if a significant difference of the conceptual or technological basis of the implementation is justified in the accompanying description paper. This applies to different versions as well as different configurations, in order to avoid forcing developers to create a new tool for every new concept.

Evaluation by Scores and Runtime

The scores will be assigned according to the following table:

Points Reported result Description
0 UNKNOWN Failure to compute verification result, out of resources, program crash.
+1 UNSAFE correct The error in the program was found and an error path was reported.
−2 UNSAFE incorrect An error is reported for a program that fulfills the property (false alarm, incomplete analysis).
+2 SAFE correct The program was analyzed to be free of errors.
−4 SAFE incorrect The program had an error but the competition candidate did not find it (missed bug, unsound analysis).

The results of type SAFE yield higher absolute score values compared to type UNSAFE, because it is expected to be 'easier' to detect errors than it is to prove correctness. The absolute score values for incorrect results are higher compared to correct results, because a single correct answer should not be able to compensate for a wrong answer. (An imaginary competition candidate that generates random answers should be ranked with a negative sum of points.)

The participating competition candidates are ranked according to the sum of points. Competition candidates with the same sum of points are sub-ranked according to success-runtime. The success-runtime for a competition candidate is the total CPU time over all benchmarks for which the competition candidate reported a correct verification result.

The participants have the opportunity to check the verification results against their own expected results and discuss inconsistencies with the competition chair.