Call for Participation — Procedure

The competition will compare state-of-the-art tools for software verification with respect to effectiveness and efficiency. The competition consists of two phases: a training phase, in which benchmark programs are given to the tool developers, and an evaluation phase, in which all participating verifiers will be executed on benchmark verification tasks, and the number of solved instances as well as the runtime is measured. The competition is performed and presented by the TACAS Competition Chair (for TACAS'14 this is Dirk Beyer).

Publication and Presentation of the Competition Candidates

The jury selects qualified competition candidates to publish (in the LNCS proceedings of TACAS) a contribution paper that gives an overview of the competition candidate. The contribution papers have a page limit of three pages. An overview paper by the competition organizers will describe the competition procedure and present the results. The papers are due on the start date of the evaluation phase.

In addition, every qualified competition candidate is granted a demonstration slot in the TACAS program to present the competition candidate to the TACAS audience.

Definition of Verification Task

A verification task consists of a C program and a specification. A verification run is a non-interactive execution of a competition candidate on a single verification task, in order to check if the following statement is correct: "The program satisfies the specification."

The result of a verification run is a triple (ANSWER, WITNESS, TIME). ANSWER is one of the following outcomes:

TRUE The specification is satisfied (i.e., there is no finite path that violates the specification).
FALSE + Error Path The specification is violated (i.e., there exists a finite path that violates the specification) and a counterexample path is produced.
UNKNOWN The tool cannot decide the problem or terminates by a tool crash, time-out, or out-of-memory (i.e., the competition candidate does not succeed in computing an answer TRUE or FALSE).

If the answer is FALSE, then a counterexample path must be produced and provided as WITNESS. There is no particular fixed format for the error path. The path has to be written to a file or on screen in a reasonable format to make it possible to check validity.

TIME is the consumed CPU time until the verifier terminates. It includes the consumed CPU time of all processes that the verifier starts. If TIME is equal to or larger than the time limit, then the verifier is terminated and the ANSWER is set to "timeout" (and interpreted as UNKNOWN).

The C programs are partitioned into categories, which are defined in category-set files. The categories and the contained programs are explained on the benchmark page. The requirements for the C programs are described below in the next section.


The specification to be verified for a program "path/filename" is given either in a file with the name "path/filename.prp" or in a file "ALL.prp" in directory "path/".

The definition 'init(main())' gives the initial states of the program by a call of function 'main' (with no parameters). The definition 'LTL(f)' specifies that formula 'f' holds at every initial state of the program. The LTL (linear-time temporal logic) operator 'G f' means that 'f' globally holds. The proposition 'label(ERROR)' is true if the C label 'ERROR' is reached.

IMPORTANT: Please note that participating verifiers do not need to support complicated parsing. The properties in the competition are restricted to the one form that is used below. (Simple pattern matching suffices.)

Label Unreachability:


G ! label(ERROR)
The C label 'ERROR' is not reachable on any finite execution of the program.

Memory Safety (only for category 'MemorySafety'):


G valid-freeAll memory deallocations are valid (counterexample: invalid free). More precisely: There exists no finite execution of the program on which an invalid memory deallocation occurs.
G valid-derefAll pointer dereferences are valid (counterexample: invalid dereference). More precisely: There exists no finite execution of the program on which an invalid pointer dereference occurs.
G valid-memtrackAll allocated memory is tracked, i.e., pointed to or deallocated (counterexample: memory leak). More precisely: There exists no finite execution of the program on which the program lost track of some previously allocated memory.

If a verification run detects that a property from {valid-free, valid-deref, valid-memtrack} is violated, then the violated (partial) property has to be given in the result:

FALSE(p), with p in {valid-free, valid-deref, valid-memtrack}, means that the (partial) property p is violated.

Agreement: All programs in category 'MemorySafety' violate at most one (partial) property p (p in {valid-free, valid-deref, valid-memtrack}). Function `malloc' is assumed to always return a valid pointer, i.e., the memory allocation never fails, and function `free' always deallocates the memory and makes the pointer invalid for further dereferences.

Benchmark Verification Tasks

The training verification tasks will be provided by a specified date on this web page. A subset of the training verification tasks will be used for the actual competition experiments. The jury agrees on a procedure to define the subset (if not all verification tasks are taken). We avoid using verification tasks without training, because the semantics of the language C is underspecified; the participants, jury, and community ensure that all involved parties agree on the intended meaning of the verification tasks. (This is in contrast to competitions over completely specified problems, such as SAT or SMT.)

The programs are assumed to be written in GNU C (some of them adhere to ANSI C). Some programs are provided in CIL (C Intermediate Language) [1].

Potential competition participants are invited to submit benchmark verification tasks until the specified date. Verification tasks have to fulfill two requirements, to be eligible for the competition: (1) the program has to be written in GNU C or ANSI C, and (2) the specification is instrumented into the program and is violated if the label 'ERROR' is reached, or an (implicit) memory-safety property does not hold. Other specifications are possible, but need to be proposed and discussed.

For each category, we specify whether the programs are written for a 32-bit or a 64-bit architecture, and if the verification tasks assume the simple or the precise memory model.
Simple Memory Model: All variables can only be modified using direct assignments, or via pointer access if the address was obtained using the operator '&'.
Precise Memory Model: All memory cells can be written to, even dereferencing (uninitialized) pointers.

In the following, we list a few conventions that are used in some of the verification tasks, in order to express special information that is difficult to capture with the C language.

__VERIFIER_assume(expression): A verification tool can assume that a function call __VERIFIER_assume(expression) has the following meaning: If 'expression' is evaluated to '0', then the function loops forever, otherwise the function returns (no side effects). The verification tool can assume the following implementation:
void __VERIFIER_assume(int expression) { if (!expression) { LOOP: goto LOOP; }; return; }

__VERIFIER_nondet_X(): In order to model nondeterministic values, the following functions can be assumed to return an arbitrary value of the indicated type: __VERIFIER_nondet_X() (and nondet_X(), deprecated) with X in {bool, char, int, float, loff_t, long, pchar, pointer, pthread_t, sector_t, short, size_t, u32, uchar, uint, ulong, unsigned, ushort} (no side effects, pointer for void *, etc.). The verification tool can assume that the functions are implemented according to the following template:
X __VERIFIER_nondet_X() { X val; return val; }

__VERIFIER_atomic_begin(): For modeling an atomic execution of a sequence of statements in a multi-threaded run-time environment, those statements can be placed between two function calls __VERIFIER_atomic_begin() and __VERIFIER_atomic_end() (deprecated but still valid: those statements can be placed in a function whose name matches __VERIFIER_atomic_). The verifiers are instructed to assume that the execution between those calls is not interrupted. The two calls need to occur within the same control-flow block; nesting or interleaving of those function calls is not allowed.

malloc(), free(): Function malloc is assumed to always return a valid pointer, i.e., the memory allocation never fails, and function free always deallocates the memory and makes the pointer invalid for further dereferences.


Technical Details

Each verification run will be started on a dedicated unloaded compute server with a 3.4 GHz 64-bit Quad Core CPU (Intel i7-2600) and a GNU/Linux operating system (x86_64-linux, Ubuntu 12.04). There are two ressource limits for each verification run: a memory limit of 15 GB (14.6 GiB) of RAM and a runtime limit of 15 min of CPU time. (If a verifier hangs, i.e., does not continue to consume CPU time, then the verifier is killed after 15 min of wall time, and the resulting runtime is set to 15 min.)


Every verification run will be started by a batch script that collects statistics and forwards for each competition candidate and verification task one of the outcomes: TRUE (specification holds), FALSE (specification does not hold, bug reported), UNKNOWN (verification result not known, resources exhausted, crash).


The competition environment feeds the candidates with parameters when started for a verification run. There is one global set of parameters that can be used to tune the verifier to the benchmark programs. This set of (command-line) parameters have to be defined in the competition contribution paper. One parameter defines the specification file. There are categories that need to be verified with the information that a 64-bit architecture is used, thus, if the verifier has a parameter to handle this aspect, it needs to be defined. There are also categories for which an additional parameter is allowed to specify the simple memory model.


A verification tool is qualified to participate as competition candidate if the tool is publicly available for download on the internet (the license must allow using it for the competition), works on the GNU/Linux plattform (more specifically, it must run on an x86_64 machine), is installable with user privileges (no root access required, except for required standard packages) and without hard-coded absolute paths for access to libraries and non-standard external tools, and succeeds for more than 50 % of all training programs to parse the input and start the verification process (a tool crash during the verification phase does not disqualify).

A person (participant) is qualified as competition contributor for a competition candidate if the person is a contributing designer/developer of the submitted competition candidate (witnessed by occurrence of the person’s name on the tool's project web page, a tool paper, or in the revision logs).

A paper is qualified if the quality of the description of the competition candidate suffices to run the tool in the competition and meets the scientific standards of TACAS as competition-candidate representation in the TACAS proceedings.

Note: A verification tool can participate several times as different competition candidates, if a significant difference of the conceptual or technological basis of the implementation is justified in the accompanying description paper. This applies to different versions as well as different configurations, in order to avoid forcing developers to create a new tool for every new concept.

Evaluation by Scores and Runtime

The scores will be assigned according to the following table:

Points Reported result Description
0 UNKNOWN Failure to compute verification result, out of resources, program crash.
The error in the program was found and an error path was reported.
An error is reported for a program that fulfills the specification (false alarm, incomplete analysis).
The program was analyzed to be free of errors.
The program had an error but the competition candidate did not find it (missed bug, unsound analysis).

The results of type TRUE yield higher absolute score values compared to type FALSE, because it is expected to be 'easier' to detect errors than it is to prove correctness. The absolute score values for incorrect results are higher compared to correct results, because a single correct answer should not be able to compensate for a wrong answer. (An imaginary competition candidate that generates random answers should be ranked with a negative sum of points.)

The participating competition candidates are ranked according to the sum of points. Competition candidates with the same sum of points are sub-ranked according to success-runtime. The success-runtime for a competition candidate is the total CPU time over all benchmarks for which the competition candidate reported a correct verification result.

The participants have the opportunity to check the verification results against their own expected results and discuss inconsistencies with the competition chair.

Opting-out from Categories

Every team can submit for every category (including meta categories, i.e., categories that consist of verification tasks from other categories) an opt-out statement. In the results table, a dash is entered for that category; no execution results are reported in that category. If a team participates (i.e., does not opt-out), *all* verification tasks that belong to that category are executed with the verifier. The obtained results are reported in the results table; the scores for meta categories are weighted according to the established procedure. (This means, a tool can participate in a meta category and at the same time opt-out from a sub-category, with having the real results of the tool counted towards the meta category, but not reported for the sub-category.)

Computation of Scores for Meta Categories

A meta category is a category that consists of several sub-categories. The scores in such a meta category is computed from the normalized scores in its sub-categories.


The scores for a (meta) category is computed from the scores of all k contained (sub-) categories using a normalization by the number of contained verification tasks: The normalized score sn_i of a verifier in category i is obtained by dividing the score s_i by the number of tasks n_i in category i (sn_i = s_i / n_i), then the sum st = SUM(sn_1, ..., sn_1, ..., sn_k) over the normalized scores of the categories is multiplied by the average number of tasks per category.


The goal is to reduce the influence of a verification task in a large category compared to a verification task in a small category, and thus, balance over the categories. Normalizing by score is not an option because we assigned higher positive scores for TRUE and higher negative scores for wrong results. (Normalizing by score would remove those desired differences.)


Category 1 has 10 verification tasks with result TRUE.
Category 2 has 10 verification tasks with result FALSE.

Verifier A correctly solved 5 verification tasks in category 1 -> 10 points, and 5 verification tasks in category 2 -> 5 points.
Overall score of Verifier A: (( (10/10 = 1) + (5/10 = 0.5) ) = 1.5) * (10 + 10 = 20) / 2 = 15

Verifier B correctly solved 10 verification tasks in category 1 -> 20 points, and 0 verification tasks in category 2 -> 0 points.
Overall score of Verifier B: (( (20/10 = 2) + (0/10 = 0.0) ) = 2.0) * (10 + 10 = 20) / 2 = 20

Verifier C correctly solved 0 verification tasks in category 1 -> 0 points, and 10 verification tasks in category 2 -> 10 points.
Overall score of Verifier C: (( (0/10 = 0) + (10/10 = 1.0) ) = 1.0) * (10 + 10 = 20) / 2 = 10

Verifier D correctly solved 8 verification tasks in category 1 -> 16 points, and 8 verification tasks in category 2 -> 8 points.
Overall score of Verifier D: (( (16/10 = 1.6) + (8/10 = 0.8) ) = 2.4) * (10 + 10 = 20) / 2 = 24

Obtained ranking:
1. Verifier D
2. Verifier B
3. Verifier A
4. Verifier C

Verifier D was strong in all categories, and thus, won the competition.
Verifier B was even stronger than Verifier D in one category, but performed not so well in the other category, and thus, only got second overall.