SV-COMP 2017 - 6th International Competition on Software Verification

Call for Participation — Procedure

The competition will compare state-of-the-art tools for software verification with respect to effectiveness and efficiency. The competition consists of two phases: a training phase, in which benchmark programs are given to the tool developers, and an evaluation phase, in which all participating verifiers will be executed on benchmark verification tasks, and the number of solved instances as well as the runtime is measured. The competition is performed and presented by the TACAS Competition Chair.

This page is the authoritative reference for the rules of the competition. If in doubt, please also read the FAQ with more descriptions.

Publication and Presentation of the Competition Candidates

The jury selects qualified competition candidates to publish (in the LNCS proceedings of TACAS) a contribution paper that gives an overview of the competition candidate. An overview paper by the competition organizers will describe the competition procedure and present the results. The papers are due on the start date of the evaluation phase.

In addition, every qualified competition candidate is granted a demonstration slot in the TACAS program to present the competition candidate to the TACAS audience.

Definition of Verification Task

A verification task consists of a C program and a specification. A verification run is a non-interactive execution of a competition candidate on a single verification task, in order to check if the following statement is correct: "The program satisfies the specification."

The result of a verification run is a triple (ANSWER, WITNESS, TIME). ANSWER is one of the following outcomes:

TRUE + Witness	The specification is satisfied (i.e., there is no path that violates the specification) and a correctness witness is produced.
FALSE + Witness	The specification is violated (i.e., there exists a path that violates the specification) and an violation witness is produced.
UNKNOWN	The tool cannot decide the problem or terminates by a tool crash, time-out, or out-of-memory (i.e., the competition candidate does not succeed in computing an answer TRUE or FALSE).

TIME is the consumed CPU time until the verifier terminates. It includes the consumed CPU time of all processes that the verifier starts. If TIME is equal to or larger than the time limit, then the verifier is terminated and the ANSWER is set to "timeout" (and interpreted as UNKNOWN).

The C programs are partitioned into categories, which are defined in category-set files. The categories and the contained programs are explained on the benchmark page. The requirements for the C programs are described below in the next section.

Witnesses

There is a fixed exchange format for the witnesses. The witness has to be written to a file, which is given to a witness validator to check validity. The result is counted as correct only if the validator successfully checked it. (There are currently two witness validators under active development, based on CPAchecker and Ultimate Automizer, respectively.) The resource limits for the witness validators are 2 processing units, 7 GB memory, and 10 % of the verification time (i.e., 90 s) for violation witnesses and 100 % (900 s) for correctness witnesses.

The following categories are excluded from witness validation: *-Arrays, *-Floats, *-Heap, *MemSafety*, ConcurrencySafety-*, and Termination-*.

Properties

The specification to be verified for a program "path/filename" is given either in a file with the name "path/filename.prp" or in a file "Category.prp".

The definition 'init(main())' gives the initial states of the program by a call of function 'main' (with no parameters). The definition 'LTL(f)' specifies that formula 'f' holds at every initial state of the program. The LTL (linear-time temporal logic) operator 'G f' means that 'f' globally holds. The proposition 'call(__VERIFIER_error())' is true if the function '__VERIFIER_error()' is called.

Error Function Unreachability:

Formula	Definition
G ! call(__VERIFIER_error())	The function '__VERIFIER_error' is not called in any finite execution of the program.

Memory Safety (only for category 'MemorySafety'):

Formula	Definition
G valid-free	All memory deallocations are valid (counterexample: invalid free). More precisely: There exists no finite execution of the program on which an invalid memory deallocation occurs.
G valid-deref	All pointer dereferences are valid (counterexample: invalid dereference). More precisely: There exists no finite execution of the program on which an invalid pointer dereference occurs.
G valid-memtrack	All allocated memory is tracked, i.e., pointed to or deallocated (counterexample: memory leak). More precisely: There exists no finite execution of the program on which the program lost track of some previously allocated memory.

If a verification run detects that a property from {valid-free, valid-deref, valid-memtrack} is violated, then the violated (partial) property has to be given in the result:

FALSE(p), with p in {valid-free, valid-deref, valid-memtrack}, means that the (partial) property p is violated.

Agreement: All programs in category 'MemorySafety' violate at most one (partial) property p (p in {valid-free, valid-deref, valid-memtrack}).

Overflow:

Formula	Definition
G ! overflow	It can never happen that the resulting type of an operation is a signed integer type but the resulting value is not in the range of values that are representable by that type.

Termination (only for category 'Termination'):

Formula	Definition
F end	Every path finally reaches the end of the program. The proposition "end" is true at the end of every finite program execution (return of main, exit, abort, etc.). A counterexample for this property is an infinite program execution.

Termination is a liveness property, and thus, counterexamples must have maximal execution length, i.e., are terminating or infinite; all terminating paths end with proposition "end".

For the result "FALSE + Error Path", the error path has to be a finite path that ends in a location that is repeated infinitely often in an infinite path. If there is a periodic infinite execution, a second path may be added. This second path has to lead from this infinitely repeated location back to this infinitely repeated location and this path has to correspond to the infinite execution. In SV-COMP 2017 a counterexample to termination does not have to adhere to the above-mentioned format for violation witnesses, a human readable representation of the error path is sufficient.

Benchmark Verification Tasks

The training verification tasks will be provided by a specified date on this web page. A subset of the training verification tasks will be used for the actual competition experiments. The jury agrees on a procedure to define the subset (if not all verification tasks are taken). We avoid using verification tasks without training, because the semantics of the language C is underspecified; the participants, jury, and community ensure that all involved parties agree on the intended meaning of the verification tasks. (This is in contrast to competitions over completely specified problems, such as SAT or SMT.)

The programs are assumed to be written in GNU C (some of them adhere to ANSI C). Each program contains all code that is needed for the verification, i.e., no includes (cpp -P). Some programs are provided in CIL (C Intermediate Language) [1].

Potential competition participants are invited to submit benchmark verification tasks until the specified date. Verification tasks have to fulfill two requirements, to be eligible for the competition: (1) the program has to be written in GNU C or ANSI C, and (2) the program has to come with a specification given by one of the properties stated above. Other specifications are possible, but need to be proposed and discussed.

New proposed categories will be included if at least three different tools or teams participate in the category (i.e., not the same tool twice with a different configuration).

For each category, we specify whether the programs are written for an ILP32 (32-bit) or an LP64 (64-bit) architecture (cf. http://www.unix.org/whitepapers/64bit.html).

In the following, we list a few conventions that are used in some of the verification tasks, in order to express special information that is difficult to capture with the C language.

__VERIFIER_error(): For checking (un)reachability we use the function __VERIFIER_error() (see property 'Error Function Unreachability' above). The verification tool can assume the following implementation:
void __VERIFIER_error() { abort(); }
Hence, a function call __VERIFIER_error() never returns and in the function __VERIFIER_error() the program terminates.

__VERIFIER_assume(expression): A verification tool can assume that a function call __VERIFIER_assume(expression) has the following meaning: If 'expression' is evaluated to '0', then the function loops forever, otherwise the function returns (no side effects). The verification tool can assume the following implementation:
void __VERIFIER_assume(int expression) { if (!expression) { LOOP: goto LOOP; }; return; }

__VERIFIER_nondet_X(): In order to model nondeterministic values, the following functions can be assumed to return an arbitrary value of the indicated type: __VERIFIER_nondet_X() (and nondet_X(), deprecated) with X in {bool, char, int, float, double, loff_t, long, pchar, pointer, pthread_t, sector_t, short, size_t, u32, uchar, uint, ulong, unsigned, ushort} (no side effects, pointer for void *, etc.). The verification tool can assume that the functions are implemented according to the following template:
X __VERIFIER_nondet_X() { X val; return val; }

__VERIFIER_atomic_begin(): For modeling an atomic execution of a sequence of statements in a multi-threaded run-time environment, those statements can be placed between two function calls __VERIFIER_atomic_begin() and __VERIFIER_atomic_end() (deprecated but still valid: those statements can be placed in a function whose name matches __VERIFIER_atomic_). The verifiers are instructed to assume that the execution between those calls is not interrupted. The two calls need to occur within the same control-flow block; nesting or interleaving of those function calls is not allowed.

malloc(), free(): We assume that the functions malloc and alloca always return a valid pointer, i.e., the memory allocation never fails, and function free always deallocates the memory and makes the pointer invalid for further dereferences.

Setup

Technical Details

Each verification run will be started on a machine with a GNU/Linux operating system (x86_64-linux, Ubuntu 16.04); there are three resource limits for each verification run: a memory limit of 15 GB (14.6 GiB) of RAM, a runtime limit of 15 min of CPU time, and a limit to 8 processing units of a CPU. (If a verifier hangs, i.e., does not continue to consume CPU time, then the verifier is killed after 15 min of wall time, and the resulting runtime is set to 15 min.)

Verifiers must not exceed an stdout/stderr limit of at most 2 MB.

Further requirements for the submission is explained on the submission page.

Parameters

The competition environment feeds the candidates with parameters when started for a verification run. There is one global set of parameters that can be used to tune the verifier to the benchmark programs. This set of (command-line) parameters have to be defined in the competition contribution paper. One parameter defines the specification file. There are categories that need to be verified with the information that a 64-bit architecture is used, thus, if the verifier has a parameter to handle this aspect, it needs to be defined. There are also categories for which an additional parameter is allowed to specify the simple memory model.

Qualification

Verifier. A verification tool is qualified to participate as competition candidate if the tool is (a) publicly available for download on the internet (the license must allow using it for the competition), (b) works on the GNU/Linux plattform (more specifically, it must run on an x86_64 machine), (c) is installable with user privileges (no root access required, except for required standard Ubuntu packages) and without hard-coded absolute paths for access to libraries and non-standard external tools, (d) succeeds for more than 50 % of all training programs to parse the input and start the verification process (a tool crash during the verification phase does not disqualify), and (e) produces witness files (for violation and correctness) that adhere to the witness exchange format (syntactically correct).

Person. A person (participant) is qualified as competition contributor for a competition candidate if the person is a contributing designer/developer of the submitted competition candidate (witnessed by occurrence of the person’s name on the tool's project web page, a tool paper, or in the revision logs).

Paper. A paper is qualified if the quality of the description of the competition candidate suffices to run the tool in the competition and meets the scientific standards of TACAS as competition-candidate representation in the TACAS proceedings.

Note: A verification tool can participate several times as different competition candidates, if a significant difference of the conceptual or technological basis of the implementation is justified in the accompanying description paper. This applies to different versions as well as different configurations, in order to avoid forcing developers to create a new tool for every new concept.

Evaluation by Scores and Runtime

The scores will be assigned according to the following table:

Points	Reported result	Description
0	UNKNOWN	Failure to compute verification result, out of resources, program crash.
+1	FALSE correct	The error in the program was found and a violation witness was confirmed.
−16	FALSE incorrect	An error is reported for a program that fulfills the specification (false alarm, incomplete analysis).
+2	TRUE correct	The program was analyzed to be free of errors and a correctness witness was confirmed.
+1	TRUE correct, witness unconfirmed	The program was analyzed to be free of errors but the correctness witness was not confirmed.
−32	TRUE incorrect	The program had an error but the competition candidate did not find it (missed bug, unsound analysis).

The results of type TRUE yield higher absolute score values compared to type FALSE, because it is expected to be 'easier' to detect errors than it is to prove correctness. The absolute score values for incorrect results are higher compared to correct results, because a single correct answer should not be able to compensate for a wrong answer. (An imaginary competition candidate that generates random answers should be ranked with a negative sum of points.)

The participating competition candidates are ranked according to the sum of points. Competition candidates with the same sum of points are sub-ranked according to success-runtime. The success-runtime for a competition candidate is the total CPU time over all benchmarks for which the competition candidate reported a correct verification result.

The participants have the opportunity to check the verification results against their own expected results and discuss inconsistencies with the competition chair.

Please review the following illustration for the scoring schema for the reachability property.

Opting-out from Categories

Every team can submit for every category (including meta categories, i.e., categories that consist of verification tasks from other categories) an opt-out statement. In the results table, a dash is entered for that category; no execution results are reported in that category. If a team participates (i.e., does not opt-out), *all* verification tasks that belong to that category are executed with the verifier. The obtained results are reported in the results table; the scores for meta categories are weighted according to the established procedure. (This means, a tool can participate in a meta category and at the same time opt-out from a sub-category, with having the real results of the tool counted towards the meta category, but not reported for the sub-category.)

Computation of Scores for Meta Categories

A meta category is a category that consists of several sub-categories. The scores in such a meta category is computed from the normalized scores in its sub-categories.

Procedure

The scores for a (meta) category is computed from the scores of all k contained (sub-) categories using a normalization by the number of contained verification tasks: The normalized score sn_i of a verifier in category i is obtained by dividing the score s_i by the number of tasks n_i in category i (sn_i = s_i / n_i), then the sum st = SUM(sn_1, ..., sn_k) over the normalized scores of the categories is multiplied by the average number of tasks per category.

Motivation

The goal is to reduce the influence of a verification task in a large category compared to a verification task in a small category, and thus, balance over the categories. Normalizing by score is not an option because we assigned higher positive scores for TRUE and higher negative scores for wrong results. (Normalizing by score would remove those desired differences.)

Example

Category 1 has 10 verification tasks with result TRUE.
Category 2 has 10 verification tasks with result FALSE.

Verifier A correctly solved 5 verification tasks in category 1 -> 10 points, and 5 verification tasks in category 2 -> 5 points.
Overall score of Verifier A: (( (10/10 = 1) + (5/10 = 0.5) ) = 1.5) * (10 + 10 = 20) / 2 = 15

Verifier B correctly solved 10 verification tasks in category 1 -> 20 points, and 0 verification tasks in category 2 -> 0 points.
Overall score of Verifier B: (( (20/10 = 2) + (0/10 = 0.0) ) = 2.0) * (10 + 10 = 20) / 2 = 20

Verifier C correctly solved 0 verification tasks in category 1 -> 0 points, and 10 verification tasks in category 2 -> 10 points.
Overall score of Verifier C: (( (0/10 = 0) + (10/10 = 1.0) ) = 1.0) * (10 + 10 = 20) / 2 = 10

Verifier D correctly solved 8 verification tasks in category 1 -> 16 points, and 8 verification tasks in category 2 -> 8 points.
Overall score of Verifier D: (( (16/10 = 1.6) + (8/10 = 0.8) ) = 2.4) * (10 + 10 = 20) / 2 = 24

Obtained ranking:
1. Verifier D
2. Verifier B
3. Verifier A
4. Verifier C

Verifier D was strong in all categories, and thus, won category Overall.
Verifier B was even stronger than Verifier D in one category, but performed not so well in the other category, and thus, only got second overall.

	TACAS 2017
Competition on Software Verification (SV-COMP)