Call for Participation — Procedure

The competition will compare state-of-the-art tools for software verification with respect to effectiveness and efficiency. The competition consists of two phases: a training phase, in which benchmark programs are given to the tool developers, and an evaluation phase, in which all participating verifiers will be executed on benchmark verification tasks, and the number of solved instances as well as the runtime is measured. The competition is performed and presented by the TACAS Competition Chair.

This page is the authoritative reference for the rules of the competition. For the most recent changes, please refer to the changelog. If in doubt, please also read the FAQ with more descriptions.

Publication and Presentation of the Competition Candidates

The jury selects qualified competition candidates to publish (in the LNCS proceedings of TACAS) a contribution paper that gives an overview of the competition candidate. An overview paper by the competition organizers will describe the competition procedure and present the results. The paper-submission deadline is specified on the dates page; submission requirements are explained on the submission page.

In addition, every qualified competition candidate is granted a demonstration slot in the TACAS program to present the competition candidate to the TACAS audience.

Definition of Verification Task

A verification task consists of a C program and a specification. A verification run is a non-interactive execution of a competition candidate on a single verification task, in order to check if the following statement is correct: "The program satisfies the specification."

The result of a verification run is a triple (ANSWER, WITNESS, TIME). ANSWER is one of the following outcomes:

TRUE + Witness The specification is satisfied (i.e., there is no path that violates the specification) and a correctness witness is produced.
FALSE + Witness The specification is violated (i.e., there exists a path that violates the specification) and a violation witness is produced.
UNKNOWN The tool cannot decide the problem or terminates by a tool crash, time-out, or out-of-memory (i.e., the competition candidate does not succeed in computing an answer TRUE or FALSE).

TIME is the consumed CPU time until the verifier terminates. It includes the consumed CPU time of all processes that the verifier starts. If TIME is equal to or larger than the time limit, then the verifier is terminated and the ANSWER is set to "timeout" (and interpreted as UNKNOWN).

The C programs are partitioned into categories, which are defined in category-set files. The categories and the contained programs are explained on the benchmark page. The requirements for the C programs are described below in the next section.

Witnesses

There is a fixed exchange format for the witnesses. The witness has to be written to a file, which is given to a witness validator to check validity. The result is counted as correct only if the validator successfully checked it. (There are currently four witness validators under active development, based on CPAchecker, CProver, and Ultimate Automizer.) The resource limits for the witness validators are 2 processing units, 7 GB memory, and 10 % of the verification time (i.e., 90 s) for violation witnesses and 100 % (900 s) for correctness witnesses.

No category is excluded from validation of violation witnesses. The following categories are excluded from validation of correctness witnesses: *-Arrays, *-Floats, *-Heap, *MemSafety*, ConcurrencySafety-*, and Termination-*.

Properties

The specification to be verified for a program "path/filename" is given either in a file with the name "path/filename.prp" or in a file "Category.prp".

The definition 'init(main())' gives the initial states of the program by a call of function 'main' (with no parameters). The definition 'LTL(f)' specifies that formula 'f' holds at every initial state of the program. The LTL (linear-time temporal logic) operator 'G f' means that 'f' globally holds. The proposition 'label(L)' is true if the C label 'L' is reached. The proposition 'call(func())' is true if the function 'func()' is called.

Unreachability of Error Function:

CHECK( init(main()), LTL(G ! call(__VERIFIER_error())) )    

FormulaDefinition
G ! call(func())
The function 'func' is not called in any finite execution of the program.
Memory Safety (only for category 'MemorySafety'):

CHECK( init(main()), LTL(G valid-free) )
CHECK( init(main()), LTL(G valid-deref) )
CHECK( init(main()), LTL(G valid-memtrack) )    

FormulaDefinition
G valid-freeAll memory deallocations are valid (counterexample: invalid free). More precisely: There exists no finite execution of the program on which an invalid memory deallocation occurs.
G valid-derefAll pointer dereferences are valid (counterexample: invalid dereference). More precisely: There exists no finite execution of the program on which an invalid pointer dereference occurs.
G valid-memtrackAll allocated memory is tracked, i.e., pointed to or deallocated (counterexample: memory leak). More precisely: There exists no finite execution of the program on which the program lost track of some previously allocated memory. (Comparison to Valgrind: This property is violated if Valgrind reports 'definitely lost'.)

If a verification run detects that a property from {valid-free, valid-deref, valid-memtrack} is violated, then the violated (partial) property has to be given in the result:

FALSE(p), with p in {valid-free, valid-deref, valid-memtrack}, means that the (partial) property p is violated.

Agreement: All programs in category 'MemorySafety' either satisfy all (partial) properties or violate exactly one (partial) property p (p in {valid-free, valid-deref, valid-memtrack}) that is reachable from the program entry. We ignore further property violations on a path after finding the first violation (because the behavior is undefined after the first violation).

Memory Cleanup (only for category 'MemorySafety'):

CHECK( init(main()), LTL(G valid-memcleanup) )    

FormulaDefinition
G valid-memcleanup
All allocated memory is deallocated before the program terminates. In addition to valid-memtrack: There exists no finite execution of the program on which the program terminates but still points to allocated memory. (Comparison to Valgrind: This property can be violated even if Valgrind reports 'still reachable'.)
Overflow:

CHECK( init(main()), LTL(G ! overflow) )    

FormulaDefinition
 G ! overflow
It can never happen that the resulting type of an operation is a signed integer type but the resulting value is not in the range of values that are representable by that type.
Termination (only for category 'Termination'):

CHECK( init(main()), LTL(F end) )    

FormulaDefinition
 F end
Every path finally reaches the end of the program. The proposition "end" is true at the end of every finite program execution (exit, abort, return from the initial call of main, etc.). A counterexample for this property is an infinite program execution.

Termination is a liveness property, and thus, counterexamples must have maximal execution length, i.e., are terminating or infinite; all terminating paths end with proposition "end".

For the result "FALSE + Error Path", the error path has to be a finite path that ends in a location that is repeated infinitely often in an infinite path. If there is a periodic infinite execution, a second path may be added. This second path has to lead from this infinitely repeated location back to this infinitely repeated location and this path has to correspond to the infinite execution.

Benchmark Verification Tasks

The training verification tasks will be provided by a specified date on this web page. A subset of the training verification tasks will be used for the actual competition experiments. The jury agrees on a procedure to define the subset (if not all verification tasks are taken). We avoid using verification tasks without training, because the semantics of the language C is underspecified; the participants, jury, and community ensure that all involved parties agree on the intended meaning of the verification tasks. (This is in contrast to competitions over completely specified problems, such as SAT or SMT.)

The programs are assumed to be written in GNU C (some of them adhere to ANSI C). Each program contains all code that is needed for the verification, i.e., no includes (cpp -P). Some programs are provided in CIL (C Intermediate Language) [1].

Potential competition participants are invited to submit benchmark verification tasks until the specified date. Verification tasks have to fulfill two requirements, to be eligible for the competition: (1) the program has to be written in GNU C or ANSI C, and (2) the program has to come with a specification given by one of the properties stated above. Other specifications are possible, but need to be proposed and discussed.

New proposed categories will be included if at least three different tools or teams participate in the category (i.e., not the same tool twice with a different configuration).

For each category, we specify whether the programs are written for an ILP32 (32-bit) or an LP64 (64-bit) architecture (cf. http://www.unix.org/whitepapers/64bit.html).

In the following, we list a few conventions that are used in some of the verification tasks, in order to express special information that is difficult to capture with the C language.

__VERIFIER_error(): For checking (un)reachability we use the function __VERIFIER_error(). The verification tool can assume the following implementation:
void __VERIFIER_error() { abort(); }
Hence, a function call __VERIFIER_error() never returns and in the function __VERIFIER_error() the program terminates.

__VERIFIER_assume(expression): A verification tool can assume that a function call __VERIFIER_assume(expression) has the following meaning: If 'expression' is evaluated to '0', then the function loops forever, otherwise the function returns (no side effects). The verification tool can assume the following implementation:
void __VERIFIER_assume(int expression) { if (!expression) { LOOP: goto LOOP; }; return; }

__VERIFIER_nondet_X(): In order to model nondeterministic values, the following functions can be assumed to return an arbitrary value of the indicated type: __VERIFIER_nondet_X() (and nondet_X(), deprecated) with X in {bool, char, int, float, double, loff_t, long, pchar, pthread_t, sector_t, short, size_t, u32, uchar, uint, ulong, unsigned, ushort} (no side effects, pointer for void *, etc.). The verification tool can assume that the functions are implemented according to the following template:
X __VERIFIER_nondet_X() { X val; return val; }

__VERIFIER_atomic_begin(): For modeling an atomic execution of a sequence of statements in a multi-threaded run-time environment, those statements can be placed between two function calls __VERIFIER_atomic_begin() and __VERIFIER_atomic_end() (deprecated but still valid: those statements can be placed in a function whose name matches __VERIFIER_atomic_). The verifiers are instructed to assume that the execution between those calls is not interrupted. The two calls need to occur within the same control-flow block; nesting or interleaving of those function calls is not allowed.

malloc(), free(): We assume that the functions malloc and alloca always return a valid pointer, i.e., the memory allocation never fails, and function free always deallocates the memory and makes the pointer invalid for further dereferences.

Competition Environment and Requirements

Competition Environment

Each verification run will be started on a machine with a GNU/Linux operating system (x86_64-linux, Ubuntu 18.04); there are three resource limits for each verification run: a memory limit of 15 GB (14.6 GiB) of RAM, a runtime limit of 15 min of CPU time, and a limit to 8 processing units of a CPU. If a verifier hangs, i.e., does not continue to consume CPU time, then the verifier is killed after 15 min of wall time, and the resulting runtime is set to 15 min.

Benchmark Definition

The competition environment feeds the candidates with parameters when started for a verification run. There is one global set of parameters that can be used to tune the verifier to the benchmark programs. This set of (command-line) parameters have to be defined in the competition contribution paper and in the benchmark definition. One parameter defines the specification file. There are categories that need to be verified with the information that a 64-bit architecture is used, thus, if the verifier has a parameter to handle this aspect, it needs to be defined.

In order to participate at SV-COMP, a benchmark definition in the SV-COMP repository is necessary. Technically, the benchmark definition needs to be integrated into the SV-COMP repository under directory benchmark-defs using a pull request.

The benchmark definition defines the categories that the verifier is to be executed on, the parameters to be passed, and the resource limits.

Tool-Info Module

In order to participate at SV-COMP, a tool-info module in the BenchExec repository is necessary. Technically, the tool-info module needs to be integrated into the BenchExec repository under directory benchexec/tools using a pull request.

The task of the tool-info module is (besides other tasks) to translate the output of a verifier to the results FALSE, TRUE, etc. For running the contributed verifier, the organizer follows the installation requirements and executes the verifier, relying on the tool-info module for correct execution of the verifier and correct interpretation of its results.

The tool-info module must be tested, cf. instructions on tool integration.

Verifier

The submitted system has to meet the following requirements:

Qualification

Verifier. A verification tool is qualified to participate as competition candidate if the tool is (a) publicly available for download and fulfills the above license requirements, (b) works on the GNU/Linux platform (more specifically, it must run on an x86_64 machine), (c) is installable with user privileges (no root access required, except for required standard Ubuntu packages) and without hard-coded absolute paths for access to libraries and non-standard external tools, (d) succeeds for more than 50 % of all training programs to parse the input and start the verification process (a tool crash during the verification phase does not disqualify), and (e) produces witness files (for violation and correctness) that adhere to the witness exchange format (syntactically correct). The competition organizer can always add verifiers from previous years as participants.

Person. A person (participant) is qualified as competition contributor for a competition candidate if the person (a) is a contributing designer/developer of the submitted competition candidate (witnessed by occurrence of the person’s name on the tool's project web page, a tool paper, or in the revision logs) or (b) is authorized by the competition organizer (after the designer/developer was contacted about the participation).

Paper. A paper is qualified if the quality of the description of the competition candidate suffices to run the tool in the competition and meets the scientific standards of TACAS as competition-candidate representation in the TACAS proceedings.

Note: A verification tool can participate several times as different competition candidates, if a significant difference of the conceptual or technological basis of the implementation is justified in the accompanying description paper. This applies to different versions as well as different configurations, in order to avoid forcing developers to create a new tool for every new concept.

Evaluation by Scores and Runtime

The scores will be assigned according to the following table:

Points Reported result Description
0 UNKNOWN Failure to compute verification result, out of resources, program crash.
+1 FALSE
correct
The error in the program was found and a violation witness was confirmed.
−16 FALSE
incorrect
An error is reported for a program that fulfills the specification (false alarm, incomplete analysis).
+2 TRUE
correct
The program was analyzed to be free of errors and a correctness witness was confirmed.
+1 TRUE
correct, witness unconfirmed
The program was analyzed to be free of errors but the correctness witness was not confirmed.
−32 TRUE
incorrect
The program had an error but the competition candidate did not find it (missed bug, unsound analysis).

The results of type TRUE yield higher absolute score values compared to type FALSE, because it is expected to be 'easier' to detect errors than it is to prove correctness. The absolute score values for incorrect results are higher compared to correct results, because a single correct answer should not be able to compensate for a wrong answer. (An imaginary competition candidate that generates random answers should be ranked with a negative sum of points.)

The participating competition candidates are ranked according to the sum of points. Competition candidates with the same sum of points are sub-ranked according to success-runtime. The success-runtime for a competition candidate is the total CPU time over all benchmarks for which the competition candidate reported a correct verification result.

The participants have the opportunity to check the verification results against their own expected results and discuss inconsistencies with the competition chair.

Please review the following illustration for the scoring schema for the reachability property.

Opting-out from Categories

Every team can submit for every category (including meta categories, i.e., categories that consist of verification tasks from other categories) an opt-out statement. In the results table, a dash is entered for that category; no execution results are reported in that category. If a team participates (i.e., does not opt-out), *all* verification tasks that belong to that category are executed with the verifier. The obtained results are reported in the results table; the scores for meta categories are weighted according to the established procedure. (This means, a tool can participate in a meta category and at the same time opt-out from a sub-category, with having the real results of the tool counted towards the meta category, but not reported for the sub-category.)

Computation of Scores for Meta Categories

A meta category is a category that consists of several sub-categories. The scores in such a meta category is computed from the normalized scores in its sub-categories.

Procedure

The score for a (meta) category is computed from the scores of all k contained (sub-) categories using a normalization by the number of contained verification tasks: The normalized score sn_i of a verifier in category i is obtained by dividing the score s_i by the number of tasks n_i in category i (sn_i = s_i / n_i), then the sum st = SUM(sn_1, ..., sn_k) over the normalized scores of the categories is multiplied by the average number of tasks per category.

Motivation

The goal is to reduce the influence of a verification task in a large category compared to a verification task in a small category, and thus, balance over the categories. Normalizing by score is not an option because we assigned higher positive scores for TRUE and higher negative scores for wrong results. (Normalizing by score would remove those desired differences.)

Example

Category 1 has 10 verification tasks with result TRUE.
Category 2 has 10 verification tasks with result FALSE.

Verifier A correctly solved 5 verification tasks in category 1 -> 10 points, and 5 verification tasks in category 2 -> 5 points.
Overall score of Verifier A: (( (10/10 = 1) + (5/10 = 0.5) ) = 1.5) * (10 + 10 = 20) / 2 = 15

Verifier B correctly solved 10 verification tasks in category 1 -> 20 points, and 0 verification tasks in category 2 -> 0 points.
Overall score of Verifier B: (( (20/10 = 2) + (0/10 = 0.0) ) = 2.0) * (10 + 10 = 20) / 2 = 20

Verifier C correctly solved 0 verification tasks in category 1 -> 0 points, and 10 verification tasks in category 2 -> 10 points.
Overall score of Verifier C: (( (0/10 = 0) + (10/10 = 1.0) ) = 1.0) * (10 + 10 = 20) / 2 = 10

Verifier D correctly solved 8 verification tasks in category 1 -> 16 points, and 8 verification tasks in category 2 -> 8 points.
Overall score of Verifier D: (( (16/10 = 1.6) + (8/10 = 0.8) ) = 2.4) * (10 + 10 = 20) / 2 = 24

Obtained ranking:
1. Verifier D
2. Verifier B
3. Verifier A
4. Verifier C

Verifier D was strong in all categories, and thus, won category Overall.
Verifier B was even stronger than Verifier D in one category, but performed not so well in the other category, and thus, only got second overall.

$LastChangedDate:: 2022-02-04 13:02:03 +0100 #$