Call for Participation — Procedure

The competition will compare state-of-the-art tools for software verification with respect to effectiveness and efficiency. The competition consists of two phases: a training phase, in which benchmark programs are given to the tool developers, and an evaluation phase, in which all participating verifiers will be executed on benchmark verification tasks, and the number of solved instances as well as the runtime is measured. The competition is performed and presented by the TACAS Competition Chair.

This page is the authoritative reference for the rules of the competition. Please also read the FAQ with more descriptions.

Publication and Presentation of the Competition Candidates

The jury selects qualified competition candidates to publish (in the LNCS proceedings of TACAS) a contribution paper that gives an overview of the competition candidate. An overview paper by the competition organizers will describe the competition procedure and present the results. The paper-submission deadline is specified on the dates page; submission requirements are explained on the submission page.

In addition, every qualified competition candidate is granted a presentation slot in the TACAS program to present the competition candidate to the TACAS audience.

Definition of Verification Task

A verification task consists of a C program and a specification. A verification run is a non-interactive execution of a competition candidate on a single verification task, in order to check if the following statement is correct: "The program satisfies the specification."

The result of a verification run is a triple (ANSWER, WITNESS, TIME). ANSWER is one of the following outcomes:

TRUE + Witness The specification is satisfied (i.e., there is no path that violates the specification) and a correctness witness is produced.
FALSE + Witness The specification is violated (i.e., there exists a path that violates the specification) and a violation witness is produced.
UNKNOWN The tool cannot decide the problem or terminates by a tool crash, time-out, or out-of-memory (i.e., the competition candidate does not succeed in computing an answer TRUE or FALSE).

TIME is the consumed CPU time until the verifier terminates. It includes the consumed CPU time of all processes that the verifier starts. If TIME is equal to or larger than the time limit, then the verifier is terminated and the ANSWER is set to "timeout" (and interpreted as UNKNOWN).

The C programs are partitioned into categories, which are defined in category-set files. The categories and the contained programs are explained on the benchmark page. The requirements for the C programs are described below in the next section.

Witnesses

There is an exchange format for witnesses of verification results. We currently support three version of the format, namely version 1.0 based on GraphML and versions 2.0 and 2.1 based on YAML. Each verification result has to be accompanied by a witness in the format version specified by the following table (some exceptions are mentioned below).

Language  PropertyBase Category Correctness Violation
C unreach-call *.Arrays, *.Floats, *.Heap   not supported 1.0, 2.0, 2.1
*Concurrency*  2.1* (demo mode)  1.0
all others 2.0, 2.1* 1.0, 2.0, 2.1
valid-memsafety *Concurrency* not supported 1.0
all others not supported 1.0, 2.0**, 2.1**
valid-memcleanup all not supported 1.0
no-overflow *Concurrency* 2.1* (demo mode) 1.0
all others 2.0, 2.1* 1.0, 2.0, 2.1
no-data-race all not supported 1.0
termination all 2.1* (demo mode) 1.0, 2.1
Java valid-assert all 1.0 (demo mode) 1.0
no-runtime-exception   all not supported 1.0
* We do not support two features of the format 2.1, namely function contracts and inductive invariants, as they have been introduced only very recently and/or do not have sufficient tool support.
** The format versions 2.0 and 2.1 support only violation of the subproperties valid-deref and valid-free of valid-memsafety. If the violated subproperty is valid-memtrack, the fromat version 1.0 has to be used.

No witnesses for the TRUE results are required in the base categories where the table says that the correctness witnesses are not supported or supported only in a demo mode.

Each witness has to be written to a file witness.graphml (format 1.0) or witness.yml (format 2.0 or higher). Each tool is supposed to produce only one witness (if both files are produced, witness.graphml is ignored). The witness is given to witness validators to check its validity The resource limits for a witness validation task are 2 processing units, 7 GB memory, and 90 s (i.e., 10 % of the verification time) for violation witnesses and 300 s (33 %) for correctness witnesses. The verification result is counted as correct only if at least one validator successfully validates the corresponding witness (except for the cases where witnesses are not required).

Documentation of Witness Formats

Properties

The specification to be verified for a program "path/filename" is given either in a file with the name "path/filename.prp" or in a file "Category.prp".

The definition 'init(main())' gives the initial states of the program by a call of function 'main'. If the program defines the function main with parameters (as allowed by the C standard), we assume that the initial values of these parameters satisfy the conditions specified in the C standard (detailed in Section 5.1.2.2.1, "Program Startup", of the C99 standard).

The definition 'LTL(f)' specifies that formula 'f' holds at every initial state of the program. The LTL (linear-time temporal logic) operator 'G f' means that 'f' globally holds. The proposition 'label(L)' is true if the C label 'L' is reached. The proposition 'call(func())' is true if the function 'func()' is called.

Accepted Verdict Formats:

If a verification run detects that a property is violated, then the violated (potentially partial) property has to be given in the result: FALSE(p), with p in {unreach-call, termination, no-overflow, valid-free, valid-deref, valid-memtrack, valid-memcleanup, no-data-race}, meaning the (potentially partial) property p is violated.

If a verification run detects that no checked property is violated, then no property shall be given in the result: TRUE, meaning the input was found to satisfy all properties.

Unreachability of Error Function:

CHECK( init(main()), LTL(G ! call(func())) )    

Formula Definition
G ! call(func())
The function 'func' is not called in any execution of the program.
Memory Safety (only for category 'MemorySafety'):

CHECK( init(main()), LTL(G valid-free) )
CHECK( init(main()), LTL(G valid-deref) )
CHECK( init(main()), LTL(G valid-memtrack) )    

Formula Definition
G valid-free All memory deallocations are valid (counterexample: invalid free). More precisely: There exists no finite execution of the program on which an invalid memory deallocation occurs.
G valid-deref All pointer dereferences are valid (counterexample: invalid dereference). More precisely: There exists no finite execution of the program on which an invalid pointer dereference occurs.
G valid-memtrack All allocated memory blocks are tracked. The set of tracked blocks is defined as the smallest set of blocks satisfying the following two rules:
  1. A block is tracked whenever there is a pointer to this block (not necessarily pointing to the beginning of the block) or to the first address after this block (see 6.5.6 of C11 standard) stored in a program variable. The variable can be of a pointer type or of a compound type containing a pointer. The variable does not have to be in the current scope, it can be global or on the call stack.
  2. If some pointer in a tracked block points to another block (again, not necessarily to the beginning of the block) or to the first address after this block, this pointed block is also tracked.
In particular, a leaked memory block is not tracked. Hence, a program with a memory leak does not satisfy this property.

Agreement: All programs in category 'MemorySafety' either satisfy all (partial) properties or violate exactly one (partial) property p (p in {valid-free, valid-deref, valid-memtrack}) that is reachable from the program entry. We ignore further property violations on a path after finding the first violation (because the behavior is undefined after the first violation).

Memory Cleanup (only for category 'MemorySafety'):

CHECK( init(main()), LTL(G valid-memcleanup) )    

Formula Definition
G valid-memcleanup
All allocated memory is deallocated before the program terminates. In addition to valid-memtrack: There exists no finite execution of the program on which the program terminates but still points to allocated memory. (Comparison to Valgrind: This property can be violated even if Valgrind reports 'still reachable'.)
No Overflow:

CHECK( init(main()), LTL(G ! overflow) )    

Formula Definition
 G ! overflow
It can never happen that the resulting type of an operation is a signed-integer type but the resulting value is not in the range of values that are representable by that type. A violation of this property matches what C11 defines as undefined behavior. (Hence, conversions to signed-integer types do not violate this property.)
No Data Race (only for category 'ConcurrencySafety'):

CHECK( init(main()), LTL(G ! data-race) )    

Formula Definition
 G ! data-race
If there exist two or more concurrent accesses to the same memory location and at least one is a write access, then all accesses must be atomic.
Termination (only for category 'Termination'):

CHECK( init(main()), LTL(F end) )    

Formula Definition
 F end
Every path finally reaches the end of the program. The proposition "end" is true at the end of every finite program execution (exit, abort, return from the initial call of main, etc.). A counterexample for this property is an infinite program execution.

Termination is a liveness property, and thus, counterexamples must have maximal execution length, i.e., are terminating or infinite; all terminating paths end with proposition "end".

For the result "FALSE + Error Path", the error path has to be a finite path that ends in a location that is repeated infinitely often in an infinite path. If there is a periodic infinite execution, a second path may be added. This second path has to lead from this infinitely repeated location back to this infinitely repeated location and this path has to correspond to the infinite execution.

Benchmark Verification Tasks

The training verification tasks will be provided by a specified date on this web page. A subset of the training verification tasks will be used for the actual competition experiments. The jury agrees on a procedure to define the subset (if not all verification tasks are taken). We avoid using verification tasks without training, because the semantics of the language C is underspecified; the participants, jury, and community ensure that all involved parties agree on the intended meaning of the verification tasks. (This is in contrast to competitions over completely specified problems, such as SAT or SMT.)

The programs are assumed to be written in GNU C (some of them adhere to ANSI C). Each program consists of a single file, which is either: a .i file, which is preprocessed, or a .c file, which may be un-preprocessed. A verifier may distinguish between preprocessed and un-preprocessed programs using the given file extensions. Un-preprocessed programs fulfill the following requirements:

  1. #include directives only include headers from the C standard or pthread.h.
  2. No #define directives are used.
  3. All used macros are defined by the C standard or pthread.h.
A verifier may preprocess a .c file using cpp -m32 or cpp -m64, depending on the program architecture (see below), without requiring additional macro definitions (-D arguments) or include paths (-I arguments) to be specified. Note that witnesses should still refer to the un-preprocessed .c file (a verifier can rely on #line directives to achieve this). Each program contains all code that is needed for the verification, i.e., all non-standard functions are defined.

Potential competition participants are invited to submit benchmark verification tasks until the specified date. Verification tasks have to fulfill two requirements, to be eligible for the competition: (1) the program has to be written in GNU C or ANSI C, and (2) the program has to come with a specification given by one of the properties stated above. Other specifications are possible, but need to be proposed and discussed.

New proposed categories will be included if at least three different tools or teams participate in the category (i.e., not the same tool twice with a different configuration).

For each category, we specify whether the programs are written for an ILP32 (32-bit) or an LP64 (64-bit) architecture (cf. http://www.unix.org/whitepapers/64bit.html).

In the following, we list a few conventions that are used in some of the verification tasks, in order to express special information that is difficult to capture with the C language.

__VERIFIER_error(): For checking (un)reachability we use the function __VERIFIER_error(). The verification tool can assume the following implementation:
void __VERIFIER_error() { abort(); }
Hence, a function call __VERIFIER_error() never returns and in the function __VERIFIER_error() the program terminates.

__VERIFIER_assume(expression): A verification tool can assume that a function call __VERIFIER_assume(expression) has the following meaning: If 'expression' is evaluated to '0', then the function loops forever, otherwise the function returns (no side effects). The verification tool can assume the following implementation:
void __VERIFIER_assume(int expression) { if (!expression) { LOOP: goto LOOP; }; return; }

__VERIFIER_nondet_X(): In order to model nondeterministic values, the following functions can be assumed to return an arbitrary value of the indicated type: __VERIFIER_nondet_X() with X in {bool, char, int, int128, float, double, loff_t, long, longlong, pchar, pthread_t, sector_t, short, size_t, u32, uchar, uint, uint128, ulong, ulonglong, unsigned, ushort} (no side effects, pointer for void *, etc.). The verification tool can assume that the functions are implemented according to the following template:
X __VERIFIER_nondet_X() { X val; return val; }

__VERIFIER_nondet_memory(void *, size_t): This function initializes the given memory block with arbitrary values. The first argument must be a valid pointer to the start of a memory block of the given size. The second argument specifies the size of the memory to initialize and must match the size of the memory block that the first argument points to. The dereference of any pointer value set through this method results in undefined behavior. This means that pointer values must be explicitly set through different means before they can be dereferenced. The verification tool can assume that __VERIFIER_nondet_memory is implemented as follows:

void __VERIFIER_nondet_memory(void *mem, size_t size) {
    unsigned char *p = mem;
    for (size_t i = 0; i < size; i++) {
        p[i] = __VERIFIER_nondet_uchar();
    }
}
    
Example uses of __VERIFIER_nondet_memory:
struct structType s;
__VERIFIER_nondet_memory(&s, sizeof(s));

int * values = malloc(sizeof(int) * 20);
__VERIFIER_nondet_memory(values, sizeof(int) * 20);
    

__VERIFIER_atomic_*(): These functions are deprecated, but still used in existing tasks. Please consider using standard C features from stdatomic.h and pthread.h to model atomicity (e.g., atomic types, atomic loads / stores, mutexes, ...).
For modeling an atomic execution of a sequence of statements in a multi-threaded run-time environment, those statements can be placed between two function calls __VERIFIER_atomic_begin() and __VERIFIER_atomic_end() or those statements can be placed in a function whose name starts with __VERIFIER_atomic_. The verifiers are instructed to assume that the execution between those calls is not interrupted. The two calls need to occur within the same control-flow block; nesting or interleaving of those function calls is not allowed.

malloc(), free(): We assume that the functions malloc and alloca always return a valid pointer, i.e., the memory allocation never fails, and function free always deallocates the memory and makes the pointer invalid for further dereferences.

Competition Environment and Requirements

Competition Environment

Each verification run will be started on a machine with a GNU/Linux operating system (x86_64-linux, the latest LTS release of Ubuntu); there are three resource limits for each verification run: a memory limit of 15 GB (14.6 GiB) of RAM, a runtime limit of 15 min of CPU time, and a limit to 4 processing units of a CPU. If a verifier hangs, i.e., does not continue to consume CPU time, then the verifier is killed after 15 min of wall time, and the resulting runtime is set to 15 min.

Benchmark Definition

The competition environment feeds the candidates with parameters when started for a verification run. There is one global set of parameters that can be used to tune the verifier to the benchmark programs. Verifiers are forbidden from using the program name, its hash, or the current category to tune their parameters. This set of (command-line) parameters have to be defined in the competition contribution paper and in the benchmark definition. One parameter defines the specification file. There are categories that need to be verified with the information that a 64-bit architecture is used, thus, if the verifier has a parameter to handle this aspect, it needs to be defined.

In order to participate at SV-COMP, a benchmark definition in the SV-COMP repository is necessary. Technically, the benchmark definition needs to be integrated into the SV-COMP repository under directory benchmark-defs using a merge request.

The benchmark definition defines the categories that the verifier is to be executed on, the parameters to be passed, and the resource limits.

Tool-Info Module

In order to participate at SV-COMP, a tool-info module in the BenchExec repository is necessary. Technically, the tool-info module needs to be integrated into the BenchExec repository under directory benchexec/tools using a pull request.

The task of the tool-info module is (besides other tasks) to translate the output of a verifier to the results FALSE, TRUE, etc. For running the contributed verifier, the organizer follows the installation requirements and executes the verifier, relying on the tool-info module for correct execution of the verifier and correct interpretation of its results.

The tool-info module must be tested, cf. instructions on tool integration.

Verifier

The submitted system has to meet the requirements mentioned on the submission page.

Qualification

Verifier. A verification tool is qualified to participate as competition candidate if the tool is (a) publicly available for download and fulfills the above license requirements, (b) works on the GNU/Linux platform (more specifically, it must run on an x86_64 machine with the latest Ubuntu LTS, with only the self-declared packages installed), (c) is installable with user privileges (no root access required, except for required standard Ubuntu packages) and without hard-coded absolute paths for access to libraries and non-standard external tools, (d) succeeds for more than 50 % of all training programs to parse the input and start the verification process (a tool crash during the verification phase does not disqualify), and (e) produces witness files (for violation and correctness) that adhere to the witness exchange format (syntactically correct). The competition organizer can always add verifiers from previous years as participants.

Person. A person (participant) is qualified as competition contributor for a competition candidate if the person (a) is a contributing designer/developer of the submitted competition candidate (witnessed by occurrence of the person’s name on the tool's project web page, a tool paper, or in the revision logs) or (b) is authorized by the competition organizer (after the designer/developer was contacted about the participation).

Paper. A paper is qualified if the quality of the description of the competition candidate suffices to run the tool in the competition and meets the scientific standards of TACAS as competition-candidate representation in the TACAS proceedings.

Note: A verification tool can participate several times as different competition candidates, if a significant difference of the conceptual or technological basis of the implementation is justified in the accompanying description paper. This applies to different versions as well as different configurations, in order to avoid forcing developers to create a new tool for every new concept.

Requirements on Verifier Behavior

The verifier should not use identifiers contained in the verification task to fingerprint and identify individual tasks or groups of tasks. It is acceptable to use occurrence of calls to extern functions from standard libraries (for example malloc, pthread_create, arithmetic functions, etc.) to detect which feature a verification tasks uses and change the verifier behavior based on this. The competition organizer is allowed to apply obfuscation to the C code and rename local identifiers (with the exception of extern functions and an allowlist of common identifiers).

Evaluation by Scores and Runtime

The scores will be assigned according to the following table:

Points Reported result Description
0 UNKNOWN Failure to compute verification result, out of resources, program crash.
+1 FALSE
correct
The error in the program was found and a violation witness was confirmed.
−16 FALSE
incorrect
An error is reported for a program that fulfills the specification (false alarm, incomplete analysis).
+2 TRUE
correct
The program was analyzed to be free of errors and a correctness witness was confirmed.
−32 TRUE
incorrect
The program had an error but the competition candidate did not find it (missed bug, unsound analysis).

The results of type TRUE yield higher absolute score values compared to type FALSE, because it is expected to be 'easier' to detect errors than it is to prove correctness. The absolute score values for incorrect results are higher compared to correct results, because a single correct answer should not be able to compensate for a wrong answer. (An imaginary competition candidate that generates random answers should be ranked with a negative sum of points.)

The participating competition candidates are ranked according to the sum of points. Competition candidates with the same sum of points are sub-ranked according to success-runtime. The success-runtime for a competition candidate is the total CPU time over all benchmarks for which the competition candidate reported a correct verification result.

The participants have the opportunity to check the verification results against their own expected results and discuss inconsistencies with the competition chair.

Please review the following illustration for the scoring schema for the reachability property.

Opting-out from Categories

Every team can submit for every category (including meta categories, i.e., categories that consist of verification tasks from other categories) an opt-out statement. In the results table, a dash is entered for that category; no execution results are reported in that category. If a team participates (i.e., does not opt-out), *all* verification tasks that belong to that category are executed with the verifier. The obtained results are reported in the results table; the scores for meta categories are weighted according to the established procedure. (This means, a tool can participate in a meta category and at the same time opt-out from a sub-category, with having the real results of the tool counted towards the meta category, but not reported for the sub-category.)

Computation of Scores for Meta Categories

A meta category is a category that consists of several sub-categories. The scores in such a meta category is computed from the normalized scores in its sub-categories.

Procedure

The score for a (meta) category is computed from the scores of all k contained (sub-) categories using a normalization by the number of contained verification tasks: The normalized score sn_i of a verifier in category i is obtained by dividing the score s_i by the number of tasks n_i in category i (sn_i = s_i / n_i), then the sum st = SUM(sn_1, ..., sn_k) over the normalized scores of the categories is multiplied by the average number of tasks per category.

Motivation

The goal is to reduce the influence of a verification task in a large category compared to a verification task in a small category, and thus, balance over the categories. Normalizing by score is not an option because we assigned higher positive scores for TRUE and higher negative scores for wrong results. (Normalizing by score would remove those desired differences.)

Example

Category 1 has 10 verification tasks with result TRUE.
Category 2 has 10 verification tasks with result FALSE.

Verifier A correctly solved 5 verification tasks in category 1 -> 10 points, and 5 verification tasks in category 2 -> 5 points.
Overall score of Verifier A: (( (10/10 = 1) + (5/10 = 0.5) ) = 1.5) * (10 + 10 = 20) / 2 = 15

Verifier B correctly solved 10 verification tasks in category 1 -> 20 points, and 0 verification tasks in category 2 -> 0 points.
Overall score of Verifier B: (( (20/10 = 2) + (0/10 = 0.0) ) = 2.0) * (10 + 10 = 20) / 2 = 20

Verifier C correctly solved 0 verification tasks in category 1 -> 0 points, and 10 verification tasks in category 2 -> 10 points.
Overall score of Verifier C: (( (0/10 = 0) + (10/10 = 1.0) ) = 1.0) * (10 + 10 = 20) / 2 = 10

Verifier D correctly solved 8 verification tasks in category 1 -> 16 points, and 8 verification tasks in category 2 -> 8 points.
Overall score of Verifier D: (( (16/10 = 1.6) + (8/10 = 0.8) ) = 2.4) * (10 + 10 = 20) / 2 = 24

Obtained ranking:
1. Verifier D
2. Verifier B
3. Verifier A
4. Verifier C

Verifier D was strong in all categories, and thus, won category Overall.
Verifier B was even stronger than Verifier D in one category, but performed not so well in the other category, and thus, only got second overall.