SV-COMP 2013 - 2nd International Competition on Software Verification

Call for Participation — Procedure

The competition will compare state-of-the-art tools for software verification with respect to effectiveness and efficiency. The competition consists of two phases: a training phase, in which benchmark programs are given to the tool developers, and an evaluation phase, in which all participating verifiers will be executed on benchmark verification tasks, and the number of solved instances as well as the runtime is measured. The competition is performed and presented by the TACAS Competition Chair (for TACAS'13 this is Dirk Beyer).

Quick Overview of Changes to Previous SV-COMP

We made some changes to the rules, compared to last year, as a reaction to the community feedback after the first competition.

We have added several new categories and many new verification tasks that were contributed by the community (tops 1 and 12 in minutes).
New benchmark programs do not need to be pre-processed with CIL (top 5 in minutes). (If a verifier requires such a pre-processing, it needs to be done inside the verifier itself.)
The negative scores are doubled, i.e., the scoring schema is updated as follows: UNSAFE incorrect = -4 and SAFE incorrect = -8 (top 15 in minutes).
The participating tools are required to be publicly available on the internet (top B).
There is one global set of parameters with impact on the performance (top 18 in minutes).
The operating system on the competition machines is Ubuntu 12.04. (If a verifier is submitted as source code, it needs to build on that environment.)
There is a new convention '__VERIFIER_atomic_' for modeling atomic execution.
There is a newly introduced category 'MemorySafety' with new result types for the verification tasks.
We are moving from verification results SAFE and UNSAFE to TRUE and FALSE, respectively, in order to be prepared for properties that are not safety properties in the future.

Publication and Presentation of the Competition Candidates

The jury selects qualified competition candidates to publish (in the LNCS proceedings of TACAS) a contribution paper that gives an overview of the competition candidate. The contribution papers have a page limit of three pages. An overview paper by the competition organizers will describe the competition procedure and present the results. The papers are due on the start date of the evaluation phase.

In addition, every qualified competition candidate is granted a demonstration slot in the TACAS program to present the competition candidate to the TACAS audience.

Definition of Verification Task

A verification task consists of a C program and a property. A verification run is a non-interactive execution of a competition candidate on a single verification task, in order to check if the following statement is correct: "The program satisfies the property."

The result of a verification run is a triple (ANSWER, WITNESS, TIME). ANSWER is one of the following outcomes:

TRUE / SAFE	The property is satisfied (i.e., there is no finite path that violates the property).
FALSE / UNSAFE + Error Path	The property is violated (i.e., there exists a finite path that violates the property) and a counterexample path is produced.
UNKNOWN	The tool cannot decide the problem or terminates by a tool crash, time-out, or out-of-memory (i.e., the competition candidate does not succeed in computing an answer TRUE or FALSE).

If the answer is FALSE / UNSAFE, then a counterexample path must be produced and provided as WITNESS. There is no particular fixed format for the error path. The path has to be written to a file or on screen in a reasonable format to make it possible to check validity.

TIME is the consumed CPU time until the verifier terminates. It includes the consumed CPU time of all processes that the verifier starts. If TIME is equal to or larger than the time limit, then the verifier is terminated and the ANSWER is set to "timeout" (and interpreted as UNKNOWN).

The C programs are partitioned into categories, which are defined in category-set files. The categories and the contained programs are explained on the benchmark page. The requirements for the C programs are described below in the next section.

Property for Label Reachability

The property to be verified for all categories except 'MemorySafety' is the reachability property p_error, which is encoded in the program source code (using the error label 'ERROR').

Property	Description
p_error	The C label 'ERROR' is not reachable from the entry of function 'main' on any finite execution of the program.

Property for Memory Safety

The property to be verified for category 'MemorySafety' is the memory safety property p_mem-safety, which consists of three partial properties.

Property	Description
p_mem-safety	p_valid-free and p_valid-deref and p_valid-memtrack
p_valid-free	All memory deallocations are valid (counterexample: invalid free). More precisely: There exists no finite execution of the program from the entry of function 'main' on which an invalid memory deallocation occurs.
p_valid-deref	All pointer dereferences are valid (counterexample: invalid dereference). More precisely: There exists no finite execution of the program from the entry of function 'main' on which an invalid pointer dereference occurs.
p_valid-memtrack	All allocated memory is tracked, i.e., pointed to or deallocated (counterexample: memory leak). More precisely: There exists no finite execution of the program from the entry of function 'main' on which the program lost track of some previously allocated memory.

If a verification run detects that the property p_mem-safety is violated, then the verification result is required to by more specific; the violated (partial) property has to be given in the result:

FALSE(p), with p in {p_valid-free, p_valid-deref, p_valid-memtrack}, means that the (partial) property p is violated.

Agreement: All programs in category 'MemorySafety' violate at most one (partial) property p (p in {p_valid-free, p_valid-deref, p_valid-memtrack}).

Benchmark Verification Tasks

The training verification tasks will be provided by a specified date on this web page. A subset of the training verification tasks will be used for the actual competition experiments. The jury agrees on a procedure to define the subset (if not all verification tasks are taken).

The programs are assumed to be written in GNU C (most of them adhere to ANSI C). Some programs are provided in CIL (C Intermediate Language) [1].

Potential competition participants are invited to submit benchmark verification tasks until the specified date. Verification tasks have to fulfill two requirements, to be eligible for the competition: (1) the program has to be written in GNU C or ANSI C (note that in contrast to the previous competition, the programs are not necessarily CIL-preprocessed), and (2) the property is instrumented into the program and is violated if the label 'ERROR' is reached or an (implicit) memory-safety property does not hold.

In the following, we list a few conventions that are used in some of the verification tasks, in order to express special information that is difficult to capture with the C language.

A verification tool can assume that a function call __VERIFIER_assume(expression) has the following meaning: If 'expression' is evaluated to '0', then the function loops forever, otherwise the function returns (no side effects). The verification tool can assume the following implementation:
void __VERIFIER_assume(int expression) { if (!expression) { LOOP: goto LOOP; }; return; }

In order to model nondeterministic values, the following functions can be assumed to return an arbitrary value of the indicated type: __VERIFIER_nondet_X() (and nondet_X(), deprecated) with X in {bool, char, int, float, loff_t, long, pchar, pointer, pthread_t, sector_t, short, size_t, u32, uchar, uint, ulong, unsigned, ushort} (no side effects, pointer for void *, etc.). The verification tool can assume that the functions are implemented according to the following template:
X __VERIFIER_nondet_X() { X val; return val; }

For modeling an atomic execution of a sequence of executions in a multi-threaded run-time environment, statements can be placed in a function whose name matches '__VERIFIER_atomic_'. Verifiers are instructed to assume that the execution of such a function is not interrupted.

Setup

Technical Details

The competition experiments will be run on a dedicated unloaded compute server with a 3.4 GHz 64-bit Quad Core CPU (Intel i7-2600K) and a GNU/Linux operating system (x86_64-linux, Ubuntu 12.04). The machine has 16 GB of RAM, of which exactly 15 GB will be made available to the competition candidate. Every experiment has a runtime limit of 15 min. The runtime is measured in seconds of CPU time.

Results

Every verification run will be started by a batch script that collects statistics and forwards for each competition candidate and verification task one of the outcomes: TRUE / SAFE (property holds), FALSE / UNSAFE (property does not hold, bug reported), UNKNOWN (verification result not known, resources exhausted, crash).

Parameters

The competition environment feeds the candidates with parameters when started for a verification run. There is one global set of parameters that can be used to tune the verifier to the benchmark programs. This set of (command-line) parameters have to be defined in the competition contribution paper. There is one category (Drivers64) that needs to be verified with the information that a 64-bit architecture is used, thus, if the verifier has a parameter to handle this aspect, it needs to be defined. There is one category (MemSafety) that requires an additional set of parameters to specify the property. This needs to be specified in the competition contribution as well. There are two categories (DeviceDrivers64 and ControlFlowInteger-MemSimple) for which additional parameters are allowed to specify the simple memory model.

Qualification

A verification tool is qualified to participate as competition candidate if the tool is publicly available on the internet (for the GNU/Linux plattform, more specifically, it must run on an x86_64 machine) and succeeds for more than 50 % of all training programs to parse the input and start the verification process (a tool crash during the verification phase does not disqualify). A person (participant) is qualified as competition contributor for a competition candidate if the person is a contributing designer/developer of the submitted competition candidate (witnessed by occurrence of the person’s name on the tool's project web page, a tool paper, or in the revision logs). A paper is qualified if the quality of the description of the competition candidate suffices to run the tool in the competition and meets the scientific standards of TACAS as competition-candidate representation in the TACAS proceedings.

A verification tool can participate several times as different competition candidates, if a significant difference of the conceptual or technological basis of the implementation is justified in the accompanying description paper. This applies to different versions as well as different configurations, in order to avoid forcing developers to create a new tool for every new concept.

Evaluation by Scores and Runtime

The scores will be assigned according to the following table:

Points	Reported result	Description
0	UNKNOWN	Failure to compute verification result, out of resources, program crash.
+1	FALSE / UNSAFE correct	The error in the program was found and an error path was reported.
−4	FALSE / UNSAFE incorrect	An error is reported for a program that fulfills the property (false alarm, incomplete analysis).
+2	TRUE / SAFE correct	The program was analyzed to be free of errors.
−8	TRUE / SAFE incorrect	The program had an error but the competition candidate did not find it (missed bug, unsound analysis).

The results of type TRUE / SAFE yield higher absolute score values compared to type FALSE / UNSAFE, because it is expected to be 'easier' to detect errors than it is to prove correctness. The absolute score values for incorrect results are higher compared to correct results, because a single correct answer should not be able to compensate for a wrong answer. (An imaginary competition candidate that generates random answers should be ranked with a negative sum of points.)

The participating competition candidates are ranked according to the sum of points. Competition candidates with the same sum of points are sub-ranked according to success-runtime. The success-runtime for a competition candidate is the total CPU time over all benchmarks for which the competition candidate reported a correct verification result.

The participants have the opportunity to check the verification results against their own expected results and discuss inconsistencies with the competition chair.

Opting-out from Categories

Every team can submit for every category (including meta categories, i.e., categories that consist of verification tasks from other categories) an opt-out statement. In the results table, a dash is entered for that category; no execution results are reported in that category. If a team participates (i.e., does not opt-out), *all* verification tasks that belong to that category are executed with the verifier. The obtained results are reported in the results table; the scores for meta categories are weighted according to the established procedure. (This means, a tool can participate in a meta category and at the same time opt-out from a sub-category, with having the real results of the tool counted towards the meta category, but not reported for the sub-category.)

Computation of Scores for Meta Categories

A meta category is a category that consists of several sub-categories. The scores in such a meta category is computed from the normalized scores in its sub-categories.

In SV-COMP'13, there are two meta categories: ControlFlowInteger and Overall.
ControlFlowInteger consists of the two sub-categories ControlFlowInteger-MemSimple and ControlFlowInteger-MemPrecise.
Overall consists of the ten sub-categories BitVectors, Concurrency, ControlFlowInteger, DeviceDrivers64, FeatureChecks, HeapManipulation, Loops, MemorySafety, ProductLines, and SystemC.

Procedure

The scores for a (meta) category is computed from the scores of all k contained (sub-) categories using a normalization by the number of contained verification tasks: The normalized score sn_i of a verifier in category i is obtained by dividing the score s_i by the number of tasks n_i in category i (sn_i = s_i / n_i), then the sum st = SUM(sn_1, ..., sn_1, ..., sn_k) over the normalized scores of the categories is multiplied by the average number of tasks per category.

Motivation

The goal is to reduce the influence of a verification task in a large category compared to a verification task in a small category, and thus, balance over the categories. Normalizing by score is not an option because we assigned higher positive scores for safe and higher negative scores for wrong results. (Normalizing by score would remove those desired differences.)

Example

Category 1 has 10 verification tasks with result safe.
Category 2 has 10 verification tasks with result unsafe.

Verifier A correctly solved 5 verification tasks in category 1 -> 10 points, and 5 verification tasks in category 2 -> 5 points.
Overall score of Verifier A: ( 10/10 = 1 + 5/10 = 0.5 ) = 1.5 * (10 + 10 = 20) / 2 = 15

Verifier B correctly solved 10 verification tasks in category 1 -> 20 points, and 0 verification tasks in category 2 -> 0 points.
Overall score of Verifier B: ( 20/10 = 2 + 0/10 = 0.0 ) = 2.0 * (10 + 10 = 20) / 2 = 20

Verifier C correctly solved 0 verification tasks in category 1 -> 0 points, and 10 verification tasks in category 2 -> 10 points.
Overall score of Verifier C: ( 0/10 = 0 + 10/10 = 1.0 ) = 1.0 * (10 + 10 = 20) / 2 = 10

Verifier D correctly solved 8 verification tasks in category 1 -> 16 points, and 8 verification tasks in category 2 -> 8 points.
Overall score of Verifier D: ( 16/10 = 1.6 + 8/10 = 0.8 ) = 2.4 * (10 + 10 = 20) / 2 = 24

Obtained ranking:
1. Verifier D
2. Verifier B
3. Verifier A
4. Verifier C

Verifier D was strong in all categories, and thus, won the competition.
Verifier B was even stronger than Verifier D in one category, but performed not so well in the other category, and thus, only got second overall.

	TACAS 2013
Competition on Software Verification (SV-COMP)