ICCV 2005 Computer Vision Contest

Goals

The goal of the contest [competition? challenge?] is to provide a fun framework for vision students to try their hand at solving a challenging vision-related task and to improve their knowledge and understanding by competing against other teams from around the world.  The contest will be launched in the Fall of 2004, so that students could work on the challenge in the context of a Fall or Winter semester/quarter vision course.  The contest finals will be held at ICCV 2005 in Beijing.

Task selection criteria

Selecting a task with an appropriate level of difficulty is a tricky problem.  On the one hand, we don't want the challenge to be too easy, so that everyone can solve the problem and the competition is uninteresting.  On the other, we don't want to re-iterate a well studied problem (face recognition, stereo matching), for which well known existing benchmarks already exist.  We also don't want to make the problem so difficult (describe every object in a picture) that students get discouraged from even trying.

Ideally, the task should be solvable (to some level) by a small team of undergraduate students with the assistance of their professor, or a motivated graduate student with a reasonable knowledge of computer vision.  It should not consume a team of students for several months (as in RoboCup), nor should it be as easy as a simple assignment in an undergraduate class.

Format of the competition

Once the task has been selected, a description of the task (along with any associated training/test data and hints) will be posted on the contest Web site.  Teams (competitors) can then start writing their solutions and self-testing their performance, e.g., with their own sample data, or with hold-some-out (cross-validation) methodologies.  After the first phase of the contest closes (say end of March, 2005), each team will be evaluated based on their performance on the test data, and the results will be posted on the Web site.  A second round of the contest, with refined rules and/or data would then be run in the Spring quarter.

To evaluate people's solutions, we have two potential approaches.  The first is to provide training data to the competitors, and to withhold the test data until the end of the evaluation period.  This prevents people from "tuning" their algorithm to do better on the test data, e.g., by using human interpretation of the imagery.  The second approach is to have people send in their binaries (pre-compiled and linked executables, or MatLab files set up to run in batch mode), so that these can be run on a standard platform (we would probably support at least Windows and Linux, running on the same hardware configuration).  The algorithms are then given the labeled training data along with the unlabeled test data and given a reasonably generous time to produce their answers.  (One possibility is to have the programs emit answers as they are ready, and compute the score on-line and at the end of the elapsed time.)

I'm in favor of the second approach.  Rahul Sukhanthar has mentioned that he could run a farm of VMWare machines that emulate either Windows or Unix in a safe sandboxed environment.  For initial training/test, a typical set of data could be supplied.  On the competition date, the binaries are run against the new training/test data, which is then made public after the trial and can be used as the basis of further algorithm tuning.

The final programs (perhaps a subset of the best performing programs?) would be run at the ICCV 2005 conference site on the day of the final competition, which would hopefully last only an hour during one of the poster sessions (just like one of the demo sessions).  We would try to structure the evaluation phase so there is some fun visual feedback and on-line scoring as the algorithms are running, in order to make the final competition exciting to watch.

Proposed tasks

People have suggested many different potential tasks, and all of these would make for a fun competition.  At this point, I've narrowed down the choice to the following two problems.  I'm also listing other potential tasks below for future reference (next time the competition is run).

Where am I?

(Suggested by Cordelia Schmid and Rick Szeliski.)  Contestants are given a collection of photographs taken in a city along with the GPS location for each photograph (for a system that uses such information for browsing, see http://wwmx.org). At test time, some new images, taken from nearby locations, are provided;  the task is then to label each new image with the best guess possible for its GPS location.  Programs are rated on how well the location of each image is computed.  (One possibility is to rank-sort each answer for every query, and to provide "points" for each image based on how well everyone did.  This removes to some extent the dependence of the test on the distance metric to score mapping.)

Status/comments:  A sample solution to this problem is provided in [Schaffalitzky, F. and Zisserman, A., Multi-view Matching for Unordered Image Sets, or "How Do I Organize My Holiday Snaps?", ECCV 2002, vol I, pp. 414-431.].  The simplest possible solution just does image-to-image similarity matching (say based on color histograms, or feature matching) and does a weighted blend of GPS data to arrive at the answer.  A more sophisticated system would actually triangulate the location of the unknown image based on matching feature across several images with known GPS locations.

I Spy Puzzle

(Suggested by Irfan Essa and Jim Rehg.)  Find a number of named objects in a picture with hundreds of jumbled, overlapping objects.  To make this tractable, we would probably provide one or more reference views of each object (unobscured).

Comments: The Scholastic I Spy web site has a Flash-based game that lets you create your own puzzles, using a set of cut-out objects that you can place, scale, and rotate.  This simpler "2D rigid (similarity)" version may be too easy, since it is probably solvable using a direct application of view-invariant features like SIFT.  Moving this to full 3D objects (downside:  would have to buy them) with only1 or a small number of views would make it more challenging.

Additional proposed tasks

Jigsaw puzzle

(Suggested by Rick Szeliski.)  The task is to take some images of jigsaw puzzle pieces taken on a flatbed scanner, to segment each piece from its neighbors, and to compute the location and orientation of each piece in the final puzzle.  Two variants of the task would be to do it with a low-resolution "box top" image, and to solve it without such an image.  Programs are rated on the accuracy of their solution (number of pieces placed within a certain tolerance of the correct answer) and the amount of time required to solve the puzzle.

Status/comments:  This challenge is generally considered too easy for the contest, so it is only given as an example of one possibility.

Object (category) recognition

(Suggested by Andrew Zisserman.)  Contestants are given a database of images that has been labeled according to semantic categories (see, e.g., Pietro Perona's 101 categories), potentially associated with segmented regions of each image.  At test time, a new set of similar images are provided and programs are required to classify (and possibly segment) each image into the correct category.

Status/comments:  An existing contest on exactly this topic is currently being organized by Luc Van Gool, Chris Williams, and Andrew Zisserman under the aegis of the European PASCAL program.  The ICCV 2005 contest could be just one venue where contest results are demonstrated, or a subset of the images/categories could be used as a "mini-contest" for ICCV.  It is anticipated that labeled images for this contest will be available by the end of the year [Chris Williams, personal communication].

Video game

(Suggested by Andrew Fitzgibbon.)  Contestants are provided an on-line stream of images from a navigation/shooting (or treasure hunt) game such as Quake or Doom.  Their task is to navigate the environment (using simple turn / move commands), avoid being injured, and possibly shoot at opponents and/or collect special items.  Scoring is based on whatever scoring system is used in the game being emulated.  Competing algorithms can either play against the game engine, or against each other (TBD).

Status/comments:  To make this happen, a version of a game engine for which the source code is available would have to be modified to generate the required images and to accept the competing program's input.  This would require a fair amount of effort on the part of a motivated contest organizer.  Ideally, each program would play in "real-time", i.e., it has a limited time budget (say 20ms) to analyze the image (and sound???) to determine its action, at which time the next clock "tick" advances, even if no input is received.

Video surveillance task (coffee pot)

(Suggested by Jana Kosecka.)  Contestants are provided with a video stream watching some location, e.g., a coffee pot in a communal kitchen.  They have to count how many cups of coffee (or other events) are taken by each person.  This requires a combination of person recognition (could be a combination of face detection, and other person i.d. technology such as colored clothes matching) and activity recognition (when is a cup of coffee taken, perhaps who makes a new cup of coffee).

Status/comment:  The program should probably run in real-time, or perhaps even faster than real time, so that a 1 hour test video could be processed in a few minutes.  To make data acquisition easier, we could ask each competing team to provide its own video, so that other teams would be tested on new (unseen) videos.

Newer suggestions (since the original 12-Jul-04 e-mail):

Capturing a soccer game

(Suggested by Yaron Caspi.)  Tracking 22 players + a ball (+ 3 referees) in a soccer game.

Input a single/multiple video streams from static camera/s. Output: 22+1 sequences of (x,y)_t coordinates.

Getting Ground Truth. I believe that there are companies that manually extract such data.

Controlling difficulty: real-time vs. off-line. Sequence length. Illumination change Y/N. Accuracy/Length "Outliers" - e.g., no substitutions (no player has left the field of view).

Score should promote quantitative results (i.e., accuracy may be limited, but do not switch between players).

Predicting Basketball Shots

(Suggested by Vaibhav Vaish and Jan-Michael Frahm.)

Problem:  We're watching a video of a basketball game. A player has a shy at the basket. By observing the video of trajectory of the ball (in the video frames), predict whether the ball will go through the hoop or not - before you can see the result.

Motivation:  May be a cool feature to add to NBA broadcasts one day. (I don't watch those, so I am presuming this is not already implemented).

Details:
For the contest, we propose that:
- use videos of a single person throwing the ball, rather than an actual basketball game. Avoids occlusions.
- use a wide FOV fixed camera (or a fixed stereo rig) that sees the entire trajectory of the ball
- provide calibration information (camera intrinsics + 3D coordinate system in which camera pose and basket coordinates are known). The starting position of the ball is not known.
- One judgement criterion should be how many frames must be observed before the correct decision is made.

Variations:
- users could be asked to recover calibration. If the radius of the ball is known, that can be done (Motilal Agarwal, ICCV 03). Makes it harder.
- We could use a larger basketball hoop size, or a specially painted ball to make things easier
- Ignore deflections off the board (at least in the first stage)

We could come up with similar problems for air hockey, billiards, golf (putting on the green only) etc.

Suggestion: To attract more participants, we feel it is essential to offer a free trip to ICCV for a sufficient number of entries (at least 10, I'd say).
 

Real-time video augmentation

(Suggested by Bill Triggs.) Augmentation demos have consistently been very popular in the past (e.g. various kinds of virtual object insertion, the head warping demo at CVPR Puerto Rico, the various head tracking and augmenting demos, etc), and they are of very direct interest to the film/graphics end of the community. One could specify a relatively fixed challenge, but I think it would be more fun to allow full scope for creativity by making it very open ended, e.g. "Augment a live video of the ICCV demo room and the people watching the demos. The camera will roam around the room. You can augment the room and/or the people as you wish, add virtual objects or characters, synthesize new viewpoints or lightings, add surprises, etc. Prizes will be awarded for the best technical solution and the most artistic one." This would be difficult to judge and relatively subjective, but one could bring in an industry personality such as Steve Sullivan. Most of all, it would be fun to do and fun to watch.
 

Shell game

(Suggested by Jim Rehg.)  Watch a video of someone hiding a pea under one of three shells, then moving them around trying to fool the viewer, just like at a carnival.  Predict which shell has the pea at the end.

Art recognition

(Suggested by Jim Rehg.)  Recognize art by known artists.  Not clear what is given as training data and what is test.

People

The following people have participated in discussions to date.  If you would like your name to be added to this list, please e-mail me.  I will probably make the selection of the topic at the same time as the selection of the contest committee, since it is likely that people will be more interested in helping out with one task rather than another.

E-mail archive

Click on the above link (or here) to see recent e-mail related to the contest.

Last updated 12-Jul-2004.  Please e-mail your comments to szeliski@microsoft.com.