Math IR Happening at MIR 2012
The MIR 2012 workshop will feature a friendly competition
for the systems presented at the workshop. Since math information retrieval is still quite
young and developing, we will not make this an official competition, but a
happening, where we get together and test our system on a common set of problems.
We expect the happening to transcend the workshop proper.
The aim of the MIR happening is to jointly gain a better understanding into the
information retrieval needs of mathematicians and the respective strengths and weaknesses
of the respective IR approaches and systems. As a tangible result of the happening the
organizers will compile a survey paper and report of this newly-gained understanding.
In particular, it is not an aim of the MIR happening to determine "winners" of the
competition in any form. That may be an aim of a subsequent competition, when we have a
better grip on the problems and possible evaluation approaches.
MIR Challenges
We plan to conduct the happening via three challenges:
-
Formula Search (Automated) in the categories:
- similarity search for formulae
- instance search (query formulae with query variables)
The judges select/prepare a formula database and a set of formula queries. The formula
database contains a list of formulae with identifiers. Every formula in two encodings:
LaTeX and MathML (parallel-markup presentation/content). The query formulae are in the
same format (extended by query variables). Participating IR systems obtain the formula
database and the list of formula queries and return for every query an ordered list of
hits (identifiers of formulae claimed to match the query), plus possible
supporting evidence (e.g. a substitution for instance queries). Results will be judged
on precision, recall, results ordering, and search time.
-
Full-Text Search (Automated) This is like formula search above, only that that
we use a document collection (LaTeX and XHTML+MathML(parallel)) and a set of
text/formula queries (in the same formats) instead of pure formulae. IR results are
ordered lists of
documents (i.e. [XPointer] references into the documents with a
highlighted result snippets as supporting evidence) and will be judged on
precision, recall, results ordering, search time, and presentation of
the documents .
-
Open Information Retrieval (Semi-Automated) In contrast to the first two
challenges, where the systems are run in batch-mode (i.e. without human intervention),
in this one mathematicians will challenge the (human) contestants to find specific
information in a document corpus via human-readable definite descriptions (natural
language text), which are translated by the contestants to their IR systems. Results
to be delivered are
hits in free form together with a description of how the
results were found.
MIR Judges Panel
We have invited a panel of mathematicians participating in CICM as a panel of
judges who will select/prepare the MIR challenges, judge the solutions of the
contestants, and provide overall feedback.
FAQ: open issues to be discussed
Q: Do we also judge indexing time? A: No, at least not as main criteria.
Q: Do we give the formula databases or document corpora to the
contestants ahead of time? A: Yes, formula database and document corpora
will be released on July 5th, in a format of previously announced sets.
These questions and partial answers, which we adopt for the
MIR happening are here.
|