Interrogare collezioni di documenti XML: una interfaccia
utente
Oreste Signore - Marco Andreini - Cristian
Lucchesi - Silvia Martelli
Ufficio Italiano W3C presso il C.N.R.
Area della Ricerca di Pisa San Cataldo - Via G. Moruzzi, 1 -
56124 Pisa
Email:oreste@w3.org
XML per i Beni Culturali
Esperienze e prospettive per il trattamento di dati strutturati e
semistrutturati
Scuola Normale Superiore
Pisa, 25 marzo 2004
Talk layout
- Main goals
- Design issues
- Demo
- Conclusion
 |
This work has been financed by the project
QUESTION-HOW (Quality Engineering Solutions via Tools,
Information and Outreach for the New Highly-enriched Offerings
from W3C: Evolving the Web in Europe), contract
IST-2000-28767 |
User needs in accessing XML data collections
XML has no semantics per se ...
... hence user needs:
- understanding structure
- understanding semantics
- sharing his/her knowledge base with the
indexer
Main goals
- Supporting the user in formulating a "semantically
correct" query
- understanding element semantics
- entering correct data (list of values, dictionaries,
constraints)
- browsing concepts (access to thesauri, ontologies,
etc.)
- Supporting user interaction:
- multilinguality
- different cultures (format conversions, e.g. for
calendars)
- different expertise levels
- formulate complex queries without learning
XPath
- Adjusting the output (insert hyperlinks, emphasize
searched terms, processing the text)
- Do not affect existing data collection and its
schema
- Keeping the architecture open and
distributed
Architecture: a rough sketch
Architecture: fully web-based
 |
- The XML Schema is externally annotated in RDF
- Metadata are also stored in the system objects.
- RDF description can be imported in the system.
- The system can export RDF annotation
- A component allows to add metadata to the system
objects
- The user can browse the structure and formulate the
query
- For each element/attribute the system knows and can
show to the user:
- The query is prepared in a general format, can be
mapped onto different search engines
|
Architecture: main features
- The XML Schema is externally annotated in RDF
- Metadata are also stored in the system objects.
- RDF description can be imported in the system.
- The system can export RDF annotation
- A component allows to add metadata to the system
objects
- The user can browse the structure and formulate the
query
- For each element/attribute the system knows and can
show to the user:
- The query is prepared in a general format, can be
mapped onto different search engines
The RDF annotation
- Identification of elements
- Properties of elements
- description
- searchability
- role (structure vs content)
- Constraints
- thesaurus, dictionary, externalList, localList
- web service to invoke
- dependencies among values of elements
- Pre-processing options
- Post-processing options
The Administrator must have an in depth knowledge of the
document collection and the related knowledge
domain
The query
Frustrating traditional approaches:
- select field from a pop up window
- insert parentheses and Boolean operators (selecting from
different options)
- difficult to formulate complex queries
- difficult to reuse elementary conditions
Composition:
- A composition of several queryFragments
(conceptually a single navigation upon the document tree)
- A queryFragment is made by several queryTokens
- A queryToken is the elementary component (a single
interaction upon the queryForm on a single element)
Normalization:
- identify the common ancestor for clauses on elements
belonging to independent paths
The sample document collection ...
... and a (partial) graphical representation of it
Specifying constraints
The XML document collection
- XML Schema is required
- Document collection is:
- compressed
- indexed
- queried
- retrieved
The XCDE search engine
- XCDE library
developed at the Department of Computer Science, University of
Pisa (prof. Paolo Ferragina).
- Best suited for collections of a limited number of
large size documents
- Query language oriented mainly to the processing and
management of textual data
- Query syntax:
- similar to SQL: SELECT-FROM-RETURN
- SELECT clause specified by means of an XML piece
of well-formed text
- implements most of the IR functionalities detailed
in the W3C Working Draft of XQuery and
XPath Full-Text Requirements, as well as other powerful
string-based queries (regular expressions and error
matches).
- output of the query (the snippet) can be formatted within the
RETURN clause which, again, includes an XML piece of
well-formed text.
- A special attribute named xml_var (pivot) is added
to elements within the SELECT clause in order to identify some
"interesting points" in each document subtree that matches the
query.
Pivots are used in the RETURN clause to indicate the way in which
these points in the matching subtree must be
visualized.
Demo
Conclusion and future plans
- A consistent framework for intelligent and
controlled access to XML data collections has been
implemented
- RDF allowed the annotation of the (unchanged)
schema
- Administrator can point to any appropriate knowledge
source (accessing thesauri via web services)
- The interface is generalized:
- any document collection
- any search engine
- Some features are not yet implemented (but they are just
"conventional development"
- Future developments can consider:
- accessing heterogeneous document collections
- inserting and updating documents
Thank you for your attention
?
If it isn't on the Web ...
... it doesn't exist
This presentation will be on the Office Web Site (http://www.w3c.it/talks/sns2004/)