An Introductory SQL Tutorial for SSA Users

Bob Mann
rgm@roe.ac.uk
Institute for Astronomy
University of Edinburgh
15 March 2004

1.Introduction
2. Primer

2.1 Relational databases
2.2 The SSA as a relational database
2.3 Structured Query Language (SQL)
2.4 The basic structure of an SSA SQL statement
2.5 Null and default values in the SSA
2.6 The SELECT statements defining the SSA views

3. Reference: additional options in SELECT statements

3.1 Aggregrate functions

3.2 Spatial functions
3.3 Mathematical functions
3.4 Operators

4. Examples: 20 queries used in the development of the SSA

1. Introduction

In this document we provide a brief introduction to the use of Structured Query Language (SQL) for accessing data in the SuperCOSMOS Science Archive (SSA). We divide this document into a Primer aimed at users new both to the SSA and to SQL, a Reference which should be of use more experienced and returning users, and an Examples section, which presents a set of 20 realistic queries used in the design of the SSA. Readers wanting a fuller introduction to SQL should consult an online tutorial or one of the legion of SQL books available: O'Reilly's SQL in a nutshell is a good introduction. Some familiarity with SuperCOSMOS and the set of parameters returned by its image analyser are assumed in what follows, so readers may wish to consult the introduction provided by Hambly et al. (2001), MNRAS, 236, 1295, while those wishing to know the differences between the data presented in the SSA and those made available previously through the SuperCOSMOS Sky Survey web interface should consult the SSA Database Overview page.

The SSA is a large database - more than 1 TB in size - so, for test purposes, we have produced the "Personal SSA" (PSSA), which is a small subset of the SSA, containing solely the data in the region of the sky with 184 < RA (deg) < 186 and -1.25 < Dec. (deg) < 1.25, which is the same area of sky as the "Personal SkyServer" produced for the Early Data Release (EDR)of the Sloan Digital Sky Survey (SDSS). The PSSA may be downloaded from here (as a .zip file with installation instructions included) or can be queried using a web interface of the same form as for the full SSA. SSA users are strongly encouraged to use the PSSA for developing and debugging queries that they want to run on the SSA: with a database as large as the full SSA it can take a long time to find out that the query you wrote does not do what you intended!

In particular, queries within this tutorial may be run on the PSSA by simply copying the highlighted text from this document and pasting it into the text box of the PSSA's SQL Query form, and after each one displayed against a highlighted background we provide a link to a copy of the output page obtained from running the query with the form's default setting of returning to the user's browser the first 30 rows of the result set. Note that these output pages may differ in detail from the ones you may obtain running the same query. As discussed below, SQL is a set-based language, and all that is guaranteed is that the same query on the same database returns the same result set, with no guarantees as to the order in which rows appear in that result set.

2. Primer

2.1. Relational databases

The SSA is a relational database, which means that it stores data in tables composed of rows and columns. Each row comprises the information stored for one data entry – i.e. a celestial object in the case of the SSA – and there is one column for each of the attributes recorded for that entry – e.g. RA, Dec, ellipticity, etc, for the SSA. The different tables comprising a database may be linked (or related), if they each have columns representing the same data value, and integrity constraints can be included in the table definitions which ensure consistency between two related tables, e.g. by preventing the deletion of only one of a pair of rows in different tables thus linked. For ease of use, it is possible to define virtual tables - called views - which are subsets of the data in one or more tables and which can be queried using the same syntax as ordinary tables (which are sometimes called base tables, to distinguish them from these virtual tables). In addition to tables and views, the major constituents of a relational database are indexes (the database community prefer that spelling to "indices"), which can speed up the identification of records which satisfy the particular condition expressed in a query, and various stored procedures and functions which extend the range of operations which can be performed on data held in the tables. The collection of definitions of columns, tables, views, indexes, stored procedures and functions in a database is called its schema.

2.2. The SSA as a relational database

The SSA schema is described in detail elsewhere, but we recap here the basic features which we shall use later. The two major tables in the SSA are called Detection and Source. The columns in Detection are basically the attributes derived by running the SuperCOSMOS image analyser over a single plate scan, and these single-plate detections are then merged into multi-epoch, multi-colour records for individual celestial objects, which are stored in Source. In addition to these two major tables, there are also a number of metadata tables, which store ancillary information describing the processes involved in obtaining and reducing SuperCOSMOS data, and which enable the provenance of data values in Source and Detection to be traced all the way back to a given glass plate exposed in an observation of a particular survey field made under known conditions and subsequently processed using a certain set of calibration coefficients. The SSA uses the same set of spatial access routines as the SDSS SkyServer, based on the Hierarchical Triangular Mesh (HTM) pixelisation of the celestial sphere, which was developed at Johns Hopkins University. To aid spatial matching of objects within the SSA and between the SSA and the SDSS EDR, respectively, there are also "Neighbours" and "CrossNeighboursEDR" tables which record pairs of sources within 10 arcsec of one another. Three views are defined in v1.0 of the SSA: ReliableStars, CompleteStars and ReliableGalaxies. As their names suggest, these are intended for use when well defined subsamples of stars or galaxies with high completeness or reliability are required, and they are defined in terms of selections on attributes in the Source table. Their advantage is that the user does not need to remember the constraints (detailed in Section 2.6 below) on the attributes required to define the subsample, but can simply query it using the view created to constitute that subsample. Users should check which attributes in which tables have been indexed in the v1.0 SSA, since the performance of queries that can make use of them should be significantly better than for those which do not: this information is presented in the SSA Browser.

2.3. Structured Query Language (SQL)

SQL is the standard language for accessing and manipulating data stored in a relational database. In fact, several versions of the SQL standard exist, and most database management systens (DBMSs) actually support a subset of standard SQL, with some vendor-specific additions. The SSA is currently implemented in Microsoft's SQL Server 2000 DBMS, so SSA users will employ its SQL dialect, although we have tried to restrict the use of vendor-specific features to a minimum. A fuller reference on this SQL dialect than presented here is available on line here.

The first thing to understand about SQL is that it is a set-based language, not a procedural language, like Fortran or C. A user submitting an SQL query to a relational database is defining the set of properties of the records that she wants returned from the database, not specifying the list of operations which will lead to their delivery; this latter is the responsibility of the DBMS engine, which will decide the best way to execute a given query from a set of possible execution plans. Many database vendors are adding procedural capabilities to the SQL dialects they support, and these constitute one of the main areas of difference between those dialects. These extensions will not be discussed here, as we shall concentrate on the basics of standard SQL.

2.4. The basic structure of an SSA SQL statement

For security reasons, the SSA does not allow users to execute queries which affect the basic structure and contents of the database, only those which can extract data from it. In SQL terms, this means that only SELECT statements are allowed: N.B. in this tutorial we write all SQL keywords in upper case italics and some column names in mixed case, both for clarity, although the SSA's SQL dialect is case insensitive by default. There are three basic classes of SELECT statement:

2.4.1 Projections

A projection is the retrieval of a set of full columns from a table. To retrieve the nominal RAs and Decs of the centres of all sky survey fields in the SSA, one would type:

SELECT nominalRA, nominalDec FROM Field

[Link to demo result set]
where Field is the name of the SSA table which records information about sky survey fields, and nominalRA and nominalDec are the names of the relevant columns in that table.

2.4.2 Selections

A selection is the retrieval of the data values in particular columns for those rows in a table which satisfy certain critieria. So, if one were interested only in fields whose nominal centres lie in a 10 degree strip south of the celestial equator, the appropriate SQL query would be:

SELECT nominalRA, nominalDec
FROM Field
WHERE nominalDec BETWEEN -10 AND 0

[Link to demo result set]
In this example the SQL statement has been split into three lines to emphasise the SELECT…FROM…WHERE form of the selection, but this is still one SQL statement. The SQL Query Form [add link] in the SSA interface ignores the whitespace at the end of each line of text and generates a single query string from valid multi-line text like this.
Multiple constraints can be included in the WHERE clause of a selection, so, for example, the query above could be rewritten as:

SELECT nominalRA, nominalDec FROM Field WHERE (nominalDec > -10) AND (nominalDec < 0)

[Link to demo result set]
while the field centres of all other fields could be selected using the following statement:

SELECT nominalRA, nominalDec FROM Field WHERE (nominalDec < -10) OR (nominalDec > 0)

[Link to demo result set]
The parentheses in these examples have been included for clarity – they are only required when needed to avoid ambiguity, and when necessary to over-rule the standard order of precedence amongst operators, outlined in Section 3.4.9. (Users should note that the accidental omission of the WHERE clause from a selection turns it not into an invalid query, but into the projection of the columns contained in its SELECT clause, which, for tables as large as the Source and Detection tables of the SSA - both of which have in excess of one billion rows - will return a lot of data.)

2.4.3 Joins

A join is the retrieval of data entries from one or more tables in a database matched under some criterion. Extending our example above, a user may be interested in the dates on which SSA exposures in this equatorial strip were taken. The Plate table in the SSA has an attribute called MJD, which records the Modified Julian Date at the midpoint of the exposure of each photographic plate making up the SSA. The Plate and Field tables are linked by having the common attribute fieldID, which is a unique identifier for each sky survey field (e.g. Field 1 in the ESO/SRC field system has a different fieldID value to Field 1 in the Palomar system). The SQL query retrieving the desired dates here would be:

SELECT mjd, nominalRA, nominalDec
FROM field, plate
WHERE (nominalDec BETWEEN -10 AND 0)
AND (field.fieldID = plate.fieldID)

An Introductory SQL Tutorial for SSA Users

Contents

1. Introduction

2. Primer

2.1. Relational databases

2.2. The SSA as a relational database

2.3. Structured Query Language (SQL)

2.4. The basic structure of an SSA SQL statement

2.4.1 Projections

2.4.2 Selections

2.4.3 Joins

2.4.4 Subqueries

2.5. Null and default values in the SSA

2.6. The SELECT statements defining the SSA views

3. Reference: additional options in SELECT statements

3.1 Aggregate Functions

3.1.1 Mathematical aggregate functions

3.1.2 COUNT

3.1.3 TOP

3.1.4 GROUP BY and HAVING

3.2 Spatial Functions

3.2.1 The HTM Scheme

3.2.2 fHTMLookupEq and fHTM_Cover

3.2.3 fGetNearbyObjEq, fGetNearestObjEq and fGetNearestObjIDEq

3.2.4 fGreatCircleDist

3.3 Mathematical Functions

3.4 Operators

3.4.1 Arithmetic operators

3.4.2 Bitwise operators

3.4.3 Comparison operators

3.4.4 Logical operators

3.4.7 String concatenation operator

3.4.8 Unary operators

3.4.9 Operator precedence

4. Examples: 20 queries used in the development of the SSA

Q1: Find the positions of all galaxies brighter than magnitude 20 in B with a local B band extinction is >0.75 mag.

Q2: Provide the positions and magnitudes of stars for which the magnitudes from the two R band surveys differ by more 3 magnitudes.

Q3: Find the positions of all galaxies with a profile statistic > 10 in all detected wavebands and photometric colours consistent with being an elliptical galaxy.

Q4: Provide the mean positions and magnitudes of any stellar objects with colours and proper motions consistent with being a white dwarf.

Q5: Find the positions and (B,R,I) magnitudes of all star-like objects within delta mag of 0.2 of the colours of a quasar of redshift 2.5 < z < 3.5.

Q6 Find unpaired objects

Q7: Provide a list of star-like objects that are 1% rare in (B-R,R-I)-space.

Q8: Create a gridded count of galaxies with B-R>1.2 and R<19 over 184<RA<186 and -1.25<Dec<1.25, on a grid of 2 arcmin

Q9: Create a count of galaxies in Level-9 HTM triangles which satisfy a certain colour cut, like 0.7B-0.5R-0.2I<0.8 and R<19.

Q10: Find the positions of all galaxies with a pixel brighter than the highest areal profile threshold in any band within 1 degree of a given point (185.0,0.0) in the sky

Q11: Find the plate numbers of those plates with nominal centres within 20 degrees of (185,0)

Q12: Find the positions and (B-R,R-I) colours of all galaxies with blue band area between 100 and 200 pixels, -10 < supergalactic latitude (sgb)/degrees < 10, and declination less than zero, and return them in colour order.

Q13: Find the positions, R band magnitudes and B-R colours of all galaxies with an area greater than 100 pixels and a major axis 10 < d/arcsec < 30 in the red band and with an ellipticity>0.5.

Q14: Find galaxies that are blended with a star and output the deblended magnitudes.

Q15: Find all pairs of objects within 10 arcsec of another that have very similar colours, and return their positions and B band magnitudes

Q16: Find the positions of stars with Sloan 5-band colours and SSA proper motions which are consistent with their being subdwarfs.

Q17: Provide a list of positions of galaxies whose Sloan and SSA magnitudes are consistent with there having been a supernova in the galaxy at one of its epochs of observation.

Q18: Provide a count of high-quality star-like sources brighter than 16th magnitude which are in either the SSA or SDSS but not both.

Q19: Provide the positions of star-like objects with SDSS colours consistent with being a quasar and positions consistent with not having moved between all the epochs in the SSA.

Q20: Provide a list of SSA objects within a magnitude of their respective nominal plate limit which are unpaired in the SSA and have no SDSS counterpart.