Bob Mann
rgm@roe.ac.uk
Institute for Astronomy
University of Edinburgh
15 March 2004
In this document we provide a brief introduction to the use of Structured Query Language (SQL) for accessing data in the SuperCOSMOS Science Archive (SSA). We divide this document into a Primer aimed at users new both to the SSA and to SQL, a Reference which should be of use more experienced and returning users, and an Examples section, which presents a set of 20 realistic queries used in the design of the SSA. Readers wanting a fuller introduction to SQL should consult an online tutorial or one of the legion of SQL books available: O'Reilly's SQL in a nutshell is a good introduction. Some familiarity with SuperCOSMOS and the set of parameters returned by its image analyser are assumed in what follows, so readers may wish to consult the introduction provided by Hambly et al. (2001), MNRAS, 236, 1295, while those wishing to know the differences between the data presented in the SSA and those made available previously through the SuperCOSMOS Sky Survey web interface should consult the SSA Database Overview page.
The SSA is a large database - more than 1 TB in size - so, for test purposes, we have produced the "Personal SSA" (PSSA), which is a small subset of the SSA, containing solely the data in the region of the sky with 184 < RA (deg) < 186 and -1.25 < Dec. (deg) < 1.25, which is the same area of sky as the "Personal SkyServer" produced for the Early Data Release (EDR)of the Sloan Digital Sky Survey (SDSS). The PSSA may be downloaded from here (as a .zip file with installation instructions included) or can be queried using a web interface of the same form as for the full SSA. SSA users are strongly encouraged to use the PSSA for developing and debugging queries that they want to run on the SSA: with a database as large as the full SSA it can take a long time to find out that the query you wrote does not do what you intended!
In particular, queries within this tutorial may be run on the PSSA by simply copying the highlighted text from this document and pasting it into the text box of the PSSA's SQL Query form, and after each one displayed against a highlighted background we provide a link to a copy of the output page obtained from running the query with the form's default setting of returning to the user's browser the first 30 rows of the result set. Note that these output pages may differ in detail from the ones you may obtain running the same query. As discussed below, SQL is a set-based language, and all that is guaranteed is that the same query on the same database returns the same result set, with no guarantees as to the order in which rows appear in that result set.
The SSA is a relational database, which means that it stores data in tables composed of rows and columns. Each row comprises the information stored for one data entry – i.e. a celestial object in the case of the SSA – and there is one column for each of the attributes recorded for that entry – e.g. RA, Dec, ellipticity, etc, for the SSA. The different tables comprising a database may be linked (or related), if they each have columns representing the same data value, and integrity constraints can be included in the table definitions which ensure consistency between two related tables, e.g. by preventing the deletion of only one of a pair of rows in different tables thus linked. For ease of use, it is possible to define virtual tables - called views - which are subsets of the data in one or more tables and which can be queried using the same syntax as ordinary tables (which are sometimes called base tables, to distinguish them from these virtual tables). In addition to tables and views, the major constituents of a relational database are indexes (the database community prefer that spelling to "indices"), which can speed up the identification of records which satisfy the particular condition expressed in a query, and various stored procedures and functions which extend the range of operations which can be performed on data held in the tables. The collection of definitions of columns, tables, views, indexes, stored procedures and functions in a database is called its schema.
The SSA schema is described in detail elsewhere, but we recap here the basic features which we shall use later. The two major tables in the SSA are called Detection and Source. The columns in Detection are basically the attributes derived by running the SuperCOSMOS image analyser over a single plate scan, and these single-plate detections are then merged into multi-epoch, multi-colour records for individual celestial objects, which are stored in Source. In addition to these two major tables, there are also a number of metadata tables, which store ancillary information describing the processes involved in obtaining and reducing SuperCOSMOS data, and which enable the provenance of data values in Source and Detection to be traced all the way back to a given glass plate exposed in an observation of a particular survey field made under known conditions and subsequently processed using a certain set of calibration coefficients. The SSA uses the same set of spatial access routines as the SDSS SkyServer, based on the Hierarchical Triangular Mesh (HTM) pixelisation of the celestial sphere, which was developed at Johns Hopkins University. To aid spatial matching of objects within the SSA and between the SSA and the SDSS EDR, respectively, there are also "Neighbours" and "CrossNeighboursEDR" tables which record pairs of sources within 10 arcsec of one another. Three views are defined in v1.0 of the SSA: ReliableStars, CompleteStars and ReliableGalaxies. As their names suggest, these are intended for use when well defined subsamples of stars or galaxies with high completeness or reliability are required, and they are defined in terms of selections on attributes in the Source table. Their advantage is that the user does not need to remember the constraints (detailed in Section 2.6 below) on the attributes required to define the subsample, but can simply query it using the view created to constitute that subsample. Users should check which attributes in which tables have been indexed in the v1.0 SSA, since the performance of queries that can make use of them should be significantly better than for those which do not: this information is presented in the SSA Browser.
SQL is the standard language for accessing and manipulating data stored in a relational database. In fact, several versions of the SQL standard exist, and most database management systens (DBMSs) actually support a subset of standard SQL, with some vendor-specific additions. The SSA is currently implemented in Microsoft's SQL Server 2000 DBMS, so SSA users will employ its SQL dialect, although we have tried to restrict the use of vendor-specific features to a minimum. A fuller reference on this SQL dialect than presented here is available on line here.
The first thing to understand about SQL is that it is a set-based language, not a procedural language, like Fortran or C. A user submitting an SQL query to a relational database is defining the set of properties of the records that she wants returned from the database, not specifying the list of operations which will lead to their delivery; this latter is the responsibility of the DBMS engine, which will decide the best way to execute a given query from a set of possible execution plans. Many database vendors are adding procedural capabilities to the SQL dialects they support, and these constitute one of the main areas of difference between those dialects. These extensions will not be discussed here, as we shall concentrate on the basics of standard SQL.
For security reasons, the SSA does not allow users to execute queries which affect the basic structure and contents of the database, only those which can extract data from it. In SQL terms, this means that only SELECT statements are allowed: N.B. in this tutorial we write all SQL keywords in upper case italics and some column names in mixed case, both for clarity, although the SSA's SQL dialect is case insensitive by default. There are three basic classes of SELECT statement:
A projection is the retrieval of a set of full columns from
a table. To retrieve the nominal RAs and Decs of the centres of all sky
survey fields in the SSA, one would type:
A selection is the retrieval of the data values in
particular columns for those rows in a table which satisfy certain critieria.
So, if one were interested only in fields whose nominal centres lie in a 10
degree strip south of the celestial equator, the appropriate SQL query would
be:
A join is the retrieval of data entries from one or more tables
in a database matched under some criterion. Extending our example above, a user may be interested in the
dates on which SSA exposures in this equatorial strip were taken. The Plate
table in the SSA has an attribute called MJD, which records the Modified
Julian Date at the midpoint of the exposure of each photographic plate making
up the SSA. The Plate and Field tables are linked by having the
common attribute fieldID, which is a unique identifier for each sky
survey field (e.g. Field 1 in the ESO/SRC field system has a different fieldID
value to Field 1 in the Palomar system). The SQL query retrieving the desired
dates here would be:
Another way of doing this is to use a prefix which is an alias for
the table name, assigned using the AS keyword in the FROM clause
of the join statement. In this way, the previous query can be rewritten as:
It is also possible to write join queries in a way that makes it more explicit that
they are joining the two tables, i.e. the example above becomes:
This is an inner join, meaning it only returns the
(mjd,nominalRA,nominalDec) triplets for matched rows, but there are other types
of join, which return different combinations of data. Firstly, there are left
and right outer joins and full joins, all of which also
return data from unmatched rows. Replacing the join clause above with
The SQL Server dialect of SQL allows the construction of nested SELECT statements, in which the
WHERE clause includes a subquery which is itself a valid SELECT statement. For
example, the join on Field and Plate in the previous Section could be rewritten in the following way:
This query could also be written using a second logical operator, ANY, as follows:
These last two queries illustrate the equivalence of IN and = ANY, but care must be taken if the logical operator NOT is added to the outer query, so that it seeks matches with the complement of the set for which matches are sought in the original query. The operator which is equivalent to NOT IN is not < > ANY, as one might initially expect, but rather < > ALL - where ALL is another logical operator, which evaluates to TRUE if all of a set of comparisons are TRUE - which is clear when one thinks through the logic of that query, but perhaps not immediately apparent.
The most common use for subqueries is to express complicated joins in a simpler fashion. Up to 32 levels of nesting are allowed, in principle, although memory restrictions may prevent that from being achieved in practice. To evaluate a complicated join as a nested series of queries would often be much slower, but the database engine should recognise the equivalence of the two forms of the query and recast it in the most efficient form as part of its query optimisation process. So, there should be no difference in the performance of queries submitted as complicated joins or as nested subqueries, and the latter are to be prefered if they help ensure that the query that is executed really is the one that the user wanted to run.
Our discussion to this point has implicitly assumed that the values of the attributes corresponding to each column in every row in a database table are known. This need not always be the case, as a simple example from the SSA illustrates. The Source table in the SSA merges information about detections made on B, R and I band plates. A very blue object may well not be detected in an I band plate, as these are shallower than the B and R plates, so what should be written for such an object in the column of the Source table which records, say, the ellipicity of the I band detection? One answer would be a null value, which is a special type of entry to be included in a table if the value of a given attribute is not known (or is indeterminate or is not applicable) for a particular row. In designing v1.0 of the SSA we have decided not to use nulls in these cases, but to define default values for use in these situations instead: e.g. in the example above, we would set the I band ellipticity in the Source table of a celestial object undetected in that band to be -9.999e9. Nulls and default values are semantically different: the query processor in a database management system (DBMS) recognises that a value marked as null is unknown or indeterminate and will not include it in, say, the computation of the mean value of all entries in a column, while, to the query processor, a default value is like any other value, and will include it, unless the user explicitly excludes it - e.g. by computing the mean magnitude only for those objects with magnitudes brighter than -99.0, in this case.
As a result of defining default values for some columns, there have to be dummy rows in some tables (i.e. rows for which every attribute takes the appropriate default value). The reason for this is illustrated by consideration of the Source and Detection tables in the SSA, whose design is detailed here. The Source table includes attributes (called ObjIDB, ObjIDR1, ObjIDR2, ObjIDI) which store the values of the ObjID attribute in the Detection table of the records, respectively, of the B band, first epoch R band, second epoch R band and I band detections which have been merged into the Source record. In our example above, there is no I band detection merged into the Source record of our very blue object, so ObjIDI must take a default value. This, in turn, necessitates the creation of a dummy row in the Detection table, whose ObjID attribute takes the value used as the default for ObjIDB, ObjIDR1, ObjIDR2 or ObjIDI used for a non-detection in a particular band, since otherwise the integrity constraints that ensure that every detection referenced in the Source table is present in the Detection table would be violated. The other columns of this dummy row are then populated with the appropriate default values. The default values used for the attributes in the SSA are listed here and have all been chosen to lie well beyond the range of legitimate values found in the SSA, so it is simple to exclude them.
It is the responsibility of the user to ensure that the presence of default values and dummy rows will not corrupt the results of queries run on the SSA, but our decision to use them, rather than nulls greatly simplifies the logic involved in all queries run on the database. The inclusion of null values for an attribute means that an expression involving it can evaluate to TRUE, FALSE or UNKNOWN, and we believe that the simplification for users in avoiding this three-valued logic greatly outweighs the burden of having to remember that default values exist for some columns in the SSA.
Having discussed the
syntax of a SELECT statement, we can describe the definition of the
views available in v1.0 of the SSA. As outlined above, views are virtual tables
comprising a subset of data drawn from one or more tables. Views are defined
using a SELECT statement and in the SQL Server dialect of SQL this takes
the following form:
The benefits of defining views can be illustrated by the
example of the CompleteStars view in SSA v1.0, which is defined as follows:
The CompleteStars view defines a highly complete selection of point sources from the Source table. It contains all the attributes for all those entries in Source for which there is a reasonable astrometric solution (chi2<5.0) derived from more at least two plates (Nplates>1) and which, in all the bands in which it is detected, is free of strong warnings concerning quality issues (qual<2048) and has image parameters within ranges outwith which it is highly unlikely that point sources will be found; these ranges are defined to be ellipticity <1/3 (i.e. image elongation is less that 50%) and -3.0 < profile statistic < 4.5 (i.e. the areal profile set for the image is close to that of an idealised PSF, where "close" is between a 3sigma lower limit and a 4.5 sigma upper limit on the N(0,1) profile statistic). Use of this view is recommended for analyses requiring high stellar completeness but which are not too sensitive to some contamination, particularly from misclassified sources, poorly parameterised deblends near the limiting magnitude of the plate and poorly parameterised/calibrated images in very crowded regions.
As always, the user must decide what SQL statement defines the set of sources best suited to the particular scientific
problem at hand, and in some cases, reliable star/galaxy separation will be more important than completeness, and, with this
in mind, we have defined ReliableStars and ReliableGalaxies views for SSA v1.0, using the following SQL
statements:
The definition of the ReliableStars view is best understood by contrasting it with that for CompleteStars. The definition of ReliableStars imposes tighter constraints on many parameters than CompleteStars to reduce contamination from non-stellar sources: the maximum ellipticity is reduced from one third to one quarter; the upper limit on the profile statistic is reduced to 3.0sigma; the overall image quality must be better (qual < 128) and only non-deblended (blend=0) detections are considered. Sources must be detected on at least three out of possible four plates (Nplates > 2) and these multi-epoch detections must admit a good astrometric solution (chi2 < 3.0). Finally, a cut in Galactic coordinates is made. Sources must lie at |b| > 10 degrees, and be more than 20 degrees from the Galactic centre, to avoid the most crowded areas of the plane and bulge, where image parameterisation is difficult. Note that the sample defined by this view will be incomplete in the most crowded regions included by this positional cut, due to blended stars being mis-classified as galaxies, so, if completeness is more of a concern that reliability, users may choose not to use ReliableStars, however it is recommended for those analyses where contamination by non-stellar sources must be minimised.
The selection defining the ReliableGalaxies view is simpler. It is a magnitude-limited (Bj < 20.0) subsample of sources with good quality (qual < 128) detections on UKJ, UKR and UKI plates, which are classed as non-stellar and non-blended in the UKJ and UKR surveys (which are deepr than the UKI) and which satisfy a positional constraint (|b| > 35.0 and d > 45.0) designed to excise those regions on the sky where contamination by stars in the bulge and close to the Galactic plane is a significant problem. As with the other views in the v1.0 SSA, use of ReliableGalaxies can greatly simplify the SQL for a great many of the queries to be run against the SSA, but users must be aware of its limitations, notably its incompleteness at faint magnitudes and lower Galactic latitudes.
Previous sections have described the basic SELECT…FROM…WHERE… structure of an SQL query. This is the basic syntax to be employed for querying the SSA, but there are some additional options in the SELECT clause which users may find useful.
SQL offers a number of useful aggregate functions, which can be used for deriving summary information on columns or selections from them.
The meanings of those mathematical aggregate functions which apply only to
numeric columns are very straightforward: AVG, MAX, MIN, SUM,
STDEV, STDEVP, VAR, and VARP return, respectively,
the mean, maximum, minimum, sum, standard deviation, population standard
variation, variance and population variance of all values in the specified
column. They can be used in conjunction with a WHERE clause or not, i.e.
N.B. Strictly speaking, these functions only apply to non-null
values found within the particular column. As discussed above, the v1.0 SSA contains
no null values by design, but uses default values instead. These will not be
ignored in the computation of aggregate functions. For example, the SQL query
There are additional aggregate functions which can be run on columns of all
types. The most useful of these is COUNT, which can be used in a number
of ways, as shown in the following example. The systemID attribute in the Field
table identifies the scheme under which the field centres were defined: since
the SSA combines data from several surveys, it can take several different values. To
find out how many, one can use the following query:
The count(*) function can also be used in conjunction with a WHERE
clause, and it is in this guise that it is most useful. Suppose a user were
interested in selecting a sample of the very large objects from the Detection
table, but didn't know the size of the largest objects. In ignorance, the user might
think that 100 pixels was a pretty large area and would query the SSA to return
a list of attributes for all such objects. Such a query would return about one
tenth of the records in the Detection table, which would probably return far
more data than the user could cope with. A better policy would be to use count(*)
to find out how many objects would be returned by an area cut at 100 pixels,
with the following query:
An interesting function specific to SQL Server’s SQL dialect is TOP, which is illustrated as follows. The query
The GROUP BY clause allows aggregate functions to return more than a
single value. For example, the user running the "SELECT COUNT(DISTINCT systemID) FROM field"
query above might want to know how many of the SSA fields come from the each of
the four systems whose presence is indicated by the result of that query. That
information is returned by the following query:
The majority of queries run on astronomical databases have a spatial component to them, restricting attention to particular region of the celestial sphere, usually defined as some range in either Equatorial or Galactic coordinates. To enhance the performance of the database for these popular spatial queries would require some sort of spatial indexing, but a practical problem is that very few DBMSs support two-dimensional indexing on the surface of a sphere. Indeed, not all DBMSs support two-dimensional indexing on a plane, and SQL Server is one of those DBMSs which supports indexing in only one dimension. Indexing on RA and/or Dec (which is better, not having the complication of wrap-around at the meridian) would aid spatial queries somewhat - and, indeed, indexing on both RA and Dec is implemented in the v1.0 SSA - but what is desired is a prescription under which two-dimensional regions of the celestial sphere can be indexed using a one-dimensional index such that regions which are close on the sky have similar values for the index code, as that will facilitate efficient spatial searches.
A number of approaches can be used to map the two-dimensional surface of a sphere onto a one-dimensional index, but, as mentioned above, the v1.0 SSA uses the same spatial indexing prescription as the SkyServer database of the Sloan Digital Sky Survey (SDSS). This is based on the Hierarchical Triangular Mesh (HTM) pixelisation of the celestial sphere, which was developed at Johns Hopkins University by Peter Kunszt, Alex Szalay and Ani Thakar, and implemented in SQL Server by Jim Gray and Don Slutz of Microsoft Research: we thank them for making available their code for use in the SSA. The HTM scheme is described in much greater detail elsewhere, and we discuss here only those facets which are relevant to use of the SSA.
The HTM scheme starts with the projection of an octahedron onto the celestial sphere, which defines a base set of eight triangular regions of the sky. Each triangular face of the octahedron is then divided into four triangles, defined such that their vertices are the vertices of the original triangle and the midpoints of its sides. The projection of this set of triangles onto the sphere divide it into 32 regions. This process of splitting triangles into four can be repeated to yield an increasingly finely-grained pixelisation of the sphere. The triangles defined at each stage in this procedure can be numbered according to a set procedure, in which an integer code is assigned to each whose most significant bits are inherited from the parent triangle and whose least significant bits record which of the four triangles created from that parent it is. In this way, it is possible to define an integer pixel code - the HTMID - for each triangle in this hierarchical pixelisation of the sphere which records not only the location of the centre of the triangle on the celestial sphere, but also the level in the hierarchical decomposition of the sphere at which it is defined. A 20-level decomposition produces triangles whose average area (note that the areas on the sphere bounded by projections of the edges of the hierarchically decomposed octahedron are not equal) is less than 0.1 square arcsecond, so these are perfectly adequate for labelling individual celestial objects within the SSA unambiguously.
This 20-level HTM decomposition is the basis for the spatial indexing of all positions (locations of celestial objects in the Detection and Source tables, and of field centres in the Field table) in the v1.0 SSA: N.B. all equatorial positions within the v1.0 SSA are quoted in J2000 coordinates. The HTMID for all these defined positions is recorded in the HTMTable table, which includes columns for an ID code (the TableID, itself defined in the HTMTableID table) for the parent table in which the position is defined, together with the ID number within that table of the row in which it is found. Following SkyServer we provide a set of functions for manipulating the entries in HTMTable: these are the key to performing efficient spatial queries in the SSA, so we strongly recommend that users learn how to use them.
The first two functions are concerned with the mapping from regions of the celestial sphere to the HTMID code.
The function fHTMLookupEq takes as arguments an (RA,Dec) pair (with both coordinates in degrees) and returns
the 20-level HTMID code for that location. For example, the HTMID for the point (185.0,0.0) is found
with the following query:
A more powerful function is fHTM_Cover, which returns the HTMID values of the triangles (at a
depth in the hierarchical decomposition specified by the user) which intersect a particular region of the
celestial sphere (defined in J2000 coordinates). This region can be specified in a number of ways, the most useful
being a CIRCLE, defined by the (RA,Dec) pair of its centre (both in degrees) and its radius (in arcminutes),
and a CONVEX, which is a polygon defined by the (RA,Dec) pairs of its vertices. Whereas fHTMLookupEq
returns a single integer (the HTMID code), fHTM_Cover
is a table-valued function, meaning that it returns a table, which must be queried with a SELECT
statement of the same syntax as the static tables and views which are found in the SSA schema. The use of
fHTM_Cover is illustrated in the following query, which probes the coverage of the circular region
of radius 40 arcminutes centred on (185.0, 0.0) by Level-6 HTM triangles:
The next three functions perform proximity searches within HTMTable. The first of these, fGetNearbyObjEq
searches for entries from a particular parent table that lie within a circle of a given radius centred on a location
specified by an (RA,Dec) pair. So, the following query:
The ordering of the results returned by fGetNearbyObjEq is not guaranteed, so, to enable the ready
determination of the closest entry in Source, say, to a given position on the sky without a second
selection on the distance column of the result set, another function, fGetNearestObjEq, is
defined to do just that. So, the nearest of the Source entries to position (185.0,0.0) is returned
by:
One of the main purposes of introducing the HTM pixelisation and its associated set of access functions was
to reduce the number of great circle distance calculations which have to be performed in proximity searches:
the great circle distance formula is a fairly complex piece of spherical trigonometry, which is easy to type
incorrectly in SQL and which is not efficiently executed by standard DBMSs, so it is desirable that its use
be kept to a minimum. It is actually used in proximity searches, such as those specified using the
fGetNearbyObjEq function, but only as a last check on the restricted number of rows which might
match the proximity criterion on the basis of their HTMID value. However, the great circle distance
formula is useful in many situation, so we define a function, fGreatCircleDist, that implements it.
Its use is illustrated by the following query:
The following mathematical functions are supported by SQL Server's SQL dialect.
Arithmetic functions (such as ABS, CEILING, DEGREES, FLOOR, POWER, RADIANS, and SIGN) return a value having the same data type as the input value, while trigonometric functions and others (such as EXP, LOG, LOG10, SQUARE, and SQRT), cast their input values to float and return a float value; this probably is of no concern to the average SSA user. All mathematical functions, except for RAND, are deterministic functions - i.e. they return the same results each time they are called with a specific set of input values - and RAND is deterministic only when a seed parameter is specified
An operator in SQL is a symbol specifying an action that is performed on one or more expressions. For the present purposes, their major use is to provide greater flexibility in the possible forms of WHERE clauses of queries and in the columns of the result sets they can produce, which need not simply be columns drawn from the table(s) being queried. There are several classes of operator to consider.
SQL Server's SQL dialect supports five arithmetic operators. The four basic ones - Addition (+), Subtraction (-), Multiplication (*) and Division (/) - plus the Modulo operation, (%), which returns the remainder of dividing one integer by another, and is used in the format "dividend%divisor". The use of the four basic arithmetic operators is demonstrated in Section 4 below, and use of all five is straightforward from their definitions.
The utility of bitwise operators in SSA v1.0 is limited to the testing of quality flag in the Detection table, which is composed of flags for various quality issues set independently in a bitwise fashion, and which reappears as the set of four quality flags (qualB, qualR1, qualR2 and qualI) for the detections merged into each record in the Source table. The meaning of each of the quality flags set bitwise in these attributes is explained here, and they may be tested using the Bitwise AND operator, &. For example, Q10 of Section 4 probes the setting of the 4th bit of the quality flag (corresponding to presence of a pixel value above the highest areal profile level) by using terms in the WHERE clause of the form "qualb & 16 = 16". In addition, there are also Bitwise OR (|)and Bitwise Exclusive OR (^) operators defined, which could be used in a similar fashion.
The Boolean comparison operators are used most frequently to filter rows via the WHERE clause of a SQL query. The most simple comparison operators ( <, >, =) were used above without introduction, but there a total of nine comparison operators which can be applied to pairs of expressions in the SQL Server dialect of SQL: = (Equal to); > (Greater than); < (Less than); >= (Greater than or equal to); <= (Less than or equal to); <>(Not equal to); != (Not equal to); !> (Not greater than); and !< (Not less than).
In a similar fashion, we have used a number of the logical operators (e.g. AND, BETWEEN, etc)
above without introduction, but the following is the full list of logical operators supported by SQL Server:
The LIKE operator is used for pattern matching. This is most commonly used for string matching. For example,
a user interested in knowing how many plates yielding data included in the SSA had the IIIaF photographic emulsion
could issue the query:
The IN logical operator determines if a given value matches any entries in a subquery or list. The
first of these is discussed in Section 2.4.4. above [add link], while the second is readily illustrated by another
reworking of the query above to count all plates taken
with IIIaF or IIIaJ emulsions. Using IN instead of LIKE it can be rewritten as:
The SOME and ANY operators are strictly equivalent and both compare a scalar expression with a
single column set of values. The count of IIIaF plates could be found (in a rather circuitous manner)
with the following query:
Many of the examples given in this Section have been somewhat contrived, as we have stuggled to find realistic astronomical queries that would make use of some of the logical operators supported by the SQL Server dialect of SQL. This probably indicates that the average SSA user is unlikely to make use of these operators, which we have discussed here solely for the sake of completeness.
The plus sign, +, is used as a string concatenation operator in the SQL Server dialect of SQL. This is most likely to be of use to SSA users in the formatting of result sets - i.e. in the definition of SELECT clauses.
The SQL Server dialect of SQL defines three unary operators - i.e. operators which have only one operand - although
none of these are likely to be of much use to most SSA users. The first
of these is the positive unary operator, +, which is used principally in SQL statements not allowed by the SSA
query interface, so this will not be of use to SSA users. The second, -, the negative unary operator, which
returns the negative value of a numeric expression, as shown in the following query:
The operators described in this subsection have the following descending levels of precedence:
In their design of SkyServer, the SQL Server implementation of the SDSS archive, the Sloan archive team were influenced by a set of 20 queries (see Data Mining the SDSS SkyServer Database by Jim Gray et al.) which were intended to be realistic examples of the kinds of queries that astronomers would want to run on SkyServer. We adopted the same approach in our design of the SSA. We took the 20 queries used for SkyServer, translated a number of them to run on the SSA, and replaced the others by new queries, often performing joins between tables in the SSA and others in the Early Data Release (EDR) version of SkyServer. We discuss these below in turn. Since these were intended to be representative of the queries which SSA users would submit, they have not necessarily been written in particularly elegant SQL, nor has the astronomical content (e.g. colour cuts used) been verified terribly rigorously. In each case we provide a link to the output obtained by running the query on the Personal SSA. To ensure that this matches exactly what a user would obtain we have amended some queries slightly, by the addition of a "TOP 30" to the select statement and an "…ORDER BY…" clause at the end of the query: recall from the discussion of ORDER BY above that the set-based nature of SQL means that the ordering of rows within a result set is not guaranteed, until explicitly dictated through use of ORDER BY. In some of these queries we make use of the fact that SQL Server regards any text on a line following a double hyphen as being a comment, and ignores it: annotating queries through the addition of comments like that is a good way of making complicated queries intelligible.
The Source table of the SSA contains E(B-V) values computed from the Schlegel et al. (1998), ApJ, 500, 525 maps for the position of each entry, so this query is simply a selection on the EBmV attribute recording those values scaled to yield the AB. The scaling factor is 4.09 (Binney and Merrifield 1998) and we use the ReliableGalaxies view of the Source table, for simplicity, as that already makes our magnitude cut at B=20.
This simple query performs a selection on Source to look for
star-like objects detected in both red surveys in the SSA with magnitudes
differing by more than 3 mags. Stars varying this much are rare, so this
query is more likely to detect oddities in the data, rather than extreme variable stars.
Indeed, run on the CompleteStars view in the PSSA it finds no matches at all, so, to
see if it detects any artefacts in the PSSA we can relax the selection criteria for stars from
those used to define the CompleteStars view, as follows:
This query is a translation to the SSA of Query 5 from the SDSS list of “20 Queries”,
which selects sources for which a de Vaucouleurs profile is a better fit than an exponential profile.
The profile information available in the SSA is more limited, so, in adapting this query to the SSA,
we base it around a colour selection on the ReliableGalaxies view together with a cut on the
profile statistic which should only select large galaxies with profiles very disimilar from the assumed
PSF. For our colour cut, we take the original colour-magnitude locus for
low-redshift ellipticals defined in (g,r,i) space by
Eisenstein et al.
(2001), AJ, 122, 2267
and translate it into cuts in SuperCOSMOS (B,R,I) space through the use of the
transformations published by
Fukugita et al. (1996), AJ, 111, 1748 between the Sloan and
Johnson-Cousins passbands, and by
Blair and Gilmore (1982), PASP, 94, 742 for conversion
between them and the photographic filters used in the SSA, i.e.
Putting all these together gives the following query, where we neglect the first epoch R band
survey, since the definition of ReliableGalaxies ensures detection of a non-stellar source
in the other three bands;
This query, which we shall run on the ReliableStars view selects white dwarf candidates on the basis
of their Reduced Proper Motion, Hr=R + 5* log10(μ/arcsec per year) + 5, computed using proper motion (μ)
from the SSA and R band photometry from the second epoch survey. White dwarfs are then selected as satisfying
Hr>3.75*(B-R2)+13.75 and Hr>9.091*(R2-I)+14.09, with the following query:
The corresponding query from the SDSS “20 Queries” list on which this is based sought
quasar candidates in the redshift range 5.5<z<6.5, which are very
unlikely to be found in the SSA. All colour selection of quasars in the SSA
will yield a large number of false positives, due to the lack of U band data,
but we take the (g,r,i) selection criteria for low-z quasars from Richards et
al (2001) – i.e. –0.2 < (g-r) < 0.35 and –0.4 < (r-i) < 0.4 – and translate
that into the SSA photometric bands, using the tranformations from Q3, obtaining the following:
This query implements one of the Usages of the WFCAM Science Archive, itself devised from the UKIDSS proposal: the SSA is a prototype for the WFCAM Science Archive (WSA) in many respects. This query simply selects entries in Source which have a detection in the first or second epoch red survey, but not both.
This query illustrates how SQL can be used to generate rough density
estimates in the space of the attributes present in a table. The query selects stellar
objects on the basis of ellipticity and profile statistics and bins them up into a unit
interval grid in (B-R,R-I) space.
This illustrates how a SQL query might be used in a clustering analysis, by computing counts-in-cells. As in Q7, it bins objects - this time into 2x2 arcmin bins on the sky – and then returns the count of galaxies in each bin.
This is another counting query, but this time, counting galaxies falling
into particular Level-9 HTM triangles (regions about 1/4 square degree in size).
This binning is performed by taking the HTMID for each relevant entry in the ReliableGalaxies
view and dividing it by 222, as required to obtain the HTMID of the
Level-9 triangle in which the galaxy lies, given that the HTMID stored in ReliableGalaxies
is at Level-20.
This query illustrates the use of a join with the result set from the table-valued function fGetNearbyObjEq. It is essentially a self-join of the Source table - i.e. a join of Source with itself - but uses the ReliableGalaxies view to restrict attention to galaxies. The call to fGetNearbyObjEq finds all objects in Source withing 1 degree of (12 20 00, 00 00 00) - i.e. RA=185 degrees, Dec=0.0 degrees - and then joins with it a selection of those entries in ReliableGalaxies which are flagged in at least one band as having been detected with at least one pixel lying above the highest areal profile threshold used by the SuperCOSMOS Image Analyser. This flag is set with the 4th bit in the qualflag, and its value is probed using the Bitwise "AND" operator, &.
The nominal centres of all the survey fields are stored in the Field table, so this requires a
join between that and the Plate table on the fieldID attribute. The spatial selection is
performed by making a call to the table-valued function fGetNearbyObjEq in a subquery and then
joining that HTMID, too.
This repeats one of the SDSS “20 Queries”, but replacing a surface
brightness cut with an image area cut, since surface brightness values are not
stored within the SSA. The Source table - and, hence, the ReliableGalaxies view -
includes positions in Galactic coordinates, so this query just requires a little spherical
trigonometry to implement the selection on supergalactic latitude, and it requires a join with
Detection to retrieve the B Band area. The clause containing the cut on supergalactic latitude
is made more cumbersome by the requirement of converting angles between degrees and radians several times.
The SuperCOSMOS plate scale is 67.14 arcsec/mm, so a conversion factor of 1 micron
=0.067 arcsec is required to recast this query in terms of the physical units
in which the length of the major axis of an image is recorded in the Detection
table. The main constraints in this query are applied to rows in the Detection
table and the join
with the ReliableGalaxies table is made only to remove duplicate entries in the result
set, as a consequence of the two epochs of red survey data, and to extract the colour information. The
contraint that sourceID >0 is required to remove the parents of deblended images, as these are
stored in Detection for completeness, but only their children are merged to form entries in Source.
Entries in the Detection table which were blended together can be recognised
by having a common parent, so this query looks for such pairs of objects by
performing a self-join on Detection constrained to pick galaxies from one copy
of the table and stars from the other, with added checks that they have
sensible magnitudes and that duplicate records in the overlap regions between
plates are excised.
The original SDSS query on which this is based is motivated by a
gravitational lens search. For the SSA, we take “similar colours” to mean B-R
and R-I colours within 0.1 mag. This query uses the SSA Neighbours
table, which records which entries in Source are less than 10 arcsec
from which other ones, and which is designed to speed up queries like this,
which would otherwise require very time-consuming self-joins of Source
with a spatial match predicate. Note that the find clause prevents double-counting.
This query roughly mirrors the work of
Digby et al (2003),
who constructed a
sample of subdwarf stars on the basis of their location in a reduced proper motion
(RPM) diagram computed using SSA proper motions,μ, together with r band
magnitudes and (r-i) colours from the SDSS Early Database Release (EDR) catalogue.
The reduced proper motion, Hr, is defined to be
Hr=r + 5* log10(μ/arcsec per year) + 5
and then the Digby et al (2003) selection criteria for subdwarfs roughly
translates to that region of the RPM diagram bounded by the lines Hr > 13.3
+ 6.5 * (r-i) and Hr<15.3+6.35*(r-i). The query is a three-way join between
the ReliableStars view of the SSA, the PhotoObj table of SkyServer and
the CrossNeighboursEDR table, which records, for each entry in Source,
its ObjID value, together with the ObjID value in PhotoObj of each SDSS object
lying within 10 arcsec of it, as well as their separation (in arcmin): CrossNeighboursEDR
therefore facilitates spatial matching between Source (and, hence, ReliableStars ) and PhotoObj,
by avoiding the need to perform an on-the-fly join between the two full tables
whenever a match is required. This query matches those stars from the SSA with
good proper motions with SDSS stars lying within 1 arcsec, and then uses SDSS
magnitudes and SSA proper motions to determine which lie within the region of
the RPM diagram populated by subdwarfs.
This query was motivated by the idea of using the epoch difference between Sloan
and SuperCOSMOS surveys to estimate the local supernova rate, through
identifying galaxy images which appear to have changed in brightness by more
than five magnitudes; of course, in practice, this identifies more artefacts in
the data than it does real supernovae, but several have been found this way. As
with Q3, and subsequent queries, this requires use of colour equations to
transform between the SDSS and SSA photometric system. This is another query
that uses the CrossNeighboursEDR table to match (good quality) SSA galaxies with
SDSS galaxies lying within 1 arcsec, and then further constraints on colour and
magnitude are applied to find those which have changed in blue or red band
brightness by more than 5 mags.
This speculative query looks for bright optical transients, but, again, is
more likely to turn up artefacts in the data.
This query is motivated by the idea of selecting quasars as stellar-like
objects with zero proper motion. The astrometric accuracy of SuperCOSMOS and
the length of the time baseline between the SSA’s constituent surveys are nearly
adequate for this to be a sensible way to select quasar candidates, but it is
mainly included as another example of an SSA-SDSS join query, using CrossNeighboursEDR
and a colour-magnitude cut in the Sloan bands suitable for low-z quasars,
namely g<=22, -0.27 <= u-g < 0.71, -0.24<= g-r<0.35, -0.27<=r-i<0.57
and –0.35<=i-z<0.70.
This query is designed to detect spurious objects in the SSA, on the
assumption that all real SSA objects should be detected in the (deeper) SDSS.
It works by selecting SSA objects in Sourcewhich are detected in only
one band, and within a magnitude of the respective nominal plate limit for that
band, and then looks up the ObjID values for those sources to see
whether they are included in the list of ObjID values for Source
entries which have no SDSS counterpart with 0.5 arcsec, as revealed by the CrossNeighboursEDR
table.