Example queries from CosmoSim
Tests for the new query parser/reformulator
These example queries are mostly taken from the CosmoSim documentation and should work with the new parser and reformulator as well. In order to extract only the queries from the source of this document, use e.g. following command line:
awk 'BEGIN{writeflag=0}{if ($0 ~ /```/ && $0 !~ /awk/) {writeflag=(writeflag+1)%2; printf "\n";} if (writeflag == 1 && $0 !~ /```/ && $0 !~ /awk/) {print $0;} }' docs/_posts/2016-03-21-example-queries-from-cosmosim.md
For example queries that don't work with current PaQu, see: PaQu Issues.
Using only 1 table
Queries like this really must work.
Simple SELECT and
LIMIT
SELECT x, y, z, mass FROM MDR1.FOF LIMIT 10
Use OFFSET explicitely or LIMIT with 2 arguments:
SELECT x, y, z, mass FROM MDR1.FOF LIMIT 100,200
PaQu-Note 1: Send query to shard nodes as is, apply simple LIMIT also again on head node table.
PaQu-Note 2: For queries including an OFFSET we need to think about how to enable this. Simply applying LIMIT and OFFSET (or LIMIT a,b) on shard nodes AND the head node will often return no results, too many or too few. The correct result (in the sense of complying with the user's expectation) would be returned if the LIMIT OFFSET clause would NOT be given to the shard-nodes and only be applied on the temporaty result table on the head node. But this would mostly come with a lot of overhead. So maybe it's better to not allow OFFSET or LIMIT a,b at all.ORDER BY
-clauseSELECT x, y, z, mass FROM MDR1.FOF ORDER BY mass DESC LIMIT 10
PaQu-Note: Send query to shard nodes as is, apply ORDER BY also again on head node table.
WHERE
-clauseSELECT x, y, z, mass FROM MDR1.FOF WHERE snapnum=85 LIMIT 10
Should work with all versions of WHERE, i.e. <, >, <=, >=, BETWEEN ... AND ...
Extra: allowIN
-keyword, e.g.WHERE snapnum IN(40, 41, 42)
Any combinations of multipleOR
andAND
s should work and return the logically correct subset.PaQu-Note: Need to collect all columns in SELECT on shards, also those in WHERE-clause etc., since the filter-conditions need to be applied on the head node again.
GROUP BY
For example in combination with aggregate functions like COUNT:SELECT COUNT(*) FROM MDR1.FOF3 GROUP BY snapnum ORDER BY snapnum
Functions
Functions that operate only on one row.
Example:SELECT log10(mass) AS logM FROM MDR1.FOF
SELECT log10(mass)/sqrt(x) AS logM FROM MDR1.FOF
SELECT 0.5*MAX(mass) FROM MDR1.FOF
Should support functions in SELECT, WHERE, GROUP BY and ORDER BY. Example functions are:
EXP
,LOG10
,SQRT
,POWER
(there are probably more).PaQu-Note: The last example fails with current PaQu, see issue #32 at http://github.com/adrpar/paqu/issues/32.
Nested functions
Functions can be nested, need to be parsed correctly. Example:SELECT log10(ABS(x)) AS log_x FROM MDR1.FOF
Aggregate functions
Functions that operate on one to many rows.
Supported functions:AVG
,COUNT
,SUM
,MAX
,MIN
SELECT AVG(mass) AS mean_mass, COUNT(*) AS num FROM MDR1.FOF3
More advanced example (mass function):
SELECT 0.25*(0.5+FLOOR(LOG10(Mvir)/0.25)) AS log_mass, COUNT(*) AS num FROM MDR1.BDMV WHERE snapnum=85 GROUP BY FLOOR(LOG10(Mvir)/0.25) ORDER BY log_mass
PaQu-Note: Needs complete reformulation, except for MIN/MAX
Nested functions and aggregate functions
SELECT log10(COUNT(*)), snapnum FROM MDR1.FOF GROUP BY snapnum
The log10(count(*)) part usually occurs as part of the mass function query from above, since usually both, mass and counts, are plotted logarithmically.
PaQu-Note: This query is not reformulated correctly by current PaQu.
Using 2 or more tables, subqueries
Simple subquery (nested query)
In these queries, the subquery only returns exactly one value, otherwise this won't work and will return a syntax error when trying to execute the outer query. (Maybe it can work with WHERE IN()?)SELECT bdmId, Rbin, mass, dens FROM Bolshoi.BDMVProf WHERE bdmId = (SELECT bdmId FROM Bolshoi.BDMV WHERE snapnum=416 ORDER BY Mvir DESC LIMIT 1) OR bdmId = (SELECT bdmId FROM Bolshoi.BDMV WHERE snapnum=416 ORDER BY Mvir DESC LIMIT 1,2) ORDER BY Rbin
Should work with any number of recursions, as many of these conditions combined with AND or OR.
Subquery as table
Probably only needed for more complicated joins later on.SELECT d.snapnum AS snapnum, d.dens AS dens FROM (SELECT snapnum, dens FROM Bolshoi.Dens256_z0) AS d LIMIT 100
PaQu-Note: PaQu needs to first execute the subquery on the shard nodes, send results to head node, then do a join between the other table on the shard nodes and the temporary head nodes table.
Using 2 or more tables, joins
Simple (implicit) join
SELECT d.snapnum AS snapnum, r.zred AS zred, d.dens AS dens FROM Bolshoi.Dens256 AS d, Bolshoi.Redshifts AS r WHERE d.snapnum=r.snapnum AND d.snapnum=36 LIMIT 100
PaQu-Note 1: PaQu needs to decide, which table is to be copied to the head node, so that shards can join with it.
PaQu-Note 2: The Redshifts-table is NOT sharded, but copied on each node (full copy) for convenience. Thus, in order to avoid having replicated results (tenfold, in fact), one needs to add
DISTINCT
after the SELECT for the head node. This is not necessary for the shards, but would not hurt. We can also ask users to provide the DISTINCT keyword explicitely, if they want to avoid duplicated results (that's what we currently do), but users are always confused by this.Simple join, reformulated with one subquery
SELECT d.snapnum AS snapnum, r.zred AS zred, d.dens AS dens FROM Bolshoi.Dens256_z0 AS d, (SELECT snapnum, zred FROM Bolshoi.Redshifts) AS r WHERE d.snapnum=r.snapnum LIMIT 100
PaQu-Note: PaQu can add the DISTINCT directly in the subquery. PaQu should prefer to copy the subquery-result to the head node, not the other table.
Queries with necessary joins,
i.e. for which nested queries are inefficient
Joins for merger trees
If rewriting to nested, then two subqueries needed for the same row to get fofTreeId and mainLeafId.SELECT p.fofTreeId, p.treeSnapnum, p.mass, p.np FROM MDR1.FOFMtree AS p, (SELECT fofTreeId, mainLeafId FROM MDR1.FOFMtree WHERE fofId=85000000000) AS mycl WHERE p.fofTreeId BETWEEN mycl.fofTreeId AND mycl.mainLeafId ORDER BY p.treeSnapnum
Joins with redshift table (multiple nested queries needed, if no joins allowed)
SELECT d.snapnum AS snapnum, r.zred AS zred, r.aexp AS aexp FROM (SELECT DISTINCT snapnum FROM Bolshoi.Dens256_z0) AS d, (SELECT DISTINCT snapnum, zred, aexp FROM Bolshoi.Redshifts) AS r WHERE r.snapnum = d.snapnum ORDER BY snapnum
The first subquery could be replaced by any other query that returns multiple snapshot numbers, e.g. when tracing a halo through time, extracting merger trees, ...
Find halos in low density regions
SELECT d.dens,h.bdmId,h.x,h.y,h.z,h.Mvir,h.Rvir,h.hostFlag FROM MDR1.Dens512_z0 d, MDR1.BDMV h WHERE d.dens<1 AND h.snapnum=85 AND h.Mvir>1.e12 AND h.phkey/8. = d.phkey ORDER BY d.dens
Find halo properties for a given semi-analytical galaxy
SELECT h.Mvir, h.spin, g.diskMassStellar, g.diskMassStellar/h.Mvir AS mass_ratio FROM MDPL2.Rockstar AS h, MDPL2.Galacticus AS g WHERE g.rockstarId = h.rockstarId AND h.snapnum=125 AND g.snapnum=125 AND h.Mvir>1.e10 ORDER BY g.diskMassStellar/h.Mvir