Example queries from CosmoSim

Tests for the new query parser/reformulator

These example queries are mostly taken from the CosmoSim documentation and should work with the new parser and reformulator as well. In order to extract only the queries from the source of this document, use e.g. following command line:

awk 'BEGIN{writeflag=0}{if ($0 ~ /```/  && $0 !~ /awk/) {writeflag=(writeflag+1)%2; printf "\n";} if (writeflag == 1 && $0 !~ /```/ && $0 !~ /awk/) {print $0;}  }' docs/_posts/2016-03-21-example-queries-from-cosmosim.md

For example queries that don't work with current PaQu, see: PaQu Issues.

Using only 1 table

Queries like this really must work.

Simple SELECT and LIMIT

SELECT x, y, z, mass 
FROM MDR1.FOF
LIMIT 10

Use OFFSET explicitely or LIMIT with 2 arguments:
```
SELECT x, y, z, mass
FROM MDR1.FOF
LIMIT 100,200
```
PaQu-Note 1: Send query to shard nodes as is, apply simple LIMIT also again on head node table.
PaQu-Note 2: For queries including an OFFSET we need to think about how to enable this. Simply applying LIMIT and OFFSET (or LIMIT a,b) on shard nodes AND the head node will often return no results, too many or too few. The correct result (in the sense of complying with the user's expectation) would be returned if the LIMIT OFFSET clause would NOT be given to the shard-nodes and only be applied on the temporaty result table on the head node. But this would mostly come with a lot of overhead. So maybe it's better to not allow OFFSET or LIMIT a,b at all.
ORDER BY-clause
```
SELECT x, y, z, mass
FROM MDR1.FOF
ORDER BY mass DESC
LIMIT 10
```
PaQu-Note: Send query to shard nodes as is, apply ORDER BY also again on head node table.
WHERE-clause
```
SELECT x, y, z, mass 
FROM MDR1.FOF
WHERE snapnum=85
LIMIT 10
```
Should work with all versions of WHERE, i.e. <, >, <=, >=, BETWEEN ... AND ...
Extra: allow IN-keyword, e.g. WHERE snapnum IN(40, 41, 42)
Any combinations of multiple OR and ANDs should work and return the logically correct subset.

PaQu-Note: Need to collect all columns in SELECT on shards, also those in WHERE-clause etc., since the filter-conditions need to be applied on the head node again.

GROUP BY
For example in combination with aggregate functions like COUNT:

SELECT COUNT(*) 
FROM MDR1.FOF3 
GROUP BY snapnum
ORDER BY snapnum

Functions
Functions that operate only on one row.
Example:
```
SELECT log10(mass) AS logM 
FROM MDR1.FOF
```
```
SELECT log10(mass)/sqrt(x) AS logM 
FROM MDR1.FOF
```
```
SELECT 0.5*MAX(mass)
FROM MDR1.FOF
```
Should support functions in SELECT, WHERE, GROUP BY and ORDER BY. Example functions are: EXP, LOG10, SQRT, POWER (there are probably more).

PaQu-Note: The last example fails with current PaQu, see issue #32 at http://github.com/adrpar/paqu/issues/32.
Nested functions
Functions can be nested, need to be parsed correctly. Example:
```
SELECT log10(ABS(x)) AS log_x 
FROM MDR1.FOF
```

Aggregate functions
Functions that operate on one to many rows.
Supported functions: AVG, COUNT, SUM, MAX, MIN

SELECT AVG(mass) AS mean_mass, COUNT(*) AS num
FROM MDR1.FOF3

More advanced example (mass function):

SELECT 0.25*(0.5+FLOOR(LOG10(Mvir)/0.25)) AS log_mass, COUNT(*) AS num
FROM MDR1.BDMV
WHERE snapnum=85 
GROUP BY FLOOR(LOG10(Mvir)/0.25)
ORDER BY log_mass

PaQu-Note: Needs complete reformulation, except for MIN/MAX

Nested functions and aggregate functions
```
SELECT log10(COUNT(*)), snapnum
FROM MDR1.FOF 
GROUP BY snapnum
```
The log10(count(*)) part usually occurs as part of the mass function query from above, since usually both, mass and counts, are plotted logarithmically.

PaQu-Note: This query is not reformulated correctly by current PaQu.

Using 2 or more tables, subqueries

Simple subquery (nested query)
In these queries, the subquery only returns exactly one value, otherwise this won't work and will return a syntax error when trying to execute the outer query. (Maybe it can work with WHERE IN()?)

SELECT bdmId, Rbin, mass, dens FROM Bolshoi.BDMVProf
WHERE bdmId =
    (SELECT bdmId FROM Bolshoi.BDMV
     WHERE snapnum=416 ORDER BY Mvir DESC LIMIT 1)
  OR
  bdmId = 
    (SELECT bdmId FROM Bolshoi.BDMV
     WHERE snapnum=416 ORDER BY Mvir DESC LIMIT 1,2)
ORDER BY Rbin

Should work with any number of recursions, as many of these conditions combined with AND or OR.

Subquery as table
Probably only needed for more complicated joins later on.
```
SELECT d.snapnum AS snapnum, d.dens AS dens 
FROM 
(SELECT snapnum, dens FROM Bolshoi.Dens256_z0) AS d
LIMIT 100
```
PaQu-Note: PaQu needs to first execute the subquery on the shard nodes, send results to head node, then do a join between the other table on the shard nodes and the temporary head nodes table.

Using 2 or more tables, joins

Simple (implicit) join
```
SELECT d.snapnum AS snapnum, r.zred AS zred, d.dens AS dens
FROM Bolshoi.Dens256 AS d, 
 Bolshoi.Redshifts AS r
WHERE d.snapnum=r.snapnum
AND d.snapnum=36
LIMIT 100
```
PaQu-Note 1: PaQu needs to decide, which table is to be copied to the head node, so that shards can join with it.

PaQu-Note 2: The Redshifts-table is NOT sharded, but copied on each node (full copy) for convenience. Thus, in order to avoid having replicated results (tenfold, in fact), one needs to add DISTINCT after the SELECT for the head node. This is not necessary for the shards, but would not hurt. We can also ask users to provide the DISTINCT keyword explicitely, if they want to avoid duplicated results (that's what we currently do), but users are always confused by this.

Simple join, reformulated with one subquery

SELECT d.snapnum AS snapnum, r.zred AS zred, d.dens AS dens
FROM Bolshoi.Dens256_z0 AS d, 
 (SELECT snapnum, zred FROM Bolshoi.Redshifts) AS r
WHERE d.snapnum=r.snapnum
LIMIT 100

PaQu-Note: PaQu can add the DISTINCT directly in the subquery. PaQu should prefer to copy the subquery-result to the head node, not the other table.

Queries with necessary joins,

i.e. for which nested queries are inefficient

Joins for merger trees
If rewriting to nested, then two subqueries needed for the same row to get fofTreeId and mainLeafId.

SELECT p.fofTreeId, p.treeSnapnum, p.mass, p.np
FROM MDR1.FOFMtree AS p, 
(SELECT fofTreeId, mainLeafId FROM MDR1.FOFMtree 
WHERE fofId=85000000000) AS mycl
WHERE p.fofTreeId BETWEEN mycl.fofTreeId AND mycl.mainLeafId
ORDER BY p.treeSnapnum

Joins with redshift table (multiple nested queries needed, if no joins allowed)

SELECT d.snapnum AS snapnum, r.zred AS zred, r.aexp AS aexp 
FROM 
(SELECT DISTINCT snapnum FROM Bolshoi.Dens256_z0) AS d,
(SELECT DISTINCT snapnum, zred, aexp FROM Bolshoi.Redshifts) AS r
WHERE r.snapnum = d.snapnum 
ORDER BY snapnum

The first subquery could be replaced by any other query that returns multiple snapshot numbers, e.g. when tracing a halo through time, extracting merger trees, ...

Find halos in low density regions

SELECT d.dens,h.bdmId,h.x,h.y,h.z,h.Mvir,h.Rvir,h.hostFlag 
FROM MDR1.Dens512_z0 d, MDR1.BDMV h
WHERE d.dens<1 AND h.snapnum=85 AND h.Mvir>1.e12
AND h.phkey/8. = d.phkey
ORDER BY d.dens

Find halo properties for a given semi-analytical galaxy