Tests for the new query parser/reformulator

These example queries are mostly taken from the CosmoSim documentation and should work with the new parser and reformulator as well. In order to extract only the queries from the source of this document, use e.g. following command line:

awk 'BEGIN{writeflag=0}{if ($0 ~ /```/  && $0 !~ /awk/) {writeflag=(writeflag+1)%2; printf "\n";} if (writeflag == 1 && $0 !~ /```/ && $0 !~ /awk/) {print $0;}  }' docs/_posts/2016-03-21-example-queries-from-cosmosim.md

For example queries that don't work with current PaQu, see: PaQu Issues.

Using only 1 table

Queries like this really must work.

  1. Simple SELECT and LIMIT

    SELECT x, y, z, mass 
    FROM MDR1.FOF
    LIMIT 10
    
  2. Use OFFSET explicitely or LIMIT with 2 arguments:

    SELECT x, y, z, mass
    FROM MDR1.FOF
    LIMIT 100,200
    

    PaQu-Note 1: Send query to shard nodes as is, apply simple LIMIT also again on head node table.
    PaQu-Note 2: For queries including an OFFSET we need to think about how to enable this. Simply applying LIMIT and OFFSET (or LIMIT a,b) on shard nodes AND the head node will often return no results, too many or too few. The correct result (in the sense of complying with the user's expectation) would be returned if the LIMIT OFFSET clause would NOT be given to the shard-nodes and only be applied on the temporaty result table on the head node. But this would mostly come with a lot of overhead. So maybe it's better to not allow OFFSET or LIMIT a,b at all.

  3. ORDER BY-clause

    SELECT x, y, z, mass
    FROM MDR1.FOF
    ORDER BY mass DESC
    LIMIT 10
    

    PaQu-Note: Send query to shard nodes as is, apply ORDER BY also again on head node table.

  4. WHERE-clause

    SELECT x, y, z, mass 
    FROM MDR1.FOF
    WHERE snapnum=85
    LIMIT 10
    

    Should work with all versions of WHERE, i.e. <, >, <=, >=, BETWEEN ... AND ...
    Extra: allow IN-keyword, e.g. WHERE snapnum IN(40, 41, 42)
    Any combinations of multiple OR and ANDs should work and return the logically correct subset.

    PaQu-Note: Need to collect all columns in SELECT on shards, also those in WHERE-clause etc., since the filter-conditions need to be applied on the head node again.

  5. GROUP BY
    For example in combination with aggregate functions like COUNT:

    SELECT COUNT(*) 
    FROM MDR1.FOF3 
    GROUP BY snapnum
    ORDER BY snapnum
    
  6. Functions
    Functions that operate only on one row.
    Example:

    SELECT log10(mass) AS logM 
    FROM MDR1.FOF
    
    SELECT log10(mass)/sqrt(x) AS logM 
    FROM MDR1.FOF
    
    SELECT 0.5*MAX(mass)
    FROM MDR1.FOF
    

    Should support functions in SELECT, WHERE, GROUP BY and ORDER BY. Example functions are: EXP, LOG10, SQRT, POWER (there are probably more).

    PaQu-Note: The last example fails with current PaQu, see issue #32 at http://github.com/adrpar/paqu/issues/32.

  7. Nested functions
    Functions can be nested, need to be parsed correctly. Example:

    SELECT log10(ABS(x)) AS log_x 
    FROM MDR1.FOF
    
  8. Aggregate functions
    Functions that operate on one to many rows.
    Supported functions: AVG, COUNT, SUM, MAX, MIN

    SELECT AVG(mass) AS mean_mass, COUNT(*) AS num
    FROM MDR1.FOF3
    

    More advanced example (mass function):

    SELECT 0.25*(0.5+FLOOR(LOG10(Mvir)/0.25)) AS log_mass, COUNT(*) AS num
    FROM MDR1.BDMV
    WHERE snapnum=85 
    GROUP BY FLOOR(LOG10(Mvir)/0.25)
    ORDER BY log_mass
    

    PaQu-Note: Needs complete reformulation, except for MIN/MAX

  9. Nested functions and aggregate functions

    SELECT log10(COUNT(*)), snapnum
    FROM MDR1.FOF 
    GROUP BY snapnum
    

    The log10(count(*)) part usually occurs as part of the mass function query from above, since usually both, mass and counts, are plotted logarithmically.

    PaQu-Note: This query is not reformulated correctly by current PaQu.

Using 2 or more tables, subqueries

  1. Simple subquery (nested query)
    In these queries, the subquery only returns exactly one value, otherwise this won't work and will return a syntax error when trying to execute the outer query. (Maybe it can work with WHERE IN()?)

    SELECT bdmId, Rbin, mass, dens FROM Bolshoi.BDMVProf
    WHERE bdmId =
        (SELECT bdmId FROM Bolshoi.BDMV
         WHERE snapnum=416 ORDER BY Mvir DESC LIMIT 1)
      OR
      bdmId = 
        (SELECT bdmId FROM Bolshoi.BDMV
         WHERE snapnum=416 ORDER BY Mvir DESC LIMIT 1,2)
    ORDER BY Rbin 
    

    Should work with any number of recursions, as many of these conditions combined with AND or OR.

  2. Subquery as table
    Probably only needed for more complicated joins later on.

    SELECT d.snapnum AS snapnum, d.dens AS dens 
    FROM 
    (SELECT snapnum, dens FROM Bolshoi.Dens256_z0) AS d
    LIMIT 100
    

    PaQu-Note: PaQu needs to first execute the subquery on the shard nodes, send results to head node, then do a join between the other table on the shard nodes and the temporary head nodes table.

Using 2 or more tables, joins

  1. Simple (implicit) join

    SELECT d.snapnum AS snapnum, r.zred AS zred, d.dens AS dens
    FROM Bolshoi.Dens256 AS d, 
     Bolshoi.Redshifts AS r
    WHERE d.snapnum=r.snapnum
    AND d.snapnum=36
    LIMIT 100
    

    PaQu-Note 1: PaQu needs to decide, which table is to be copied to the head node, so that shards can join with it.

    PaQu-Note 2: The Redshifts-table is NOT sharded, but copied on each node (full copy) for convenience. Thus, in order to avoid having replicated results (tenfold, in fact), one needs to add DISTINCT after the SELECT for the head node. This is not necessary for the shards, but would not hurt. We can also ask users to provide the DISTINCT keyword explicitely, if they want to avoid duplicated results (that's what we currently do), but users are always confused by this.

  2. Simple join, reformulated with one subquery

    SELECT d.snapnum AS snapnum, r.zred AS zred, d.dens AS dens
    FROM Bolshoi.Dens256_z0 AS d, 
     (SELECT snapnum, zred FROM Bolshoi.Redshifts) AS r
    WHERE d.snapnum=r.snapnum
    LIMIT 100
    

    PaQu-Note: PaQu can add the DISTINCT directly in the subquery. PaQu should prefer to copy the subquery-result to the head node, not the other table.

Queries with necessary joins,

i.e. for which nested queries are inefficient

  • Joins for merger trees
    If rewriting to nested, then two subqueries needed for the same row to get fofTreeId and mainLeafId.

    SELECT p.fofTreeId, p.treeSnapnum, p.mass, p.np
    FROM MDR1.FOFMtree AS p, 
    (SELECT fofTreeId, mainLeafId FROM MDR1.FOFMtree 
    WHERE fofId=85000000000) AS mycl
    WHERE p.fofTreeId BETWEEN mycl.fofTreeId AND mycl.mainLeafId
    ORDER BY p.treeSnapnum 
    
  • Joins with redshift table (multiple nested queries needed, if no joins allowed)

    SELECT d.snapnum AS snapnum, r.zred AS zred, r.aexp AS aexp 
    FROM 
    (SELECT DISTINCT snapnum FROM Bolshoi.Dens256_z0) AS d,
    (SELECT DISTINCT snapnum, zred, aexp FROM Bolshoi.Redshifts) AS r
    WHERE r.snapnum = d.snapnum 
    ORDER BY snapnum
    

    The first subquery could be replaced by any other query that returns multiple snapshot numbers, e.g. when tracing a halo through time, extracting merger trees, ...

  • Find halos in low density regions

    SELECT d.dens,h.bdmId,h.x,h.y,h.z,h.Mvir,h.Rvir,h.hostFlag 
    FROM MDR1.Dens512_z0 d, MDR1.BDMV h
    WHERE d.dens<1 AND h.snapnum=85 AND h.Mvir>1.e12
    AND h.phkey/8. = d.phkey
    ORDER BY d.dens
    
  • Find halo properties for a given semi-analytical galaxy

    SELECT h.Mvir, h.spin, g.diskMassStellar, g.diskMassStellar/h.Mvir AS mass_ratio
    FROM MDPL2.Rockstar AS h, MDPL2.Galacticus AS g
    WHERE g.rockstarId = h.rockstarId 
    AND h.snapnum=125 AND g.snapnum=125
    AND h.Mvir>1.e10
    ORDER BY g.diskMassStellar/h.Mvir