Aggregation

Docs

Aggregations pipeline stages

  • $match

  • $sort

  • $skip

  • $limit

  • $count

  • $facet

  • $graphLookup

  • Pipeline stages can appear multiple times in the pipeline with the exception of few like $out, $merge, and $geoNear

  • Tree structure data model

  • $lookup (for joins)

    It's like left outer join of relational database (can be one to one, one to many)

    SYNTAX
    {
    $lookup: {
    from: "otherCollection",
    localField: "x",
    foreignField: "y",
    as: "resultingArray"
    }
    }
    eg:
    db.products.aggregate([
    {
    $lookup:
    {
    from: 'masterData',
    localField:'pid',
    //field from the input document, so can be array or single key
    foreignField:'pid',
    /*field from the documents of the "from" collection,
    so target can be foreign primary key/ normal column*/
    as: "orderdetails"
    }
    }]).pretty()

    //(.) check where localField is in object array (may be product_ids.id)

    First Find then lookup

    db.products.aggregate([
    {
    $match:{
    pid:{
    $exists :true
    },
    qty:{
    $lte :40
    }
    }
    },
    {
    $lookup:
    {
    from: 'masterData',
    localField:'pid',
    foreignField:'pid',
    as: "orderdetails"
    }
    }]).pretty()
    • First Find then lookup with custom condition ($match) & projection ($project)
    • [docs](https://docs.mongodb.com/manual/reference/operator/aggregation/ lookup#specify-multiple-join-conditions-with-lookup) //$lookup-> pipeline-> $match - $expr
    • $lookup - $match can be also used in sub query concept
      db.products.aggregate([
      {
      $match:{
      pid:{
      $exists :true
      },
      qty:{
      $lte :40
      }
      }
      },
      {
      $lookup:
      {
      from: 'masterData',
      let: { cust_pid: "$pid"},
      pipeline: [
      { $match:
      { $expr:
      { $and:
      [
      { $eq: [ "$pid", "$$cust_pid" ] },
      { $eq: [ "$status", 1 ] }
      ]
      }
      }
      },
      { $project: { item: 1, _id: 0 } }
      ],
      as: "orderdetails"
      }
      }]).pretty()
      Note: All except the $out, $merge, and $geoNear stages can appear multiple times in a pipeline.//so $lookup, $match..etc multiple times docs[join caveat if field doesn't exist](https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/ #perform-a-single-equality-join-with-lookup) //also merging objects if we don't need resultingArray for one to one mapping (don't use - bad syntax)
  • $group

    • Syntax & note points

      {
      $group:
      {
      _id: <group by expression>, // single prop OR object with multi prop
      <field1>: { <accumulator1> : <expression1> },
      ...
      }
      }

      Note :

      • If _id = null, accumulation happens for all the input documents as a whole.
      • some accumulators are $min, $max, $sum, $avg, $first, $last ..etc
      • $group stage has a limit of 100 megabytes of RAM
      • To allow for the handling of large datasets, set .aggregate() option [allowDiskUse](https://docs.mongodb.com/manual/ reference/method/db.collection.aggregate/#db.collection.aggregate) to true
    • Count the Number of Documents in a Collection

      Also check collection stats

      use samples
      db.companies.aggregate( [
      {
      $group: {
      _id: null,
      count: { $sum: 1 }
      }
      }
      ] )
      o/p:
      {
      "_id": null,
      "count": 18801
      }

      Note : To just count - we can use $count. Use above if want to perform other accumulators too on all rows.

      db.companies.aggregate( [
      {
      $count: 'total_rows'
      }
      ] )
      { "total_rows" : 18801 }
    • Group by single field (distinct example)

      db.companies.aggregate([{
      $group: {
      _id: "$name"
      }
      }])
      o/p:
      { "_id" : "EnteGreat" }
      { "_id" : "EnteGreat Solutions" }
      { "_id" : "goBookmaker" }
      { "_id" : "AfterLogic" }
      { "_id" : "Embedster" }
      { "_id" : "Geekdive" }
      { "_id" : "Contemporary Computer Services" }
      { "_id" : "Intergy" }
      { "_id" : "Getyoo" }
      { "_id" : "Eazeeloans" }
      ...

      Note : $name = $ defines passed document to the group stage

    • Multi stage aggregation (group by having example)

      db.companies.aggregate(
      [
      // First Stage
      {
      $group :
      {
      _id : "$name",
      totalDayCapacity: { $sum: {
      $multiply: [ "$number_of_employees", 8 ]
      } }
      }
      },
      // Second Stage
      {
      $match: { "totalDayCapacity": { $gte: 100 } }
      }
      ]
      )
      o/p:
      { "_id" : "Eazeeloans", "totalDayCapacity" : 200 }
      { "_id" : "Atlantic BT", "totalDayCapacity" : 240 }
      { "_id" : "UOL (Universo Online)", "totalDayCapacity" : 4000 }
      { "_id" : "Tiny Speck", "totalDayCapacity" : 360 }
      { "_id" : "VSE Corp", "totalDayCapacity" : 15360 }
      { "_id" : "Vertro", "totalDayCapacity" : 400 }
      { "_id" : "SERG", "totalDayCapacity" : 160 }
      ...
    • Multi stage aggregation (group by day/ date formatting example) Check date formatters & convert operators link in Project section

      db.companies.aggregate([
      // First Stage
      {
      $match: {
      "created_at": {
      $gte: new ISODate("2007-01-01"),
      $lt: new ISODate("2015-01-01")
      }
      }
      },
      // Second Stage
      {
      $group: {
      _id: {
      $dateToString: {
      format: "%Y-%m-%d",
      date: "$created_at"
      }
      },
      totalNewEmployeesOnDay: {
      $sum: "$number_of_employees"
      },
      averagePerCompanyEmployees: {
      $avg: "$number_of_employees"
      },
      noOfCompanies: {
      $sum: 1
      }
      }
      },
      // Third Stage
      {
      $sort: {
      noOfCompanies: -1
      }
      }
      ])
      o/p:
      {
      "_id" : "2007-05-25",
      "totalNewEmployeesOnDay" : 647,
      "averagePerCompanyEmployees" : 323.5,
      "noOfCompanies" : 2
      }
      {
      "_id" : "2007-06-01",
      "totalNewEmployeesOnDay" : 0,
      "averagePerCompanyEmployees" : null,
      "noOfCompanies" : 1
      }
      Note: In DB most created_at values are string, so less data
    • Group by Field & add other column values in array

      db.companies.aggregate([{
      $match: {
      "created_at": {
      $gte: new ISODate("2007-01-01"),
      $lt: new ISODate("2015-01-01")
      }
      }
      },
      {
      $group: {
      _id: {
      $dateToString: {
      format: "%Y-%m-%d",
      date: "$created_at"
      }
      },
      names: {
      $push: "$name"
      }
      }
      }
      ])
      o/p:
      { "_id" : "2007-06-01", "names" : [ "StumbleUpon" ] }
      { "_id" : "2007-05-25", "names" : [ "Wetpaint", "AdventNet" ] }
    • Group by Field & add entire row in array

      db.companies.aggregate([{
      $group: {
      _id: '$category_code',
      names: {
      $push: "$$ROOT"
      }
      }
      },
      {
      $limit: 2
      }
      ], {
      allowDiskUse: true
      })
    • Group by Field & add extra fields in next stage

      db.companies.aggregate([{
      $group: {
      _id: '$category_code',
      comps: {
      $push: "$$ROOT"
      }
      }
      },
      {
      $addFields: {
      totalEmp: {
      $sum: "$comps.number_of_employees"
      }
      }
      },
      {
      $limit: 2
      }
      ], {
      allowDiskUse: true
      })

      Check this query in compass aggregation tab

    • Group by Multiple Field

      db.companies.aggregate([{
      $group: {
      _id: {
      category_code: '$category_code',
      founded_year: '$founded_year'
      },
      count: {
      $sum: 1
      }
      }
      },
      {
      $sort: {
      "_id.category_code": -1
      }
      },
      {
      $limit: 10
      }
      ], {
      allowDiskUse: true
      })
      o/p:
      {
      "_id" : {
      "category_code" : "web",
      "founded_year" : 1880
      },
      "count" : 1
      }
      {
      "_id" : {
      "category_code" : "web",
      "founded_year" : 1962
      },
      "count" : 1
      }
      {
      "_id" : {
      "category_code" : "web",
      "founded_year" : 1990
      },
      "count" : 2
      }
      ...
  • $project

    • Syntax & Notes Either specify exactly which has to be included/ which has to be excluded Find projection (MD)

      { $project: { "<field1>": 0, "<field2>": 0, ... } }
      // Return all but the specified fields
    • Conditionally Exclude Fields

      db.books.aggregate( [
      {
      $project: {
      title: 1,
      "author.first": 1,
      "author.last" : 1,
      "author.middle": {
      $cond: {
      if: { $eq: [ "", "$author.middle" ] },
      then: "$$REMOVE",
      else: "$author.middle"
      }
      }
      }
      }
      ] )
    • Computed fields (Project new object example)

      Consider following

      {
      "_id" : 1,
      title: "abc123",
      isbn: "0001122223334",
      author: { last: "zzz", first: "aaa" },
      copies: 5
      }
      db.books.aggregate(
      [
      {
      $project: {
      title: 1,
      isbn: {
      prefix: { $substr: [ "$isbn", 0, 3 ] },
      group: { $substr: [ "$isbn", 3, 2 ] },
      publisher: { $substr: [ "$isbn", 5, 4 ] },
      title: { $substr: [ "$isbn", 9, 3 ] },
      checkDigit: { $substr: [ "$isbn", 12, 1] }
      },
      lastName: "$author.last",
      copiesSold: "$copies"
      }
      }
      ]
      )
      o/p:
      {
      "_id" : 1,
      "title" : "abc123",
      "isbn" : {
      "prefix" : "000",
      "group" : "11",
      "publisher" : "2222",
      "title" : "333",
      "checkDigit" : "4"
      },
      "lastName" : "zzz",
      "copiesSold" : 5
      }
    • Project New Array Fields

      Consider following

      {
      "_id" : 1,
      title: "abc123",
      isbn: "0001122223334",
      author: { last: "zzz", first: "aaa" },
      copies: 5
      }
      db.collection.aggregate( [ { $project: { myArray: [ "$title", "$isbn" ] } } ] )
  • $unwind

    Deconstructs an array field from the input documents to output a document for each element.

    • Normal $push

      consider
      {
      "_id": 3,
      "height": 480,
      "width": 640,
      "tags": ["kittens", "travel"]
      }
      ...
      use sample_pictures
      db.images.aggregate([
      {
      $limit:50
      },
      {
      $group: {
      _id: '$height',
      tags: {
      $push: '$tags'
      }
      }
      }])
      o/p:
      {
      "_id": 480,
      "tags": [ //array of arrays
      [
      "kittens",
      "travel"
      ],
      [
      "dogs",
      "kittens",
      "work"
      ],
      [
      "dogs",
      "work"
      ],
      ]
      }
      ...
    • Just $unwind

      use sample_pictures
      db.images.aggregate([
      {
      $unwind: '$tags'
      }])
      o/p: (single document splitted like cross join)
      { "_id" : 3, "height" : 480, "width" : 640, "tags" : "kittens" }
      { "_id" : 3, "height" : 480, "width" : 640, "tags" : "travel" }
      ...
    • $unwind then $push

      use sample_pictures
      db.images.aggregate([
      {
      $limit:50
      },
      {
      $unwind:'$tags'
      },
      {
      $group: {
      _id: '$height',
      tags: {
      $push: '$tags'
      }
      }
      }])
      o/p:
      {
      "_id": 480,
      "tags": [
      "kittens",
      "travel",
      "dogs",
      "kittens",
      "work",
      "dogs",
      "work"
      ]
      }
      ...
    • $unwind then $addToSet (to avoid duplicate)

      use sample_pictures
      db.images.aggregate([
      {
      $limit:50
      },
      {
      $unwind:'$tags'
      },
      {
      $group: {
      _id: '$height',
      tags: {
      $addToSet: '$tags'
      }
      }
      }])
      o/p:
      {
      "_id": 480,
      "tags": [
      "work",
      "kittens",
      "travel",
      "dogs"
      ]
      }
      ...
    • Check docs for unwinding array with in object array example

      In simple - say $items = objArr && tags = arr in that object
      // First Stage
      { $unwind: "$items" },
      // Second Stage
      { $unwind: "$items.tags" },
  • $bucket && $bucketAuto

    • Categorizes incoming documents into (groups) buckets based on a specified bucket boundaries.

    • $bucket

      • Syntax

        {
        $bucket: {
        groupBy: <expression>,
        boundaries: [ <lowerbound1>, <lowerbound2>, ... ],
        default: <literal>,
        output: {
        <output1>: { <$accumulator expression> },
        ...
        <outputN>: { <$accumulator expression> }
        }
        }
        }
      • Note :

        • An array of [ 0, 5, 10 ] creates two buckets:
        • [0, 5) with inclusive lower bound 0 and exclusive upper bound 5.
        • [5, 10) with inclusive lower bound 5 and exclusive upper bound 10.
        • default property is optional if all values fall under specified bucket boundaries.
      • Example:

        use samples
        db.restaurants.aggregate([{
        $bucket: {
        groupBy: '$rating',
        boundaries: [0, 3, 5, 10],
        default: "Other",
        output: {
        "count": {
        $sum: 1
        },
        "ratings": {
        $addToSet: "$rating"
        }
        }
        }
        }])
        output:
        { "_id" : 0, "count" : 23, "ratings" : [ 2, 1.5, 1, 2.5 ] }
        { "_id" : 3, "count" : 706, "ratings" : [ 3.5, 4, 3, 4.5 ] }
        { "_id" : 5, "count" : 1756, "ratings" : [ 5.5, 5, 6 ] }
        { "_id" : "Other", "count" : 63, "ratings" : [ "Not yet rated" ] }
    • $bucketAuto

      Bucket boundaries are automatically determined in an attempt to evenly distribute the documents.

      db.restaurants.aggregate([{
      $bucketAuto: {
      groupBy: '$rating',
      buckets: 4,
      output: {
      "count": {
      $sum: 1
      },
      "ratings": {
      $addToSet: "$rating"
      }
      }
      }
      }])
      output:
      {
      "_id": {
      "min": 1,
      "max": 5
      },
      "count": 729,
      "ratings": [4.5, 3, 2.5, 3.5, 2, 1.5, 4, 1 ]
      } {
      "_id": {
      "min": 5,
      "max": 5.5
      },
      "count": 1107,
      "ratings": [ 5 ]
      } {
      "_id": {
      "min": 5.5,
      "max": "Not yet rated"
      },
      "count": 649,
      "ratings": [ 6, 5.5 ]
      } {
      "_id": {
      "min": "Not yet rated",
      "max": "Not yet rated"
      },
      "count": 63,
      "ratings": [ "Not yet rated" ]
      }
  • $out (vs) $merge

    (Not practiced) //check for scoped temporary tables (like #localTempTable of SQL, instead ##globalTempTable)

    • $out Documents returned by the aggregation pipeline are written to a specified collection.

    • Use $merge. It is $out + missing features. But MongoDB 4.2 version

      • Syntax

        { $merge: {
        into: <collection> -or- { db: <db>, coll: <collection> },
        on: <identifier field> -or- [ <identifier field1>, ...],//Optional
        let: <variables>, //Optional
        whenMatched: <replace|keepExisting|merge|fail|pipeline>,//Optional
        whenNotMatched: <insert|discard|fail> //Optional
        } }
      • Syntax 1 (If using all default options for $merge )

        { $merge: <collection> } // Output collection is in the same database
      • Example:

        {
        $merge: {
        into: "myOutput",
        on: "_id",
        whenMatched: "replace",
        whenNotMatched: "insert"
        }
        }
      • Check Restrictions

      • For performance complex queries can run in night & create temp tables, let product query temp tables. Have TimeToLiveIndex so that previous day records get deleted.


Main Operators

Docs

  • $substr

  • $toUpper

  • $toLower

  • $trim //$ltrim and $rtrim also exists

  • $toDouble

  • $subtract

  • $round

  • $multiply

  • $strLenCP

  • $concat //can have expresion too (check word doc notes)

  • $dateToString

  • $dateFromString

  • $dateToParts

  • $convert

  • $cond //sql case expression

  • $slice

    Array slice(startIndex, noOfElements)

    Consider

    {
    "_id": 2,
    "height": 480,
    "width": 640,
    "tags": ["dogs", "kittens", "work"]
    }

    Use sample_pictures

    db.images.aggregate([{
    $match: {
    _id: 2
    }
    },
    {
    $project: {
    tags: {
    $slice: ["$tags", 1, 1]
    }
    }
    }
    ]).pretty();
    o/p:
    { "_id" : 2, "tags" : [ "kittens" ] }
  • $filter

    (Array filter)

    consider
    {
    "_id": 2,
    "height": 480,
    "width": 640,
    "tags": ["dogs", "kittens", "work"]
    }
    { $filter: { input: <array>, as: <string>, cond: <expression> } }
    • as = local variable that represents individual element of the input array. default is 'this'

      db.images.aggregate([{
      $match: {
      _id: 2
      }
      },
      {
      $project: {
      tags: {
      $filter: {
      input: "$tags",
      as: "myTag",
      cond: {
      $eq: ['$$myTag', 'dogs']
      } /*two $$ in '' to identify local/
      special variable value. one $ is field value.*/
      }
      }
      }
      }
      ]).pretty();
      o/p:
      { "_id" : 2, "tags" : [ "dogs" ] }
  • $size

    use sample_pictures
    db.images.aggregate([{
    $match: {
    _id: 2
    }
    },
    {
    $project: {
    tagCount: {
    $size: '$tags'
    }
    }
    }
    ]).pretty();
    o/p:
    { "_id" : 2, "tagCount" : 3 }

Accumulators

Docs

  • $first
  • Some accumulators are $min, $max, $sum, $avg, $first, $last, $addToSet, $push ..etc

Geo spatial

Docs

Note : Aggregation returns a cursor