Wednesday, February 14, 2018

Group objects evenly by size in PowerShell

A PowerShell function to group objects evenly by size, using a simple packing algorithm. But with advanced pipeline techniques for handing nested arrays and leveraging the PowerShell parser to protect against code injection. Just for fun.

The initial challenge

Brett Miller and Bob Frankly recently posed the hypothetical question on the PowerShell Slack channel, how can you easily split a collection of objects into 4 groups, based on the size of each object, such that the total size of each group of objects is approximately the same?

I threw together a quick handful of lines of PowerShell code to meet the requirements. And then I started to work on a generic function to solve the problem generally.

And that’s when it got interesting.

The added complexity

I wanted the function to be able to take input either through a parameter or from a pipeline. Techniques for handling that are well known. But as the input could be nested arrays, the standard techniques would not work for both input methods. I had to find out how to determine when the function is part of a pipeline.

I wanted the user not only to be able to specify a property to use to determine an object’s size, but also to be able to specify a nested property. For example, if I have an Active Directory $Group, the size of the group is in $Group.Members.Count. But if I let the user give me a string I am going to execute, I needed a way to confirm the string contains nothing but nested property names, and not a code injection.

I wanted the user to also be able to group objects by simple object count. I realized that this would be default behavior of the functions when it is used with non-collection objects simply by making the default -Property value “Count”, with no extra coding needed.

The code

Let’s define the function.

function Group-Evenly
    {

A good comment-based help block is good. I do this for functions that I’m sharing, or that I’m likely to reuse in other scripts, or when I want to thoroughly document it for myself or the poor soul that has to maintain my code after I’ve moved on.

     <#
    .SYNOPSIS
        Evenly divides input objects into a given number of groups
        optionally weighted by the value of a given property.

    .DESCRIPTION
        Creates specified number of groups (arrays)
        Input object are sorted by value of the specified Property, descending
            (If no property is specified, .Count is used)
        Each object is placed in the group with the smallest totale value of the specified Property

        This algorithm may not always produce an optimal result, but does
        produce a reasonable result quickly compared to the brute force
        required to guarantee an optimal result.

    .OUTPUT
        [array[]]

    .PARAMETER InputObject
        Objects to be grouped
        Accepts pipeline input
        Unlike most commands, accepts Null pipeline input

    .PARAMETER Property
        String - Property to use to determine object size for weighted grouping
        Accepts nested property names, e.g. - Members.Count
        Default to "Count"

    .PARAMETER Number
        Int32 - Number of groups to create
        Defaults to 2

    .EXAMPLE
        $Users = Get-ADUser -Filter *
        $Teams = Group-Evenly -InputObject $Users

        Results in two arrays, each with half of the users.

    .EXAMPLE
        $DataChunks = Get-ChildItem C:\Temp -File |
            Group-Evenly -Property Length -Number 4

        Results in four arrays of files, grouped such that the total file sizes
        of the groups are approximately equal.

    .EXAMPLE
        $Meetings = Get-ADGroup -Filter { Name -like "Dept*" } -Properties Members |
            Group-Evenly -Property Members.Count -Number 6

        Results in six arrays of AD department groups, grouped such that the total
        membership of the grouping are approximately equal

    .EXAMPLE
        $Whatever = Get-ChildItem C:\Temp -File |
            GroupEvenly -Property Directory.Parent.FullName.Length

        Results in two arrays of files, grouped evenly but weighted by the length
        of the full path of the parent of the file's directory. That is, of course,
        completely useless, but I didn't feel like taking the time to come up with
        a better example of using a deeply nested property value.

    .NOTES
        v 1.0 Tim Curwick Created
    #>

[cmdletbinding()] tells PowerShell to automatically do various advanced function things and better parameter handling than otherwise. In PowerShell 4.0 and up, [cmdletbinding()] is not needed if [parameter()] is used, but it doesn’t hurt to add it, and it’s good practice so I don’t forget it in those scripts where I do need it.

    [cmdletbinding()]
    Param (

Parameter $InputObject will hold the objects to group. It will be an array of any type of object that needs to be grouped. We want to be able to take objects from the pipeline.

We are not making it mandatory, because I like the behavior of returning empty groups instead of nothing if $InputObject is empty.

        [parameter( ValueFromPipeline = $True )]
        [array]$InputObject,

Parameter $Property is the string describing the path to the property or nested property to use to determine the size of the objects.

By defaulting to ‘Count’, arrays are automatically grouped according to the number of elements they have, and objects that are not collections are simply split into groups with equal numbers of objects.

        [string]$Property = 'Count',

Parameter $Number is the number of groups to divide the objects into.

        [int]$Number = 2 )

Because we want to act on pipeline objects, but not one at a time, we have to gather them up before starting to work on them. We’ll use a Begin block to create an array to hold the objects, a Process block to add objects to the array as they come in the pipeline or from the parameter, and then do all of the actual work in the End block.

    Begin
        {
        # Initialize array
        $RawItems = @()
        }

Typically, in a Process block such as this, we would simply add any incoming objects to the array.

But atypically here, the individual objects may themselves be arrays. PowerShell’s special handling of arrays requires us to do some special handling to get our desired behavior.

If we get an array from the -InputObject parameter, the element of the array are the objects we want to sort, and what we want to add to $RawItems.

But if we get an array from the pipeline, that means it was itself an element nested within a parent array. In that case, we don’t want to add each element of the array to $RawItems. We want to add the entire array as a single element in $RawItems.

To distinguish between the two, we need to be able to tell the when the function is running in a pipeline. Thanks to Øyvind Kallstad for his blog Quick tip: Determine if input comes from the pipeline or not with the answer.

$PSCmdlet.MyInvocation.ExpectingInput is $True if we’re in a pipepline.

    Process
        {
        # If input is from pipeline
        # Treat an array as a single input item
        If ( $PSCmdlet.MyInvocation.ExpectingInput )
            {

If we are in a pipeline, we want the array to be added as a single element to $RawItems. To do that, we use a unary comma to indicate that it is an element. ,@($x) results in what this looks like: @( @( $x ) ). But @( @( $x ) ) results in @( $x ) by design, so we have to resort to the unary comma.

            $RawItems += ,$InputObject
            }

If we are not in a pipepline, we want to add all of the element of the $InputObject to $RawItems, which is the normal PowerShell behavior when “adding” two arrays.

        # Else (input is from paramter)
        # Treat an array as a collection of input items
        Else
            {
            $RawItems += $InputObject
            }
        }

Once we have collected all of the objects from the pipeline, the End block runs, and we can actually do some work.

    End
        {

First we create a string which, when executed, will get the size of the object based on the $Property string.

        ## Test for code injection

        # Build property string
        $SizeString = "`$_.$Property"

Then we need to check it to confirm that it really will do nothing other than reference a property or nested property. A scripter might take input from an untrusted source and use it to populate the -Property value, and end up with Group-Evenly -Property 'x;Remove-Item C:\*.* -Recurse' which would be a problem if we didn’t do this check.

To run this or any code, the PowerShell parser first needs to cut it up into identified tokens. We can leverage that capability, and ask PowerShell to do so now, and identify the tokens for us to parse.

        # Use PowerShell parser to tokensize the property string
        $TokenErrors = [System.Collections.ObjectModel.Collection[System.Management.Automation.PSParseError]]@()
        $Tokens = [System.Management.Automation.PSParser]::Tokenize( $SizeString, [ref]$TokenErrors )

If there were no errors during tokenizing, we set a validity flag to the $True and continue. If there were errors, the scriptblock is not going to run properly, and we set the flag to $False.

        # If there are errors, it won't work anyway; set to invalid
        $PropertyValid = $TokenErrors.Count -eq 0

Then we examine the tokens. The tokens would look like this if $Property = 'Members.Count'.

Content     Type Start Length StartLine StartColumn EndLine EndColumn
-------     ---- ----- ------ --------- ----------- ------- ---------
_       Variable     0      2         1           1       1         3
.       Operator     2      1         1           3       1         4
Members   Member     3      7         1           4       1        11
.       Operator    10      1         1          11       1        12
Count     Member    11      5         1          12       1        17
...      NewLine    16      2         1          17       2         1

Or like this if $Property = 'x;Remove-Item C:\*.* -Recurse '

Content                   Type Start Length StartLine StartColumn EndLine EndColumn
-------                   ---- ----- ------ --------- ----------- ------- ---------
_                     Variable     0      2         1           1       1         3
.                     Operator     2      1         1           3       1         4
x                       Member     3      1         1           4       1         5
;           StatementSeparator     4      1         1           5       1         6
Remove-Item            Command     5     11         1           6       1        17
C:\*.*         CommandArgument    17      6         1          18       1        24
-Recurse      CommandParameter    24      8         1          25       1        33
...                    NewLine    33      2         1          34       2         1

The $ in $_ is simply an indicator that a variable name follows, and is not included in any of the tokens.

In our script, the first token is always Type “Variable” and Content “_”, and the second token is always Type “Operator” and Content “.”, because we hardcoded $_. at beginning of the scriptblock. So we can ignore those.

In valid code (by our definition), all of the remaining tokens are of either Type “Operator”, “Member”, or “NewLine”. So if any of the tokens are of any other Type, we set the $PropertyValid flag to $False.

The only Operator token we need has the Content “.”. If any other Operators are present, we set the flag to $False.

            # If there are any tokens after the $_ other than .PropertyName.PropertyName.etc
            # (Bad -Property value (or code injection))
            # Set to invalid
            $Tokens[2..($Tokens.Count-1)].
                Where{
                    $_.Type -notin 'Operator', 'Member', 'NewLine' -or
                    ( $_.Type -eq 'Operator' -and $_.Content -ne '.' ) }.
                ForEach{ $PropertyValid = $False }

Otherwise, we are safe to proceed. (You might be concerned that a Member token can be a method rather than a property, but if it was, there would be associated GroupStart and GroupEnd tokens holding the parentheses following the method name, which are not allowed. If there were an extra NewLine token in there, it would have to be followed by something we aren’t allowing in order to be dangerous. Even if a NewLine were followed by a dot Operator that is intended as a call operator rather than as a member operator, the call operator would only be dangerous if it were followed by a String, Variable, or some other type of token that we are not allowing.)

        # If property string is valid
        # continue
        If ( $PropertyValid )
            {

We create an array of the correct number of arrays to hold groups that we will group the input objects into. The simplest way to do this is to create an array with a single empty array as an element, using the unary comma discussed earlier. Then we “multiply” the array by the number of groups we want, which in PowerShell means make X additional elements in the array that are copies of the existing element(s).

            # Initialize array with the desired number of groups
            $Groups = ,@() * $Number

We will want to quickly check the sizes of each group as we go along. Rather than re-measure the groups each time, we’ll store the running totals in an integer array. The index of a size in this array will match the index of the group in the group array that it measures.

Again, the simplest (or prettiest) way to do this is to create an array with a single zero, and then multiply it to get the correct number of zeros.

            # Initialize array to hold group sizes
            $Sizes  = @(0* $Number

We will be frequently referencing the last index of the arrays. Rather than repeating the calculation, we do it once and store it in a variable. It’s faster, and it makes the code prettier.

            # Get highest index number
            $TopIndex = $Number - 1

Then we convert the $SizeString to a [ScriptBlock] for later execution. We didn’t do it until we were sure it was valid so that the conversion doesn’t throw an error. (We’ll throw our own error later if it isn’t valid.) We’re doing it now instead of in the following code so that it is only done once. It’s faster and makes the code prettier.

            # Convert size string to a scriptblock
            $SizeBlock = [ScriptBlock]::Create( $SizeString )

To simplify handling of the objects and their sizes, we’re wrapping them in custom objects along with their sizes, and then sort them by size, biggest ones first.

            # Create an array with the items and their calculated sizes
            # Sort by size descending
            $Items = $RawItems |
                Select-Object -Property @(
                    @{ Label = 'Value'; Expression = { $_ } }
                    @{ Label = 'Size' ; Expression = $SizeBlock } ) |
                Sort-Object -Property Size -Descending

Then we simply go through the items, biggest ones first, and put each in whatever group has the most room.

            # For each item (starting with the largest)
            # Place item in smallest group
            ForEach ( $Item in $Items )
                {

For each possible group index, we sort them by the sizes of the group, and take the first one (the index of the smallest group).

                # Find the index of the smallest group
                $Smallest = 0..$TopIndex | Sort-Object -Property { $Sizes[$_] } | Select-Object -First 1

Add the item to the smallest group.

                # Add the item to the smallest group
                $Groups[$Smallest] += $Item.Value

Add the size of the item to the running total size of the smallest group.

                # Add the size of the item to the group size
                $Sizes[ $Smallest] += $Item.Size
                }

Repeat until all of the items are placed.

Return the results.

            # Return the results
            return $Groups
            }

If the $Property string is invalid, we throw an error. We use Write-Error instead of keyword Throw to make it a non-terminating error whose behavior is dictated by $ErrorActionPreference or by using the common parameter -ErrorAction on our function. (We don’t have to define -ErrorAction; [cmdetbinding()] took care of that for us.)

        # Else (invalid Property value)
        # Throw error (respecting ErrorAction)
        Else
            {
            Write-Error -Message "Invalid Property value."
            }
        }
    }


Usage

Now we can use our new function.

Using default parameters, and no pipeline, we can take all of our users and split them into two groups.

$Users = Get-ADUser -Filter *
$Teams = Group-Evenly -InputObject $Users

Or let’s say the file share where my user’s home drives are stored need to be split up onto four new drives. I don’t care which users go where, but I want the total spaced used on each share the be roughly equal.

$UserFolders = Get-ChildItem $SourceShare -Directory |
    ForEach-Object {
        [pscustomobject]@{
            FullName = $_.FullName
            FolderSize = Get-ChildItem $_.FullName -File -Recurse |
                Measure-Object -Property Length -Sum |
                Select-Object -ExpandProperty Sum } }

$NewShareGroups = $UserFolders |
    Group-Evenly -Property FolderSize -Number 4

HR needs to schedule 6 meetings to talk with all employees. They want all employees in a given department to go to the same meeting, and they want the meetings to be of roughly the same size. There are Active Directory groups that define who is in what department.

$Meetings = Get-ADGroup -Filter { Name -like "Dept*" } -Properties Members |
    Group-Evenly -Property Members.Count -Number 6

Just to test how well we handle nested properties, let’s group some files based on the length of the full name of the parent of the file’s directory. It’s kinda stupid, but I didn’t have the time to come up with a better example.

$Whatever = Get-ChildItem C:\Temp -File -Recurse |
    GroupEvenly -Property Directory.Parent.FullName.Length

Full function

function Group-Evenly
    {
    <#
    .SYNOPSIS
        Evenly divides input objects into a given number of groups
        optionally weighted by the value of a given property.

    .DESCRIPTION
        Creates specified number of groups (arrays)
        Input object are sorted by value of the specified Property, descending
            (If no property is specified, .Count is used)
        Each object is placed in the group with the smallest totale value of the specified Property

        This algorithm may not always produce an optimal result, but does
        produce a reasonable result quickly compared to the brute force
        required to guarantee an optimal result.

    .OUTPUT
        [array[]]

    .PARAMETER InputObject
        Objects to be grouped
        Accepts pipeline input
        Unlike most commands, accepts Null pipeline input

    .PARAMETER Property
        String - Property to use to determine object size for weighted grouping
        Accepts nested property names, e.g. - Members.Count
        Default to "Count"

    .PARAMETER Number
        Int32 - Number of groups to create
        Defaults to 2

    .EXAMPLE
        $Users = Get-ADUser -Filter *
        $Teams = Group-Evenly -InputObject $Users

        Results in two arrays, each with half of the users.

    .EXAMPLE
        $DataChunks = Get-ChildItem C:\Temp -File |
            Group-Evenly -Property Length -Number 4

        Results in four arrays of files, grouped such that the total file sizes
        of the groups are approximately equal.

    .EXAMPLE
        $Meetings = Get-ADGroup -Filter { Name -like "Dept*" } -Properties Members |
            Group-Evenly -Property Members.Count -Number 6

        Results in six arrays of AD department groups, grouped such that the total
        membership of the grouping are approximately equal

    .EXAMPLE
        $Whatever = Get-ChildItem C:\Temp -File |
            GroupEvenly -Property Directory.Parent.FullName.Length

        Results in two arrays of files, grouped evenly but weighted by the length
        of the full path of the parent of the file's directory. That is, of course,
        completely useless, but I didn't feel like taking the time to come up with
        a better example of using a deeply nested property value.

    .NOTES
        v 1.0 Tim Curwick Created
    #>

    [cmdletbinding()]
    Param (
        [parameter( ValueFromPipeline = $True )]
        [array]$InputObject,
        [string]$Property = 'Count',
        [int]$Number = 2 )

    Begin
        {
        # Initialize array
        $RawItems = @()
        }
    Process
        {
        # If input is from pipeline
        # Treat an array as a single input item
        If ( $PSCmdlet.MyInvocation.ExpectingInput )
            {
            $RawItems += ,$InputObject
            }

        # Else (input is from paramter)
        # Treat an array as a collection of input items
        Else
            {
            $RawItems += $InputObject
            }
        }
    End
        {
        ## Test for code injection

        # Build property string
        $SizeString = "`$_.$Property"

        # Use PowerShell parser to tokensize the property string
        $TokenErrors = [System.Collections.ObjectModel.Collection[System.Management.Automation.PSParseError]]@()
        $Tokens = [System.Management.Automation.PSParser]::Tokenize( $SizeString, [ref]$TokenErrors )

        # If there are errors, it won't work anyway; set to invalid
        $PropertyValid = $TokenErrors.Count -eq 0

        # If there are any tokens after the $_ other than .PropertyName.PropertyName.etc
        # (Bad -Property value (or code injection))
        # Set to invalid
        $Tokens[2..($Tokens.Count-1)].
            Where{
                $_.Type -notin 'Operator', 'Member', 'NewLine' -or
                ( $_.Type -eq 'Operator' -and $_.Content -ne '.' ) }.
            ForEach{ $PropertyValid = $False }
       
        # If property string is valid
        # continue
        If ( $PropertyValid )
            {
            # Initialize array with the desired number of groups
            $Groups = ,@() * $Number

            # Initialize array to hold group sizes
            $Sizes  = @(0* $Number

            # Get highest index number
            $TopIndex = $Number - 1

            # Convert size string to a scriptblock
            $SizeBlock = [ScriptBlock]::Create( $SizeString )

            # Create an array with the items and their calculated sizes
            # Sort by size descending
            $Items = $RawItems |
                Select-Object -Property @(
                    @{ Label = 'Value'; Expression = { $_ } }
                    @{ Label = 'Size' ; Expression = $SizeBlock } ) |
                Sort-Object -Property Size -Descending
   
            # For each item (starting with the largest)
            # Place item in smallest group
            ForEach ( $Item in $Items )
                {
                # Find the index of the smallest group
                $Smallest = 0..$TopIndex | Sort-Object -Property { $Sizes[$_] } | Select-Object -First 1

                # Add the item to the smallest group
                $Groups[$Smallest] += $Item.Value

                # Add the size of the item to the group size
                $Sizes[ $Smallest] += $Item.Size
                }
   
            # Return the results
            return $Groups
            }
   
        # Else (invalid Property value)
        # Throw error (respecting ErrorAction)
        Else
            {
            Write-Error -Message "Invalid Property value."
            }
        }
    }

No comments:

Post a Comment