Monday, February 12, 2018

Alternatives to Arrays for enhanced performance in PowerShell

PowerShell enhances Arrays with advanced functionality. The tradeoff is performance. For most scripts, this is acceptable. But sometimes unintended side effects of Array enhancement or the need for extreme speed require alternatives.

Array enhancement in PowerShell

PowerShell is a .Net language. When a script runs, the PowerShell engine compiles it on the fly and feeds it to the .Net engine for execution. But the PowerShell team wanted a way to enhance various aspects of .Net, so the PowerShell engine doesn’t just handle the compiling and interaction with .Net. It also adds a layer of functional enhancements that allow us to simply do things in PowerShell that can’t be done simply in .Net.

One of the most useful enhancements, which has been part of PowerShell from the beginning, is Array handling. Many basic PowerShell techniques, such as working with Where-Object, Select-Object, etc., are built around Arrays. So we need enhancements to make it easy to work with the Arrays we use to work with everything else.

Adding elements to an Array

The most notable Array enhancement, and the one that ultimately leads to the issues discussed in this article, is the ability to add new elements to an Array.

$Names1 = @( 'Tim', 'Joe' )
$Names1 += 'Hannah'

You can’t really add new elements to Arrays, as they are defined in .Net as being of fixed size. When we added string 'Hannah' to Array $Names1, above, PowerShell created an entirely new Array with room for 3 elements, copied 'Tim' and 'Joe' from the old Array to the new Array, set the third element’s value to 'Hannah', and then repointed the variable $Names1 to the new Array. The old Array was abandoned in memory.

Most of the time, this is great. It makes thing much easier. For typical scripts, the performance tradeoff is barely measureable. In the above example, it takes only 20 microseconds to add the new element.

Problem 1 - performance at scale

What works well for a small number of things doesn’t always work well at large scales.

If adding 'Hannah' only takes 20 microseconds, you might thing that adding additional Array elements would only take additional 20-microsecond blocks.

But if we add 'Mike', PowerShell has to copy 3 elements to the new Array before adding 'Mike'. And if we then add 'Derek', it has to copy 4 elements to the new Array, etc. And now we have the original Array with two elements, the next version with 3 elements, and another version with 4 elements, etc., all sitting in memory unused. If there is memory pressure, the .Net engine will look around and find and delete the abandoned Arrays, but that takes time and yet more CPU effort.

Each new element takes longer to add than the one before it. Eventually it becomes noticeable. Eventually it becomes unacceptably slow. Eventually memory handling becomes an issue and the slowing accelerates.

It takes about 1 second to add 1,000 elements to an Array one at a time. Not too bad for a script. And well within the Array size requirements for most scripts

Above 1,000 elements, it becomes unacceptable, and an alternative is needed.

Alternative to generic Array - ArrayList

The alternative for a generic Array is an ArrayList. An ArrayList is very similar to an Array. It is a collection of elements in a particular order.

Arrays and ArrayLists can be converted back and forth by simply casting them as the desired type. The easiest way to create a new ArrayList is to cast an empty Array as an ArrayList.

$ArrayList = [System.Collections.ArrayList]@()
$Array     = [Array]$ArrayList
$ArrayList = [System.Collections.ArrayList]$Array

ArrayLists do not have PowerShell enhancements, so they are not as simple to work with, which is why we don’t use them unless we need them. You cannot add elements to an ArrayList using the PlusEquals operator. This does not work:

# This does not work as expected
$Names2 = [System.Collections.ArrayList]@( 'Tim', 'Joe' )
$Names2 += 'Hannah'
# because PowerShell converts $Names to an [Array] to perform the addition

It appears to work (and it sort of does work), but only because PowerShell “helpfully” converts the ArrayList into an Array before doing the addition, so we didn’t gain anything.

To add an element to an ArrayList, we use its .Add() method. The existing elements are left alone, an additional element slot is added to the ArrayList, and the new element is assigned to it.

$Names3 = [System.Collections.ArrayList]@( 'Tim', 'Joe' )
$Names3.Add( 'Hannah' )

If we want to add multiple elements at one time, we can use the .AddRange() method. Another nice feature of ArrayLists is that we can use the .Remove() and .RemoveAt() methods to remove elements, something we can’t do with Arrays even with enhancements.

Here is what it looks like to use an ArrayList to optimize adding elements one at a time. If we are working against a list of tens of thousands of employees, this is much faster than using an Array. As Arrays are easier to work with, we convert the ArrayList to an Array at the end for any further handling.
$Invites = [System.Collections.ArrayList]@()

ForEach ( $Person in $Employees )
    {
    $Details = Get-PersonDetails $Person
        {
        If ( $Details.IsPersonILike )
            {
            $Invites.Add( $Person )
            }
        }
    }

$Invites = [Array]$Invites

Problem 2 - typed Arrays revert to generic Arrays

Another problem with the process PowerShell uses for adding elements to an array is that when PowerShell creates the new, slightly larger array, it does not respect what type of array we started with. It always creates the new array as a generic array.

Arrays are not just Arrays. All Arrays are a subclass of [Array]. The subclass defines what types of elements are in the array. When a particular subclass isn’t specified, PowerShell by default creates an [Object[]] Array. As all types in .Net are derived from the [Object] class, any type of object can be an element in an [Object[]] array. (In depth article here - An array is not an array: Discovering an abstract class in PowerShell)

We can create an Array that only hold elements of a specific type by explicitly telling PowerShell that is what we want to do. The simplest way to create a typed Array is by casting a generic Array as the desired Array subclass. We reference an Array subclass by referencing the desired element type, with an extra set of square brackets inside the end of it.

This creates and assigns a [String[]] Array to the value of variable $Names4.

$Names4 = [String[]]@( 'Tim', 'Joe' )

But when we add an element to it, PowerShell converts it to a generic [Object[]] Array, even if the new element is a string.

$Names4 = [string[]]'Tim', 'Joe' )
$Names4.GetType().FullName  # yields System.String[]
$Names4 += 'Hannah'
$Names4.GetType().FullName  # yields System.Object[] !!!

Strongly type variable to preserve Array subclass

This can be prevented by strongly typing the variable, rather than just the contents of the variable.

[string[]]$Names5 = ( 'Tim', 'Joe' )
$Names5.GetType().FullName  # yields System.String[]
$Names5 += 'Hannah'
$Names5.GetType().FullName  # yields System.String[]

A strongly typed Array variable is useful because it automatically handles confirming an element is of the right type and, when needed and possible, automatically handles the conversion.

If you try to add an integer to a [String[]] array, PowerShell simply converts the number to a string while adding it.

[string[]]$Names6 = ( 'Tim', 'Joe' )
$Names6 += 3
$Names6.GetType().FullName  # yields System.String[]

Problem 1 + 2 - performance at scale for strongly typed Arrays

The performance at scale problem is even worse for strongly typed Array variables than it is for generic Arrays.

That’s because PowerShell still creates the [Object[]] Array, and then when it tries to assign the new array to the strongly typed variable, it has to perform the additional step of creating yet another Array, this time of the correct subclass. Not only does it have to copy everything over yet again, but it also checks every single element to be sure it is of the correct type, even though only the new elements could possibly need conversion to the specified type.

So the performance for adding elements to strongly typed Array variables is worse than for generic Arrays, and the larger the Array, the worse the worseness gets. Adding 1,000 elements to a strongly typed Array variable takes 1.5 times as long as with a generic Array, and adding 10,000 elements takes 5 times longer. (My test for 100,000 elements is still running. Don’t do that. Find an alternative somewhere.)

But we can’t use an ArrayList in place of a strongly typed Array variable, because ArrayLists don’t let you limit the element type. So we need another alternative.

Alternative to a strongly typed Array variable - List

The typed version of an ArrayList is a List.

Referencing the List type is different from a typed Array. We use the fully qualified name of List type with the desired element type in square brackets within the end of the parent square brackets. As with the typed Arrays and ArrayLists, we can create Lists by casting empty or existing Arrays. As with ArrayLists, we add items to a List using its .Add() method.

$Names7 = [System.Collections.Generic.List[String]]@( 'Tim', 'Joe' )
$Names7.Add( 'Hannah' )

And as with ArrayLists, we can easily convert a List back to an array by casting.

$Invites = [System.Collections.Generic.List[String]]@()

ForEach ( $Person in $Employees )
    {
    $Details = Get-PersonDetails $Person
        {
        If ( $Details.IsPersonILike )

            {
            $Invites.Add( $Person )
            }
        }
    }

$Invites = [String[]]$Invites

Thanks to Dave Wyatt for the tip on using Lists.

No comments:

Post a Comment