Friday, September 1, 2017

Remove comments and whitespace from PowerShell scripts

A PowerShell.Slack.com user asked if it was possible to easily remove the comments and whitespace from a scriptblock to reduce the size. I was intrigued by the challenge, and came up with this function.

I also immediately put it to use. I manage a PowerShell GUI application that I wrap in an executable using Sapien PowerShell Studio. Comments and whitespace don’t server any function in the final wrapped package, so my build script now runs this function on the code before running the Sapien build. The resulting executable is 31% smaller than it used to be.

If you use this function, test the results thoroughly. I am reasonably sure that this will work fine with most code, but I can’t guarantee that it won’t break your code. This will break any comment-based help if it uses multiple #comments instead of <#multiline comments#>, as only the section headings will be recognized and left untouched.


I realized we can use the Tokenize function of the PowerShell parser to split up the scriptblock into identified chunks. This also effectively strips out all horizontal whitespace, as the parser just ignores it. Then we can take all of the tokens that are not functional comments, and put them back together again into a leaner scriptblock. It’s slightly more complicated than that, but not much.

The ::Tokenize() method can work with a scriptblock, a string, or an array of strings, so we will similarly accept pretty much anything as input for our function. It would be convenient to be able to use the function in a pipeline, so let’s turn that on.

function Remove-CommentsAndWhiteSpace
    {
    # We are not restricting scriptblock type as Tokenize() can take several types
    Param (
        [parameter( ValueFromPipeline = $True )]
        $Scriptblock
        )

We want to accept pipeline input, but we need to process the script as a whole, not as individual lines. So we just use the Process block to collect all of the input in a single collection.

    Begin
        {
        # Intialize collection
        $Items = @()
        }

    Process
        {
        # Collect all of the inputs together
        $Items += $Scriptblock
        }

    End
        {
        ## Process the script as a single unit

And then despite ::Tokenize()’ ability to handle almost anything, we’re going to turn whatever comes in into a single string anyway, so that we can come back to it later to grab parts of it. We use a new variable leaving the input variable untouched, so we can later base our output type on the input type. The -join operator forces $Scriptblock to convert to a string or array of strings if needed, and then concatenates them with interspersed lines breaks if needed.

        # Convert input to a single string if needed
        $OldScript = $Items -join [environment]::NewLine

If the input is just white space, there is nothing to do.

        # If no work to do
        # We're done
        If ( -not $OldScript.Trim( " `n`r`t" ) ) { return }

We use the ::Tokenize() to parse the script and turn it into “tokens”, identified as commands, comments, strings, variables, etc. The method requires a reference variable for dumping parsing errors. We don’t need those, so we give it the odd construction [ref]$Null to tell it to send those nowhere.

        # Use the PowerShell tokenizer to break the script into identified tokens
        $Tokens = [System.Management.Automation.PSParser]::Tokenize( $OldScript, [ref]$Null )

The resulting $Tokens do not contain any horizontal whitespace, as the parser just ignored them.
We don’t want any comments, so we strip those out. But not quite all of them. Comment-based help and #requires statements need to stay in to keep that functionality. We’ll identify comments to keep by looking at the first word in the comment, so we define a list of words that identify allowed comments.

        # Define useful, allowed comments
        $AllowedComments = @(
            'requires'
            '.SYNOPSIS'
            '.DESCRIPTION'
            '.PARAMETER'
            '.EXAMPLE'
            '.INPUTS'
            '.OUTPUTS'
            '.NOTES'
            '.LINK'
            '.COMPONENT'
            '.ROLE'
            '.FUNCTIONALITY'
            '.FORWARDHELPCATEGORY'
            '.REMOTEHELPRUNSPACE'
            '.EXTERNALHELP' )

If a token is not a comment, we pass it through to keep. If a token is a comment, we parse the .Content, again leveraging the smarts of ::Tokenize(), to find the first word in the comment. If it’s in the allowed list, we pass it through to keep.

        # Strip out the Comments, but not useful comments
        # (Bug: This will break comment-based help that uses leading # instead of multiline <#,
        # because only the headings will be left behind.)

        $Tokens = $Tokens.ForEach{
            If ( $_.Type -ne 'Comment' )
                {
                $_
                }
            Else
                {
                $CommentText = $_.Content.Substring( $_.Content.IndexOf( '#' ) + 1 )
                $FirstInnerToken = [System.Management.Automation.PSParser]::Tokenize( $CommentText, [ref]$Null ) |
                    Where-Object { $_.Type -ne 'NewLine' } |
                    Select-Object -First 1
                If ( $FirstInnerToken.Content -in $AllowedComments )
                    {
                    $_
                    }
                } }

Our new version of the script starts as an empty string.

        # Initialize script string
        $NewScriptText = ''
        $SkipNext = $False

Then we loop through each token except for the last one. We are looping through index numbers instead of the tokens themselves so that we can more easily reference the following token when making decisions. We’ll save the last token for later, as it won’t have a following token to reference, and it is most efficient to handle it separately.

        # If there are at least 2 tokens to process...
        If ( $Tokens.Count -gt 1 )
            {
            # For each token (except the last one)...
            ForEach ( $i in ( 0..($Tokens.Count-2) ) )
                {

If we decided on the previous loop that we should skip this token and not include it in the script, we do so. If the token is a line continuation, we are going to skip it and not include it in the new script. If this token is a new line or a semicolon and the following token is a new line or a semicolon or a close parenthesis or a close curly brace, we are going to skip this one as redundant.

                # If token is not a line continuation and not a repeated new line or semicolon...
                If (    -not $SkipNext -and
                        $Tokens[$i  ].Type -ne 'LineContinuation' -and (
                        $Tokens[$i  ].Type -notin ( 'NewLine', 'StatementSeparator' ) -or
                        $Tokens[$i+1].Type -notin ( 'NewLine', 'StatementSeparator', 'GroupEnd' ) ) )
                    {

Then we add the token to the new script. For most tokens, we just use the .Content of the $Token object, but for variables and strings, we go back to the old script and pull them out of there. The token content does not include $ for variables, because the $ is just an indicator that what follows is a variable name, not part of the variable name itself. And the token content does not include the quotes for strings for a similar reason. For variables, we could simply put the $ back in manually, but for strings we don’t know whether to use single quotes, double quotes, and/or here-string quotes, so we just grab the original and don’t have to think about it.

                    # Add Token to new script
                    # For string and variable, reference old script to include $ and quotes
                    If ( $Tokens[$i].Type -in ( 'String', 'Variable' ) )
                        {
                        $NewScriptText += $OldScript.Substring( $Tokens[$i].Start, $Tokens[$i].Length )
                        }
                    Else
                        {
                        $NewScriptText += $Tokens[$i].Content
                        }

And then we have to do some serious thinking about what to put between this token and the next one. Some code will break if you add a space. (
$X.Name -> $X .Name
 Other code will break if you take a space out. (
Get-Item -Path -> Get-Item-Path
) So we look at the original and see if there was white space (or comments) between them before. If so, we put in a single space.
…Unless we are before or after a NewLine or a semicolon, or inside of and next to a parenthesis, or curly brace, in which case a space is not needed, and we leave it out.

                    # If the token does not never require a trailing space
                    # And the next token does not never require a leading space
                    # And this token and the next are on the same line
                    # And this token and the next had white space between them in the original...
                    If (    $Tokens[$i  ].Type -notin ( 'NewLine', 'GroupStart', 'StatementSeparator' ) -and
                            $Tokens[$i+1].Type -notin ( 'NewLine', 'GroupEnd', 'StatementSeparator' ) -and
                            $Tokens[$i].EndLine -eq $Tokens[$i+1].StartLine -and
                            $Tokens[$i+1].StartColumn - $Tokens[$i].EndColumn -gt 0 )
                        {
                        # Add a space to new script
                        $NewScriptText += ' '
                        }

We check to see if the next token should be skipped based on this token. Specifically, if this token is an open parenthesis or an open curly brace and the next token is a new line or a semicolon, we skip the next token. Or if the current token was skipped for the same reason, we check if the next token should also be skipped.

                    # If the next token is a new line or semicolon following
                    # an open parenthesis or curly brace, skip it
                    $SkipNext = $Tokens[$i].Type -eq 'GroupStart' -and $Tokens[$i+1].Type -in ( 'NewLine', 'StatementSeparator' )
                    }

                # Else (Token is a line continuation or a repeated new line or semicolon)...
                Else
                    {
                    # [Do not include it in the new script]

                    # If the next token is a new line or semicolon following
                    # an open parenthesis or curly brace, skip it
                    $SkipNext = $SkipNext -and $Tokens[$i+1].Type -in ( 'NewLine', 'StatementSeparator' )
                    }
                }
            }

Add the last token to the new script, again referencing the old script for a variable or string.

        # If there is a last token to process...
        If ( $Tokens )
            {
            # Add last token to new script
            # For string and variable, reference old script to include $ and quotes
            If ( $Tokens[$i].Type -in ( 'String', 'Variable' ) )
                {
                $NewScriptText += $OldScript.Substring( $Tokens[-1].Start, $Tokens[-1].Length )
                }
            Else
                {
                $NewScriptText += $Tokens[-1].Content
                }
            }

If we ended up with a NewLine or StatementSeparator at the beginning, trim it off. (If we ended up with one at the end, we’ll leave it in as best practice.)

        # Trim any leading new lines from the new script
        $NewScriptText = $NewScriptText.TrimStart( "`n`r;" )

And then we return the result in the same format as it came in.
If it came back as a scriptblock, convert the new script string to a scriptblock and return.

        # Return the new script as the same type as the input
        If ( $Items.Count -eq 1 )
            {
            If ( $Items[0] -is [scriptblock] )
                {
                # Return single scriptblock
                return [scriptblock]::Create( $NewScriptText )
                }

If it came in as a single string (or something we converted to a string), return a single string.

            Else
                {
                # Return single string
                return $NewScriptText
                }
            }

Otherwise, it was an array of strings (or an array of things we converted to strings). Split it at the line breaks and return.

        Else
            {
            # Return array of strings
            return $NewScriptText.Split( "`n`r", [System.StringSplitOptions]::RemoveEmptyEntries )
            }
        }
    }


Full function

Here it is all together.

function Remove-CommentsAndWhiteSpace
    {
    # We are not restricting scriptblock type as Tokenize() can take several types
    Param (
        [parameter( ValueFromPipeline = $True )]
        $Scriptblock
        )

    Begin
        {
        # Intialize collection
        $Items = @()
        }

    Process
        {
        # Collect all of the inputs together
        $Items += $Scriptblock
        }

    End
        {
        ## Process the script as a single unit

        # Convert input to a single string if needed
        $OldScript = $Items -join [environment]::NewLine

        # If no work to do
        # We're done
        If ( -not $OldScript.Trim( " `n`r`t" ) ) { return }

        # Use the PowerShell tokenizer to break the script into identified tokens
        $Tokens = [System.Management.Automation.PSParser]::Tokenize( $OldScript, [ref]$Null )

        # Define useful, allowed comments
        $AllowedComments = @(
            'requires'
            '.SYNOPSIS'
            '.DESCRIPTION'
            '.PARAMETER'
            '.EXAMPLE'
            '.INPUTS'
            '.OUTPUTS'
            '.NOTES'
            '.LINK'
            '.COMPONENT'
            '.ROLE'
            '.FUNCTIONALITY'
            '.FORWARDHELPCATEGORY'
            '.REMOTEHELPRUNSPACE'
            '.EXTERNALHELP' )

        # Strip out the Comments, but not useful comments
        # (Bug: This will break comment-based help that uses leading # instead of multiline <#,
        # because only the headings will be left behind.)

        $Tokens = $Tokens.ForEach{
            If ( $_.Type -ne 'Comment' )
                {
                $_
                }
            Else
                {
                $CommentText = $_.Content.Substring( $_.Content.IndexOf( '#' ) + 1 )
                $FirstInnerToken = [System.Management.Automation.PSParser]::Tokenize( $CommentText, [ref]$Null ) |
                    Where-Object { $_.Type -ne 'NewLine' } |
                    Select-Object -First 1
                If ( $FirstInnerToken.Content -in $AllowedComments )
                    {
                    $_
                    }
                } }

        # Initialize script string
        $NewScriptText = ''
        $SkipNext = $False

        # If there are at least 2 tokens to process...
        If ( $Tokens.Count -gt 1 )
            {
            # For each token (except the last one)...
            ForEach ( $i in ( 0..($Tokens.Count-2) ) )
                {
                # If token is not a line continuation and not a repeated new line or semicolon...
                If (    -not $SkipNext -and
                        $Tokens[$i  ].Type -ne 'LineContinuation' -and (
                        $Tokens[$i  ].Type -notin ( 'NewLine', 'StatementSeparator' ) -or
                        $Tokens[$i+1].Type -notin ( 'NewLine', 'StatementSeparator', 'GroupEnd' ) ) )
                    {
                    # Add Token to new script
                    # For string and variable, reference old script to include $ and quotes
                    If ( $Tokens[$i].Type -in ( 'String', 'Variable' ) )
                        {
                        $NewScriptText += $OldScript.Substring( $Tokens[$i].Start, $Tokens[$i].Length )
                        }
                    Else
                        {
                        $NewScriptText += $Tokens[$i].Content
                        }

                    # If the token does not never require a trailing space
                    # And the next token does not never require a leading space
                    # And this token and the next are on the same line
                    # And this token and the next had white space between them in the original...
                    If (    $Tokens[$i  ].Type -notin ( 'NewLine', 'GroupStart', 'StatementSeparator' ) -and
                            $Tokens[$i+1].Type -notin ( 'NewLine', 'GroupEnd', 'StatementSeparator' ) -and
                            $Tokens[$i].EndLine -eq $Tokens[$i+1].StartLine -and
                            $Tokens[$i+1].StartColumn - $Tokens[$i].EndColumn -gt 0 )
                        {
                        # Add a space to new script
                        $NewScriptText += ' '
                        }

                    # If the next token is a new line or semicolon following
                    # an open parenthesis or curly brace, skip it
                    $SkipNext = $Tokens[$i].Type -eq 'GroupStart' -and $Tokens[$i+1].Type -in ( 'NewLine', 'StatementSeparator' )
                    }

                # Else (Token is a line continuation or a repeated new line or semicolon)...
                Else
                    {
                    # [Do not include it in the new script]

                    # If the next token is a new line or semicolon following
                    # an open parenthesis or curly brace, skip it
                    $SkipNext = $SkipNext -and $Tokens[$i+1].Type -in ( 'NewLine', 'StatementSeparator' )
                    }
                }
            }

        # If there is a last token to process...
        If ( $Tokens )
            {
            # Add last token to new script
            # For string and variable, reference old script to include $ and quotes
            If ( $Tokens[$i].Type -in ( 'String', 'Variable' ) )
                {
                $NewScriptText += $OldScript.Substring( $Tokens[-1].Start, $Tokens[-1].Length )
                }
            Else
                {
                $NewScriptText += $Tokens[-1].Content
                }
            }

        # Trim any leading new lines from the new script
        $NewScriptText = $NewScriptText.TrimStart( "`n`r;" )

        # Return the new script as the same type as the input
        If ( $Items.Count -eq 1 )
            {
            If ( $Items[0] -is [scriptblock] )
                {
                # Return single scriptblock
                return [scriptblock]::Create( $NewScriptText )
                }
            Else
                {
                # Return single string
                return $NewScriptText
                }
            }
        Else
            {
            # Return array of strings
            return $NewScriptText.Split( "`n`r", [System.StringSplitOptions]::RemoveEmptyEntries )
            }
        }
    }


Output

And as example output, here is the result of running it against itself.

function Remove-CommentsAndWhiteSpace
{Param ([parameter(ValueFromPipeline = $True)]
$Scriptblock)
Begin
{$Items = @()}
Process
{$Items += $Scriptblock}
End
{$OldScript = $Items -join [environment]::NewLine
If (-not $OldScript.Trim(" `n`r`t")) {return}
$Tokens = [System.Management.Automation.PSParser]::Tokenize($OldScript, [ref]$Null)
$AllowedComments = @('requires'
'.SYNOPSIS'
'.DESCRIPTION'
'.PARAMETER'
'.EXAMPLE'
'.INPUTS'
'.OUTPUTS'
'.NOTES'
'.LINK'
'.COMPONENT'
'.ROLE'
'.FUNCTIONALITY'
'.FORWARDHELPCATEGORY'
'.REMOTEHELPRUNSPACE'
'.EXTERNALHELP')
$Tokens = $Tokens.ForEach{If ($_.Type -ne 'Comment')
{$_}
Else
{$CommentText = $_.Content.Substring($_.Content.IndexOf('#'+ 1)
$FirstInnerToken = [System.Management.Automation.PSParser]::Tokenize($CommentText, [ref]$Null|
Where-Object {$_.Type -ne 'NewLine'|
Select-Object -First 1
If ($FirstInnerToken.Content -in $AllowedComments)
{$_}}}
$NewScriptText = ''
$SkipNext = $False
If ($Tokens.Count -gt 1)
{ForEach ($i in (0..($Tokens.Count-2)))
{If (-not $SkipNext -and
$Tokens[$i ].Type -ne 'LineContinuation' -and ($Tokens[$i ].Type -notin ('NewLine', 'StatementSeparator'-or
$Tokens[$i+1].Type -notin ('NewLine', 'StatementSeparator', 'GroupEnd')))
{If ($Tokens[$i].Type -in ('String', 'Variable'))
{$NewScriptText += $OldScript.Substring($Tokens[$i].Start, $Tokens[$i].Length)}
Else
{$NewScriptText += $Tokens[$i].Content}
If ($Tokens[$i ].Type -notin ('NewLine', 'GroupStart', 'StatementSeparator'-and
$Tokens[$i+1].Type -notin ('NewLine', 'GroupEnd', 'StatementSeparator'-and
$Tokens[$i].EndLine -eq $Tokens[$i+1].StartLine -and
$Tokens[$i+1].StartColumn - $Tokens[$i].EndColumn -gt 0)
{$NewScriptText += ' '}
$SkipNext = $Tokens[$i].Type -eq 'GroupStart' -and $Tokens[$i+1].Type -in ('NewLine', 'StatementSeparator')}
Else
{$SkipNext = $SkipNext -and $Tokens[$i+1].Type -in ('NewLine', 'StatementSeparator')}}}
If ($Tokens)
{If ($Tokens[$i].Type -in ('String', 'Variable'))
{$NewScriptText += $OldScript.Substring($Tokens[-1].Start, $Tokens[-1].Length)}
Else
{$NewScriptText += $Tokens[-1].Content}}
$NewScriptText = $NewScriptText.TrimStart("`n`r;")
If ($Items.Count -eq 1)
{If ($Items[0] -is [scriptblock])
{return [scriptblock]::Create($NewScriptText)}
Else
{return $NewScriptText}}
Else
{return $NewScriptText.Split("`n`r", [System.StringSplitOptions]::RemoveEmptyEntries)}}}

No comments:

Post a Comment