Friday 27 September 2013

Search vs. Recursive Looping: Getting a List of Sites (SPWeb's) a User Has Access to in a SharePoint Site Collection

Introduction

A question about returning all the sites (SPWeb's) "the current user" has access to in a given site collection comes up regularly in the TechNet SharePoint forums. The question usually asked is, "is there a method that returns all the sub-webs" of a site collection that a user has access to, or do we need to recursively loop through each web in the site collection, checking if the user has a specific permission to view the web?

The answer is regularly that you need to loop through the collection of webs (recursively), to determine the list of webs the user has access to.

Depending on the size of a site collection, this can be a very expensive and time consuming operation.

There is another way to achieve this requirement, using Search. This article explores using Search to generate a list of webs a user has access to, examines the performance differences between Search and Looping through collections, as well as some potential pros and cons.

Creating a Webpart to Test the Performance of Both Methods.

To compare the difference in performance and the results produced from each method, we are going to create a test webpart. The webpart is very simple, containing two main methods. One method is used for generating the list of webs by looping (calling SPWeb.GetSubwebsForCurrentUser() on each web), and the other method is using the SharePoint Search infrastructure, via the KeywordSearch class. Each of these methods is wrapped in an SPMonitoredScope block, enabling the performance of the each method to be tracked. The results can be seen in the Developer Dashboard.

The method that uses SPWeb.GetSubwebsForCurrentUser() starts at the root web for the site collection, an traverse down, calling GetSubwebsForCurrentUser() on each child web of the current web, until it finishes enumerating all the webs the current user has access to.

The search query used in the search method, queries the search engine for "ALL sites AND webs WHERE the webapplication hostname STARTS WITH the current sites hostname". You can test out the results of this search query using the SharePoint UI, via a standard Enterprise Search site. The search command would look something similar to this, if you were searching for all sub-sites you had access to on the http://corporate site collection:

(contentclass:STS_SITE OR contentclass:STS_Web) AND sitename:http://corporate

WebPart Code for Testing the Performance of Both Methods


using System;
using System.Collections;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Text;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using Microsoft.Office.Server.Search.Administration;
using Microsoft.Office.Server.Search.Query;
using Microsoft.SharePoint;
using Microsoft.SharePoint.Utilities;

namespace SearchVerseLoop.GetTheSitesIHavePermissionsToSee
{
    [ToolboxItemAttribute(false)]
    public class GetTheSitesIHavePermissionsToSee : WebPart
    {
        private Label _sitesFromSearch;
        private Label _sitesFromLooping;

        protected override void CreateChildControls()
        {
            _sitesFromSearch = new Label();
            _sitesFromLooping = new Label();
            Controls.Add(_sitesFromSearch);
            Controls.Add(_sitesFromLooping);
        }

        protected override void OnPreRender(EventArgs e)
        {
            base.OnPreRender(e);
            _sitesFromLooping.Text = GetAllWebs();
        }

        private String GetAllWebs()
        {
            try
            {
                var output = new StringBuilder();
                var websFromLooping = new ArrayList();
                using (new SPMonitoredScope("using a loop"))
                {
                    GetListOfWebs(SPContext.Current.Site.RootWeb, SPContext.Current.Site.RootWeb.GetSubwebsForCurrentUser(), websFromLooping);
                    output.Append(String.Format("<p>There are {0} webs I have access to (retrieved from looping through the rootwebs sub-webs)</p>", websFromLooping.Count));
                    foreach (var web in websFromLooping)
                    {
                        output.Append(String.Format("<span>{0}</span><br/>", web));
                    }
                }

                var websFromSearching = new ArrayList();
                using (new SPMonitoredScope("using search"))
                {
                    GetListOfWebsFromSearch(websFromSearching);
                    output.Append(String.Format("<p>There are {0} webs I have access to (retrieved from search, filtering on the current site)</p>", websFromSearching.Count));
                    foreach (var web in websFromSearching)
                    {
                        output.Append(String.Format("<span>{0}</span><br/>", web));
                    }
                }

                return output.ToString();
            }
            catch (Exception e)
            {
                return e.Message;
            }
        }

        private void GetListOfWebs(SPWeb currentWeb, IEnumerable<SPWeb> webCollection, ArrayList webs)
        {
            webs.Add(currentWeb.Url);
            foreach (SPWeb web in webCollection)
            {
                if (web.GetSubwebsForCurrentUser().Count > 0)
                {
                    GetListOfWebs(web, web.GetSubwebsForCurrentUser(), webs);
                }
                else
                {
                    webs.Add(web.Url);
                }
            }
        }

        private void GetListOfWebsFromSearch(ArrayList webs)
        {
            var ssaProxy = (SearchServiceApplicationProxy)SearchServiceApplicationProxy.GetProxy(SPServiceContext.GetContext(SPContext.Current.Site));
            var keywordQuery = new KeywordQuery(ssaProxy)
                {
                    RowLimit = 500,
                    TrimDuplicates = true,
                    ResultsProvider = SearchProvider.Default
                };
            keywordQuery.SelectProperties.Clear();
            keywordQuery.SelectProperties.Add("Path");
            keywordQuery.ResultTypes |= ResultType.RelevantResults;
            keywordQuery.QueryText = String.Format("(contentclass:STS_SITE OR contentclass:STS_Web) AND sitename:{0}", SPContext.Current.Site.HostName);
            ResultTableCollection searchResults;
            try
            {
                searchResults = keywordQuery.Execute();
            }
            catch (Exception)
            {
                //"Your query is malformed. Please rephrase your query."
                return;
            }

            if (!searchResults.Exists(ResultType.RelevantResults)) return;
            var searchResult = searchResults[ResultType.RelevantResults];
            var results = new DataTable { TableName = "SearchResults" };
            results.Load(searchResult, LoadOption.OverwriteChanges);
            foreach (DataRow dataRow in results.Rows)
            {
                webs.Add(dataRow["Path"]);
            }
        }
    }
}


Using SPMonitoredScope

In the code above we have two main functions that get called during the PreRender event. Both of these functions are wrapped in an SPMonitoredScope, which will enable us to track performance information about each method, namely the time each method takes to produce the list webs the current user has access to.

Using SPMonitoredScope also allows us to see other valuable information, such as the number and type of SQL calls, and expensive object allocations, like SPRequest allocations.

Turning on the Developer Dashboard with PowerShell

To see the results, we need to enable the Developer Dashboard. There is no user interface in SharePoint for enabling the Developer Dashboard, but thankfully, it's easily enabled using PowerShell.

To enable the Developer Dashboard, logon to your (test) SharePoint server, and open the SharePoint Management Shell.

Execute the following commands to enable the Developer Dashboard.
$ds = [Microsoft.SharePoint.Administration.SPWebService]::ContentService.DeveloperDashboardSettings;
$ds.DisplayLevel = 'On';
$ds.Update();


Testing the Example Webpart

To test the performance differences, we will run the following tests.
1. User A, with the webpart on a site collection with 9 sites (all webs are indexed)
2. User A, with the webpart on a site collection with 54 sites (some webs are NOT indexed)
3. User B, who has more restricted permissions than User A, with the webpart on a site collection with 54 sites (some webs are NOT indexed)

Each test will be run three times (by refreshing the page), the results (time taken for each method) will be aggregate to produce an average time.

Test 1 User A, on a site collection with 9 sites (all webs are indexed):

The output from the webpart shows both methods return the same number of sites.


This image shows part of the Developer Dashboard output. Using it, we can see the time taken for various parts of the page to load. The highlighted section shows the time taken to execute the two methods we wrapped in the SPMonitoredScope blocks.
You can see immediately that the search method is much faster, even on a small site collection.


The results from the first page refresh.


The results from the second page refresh.


The results from the third page refresh.



From the five screen shots above, we can see that both methods returned the same number of sites, and the differences in the time taken by each method.

Average time for the Looping Method to generate the result set: 84.3ms (93.68, 90.96, 68.20)
Average time for the Search Method to generate the result set: 23.37ms (25.62, 22.04, 22.45)

From this test, we can already see that using search is considerably faster, even though we are dealing with a small site collection.

Test 2 User A, with the Webpart on a Site Collection with 54 Sites (some Webs are not Indexed):

From the output of the webpart you can see, there is a difference in the number of sites returned. The loop method returns 57 webs, but the search method only returns 54 webs.

The difference in the search results is mainly down to a setting on an SPWeb that controls if the site is included in the search index. That setting, "Allow this site to appear in search results?", is set via the site settings page of a site (SPWeb). For example: The "search center" (http://sneakpreview/searchcenter) is not returned in the result set, as this site is excluded from appearing in search results.

This is one of the caveats of using the search method, and could be seen as either a dis-benefit, or a benefit.


Looking at the Developer Dashboard for this test, we can see that the Search method clearly out-performs the loop based method. Test 2 has approximately 6 times the number of sites to retrieve. Using search, the query takes about 3 times longer than it did in the first example. Using looping, the query takes nearly 10 times longer than it did in the first test. Ouch! 


The results from the first page refresh.


 The results from the second page refresh.


The results from the third page refresh.



From the five screen shots above, we can see that both methods returned approximately the same number of sites, and the different time taken by each method. Looping returns all 57 sites the user has access to, while the Search method returns 54 sites (because some sites are excluded from the Search Index).

Average time for the Looping Method to generate the result set: 803.8ms (806.64, 806.96, 786.25, 815.11)
Average time for the Search Method to generate the result set: 64ms (69.83, 67.64, 60.80, 57.91)

In this case, where we are searching a slightly larger site collection, the differences in performance are very noticeable!

Test 3 User B, who has more Restricted Permissions than User A, with the Webpart on a Site Collection with 54 sites (some Webs are not Indexed):

In this test, we focus on some other performance statistics that are highlighted by the Developer Dashboard.

We can see that this user has access to fewer sites than the user used in the previous test (47 sites, as compared with 57 sites for the user in Test 2). While the results are similar to Test 2 (the Search method returns 3 fewer results than the looping method, and the performance time statistics are similar), we want to look at what else is going on behind the scenes.


If we have a closer look at the Developer Dashboard's output, under the Database Queries, we can that the Looping method (calling SPWeb.GetSubwebsForCurrentUser()) makes two calls to the SQL (proc_ListChildWebsFiltered, and proc_GetTpWebMetaDataAndListMetaData) database for each Web that is checked.


Further down the Developer Dashboard page, we have the SPRequest Allocations listed. Here we can see that the Looping method (which calls SPWeb.GetSubwebsForCurrentUser()) creates an SPRequest allocation for each web that is checked.



Test Summary

From the three tests above, it's clear that the Search method out performs the Looping method, and uses less resources in doing so. This makes the search method more scale-able, both in terms of simultaneous users loading the page, and in terms of how large the site collection can be.

The caveat to the Searching method is that the result set might not include all of the sites a user has access to, if one or more sites has been excluded from the Search Index. This may or may not be a problem, depending on why the sites have been excluded from the search index.

The looping method puts more load on the SharePoint infrastructure, and performance issues are bound to occur as the number of users using the code (or webpart) increases and/or the number of sites in a site collection increases.

Quick Summary of Pros and Cons

Search Pros


  • It's fast
  • It can handle a very large site collection, returning results very quickly

Search Cons


  • If a site has the "Allow this site to appear in search results?" set to No, then true to form, the site won't be returned in the search results. This could be a pro (in some scenarios) or a con.
  • There are limited properties that can returned about an SPWeb object using Search. If you need to query additional properties, for example a property from the SPWeb.Properties collection, you would need to use the looping method.

Loop (iteratively calling GetSubwebsForCurrentUser) Pros


  • You can query additional properties of each SPWeb object as you parse the  collection of webs the user has access to. For example, you could query a custom property from the SPWeb.Properties collection.

Loop (iteratively calling GetSubwebsForCurrentUser) Cons


  • As the number of webs in a site collection increases, the performance becomes a big issue, causing the page to load slower.
  • Make calls to SPWeb.GetSubwebsForCurrentUser() increases load on the SQL server. This could cause a performance problem (albeit, depending on the size of your environment, number of webs in the site collection and the frequency in which the code is called).
  • Creates a lot of SPRequest allocations.


See Also

KeywordQuery
SPWeb.GetSubwebsForCurrentUser()
Using the Developer Dashboard
Using SPMonitoredScope
SPMonitoredScope


Friday 20 September 2013

Using PowerShell to Group and Filter SharePoint ListItems by Metadata Fields

You can use PowerShell to group documents (or list items) based on metadata. You might want to do this to find documents that are duplicates, to find the most active authors for a certain metadata combination (i.e. group by author, then by document type), or to create a report on metadata combinations.

Though some of this functionality can be produced using ListViews, there are scenarios where a ListView can't be used. One such example is grouping documents/listitems by certain metadata (e.g. customer id, document type), then displaying a list of  groups that only contain more than one document/listitem for a given combination of customer id/document type.

The PowerShell functions that are required to perform this sort of task are all standard PowerShell cmdlets (out of the box). Most of the following script examples make use of three PowerShell cmdlets, Group-ObjectWhere-Object, Sort-Object.

In the examples below, we are querying a marketing list that contains hundreds of marketing publications and submissions. The document library has a number of metadata fields used to describe those documents, including the following fields:

* Marketing Document Type (the type of document. E.g. Publication, Submission, etc)
* Year (the year the publication or submission relates to)
* Group (the business group)

The examples below demonstrate a number of business cases, an example of the script used and the report that gets created.

The basic structure of the script command is piping (passing) collections between cmdlets, grouping, filtering sorting, and counting the items along the way.

Getting Started


All of the examples below use a DataTable as the source list of objects that get passed to the Group-Object cmdlet, which we can get from the following lines of code:

Get the collection of items from the document library and store it in a variable.
$w = Get-SPWeb "http://corporation/marketing"
$l = $w.Lists["Legal Directories"]
$items = $l.Items;

Get all of the items in the listitem collection, return it as a DataTable and store it in the $dt variable.
$dt = $items.GetDataTable();

Store the static names of the fields we'll be using to group and sort by, into variables (to keep the scripts manageable and make the script commands shorter).
$fAuthor = "Author";
$fYear = "iYear";
$fDocType = "iMarketingDocumentClassification"
$fGroup = "iStrand";

Before we hop into the examples, let's take a quick look at each object returned along the pipeline, what it's type is, what properties are exposed, and most importantly, how we use them.

The example we are going to use is grouping by the Author, then Document Type fields. Using the object returned by the Group-Object cmdlet, we are sorting on the Document Type property, then the Count property. The final collection of grouped and sorted objects is passed into the $ad variable. We then pass the $ad variable to Format-Table, displaying the following columns; Count, Authors Name, Document Type.
$ad = $dt | Group-Object $fAuthor,$fDocType  | Sort-Object -Property {$_.Values[1]},{$_.Count} -Descending

$ad | Format-Table Count,@{Label="Name";Expression={$_.Values[0]}},@{Label="Document Type";Expression={$_.Values[1]}} -Autosize

The first part of the example is piping the DataTable object to the Group-Object cmdlet, and grouping it by the Author field, then by the Document Type field.
$go = $dt | Group-Object $fAuthor,$fDocType

If we have a look at the type of object returned, we see that the Group-Object cmdlet has given us a System.Array full of Objects.

If we list out the contents of $go, we see the three properties; Count (the number of items in the group), Name (the values of each set of grouped fields) and Group (containing the datarows in the group).

When we look at the first object in the collection of objects in $go, we see the type is GroupInfo (Microsoft.PowerShell.Commands.GroupInfo). The documentation for the GroupInfo class can be seen here, GroupInfo. The GroupInfo class actually has a forth property, Values, which contains the values of the elements in the group (seen in the screen shot below, as well as in the MSDN Documentation).



Now that we have grouped our list items, and have them in an Array full of GroupInfo objects, we need to sort them into an order ready for displaying. In this example, we want to sort them by Document Type, and then by the Count of each group.
$ad = $go  | Sort-Object -Property {$_.Values[1]},{$_.Count} -Descending

We do this by piping the results from Group-Object to Sort-Object, and specifying the properties to sort on.

The first property we sort on is $_.Values[1]. $_ represents the curent GroupInfo object in the pipeline. We know the GroupInfo object has the Value property, which contains the values of the elements in the group (from the screen, we can see is the Author and Document Type field values).

The second field we are sorting on, is $_.Count. $_, as just discussed, is a GroupInfo object, so we know it has a Count property that contains the number of elements in the group.

Finally, we add the -Descending switch to Sort-Object, to order the results in descending order. The results from Sort-Object (an Array of GroupInfo objects) are then returned and stored in the $ad variable.

Now that we have our results grouped and sorted the way we want them, we can display the results in a table format, using the Format-Table cmdlet. However, to display the columns we want, we need to use formatting instructions to get the values of Document Type and Author, because these values are not direct properties of the GroupInfo object (the GroupInfo object's properties are Count, Group, Name, and Values). If you're interested in learning more about formatting instructions, there's a great article on the Microsoft TechNet Scripting site that is worth reading, Creating Custom Tables.
$ad | FT Count,@{Label="Name";Expression={$_.Values[0]}},@{Label="Document Type";Expression={$_.Values[1]}} -Autosize

The full script can be condensed a little more, piping the DataTable to Group-Object, piping the Array of GroupInfo (returned from Group-Object) to Sort-Object, and finally piping the results to Format-Table for displaying.
$dt | Group-Object $fAuthor,$fDocType  | Sort-Object -Property {$_.Values[1]},{$_.Count} -Descending | FT Count,@{Label="Name";Expression={$_.Values[0]}},@{Label="Document Type";Expression={$_.Values[1]}} -Autosize


Examples

The following examples build on the script above, using the same DataTable ($dt) variable and fields.

Example One: Listing authors who frequently create documents

Group documents by Author, Then by Document Type. Then get all the groups that contain five or more documents with the same author and document type value, and write them out to screen, including each documents ListItemId.

This report will list the people who regularly add documents of certain types (e.g. publications) to the marketing document library
$ad = $dt | Group-Object $fAuthor,$fDocType  | Where-Object{$_.Count -gt 5}

foreach($d in $ad){$d.Group | Format-Table @{Label="Name";Expression={$_["FileLeafRef"]}},@{Label="Item Id";Expression={$_["ID"]}},@{Label="Document Type";Expression={$_["iMarketingDocumentClassification"]}},@{Label="Author";Expression={$_["Author"]}}}



Example Two: Looking for potential duplicates by author

Group documents by Author, Then by Document Type, then by Business Group. Then get all the groups with the same author, document type value and business group, where the group contains more than one item (e.g. a potential duplicate) and write them out to screen, including each documents ListItemId.

This is an example of something a SharePoint List View can't achieve. A list view can do the grouping, but can't filter out groups that only contain one document.

$ad = $dt | Group-Object $fAuthor,$fDocType,$fGroup | Where-Object{$_.Count -gt 1}

foreach($d in $ad){$d.Group | Format-Table @{Label="Name";Expression={$_["FileLeafRef"]}},@{Label="Item Id";Expression={$_["ID"]}},@{Label="Document Type";Expression={$_["iMarketingDocumentClassification"]}},@{Label="Group";Expression={$_["iStrand"]}},@{Label="Author";Expression={$_["Author"]}}}



Example Three: Looking for duplicate submissions

Filter the documents by the document type, then Group the documents by Year, Then by Group. Then get all the groups that contain more than 1  (submission) documents with the same year for the same business group, and write them out to screen, include each documents ListItemId.

$ad = $dt | ?{$_[$fDocType] -eq "Directory Submission"} | Group-Object $fYear,$fGroup  | Where-Object{$_.Count -gt 1 }

foreach($d in $ad){$d.Group | Format-Table @{Label="Name";Expression={$_["FileLeafRef"]}},@{Label="Item Id";Expression={$_["ID"]}},@{Label="Business Group";Expression={$_["iStrand"]}},@{Label="Year";Expression={$_["iYear"]}}}



Example Four: Looking for potential duplicates.

Group documents by Year, Then by Document Type, Then by Group. Then get all the groups that contain two or more documents with the same year, document type and business group values (identifying the potential duplicates), and write them out to screen, include each documents ListItemId.

$ad = $dt | Group-Object $fYear,$fDocType,$fGroup  | Where-Object{$_.Count -gt 1}

foreach($d in $ad){$d.Group | Format-Table @{Label="Name";Expression={$_["FileLeafRef"]}},@{Label="Item Id";Expression={$_["ID"]}},@{Label="Document Type";Expression={$_["iMarketingDocumentClassification"]}},@{Label="Year";Expression={$_["iYear"]}},@{Label="Group";Expression={$_["iStrand"]}},@{Label="Author";Expression={$_["Author"]}}}


Continuing to build on the previous example where we have identified potential duplicate documents, we will add a new column for tracking the potential duplicate documents, and set the value of that column to true for all the identified documents. This field can then be used to create a List View in the SharePoint UI, that a person from the Marketing team can use to review the suspect documents and take an appropriate action.

First, we create the new field, then refresh the list reference and datatable
$l.Fields.Add("PotentialDuplicate", [Microsoft.SharePoint.SPFieldType]::Boolean , $false)
$l.Update();
$items = $l.Items;
$dt = $items.GetDataTable();

Next, using the items we grouped and sorted into the $ad variable in example four, we loop through the collection of group objects, looping through each group object, finally updating the PotentialDuplicate field of each item.
foreach($group in $ad){foreach($item in $group.Group){$li = $l.Items.GetItemById($item.ID); $li["PotentialDuplicate"] = $true;$li.Update();}}

Well, that's it! There's lots of PowerShell in there, but once you get a feel for the GroupInfo object collection that the Group-Object cmdlet returns, and how to use it with Sort-Object, Where-Object and Format-Table, it's really easy to group, filter, sort and count item collections!