2016年12月6日 星期二

[C#] Use HtmlAgilityPack to parse html

 C#    Backend    HtmlAgilityPack    HTML Parsing   Sharepoint





Background


I made a year-end-party survey on Sharepoint for my company and must make some statistics data.  
Sharepoint can export the survey result to EXCEL, and all I have to do is import the excel to database.

But NO, this is not an AGILE way!! I DON’T want to import the EXCEL once again if the result changes!
Okay, every survey result’s uri is looked like this,
http://XXX/sites/MsgCenter/Survey/Lists/20170118/DispForm.aspx?ID=10&Source=XXX

So I decide to use HttpClient and HtmlAgilityPack to get and parse the html in order to get the useful values, and save them to database.



Environment

Visual Studio 2015 Ent. Update 3
HtmlAgilityPack 1.4.9.5



Implement


Define the Template Uri

Put a keyword for replacing it with the id in loop, also define a max number to determine how many uri that we will try to get the corresponding html.

private readonly int maxNumber = 300;
private string templateUri = "http://XXX/sites/MsgCenter/Survey/Lists/20170118/DispForm.aspx?ID=" + "TEMPLATE_ID" + "&Source=XXX";



Get and parse html from the uri


Main program

#region Define the survey uri list
var surveyUris = new List<string>();
for (int i = 0; i < maxNumber; i++)
{
    surveyUris.Add(templateUri.Replace("TEMPLATE_ID", i.ToString()));
}
#endregion
//Start reading the Survey result
surveyUris.ForEach(uri =>
{
     var yearend = getHtmlAsync(uri).Result;
     if (yearend != null)
     {
         //Insert to database…
     }
});


HttpClinet : Get Html

Here we use
HttpClient: GetByteArrayAsync to get the html from an uri.
Notice that if we want to sending request to Sharepoint, we have to be an authorized user or we will get a 401(Unauthorized) response. In this case, we use Windows authentication and set HttpClientHandler for initializing HttpClient.

- PreAuthenticate 
  Gets or sets a value that indicates whether the handler sends an Authorization header with the request.
- UseDefaultCredentials
  Gets or sets a value that controls whether default credentials are sent with requests by the handler.


private async Task<SvYearEnd> getHtmlAsync(string uri)
{
            HttpClientHandler handler = new HttpClientHandler() {
 PreAuthenticate =
true,
 UseDefaultCredentials =
true };
            HttpClient httpClient = new HttpClient(handler);

            var response = await httpClient.GetByteArrayAsync(new Uri(uri));

            String srcHtml = Encoding.GetEncoding("utf-8").GetString(response, 0, response.Length - 1);
            srcHtml = WebUtility.HtmlDecode(srcHtml);
            var yearend = this.deserializeHtml(srcHtml);
            if (yearend.EmpNo != 0)
            {
                return yearend;
            }
            else
                return null;
}


HtmlAgilityPack : Parse Html

Load html with HtmlAgilityPack, and we can use lambda expression to get the innerHtml with id, class, … etc.

private object parseHtml(string srcHtml)
{
       var html = new HtmlDocument();
       html.LoadHtml(srcHtml);

       //Get the root
       var root = html.DocumentNode;
       //Remove the comments
       root.Descendants().Where(n => n.NodeType == HtmlAgilityPack.HtmlNodeType.Comment).ToList().ForEach(n => n.Remove());
       //Search where id=='XXX'
      var textAnswers = root.Descendants().Where(n => n.GetAttributeValue("id", "").Equals("SPFieldText")).ToList();
    
      textAnswers.ForEach(x =>
      {
          
var empId = x.InnerText.Trim();
          
//…
      });
}





Reference



沒有留言:

張貼留言