C# Backend HtmlAgilityPack HTML Parsing Sharepoint
▌Background
I made a year-end-party survey on Sharepoint
for my company and must make some statistics data.
Sharepoint
can export the survey result to EXCEL, and all I have to do is import the excel
to database.
But NO, this is not an AGILE way!! I DON’T want to import
the EXCEL once again if the result changes!
Okay, every survey result’s uri is looked like this,
http://XXX/sites/MsgCenter/Survey/Lists/20170118/DispForm.aspx?ID=10&Source=XXX
So I decide to
use HttpClient
and HtmlAgilityPack
to get and parse the html in order to get the useful values, and save them to
database.
▌Environment
▋Visual
Studio 2015 Ent. Update 3
▋HtmlAgilityPack
1.4.9.5
▌Implement
▋Define the Template Uri
Put a keyword for replacing it with the id in loop, also
define a max number to determine how many uri that we will try to get the corresponding
html.
private readonly int maxNumber = 300;
private string templateUri = "http://XXX/sites/MsgCenter/Survey/Lists/20170118/DispForm.aspx?ID=" + "TEMPLATE_ID" + "&Source=XXX";
|
▋Get and parse html from the uri
▋Main program
#region Define the survey uri
list
var surveyUris = new List<string>();
for (int i = 0; i <
maxNumber; i++)
{
surveyUris.Add(templateUri.Replace("TEMPLATE_ID", i.ToString()));
}
#endregion
//Start reading the Survey result
surveyUris.ForEach(uri =>
{
var yearend =
getHtmlAsync(uri).Result;
if (yearend != null)
{
//Insert to
database…
}
});
|
▋HttpClinet : Get
Html
Here we use HttpClient: GetByteArrayAsync to get the html from an uri.
Notice that if we want to sending request to Sharepoint, we have to be an authorized user or we will get a 401(Unauthorized) response. In this case, we use Windows authentication and set HttpClientHandler for initializing HttpClient.
- PreAuthenticate
Gets or sets a value that indicates whether the handler sends an Authorization header with the request.
- UseDefaultCredentials
Gets or sets a value that controls whether default credentials are sent with requests by the handler.
private async Task<SvYearEnd> getHtmlAsync(string uri)
{
HttpClientHandler handler = new HttpClientHandler() {
PreAuthenticate = true, UseDefaultCredentials = true };
HttpClient httpClient = new HttpClient(handler);
var response = await httpClient.GetByteArrayAsync(new Uri(uri));
String srcHtml = Encoding.GetEncoding("utf-8").GetString(response,
0, response.Length - 1);
srcHtml = WebUtility.HtmlDecode(srcHtml);
var yearend = this.deserializeHtml(srcHtml);
if (yearend.EmpNo != 0)
{
return yearend;
}
else
return null;
}
|
▋HtmlAgilityPack :
Parse Html
Load html with HtmlAgilityPack, and we can use lambda expression to get the innerHtml with id, class, … etc.
private object parseHtml(string srcHtml)
{
var html = new HtmlDocument();
html.LoadHtml(srcHtml);
//Get the root
var root =
html.DocumentNode;
//Remove the comments
root.Descendants().Where(n
=> n.NodeType == HtmlAgilityPack.HtmlNodeType.Comment).ToList().ForEach(n => n.Remove());
//Search where id=='XXX'
var textAnswers = root.Descendants().Where(n => n.GetAttributeValue("id", "").Equals("SPFieldText")).ToList();
textAnswers.ForEach(x =>
{
var empId = x.InnerText.Trim(); //…
});
}
|
▌Reference
沒有留言:
張貼留言