Using mshtml to parse a html document in C# works very slow - Why ?
-
Parsing a html document in C# using mshtml object works very slow. Why is this happening ? Is it possible to avoid it ?
We haven't seen this. Have you run your code through a profiler? Perhaps there's something on your end causing the problem.
-
We haven't seen this. Have you run your code through a profiler? Perhaps there's something on your end causing the problem.
What do you mean by a profiler ? I have made a test end code - a code that simply gets recursively all children of each node and that runs through each attribute of each node. Even very simple html pages run in 1-2 seconds. However the same pages, with the same algorithms, run instantly on VB6 or VC++ 6. Can you provide me a simple C# project that does the same thing and runs instantly on html pages of medium complexity ?
-
We haven't seen this. Have you run your code through a profiler? Perhaps there's something on your end causing the problem.
This is the C# code that I used for tests (doc is a HTMLDocument variable): ArrayList nodes=new ArrayList(); mshtml.IHTMLDOMNode nod; children=(mshtml.IHTMLDOMChildrenCollection)doc.childNodes; foreach(mshtml.IHTMLDOMNode node in children) nodes.Add(node); mshtml.IHTMLAttributeCollection attributes; for(int a=0;a
-
What do you mean by a profiler ? I have made a test end code - a code that simply gets recursively all children of each node and that runs through each attribute of each node. Even very simple html pages run in 1-2 seconds. However the same pages, with the same algorithms, run instantly on VB6 or VC++ 6. Can you provide me a simple C# project that does the same thing and runs instantly on html pages of medium complexity ?
Go download the ANTS profiler trial[^], run your project with it, and it'll tell you what's taking time. My guess is one of 3 things; you're modifying the DOM document (which is not too fast), or there's a bug in the interop assembly (microsoft.mshtml.dll), or you're not caching your references to the DOM objects. Run a profiler and see what's taking up time.
-
Go download the ANTS profiler trial[^], run your project with it, and it'll tell you what's taking time. My guess is one of 3 things; you're modifying the DOM document (which is not too fast), or there's a bug in the interop assembly (microsoft.mshtml.dll), or you're not caching your references to the DOM objects. Run a profiler and see what's taking up time.
1. The program doesn't modify the DOM for sure. 2. Probably there's a bug in the interop. 3. I am caching the reference to DOM objects (i.e. access the objects always through a local object). I have made a simple COM in VB6 that takes as input a HTMLDocument object and runs the same algorithm as the one written in C#. If it's run by a VB6 program - it runs very fast. If it's run by C# - then it runs slowly, similar to C# code. But ! if the VB6 com receives the URL of the page, loads it itself and then calls the function that parses it - then it works very fast again. I think this is due to some kind of .NET object wrapper.
-
Go download the ANTS profiler trial[^], run your project with it, and it'll tell you what's taking time. My guess is one of 3 things; you're modifying the DOM document (which is not too fast), or there's a bug in the interop assembly (microsoft.mshtml.dll), or you're not caching your references to the DOM objects. Run a profiler and see what's taking up time.
-
I have run the profiler. The biggest part of the time is spent on accessing/enumerating (with Foreach, or with IEnumerator) all children or all attributes of each node.
Try enumerating them with a regular for loop. Also, cache references to these if you're accessing them multiple times. You may also want to profile mshtml.dll when you run this to see if there's anything odd going on there.
-
Try enumerating them with a regular for loop. Also, cache references to these if you're accessing them multiple times. You may also want to profile mshtml.dll when you run this to see if there's anything odd going on there.
I am sure that if there's something wrong, then it's wrong with the mshtml.dll . As I said before, if I run a VB6 COM that uses the HTMLDocument from C# (received as a parameter to its function) - then it runs slowly, but if the VB6 COM doesn't take the C# HTMLDocument (load the page itself) - then it runs fast. How to run the profiler on mshtml.dll ? As you said at the beginning - you didn't encounter such things as slow running. Have you made a similar module or code that parses a HTMLDocument object using mshtml ? If so - what VS.NET did you use , and on what windows ? Thanks.
-
I am sure that if there's something wrong, then it's wrong with the mshtml.dll . As I said before, if I run a VB6 COM that uses the HTMLDocument from C# (received as a parameter to its function) - then it runs slowly, but if the VB6 COM doesn't take the C# HTMLDocument (load the page itself) - then it runs fast. How to run the profiler on mshtml.dll ? As you said at the beginning - you didn't encounter such things as slow running. Have you made a similar module or code that parses a HTMLDocument object using mshtml ? If so - what VS.NET did you use , and on what windows ? Thanks.
When you run the profiler (assuming you're running Ants profiler), in the profiler wizard, select "Profile all .NET methods", which should give you some results on MSHTML.
-
When you run the profiler (assuming you're running Ants profiler), in the profiler wizard, select "Profile all .NET methods", which should give you some results on MSHTML.
I've tried to access the nodes by indexes (not through enumeration) - doesn't help much. Ran the profiler on mshtml.dll too, and it showed that most time is spent on accessing the attributes and children of nodes. Thus there's only one possible reason - either .NET works slowly when accessing mshtml elements, either the mshtml is generated wrong (which I doubt).