Hi, i am trying to use pyQT and python to get the dynamic content from a web page. The problem is that i still only get the static content. What could be wrong with the code below? Code is based on this link: https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/[^]
import sys
import time
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from lxml.html import fromstring, tostring, iterlinks
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
print("inside 1")
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
print("inside 2")
#def userAgentForUrl(self, url):
return 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36 OPR/32.0.1948.25'
url = 'http://www.somepage.com'
r = Render(url)
print("inside 3")
print("Sleeping..")
time.sleep(5)
print("Sleeping done")
result = r.mainFrame().toHtml()
print(result.encode('utf-8'))
I added the sleep(5) to ensure that the dynamic content has time to load but is does not help. Why doesn't the r.mainFrame() contain the valid dynamically created page contents? Is it not updated after the pageloaded event? Regards