Find sub-url
-
Hi, I have an URL like http://my.lotro.com/home[^] Is it possible to find sub-urls like http://my.lotro.com/home/character/4368675/150026162587528896[^] ? (similar to FindFirstFile(), ...) (web spiders find them for example) Thanks.
-
Hi, I have an URL like http://my.lotro.com/home[^] Is it possible to find sub-urls like http://my.lotro.com/home/character/4368675/150026162587528896[^] ? (similar to FindFirstFile(), ...) (web spiders find them for example) Thanks.
Web spiders request documents and parse them. During parsing, all links from the document are stored to be processed later. If you want to find sub-urls, you must do something similar. But this will not find URLs that are not linked. For some URLs without a file specification you may get a directory listing containing all files and sub-directories. But most servers will send you a default page (often index.html) when no file is specified or deny the request (listing of directories prohibited). UPDATE: You may use the GNU wget utility (also available for Windows) to perform such scanning. This command will download and parse all files and delete them afterwards while printing a line for each URL.
wget -r -nd --delete-after http://my.lotro.com/home/
-
Hi, I have an URL like http://my.lotro.com/home[^] Is it possible to find sub-urls like http://my.lotro.com/home/character/4368675/150026162587528896[^] ? (similar to FindFirstFile(), ...) (web spiders find them for example) Thanks.
There is no standard directory listing method in the HTTP protocol. Directory listing is usually disabled (security hole) for most if not the whole site and even if its enabled you get back an index file if present (like with http://my.lotro.com/home[^]). Even if the get on the directory returns a directory listing its still a non-standard generated html page that you have to parse somehow. Its a waste of time trying to solve this problem because this can not be solved. Web spiders just follow links found on websites, they do not do directory listings.