How to know wheter a string contains a url?

Waleed Eissa

How can I know whether a string contains a url? It's very easy if it starts with http:// but I'm talking about urls that don't start with http://, I don't want to extract the url, I just want to know whether a string contains a url or not, any ideas?

Waleed Eissa Software Developer Sydney

Manas Bhardwaj

Use regular expressions

Please remember to rate helpful or unhelpful answers, it lets us and people reading the forums know if our answers are any good.

Christian Graus

The answer you were given was good, the only other thing you could do, to see if it's a VALID URL, is to do a HTTP Post to it and see what you get back.

Christian Graus No longer a Microsoft MVP, but still happy to answer your questions.

Waleed Eissa

Thanks for your reply, actually the problem I find with using a regular expression is that it can become really hard to distinguish normal text from a url, as I mentioned in my post, I want to find urls even if they don't start with http://, this what makes it really challenging.

Waleed Eissa Software Developer Sydney

Waleed Eissa

Thanks for your answer but I'm afraid this is not possible, I'm just trying to write a spam filter for my website, so I can't keep users waiting that long, I thought about searching for all TLDs but I don't think it's a good idea, performance-wise. Do you know of any good spam filter that I can call from ASP.NET application? ie. send it a string and gets something like a boolean indicating whether it's spam or not, a percentage will even be much better than a boolean (the percentage of how likely this post is spam), thanks.

Waleed Eissa Software Developer Sydney

Christian Graus

I think I just answered this in the ASP.NET forum. There is no way of knowing if a string is a *valid* URL without posting to it. Telling if a string is a valid URL is easy with regex tho.

Christian Graus No longer a Microsoft MVP, but still happy to answer your questions.

Waleed Eissa

Ok, now I get your point, actually I don't care whether they are valid or not, as I mentioned before it's just for spam filtering so it's not important to check whether they are valid .. Let me explain from the beginning (hopefully you have the time to read all this :)) In my website, users should be adding a lot of posts in a short time and I want the site to be as fast and responsive as possible when they do this, so, basically I'm looking for a spam filter that will run on my machine (as opposed to spam filters that call a web service on another website, like akismet, which can be good for blogs and sites that don't receive many posts). Unfortunately I wasn't able, so far, to find such thing, this is why I'm trying to write it myself and it seems more complicated than what I thought. Well, I thought of two approaches that I can use to detect spam: - Using naive bayesian (there's an article here on code project that talks about that, see http://www.codeproject.com/KB/recipes/BayesianCS.aspx[^]) - Using some rules that usually apply to spam and this is what I'm trying to do. Actually naive bayesian is very effective in most cases but it's basically because of something related to my app. Read on: Due to the nature of my website, users wouldn't normally post any text that contains links (and I don't change links that start with http:// to anchor tags). So, it's reasonable to assume that posts that contain links will most likely be spam. Spammers can spam your site for two reasons, first to get a higher page rank for some website, more accurately for some web page (which is not true in my case as I don't change links into anchor tags, and even if I was I could use rel="nofollow" as most people do) but anyway the point is that the spam contains a url, second to advertise something and in this case they have to leave a url, email or a phone number (if you can't reach the advertiser then the ad is useless, right?). Probably you're thinking that if I don't change the links into anchor tags they won't spam my site, I can assure you they are dumb enough to do this, I have seen many other websites that don't change links into anchors still they are heavily spammed (but may be not because they are dumb, it might be because it's rumored that google detects any links that start with http:// when crawling your site even if they are not in an

Paul Conrad

Use regular expressions as you have already been told. Instead of checking for something like http://, why not just check for things like .com, .net, .edu, etc.

"The clue train passed his station without stopping." - John Simmons / outlaw programmer "Real programmers just throw a bunch of 1s and 0s at the computer to see what sticks" - Pete O'Hanlon

Waleed Eissa

Hi Paul, thanks for your answer, the problem with checking for domain names, like .com, .net .. etc, is that there are too many TLDs to check for (because you have to check for ccTLDs which are very commonly used by spammers), this is along with some other problems too, please refer to my last post. Regards

Waleed Eissa Software Developer Sydney

Manas Bhardwaj

Waleed Eissa wrote:

Using naive bayesian

But again, Naive Bayes algorithm doesn't have inteliigence on its own. It has to be trained in proper manner to produce results. The more you train him, the better results it will yield.

Please remember to rate helpful or unhelpful answers, it lets us and people reading the forums know if our answers are any good.

Waleed Eissa

Actually I'm not esp. interested in Naive Bayes algorithm or any other algorithm, I'm just trying to filter out the spam, can you suggest a better way for doing this? And if you know of a good spam filter that I can use in my application that will even be much better. Regards

Waleed Eissa Software Developer Sydney