How to check all links / urls of a website and also identify broken links

If you are curious to identify all the URL’s of your website and want to do something with it, there is a way to get those URLs using “Linkcheker”

” LinkChecker is a free, GPL licensed website validator. LinkChecker checks links in web documents or full websites. It runs on Python 2 systems, requiring Python 2.7.2 or later.” So it can help us to know all the URL’s from a website and also report those URL’s which are broken and not working.

To install Linkchecker on ubuntu, follow below steps,

 wget -c http://ftp.debian.org/debian/pool/main/l/linkchecker/linkchecker_9.3-4_i386.deb 

We have installed it for 32 bit ubuntu, you can download 64 bit version from http://ftp.debian.org/debian/pool/main/l/linkchecker/ if necessary.

 sudo dpkg -i linkchecker_9.3-4_i386.deb 

Now, Lets test with our another simple html website, [ You can change the URL to your website name ]

 linkchecker http://www.byteslices.com -v -F text/website-urls.txt 
INFO 2017-08-12 12:00:23,486 MainThread Checking intern URLs only; use --check-extern to check extern URLs.
1 thread active, 0 links queued, 0 links in 0 URLs checked, runtime 1 seconds
10 threads active, 53 links queued, 92 links in 12 URLs checked, runtime 6 seconds
10 threads active, 22 links queued, 183 links in 21 URLs checked, runtime 11 seconds
3 threads active, 0 links queued, 212 links in 31 URLs checked, runtime 16 seconds 

So the above command, prints all URLs with “-v” verbose mode, -F text/website-urls.txt saves to output file “website-urls.txt” in “text” mode.

You can check other command line parameters of “linkchecker” using,

 linkchecker --help 

Now, lets try to identify which are the real URL’s in html exists in this website. The linkchecker output in text for a single URL is something like this,

READ  how does your website looks on different mobile devices and is your website mobile friendly ?

URL `css/logo-nav.css’
Parent URL http://www.byteslices.com, line 19, col 5
Real URL http://www.byteslices.com/css/logo-nav.css
Check time 1.657 seconds
Size 2KB
Result Valid: 200 OK

URL `index.html’
Name `\n Byteslices Technologies\n ‘
Parent URL http://www.byteslices.com, line 58, col 17
Real URL http://www.byteslices.com/index.html
Check time 2.154 seconds
D/L time 0.038 seconds
Size 9KB
Result Valid: 200 OK

This just shows the section with two URLs, one with css and one with html, and we want to know only html, hence we will use grep command on the logs we collected like below,

 cat website-urls.txt | grep "Real URL" | grep html > only-html-links.txt 

Above command will remove only html links and save it to another text file “only-html-links.txt”

Now if you observe this file, will have some duplicated lines which had came from css related URLs, so Lets remove those duplicated lines using “sort” command as below,

 $ sort only-html-links.txt | uniq
Real URL   http://www.byteslices.com/about.html
Real URL   http://www.byteslices.com/contact.html
Real URL   http://www.byteslices.com/embedded-iot.html
Real URL   http://www.byteslices.com/index.html
Real URL   http://www.byteslices.com/web-technologies.html 

So, now we got all uniq links with html extention from this website, Now, lets remove “Real URL” text from this file. To do this, we will save this output of above command to text file as,

 sort only-html-links.txt | uniq > only_uniq_links.txt 

Open this text file “only_uniq_links.txt” in gedit and use find and replace with Find as “Real URL ” and replace as “Nothing” and click “Replace All”

and…. Done.. you get URLs like below in “only_uniq_links.txt” text file.

http://www.byteslices.com/about.html
http://www.byteslices.com/contact.html
http://www.byteslices.com/embedded-iot.html
http://www.byteslices.com/index.html
http://www.byteslices.com/web-technologies.html

Leave a Reply

Your email address will not be published. Required fields are marked *