How to check all links / urls of a website and also identify broken links

If you are curious to identify all the URL’s of your website and want to do something with it, there is a way to get those URLs using “Linkcheker”

” LinkChecker is a free, GPL licensed website validator. LinkChecker checks links in web documents or full websites. It runs on Python 2 systems, requiring Python 2.7.2 or later.” So it can help us to know all the URL’s from a website and also report those URL’s which are broken and not working.

To install Linkchecker on ubuntu, follow below steps,

 wget -c 

We have installed it for 32 bit ubuntu, you can download 64 bit version from if necessary.

 sudo dpkg -i linkchecker_9.3-4_i386.deb 

Now, Lets test with our another simple html website, [ You can change the URL to your website name ]

 linkchecker -v -F text/website-urls.txt 
INFO 2017-08-12 12:00:23,486 MainThread Checking intern URLs only; use --check-extern to check extern URLs.
1 thread active, 0 links queued, 0 links in 0 URLs checked, runtime 1 seconds
10 threads active, 53 links queued, 92 links in 12 URLs checked, runtime 6 seconds
10 threads active, 22 links queued, 183 links in 21 URLs checked, runtime 11 seconds
3 threads active, 0 links queued, 212 links in 31 URLs checked, runtime 16 seconds 

So the above command, prints all URLs with “-v” verbose mode, -F text/website-urls.txt saves to output file “website-urls.txt” in “text” mode.

You can check other command line parameters of “linkchecker” using,

 linkchecker --help 

Now, lets try to identify which are the real URL’s in html exists in this website. The linkchecker output in text for a single URL is something like this,

READ  How to make sure your website URL shows "Secure" after SSL installation / Site Not Showing 'Secure' In URL After Installing SSL

URL `css/logo-nav.css’
Parent URL, line 19, col 5
Real URL
Check time 1.657 seconds
Size 2KB
Result Valid: 200 OK

URL `index.html’
Name `\n Byteslices Technologies\n ‘
Parent URL, line 58, col 17
Real URL
Check time 2.154 seconds
D/L time 0.038 seconds
Size 9KB
Result Valid: 200 OK

This just shows the section with two URLs, one with css and one with html, and we want to know only html, hence we will use grep command on the logs we collected like below,

 cat website-urls.txt | grep "Real URL" | grep html > only-html-links.txt 

Above command will remove only html links and save it to another text file “only-html-links.txt”

Now if you observe this file, will have some duplicated lines which had came from css related URLs, so Lets remove those duplicated lines using “sort” command as below,

 $ sort only-html-links.txt | uniq
Real URL
Real URL
Real URL
Real URL
Real URL 

So, now we got all uniq links with html extention from this website, Now, lets remove “Real URL” text from this file. To do this, we will save this output of above command to text file as,

 sort only-html-links.txt | uniq > only_uniq_links.txt 

Open this text file “only_uniq_links.txt” in gedit and use find and replace with Find as “Real URL ” and replace as “Nothing” and click “Replace All”

and…. Done.. you get URLs like below in “only_uniq_links.txt” text file.

Leave a Reply

Your email address will not be published. Required fields are marked *