How to Identify All URLs and Broken links of Website ?

Subscribe our Rurban Life YouTube Channel.. "Rural Life, Urban LifeStyle"

If you are curious to identify all the URL’s of your website and want to do something with it, there is a way to get those URLs using “Linkcheker”

” LinkChecker is a free, GPL licensed website validator. LinkChecker checks links in web documents or full websites. It runs on Python 2 systems, requiring Python 2.7.2 or later.” So it can help us to know all the URL’s from a website and also report those URL’s which are broken and not working.

To install Linkchecker on ubuntu, follow below steps,

$ sudo apt-get install linkchecker

We have installed it for Ubuntu 20.04 you can download other platforms from http://ftp.debian.org/debian/pool/main/l/linkchecker/ if necessary.

Now, Lets test with our another simple html website, [ You can change the URL to your website name ]

$ linkchecker http://www.byteslices.com -v -F text/website-urls.txt

INFO 2017-08-12 12:00:23,486 MainThread Checking intern URLs only; use --check-extern to check extern URLs.
1 thread active, 0 links queued, 0 links in 0 URLs checked, runtime 1 seconds
10 threads active, 53 links queued, 92 links in 12 URLs checked, runtime 6 seconds
10 threads active, 22 links queued, 183 links in 21 URLs checked, runtime 11 seconds
3 threads active, 0 links queued, 212 links in 31 URLs checked, runtime 16 seconds

So the above command, prints all URLs with “-v” verbose mode, -F text/website-urls.txt saves to output file “website-urls.txt” in “text” mode.

You can check other command line parameters of “linkchecker” using,

$ linkchecker --help

Now, lets try to identify which are the real URL’s in html exists in this website. The linkchecker output in text for a single URL is something like this,

URL `css/logo-nav.css’
Parent URL http://www.byteslices.com, line 19, col 5
Real URL http://www.byteslices.com/css/logo-nav.css
Check time 1.657 seconds
Size 2KB
Result Valid: 200 OK

URL `index.html’
Name `\n Byteslices Technologies\n ‘
Parent URL http://www.byteslices.com, line 58, col 17
Real URL http://www.byteslices.com/index.html
Check time 2.154 seconds
D/L time 0.038 seconds
Size 9KB
Result Valid: 200 OK

This just shows the section with two URLs, one with css and one with html, and we want to know only html, hence we will use grep command on the logs we collected like below,

$ cat website-urls.txt | grep "Real URL" | grep html > only-html-links.txt

Above command will remove only html links and save it to another text file “only-html-links.txt”

Now if you observe this file, will have some duplicated lines which had came from css related URLs, so Lets remove those duplicated lines using “sort” command as below,

$ sort only-html-links.txt | uniq
Real URL   http://www.byteslices.com/about.html
Real URL   http://www.byteslices.com/contact.html
Real URL   http://www.byteslices.com/embedded-iot.html
Real URL   http://www.byteslices.com/index.html
Real URL   http://www.byteslices.com/web-technologies.html

So, now we got all uniq links with html extention from this website, Now, lets remove “Real URL” text from this file. To do this, we will save this output of above command to text file as,

$ sort only-html-links.txt | uniq > only_uniq_links.txt

Open this text file “only_uniq_links.txt” in gedit and use find and replace with Find as “Real URL ” and replace as “Nothing” and click “Replace All”

and…. Done.. you get URLs like below in “only_uniq_links.txt” text file.

http://www.byteslices.com/about.html
http://www.byteslices.com/contact.html
http://www.byteslices.com/embedded-iot.html
http://www.byteslices.com/index.html
http://www.byteslices.com/web-technologies.html

Subscribe our Rurban Life YouTube Channel.. "Rural Life, Urban LifeStyle"

Related

Leave a Comment Cancel reply