Hi Guys, I’m glad that I can announce a new Blackhat SEO series. In this first part we are going to talk about content scraping. In this series, I will be using C# and PHP and MySQL. All you need are basics of these languages (I’m not good at C#, either).
Main difference between Blackhat and Whitehat
Content scraping is absolutely necessary for Blackhat SEO projects. Because if you do Whitehat SEO, you basically create hand-written unique content. All these handwritten stuff costs money (you have to hire guys or write it by your own). On the other hand, Blackhat SEO is based on the content generation and quantity. Most of us won’t be able to write a content generator which will be able to produce results as good as a real person. So there goes quantity ( we are replacing quality with quantity). For generating a “new” content we will need some kind of “old” content. And here is the true value of content scraping.
C# class for scraping a page
OK, what comes to your mind when you hear “scrape a page”, what do you imagine? In my mind comes: open and read page, get all links ( outcoming / inner links ), get all valuable texts ( basically paragraphs which are longer than 200 – 300 letters ). So I’ve created this class in C# which can do all these things.
1. Reading a web page
public string openPage(string paramUrl)
{
// edit the page url
this.privUrl = paramUrl; // save the url into private class variable
this.privPureUrl = paramUrl.Replace("http://www.", "").Replace("www.", ""); // eg: from http://www.google.com -> google.com
//if (this.privPureUrl.IndexOf('/') != -1) this.privPureUrl= this.privPureUrl.Substring(0, this.privPureUrl.IndexOf('/')); // google.com/ -> google.com
if (paramUrl == "") return this.privPage = "";
Uri uri = new Uri(paramUrl);
WebRequest req = WebRequest.Create(uri);
try // in case of error (e.g. connection timeout, bad url and other)
{
WebResponse resp = req.GetResponse();
Stream stream = resp.GetResponseStream();
StreamReader sr = new StreamReader(stream);
this.privPage = sr.ReadToEnd();
}
catch // return blank page
{
this.privPage = "";
}
return (this.privPage);
}
The code above gets the link as a parameter. For example http://www.google.com. We strip from the link the “http://www.” or just “www.” part, so we have “google.com” which we save into private class variable (this.privPureUrl).
Then we open the page and read the whole page into a string (the code inside the “try” block). If this process fails, we go to the catch block, where we assign open page as an empty string. And in the end, we return the opened page as a string.
2. Get all links from a webpage
We have to make one assumption: All links starts with href=” or href=’

So we can split the whole page into an array, divided by href=” or href=’. For this purpose is good using the regular expressions. The C# code is here:
public List<string> getLinks()
{
// start from body tag
int start_pos = this.privPage.IndexOf("<body>");
if (start_pos == -1) start_pos = 1;
string page_content = this.privPage.Substring(start_pos);
// split the content
Regex replace_href = new Regex("href=['|\"]");
string[] links = replace_href .Split(page_content);
List<string> links_to_return = new List<string>()
int counter = 0;
for(int i = 1; i < links.Length; i++)
{
int simple_quote = links[i].IndexOf('\'');
int double_quote = links[i].IndexOf('"');
if (simple_quote < double_quote && simple_quote != -1) double_quote = simple_quote;
links_to_return.add(links[i].Substring(0, double_quote));
}
return links_to_return;
}
So this function basically divides our content into parts. The divider is href=" or href='. We will go through this array and search for first quotes (single or double) which ends the href attribute. Then we easily substring the value and we have a single link.
You can upgrade this function to return only outgoing links, or only inner links, or whatever you want.
3. Get all the valuable content from a website
The valuable website content, ideal to scrape, is located in <p> .... </p> ---> paragraphs. We will go through all paragraphs, and remove everything except letters, numbers and "'!?..
public List<string> getParagraphs()
{
// split paragraphs
string[] pars = Regex.Split(this.privPage, "<p");
// go through every paragraph
for (int i = 0; i < pars.Length; i++)
{
// save actual browsed paragraph into string
string par = pars[i];
// end <
int p_end = par.IndexOf(">") + 1;
int p_close = par.IndexOf("</p>", p_end);
if (p_end == -1 || p_close == -1) continue;
par = par.Substring(p_end, p_close - p_end);
if (par.Length < this.privMinLenght) continue;
// filter all things except a-z, 0-9, .?!
par = Regex.Replace(par, "/&[^;]{1,7};/", "");
par = Regex.Replace(par, "/[\x00-AZ-az-\xFF]/", "");
par = Regex.Replace(par, "/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F-\xFF]/", "");
par = this.StripTagsCharArray(par);//Regex.Replace(par, "<.*?>","");
//add paragraph into our paragraphs store
this.pageParagraphs.Add(par);
}
return this.pageParagraphs;
}
Conclusion
Today we have learned how to open a webpage, scrape valuable content and read all links. In another words, the very basics of Blackhat SEO. In the next episodes of this series, I will write about content generation, MadLib sites, linkspamming and other Blackhat SEO techniques.
This series is meant for educational purposes only - if you want to be a Jedi, you also need to know the dark side.
Downloads
C# Content Scraper Class (download and run)
Continue ReadingView Comments (7)