Common Crawl Open Data

Common Crawl

Common Crawl is a non-profit organization headquartered in California that seeks to crawl the internet once a month and make the data collected public through Amazon S3 (Litwintschik, 2017). The organization gathers the data by collecting web pages on the internet to archive them and make them available publicly to view and analyze since 2008 (Common Crawl, n.d.-d). The Common Crawl website states that they collect these data to democratize access to web information by creating and maintaining an open archive of globally accessible and analyzable web crawl data (Common Crawl, n.d.-b).

The data are available on the Registry of Open Data on AWS, an open big data portal that provides data from different sources (companies and organizations). The Common Crawl data contains a collection of web crawl data containing more than 50 billion online pages (AWS, n.d.).

Stakeholders

As Common Crawl has a wide range of stakeholders. There are two main stakeholders, internal and external. The internal stakeholders are the people who are working within the organization, and the external ones are the people who want to use the data. According to the Common Crawl website, the team is divided into staff and volunteers, a board of directors, and an advisory board (Common Crawl, n.d.-c). Furthermore, the external stakeholders are the most important kind for the Common Crawl organization, who are the people who want to use data to analyze specific data and solve real-world problems using web data, to name a few. The organization target people who are curious about the data and startups who want to analyze the data (Common Crawl, n.d.-a).

Data Privacy and Quality

Offering the data to the public needs the data to be free to use and in high quality as much as possible. And as Common Crawl organization offers an open big data platform, data privacy and quality need to be clear for the audience so they can understand the kind of data they deal with. The organization wants to offer the data for anyone for free to use the data in any domain. However, there is no direct instruction about data privacy because it contains the web page’s metadata and HTML code from the internet, which means anyone can access that information. The organization freely provides data and metadata to the general population (Rahman, 2021). However, the organization states some regular restrictions on using the data to save people’s privacy and safety. For example, do not violate other people’s rights, do not spam or stalk people, and harvest personally identifiable information (Common Crawl, n.d.-e).

Furthermore, the organization provides high-quality data. According to Ronallo, the Common Crawl only crawls a subset of the Web, and it aims to crawl only high-quality sites rather than fetching or caching junk (2012). In most cases, the data will be clean and ready to analyze.

Resources

The organization has big, complex data that come from different sources and formats. This kind of data needs special kind of resources to operate, maintain, and manage the big data that the organization has. The resources include hardware and high expertise who can deal with complex systems.

Team and Technology

There are two kinds of teams that need to be in the organization, the technical team and the business team. The business team is responsible for forming the policies, standards, and strategies for the organization. The technical team is responsible for collecting, cleaning, storing, and publishing the data based on the business team documents. Currently, Common Crawl has a team from different backgrounds and expertise. On the organization’s website, they list the team with their qualifications. The organization contains expertise to run and serve its goals. One of the association’s skills is Crawl Engineer & Data Scientist, which is responsible for running and maintaining the crawler. Also, the organization has more than ten advisors with different domains, cultures, and nationalities (Common Crawl, n.d.-c).

In terms of technology, to host petabytes of big and complex data, the organization needs to get advanced technology to deal with this amount of data. The data, which was gathered between January 16 and 29, contains 2.95 billion web pages, or 320 terabytes of uncompressed material, as well as page, grabs for 1.35 billion URLs (Nagel, 2022). According to the organization’s website, the data is hosted on Amazon Web Services (AWS) Public Data Sets and many academic cloud platforms throughout the world (Common Crawl, n.d.-f). However, there is no information about how much capacity the data takes or other technical information.

build-vs-buy solutions for the project

The Common Crawl host the data on a cloud service which is AWS and other cloud platforms around the world (Common Crawl, n.d.-f). The buy approach in the organization case is the most suitable approach for many reasons. First, they are a non-profit organization, and AWS provides free storage of the data on their servers. Amazon states that the Amazon Web Services (AWS) Open Data Sponsorship Program pays the cost of storing high-value cloud-optimized datasets that are publicly available (AWS, 2022). Also, the organization will not stop expanding because they crawl the web and gain terabytes monthly. On the other hand, AWS and other cloud computing providers can provide the skills that the organization cannot offer to secure, operate, and manage the servers and network.

Metadata (Types of Data in The Dataset)

The dataset contains three types of data. Raw web page data, metadata extracts, and text extracts are all included in the corpus (Common Crawl, n.d.-f). In addition, the organization states that the raw data is aggregated in a Web ARChive format (WARC), which provides a straight mapping to the crawling process and the metadata is stored in WAT format within the WARC, and the text is stored in WET format (Common Crawl, n.d.-d).

Examples of Statistical Analyses and Visualizations

The Common Crawl organization provides data that shows statistical information about its content. In this paper, we explore three kinds of datasets as follows:

Unique URLs, host and domain names, and top-level domains (Common Crawl, 2008–2022)
Distribution of Languages (Common Crawl, 2021–2022a)
MIME Types (Common Crawl, 2021–2022b).

In addition, python programming was used to explore, clean, and prepare the data and Tableau software to create the visualizations. The following shows each dataset in more depth with a sample of the data set, statistical information, and the most important visualizations.

Unique URLs, host and domain names, top-level domains

The first dataset consists of five columns, and 87 rows show the number of domains, hosts, TLDs, and URLs from 2008 to 2022. Also, the data shows the crawl file names grouped by the year and week number. Figure 1 shows a dataset sample and represents the five columns with examples of the data type and crawl file names.

Figure 1: Unique URLs, host, and domain names, TLD dataset sample

Table 1 shows the statistical information about those data divided by years. For each year from 2008 to 2009, the table represents the mean, minimum, and maximum for the number of domains, hosts, TLDs, and URLs.

year

domain

host

TLD

URL

mean

min

max

mean

min

max

mean

min

max

mean

min

max

2008

1.5E+07

3.2E+07

32086112

1496

1.791E+09

1.79E+09

1.791E+09

2009

3.1E+07

6.9E+07

68991076

4711

2.301E+09

2.3E+09

2.301E+09

2012

4.1E+07

1.1E+08

1.08E+08

4533

3.597E+09

3.6E+09

3.597E+09

2013

2.7E+07

5.2E+07

51581682

7144

3.752E+09

3.75E+09

3.752E+09

2014

1.3E+08

2.4E+08

2.39E+08

31727

1.456E+10

1.46E+10

1.456E+10

2015

1.4E+08

2.5E+08

2.52E+08

40397

1.636E+10

1.64E+10

1.636E+10

2016

2.1E+08

3.6E+08

3.58E+08

40669

1.65E+10

2017

3E+08

6.3E+08

6.31E+08

57574

3.681E+10

3.68E+10

3.681E+10

2018

3.8E+08

7.4E+08

7.42E+08

61880

3.639E+10

3.64E+10

3.639E+10

2019

4E+08

5.7E+08

5.69E+08

60558

3.204E+10

3.2E+10

3.204E+10

2020

3.2E+08

4.5E+08

4.47E+08

44789

2.561E+10

2.56E+10

2.561E+10

2021

3.2E+08

4.2E+08

4.19E+08

44583

2.62E+10

2022

3.5E+07

4.4E+07

44368670

4908

2.971E+09

2.97E+09

2.971E+09

Table 1:Unique URLs, host and domain names, top-level domains statistical information

Figure 2: New URLs in Crawl between 2008-2022

The line chart shows the number of new URLs that crawl each year from 2008 to 2021. It can be seen that 2017 was when the number of URLs hit a peak of more than 36.8 million. However, the new URLs decreased significantly after 2018, from 36.38 million to 26.2 million by 2021. The information about 2022 was excluded from this chart because it represents only January data.

Distribution of Languages

The second dataset denotes the distribution of the languages for the crawled web pages from 2018 to 2022. The data consists of five columns, and 5,839 rows represent the crawl file names, languages, number of pages, number of URLs, and the percentage of language distribution per crawl. In addition, there are 163 different languages within the dataset. Figure 3 depicts a sample from the language distribution dataset and represents the crawled web pages and URL language information for each year. The following table shows statistical information about how the language is distributed by year.

Figure 3: Language Distribution Dataset Sample

The following table shows the statistical information about the information from the language distribution dataset. Because there are some categorical columns, the data also shows the number of unique and frequency for each categorical column.

	crawl	Primary language	pages	URLs	pages_crawl
count	5805	5805	5.81E+03	5.81E+03	5805
unique	5	163	NaN	NaN	NaN
top	2019	afr	NaN	NaN	NaN
freq	1933	36	NaN	NaN	NaN
mean	NaN	NaN	1.71E+07	1.69E+07	0.603398
std	NaN	NaN	1.02E+08	1.02E+08	3.60373
min	NaN	NaN	2.00E+00	2.00E+00	0
max	NaN	NaN	1.52E+09	1.51E+09	46.456

Table 2: Language Distribution Dataset statistical information

The bar chart in Figure 4 shows the top 10 languages distribution among the crawl data from 2018 to 2022. It can be noticed that all languages dominate the English language. In addition, 2019 was the year with the highest number of pages crawled compared to other years. Nevertheless, 2022 only represents January, so it just shows the first portion of the data.

Figure 5: Language Distribution Treemaps

The treemaps diagram shows all language distribution from 2018 to 2022. It is obvious that the English language takes the biggest part of the chart. It can be noticed that all languages dominate the English language, followed by Russian, Chines, and Standard German languages, respectively. In addition, 2019 was the year with the highest number of pages crawled compared to other years. Nevertheless, 2022 only represents January, so it shows the first portion of the data. In addition, this chart can identify the portions of the other languages and compare them.

MIME Types

The third dataset depicts the data format that the crawl data consists of. Also, the data shows the number of pages, URLs, and portion of the data for each of the elements. The data represents seven types of data per crawl file: other, application, audio, image, message, text, and video.

For example, suppose the data represent the image information. In that case, it will show the number of pages, URLs that contain images and how much the percentage of the images that exist compared to the other format per crawl file. Figure 6 shows a sample of the MIME datasets. It can be seen that the different kind of format per crawl file with the number of pages, URLs and the percentage of the data type of the crawl file.

The following table shows the statistical information about the dataset. It can be noticed that there are 357 crawl files with seven kind of data types.

	crawl	Data type	pages	urls	pages_crawl
count	357	357	357	357	357
unique	NaN	7	NaN	NaN	NaN
top	NaN	image	NaN	NaN	NaN
freq	NaN	51	NaN	NaN	NaN
mean	2019	NaN	418,458,200	414,634,900	14.285711
std	1.387597	NaN	779,218,200	771,846,600	26.523468
min	2017	NaN	13,318	13,305	0.0004
0.25	2018	NaN	116,473	116,288	0.0039
0.5	2019	NaN	401,230	400,620	0.0139
0.75	2020	NaN	550,265,800	546,561,700	18.5252
max	2022	NaN	2,823,292,000	2,808,792,000	85.8146

Table 3: MIME Dataset Statistical Information

Figure 7 illustrates the different data formats with the number of web pages from 2017 to 2021. It can be seen that the text and application data are the highest amount of data that existed on the internet. The text data include HTML, plain text, CSV, and Vcard files, to name a few. The application files mostly include the interactive or complex data types such as XML, JSON, ZIP, PDF …etc.

Figure 7: Content Types and Pages number from 2017 - 2021 — Figure 7: Content Types and Pages number from 2017 – 2021

Time and Effort Required for The Study

Currently, the Common Crawl organization published more than 80 compressed files that contain web data. The organization started to crawl the web data in 2008 (Common Crawl, n.d.-d). The organization aims to publish the crawl internet data monthly.

In addition, the organization has a good team that can manage the crawl operation. However, there is not enough information about how long the crawl operation takes from the first step to the last one, which is to publish the data on Amazon AWS.

Expected Values/Benefits to The Organization

The Common Crawl is not aiming for profits because it is a non-profit organization. It aims to provide the data for the public society for free to enrich the data community with data that was available only to big companies. The organization claims on its website that small businesses and even individuals may now access high-quality crawl data that was previously only available to giant search engine organizations, allowing them to satisfy their curiosity, evaluate the world, and explore bright ideas (Common Crawl, n.d.-a). From the author perspective the organization seeks to attract the big data talents. Also, it wants to the folks from different domains to use the data and explore the web world to solve problems, find opportunities for their business, or use the data for the scientific research.

Explain/define terms.

AWS	Amazon Web Service is a cloud computing service provided by Amazon for companies and organizations. Also, AWS includes more than 100 services that could be used to store, run, maintain, and analyze the data.
Crawl Data	Web crawling is used for data extraction and refers to collecting data from either the world wide web or in data crawling cases – any document, file, etc (Fatenaite, 2021).
CSV	A Comma Separated Values file is a plain text file that contains a list of data that may be used to transfer data between apps (Hoffman, 2018).
Domain	The URL that users put into their browser to get to a website (Ricart, 2022).
Host	Allocates storage space on a web server for a website’s data (Namecheap, n.d.)
HTML	Hyper Text Markup Language is the most fundamental Web building component, defining the meaning and structure of web content (Mozilla, 2022c).
JSON	JavaScript Object Notation is a serialization format that supports objects, arrays, integers, texts, Booleans, and null values, and it is based on JavaScript syntax, although it differs from it in that not all JavaScript is JSON (Mozilla, 2022a).
MIME	Multipurpose Internet Mail Extensions is a media type that describes the nature and format of a document, file, or collection of bytes (Mozilla, 2022d).
TLD	Top-level domain is the last section of a domain name or the component that comes after the “dot” sign, such as com, net, org …etc (Bryant, 2021).
URL	A Uniform Resource Locator (URL) is the address of a single Web resource; each valid URL links to a single resource (Mozilla, 2022b).
Vcard	A vCard is an electronic business (or personal) card that also serves as the name of an industry standard for the type of communication exchanged on these cards (Contributor, 2005).
WARC	The WARC (Web ARChive) format provides a mechanism for merging numerous digital resources into a single archival file with associated data (Library of Congress, 2022).
WAT	Web Assembly Text Format supports both binary and text formats. The binary format (.wasm) is a stack-based virtual machine’s small binary instruction format, a portable compilation target for various higher-level languages like C, C++, Rust, C#, Go, Python, and others. The text format (.wat) is a human-readable format for viewing the source code of a WebAssembly module by developers. The text format may also be used to write binary-compilable programs (WEBASSEMBLYMAN, n.d.) and it include crucial metadata about the WARC-formatted records (Common Crawl, n.d.-d)
WET	The WinSQL Export Template (WET) file format is compatible with software that runs on the Windows operating system (File-Extension, n.d.).
XML	Extensible Markup Language (XML) is a markup language that uses tags to define objects. The XML file format was created with the intention of storing and transporting data without the use of software or hardware (Iqbal, 2019).
ZIP	ZIP files compress data, allowing them to deliver more data at quicker speeds than ever before (DropBox, n.d.).

Author: Zaid Altukhi

References

Firebase Realtime Database. (n.d.). Firebase. Retrieved September 28, 2021, from https://firebase.google.com/docs/database
GeeksforGeeks. (2019, January 18). Hadoop | History or Evolution. https://www.geeksforgeeks.org/hadoop-history-or-evolution/
Introduction to Amazon DynamoDB (1:01). (2021). Amazon Web Services, Inc. https://aws.amazon.com/dynamodb/
Karanth, S. (2014). Mastering Hadoop: go beyond the basics and master the next generation of Hadoop data processing platforms (1st ed.). Packt Publishing. https://learning-oreilly-com.mutex.gmu.edu/library/view/mastering-hadoop/9781783983643/ch01.html
Langit, L. (2015, May 19). NoSQL for SQL Professionals [Video]. LinkedIn Learning. https://lnkd.in/dCZvx6wN
Lo, F. (n.d.). What is Hadoop and NoSQL? DATA JOBS. Retrieved September 29, 2021, from https://datajobs.com/what-is-hadoop-and-nosql
Sinha, B., & Shah, M. (2017, January 10). NoSQL and Hadoop: Document-Based versus Relational Databases. TDWI. https://tdwi.org/articles/2017/01/10/nosql-hadoop-document-vs-relational-databases.aspx

اخر المقالات

Technology

Common Crawl

Stakeholders

Data Privacy and Quality

Resources

Team and Technology

build-vs-buy solutions for the project

Metadata (Types of Data in The Dataset)

Examples of Statistical Analyses and Visualizations

Zaid Altukhi

اخر المقالات

ماهو الذكاء الاصطناعي القابل للتفسير XAI

Obsidian ABC

أنواع الهويات البصرية

قصة العلامة التجارية

الذكاء الاصطناعي إلى أين؟

الذكاء الاصطناعي

اترك ردإلغاء الرد

ماهو الذكاء الاصطناعي القابل للتفسير XAI

Obsidian ABC

أنواع الهويات البصرية

قصة العلامة التجارية

الذكاء الاصطناعي إلى أين؟

Article، Data Analytics

Common Crawl

Stakeholders

Data Privacy and Quality

Resources

Team and Technology

build-vs-buy solutions for the project

Metadata (Types of Data in The Dataset)

Examples of Statistical Analyses and Visualizations

Zaid Altukhi

اخر المقالات

اترك ردإلغاء الرد