Common Crawl

Common Crawl is a non-profit organization headquartered in California that seeks to crawl the internet once a month and make the data collected public through Amazon S3 (Litwintschik, 2017). The organization gathers the data by collecting web pages on the internet to archive them and make them available publicly to view and analyze since 2008 (Common Crawl, n.d.-d). The Common Crawl website states that they collect these data to democratize access to web information by creating and maintaining an open archive of globally accessible and analyzable web crawl data (Common Crawl, n.d.-b).

The data are available on the Registry of Open Data on AWS, an open big data portal that provides data from different sources (companies and organizations). The Common Crawl data contains a collection of web crawl data containing more than 50 billion online pages (AWS, n.d.).

Stakeholders

As Common Crawl has a wide range of stakeholders. There are two main stakeholders, internal and external. The internal stakeholders are the people who are working within the organization, and the external ones are the people who want to use the data. According to the Common Crawl website, the team is divided into staff and volunteers, a board of directors, and an advisory board (Common Crawl, n.d.-c). Furthermore, the external stakeholders are the most important kind for the Common Crawl organization, who are the people who want to use data to analyze specific data and solve real-world problems using web data, to name a few. The organization target people who are curious about the data and startups who want to analyze the data (Common Crawl, n.d.-a).

Data Privacy and Quality

Offering the data to the public needs the data to be free to use and in high quality as much as possible. And as Common Crawl organization offers an open big data platform, data privacy and quality need to be clear for the audience so they can understand the kind of data they deal with. The organization wants to offer the data for anyone for free to use the data in any domain. However, there is no direct instruction about data privacy because it contains the web page’s metadata and HTML code from the internet, which means anyone can access that information. The organization freely provides data and metadata to the general population (Rahman, 2021). However, the organization states some regular restrictions on using the data to save people’s privacy and safety. For example, do not violate other people’s rights, do not spam or stalk people, and harvest personally identifiable information (Common Crawl, n.d.-e).

 Furthermore, the organization provides high-quality data. According to Ronallo, the Common Crawl only crawls a subset of the Web, and it aims to crawl only high-quality sites rather than fetching or caching junk (2012). In most cases, the data will be clean and ready to analyze.

 

Resources

The organization has big, complex data that come from different sources and formats. This kind of data needs special kind of resources to operate, maintain, and manage the big data that the organization has. The resources include hardware and high expertise who can deal with complex systems.

Team and Technology

There are two kinds of teams that need to be in the organization, the technical team and the business team. The business team is responsible for forming the policies, standards, and strategies for the organization. The technical team is responsible for collecting, cleaning, storing, and publishing the data based on the business team documents. Currently, Common Crawl has a team from different backgrounds and expertise. On the organization’s website, they list the team with their qualifications. The organization contains expertise to run and serve its goals. One of the association’s skills is Crawl Engineer & Data Scientist, which is responsible for running and maintaining the crawler. Also, the organization has more than ten advisors with different domains, cultures, and nationalities (Common Crawl, n.d.-c).

In terms of technology, to host petabytes of big and complex data, the organization needs to get advanced technology to deal with this amount of data. The data, which was gathered between January 16 and 29, contains 2.95 billion web pages, or 320 terabytes of uncompressed material, as well as page, grabs for 1.35 billion URLs (Nagel, 2022). According to the organization’s website, the data is hosted on Amazon Web Services (AWS) Public Data Sets and many academic cloud platforms throughout the world (Common Crawl, n.d.-f). However, there is no information about how much capacity the data takes or other technical information.

build-vs-buy solutions for the project

The Common Crawl host the data on a cloud service which is AWS and other cloud platforms around the world (Common Crawl, n.d.-f). The buy approach in the organization case is the most suitable approach for many reasons. First, they are a non-profit organization, and AWS provides free storage of the data on their servers. Amazon states that the Amazon Web Services (AWS) Open Data Sponsorship Program pays the cost of storing high-value cloud-optimized datasets that are publicly available (AWS, 2022). Also, the organization will not stop expanding because they crawl the web and gain terabytes monthly. On the other hand, AWS and other cloud computing providers can provide the skills that the organization cannot offer to secure, operate, and manage the servers and network.

Metadata (Types of Data in The Dataset)

The dataset contains three types of data. Raw web page data, metadata extracts, and text extracts are all included in the corpus (Common Crawl, n.d.-f). In addition, the organization states that the raw data is aggregated in a Web ARChive format (WARC), which provides a straight mapping to the crawling process and the metadata is stored in WAT format within the WARC, and the text is stored in WET format (Common Crawl, n.d.-d).

Examples of Statistical Analyses and Visualizations 

The Common Crawl organization provides data that shows statistical information about its content. In this paper, we explore three kinds of datasets as follows: 

  1. Unique URLs, host and domain names, and top-level domains (Common Crawl, 2008–2022) 
  2. Distribution of Languages (Common Crawl, 2021–2022a) 
  3. MIME Types (Common Crawl, 2021–2022b).

In addition, python programming was used to explore, clean, and prepare the data and Tableau software to create the visualizations. The following shows each dataset in more depth with a sample of the data set, statistical information, and the most important visualizations.

Unique URLs, host and domain names, top-level domains

The first dataset consists of five columns, and 87 rows show the number of domains, hosts, TLDs, and URLs from 2008 to 2022. Also, the data shows the crawl file names grouped by the year and week number. Figure 1 shows a dataset sample and represents the five columns with examples of the data type and crawl file names.

Figure 1: Unique URLs, host, and domain names, TLD dataset sample
Figure 1: Unique URLs, host, and domain names, TLD dataset sample

Table 1 shows the statistical information about those data divided by years. For each year from 2008 to 2009, the table represents the mean, minimum, and maximum for the number of domains, hosts, TLDs, and URLs.

year domain host TLD URL
mean min max mean min max mean min max mean min max
2008 1.5E+07 1.5E+07 1.5E+07 3.2E+07 3.2E+07 32086112 1496 1496 1496 1.791E+09 1.79E+09 1.791E+09
2009 3.1E+07 3.1E+07 3.1E+07 6.9E+07 6.9E+07 68991076 4711 4711 4711 2.301E+09 2.3E+09 2.301E+09
2012 4.1E+07 4.1E+07 4.1E+07 1.1E+08 1.1E+08 1.08E+08 4533 4533 4533 3.597E+09 3.6E+09 3.597E+09
2013 2.7E+07 2.7E+07 2.7E+07 5.2E+07 5.2E+07 51581682 7144 7144 7144 3.752E+09 3.75E+09 3.752E+09
2014 1.3E+08 1.3E+08 1.3E+08 2.4E+08 2.4E+08 2.39E+08 31727 31727 31727 1.456E+10 1.46E+10 1.456E+10
2015 1.4E+08 1.4E+08 1.4E+08 2.5E+08 2.5E+08 2.52E+08 40397 40397 40397 1.636E+10 1.64E+10 1.636E+10
2016 2.1E+08 2.1E+08 2.1E+08 3.6E+08 3.6E+08 3.58E+08 40669 40669 40669 1.65E+10 1.65E+10 1.65E+10
2017 3E+08 3E+08 3E+08 6.3E+08 6.3E+08 6.31E+08 57574 57574 57574 3.681E+10 3.68E+10 3.681E+10
2018 3.8E+08 3.8E+08 3.8E+08 7.4E+08 7.4E+08 7.42E+08 61880 61880 61880 3.639E+10 3.64E+10 3.639E+10
2019 4E+08 4E+08 4E+08 5.7E+08 5.7E+08 5.69E+08 60558 60558 60558 3.204E+10 3.2E+10 3.204E+10
2020 3.2E+08 3.2E+08 3.2E+08 4.5E+08 4.5E+08 4.47E+08 44789 44789 44789 2.561E+10 2.56E+10 2.561E+10
2021 3.2E+08 3.2E+08 3.2E+08 4.2E+08 4.2E+08 4.19E+08 44583 44583 44583 2.62E+10 2.62E+10 2.62E+10
2022 3.5E+07 3.5E+07 3.5E+07 4.4E+07 4.4E+07 44368670 4908 4908 4908 2.971E+09 2.97E+09 2.971E+09

Table 1:Unique URLs, host and domain names, top-level domains statistical information

Figure 2: New URLs in Crawl between 2008-2022
Figure 2: New URLs in Crawl between 2008-2022

The line chart shows the number of new URLs that crawl each year from 2008 to 2021. It can be seen that 2017 was when the number of URLs hit a peak of more than 36.8 million. However, the new URLs decreased significantly after 2018, from 36.38 million to 26.2 million by 2021. The information about 2022 was excluded from this chart because it represents only January data.

Distribution of Languages

The second dataset denotes the distribution of the languages for the crawled web pages from 2018 to 2022. The data consists of five columns, and 5,839 rows represent the crawl file names, languages, number of pages, number of URLs, and the percentage of language distribution per crawl. In addition, there are 163 different languages within the dataset. Figure 3 depicts a sample from the language distribution dataset and represents the crawled web pages and URL language information for each year. The following table shows statistical information about how the language is distributed by year.

Figure 3: Language Distribution Dataset Sample
Figure 3: Language Distribution Dataset Sample

The following table shows the statistical information about the information from the language distribution dataset. Because there are some categorical columns, the data also shows the number of unique and frequency for each categorical column.

crawl Primary language pages URLs pages_crawl
count 5805 5805 5.81E+03 5.81E+03 5805
unique 5 163 NaN NaN NaN
top 2019 afr NaN NaN NaN
freq 1933 36 NaN NaN NaN
mean NaN NaN 1.71E+07 1.69E+07 0.603398
std NaN NaN 1.02E+08 1.02E+08 3.60373
min NaN NaN 2.00E+00 2.00E+00 0
max NaN NaN 1.52E+09 1.51E+09 46.456

Table 2: Language Distribution Dataset statistical information

Top 10 languages distribution
Top 10 languages distribution

The bar chart in Figure 4 shows the top 10 languages distribution among the crawl data from 2018 to 2022. It can be noticed that all languages dominate the English language. In addition, 2019 was the year with the highest number of pages crawled compared to other years. Nevertheless, 2022 only represents January, so it just shows the first portion of the data.

Figure 5: Language Distribution Treemaps 
Figure 5: Language Distribution Treemaps

The treemaps diagram shows all language distribution from 2018 to 2022. It is obvious that the English language takes the biggest part of the chart. It can be noticed that all languages dominate the English language, followed by Russian, Chines, and Standard German languages, respectively. In addition, 2019 was the year with the highest number of pages crawled compared to other years. Nevertheless, 2022 only represents January, so it shows the first portion of the data. In addition, this chart can identify the portions of the other languages and compare them.

MIME Types

The third dataset depicts the data format that the crawl data consists of. Also, the data shows the number of pages, URLs, and portion of the data for each of the elements. The data represents seven types of data per crawl file: other, application, audio, image, message, text, and video.

For example, suppose the data represent the image information. In that case, it will show the number of pages, URLs that contain images and how much the percentage of the images that exist compared to the other format per crawl file. Figure 6 shows a sample of the MIME datasets. It can be seen that the different kind of format per crawl file with the number of pages, URLs and the percentage of the data type of the crawl file.

Figure 6: MIME Types Dataset Sample
Figure 6: MIME Types Dataset Sample

The following table shows the statistical information about the dataset. It can be noticed that there are 357 crawl files with seven kind of data types.

crawl Data type pages urls pages_crawl
count 357 357 357 357 357
unique NaN 7 NaN NaN NaN
top NaN image NaN NaN NaN
freq NaN 51 NaN NaN NaN
mean 2019 NaN 418,458,200 414,634,900 14.285711
std 1.387597 NaN 779,218,200 771,846,600 26.523468
min 2017 NaN 13,318 13,305 0.0004
0.25 2018 NaN 116,473 116,288 0.0039
0.5 2019 NaN 401,230 400,620 0.0139
0.75 2020 NaN 550,265,800 546,561,700 18.5252
max 2022 NaN 2,823,292,000 2,808,792,000 85.8146

Table 3: MIME Dataset Statistical Information

Figure 7 illustrates the different data formats with the number of web pages from 2017 to 2021. It can be seen that the text and application data are the highest amount of data that existed on the internet. The text data include HTML, plain text, CSV, and Vcard files, to name a few. The application files mostly include the interactive or complex data types such as XML, JSON, ZIP, PDF …etc.

Figure 7: Content Types and Pages number from 2017 - 2021
Figure 7: Content Types and Pages number from 2017 – 2021

Time and Effort Required for The Study

Currently, the Common Crawl organization published more than 80 compressed files that contain web data. The organization started to crawl the web data in 2008 (Common Crawl, n.d.-d). The organization aims to publish the crawl internet data monthly.

In addition, the organization has a good team that can manage the crawl operation. However, there is not enough information about how long the crawl operation takes from the first step to the last one, which is to publish the data on Amazon AWS.

Expected Values/Benefits to The Organization

The Common Crawl is not aiming for profits because it is a non-profit organization. It aims to provide the data for the public society for free to enrich the data community with data that was available only to big companies. The organization claims on its website that small businesses and even individuals may now access high-quality crawl data that was previously only available to giant search engine organizations, allowing them to satisfy their curiosity, evaluate the world, and explore bright ideas (Common Crawl, n.d.-a). From the author perspective the organization seeks to attract the big data talents. Also, it wants to the folks from different domains to use the data and explore the web world to solve problems, find opportunities for their business, or use the data for the scientific research.

 

Explain/define terms.

AWS  Amazon Web Service is a cloud computing service provided by Amazon for companies and organizations. Also, AWS includes more than 100 services that could be used to store, run, maintain, and analyze the data.
Crawl Data Web crawling is used for data extraction and refers to collecting data from either the world wide web or in data crawling cases – any document, file, etc (Fatenaite, 2021).
CSV A Comma Separated Values file is a plain text file that contains a list of data that may be used to transfer data between apps (Hoffman, 2018).
Domain  The URL that users put into their browser to get to a website (Ricart, 2022).
Host Allocates storage space on a web server for a website’s data (Namecheap, n.d.)
HTML Hyper Text Markup Language is the most fundamental Web building component, defining the meaning and structure of web content (Mozilla, 2022c).
JSON JavaScript Object Notation is a serialization format that supports objects, arrays, integers, texts, Booleans, and null values, and it is based on JavaScript syntax, although it differs from it in that not all JavaScript is JSON (Mozilla, 2022a).
MIME Multipurpose Internet Mail Extensions is a media type that describes the nature and format of a document, file, or collection of bytes (Mozilla, 2022d).
TLD Top-level domain is the last section of a domain name or the component that comes after the “dot” sign, such as com, net, org …etc (Bryant, 2021).
URL A Uniform Resource Locator (URL) is the address of a single Web resource; each valid URL links to a single resource (Mozilla, 2022b).
Vcard A vCard is an electronic business (or personal) card that also serves as the name of an industry standard for the type of communication exchanged on these cards (Contributor, 2005).
WARC The WARC (Web ARChive) format provides a mechanism for merging numerous digital resources into a single archival file with associated data (Library of Congress, 2022).
WAT Web Assembly Text Format supports both binary and text formats. The binary format (.wasm) is a stack-based virtual machine’s small binary instruction format, a portable compilation target for various higher-level languages like C, C++, Rust, C#, Go, Python, and others. The text format (.wat) is a human-readable format for viewing the source code of a WebAssembly module by developers. The text format may also be used to write binary-compilable programs (WEBASSEMBLYMAN, n.d.) and it include crucial metadata about the WARC-formatted records (Common Crawl, n.d.-d)
WET The WinSQL Export Template (WET) file format is compatible with software that runs on the Windows operating system (File-Extension, n.d.).
XML Extensible Markup Language (XML) is a markup language that uses tags to define objects. The XML file format was created with the intention of storing and transporting data without the use of software or hardware (Iqbal, 2019).
ZIP ZIP files compress data, allowing them to deliver more data at quicker speeds than ever before (DropBox, n.d.).

Author: Zaid Altukhi

اخر المقالات 

apps productivity Technology

Obsidian ABC

Graphic Design Uncategorized

أنواع الهويات البصرية

Graphic Design

قصة العلامة التجارية

Technology

الذكاء الاصطناعي إلى أين؟

Technology

الذكاء الاصطناعي

Graphic Design

اسس تصميم الشعارات

اترك رد