- Web-scraping JavaScript page with Python. Ask Question Asked 9 years, 5 months ago. Active 21 days ago. Viewed 309k times 211. I'm trying to develop a simple web scraper. I want to extract text without the HTML code. In fact, I achieve this goal, but I have seen that in some pages where JavaScript is loaded I didn't obtain good results.
- As you can see, X-FORWARDED-FOR can be set to an arbitrary value, so you can bypass IP limitation requests during scrapping or more dangerous IP verification during a login procedure The origin IP is not forwarded to the website, so the only way to block this kind of request on your server is to filter on CF-WORKER header.
Web scrapers can be developed using any programming language that is Turing complete. Java, PHP, Python, JavaScript, C/C, and C#, among others, have been used for writing web scrapers. Being that as it may, some languages are much more popular than others as far as developing web scrapers are concerned. JavaScript is not a popular choice. C# javascript html visual-studio web-scraping. Improve this question. Follow edited Jul 10 '19 at 20:56. Asked Jun 18 '14 at 14:51. 360 1 1 gold badge 4 4 silver badges 18 18 bronze badges. Add a comment 3 Answers Active Oldest Votes. The problem is the browser usually executes the javascript and it results. Recently, I had been writing a small web scrapping application using ASP.NET 2.0. During which time I was trying to investigate some of the new controls available in ASP.NET 2.0. One of the 2.0 controls that I found particularly interesting was the MultiView control. The new MultiView control works in tandem with an embedded View control.
Sometimes you need to scrape content from a website and a fancy scraping setup would be overkill.
Maybe you only need to extract a list of items on a single page, for example.
In these cases you can just manipulate the DOM right in the Chrome developer tools.
Extract List Items From a Wikipedia Page
Let's say you need this list of baked goods in a format that's easy to consume: https://en.wikipedia.org/wiki/List_of_baked_goods
Open Chrome DevTools and copy the following into the console:
Now you can select the JSON output and copy it to your clipboard.
A More Complicated Example
Let's try to get a list of companies from AngelList (https://angel.co/companies?company_types[]=Startup&locations[]=1688-United+States
This case is a slightly less straightforward because we need to click 'more' at the bottom of the page to fetch more search results.
Open Chrome DevTools and copy:
You can access the results with:
Some Notes
- Chrome natively supports ES6 so we can use things like the spread operator
- We spread
[...document.querySelectorAll]
because it returns a node list and we want a plain old array.
- We spread
- We wrap everything in a setTimeout loop so that we don't overwhelm Angel.co with requests
- We save our results in localStorage with
window.localStorage.setItem('__companies__', JSON.stringify(arr))
so that if we disconnect or the browser crashes, we can go back to Angel.co and our results will be saved. - We must serialize data before saving it to localStorage.
Scraping With Node
Web Scraping Js Library
These examples are fun but what about scraping entire websites?
We can use node-fetch and JSDOM to do something similar.
Just like before, we're not using any fancy scraping API, we're 'just' using the DOM API. But since this is node we need JSDOM to emulate a browser.
Scraping With NightmareJs
Nightmare is a browser automation library that uses electron under the hood.
The idea is that you can spin up an electron instance, go to a webpage and use nightmare methods like type and click to programmatically interact with the page.
For example, you'd do something like the following to login to a Wordpress site programmatically with nightmare:
Nightmare is a fun library and might seem like 'magic' at first.
But the NightmareJs methods like wait, type, click, are just syntactic sugar on DOM (or virtual DOM) manipulation.
For example, here's the source for the nightmare method refresh:
In other words, window.location.reload wrapped in their evaluate_now method. So with nightmare, we are spinning up an electron instance (a browser window), and then manipulating the DOM with client-side javascript. Everything is the same as before, except that nightmare exposes a clean and tidy API that we can work with.
Why Do We Need Electron?
Why is Nightmare built on electron? Why not just use Chrome?
This brings us to the interesting alternative to nightmare, Chromeless.
Chromeless attempts to duplicate Nightmare's simple browser automation API using Chrome Canary instead of Electron.
This has a few interesting benefits, the most important of which is that Chromeless can be run on AWS Lambda. It turns out that the precompiled electron binaries are just too large to work with Lambda.
Here's the same example we started with (scraping companies from Angel.co), using Chromeless:
To run the above example, you'll need to install Chrome Canary locally. Here's the download link.
Next, run the above two commands to start Chrome canary headlessly.
Finally, install the npm package chromeless.
Posted by Soham Kamani on November 24, 2018
GSD
Almost all the information on the web exists in the form of HTML pages. The information in these pages is structured as paragraphs, headings, lists, or one of the many other HTML elements. These elements are organized in the browser as a hierarchical tree structure called the DOM (short for Document Object Model). Each element can have multiple child elements, which can also have their own children. This structure makes it convenient to extract specific information from the page.
The process of extracting this information is called 'scraping' the web, and it’s useful for a variety of applications. All search engines, for example, use web scraping to index web pages for their search results. We can also use web scraping in our own applications when we want to automate repetitive information gathering tasks.
Learn how your Marketing team can update your Node App with ButterCMS.
Cheerio is a Node.js library that helps developers interpret and analyze web pages using a jQuery-like syntax. In this post, I will explain how to use Cheerio in your tech stack to scrape the web. We will use the headless CMSAPI documentation for ButterCMS as an example and use Cheerio to extract all the API endpoint URLs from the web page.
Why Cheerio
There are many other web scraping libraries, and they run on most popular programming languages and platforms. What makes Cheerio unique, however, is its jQuery-based API.
jQuery is by far the most popular javascript library in use today. It's used in browser-based javascript applications to traverse and manipulate the DOM. For example, if your document has the following paragraph:
You could use jQuery to get the text of the paragraph:
The above code uses a CSS #example
text
<strong>
The jQuery API is useful because it uses standard CSS selectors to search for elements, and has a readable API to extract information from them. jQuery is, however, usable only inside the browser, and thus cannot be used for web scraping. Cheerio solves this problem by providing jQuery's functionality within the Node.js
The Cheerio API
Unlike jQuery, Cheerio doesn't have access to the browser’s
Let's look at how we can implement the previous example using cheerio:
You can find more information on the Cheerio API in the official documentation.
Scraping the ButterCMS documentation page
The ButterCMS documentation page is filled with useful information on their APIs. For our application, we just want to extract the URLs of the API endpoints.
For example, the API to get a single page is documented below:
What we want is the URL:
https://api.buttercms.com/v2/pages/<page_type_slug>/<page_slug>/?auth_token=api_token_b60a008a
In order to use Cheerio to extract all the URLs documented on the page, we need to:
- Download the source code of the webpage, and load it into a Cheerio instance
- Use the Cheerio API to filter out the HTML elements containing the URLs
To get started, make sure you have Node.js installed on your system. Create an empty folder as your project directory:
mkdir cheerio-example
Next, go inside the directory and start a new node project:
npm init
## follow the instructions, which will create a package.json file in the directory
Finally, create a index.js
Obtaining the website source code
We can use the
While in the project directory, install the
npm install axios
We can then use
Web Scraping Using Js
Add the above code to index.js
and run it with:
You should then see the HTML source code printed to your console. This can be quite large! Let’s explore the source code to find patterns we can use to extract the information we want. You can use your favorite browser to view the source code. Right-click on any page and click on the 'View Page Source' option in your browser.
Learn how your Marketing team can update your Node App with ButterCMS.
Extracting information from the source code
After looking at the code for the ButterCMS documentation page, it looks like all the API URLs are contained in span
elements within pre
elements:
We can use this pattern to extract the URLs from the source code. To get started, let's install the Cheerio library into our project:
npm install cheerio
Now, we can use the response data from earlier to create a Cheerio instance and scrape the webpage we downloaded:
Run the above program with:
node index.js
And you should see the output:
'https://api.buttercms.com/v2/posts/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/pages/<page_type_slug>/<page_slug>/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/pages/<page_type>/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/content/?keys=homepage_headline,homepage_title&auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/posts/?page=1&page_size=10&auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/posts/<slug>/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/search/?query=my+favorite+post&auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/authors/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/authors/jennifer-smith/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/categories/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/categories/product-updates/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/tags/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/tags/product-updates/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/feeds/rss/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/feeds/atom/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/feeds/sitemap/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
Conclusion
Cheerio makes it really easy for us to use the tried and tested jQuery API in a server-based environment. In fact, if you use the code we just wrote, barring the page download and loading, it would work perfectly in the browser as well. You can verify this by going to the ButterCMS documentation page and pasting the following jQuery code in the browser console:
You’ll see the same output as the previous example:
You can even use the browser to play around with the DOM before finally writing your program with Node and Cheerio.
One important aspect to remember while web scraping is to find patterns in the elements you want to extract. For example, they could all be list items under a common ul element, or they could be rows in a table element. Inspecting the source code of a webpage is the best way to find such patterns, after which using Cheerio's API should be a piece of cake!
Soham is a full stack developer with experience in developing web applications at scale in a variety of technologies and frameworks.
ButterCMS is the #1 rated Headless CMS
Related articles
What Is Headless CMS? A Simple Guide for Marketing Teams
Brigid Burt
React, Firebase, and Google Analytics: How To Set Up and Log Events
Zain Sajjad
When to Use NoSQL: A Guide for Beginners
Alex Williams
Don’t miss a single post
Get our latest articles, stay updated!