Have you ever thought of all the possibilities web scraping provides and how many benefits it can unlock for your business? Surely, you have!
But at the same time there were a lot of thoughts about the hurdles appearing – possible blocking, the system being sophisticated, difficulties in getting JS/AJAX data, scaling up challenges, maintaining, requiring above-the-average skill. And even if you don’t give up and keep working, your efforts can be completely derailed by the structure changes in the website. Don’t worry about that! There’s a simple Beginners Guide to web scraping. We did our best to put it together so even if you don’t have a technical background or lack relevant experience, you can still use it as a handbook. So you can get all the advantages web scraping provides and implement the juicy features into your business.
Let’s get started!
Loading Web Pages with 'request' The requests module allows you to send HTTP requests using. Mar 05, 2020 Scraper is a chrome extension for scraping simple web pages. It is a free web scraping tool which is easy to use and allows you to scrape a website’s content and upload the results to Google Docs or Excel spreadsheets. It can extract data from tables and convert it into a structured format. Who is this for: developers who are proficient at programming to build a web.
What is web scraping?
How To Use Web Scraper
In short, web scraping allows you to extract data from the websites, so it can be saved in a file on your machine, so it can be accessed on a spreadsheet later on.
Usually you can only view the downloaded web page but not extract data. Yes, it is possible to copy some parts of it manually but this way is too time-consuming and not scalable. Web scraping extracts reliable data from the picked pages, so the process becomes completely automated. The received data can be used for business intelligence later on.
In other words, one can work with any kind of data, as far web scraping works perfectly fine with vast quantities of data, as well as different data types.
Images, text, emails, even phone numbers – all will be extracted up to your business’ needs. For some projects specific data can be needed, for example, financial data, real estate data, reviews, price or competitor data whatever. Using web scraping tools it is fast and easy to extract it as well. But the best thing is that at the end you get the extracted data in a format of your choice. It can be plain text, JSON or CSV.
How does web scraping work?
Surely, there are lots of ways to extract data, but here there’s the easiest and the most reliable one. Here’s how it works.
1. Request-response
The first simple step in any web scraping program (also called a “scraper”) is to request the target website for the contents of a specific URL.
In return, the scraper gets the requested information in HTML format. Remember, HTML is the file type used to display all the textual information on a webpage.
2. Parse and extract
HTML is a markup language, having a simple and clear structure. Parsing applies to any computer language, taking the code as bunches of text. It produces a structure in memory, which the computer can understand and work with.
Sounds too difficult? Wait a second. To make it simple we can say that HTML parsing takes HTML code, expects it and extracts the relevant information – title, paragraphs, headings. Links and formatting like bold text.
So all you need is a regular expression, defining the regular language, so a regular expression engine can generate a parser for this specific language. Thus pattern matching becomes possible, as well as text extraction.
3. Download data
The last step - downloading and saving the data in the format of your choice (CSV, JSON or in a database). After it becomes accessible, it can be retrieved, implemented in other programs.
In other words, scrapping allows you not just to extract data, but to store it into a central local database or spreadsheet and use it later when you need.
Advanced techniques for web scraping using python
Today computer vision technologies, as well as machine learning is used to distinguish and scrape data from the images, similar to the way a human being could do.
All it works quite straightforward. A machine learning system has its own classifications to which it assigns a so-called confidence score. It is a measure of the statistical likelihood. So if the classification is considered to be correct, it means it is close to the patterns discerned in the training data
In case the confidence score is too low, the system initiates a new search query to pick the bunch of text which will most likely contain the previously requested data.
After the system makes an attempt to scrap the relevant data from the text considered to be new and reconciles the received result with the data in the initial scraping. In case the confidence score is still too low it processes further on, working on the next pulled text.
What is web scraping used for?
There are numerous ways how web scraping python can be used, basically it can be implemented in every known domain. But let’s have a closer look at some areas where web scraping is considered to be the most efficient.
Price monitoring
Easy To Use Web Scraper Tool
Competitive pricing is the main strategy for e-commerce businesses. Protons of calcium. The only way to succeed here is to keep a constant track of the competitors and their pricing strategy. Parsed data can help to define your own pricing strategy. It is much faster than manual comparing and analysis. When it comes to price monitoring web scraping can be surprisingly efficient.
Lead generation
Marketing is essential for any business. For marketing strategy to be successful one needs not just to have the contact details of the parties involved but to reach them. It is the essence of lead generation. And web scraping can improve the process, making it more efficient.
Leads are the very first thing needed for marketing campaign acceleration.
To reach the target audience you most likely need tons of data such as phone numbers, emails etc. And of course to collect it manually over the thousands of websites all over the web is impossible.
Web scraping is here to help! It extracts the data. The process is not just accurate but quick and takes just a fraction of time.
The received data can be easily integrated into your sales tools as far you can pick a format you are comfortable with.
Competitive analysis
Competition has always been the flesh and blood of any business, but today it is critically important to know the competitors well. It allows us to understand their strong and weak points, strategies and evaluate risks in a more efficient way. Of course it is possible only if you possess a lot of relevant data. And web scraping helps here as well.
Any strategy starts with analysis. But how to work with the data spread everywhere? Sometimes it is even impossible to access it manually.
If it is difficult to do manually, use web scraping. So you get the required data and can start working over almost immediately.
A good point here – the faster your scraping tool, the better competitive analysis will be.
Fetching images and product description
When the customer enters any e-commerce website the first thing he sees is the visual content, e.g. pictures. Tons and tons of them. But how to create all this amount of product descriptions and pictures overnight? With web scraping of course!
So, when you come up to the idea of launching a brand new e-commerce website you face a content issue – all these pictures, descriptions and so on.
Old good way of hiring somebody just to copy and paste or write the content from scratch might work but will take forever. Use web scraping instead and see the result.
In other words, web scraping makes your life as an e-commerce website owner much easier, right?
Is data scraping software legal?
Web scraping software is working with data – it is, technically, a process of data extraction. But what if it is protected by law or copyrighted? It is quite natural that one of the first appearing questions is ‘Is it legal?’. The issue is tricky, as far here’s no certain opinion on this point even between the layers. Here are a few points to consider:
- Public data can be scrapped without any limits and there will be no restrictions. But if you step into the private data, it might land you in trouble.
- Abusive manner or using personal data for commercial purposes is the best way to end up in violation of CFAA, so avoid it.
- Scrapping copyrighted data is illegal and, well, unethical.
- To stay on the safe side, follow Robots.txt requirements, as well as Terms of Service (ToS).
- Using API for scraping is fine as well.
- Consider the crawl rate as 1 in 10-15 seconds. Otherwise you can be blocked.
- Don’t hit servers too often and do not process web scraping in an aggressive manner if you want to be safe.
Challenges in web scraping
Some aspects of web scraping are challenging, though it is relatively simple in general. See below a short list of major challenges you can face:
1. Frequent structure changes
After the scrapper is set up the big game only begins. In other words, setting up the tool is the first step so you can face some unexpected challenges:
All websites keep updating their UI and features. It means that the website structure is changing all the time. As far the crawler keeps in mind the existing structure, any change might upset your plans. The issue will be solved as soon as you change the crawler accordingly.
So to get complete and relevant data you should keep changing your scrapper again and again as soon as structure changes appear.
2. HoneyPot traps
Keep in mind that all the websites with sensitive data take precautions to protect the data in this or that way and they are called HoneyPots. It means that all your web scraping efforts can simply be thwarted and you will be surfing the web in attempts to figure out what’s wrong this time.
- HoneyPots are the links, accessible for crawlers, but developed to detect crawlers and prevent them from extracting data.
- They are in most cases the links with CSS style set to display:none. Another way to hide them is to remove them from the visible area or make them the color of background.
- When your crawler gets trapped, the IP becomes flagged or even blocked.
- Deep directory tree is another way to detect a crawler.
- So the number of retrieved pages or limit the traversal depth has to be limited.
3. Anti-scraping technologies
Anti-scrapping technologies evolve as well as web scraping does as far as there's a lot of data that should not be shared, and it is fine. But if you do not keep this in mind you can end up blocked. See below a short list of the most essential points you should know:
- The bigger the website is, the better it protects the data and defines crawlers. For example, LinkedIn, Stubhub and Crunchbase use powerful anti-scraping technologies.
- In case of such websites, bot access is prevented by using dynamic coding algorithms and IP blocking mechanisms implementation.
- It is clear that it is a huge challenge – to avoid blocking, so the solution, working against all the odds, turns out to become a time consuming and pretty expensive project.
4. Data quality
To get the data is just one of the points to achieve. For efficient work the data should be clean and accurate. In other words, if the data is incomplete or there are tons of mistakes, it is of no use. From a business perspective data quality is the main criteria, as far in the end of the day you need data ready to work with.
How can I start web scraping?
We are pretty sure – the question spinning round in your head is something like “How can I start web scraping and enhance my marketing strategy?”
Coding your own
- Prefer DIY-approach? Then go on and code your own scraper.
- Open-source products are an option as well.
- A host is another essential chain in the link. It enables the scraper to run round the clock.
- Robust server infrastructure is a must. However, you will need some kind of storage for the data.
- One of the greatest things in DIY-approach and coding your own scraper is the fact that you are in absolute control of every single bit of functionality.
- Weak point here is an immense amount of needed resources.
- You should not forget about monitoring and improving your system from time to time, and it also requires resources.
- Coding your own scraper might be a good option for a small, short-term project.
Web scraping tools & web scraping service
Another way to reach the same result is just to use existing tools for scraping.
- Invest a bit and try existing tools to find the one, meeting your requirements best.
- You can get a lot of benefits the power of web scraping in case you find a reliable, scalable and affordable tool among the ones available in the market
- There are free tools or the ones with a substantial trial period. They are worth giving a try if you need to extract a lot of data.
- Try to work with ProWebScraper for the quick start. It's free, intuitive and allows python scrape website with the first 1000 pages for free.
Custom solution
There’s another way, something in between the previous two.
It is simple – get the team of developers, so they will code a scraping tool specifically for your business’ needs.
So you get a unique tool without the stress caused by accrual DIY approach. And the total cost will be much lower than in case you decide to subscribe to some existing scrapers.
Freelance developers can match too and create a good scrapper upon request, why not.
A SaaS MVP based on web scraping, data analytics, and data visualization
To sum up
Web scraping is an extremely powerful tool for extracting data and getting additional advantages over the competitors. The earlier you start exploring, the better for your business.
There are different ways to start exploring the world of web scrapers and you can start with free ones shifting to unique tools, developed in accordance with your needs and requirements.
NextStep 2019 was an exciting event that drew professionals from multiple countries and several sectors. One of our most popular technical sessions was on how to scrape website data. Presented by Miguel Antunes, an OutSystems MVP and Tech Lead at one of our partners, Do iT Lean, this session is available on-demand. But, if you prefer to just quickly read through the highlights…keep reading, we’ve got you covered!
As developers, we all love APIs. It makes our lives that much easier. However, there are times when APIs aren’t available, making it difficult for developers to access the data they need. Thankfully, there are still ways for us to access this data required to build great solutions.
What Is Web Scraping?
Web scraping is the act of pulling data directly from a website by parsing the HTML from the web page itself. It refers to retrieving or “scraping” data from a website. Instead of going through the difficult process of physically extracting data, web scraping employs cutting-edge automation to retrieve countless data points from any number of websites.
If a browser can render a page, and we can parse the HTML in a structured way, it’s safe to say we can perform web scraping to access all the data.
Benefits of Web Scraping and When to Use It
You don’t have to look far to come up with many benefits of web scraping. Adobe indesign cs3 portable bagas31.
- No rate-limits: Unlike with APIs, there aren’t any rate limits to web scraping. With APIs, you need to register an account to receive an API key, limiting the amount of data you’re able to collect based on the limitations of the package you buy.
- Anonymous access: Since there’s no API key, your information can’t be tracked. Only your IP address and cool keys can be tracked, but that can easily be fixed through spoofing, allowing you to remain perfectly anonymous while accessing the data you need.
- The data is already available: When you visit a website, the data is public and available. There are some legal concerns regarding this, but most of the time, you just need to understand the terms and conditions of the website you’re scraping, and then you can use the data from the site.
How to Web Scrape with OutSystems: Tutorial
Regardless of the language you use, there’s an excellent scraping library that’s perfectly suited to your project:
- Python: BeautifulSoup or Scrapy
- Ruby: Upton, Wombat or Nokogiri
- Node: Scraperjs or X-ray
- Go: Scrape
- Java: Jaunt
OutSystems is no exception. Its Text and HTML Processing component is designed to interpret the text from the HTML file and convert it to an HTML Document (similar to a JSON object). This makes it possible to access all the nodes.
It also extracts information from plain text data with regular expressions, or from HTML with CSS selectors. You’ll be able to manipulate HTML documents with ease while sanitizing user input against HTML injection.
But how does web scraping look like in real life? Let’s take a look at scraping an actual website.
We start with a simple plan:
- Pinpoint your target: a simple HTML website;
- Design your scraping theme;
- Run and let the magic happen.
Scraping an Example Website
Our example website is www.bank-code.net, a site that lists all the SWIFT codes from the banking industry. There’s a ton of data here, so let’s get scraping.
This is what the website looks like:
If you want to collect these SWIFT codes for an internal project, it will take hours to copy it manually. With scraping, extracting the data will take a fraction of that time.
- Navigate to your OutSystems personal environment, and start a new app (if you don't have one yet, sign-up for OutSystems free edition);
- Choose “Reactive App”;
- Fill in your app’s basic information, including its name and a description of the app to continue;
- Click on “Create Module”;
- Reference the library you’re going to use from the Forge component, which in this case is the “Text and HTML Processing” library;
- Go to the website and copy the URL, for example: https://bank-code.net/country/PORTUGAL-%28PT%29/100. We’re going to use Portugal as a baseline for this tutorial;
- In the OutSystems app, create a REST API for integration with the website. It’s basically just a “get request”, and place the copied URL;
- If you noticed we have the pagination offset already present in the URL, it’s the “/100” part. Change that to be a REST input parameter;
- Out of our set of actions, we’ll use the ones designed to work with HTML, which in this case, are Attributes or Elements. We can send the HTML text of the website to these actions. This will return our HTML document, the one mentioned before that looks like a JSON object where you can access all the nodes of the HTML.
Now we can create our action to scrape the website. Let’s call it “Scrape”, for example.
- Use the endpoint previously created, which will gather the HTML. We’ll parse this HTML text into our document;
- Going back to the website, in Chrome, right-click on the page where the content is that you’d like scraped. Click on “Inspect” and in the subsequent section, identify the table you’d like to scrape;
- Since the table has its own ID, it will be unique across the HTML text, making it easy to identify in the text;
- Since we now have the table, we really want to get all the rows in this table. You can easily identify the selector for the row by expanding the HTML till you see the rows and right click in one of them - Copy - Copy Selector, and this will give you “#tableID > tbody > tr:nth-child(1)” for the first row. And since we want all of them, we’re going to use “#tableID > tbody > tr”;
- You have now all the elements for the table rows. It’s time to iterate all rows and get to select all the columns;
- Now, select the column’s text, using the HTML document and the Selector from the last action, in addition to our column selector: “> td:nth-child(2)” is the selector for the second column which contains the Bank Name. For the other columns, you just need to iterate the “child(n)” node.
Since you have scraped all the information, check if you already have the code on our database. If we have it, we just need to update the data. If we don’t have it, we’ll just create the record. This should provide us with all the records for the first page of the website when you hit 1-Click Publish.
The process above is basically our tool for parsing the data from the first page. We identify the site, identify the content that we want, and identify how to get the data. This runs all the rows of the table and parses all the text from the columns, storing it in our database.
For the full code used in this example, you can go to the OutSystems Forge and download it from there.
Web Scraping Enterprise Scale: Real-Life Scenario - Frankort & Koning
So, you may think that this was a nice and simple example of scraping a website, but how can you apply this at the enterprise level? To illustrate this tool’s effectiveness at an enterprise-level, we’ll use a case study of Frankort & Koning, a company we did this for.
Frankort & Koning is a Netherlands-based fresh fruit and vegetable company. They buy products from producers and sell them to the market. As these products trade in fresh produce, there are many regulations that regulate their industry. Frankfort & Koning needs to check each product that they buy to resell.
Imagine how taxing it would be to check each product coming into their warehouse to make sure that all the producers and their products are certified by the relevant industry watchdog. This needs to be done multiple times per day per product.
GlobalGap has a very basic database, which they use to give products a thirteen-digit GGN (Global Gap Number). This number identifies the producer, allowing them to track all the products and determine if they're really fresh. This helps Frankort & Koning certify that the products are suitable to be sold to their customers. Since Global Gap doesn't have any API to assist with this, this is where the scraping part comes in.
To work with the database as it is now, you need to enter the GGN number into the website manually. Once the information loads, there will be an expandable table at the bottom of the page. Clicking on the relevant column will provide you with the producer’s information and whether they’re certified to sell their products. Imagine doing this manually for each product that enters the Frankort & Koning warehouse. It would be totally impractical.
How Did We Perform Web Scraping for Frankort & Koning?
We identified the need for some automation here. Selenium was a great tool to set up the automation we required. Selenium automates user interactions on a website. We created an OutSystems extension with Selenium and Chrome driver.
This allowed Selenium to run Chrome instances on the server. We also needed to give Selenium some instructions on how to do the human interaction. After we took care of the human interaction aspect, we needed to parse the HTML to bring the data to our side.
The instructions Selenium needed to automate the human interaction included identifying our base URL and the 'Accept All Cookies' button, as this button popped up when opening the website. We needed to identify that button so that we could program a click on that button.
We also needed to produce instructions on how to interact with the collapse icon on the results table and the input where the GGN number would be entered into. We did all of this to run on an OutSystems timer and ran Chrome in headless mode.
We told Selenium to go to our target website and find the cookie button and input elements. We then sent the keys, as the user entered the GGN number, to the system and waited a moment for the page to be rendered. After this, we iterated all the results, and then output the HTML back to the OutSystems app.
This is how we tie together automation and user interaction with web scraping.
These are the numbers we worked with, with Frankort & Koning:
- 700+ producers supplying products
- 160+ products provided each day
- 900+ certificates - the number of checks they needed to perform daily
- It would’ve taken about 15 hours to process this information manually
- Instead, it took only two hours to process this information automatically
This is just one example of how web scraping can contribute to bottom-line savings in an organization.
Still Got Questions?
Just drop me a line! And in the meantime, if you enjoyed my session, take a look at the NextStep 2020 conference, now available on-demand, with more than 50 sessions presented by thought leaders driving the next generation of innovation.