Writing clean and scalable code is difficult enough when we have control over our data and our inputs. Writing code for web crawlers, which may need to scrape and store a variety of data from diverse sets of websites that the programmer has no control over, often presents unique organizational challenges.
We are often asked to collect custom product attributes from a variety of websites, each with different templates and lawets. One website’s h1 tag contains the title of the article, another’s h1 tag contains the title of the website itself, and the article title is in <span id=”title”>.
We may need flexible control over which websites are scraped and how they’re scraped, and a way to quickly add new websites or modify existing ones, as fast as possible, without writing multiple lines of code.
We are also asked to scrape product prices from different websites, with the ultimate aim of comparing prices for the same product. Perhaps these prices are in different currencies, and perhaps it is needed to combine this with external data from some other non web source.
Planning and Defining Objects
One common trap of web scraping is defining the data that we want to collect based entirely on what’s available in front of our eyes. For instance, if we want to collect product data, we may first look at clothing store and decide that each product we scrape needs to have the following fields:
- Product name
- Fabric type
- Customer rating
Looking at another website, we find that it has SKUs (stock keeping units, used to track and order items) listed on the page. We definitely want to collect that data as well, even if it doesn’t appear on the first site! We add this field:
- Item SKU
Although clothing may be a great start, we also want to make sure we can extend this crawler to other types of products. We start perusing product sections of other websites and decide we also need to collect this information:
- Matte/Glossy print
- Number of customer reviews
- Link to manufacturer
Clearly, this is an unsustainable approach. Simply adding attributes to our product type every time we see a new piece of information on a website will lead to far too many fields to keep track of. Not only that, but every time we scrape a new website, we’ll be forced to perform a detailed analysis of the fields the website has and the fields we’ve accumulated so far, and potentially add new fields (modifying our Python object type and our database structure). This will result in a messy and difficult-to-read dataset that may lead to problems using it.
One of the best things we can do when deciding which data to collect is often to ignore the websites altogether. We don’t start a project that’s designed to be large and scalable by looking at a single website and saying, “What exists?” but by saying, “What do I need?” and then finding ways to seek the information that we need from there.
Perhaps what we really want to do is compare product prices among multiple stores and track those product prices over time. In this case, we need enough information to uniquely identify the product, and that’s it:
- Product title
- Product ID number (if available/relevant)
It’s important to note that none of this information is specific to a particular store. For instance, product reviews, ratings, price, and even description are specific to the instance of that product at a particular store. That can be stored separately.
Other information (colors the product comes in, what it’s made of) is specific to the product, but may be sparse—it’s not applicable to every product. It’s important to take a step back and perform a checklist for each item we consider and ask our self the following questions:
- Will this information help with the project goals? Will it be a roadblock if I don’t have it, or is it just “nice to have” but won’t ultimately impact anything?
- If it might help in the future, but I’m unsure, how difficult will it be to go back and collect the data at a later time?
- Is this data redundant to data I’ve already collected?
- Does it make logical sense to store the data within this particular object? (As mentioned before, storing a description in a product doesn’t make sense if that description changes from site to site for the same product.)
If we do decide that we need to collect the data, it’s important to ask a few more questions to then decide how to store and handle it in code:
- Is this data sparse or dense? Will it be relevant and populated in every listing, or just a handful out of the set?
- How large is the data?
- Especially in the case of large data, will I need to regularly retrieve it every time I run my analysis, or only on occasion?
- How variable is this type of data? Will I regularly need to add new attributes, modify types (such as fabric patterns, which may be added frequently), or is it set in stone (shoe sizes)?
Let’s say we plan to do some meta analysis around product attributes and prices: for example, the number of pages a book has, or the type of fabric a piece of clothing is made of, and potentially other attributes in the future, correlated to price. We run through the questions and realize that this data is sparse (relatively few products have any one of the attributes), and that we may decide to add or remove attributes frequently. In this case, it may make sense to create a product type that looks like this:
- Product title
- Product ID number (if available/relevant)
- Attributes (optional list or dictionary)
And an attribute type that looks like this:
- Attribute name
- Attribute value
If we’re scraping news articles, we may want basic information such as the following:
But say some articles contain a “revision date,” or “related articles, or a “number of social media shares.” Do we need these? Are they relevant to our project? How do we efficiently and flexibly store the number of social media shares when not all news sites use all forms of social media, and social media sites may grow or wane in popularity over time?
It can be tempting, when faced with a new project, to dive in and start writing Python to scrape websites immediately. The data model, left as an afterthought, often becomes strongly influenced by the availability and format of the data on the first website we scrape.
However, the data model is the underlying foundation of all the code that uses it. A poor decision in our model can easily lead to problems writing and maintaining code down the line, or difficulty in extracting and efficiently using the resulting data. Especially when dealing with a variety of websites—both known and unknown—it becomes vital to give serious thought and planning to what, exactly, we need to collect and how we need to store it.