Build An Amazon Product Wishlist App Part 1/2 — Backend

Create An API That Scrapes Amazon.com With ReactJS & NodeJS

Amazon Product Wishlist App

What Are We Building?

Essentially what we are building is a way to ping Amazon.com’s website to be able to retrieve information that we need on products to be able to show on our own frontend.

Use-cases

This is a very simple base to be able to do some interesting things with their affiliate program or take advantage of being able to find information about specific products in more detail to use for your own internal app.

Some use-cases might include:

  • Creating your own Amazon Product Widget that shows products based on a set of keywords associated to your content
  • Creating a comparison tool to be able to quickly compare details from a product to another
  • Find similar products based on existing keywords
  • Gathering review details to make more informed decision

Why Not Just Use The Amazon Product Advertising API?

Amazon Product Advertising API

This is a good question. Why didn’t I use the Product Advertising API?

In short the reason is that I got completely turned off by the idea that I needed to make 3 sales within 180 days to be able to use the API ️🤦🏻‍♂️.

Amazon Product Advertising API Request Access

So naturally I ended up staying up all night to see if it was possible to create my own and then write this tutorial on how to do it yourself.

Requirements

As the the subtitle of this tutorial might have given it away, we’ll be using NodeJS for the backend and ReactJS for the frontend to interact with our API.

The main things you’ll need installed on your computer are:

  • NodeJS v12+
  • Yarn
  • Postman App (for testing our endpoints)

High Level Architecture

App Architecture

There’s quite a bit to do here so for the sake of making this a bit more digestible, I’m breaking up the backend and frontend from each other to make it a bit easier.

Init Our Project

mkdir amazon-search-app;
cd amazon-search-app;
yarn init -y;
echo "node_modules/*" > .gitignore;

Installing Our Dependencies

If you haven’t already gathered, the only way that I’m going to be able to retrieve data from Amazon’s website and in the format I need without their API is to use a web scrapper.

For this I’m going to rely on Puppeteer:

For the endpoints I’m going to use good ol’ Express and a one other dependency to be able to parse JSON payloads.

Downloading dependencies:

yarn add express puppeteer cors;
yarn add -D nodemon;

Creating Our Main API File

Next we’ll setup our main source file for NodeJS to run and just set up some initial endpoints.

mkdir src;
cd src;
touch index.js;

File: /src/index.js

// Imports
// ----------------------------------------
const express = require('express');
const cors = require('cors');
// Constants
// ----------------------------------------
const app = express();
const PORT = process.env.PORT || 5000;
const VERSION = process.env.VERSION || '1.0.0';
// Config
// ----------------------------------------
app.use(cors());
// Endpoints
// ----------------------------------------
app.get('/', (_req, res) => res.send({ version: VERSION }));
// Start Server
// ----------------------------------------
app.listen(PORT, () => console.log(`Listening on port ${PORT}`));

Now we’ll just make a modification to our package.json to add a yarn start:

File: /package.json

{
"name": "amazon-search-app",
"version": "1.0.0",
"main": "index.js",
"license": "MIT",
"scripts": {
"start": "nodemon src/index.js"
},

"dependencies": {
"cors": "^2.8.5",
"express": "^4.17.1",
"puppeteer": "^2.1.1"
},
"devDependencies": {
"nodemon": "^2.0.2"
}
}

Test out our server to make sure it’s working:

yarn start;// Expected results
// [nodemon] ...
// ...
// Listening on port 5000
localhost:5000 main endpoint

Adding Search Endpoint

Things are looking good so far, yay you just made an endpoint that tells you a fake version number. Alright, the next step is try and build out our search capability. What I want to do here is pass a GET request a query parameter with a string that I want to be searched on Amazon.com’s website.

Concept

/search?q=burrito blanket

Creating Endpoint

File: /src/index.js

// Imports
// ----------------------------------------
const express = require('express');
const cors = require('cors');
// Constants
// ----------------------------------------
const app = express();
const PORT = process.env.PORT || 5000;
const VERSION = process.env.VERSION || '1.0.0';
// Config
// ----------------------------------------
app.use(cors());
// Endpoints
// ----------------------------------------
app.get('/', (_req, res) => res.send({ version: VERSION }));
app.get('/search', async (req, res) => {
const { q } = req.query;
// Validate if query is empty
if (!q || q.length === 0) {
return res.status(422).send({
message: 'Missing or invalid \'q\' value for search.'
});
}

// [Format Amazon's Query Here]
});
// Start Server
// ----------------------------------------
app.listen(PORT, () => console.log(`Listening on port ${PORT}`));

To get an idea on how to perform a search with Amazon, I’m just going to go to Amazon.com and then type some keywords and then validate the request via the search URL.

Amazon.com’s Search Results

You’ll notice that the words "burrito blanket" were transformed into the following format:

/s?k=burrito+blanket

This is the format we need to convert out requests to.

File: /src/index.js

...app.get('/search', async (req, res) => {
const { q } = req.query;
// Validate if query is empty
if (!q || q.length === 0) {
return res.status(422).send({
message: 'Missing or invalid \'q\' value for search.'
});
}

// Amazon's Query Formatted
const amazonQuery = q.replace(' ', '+');
});
...

Creating Web Scraping Function

Next we’re going to pass this value to a function that will perform our scraping with Puppeteer.

File: /src/index.js

// Imports
// ----------------------------------------
const express = require('express');
const cors = require('cors');
const puppeteer = require('puppeteer');
// Puppeteer Request
// ----------------------------------------
const SearchAmazon = async (query) => {
const browser = await puppeteer.launch();
const page = await browser.newPage();

await page.setViewport({ width: 1920, height: 1080 });
await page.goto(`https://www.amazon.com/s?k=${query}`);

const getData = await page.evaluate(() => {
const data = [];
const items = document.querySelector('[FIGURE-OUT]');
});
};
// Endpoints
// ----------------------------------------
...

Finding DOM Elements In A Haystack

Alright so now we’re at a point where we’re looking through DOM elements on Amazon’s website to be able to traverse it and gather all the details that we need from the search page.

This part is not glamorous and just involves using the browser’s Developer Inspect Tool to find that one element that contains most of the items and repeats.

Almost There
Gotcha!

You’ll notice that the div has 3class names:

s-result-list s-search-results sg-row

Let’s use this to be able to get the information we want in the browser without using NodeJS first, that way we can get the feedback quickly on what we’re retrieving.

NOTE: Even though we found this, there is a good chance that they will change out the class names later on and it will break out scrapping, which would require that we update these class names later.

Why Not Xpath?

/html/body/div[1]/div[1]/div[1]/div[1]/div/span[4]/div[1]

Xpath is great, but I have a feeling that if Amazon decides to change something, there is a good chance that they are going to add div or introduce a new sibling element somewhere which would mess up the entire path.

I find that if you use the exact class names, or better yet IDs, then you get to the data quicker, and it’s not entirely reliant on the structure.

Finding The Image

In the browser, we’re going to use the class names we just got and then get it’s children to find the image for each one, but first just one for now.

const items = document.querySelector('.s-result-list.s-search-results.sg-row');
Getting the main DIV that contains all the items to traverse

Next, we’ll just use the .children attribute to get to the child elements on this main div. and then querying again to get the img tag within it.

items.children[0].querySelector('img');

And then finally when we find the img element, get its attribute for the src of the image itself.

// Image
items.children[0].querySelector('img').getAttribute('src');
Finding The Image

Finding The Name

We’re going to going to use the same tactic here to find the name of the product but just dive deeper by finding the main element that displays the name.

Finding The Title With Dev Tools Inspector

You’ll see there’s an H2 tag that holds an a tag which holds a span tag which holds the title. This is in pretty much the same in every item.

So we’ll use this code to get its plain text:

// Name
items.children[0].querySelector('h2 > a span').innerText
Getting The Item Name

Additionally because there’s an anchor tag, we can also get the product url with just:

// URL
items.children[0].querySelector('h2 > a').getAttribute('href');
Getting The Item URL

The only thing we’ll need to note is that it’s a relative path so we’ll need to remember to add https://www.amazon.com at the beginning of any links we build.

Getting The Price

This one is a bit trickier because there are more than one way that Amazon displays the pricing.

Different Pricing Formatting

If we did some digging you’ll notice that most prices are found under a span tag with the class name of a-price, BUT that’s only one case.

Price Element Containers

You’ll notice this one has a different format for showing the price and .a-price isn’t there 🤦🏻‍♂️. So what do we do?

This is where RegExp is going to help us. We’re going to use it to identify the first price in the HTML as string.

First we’re going to get the HTML as a string with:

items.children[0].innerHTML;// "<div class="sg-col-inner"> ...

For the RegExp we need to factor in for these different types of dollar amounts:

$1.58
$13.44
$10,500.23

The pattern is that it always starts with a $ then a 1 or more numbers, sometimes with a , comma followed by a . and then two numbers.

// Finds all $
.match(/\$/g);
// ["$", "$", "$", "$", "$"]
// Finds all values that start with $ followed by some numbers
.match(/\$([0-9]+)/g);
["$19", "$24", "$24", "$18"]
// Finds all values that start with $ followed a decimal number
.match(/\$([0-9]+).([0-9]+)/g);
["$19.99", "$24.99", "$24.99", "$18.99"]
// For good measure account for comma values (1,000.00)
.match(/\$([0-9]+|[0-9]+,[0-9]+).([0-9]+)/g)
["$1,000.00"]

We just need the first value, because we’ll assume that everything after the first value is either an alternative price or a discounted price from the original price. Our code then becomes:

items.children[0].querySelector('.a-price > span').innherHTML.match(/\$([0-9]+|[0-9]+,[0-9]+).([0-9]+)/g)[0]

But we also want to remove any $ or any commas that are there for other larger numbers. So it becomes:

// Price
items.children[0].querySelector('.a-price > span').innherHTML.match(/\$([0-9]+|[0-9]+,[0-9]+).([0-9]+)/g)[0].replace(/[\$\,]/g, '')

Gathering All Data

Now that we have all the data pieces, we’re going to put it together in our function:

File: /src/index.js

// Imports
// ----------------------------------------
const express = require('express');
const cors = require('cors');
const puppeteer = require('puppeteer');
... // Puppeteer Request
// ----------------------------------------
const SearchAmazon = async (query) => {
const browser = await puppeteer.launch();
const page = await browser.newPage();

await page.setViewport({ width: 1920, height: 1080 });
await page.goto(`https://www.amazon.com/s?k=${query}`);

const getData = await page.evaluate(() => {
const data = [];
const items = document.querySelector('.s-result-list.s-search-results.sg-row');

for (let i = 0; i < items.children.length; i++) {
const name = items.children[i].querySelector('h2 > a span').innerText;
const url = items.children[i].querySelector('h2 > a').getAttribute('href');
const image = items.children[0].querySelector('img').getAttribute('src');
const price = items.children[i].querySelector('.a-price > span').innherHTML.match(/\$([0-9]+|[0-9]+,[0-9]+).([0-9]+)/g)[0].replace(/[\$\,]/g, '')

data.push({
name,
url,
image,
price
});
}

return data;
});

// Close page and browser
await page.close();
await browser.close();

return getData;
};
// Endpoints
// ----------------------------------------
...

Adding It To Our Search Endpoint

File: /src/index.js

...app.get('/search', async (req, res) => {
const { q } = req.query;
// Validate if query is empty
if (!q || q.length === 0) {
return res.status(422).send({
message: 'Missing or invalid \'q\' value for search.'
});
}

// Amazon's Query Formatted
const amazonQuery = q.replace(' ', '+');

return res.send(await SearchAmazon(amazonQuery));
});
...

Testing With Postman

Let’s start our server up, if we haven’t done so already with yarn start.

Endless Loading

We’re getting an error:

(node:27296) UnhandledPromiseRejectionWarning: Error: Evaluation failed: TypeError: Cannot read property 'innerText' of null
...

This is because there are scenarios in our DOM traversing that when we pick up certain elements, they may or may not exists, so we need to account for that.

Handling Errors

In order to handle the errors, we need to wrap our .querySelector in conditionals to validate if those fields exist in the first place.

Refactoring Our For Loop

File: /src/index.js

...// Name
// const name = items.children[i].querySelector('h2 > a span').innerText;
// Becomes
const name = items.children[i].querySelector('h2 > a span');
// URL
// const url = items.children[i].querySelector('h2 > a').getAttribute('href');
// Becomes
const url = items.children[i].querySelector('h2 > a');
// Image
// const image = items.children[i].querySelector('img').getAttribute('src');
// Becomes
const image = items.children[i].querySelector('img');
// Price
// const price = items.children[i].querySelector('.a-price > span').innerHTML.match(/\$([0-9]+|[0-9]+,[0-9]+).([0-9]+)/g)[0].replace(/[\$\,]/g, '')
// Becomes
const price = items.children[i].innerHTML.match(/\$([0-9]+|[0-9]+,[0-9]+).([0-9]+)/g);

Modifying Our Array Push With Conditionals

We’ll use ternary operators to be able to validate if the fields are null or not and then display a default if they don’t.

File: /src/index.js

...data.push({
name: name && name.innerText || 'Unknown Name',
url: url && url.getAttribute('href') || 'Unknown URL',
image: image && image.getAttribute('src') || 'Unknown Image URL',
price: price && price.length > 0 && price[0].replace(/[\$\,]/g, '') || '0.00'
});
...

Testing With Postman Again

Our API Is Working!

But wait a second, there are a few elements that aren’t showing correctly.

Stragglers

This is because there are a few DOM elements with the same class name but don’t contain any data, so we just need to account for these stragglers. In this particular case, all data is missing, so we’ll just wrap it in an if statement that looks for at least one value to append it to the array.

File: /src/index.js

...
if (name || url || image || price) {
data.push({
name: name && name.innerText || 'Unknown Name',
url: url && url.getAttribute('href') || 'Unknown URL',
image: image && image.getAttribute('src') || 'Unknown Image URL',
price: price && price.length > 0 && price[0].replace(/[\$\,]/g, '') || '0.00'
});
}
...
No More Stragglers

Scraping Optimizations

Next, after reading this article on Scrape Hero, I found that you could disable loading the images and the CSS to be able to optimize for load time with Puppeteer.

File: /src/index.js

... const browser = await puppeteer.launch();
const page = await browser.newPage();
// OPTIMIZATION
await page.setRequestInterception(true);

page.on('request', (req) => {
if(req.resourceType() == 'stylesheet' || req.resourceType() == 'font' || req.resourceType() == 'image'){
req.abort();
} else {
req.continue();
}
});
...

Final Code

File: /src/index.js

// Imports
// ----------------------------------------
const express = require('express');
const cors = require('cors');
const puppeteer = require('puppeteer');

// Constants
// ----------------------------------------
const app = express();
const PORT = process.env.PORT || 5000;
const VERSION = process.env.VERSION || '1.0.0';

// Config
// ----------------------------------------
app.use(cors());
// Puppeteer Request
// ----------------------------------------
const SearchAmazon = async (query) => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', (req) => {
if(req.resourceType() == 'stylesheet' || req.resourceType() == 'font' || req.resourceType() == 'image'){
req.abort();
} else {
req.continue();
}
});
await page.setViewport({ width: 1920, height: 1080 });
await page.goto(`https://www.amazon.com/s?k=${query}`);

const getData = await page.evaluate(() => {
const data = [];
const items = document.querySelector('.s-result-list.s-search-results.sg-row');

for (let i = 0; i < items.children.length; i++) {
const name = items.children[i].querySelector('h2 > a span');
const url = items.children[i].querySelector('h2 > a');
const image = items.children[i].querySelector('img');
const price = items.children[i].innerHTML.match(/\$([0-9]+|[0-9]+,[0-9]+).([0-9]+)/g);
if (name || url || image || price) {
data.push({
name: name && name.innerText || 'Unknown Name',
url: url && url.getAttribute('href') || 'Unknown URL',
image: image && image.getAttribute('src') || 'Unknown Image URL',
price: price && price.length > 0 && price[0].replace(/[\$\,]/g, '') || '0.00'
});
}
}

return data;
});

// Close page and browser
await page.close();
await browser.close();

return getData;
};

// Endpoints
// ----------------------------------------
app.get('/', (_req, res) => res.send({ version: VERSION }));

app.get('/search', async (req, res) => {
const { q } = req.query;
// Validate if query is empty
if (!q || q.length === 0) {
return res.status(422).send({
message: 'Missing or invalid \'q\' value for search.'
});
}
// Amazon's Query Formatted
const amazonQuery = q.replace(' ', '+');

return res.send({ data: await SearchAmazon(amazonQuery) });
});

// Start Server
// ----------------------------------------
app.listen(PORT, () => console.log(`Listening on port ${PORT}`));

Where To Go From Here

You could package this API up in a Docker container by reading my NodeJS Docker Deployment Process to setup it up in a way to deploy the Docker image on a server.

You could also use a service like Proxy Bananza to get around rate limiting.

I also recommend reading this article by Hartley Brody on How to Scrape Amazon.com: 19 Lessons I Learned While Crawling 1MM+ Product Listings. It has great insights on scraping a lot of Amazon’s data.

You should be able to find Part 2 of this project here.

If you got value from this, and/or if you think this can be improved, please let me know in the comments.

Please share it on twitter 🐦 or other social media platforms. Thanks again for reading. 🙏

Please also follow me on twitter: @codingwithmanny and instagram at @codingwithmanny.

🙏

Web Application / Full Stack JavaScript Developer & Aspiring DevOps