What Are We Building?
Essentially what we are building is a way to ping Amazon.com’s website to be able to retrieve information that we need on products to be able to show on our own frontend.
Use-cases
This is a very simple base to be able to do some interesting things with their affiliate program or take advantage of being able to find information about specific products in more detail to use for your own internal app.
Some use-cases might include:
- Creating your own Amazon Product Widget that shows products based on a set of keywords associated to your content
- Creating a comparison tool to be able to quickly compare details from a product to another
- Find similar products based on existing keywords
- Gathering review details to make more informed decision
Why Not Just Use The Amazon Product Advertising API?
This is a good question. Why didn’t I use the Product Advertising API?
In short the reason is that I got completely turned off by the idea that I needed to make 3 sales within 180 days to be able to use the API ️🤦🏻♂️.
So naturally I ended up staying up all night to see if it was possible to create my own and then write this tutorial on how to do it yourself.
Requirements
As the the subtitle of this tutorial might have given it away, we’ll be using NodeJS for the backend and ReactJS for the frontend to interact with our API.
The main things you’ll need installed on your computer are:
NodeJS v12+
Yarn
Postman App
(for testing our endpoints)
High Level Architecture
There’s quite a bit to do here so for the sake of making this a bit more digestible, I’m breaking up the backend and frontend from each other to make it a bit easier.
Init Our Project
mkdir amazon-search-app;
cd amazon-search-app;
yarn init -y;
echo "node_modules/*" > .gitignore;
Installing Our Dependencies
If you haven’t already gathered, the only way that I’m going to be able to retrieve data from Amazon’s website and in the format I need without their API is to use a web scrapper.
For this I’m going to rely on Puppeteer:
For the endpoints I’m going to use good ol’ Express and a one other dependency to be able to parse JSON payloads.
Downloading dependencies:
yarn add express puppeteer cors;
yarn add -D nodemon;
Creating Our Main API File
Next we’ll setup our main source file for NodeJS to run and just set up some initial endpoints.
mkdir src;
cd src;
touch index.js;
File: /src/index.js
// Imports
// ----------------------------------------
const express = require('express');
const cors = require('cors');// Constants
// ----------------------------------------
const app = express();
const PORT = process.env.PORT || 5000;
const VERSION = process.env.VERSION || '1.0.0';// Config
// ----------------------------------------
app.use(cors());// Endpoints
// ----------------------------------------
app.get('/', (_req, res) => res.send({ version: VERSION }));// Start Server
// ----------------------------------------
app.listen(PORT, () => console.log(`Listening on port ${PORT}`));
Now we’ll just make a modification to our package.json
to add a yarn start
:
File: /package.json
{
"name": "amazon-search-app",
"version": "1.0.0",
"main": "index.js",
"license": "MIT",
"scripts": {
"start": "nodemon src/index.js"
},
"dependencies": {
"cors": "^2.8.5",
"express": "^4.17.1",
"puppeteer": "^2.1.1"
},
"devDependencies": {
"nodemon": "^2.0.2"
}
}
Test out our server to make sure it’s working:
yarn start;// Expected results
// [nodemon] ...
// ...
// Listening on port 5000
Adding Search Endpoint
Things are looking good so far, yay you just made an endpoint that tells you a fake version number. Alright, the next step is try and build out our search capability. What I want to do here is pass a GET
request a query parameter with a string that I want to be searched on Amazon.com’s website.
Concept
/search?q=burrito blanket
Creating Endpoint
File: /src/index.js
// Imports
// ----------------------------------------
const express = require('express');
const cors = require('cors');// Constants
// ----------------------------------------
const app = express();
const PORT = process.env.PORT || 5000;
const VERSION = process.env.VERSION || '1.0.0';// Config
// ----------------------------------------
app.use(cors());// Endpoints
// ----------------------------------------
app.get('/', (_req, res) => res.send({ version: VERSION }));app.get('/search', async (req, res) => {
const { q } = req.query; // Validate if query is empty
if (!q || q.length === 0) {
return res.status(422).send({
message: 'Missing or invalid \'q\' value for search.'
});
}
// [Format Amazon's Query Here]
}); // Start Server
// ----------------------------------------
app.listen(PORT, () => console.log(`Listening on port ${PORT}`));
To get an idea on how to perform a search with Amazon, I’m just going to go to Amazon.com and then type some keywords and then validate the request via the search URL.
You’ll notice that the words "burrito blanket"
were transformed into the following format:
/s?k=burrito+blanket
This is the format we need to convert out requests to.
File: /src/index.js
...app.get('/search', async (req, res) => {
const { q } = req.query; // Validate if query is empty
if (!q || q.length === 0) {
return res.status(422).send({
message: 'Missing or invalid \'q\' value for search.'
});
}
// Amazon's Query Formatted
const amazonQuery = q.replace(' ', '+');
});...
Creating Web Scraping Function
Next we’re going to pass this value to a function that will perform our scraping with Puppeteer.
File: /src/index.js
// Imports
// ----------------------------------------
const express = require('express');
const cors = require('cors');
const puppeteer = require('puppeteer');// Puppeteer Request
// ----------------------------------------
const SearchAmazon = async (query) => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setViewport({ width: 1920, height: 1080 });
await page.goto(`https://www.amazon.com/s?k=${query}`);
const getData = await page.evaluate(() => {
const data = [];
const items = document.querySelector('[FIGURE-OUT]');
});
};// Endpoints
// ----------------------------------------
...
Finding DOM Elements In A Haystack
Alright so now we’re at a point where we’re looking through DOM elements on Amazon’s website to be able to traverse it and gather all the details that we need from the search page.
This part is not glamorous and just involves using the browser’s Developer Inspect Tool to find that one element that contains most of the items and repeats.
You’ll notice that the div
has 3class names:
s-result-list s-search-results sg-row
Let’s use this to be able to get the information we want in the browser without using NodeJS first, that way we can get the feedback quickly on what we’re retrieving.
NOTE: Even though we found this, there is a good chance that they will change out the class names later on and it will break out scrapping, which would require that we update these class names later.
Why Not Xpath?
/html/body/div[1]/div[1]/div[1]/div[1]/div/span[4]/div[1]
Xpath is great, but I have a feeling that if Amazon decides to change something, there is a good chance that they are going to add div
or introduce a new sibling element somewhere which would mess up the entire path.
I find that if you use the exact class names, or better yet IDs, then you get to the data quicker, and it’s not entirely reliant on the structure.
Finding The Image
In the browser, we’re going to use the class names we just got and then get it’s children to find the image for each one, but first just one for now.
const items = document.querySelector('.s-result-list.s-search-results.sg-row');
Next, we’ll just use the .children
attribute to get to the child elements on this main div
. and then querying again to get the img
tag within it.
items.children[0].querySelector('img');
And then finally when we find the img
element, get its attribute for the src
of the image itself.
// Image
items.children[0].querySelector('img').getAttribute('src');
Finding The Name
We’re going to going to use the same tactic here to find the name of the product but just dive deeper by finding the main element that displays the name.
You’ll see there’s an H2
tag that holds an a
tag which holds a span
tag which holds the title. This is in pretty much the same in every item.
So we’ll use this code to get its plain text:
// Name
items.children[0].querySelector('h2 > a span').innerText
Additionally because there’s an anchor tag, we can also get the product url with just:
// URL
items.children[0].querySelector('h2 > a').getAttribute('href');
The only thing we’ll need to note is that it’s a relative path so we’ll need to remember to add https://www.amazon.com
at the beginning of any links we build.
Getting The Price
This one is a bit trickier because there are more than one way that Amazon displays the pricing.
If we did some digging you’ll notice that most prices are found under a span
tag with the class name of a-price
, BUT that’s only one case.
You’ll notice this one has a different format for showing the price and .a-price
isn’t there 🤦🏻♂️. So what do we do?
This is where RegExp
is going to help us. We’re going to use it to identify the first price in the HTML as string.
First we’re going to get the HTML as a string with:
items.children[0].innerHTML;// "<div class="sg-col-inner"> ...
For the RegExp we need to factor in for these different types of dollar amounts:
$1.58
$13.44
$10,500.23
The pattern is that it always starts with a $
then a 1 or more numbers, sometimes with a ,
comma followed by a .
and then two numbers.
// Finds all $
.match(/\$/g);
// ["$", "$", "$", "$", "$"]// Finds all values that start with $ followed by some numbers
.match(/\$([0-9]+)/g);
["$19", "$24", "$24", "$18"]// Finds all values that start with $ followed a decimal number
.match(/\$([0-9]+).([0-9]+)/g);
["$19.99", "$24.99", "$24.99", "$18.99"]// For good measure account for comma values (1,000.00)
.match(/\$([0-9]+|[0-9]+,[0-9]+).([0-9]+)/g)
["$1,000.00"]
We just need the first value, because we’ll assume that everything after the first value is either an alternative price or a discounted price from the original price. Our code then becomes:
items.children[0].querySelector('.a-price > span').innherHTML.match(/\$([0-9]+|[0-9]+,[0-9]+).([0-9]+)/g)[0]
But we also want to remove any $
or any commas that are there for other larger numbers. So it becomes:
// Price
items.children[0].querySelector('.a-price > span').innherHTML.match(/\$([0-9]+|[0-9]+,[0-9]+).([0-9]+)/g)[0].replace(/[\$\,]/g, '')
Gathering All Data
Now that we have all the data pieces, we’re going to put it together in our function:
File: /src/index.js
// Imports
// ----------------------------------------
const express = require('express');
const cors = require('cors');
const puppeteer = require('puppeteer');... // Puppeteer Request
// ----------------------------------------
const SearchAmazon = async (query) => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setViewport({ width: 1920, height: 1080 });
await page.goto(`https://www.amazon.com/s?k=${query}`);
const getData = await page.evaluate(() => {
const data = [];
const items = document.querySelector('.s-result-list.s-search-results.sg-row');
for (let i = 0; i < items.children.length; i++) {
const name = items.children[i].querySelector('h2 > a span').innerText;
const url = items.children[i].querySelector('h2 > a').getAttribute('href');
const image = items.children[0].querySelector('img').getAttribute('src');
const price = items.children[i].querySelector('.a-price > span').innherHTML.match(/\$([0-9]+|[0-9]+,[0-9]+).([0-9]+)/g)[0].replace(/[\$\,]/g, '')
data.push({
name,
url,
image,
price
});
}
return data;
});
// Close page and browser
await page.close();
await browser.close();
return getData;
};// Endpoints
// ----------------------------------------
...
Adding It To Our Search Endpoint
File: /src/index.js
...app.get('/search', async (req, res) => {
const { q } = req.query;// Validate if query is empty
if (!q || q.length === 0) {
return res.status(422).send({
message: 'Missing or invalid \'q\' value for search.'
});
}
// Amazon's Query Formatted
const amazonQuery = q.replace(' ', '+');
return res.send(await SearchAmazon(amazonQuery));
});...
Testing With Postman
Let’s start our server up, if we haven’t done so already with yarn start
.
We’re getting an error:
(node:27296) UnhandledPromiseRejectionWarning: Error: Evaluation failed: TypeError: Cannot read property 'innerText' of null
...
This is because there are scenarios in our DOM traversing that when we pick up certain elements, they may or may not exists, so we need to account for that.
Handling Errors
In order to handle the errors, we need to wrap our .querySelector
in conditionals to validate if those fields exist in the first place.
Refactoring Our For Loop
File: /src/index.js
...// Name
// const name = items.children[i].querySelector('h2 > a span').innerText;// Becomes
const name = items.children[i].querySelector('h2 > a span');// URL
// const url = items.children[i].querySelector('h2 > a').getAttribute('href');// Becomes
const url = items.children[i].querySelector('h2 > a');// Image
// const image = items.children[i].querySelector('img').getAttribute('src');// Becomes
const image = items.children[i].querySelector('img');// Price
// const price = items.children[i].querySelector('.a-price > span').innerHTML.match(/\$([0-9]+|[0-9]+,[0-9]+).([0-9]+)/g)[0].replace(/[\$\,]/g, '')// Becomes
const price = items.children[i].innerHTML.match(/\$([0-9]+|[0-9]+,[0-9]+).([0-9]+)/g);
Modifying Our Array Push With Conditionals
We’ll use ternary operators to be able to validate if the fields are null
or not and then display a default if they don’t.
File: /src/index.js
...data.push({
name: name && name.innerText || 'Unknown Name',
url: url && url.getAttribute('href') || 'Unknown URL',
image: image && image.getAttribute('src') || 'Unknown Image URL',
price: price && price.length > 0 && price[0].replace(/[\$\,]/g, '') || '0.00'
});...
Testing With Postman Again
But wait a second, there are a few elements that aren’t showing correctly.
This is because there are a few DOM elements with the same class name but don’t contain any data, so we just need to account for these stragglers. In this particular case, all data is missing, so we’ll just wrap it in an if statement that looks for at least one value to append it to the array.
File: /src/index.js
...
if (name || url || image || price) {
data.push({
name: name && name.innerText || 'Unknown Name',
url: url && url.getAttribute('href') || 'Unknown URL',
image: image && image.getAttribute('src') || 'Unknown Image URL',
price: price && price.length > 0 && price[0].replace(/[\$\,]/g, '') || '0.00'
});
}
...
Scraping Optimizations
Next, after reading this article on Scrape Hero, I found that you could disable loading the images and the CSS to be able to optimize for load time with Puppeteer.
File: /src/index.js
... const browser = await puppeteer.launch();
const page = await browser.newPage();// OPTIMIZATION
await page.setRequestInterception(true);
page.on('request', (req) => {
if(req.resourceType() == 'stylesheet' || req.resourceType() == 'font' || req.resourceType() == 'image'){
req.abort();
} else {
req.continue();
}
});...
Final Code
File: /src/index.js
// Imports
// ----------------------------------------
const express = require('express');
const cors = require('cors');
const puppeteer = require('puppeteer');
// Constants
// ----------------------------------------
const app = express();
const PORT = process.env.PORT || 5000;
const VERSION = process.env.VERSION || '1.0.0';
// Config
// ----------------------------------------
app.use(cors());// Puppeteer Request
// ----------------------------------------
const SearchAmazon = async (query) => {
const browser = await puppeteer.launch();
const page = await browser.newPage();await page.setRequestInterception(true);
page.on('request', (req) => {
if(req.resourceType() == 'stylesheet' || req.resourceType() == 'font' || req.resourceType() == 'image'){
req.abort();
} else {
req.continue();
}
});
await page.setViewport({ width: 1920, height: 1080 });
await page.goto(`https://www.amazon.com/s?k=${query}`);
const getData = await page.evaluate(() => {
const data = [];
const items = document.querySelector('.s-result-list.s-search-results.sg-row');
for (let i = 0; i < items.children.length; i++) {
const name = items.children[i].querySelector('h2 > a span');
const url = items.children[i].querySelector('h2 > a');
const image = items.children[i].querySelector('img');
const price = items.children[i].innerHTML.match(/\$([0-9]+|[0-9]+,[0-9]+).([0-9]+)/g);if (name || url || image || price) {
data.push({
name: name && name.innerText || 'Unknown Name',
url: url && url.getAttribute('href') || 'Unknown URL',
image: image && image.getAttribute('src') || 'Unknown Image URL',
price: price && price.length > 0 && price[0].replace(/[\$\,]/g, '') || '0.00'
});
}
}
return data;
});
// Close page and browser
await page.close();
await browser.close();
return getData;
};
// Endpoints
// ----------------------------------------
app.get('/', (_req, res) => res.send({ version: VERSION }));
app.get('/search', async (req, res) => {
const { q } = req.query;
// Validate if query is empty
if (!q || q.length === 0) {
return res.status(422).send({
message: 'Missing or invalid \'q\' value for search.'
});
} // Amazon's Query Formatted
const amazonQuery = q.replace(' ', '+');
return res.send({ data: await SearchAmazon(amazonQuery) });
});
// Start Server
// ----------------------------------------
app.listen(PORT, () => console.log(`Listening on port ${PORT}`));
Where To Go From Here
You could package this API up in a Docker container by reading my NodeJS Docker Deployment Process to setup it up in a way to deploy the Docker image on a server.
You could also use a service like Proxy Bananza to get around rate limiting.
I also recommend reading this article by Hartley Brody on How to Scrape Amazon.com: 19 Lessons I Learned While Crawling 1MM+ Product Listings. It has great insights on scraping a lot of Amazon’s data.
You should be able to find Part 2 of this project here.
If you got value from this, and/or if you think this can be improved, please let me know in the comments.
Please share it on twitter 🐦 or other social media platforms. Thanks again for reading. 🙏
Please also follow me on twitter: @codingwithmanny and instagram at @codingwithmanny.