How I created a search engine

How I created a search engine - Qwarx.nc

July 1, 2022

Context

If you haven't read my precedent posts, you won't get the full picture as to how I've decided to develop this product

How it started

It's been about a year since I went from 0 to a business directory website open to the public, browsed daily by more than a hundred users, with around 150 active contributors. I was happy to have finally released a web application to the public, news spread fast, and we got more people interested every day.

However, the UX study got me thinking :

why wouldn't we build our own Google?

Our name was very close to our model (Gougle.nc), and we were planning on refactoring our UI to look more like Google. We got an "internal" search engine, but we weren't really a real world-wide-web search engine...

The UX redesign recommended going from one search category (business pages) to at least three: business pages, classified ads, and websites.

Of course, the websites had to be about New Caledonia only. So a straightforward way to query them was to go through the local domain name registrar domaine.nc and go over the listed domains, querying their URL and all relevant meta information (title, description, category...). I automated a python script that was running daily. It was pretty simple, but it got me thinking: "What If we could crawl those websites?"

Pivoting

After consulting with my associate, we decided to pivot from our actual product. We won't be a business directory anymore but a full-fledged search engine. New Caledonia is a 300k people french island far away from France and its 60M+ inhabitants; obviously, if you don't tweak the SEO of your local website well enough, you could get lost in search results coming from websites concerning the mainland.

those are the main reasons we chose to pivot :

We give local websites better exposure and a chance to have their websites entirely indexed and pushed up in our search engine, eliminating competition with un-relevant search results from abroad.
Create collaborations with local institutions and businesses to integrate custom results, with a code snippet to, for example, view a film listing from a theater or read about administrative procedures to renew a passport.
Offering visitors a way to centralize their search into a single website instead of browsing several others. The leading local websites were mostly: classified ads, real estate, news, and e-commerce websites. It wasn't uncommon for someone looking for a used car to buy, an apartment to rent, or simply looking for some information to browse 4-5 websites until they found what they needed.
And, of course, showing hundreds of thousands and potentially more than a million different results already available on the web unlocks many business opportunities.

How I started

When developing the business directory website Gougle.nc, I was stuck without a possibility to customize metadata because I chose Create-React-App, a React toolchain used to create Single Page Applications. I wasn't exposed to SEO before this project and was oblivious about it. So the first thing I did was switch to a React Framework capable of Server Side Rendering, and I chose NextJS.

NextJS is a fantastic tool. It offers much more than Create React App, which isn't in the same category anyway (we're talking React full-stack framework vs. a React project creation engine). Having a way to generate a page from the client or the server gave me more freedom to manage the user experience. For example, avoiding any data loading after the interface has been loaded, resulting in the page flickering. And, of course, serving to search engines an entirely generated HTML page without the need to execute javascript.

For hosting, I gave up AWS (which I still used for a few other things, like Lambda Functions), and I switched to Zeit (now called Vercel), the same company developing NextJS. Things got so much easier from there; my favorite feature was probably the atomic deployment. Every deployment you make on it, either production or not, will have its own address, basically your app address plus a hashcode. It was really convenient for demos and debugging. Whenever I found out about a performance regression, I could start by going over all past deployments still running and then find the related GitHub commit once I find the first deployed version showing this problem. Also, many optimizations that I had to do manually (gzip compression, code minification...) was now done automatically... fantastic.

Then it was time to decide how I was going to crawl websites and scrap them. I chose Scrapy, which seemed to be the de facto solution in that case. It was a bit of a learning curve at the beginning since I didn't know Python that much, let alone Scrapy.

Scrapy hosting was done with ScrapingHub, which was a no-brainer to me. I could then schedule crawling and scraping operations at different times to ensure that the search engine always has new data.

Algolia was still the heart of the system, but our needs were much higher this time, so we had to contact them directly for a custom solution. The previous project required only around 20k pages maximum. Still, it could potentially reach a million pages this time, so we negotiated a special offer to give us the space we need. I'm pretty sure we were the biggest client in New Caledonia.

Cloudinary was also still here but was this time used much more intensely. Before that, I only had around a hundred assets already loaded (mainly website assets, plus a collection of cover pictures users could choose from). We gave users the choice to upload their own image gallery. But now that we had to scrape data from websites, we could have much more images and more transformations.

Firebase wasn't necessary anymore since we dropped any user-generated content. We used a single Google Sheet with 10 tabs to handle the project configuration, which worked well for my non-technical partner.

A lightweight server was used with NextJS for things like generating a sitemap on the fly, and I used AWS Lambda Functions for some edge cases, like custom-made access to a website back-end to scrap its content.

The crawling pipeline

The search engine wouldn't do anything without any data set to search in, so I had to build an autonomous system capable of scraping the entirety of the web in New Caledonia without going out of bounds. It also had to scrap each website entirely and thoroughly, without skipping essential information, but without scraping too much either. You also had to consider the performance impact on the servers of the website.

If you are curious about how this was built, I finally uploaded it on my personal github; free to read for everyone. Here's how it was constructed roughly:

The pipeline gets its "seed" websites from the *.nc websites listed on the registrar domaine.nc, and from manually websites added to our own database. Those are usually essential websites related to New Caledonia but not registered in *.nc
The most important websites to our eyes had a custom made crawler written in scrapy. It was also significant for websites written in Angular, usually Single Page Apps, and not readable by Scrapy by default, so you had to use a headless browser in front.
Then according to the website category (that we maintained ourselves), I triggered different types of crawlers that were less specific but not completely generic either. One thing that helped tremendously was websites equipped with a dynamically generated sitemap.
For any websites fallout out of those categories, I used a generic crawler, doing its best to get relevant information.
The spiders were deployed on Scrapping Hub, where many priorities and scheduling were defined. The more essential websites (usually the main news, real estate, and classified ads websites) were crawled 2 to 3 times a day. The rest were crawled from 1 time a day to 1 time a week, depending on their update frequency.
Pages scores were calculated first from the Moz SEO scores, which were frequently re-calculated and stored, and from various criteria, like the number of pages on the website, its number of external links, and its SEO good practices etc...
Stale pages were a big problem. So there was a separate script dedicated solely to browsing all individual links stored in Algolia, testing them out with a simple HTTP request, and then clearing them if they failed several times in a row, as you have to consider temporary downtimes.
At the early stage of the project, a crawler getting lost in an infinite loop was very common, and it required months of trial and error before being able to tweak to dodge all triggers.
Also, we had a few cases of websites with an affected back-end, usually filled with hundreds of thousands of links to an external Japanese e-commerce website. Not sure how it happened, but the common denominator amongst them was generaly an outdated CMS. Our crawler would get stuck on it, but so does Google, since those were not even indexed on the main search engines either.

The search engine interface

The search engine front-end looked simple. It was only a welcome page, a search page, and a blog. However, many computations happened during a search, and performances were important to monitor. Also, the components were dynamically changing according to the user input. Here is how it was roughly built :

The welcome page was loaded extremely fast for good SEO and user experience purposes. You got a search bar, a quick glance at how many pages are indexed by category, and some news feed, updated from a CMS. The next page is getting preloaded as soon as the main search page is ready so that when the user sends a search input, it gets loaded immediately.
The search page is where you will get your search results and other rich data loaded. Most results were given a preview image, usually from open-graph photos or the first image of interest on the webpage. All photos were served from Cloudinary CDN. If it is the first time the image is displayed to a user, it will be resized, optimized, and added to Cloudinary CDN the next time it is loaded. This resulted in hundreds of thousands of images stored at all times during the lifetime of the search engine
A lot of data is being recomputed on screen when typing a new search, so I optimized rendering performances a lot. Especially at first when I decided to copy Baidu and the instant search as you type.
If the first 3 results included one with rich information, then another panel of information was displayed on the right.

The cost of SAAS

Since the beginning, my philosophy has been to focus my attention on the front-end logic and the crawling logic. I did my best to compose different software as a service for everything else so that I didn't have to worry too much about infrastructure.

However, this strategy blew back when we started getting more visits and more data stored.

We had a custom plan with Algolia for up to a million entries. At first, it seemed plenty; however, 6 months after our release, we crossed the 900 000 lines. Everything was fine since we felt like we had scraped everything discoverable on the web in New Caledonia. Things got complicated when we had to sort some of our data, like prices, dates... Adding more sort types actually consumed entries in Algolia, so it was out of the question, at least without blowing up our budget.
Cloudinary was godsent to serve highly optimized images to the users, but its running cost rose dramatically when we started getting more daily users. We were generating tons of different versions for each image scraped, maybe 3-4 other file formats, compression levels, sizes... I wanted the CDN to cache many variations in advance to always be able to serve the ideal image. Still, the hosting cost ended up being exponential, so I decided to serve fewer versions of each image. I stuck to one desktop and one mobile version, with a single optimal compression level and file format for everyone.
To better understand our users, I decided to use FullStory, a platform that captures user sessions like a video recorder. It was instrumental in pinpointing incomprehension, but it wasn't cheap either and impacted our budget.

We found out that those technical limitations were blockers, and one solution was to rewrite the search logic using Elasticsearch and host it ourselves on AWS. For Cloudinary, I could write a custom solution using Amazon S3, and lambda functions written in Node, using the image library sharp.

Summing up

Here are some numbers about this project :

Around 150 unique visitors a day, going as high as 500 when releasing articles in the newspapers or the local radio.
Almost a million pages were indexed, which was never done before in New Caledonia.
A production cost of around 800 USD per month, with much of the cost being Algolia and Cloudinary.
2 big websites accepted to give us a backdoor to their database, which prevented our system from overloading their system with a more resource-intensive web crawling.
We got support from an elected official. However, we couldn't get subventions.