How Waymark finds all of the image elements on your website

by Ryan Geyer —

At Waymark, one of our core features involves generating brand profiles from user-provided website URLs. This process includes importing images from your website to save you time from having to upload content yourself, identifying which image is most likely to be your brand’s logo, writing a summary of your brand which can be used to inform our AI-powered video generation, and more.

There is a lot of interesting stuff going on here, but in this article I am going to focus on the journey that we have gone on figuring out a way to get as many quality images from a page as possible. It’s not as easy as you might think!

1. The simplest approach

Our image extraction works by opening the provided site in Puppeteer, so the very first approach that I took when writing this image extraction logic was by far the most appealing at first glance; Puppeteer allows you to intercept network requests made from the page and access the media type of the request, so it’s possible to get pretty good results by simply intercepting all requests with an “image” media type and adding those URLs to a set, waiting for the page’s network requests to settle, and calling it a day. The code would look something like this:

const imageURLSet = new Set<string>();

await page.setRequestInterception(true);
page.on('request', (request: HTTPRequest) => {
  const resourceType = request.resourceType();

  if(resourceType === 'image'){
    imageURLSet.add(imageURLSet.url());
  }
});

This is lovely! It allows the browser to do the work for us and we can just kick back and pick up the images as they roll in.

But of course, this approach has downsides:

  1. Data URLs and inline SVG elements will not show up as network requests, so if you care about getting those images, you will need to add some extra handling. SVG elements are relatively easy to query for, but data URLs may prove to be a lot harder to track down.
  2. The image URLs that come in are entirely black-boxed; if you care about getting additional context about the images such as where they are positioned on the page, this approach will not work for you.

As it turns out, Waymark needs to gather a lot of context about how image elements are displayed on the page to inform how we determine which image is most likely to be the brand’s logo, so that second downside was a big problem. I initially tried rolling with some awkward solutions where we would reverse engineer an img element selector matching the image URL, like document.querySelector(`img[src*='${imageURLPathname}']`). But what if the image was loaded from an srcset? Or a picture <source> tag? Or a CSS background image? The list of edge cases kept growing, and it started to become clear that there were just too many gaps for us to be able to confidently provide an acceptable experience to as many users as possible.

2. BRUTE FORCE

So, we need to be able to get the elements for as many images on the page as possible. The simplest approach would be to just directly query for all <img>/<svg> elements on the page and go from there, but there didn’t seem to be a good way to query for elements with a CSS background image. As such, my next approach was to just throw brute force at it by traversing down every single node in the DOM tree and checking if it was an <img>, an <svg>, or had a background-image CSS style. This worked… but it was slow, inefficient, and simply overkill. Even worse, the code was pretty hard to read/maintain/debug because it had to all run in a huge monolithic script that was excecuted on the page in Puppeteer, where it becomes a lot harder to reliably log or trace errors.

Overall, this did close a lot of the gaps that the previous approach had left, but it left me wanting. Eventually, an opportunity came to refactor this code during an initiative to further improve the reliability of our image extraction and logo identification, and I took it.

3. Figure out where the images are and go there directly

So, in the first approach, we tried just letting the images come to us, but this had some drawbacks by limiting the amount of information we could easily access.

In the second approach, we went in the complete opposite direction and tried desperately digging through every nook and cranny of a site looking for images.

But I have finally landed on a third approach which, in my opinion, works as a really nice middle-ground between the two. Earlier I mentioned that it didn’t seem like there was a good way to query for elements with CSS background images, but that problem sat with me for a while; at the end of the day, any CSS styles are applied with a selector. Theoretically, couldn’t we figure out how to parse the page’s stylesheets, extract the selector(s) for any background-image styles, and directly query for them? Well, yes we can!

All of the stylesheets loaded on the page, both from inlined <style> tags and external linked CSS files, can be accessed on document.styleSheets.

This will return a list of CSSStyleSheet instances, each of which should include a cssRules property that enumerates every valid style rule parsed from the stylesheet. Unforunately, depending on how the site in question was implemented, directly accessing a stylesheet’s cssRules may throw a CORS error if the stylesheet came from a different origin without a crossorigin="anonymous" attribute on the <link> tag.

This led me down a ridiculous aside writing a script to fetch each stylesheet as a string, manually search it for every instance of a background image style rule, and then construct the selector for that style rule, taking into account that nested CSS selectors are now supported in most browsers. I had a blast writing it and got it working well, but it is extremely important to remember that anytime you write code which could be described as “a fun puzzle”, you need to deeply interrogate if it’s actually a good idea to put it in production.

Of course, there were two much simpler solutions to this problem:

  1. We control the Puppeteer browser, we can just disable CORS errors if we want.
  2. Pretty much every modern browser supports creating constructed stylesheets with the CSSStyleSheet constructor. This exposes an API to parse a CSS stylesheet string in the exact way that browsers parse all other stylesheets, so you know it should work quickly and reliably.

I tinkered with disabling CORS, but started feeling a little uneasy as I realized that the Chromium flags required to achieve this appear to have shifted over the years; a simple "--disable-web-security" used to be all you needed, but more modern answers seem to indicate that you now also need to provide "--disable-features=IsolateOrigins" and "--disable-site-isolation-trials" flags as well. I don’t think Chromium considers disabling CORS to be an important feature that people are supposed to be using (and for good reason!) so maybe we shouldn’t bank on that API not changing under our feet.

So, the constructed stylesheet approach started looking a lot more appealing. And it worked like a dream! Here’s what the code looks like to make this work:

// Gather the text contents of all of the stylesheets on the page into a big combined string
const combinedStylesheetString = (
  await Promise.all(
    Array.from(document.styleSheets).map(async (styleSheet) => {
      try {
        if (styleSheet.href) {
          // External stylesheets have an href which we can fetch
          return await fetch(styleSheet.href).then((response) => response.text());
        } else {
          // We can directly access the <style> tags for inlined stylesheets via styleSheet.ownerNode
          return styleSheet.ownerNode?.textContent || '';
        }
      } catch (e) {
        return '';
      }
    }),
  )
).join(' ');

const parsedStylesheet = new CSSStyleSheet();
await parsedStylesheet.replace(combinedStylesheetString);

Now that we have a CSSStyleSheet instance to work with, we can search it for any background image style rules and get the selectors for those rules.

const backgroundImages = new Array<{
  selector: string;
  imageURL: string;
}>();

// Regex matches `url("<url-here>")` styles so we can extract the URL from a background image style
const cssURLStyleRegex = /url\(['"](.*?)['"]\)/;

const rules = parsedStylesheet.cssRules;

for (const styleRule of rules) {
  if (!(styleRule instanceof CSSStyleRule)) {
    continue;
  }

  let backgroundImageURL: string | null = null;

  const styleDeclaration = styleRule.style;
  backgroundImageURL = styleDeclaration.backgroundImage?.match(cssURLStyleRegex)?.[1];
  if (!backgroundImageURL) {
    backgroundImageURL = styleDeclaration.background?.match(cssURLStyleRegex)?.[1];
  }

  if (backgroundImageURL) {
    // If we found a background style with an image URL in it, we can get the selector for that style rule
    // on `styleRule.selectorText`
    backgroundImages.push({
      selector: styleRule.selectorText,
      imageURL: backgroundImageURL,
    });
  }
}

So now, we have a list of background image URLs and their selectors parsed from the page’s stylesheets. Now if we want to find all of the elements on the page with background images, all we have to do is query for those selectors!

const backgroundImagesWithElements = backgroundImages.flatMap(({
  selector,
  imageURL,
})=>{
  const elementsMatchingSelector = document.querySelectorAll(selector);
  return Array.from(elementsMatchingSelector).map((element) => ({
    element,
    imageURL,
  }));
});

I’m very pleased with this solution. If we’re honest with ourselves, it’s still a little complex, but a major improvement over our previous attempts at gathering background image elements.

Now at this point, you can easily query for image tags with document.getElementsByTagName("img") and SVGs with document.getElementsByTagName("svg").

I will not argue that any solution is absolutely bullet-proof, but this feels the closest we’ve ever gotten. Of course, I have seen plenty of… “creatively” written websites while working on this feature to know that there will always be some website out there which defeats our image extraction. But that number is shrinking, and we’ll keep working to get it lower and lower if we can.