JavaScriptSEO - Olga Zarr - Martin Splitt
===
[00:00:00] Olga: Hi, everyone. It's Olga Zarr from SEO Sly. This is a very special episode. As you can see, I have a very special guest. This is Martin Splitt from Google. Martin, how are you doing today? SEO Sly, SEO done right.
[00:00:16] Olga: I'm Olga Zarr, an SEO consultant in SEO since 2012. Don't forget to subscribe to learn SEO for free with me. Now let's get into the show.
[00:00:27] Martin: Hi, everybody. Hi, Olga. I'm doing really fine. Really excited to be here.
[00:00:32] Olga: Yeah. So today I wanted to talk about unsurprisingly about JavaScript SEO. I have prepared, like, I have a list of, I don't know how many questions.
[00:00:42] Olga: I am pretty sure I won't be able to ask all of them, but I will try. But we, before we get started with the questions, can you briefly tell a bit, a bit about yourself and your role at Google? What is it exactly that you are doing there?
[00:00:58] Martin: Okey dokey. So, uh, my name is Martin Splitt. I am working for Google in the Google Zurich office together with people like Gary Eash, uh, Lizzie Sassman, uh, Daniel Weisberg.
[00:01:11] Martin: And John Muller, we are all here in Zurich. We also have Cherry Promawin on our team as well. She's working out of Singapore. And we are the Search Relations team. As such, I am a, technically the title is Developer Relations Engineer. Um, but we're also talking to non developers. And our goal is to make sure that people in the ecosystem, in the web ecosystem, um, know what they need to do to make sure that their website is as discoverable in search as possible because we understand that search brings a lot of traffic to websites out there and we want to make sure that we help people do the right or make the right decisions when they build websites um and then if there is something that goes wrong on our end that people can talk to us To route this problem, and also teams inside Google search are asking us, um, whenever they want to do something like documentation or launch a new feature or announce a change on our blog.
[00:02:15] Martin: So we are maintaining all of that, and we also give feedback to product teams.
[00:02:19] Olga: Okay. Perfect. Perfect. So, okay. So I'll get started. I have three groups of questions. One is let's pretend you are a Google bot and you will be like trying to answer as if, facts and myths, which is, I think something everyone likes.
[00:02:35] Olga: And a bunch of like general JavaScript SEO questions. So let's start with, let's pretend you are a Googlebot, so can you walk me, can you walk us through like the path that Googlebot goes through when it visits a site? I guess it starts from robots. txt if I'm correct, and then what happens next?
[00:02:55] Martin: I mean it starts with a URL, that's where it really starts.
[00:02:59] Martin: So Googlebot gets a URL from somewhere, we're going to discuss a bit on what that somewhere is in a moment, but for now it just gets a URL. Thanks a lot. The very first thing it does is it looks at the host domain. So if it's like example. com slash products, for instance, um, it will go to example. com and check if example.
[00:03:19] Martin: com has a robots TXT file, and if it has a robots TXT file, it'll look at it. And see if it can make a request to the URL that it has been given. Um, if yes, if that is the case, then it, uh, will make an HTTP request and it will record the response that it gets with all the metadata. Like timing information, HTTP headers, IP address, all this kind of stuff that comes back, um, will be recorded and then passed on to the next system.
[00:03:53] Olga: Mm hmm. Mm hmm. And so the next system is?
[00:03:56] Martin: Ha ha. So the next system is actually multiple systems, because everything is complex and multiple things are moving at the same time when it comes to Google Search. Um, one system is, basically a, how do I put this? It's, it's, it's not really a scheduler. It's kind of a scheduler because it looks at the, the response that we got.
[00:04:21] Martin: And it looks, if the response seems to include other URLs that it might want to look at, and if there's any, if it's an HTML page and it happens to have a bunch of links in it, or URLs in general, that might look like it's, it's a page on the web. Then that gets passed into a separate process called the dispatcher, and the dispatcher makes decisions as to, um, what of these URLs we might want to look at and crawl, when to crawl them, and which priority in the general list of URLs that we need to crawl, or want to crawl, should it go.
[00:04:54] Martin: So it passes on these URLs. into, uh, a basically a big list. And then from this list comes a URL for Googlebot to go next. So that's, that's that cycle, the discovery and crawling cycle, the response that it was, it recorded, um, from your server that is passed onto a system called the indexing system. It has a name.
[00:05:20] Martin: I'm not going to discuss the name in detail. People think it's called Caffeine. It's no longer called Caffeine. It hasn't been called Caffeine in many, many years. But it's a, it's a system that is, again, made up of lots of smaller systems. And that's called the indexing. It's
[00:05:36] Olga: just
[00:05:37] Martin: indexing pipeline, if you want to call it that.
[00:05:39] Martin: Mm
[00:05:39] Olga: okay. And what happens next?
[00:05:43] Martin: I had a feeling you'd ask what happens next. So, in the, in the indexing system, a lot of things happen, um, sometimes multiple times, sometimes in parallel. Some of the things depend on each other. Um, for instance, it looks at the HTTP response and goes, Huh. Is this a 200 OK?
[00:06:01] Martin: Is this a 400 response? Is this a 500 response? 400 status codes for 404 is the most famous one, but also 410, 403, 402, 401. These kind of responses tell us that something has gone wrong. And that the URL is not supported by the service. So, for instance, it doesn't exist anymore. Or it has, um, I don't know, it has been, uh, uh, hidden behind the password.
[00:06:28] Martin: And we don't enter passwords and usernames. So, we wouldn't see it. We are not authorized to see it. All right. Okay. In that case, we can't actually go there. Um, and that would abort the indexing process, because if there is an error, then what's the point in going on with indexing? The other thing is 500 responses usually means something has gone wrong on the server.
[00:06:47] Martin: Uh, usually things like 503, 502, 504, these kind of things are temporary errors. So we'll probably, at this point again, stop indexing and wait for a new crawl to come in at some point later. And then maybe we get a positive response back. And that, that'd be nice because again, an arrow page, what's the point?
[00:07:07] Martin: Um, by the way, three Oh, whatever. So, or 300 responses. 3 0 3, 3 0 2, 3 0 1. All these 3 0 4, uh, all of these are redirect responses and that would actually be handled in crawling directly. So if crawling receives a 3 0 1, a 3 0 2, a 3 0 4 whatever, 3 0 3 whatever, um, it would follow that redirect. Um, and then go crawl the linked URL from redirect.
[00:07:35] Martin: So that happens in crawling. That is not even reaching indexing at this point. Um, okay. At this point we have, uh, some HTTP response that is 200 or something in the range of 200 to 299 telling us that this is good to go. Um, and then we look at things like, okay, so is this still an error page? Because sometimes you get a 200 and it just says, oops, this page is no longer here.
[00:07:59] Martin: Or oops, there's a problem. Um, so error detection looks at more than just HTTP statuses. Um, If it still looks like an error page, we might at this point still cancel indexing and say like, okay, so this is an error page, there's nothing here for us to do. If the response turns out to be something else than HTML, it could be a PDF, it could be a CSV file, it could be a doc file, it could be an RTF file, um, and a bunch of other file formats that we happen to support.
[00:08:28] Martin: Then this would be converted into an HTML file or into an HTML representation of, of that page because everything else in the indexing pipeline relies on whatever it gets to be. Um, HTML, and we are speaking specifically web page index, so like web search, obviously images don't do that, obviously videos don't do that, obviously like all sorts of other things that exist in search don't do that, but we're talking web search here, and we're talking websites that we want to index.
[00:08:56] Martin: Or websites that we want to index, they are being converted into HTML. This HTML representation is then looked at, uh, in various different ways. One thing is we want to know which language is this in. Uh, we want to know when was this created? When was this possibly updated? Have we seen this before at another URL?
[00:09:17] Martin: Uh, we're going to render it. So we're going to throw it into a headless Chrome browser and run the JavaScript because that might generate additional content or additional information on the page. Um, and we want to figure out what is this about? Is there structured data in this and all these kinds of things we find out during the indexing process.
[00:09:38] Olga: Okay. And I promised no ranking questions. So how does Google decide whether to index a specific page? Let's say there are no blocks, like there is no technical blocks, like, like there is this evaluation happening, right? Yeah. Like, if you could tell me more, more about this.
[00:09:58] Martin: So we do have a bunch of systems that look at the content and try to figure out is this useful, high quality content or is this less useful or less qualitative, interesting content.
[00:10:11] Martin: And if it looks like it's high quality content and it's useful and we don't have it already in index, uh, then we might, or we will very likely decide to index it. So we look at a bunch of different things. But for instance, if your website basically just says hello and that's, that's the content, then that's not very useful.
[00:10:37] Martin: We might not want to index it. Yeah. It's like, hello world. I was like, okay, cool. Um, if we have a lot of information that looks very, very similar as in like very, very similar, let's say like you have the same product and the three different URLs. And the content is pretty much exactly the same. Maybe like one or two words are different for whatever reason.
[00:10:58] Martin: Then we might decide like, yeah, we already have this. We don't really need this again. So then we might not index it as well. And we will tell you that it's a duplicate of something else. Um, and show you that there is a different canonical that we picked for this document. Systems that are looking at, um, predictions as in like, how likely is this to show up in search, uh, to begin with, especially if we have historic data for it, if it just turns out to be not very, and that's a, that's a, here's a tricky thing.
[00:11:32] Martin: It might be useful, it might be high quality, but if no one cares about it, if it never ever shows up for years and, and at end, um, We might kick it out of the index. We might not. We might still think like, yeah, this still has a good chance of showing up for like one or two queries. So then we actually would keep it probably.
[00:11:48] Martin: But if, if it really looks like this is not as useful to keep in the index, then it might fall out of the index, but it might, it might get re indexed next round. Um, so if it gets crawled after like half a year and then we might be like, we don't know, we're just going to put it in the index. And then we find out a year later, like, oh, okay, no, this is not useful.
[00:12:06] Martin: It might fall out of the index again. So things can move in and out of the index. Um, also based on crawl demands or on demand for the page, not on crawl demand, but like on demand for the page. Um, yeah, and then it might be stored in the index. And if it's in the index, that's good. It's not necessarily your end goal.
[00:12:25] Martin: Your end goal is for it to actually show up in search results.
[00:12:27] Olga: Yeah. And indexing does not equal ranking.
[00:12:31] Martin: Correct. Correct. Because then that just means that we have it in the database and can potentially show it in search results.
[00:12:36] Olga: Okay. And we put stop here. No ranking questions. Okay. So, recently, in Google search console, there appeared, robots TXT reports, like showing different variations of robots TXT, like without HTTP with dub, dub, dub, without like different variations.
[00:12:53] Olga: Like, why were they added? Like, did you notice like, People keep messing this up and it leads to like potential, like harmful things happening to SEO. Like can you tell me more about this?
[00:13:08] Martin: Uh, which, which specifically? I'm not sure I understand the question.
[00:13:12] Olga: Uh, in GSC there's this report. Yeah. Robots, TXT, which shows you different variations of, of robot cxt.
[00:13:19] Olga: Yeah. Without like dub dub dub with dub dub up, uh, with http without hdp.
[00:13:26] Martin: I'm not even super sure, like, are we sure, what, where, where in search, what, where in search console is that?
[00:13:33] Olga: In settings. I think it is somewhere where false stats reports reside.
[00:13:40] Martin: Okay. Let's see. I'm not sure.
[00:13:42] Martin: Robots 60. So. Hmm. Okay. Thank you. I only have websites and I was
[00:13:47] Olga: What I was wondering is if mm-Hmm, people mess it up and they have like an empty robots T xt for the version without the dub dub dub, or they, and they have no redirect, does it, will it, can it like, cause very serious effects or does Google have some system of kind of correcting that, that like.
[00:14:07] Olga: When Google, for example, when there's like technical issues, Google very often is able to kind of understand what you meant and kind of do not cause you problems in some cases. So, yeah,
[00:14:19] Martin: I'm actually not entirely sure what the specification here is. I would assume that if we are trying to index the dub dub dub version we are definitely looking at the dub dub dub version of the robots txt as far as i'm aware and i i wouldn't be surprised if people screw it up uh that would be not surprising whatsoever um And I'm pretty sure that people are falling into these surprises, let's put it that way.
[00:14:49] Martin: Because I'm seeing this with other things, not just with robots. txt. So I am not surprised that we are showing all the different variations of the robots. txt URLs. Um, because probably very likely people are, Hosting different versions of their robots TXT at different places. And I know that some websites have like subdomains that then have like their own robots TXT that is probably controlled by a different team.
[00:15:17] Martin: And then at some point this team might be like, Oh, we are having like issues. And then, uh, other parts of the website might get worried in that way. They can check, I guess. I'm not sure what the UX reasons behind that were. Um, but I think it makes sense.
[00:15:31] Olga: Okay. So another question. So, very often when I see a site having indexing problems or crawl budgets problems, I can see a lot of pages, in the indexing report under discovered currently not indexed and, crawls currently not indexed, however, sometimes there is this column source, which says Google systems.
[00:15:53] Olga: Does it mean that then it is Google's fault, kind of? Or should I be looking into my own site and its quality?
[00:16:03] Martin: Um, so if it says sources Google, uh, Google systems, it probably just means that some system found this information somewhere, and, um, that is not exactly our fault, I would say. I would, however, say, like, it means that, um, At least it's not coming from, I don't know, like your, your sitemap or something.
[00:16:26] Martin: But, uh, if in general we figure it out quite quickly, if a page, uh, shouldn't be, I mean, the thing is like, even if it just shows up as like discovered, but not crawled or crawled, but not indexed, uh, eventually we'll figure out that this page is not worth our time. And then crawling, we'll probably move on and go elsewhere.
[00:16:50] Martin: So then you wouldn't have to worry about that too much, I guess. Uh huh. And if it says, like, Google, well, sources like Google Systems, uh, it doesn't mean that something's broken or weird on our end. It's just like, we found the URL. from somewhere inside our systems that could be like,
[00:17:09] Olga: okay,
[00:17:10] Martin: discovered, discovered internally.
[00:17:12] Martin: Um, I don't think that's necessarily something that you need to fix unless it causes you demonstrable issues.
[00:17:20] Olga: Okay. Perfect. One more GSC question. A very recent issue. So I noticed a lot of sites, are now getting a lot of 4 0 4 errors, with when there is a slash 1000 at the end. And this looks like, some spam attack
[00:17:36] Olga: Basically, someone is creating those URLs of your site with slash 1000 linking to you and kind of, and you most sites do not have that URL and most sites return 4 0 4, so they. appear as 404 errors in GSC. Sometimes the numbers of those two errors are really, really gigantic. So is this something to worry about, especially in the case of a huge site, very huge one?
[00:18:01] Olga: No,
[00:18:02] Martin: no. Uh, I don't think that's a big problem because 404s are getting kicked out of the pipeline so early that it doesn't really cause that much of a problem. Uh, if you see crawl speed decline because of it, then that's something that you might want to. look into or you might want to try to find out if you can use the robots rule to kind of Avoid this kind of thing, but I haven't heard of any site experiencing actual issues due to these kind of urls because hypothetically like if someone just like if a million pages linked to a url that no longer exists for whatever reason or has never existed on your site That is something that just happens on the web and it needs to be addressed by us on a web scale, so that shouldn't cause big problems.
[00:18:49] Olga: Okay. Okay. So talking about crawl budgets, by the way, I recently had a site, the site is relatively small, one thousand, not one, 10, 000 URLs, e commerce. So I wouldn't say this is a big site. However, they have Really messed up, filtering and canonicals. So as a, as a result of that, when I crawled it with screaming frog, I was able to like crawl two or 3 million URLs in GSC.
[00:19:18] Olga: I was able to find around in indexing report in those different buckets, around 6 million URLs. So, in the, in the sitemap, everything is correct, only canonical URLs are, are indicated in the sitemap, so is this potentially a crawl budget issue, even though the site is relatively small?
[00:19:38] Martin: I mean, if the site is small, and then there's a large amount of, uh, Of non canonical URLs being crawled, that shouldn't be that much of an issue because eventually that crawling again is going to slow down or die out quickly.
[00:19:53] Martin: We should be pretty good at predicting which URL patterns have more value because they get up or end up being picked as canonicals. So, uh, in that case, We should adjust crawling accordingly and I don't think that's going to cause a crawling issue unless you're like a new site that has like a million pages that need to update very, very quickly.
[00:20:15] Olga: You will
[00:20:16] Martin: not get overwhelmed by the spurious crawls, I think.
[00:20:21] Olga: Okay. So another question. What is the time between Google, like, crawling the side of the page, and rendering, like I know I'm oversimplifying it, but this, render queue, is it like seconds, minutes?
[00:20:36] Olga: Does it depend?
[00:20:37] Martin: It's lovely. That's a lovely question. In general. For most of the pages, uh, in search, it is within minutes. It might end up being within hours, um, and very rarely it might be longer than that. If it's longer than that, that usually hints at us not being super interested in, um, In the content in the first place and then it might be less likely to actually be selected for indexing
[00:21:02] Olga: Okay, and I heard someone did a test like they changed the Content of the site every second.
[00:21:10] Olga: So with that, let's say the content changes every second until I don't know second 35 At what second do you think google bot will kind of index? I know
[00:21:21] Martin: I know that that's an interesting one. I know that um You Time and dates and a lot of other things don't work as you'd expect in rendering because it shouldn't matter that much, like in general, it doesn't matter, uh, which time we specifically, I think I'm not a hundred percent sure how all the date and time functions work in rendering.
[00:21:46] Martin: I know that some of them work in very mysterious ways. So we might render, let's say we might render today, but we crawled yesterday. For some reason it wasn't in the crawl queue for like a day. Then you might actually see the date from yesterday. Sometimes you might see the date from today if something has fetched a resource recently and has actually cleared cache in between.
[00:22:06] Martin: So it doesn't always work as predictably as you think because it shouldn't need to. Like normally, a normal website, even if it's like dynamic content, doesn't usually rely on date and time to be very accurate. And so things like 10 seconds can actually happen instantaneously. And sometimes things that use timers can happen in non sequential order.
[00:22:30] Martin: Um, so relying on these tests is why I understand where it's coming from, because I would do tests like this sometimes when I'm not sure how something exactly is implemented in our web rendering service. And I'm just curious. Then I asked like the team and then I compare that to what I'm seeing. Uh, but pretty much I wouldn't rely on these kinds of tests.
[00:22:50] Martin: because they are not very reliable. I've also heard like, Oh yeah, Google waits for, I don't know, like five seconds until all the content has to be there. And I'm like, no, that's not true. We actually wait longer than that if it needs to be. But for that, we need to be confident that this, uh, this waiting results in some additional content to show up, for instance, which usually we are, we are running on the side of caution, but.
[00:23:14] Martin: If you're doing like weird test setups, then you're messing with our heuristics, trying to identify real world website behaviors. And then you might see weird things. Another thing is you might see weird results when you're using things like web workers right now, just because very few websites are using web workers and we haven't had the urge to actually implement that properly.
[00:23:37] Martin: So you're seeing some, some differences in behavior. Another thing is, uh, if you're asking for random numbers, you're not getting random numbers. You're, you are getting pseudo random numbers, but you're getting the same pseudo random numbers on, on all of the renders, for instance. Um, or at least that used to be the case a couple of years back.
[00:23:55] Martin: I'm not sure if it has changed now to make sure that the renders are as consistent and as comparable as possible over time, uh, and that we don't accidentally introduce, um, additional signal jitter. Where there should be none. So, things in rendering don't always work the way that people expect it to. And then they're like, listen, here it shows that it has output ten things with a delay of one second each.
[00:24:20] Martin: So it must be rendering for ten seconds. No, that's not necessarily true. We might just have found that it's pretty pointless to schedule more of these. Okay.
[00:24:30] Olga: Yeah. Okay. So factor myth question, a follow up to that. Is it, does it happen for some websites? Is it possible that JavaScript rendering is kind of off for a specific site for weeks and Google only takes into account the source code and it happens for weeks or months?
[00:24:49] Olga: I heard such stories.
[00:24:51] Martin: No. No.
[00:24:52] Olga: No. So.
[00:24:52] Martin: In general, everything goes to the render queue. Right. And, uh, if things go horribly, horribly wrong, and I will not say that never happens because people are very creative when it comes to writing JavaScript code. Um, maybe they manage to cause problems that take us a while to get over or get around.
[00:25:10] Martin: Um, and then we might take what we have. Like, it's like, okay, uh, the JavaScript rendering fails for some funky reason. And that's, that's rare. That's really rare. Uh, so we take what we have from the service and HTML because that's better than having nothing. Um, but in general, everything tries to, to render.
[00:25:29] Olga: Okay. So, maybe now , a simpler question,
[00:25:33] Olga: there is no index in the source code. Ah, there's no index in index in the rendered.
[00:25:38] Martin: Oh, uh, ha, ha, ha. Ha, ha, ha. Uh, that will very likely mean that it's not even rendering because if we are looking at the HTML that comes back from the server and that tells us already it doesn't want to be indexed, then we conclude that, alright, this page doesn't want to be in the index.
[00:25:57] Martin: Um, so we can save on all these expensive processes, including rendering. Like, we don't have to do anything with that. Like, we don't have to convert it into HTML. We don't have to, like, none of the things need to run. If it tells us it doesn't want to be indexed, then we can just, like, take a shortcut and say, like, Okay, bye.
[00:26:14] Olga: So removing
[00:26:15] Martin: that information with JavaScript doesn't work.
[00:26:17] Olga: Okay, and vice versa.
[00:26:19] Martin: That works. That works.
[00:26:21] Olga: That works. That works. And then the JavaScript will like always override the, the, the source code.
[00:26:28] Martin: Except for cases where the page has so much content and high quality content that we think like, Oh yeah, this is good to go forward.
[00:26:34] Martin: And then later on, we re indexed. with the rendered version. And then it might be that the non rendered version goes into the index and then it gets overwritten with a rendered version. And then depending on caching, it can take a few hours to days until that is in all the data centers. And then, so there are, well, these are edge cases.
[00:26:53] Martin: This happens, but it happens very rarely and you can't rely on it happening that way. You can assume. on more or less like the safe side that it won't be indexed. Uh, it might be indexed for a short transitional period of time. This, this transitional period might be a little longer, might be a little shorter, depending a little bit on, on load on the data centers.
[00:27:17] Martin: Um, and where you are located geographically and, uh, In general, it won't show up in the index then.
[00:27:25] Olga: But the
[00:27:26] Martin: clearer signals you can send us, the better.
[00:27:28] Olga: sure. Okay, so let's say I have a website. I don't want to have users from anywhere but the US. Is it okay if I return forbidden status code for everyone, but everyone that is coming from outside the US?
[00:27:45] Martin: Oh, boy, uh, in general, the the internet is a global place. I hate it when people do that, because I might be trying to access in that case, in your example, I might be a US American living in the US, being traveling abroad for a week, I still want to access that website. And now I can't and I have to wait for a week or I have to use a VPN.
[00:28:07] Martin: And So what's the point? Just let me, let me access it. If you don't want me to, because I don't know, you have support effort that you would like to reduce, then, then tell me in as clear as possible, but maybe not like block me out entirely. If I know what I'm doing and if I'm knowing that I'm doing something that you might not necessarily want, then I should be able to do that.
[00:28:28] Martin: Just let me know that you are not excited about it, but like you can, I mean, you can do that. Um, and it's probably okay. But it's a bad user experience for people, I think.
[00:28:39] Olga: yeah. Because, like, I recently had such a thing, like, someone came to me, they sent me the URL, and they wanted, like, me to assess it, like, give them the quote, and then I couldn't access it, and then I had to use a VPN, and this is exactly the setup they had.
[00:28:56] Olga: So that's a funny one. Okay. how do I know if I have crawl budget issues other than, my site, my pages are not getting indexed or crawled, or I'm getting maybe 500 errors is, are there any other symptoms?
[00:29:13] Martin: With crawl budget issues, uh, crawl budget usually is made out of two different components.
[00:29:18] Martin: One is crawl demand and the other one is crawl rate. So depending on what is the limiting factor in your case, it could be looking into one thing or the other. So it, it could be, so to understand a little bit where that's coming from. If you think about it as, Google search. We would like to index as much of your sites as quickly as possible.
[00:29:43] Martin: And if you have like a million products, optimally, we would discover your homepage, all the URLs for the million products. You can put that in the sitemap for instance, and then we would know about all the million products. And then we would make a million requests in one go and get a million product pages back, index them and be done with it.
[00:30:00] Martin: That's fantastic. However, if your server can't handle that. And that's where crawl rate comes in. If your server can't handle that, if your server basically crashes, if we request more than a thousand products at the same time, then that's not good for you. And that's not good for us. Because then we are seeing like errors and then we have to retry and that's expensive and time consuming.
[00:30:19] Martin: And you have to like deal with customers not being able to actually visit your website because your server is down for that few seconds. That's not good. So we are looking at server timings, for instance. So we are making a request. I say like we make a hundred requests. Quickly, um, we're making a thousand requests in one go.
[00:30:36] Martin: The server starts to slow down, then we're like, okay, okay. Maybe we need to do less than a thousand requests at the same time. We would take a little longer until we actually get your whole million products in because we can't just request them all in one go. We have to batch it up into smaller batches.
[00:30:52] Martin: Uh, the same if your server starts throwing error codes at us. And by that I specifically mean 500 something something. So 502, 503, 504. I usually mean like, oops, uh, we are getting overwhelmed here. And then we are backing off as well. So we're seeing like, oh, okay, so the error rate goes down if we make bigger batches.
[00:31:12] Martin: So let's actually make the batches smaller again and slow things down a little bit. That's one side of the equation where you can see, uh, crawl budget issues if it's crawl rate. So if your server can't keep up, With the amount of requests that we are trying to accomplish, we will automatically adjust to that.
[00:31:28] Martin: But if you have, let's say like literally a million pages that go up every day or need to change every day, and we can't make that many requests because your service is slow, then that is a problem that you can debug by looking at your server logs and seeing like, Ooh, the timings and the error responses.
[00:31:44] Martin: are going up. The crawl stats report might show that as well. Um, so that's the crawl rate side of things. And there's crawl demand. Crawl demand is very hard to predict sometimes because if you are a news portal and you have a breaking news story on some global events, um, then like de eclipse or something like that.
[00:32:06] Martin: No, that's not a global event then usually. But anyway, some, some event is happening, um, And, and you are, you happen to be the one who has the story as one of the first couple people, then this content is probably really relevant and useful. And as it's maybe a developing story and you keep changing that, we might figure out, Oh, we actually need to crawl this a lot of times because there is a lot of demand in queries right now.
[00:32:31] Martin: The story clearly still changes. So the content gets updated quite frequently. So we want to like keep up with that so that we are representing your content best and search results. But then maybe you are just one of many people who are reporting about the exact same event. And there's only so much to be said about that event.
[00:32:49] Martin: So and maybe no one cares about the event. Maybe it's some very, really boring local thing that like five people care about. But then Maybe we don't have to be as eager there as elsewhere. Or if you are like posting about Christmas shopping ideas for 2026. And it's, it's 2024 right now and it's about to be summer, then the demand for this is probably very low.
[00:33:14] Martin: So we might not be crawling right now because we are like, okay, so this is like Christmas content. We understand that this is content that is very seasonal. The query demand for this right now is not very high. We already have a version in our, our index. So if this gets updated, it probably gets updated a little later.
[00:33:31] Martin: And it's fine if we just like check monthly. And then in December we might actually ramp that up again and systems understand these kind of seasonality changes and topics and seasonality And might actually just not crawl because they don't care at this point And that doesn't mean that you have a crawl budget issue.
[00:33:48] Martin: That just means that Your content isn't as important and as time critical as other things right now, so it doesn't get the full prioritization As you should probably. What you however see is if you notice that the server gets, uh, gets a bunch of requests, but it's fine. So it's not a crawl rate issue.
[00:34:12] Martin: And there are topics that are evolving up to date, latest things. And there is a lot of traffic on these queries coming to your site, but it's not reflecting the latest changes. And if you look at like the view crawled page and the URL inspection tool, for instance, and you see that the content hasn't been updated in like days, then that might hint at, and then you go to crawl stats and you see like, okay, so we, I know we have a thousand pages that have changed and yet they have, and they are breaking news and they are being in demand and they are showing up in search results, but they are not getting, they're getting clicks.
[00:34:48] Martin: They're getting traffic, uh, from search results as well. But they're not getting updated quite frequently. Then that might hint at a crawl budget issue.
[00:34:58] Olga: That
[00:35:00] Martin: might be that we are spending crawls elsewhere. And that, that can also be seen if the crawl rate is quite high, but there's pages in the crawl stats that are not as relevant or, yeah, um, we are spending time on URLs that don't actually make sense, then that might be a moment where like, huh.
[00:35:22] Martin: Maybe we need to like take care of our site structure or a sitemap file to make sure that Google understands where to focus its efforts.
[00:35:31] Olga: Okay. Perfect. So a few quick facts, myths, questions. does a Google bot follows button links? Fact or myth?
[00:35:41] Martin: Where does the Google follow button, the Google what button?
[00:35:46] Olga: Google, Google bot follows button links. Oh no.
[00:35:49] Martin: Oh no. No, it does. Well, okay. Ah, okay. I, I got, this got pointed out last time. Last time on the Search Central live events in Poland, actually, it's a good, it's a good point. So this is complicated and actually it's not very complicated, but it's, it's hard to explain this with nuance.
[00:36:11] Martin: So it doesn't get taken out of context. Generally, if you want us to recognize something as a link, make it a link. If it's a button, it's not a link. We will not treat it as a link. Uh, it's not necessarily page A pointing to page B. However, however, we might, there's no guarantee, but we might see in the HTML, Ooh, there is something that looks like a URL.
[00:36:40] Martin: So we can cue that for discovery and crawling. So we can say like, Aha, there's a URL. Let's try if we can crawl that URL, even if it's a relative URL. So if it's example. com, a. html has like a button. with like something something slash b dot html and we'd be like, Ooh, maybe example. com slash b dot html is actually a URL that we could crawl.
[00:37:02] Martin: And then we might crawl that. We might discover that there is a page. So even if you're not using links, we might still find the URL and we might still try to crawl it and we might still index it. It's just not treated the exact same way as a link. And it might also not be picked up or it might be picked up, but it might be scheduled with lower priority.
[00:37:23] Martin: So that's why our recommendation and guidance is if you want us to pick up something as a link, make sure it is an actual link.
[00:37:31] Olga: Okay, that's a myth. Another one, Google bot follows JavaScript links. Myth or, facts?
[00:37:40] Martin: If with JavaScript links you mean JavaScript colon something something, then no, it doesn't follow these links.
[00:37:46] Martin: Again, if there is something somewhere in the code that looks like a URL, we might still crawl it, but it's not because we followed or executed that JavaScript link. We don't click on things, we don't interact with things, and if it's a JavaScript colon, that's not something that I can make an HTTP request to, so no.
[00:38:00] Martin: Googlebot is not following these. However, if it looks like a URL, it might still be crawled.
[00:38:06] Olga: So a little bit similar to button, right? Yeah. Mm-Hmm. treatment. Mm-Hmm. . Okay. , are JavaScript redirect. Okay. Or is it better to like, have those like regular Mm-Hmm. kind of http
[00:38:18] Martin: redirects. Uh, it is better to have HTP requests if you, uh, redirect if you can have HT P redirects, just because it's a more.
[00:38:26] Martin: stable, more robust way to do that. And it works also for browsers in any case. And it's, it's a more stable one, especially if it's a permanent redirect and you do it in HTTP 301, then browsers, this has nothing to do with SEO, browsers. Remember that aha, this is a permanent redirect and in future, even if the user types in the old URL, it will immediately make a request to that new URL because it has an internal directory, remembering that that is a permanent redirect.
[00:38:57] Martin: So for users, it saves them additional network round trips. Imagine you have slow network and that just takes longer until the redirect actually happens because if it's a 302 or 303, And it actually does the request to the old URL to check if it's still there. Yes, it is a redirect, which, this round trip may take time, especially if you're on a slow network.
[00:39:16] Martin: And then it makes a round trip to the actual URL and comes back. If it's a 301, your browser makes a direct request to the new URL and saves you the additional hassle. With JavaScript, you have to make the request to the old URL, download everything that is at that URL, which you don't have if it's an HTTP redirect, download everything that is there, run the JavaScript, and then make a new round trip to the new URL and fetch the content there.
[00:39:40] Martin: So, while they are not optimal neither for users nor for Googlebot, Googlebot does, when it renders JavaScript, does see that there's a redirect, follows the redirect, and processes it accordingly. It's just that As, as we discussed, if the HTTP redirect is in place, Googlebot can immediately, in the crawling stage, handle that redirect.
[00:40:02] Martin: It cannot do that if it's a JavaScript redirect. Okay. The redirect takes, Takes a little longer to take effect, but it still works and we have so when we migrated Or change the structure and revamp the structure of our documentation website, which includes our blog We moved away from blogger and we moved on to our own or a CMS platform and to migrate to redirect the old blog links to the new blog links We didn't have any other choice, thanks to Blogger being Blogger, than to implement JavaScript redirects.
[00:40:36] Martin: So Gary, out of all people, had to implement a JavaScript redirect, and that works just fine.
[00:40:42] Olga: Okay, perfect. So, , I have another question, maybe a more like, higher level. How, because you are kind of this person between SEOs, developers, you are a developer.
[00:40:54] Olga: So,
[00:40:55] Martin: Yeah.
[00:40:56] Olga: When I, when I, I am an SEO and I am pretty sure that my recommendation regarding JavaScript is correct because like I can see that clearly Google has problems with seeing the content and the developers do not really agree with them. Do you have some, any tips, best practices on how to talk to developers?
[00:41:17] Martin: Um, come with proof. So if you have if you have guidance from our side that is usually really helpful Uh show them what you are looking at show them that there is a problem or there is a challenge Uh, tell them this is the criteria that this is the change i'm expecting to see after your work has been implemented And then verify with them afterwards like thank you so much Your changes had this impact for us.
[00:41:46] Martin: And that usually motivates developers. Uh, don't, if you don't know something, just say so. If you don't understand what they told you, just say so. Don't try to like bullshit your way around or try to like steer away from the bits and pieces that you don't know. If you don't know, that's fine. Developers don't know.
[00:42:04] Martin: 90 percent of the time, what's going on, and then they find out. If you don't know, you can ask them to help you find out, or you can go on your own journey and find out, but come back with facts and facts. Measurable things like show them look here. That's the rendered html. Here's google documentation saying they look at the rendered html Here's content that we should be seeing here's content that is not on the rendered html This is a problem because xyz and here's the documentation to back me up And if we make whatever change you need to also don't assume that you know how to do their work If you don't know what they specifically need to do just say i'm like, I don't know what you have to do But could you please explain to me?
[00:42:44] Martin: What needs to be done so that I can make sure I, uh, advocate for your time being used to fix this, um, properly. So be an ally to them, especially if it's something that you, if you, if you sense that they are not against doing the work, it's just that they don't have priority for doing it, then make sure that they are on your side as an ally when you talk to the stakeholders to actually determine where effort and time is spent from developers that can be.
[00:43:13] Martin: A team lead that can be a project manager that can be a technical program manager that depends a little bit on your organization that you're working with. But help them argue the case why they should be spending time on the thing that you asked them to spend time
[00:43:29] Olga: on. Perfect. Thanks. Okay. So this question, as part of an SEO audit, I usually, I always like check JavaScript reliance of the site.
[00:43:39] Olga: I usually like disable JavaScript, right, to see like what the site looks without. And, , Very often I see a blank page and if I see a blank page, what should be my next move? Usually it is like going to inspect URL in Google Search Console. What are other things I should be doing? Should I be worried?
[00:43:59] Martin: No, no, you shouldn't be worried because I mean, uh, the reality is that a lot of stuff relies on JavaScript these days and that's generally fine.
[00:44:09] Martin: Uh, the easiest way to go about it is as you say, you go to your own inspection tool and you look at the rendered HTML. If the content, if the content is in the rendered HTML, you will be fine. If the content is missing in the rendered HTML, that's when the journey begins. That's when you need to look deeper.
[00:44:28] Martin: Why is it not there? Where is it coming from? Which piece of JavaScript is doing this? Is there something that looks like this might be a problem, but if it's in rendered HTML, then there's no cause for alert because. If it's, if it's in the rendered HTML, it's fine.
[00:44:43] Olga: Okay. Perfect. So, when, executing JavaScript with, uh, with screaming frog and using like a user agent, Google bot, I very often do that.
[00:44:52] Olga: Then I get those screenshots of all the pages. If it is possible to crawl the site, the entire site. Is it reflective of how actually Google sees the site, or could they be differences? I guess they could be, but how significant?
[00:45:09] Martin: Uh, that depends a lot on the site that you're looking at. So I'll give you an example.
[00:45:14] Martin: If you use Screaming Frog, they have their own implementation of rendering. They don't use what we are using for obvious reasons. They have to build their own version and they have to make decisions, just as Google engineers had to make decisions on how our rendering specifically works. So it could be that they are using a different Chrome version.
[00:45:32] Martin: It could be that they're using slightly different specifications. There could be that using slightly different flags. That they are using slightly different ways of implementing their own rendering that are different from the way that our rendering works. Um, so that's, that's definitely going to be different.
[00:45:46] Martin: Another thing is that it runs from a different IP address. So it's not coming from Googlebots. Real IP addresses these scrolls, they're, they're coming from wherever. I'm not sure if Screaming Frog is probably just using your IP address for wherever you are reading it. Yeah,
[00:46:00] Olga: I think so. Like the desktop for sure.
[00:46:01] Olga: Yeah. Yeah,
[00:46:02] Martin: yeah. And then, uh, it might be that something on the server side of the website is configured. to detect fake google bots and not work with them or work differently with them and then a fake google bot gets a different thing than the real google bot and it might might be as obvious as like oh yeah no this is not a real google bot because the ip addresses don't match the google ip addresses um so this just gets a page that says like you have been blocked That doesn't mean that Google is blocked.
[00:46:33] Martin: That just means that your IP address unfortunately is blocked because it is Behave or it's it's imposing Googlebot when it really is not Googlebot. Um, So that might happen. Uh, it could be that the The, the thing is more subtle that it actually uses like a cached version and doesn't use a cached version for Googlebot.
[00:46:54] Martin: So then Googlebot gets a slightly different version than what Screaming Frog is getting. It could be that they have a glitch in their robots. txt implementation somehow, and then they actually can crawl something that Googlebot would be disallowed from crawling. So there, there are these small variations and usually these small variations don't matter.
[00:47:17] Martin: But some, they do, and then it's really, really hard to debug the difference. It's, it's easy to spot the difference if what you're seeing in the URL inspection tool and everything else we have in Screaming Frog don't match. Then there's only one thing that is correct, and that's the real Googlebot doing the fetch, and that's the URL inspection tool.
[00:47:36] Martin: Um, And the other thing is just close, but slightly off the real thing. And then there's not that much you can do. Then you have to debug with, uh, with the URL inspection tools, which is not necessarily the easiest thing to do. Um, but depending on what you're trying to do, you might actually get an idea of what's going on and why there's a difference.
[00:48:04] Martin: Between what you're seeing and what you're seeing in URL inspection tool.
[00:48:10] Olga: Okay, perfect. should I, , recommend I have an e commerce client, let's say, they have, their own, CMS. Everything is custom coded using JavaScript and all the boxes, product boxes, links to products, pagination, everything is shown only after JavaScript has been executed.
[00:48:27] Olga: Without JavaScript, The entire page is there, but there are no products and no pagination. They say they could rebuild it in a way that everything will be visible in the source, but should they? Is it worth, do you think, time? Can it improve things?
[00:48:45] Martin: If they have problems now, That you can link to that because some products are not showing up in rendered HTML, uh, product updates are too slow, there's problems with Google Merchant Center.
[00:48:56] Martin: If there is a problem that you can prove is there and would go away if JavaScript wasn't involved. Maybe. It might make sense to make it non JavaScripty because it might also be faster for clients, but maybe the difference is small enough for, for, uh, for JavaScript implementation to be just fine. So unless you have data, hard, real fact data that it is because of JavaScript loading times, uh, and, and rendering.
[00:49:27] Martin: That you are seeing what you're seeing impacting that side. I wouldn't recommend rebuilding it because rebuilding has a lot of risk as well. That's what people can tend to forget is you have something that works right now.
[00:49:40] Olga: It might
[00:49:41] Martin: not work perfectly, but will it work perfectly when they rebuild it? No, because then they're rebuilding it.
[00:49:47] Martin: Most likely with the technology they are less experienced with. And there is inherently more complex, uh, to do something like that. If you are working with systems that have to be very highly interactive, there's no way around JavaScript and making these with like a hybrid kind of rendering solution, or like with the hydration solution in there, um, might be more complex, more likely, more complex is actually most likely more complex than a clear path for either server side running or client side rendering, um, in between solution is technically quite complex sometimes.
[00:50:21] Martin: Um And then you might involve new problems and especially if as part of that, the structure of the content, the presentation of the content and the layout and all this kind of stuff changes. Yeah. It's like it's a new site, so it won't be comparable either. And it can be that you're, exactly, effectively you're revamping, you're migrating.
[00:50:42] Martin: And then you have, and everyone who has done the migration knows that that is a complex. time consuming and nerve wracking process. So unless, unless you have very, very good reasons. To do that, I wouldn't do it. Mm
[00:50:58] Olga: Okay, perfect. How does Next. js rehydration in React based site is seen by Google?
[00:51:07] Olga: The content links, links, etc. gets repeated twice on the same page. But doing it for server side rendering purposes has any side effects? Like, I think this was the question.
[00:51:20] Martin: Yeah, that's a good question. Doesn't really have side effects, uh, is fine.
[00:51:24] Olga: Okay.
[00:51:25] Martin: It's related to that. We just discover links twice, but that's okay.
[00:51:31] Martin: Um, that doesn't have any, any implications really. It's okay.
[00:51:35] Olga: Okay. And, let's say I have a commerce sites and, product load dynamically when, when someone is scrolling. So how long will Google be scrolling and discovering new products for how long or how tall?
[00:51:52] Martin: That's really hard to say. Um, in general, it doesn't scroll at all.
[00:51:57] Martin: So if you're purely based on scrolling, you won't see the content in rendered HTML, it doesn't have like a clear set cutoff moment. Basically what you want to do is you want to check what your rendered HTML looks like, and then make decisions based on that.
[00:52:15] Olga: So would you recommend, for example, in that implementation, having like normal pagination clickable with links in the source code, in the version without JavaScript and then having that magic with JavaScript rendered version?
[00:52:31] Martin: I think that's in general, okay, but it sounds like a shaky setup that invites potential problems. So I'm not so sure. I'm keen on implementations like that. If you can make it just work without JavaScript, make it just work without JavaScript, or if not, then you can do something like that. It sounds like a shaky thing that can go wrong in many hard to debug ways.
[00:52:58] Olga: Okay, so two final questions. Yes. Is there a situation where you could justify linking internally using URLs with parameters and those URLs are canonicalized to the version without? Do you think it may work in some cases? I recently again had a site where basically all the internal linking was built this way and I kind of questioned that and they said, but this is how our site is structured.
[00:53:29] Olga: We cannot really change it. And I was wondering if. Is there a case for that?
[00:53:36] Martin: I don't think it's much of a problem because if they are canonized the non parameterized version, then a link to a parameterized version will link to the same cluster, uh, and then Uhhuh, it'll point to the same thing. So it shouldn't be that much of a problem.
[00:53:52] Martin: But again, it's a case of like giving us clear signals if you can. And apparently they can't. If
[00:54:00] Olga: you
[00:54:00] Martin: can have non parameterized versions that point to canonicals as links, great. That's fantastic because then that's as clear a signal as it gets. Uh, if for some technical reasons they can't do that, I don't think it is too much of a problem.
[00:54:15] Martin: Especially because internal links in that case mostly help us understand the structure and also help aid discovery. So if the pages are indexed, And rank as you expect them to rank and I don't think it's a big problem.
[00:54:29] Olga: Okay. And the final question. What would you say are like the, the worst JavaScript SEO mistakes you wish you stopped seeing and you keep seeing them over and over again?
[00:54:43] Martin: Worst JavaScript SEO issues? So, uh, probably. Trying to be clever and not using the platform directive. So if you have a link, just use the link. Don't try to like build something with JavaScript when you have something that just works out of the box in HTML, because there's like accessibility built into this performance built into it.
[00:55:08] Martin: There's, there's. discoverability built into it. And you have to rebuild all of that if you're trying to be clever. And that's the general theme that I'm seeing with JavaScript developers, that they're rebuilding something that already exists because they want to make it better. And then they actually end up making it worse or only just as good as the thing they had without JavaScript in the first place.
[00:55:27] Martin: So then what's, what's the deal? Why are we doing all this work? Um, number one, number two, If you're trying to be clever SEO wise, and you're trying to like, use robots. txt to kind of minimize the URLs that we are, um, we are potentially crawling, and then you get overly eager and you actually remove URLs that are relevant for rendering the page.
[00:55:50] Martin: So if the content doesn't show up because you roboted them away, that still to this day it's the simplest mistake and it still happens so many times.
[00:56:00] Olga: Okay. Perfect. So, Martin, thank you so much. so
[00:56:04] Martin: much, Olga, for having me.
[00:56:05] Olga: I learned so much, and I think everyone learned so much as well. So, again, thanks so much, and I hope to talk to you soon.
[00:56:13] Martin: All right. Have a great day, everybody. Thanks for listening in. Thank you.
[00:56:16] Olga: See you
[00:56:16] Martin: around.
[00:56:17] Olga: Thanks. .