Jump to content

Did Google Just Leak It's Google Search Algorithm Docs?

Summary

A 2,596 file, 178,764 line commit by a github bot may have pulled back the curtain on how Google Ranks pages.

 

Quotes

Quote

A trove of leaked Google documents has given us an unprecedented look inside... some of the most important elements Google uses to rank content. Thousands of documents...were released March 13 on Github by an automated bot. What’s inside? The documentation indicates this information is accurate as of March [with] Ranking features [showing] 2,596 modules are represented in the API documentation with 14,014 attributes.
The documents did not specify how any of the ranking features are weighted – just that they exist. [Additionally,] Twiddlers, These are re-ranking functions that “can adjust the information retrieval score of a document or change the ranking of a document,” [and] Demotions: Content can be demoted for a variety of reasons, such as: A link doesn’t match the target site, SERP signals indicate user dissatisfaction, Product reviews, Location, Exact match domains, Porn. [Finally,] Change history, Google apparently keeps a copy of every version of every page it has ever indexed. Meaning, Google can “remember” every change ever made to a page. However, Google only uses the last 20 changes of a URL when analyzing links.

  • Links matter - This should not be a shocker, but if you want to rank well, you need to keep creating great content and user experiences, based on the documents.
  • Brand matters - Brand matters more than anything else
  • Entities matter - Authorship lives. Google stores author information associated with content and tries to determine whether an entity is the author of the document.
  • SiteAuthority - Low quality content on part of a site can impact a site’s ranking as a whole.
  • Chrome data - Google uses data from its Chrome browser for ranking
  • Whitelists - Google whitelist certain domains related to elections and COVID – isElectionAuthority and isCovidLocalAuthority.
  • smallPersonalSite - for a small personal site or blog
  • Freshness matters – Google looks at dates in the byline (bylineDate), URL (syntacticDate) and on-page content (semanticDate)
  • RegistrationInfo - Google stores domain registration information
  • titlematchScore - believed to measure how well a page title matches a query
  • avgTermWeight - Google measures the average weighted font size of terms in documents
  • siteRadius vs siteFocusScore - To determine whether a document is or isn’t a core topic of the website

 

My thoughts

Interesting to see some portion of PageRank still exists in the latest version of Google. I'm sure it's been heavily nerfed since the original, but glad to see it's still alive. These docs also go along way to show lies, half truths, and other information that Google has made statements on in the past, see the ipullrank link below for a breakdown of these. I just wonder how this will affect Google's already, relatively downwards, quality trend now that most people have a peak at Google's playbook.

 

Sources

The github commit

Secrets from the Algorithm (ipullrank.com) (VERY TECHNICAL, VERY INTERESTING READ)

HUGE Google Search document leak (searchengineland.com) (Quote Source)

 

PLEASE QUOTE ME IF YOU ARE REPLYING TO ME

Desktop Build: Ryzen 7 2700X @ 4.0GHz, AsRock Fatal1ty X370 Professional Gaming, 48GB Corsair DDR4 @ 3000MHz, RX5700 XT 8GB Sapphire Nitro+, Benq XL2730 1440p 144Hz FS

Retro Build: Intel Pentium III @ 500 MHz, Dell Optiplex G1 Full AT Tower, 768MB SDRAM @ 133MHz, Integrated Graphics, Generic 1024x768 60Hz Monitor


 

Link to post
Share on other sites

Given that search is basically not usable as is due to the llm bullshit fountain id say this wont have too much of an impact immediatly. However it will in maybe a year when google does another big training data steal from websites that have now been able to abuse the rankings and potentially fill the llm with false information that helps them out due to ranking well.

 

The udm14 non ai search will be more effected by this sooner. Which sucks since the udm14 thing has made google search the best its been in a decade.

Link to post
Share on other sites

ok,but does it matter? to whom? google is trash since at least 10 years... search results mostly seem random compared to how it was... yt algorithm is incredibly skewed and easily manipulated...

 

the only way to improve google is to remove it from the internet entirely.  🙂

 

 

The direction tells you... the direction

-Scott Manley, 2021

 

Softwares used:

Corsair Link (Anime Edition) 

MSI Afterburner 

OpenRGB

Lively Wallpaper 

OBS Studio

Shutter Encoder

Avidemux

FSResizer

Audacity 

VLC

WMP

GIMP

HWiNFO64

Paint

3D Paint

GitHub Desktop 

Superposition 

Prime95

Aida64

GPUZ

CPUZ

Generic Logviewer

 

 

 

Link to post
Share on other sites

56 minutes ago, Caroline said:

trash.vbs

that's about what i got too. 😄

 

1 hour ago, jaslion said:

udm14 thing has made google search the best its been in a decade.

you what? the only way to actually use google now is to add "reddit" to any search if you want any kind of non skewed, objective results... reddit! ... that aint "good" bro

 

The direction tells you... the direction

-Scott Manley, 2021

 

Softwares used:

Corsair Link (Anime Edition) 

MSI Afterburner 

OpenRGB

Lively Wallpaper 

OBS Studio

Shutter Encoder

Avidemux

FSResizer

Audacity 

VLC

WMP

GIMP

HWiNFO64

Paint

3D Paint

GitHub Desktop 

Superposition 

Prime95

Aida64

GPUZ

CPUZ

Generic Logviewer

 

 

 

Link to post
Share on other sites

3 hours ago, Mark Kaine said:

that's about what i got too. 😄

 

you what? the only way to actually use google now is to add "reddit" to any search if you want any kind of non skewed, objective results... reddit! ... that aint "good" bro

 

Depending on what you're looking for you could make it show results up to 2015 using the 'advanced' search mode.

Or use something else, the Duck is fine unless you look for anti-american content. Otherwise have primary sources at hand and skip search engines. There's Qwant but it only works for Europe or via VPN.

Some countries have their own (Dzen, Baidu, etc.) but they're specific to them.

 

The thing with Google is that it's the only option for some regions, at work I tried searching local news or things specific to my region using several engines and got old or irrelevant results or at most a Wikipedia article, only Google works because of how massive its index is.

Caroline doesn't need to hear all this, she's a highly trained professional.

Link to post
Share on other sites

5 hours ago, Caroline said:

Depending on what you're looking for you could make it show results up to 2015 using the 'advanced' search mode.

Or use something else, the Duck is fine unless you look for anti-american content. Otherwise have primary sources at hand and skip search engines. There's Qwant but it only works for Europe or via VPN.

Some countries have their own (Dzen, Baidu, etc.) but they're specific to them.

 

The thing with Google is that it's the only option for some regions, at work I tried searching local news or things specific to my region using several engines and got old or irrelevant results or at most a Wikipedia article, only Google works because of how massive its index is.

When I have tried different search engines, Google is definitely better compared to for example bing or duckduckgo when it comes to Norwegian content. But for probably some of the same reasons it's much more problematic looking up Swedish or Danish content when on the Norwegian Google search. 

Its because language is similar enough and Norwegian content seem to have a big booster in the search algorithm. Switching to Swedish Google isn't as easy as going to Google.se.

 

One example, is if I Google for "Riksarkivet se" (National Archives, Sweden have the .se domain) , sure the top result is the right website, but every single other result is Norwegian national archive or related to it.

Sure, Norway also has a archive with same name, Riksarkivet, but all the Norwegian result aren't websites with that name, it's Arkivverket.no and digitalarkivet.no that is common for all the archives.

 

While searching for "Riksarkivet se" on bing shows mostly Swedish results. The booster for getting result based on your country, is def weaker on bing.

“Remember to look up at the stars and not down at your feet. Try to make sense of what you see and wonder about what makes the universe exist. Be curious. And however difficult life may seem, there is always something you can do and succeed at. 
It matters that you don't just give up.”

-Stephen Hawking

Link to post
Share on other sites

Can they leak Google search source code now.

| Ryzen 7 7800X3D | AM5 B650 Aorus Elite AX | G.Skill Trident Z5 Neo RGB DDR5 32GB 6000MHz C30 | Sapphire PULSE Radeon RX 7900 XTX | Samsung 990 PRO 1TB with heatsink | Arctic Liquid Freezer II 360 | Seasonic Focus GX-850 | Lian Li Lanccool III | Zowie GTF-X | Mouse: Vaxee XE wired | Keyboard: Ducky One 3 TKL (Cherry MX-Speed-Silver)Beyerdynamic MMX 300 (2nd Gen) | LG 32GS95UV-B OLED 4K 240Hz / 1080p 480Hz dual-mode | OS: Windows 11 |

Link to post
Share on other sites

9 hours ago, Mihle said:

When I have tried different search engines, Google is definitely better compared to for example bing or duckduckgo when it comes to Norwegian content.

 

Agreed (in Canada myself). While I have certainly noticed search quality decrease over the years in general, I still have the best results with Google over any others I've tried, and it's not even close. I'm talking about regularly doing a simple search, Bing finding absolutely nothing related, while what I'm looking for is the top result on Google, to the point that it's almost bizarre. 

 

And I gave Bing a really good opportunity too (and still do) as I was a big MS Reward Point collector for some time, so essentially had to use Bing, or at least would always try it first.

 

Perhaps some of it comes down to being logged into google, and having a long history with them and their products. Maybe a combination of me being used to how to search on it, along with it having so much of my data just makes it more useful? I don't know.

5800X3D | 32GB RAM | RTX 4070 | 1TB NVME (boot) | 2TB NVME (storage) | B550M DS3H | Samsung NU8000 65" 1440p 120hz | 5.1 Surround Sound

 

Link to post
Share on other sites

I wonder if this will turn up new evidence in the DOJ v Google court case?

https://arstechnica.com/tech-policy/2024/05/google-sends-doj-unexpected-check-in-attempt-to-avoid-monopoly-jury-trial/

Funny how Google can just send money and it's not treated like a bribe but if I mailed a check to opposing council or a prosecutor unannounced it would be a bribe and not just "trying to pay ahead". Hopefully this means Google is worried they're on shaky ground and the government case is strong enough to enact some reforms.

Link to post
Share on other sites

2 hours ago, RejZoR said:

People still use Google? WHY?!

There are no other options. Google had the first mover advantage to a search engine that did things right and didn't just reward sites for keyword stuffing.

 

However we are rapidly approaching a point where we need to go back to curated lists (original flavor Yahoo!) of sites because too many sites exist that are just ads with stolen/ai-generated-garbage in between them.

 

The right middle ground is to actually break google up from doing advertisements, because it's pretty clear that their "ad business" conflicts with every single other thing they want to do. Break it up, split it off, and let the ad business sink or swim by itself.

 

Link to post
Share on other sites

5 hours ago, Kisai said:

There are no other options. Google had the first mover advantage to a search engine that did things right and didn't just reward sites for keyword stuffing.

 

However we are rapidly approaching a point where we need to go back to curated lists (original flavor Yahoo!) of sites because too many sites exist that are just ads with stolen/ai-generated-garbage in between them.

 

The right middle ground is to actually break google up from doing advertisements, because it's pretty clear that their "ad business" conflicts with every single other thing they want to do. Break it up, split it off, and let the ad business sink or swim by itself.

 

Because it's now been so long, it's very easy for everyone to have forgotten the real power of Google Search was actually working quickly and the upper results might actually be useful. Mostly because the original versions ranked based on user interaction. The problem is "Bots kill everything on the Internet" happened. It's been a downward spiral for a long time in Google attempting to combat fake results and selling ads to combat the fake results.

 

As for the longer technical breakdown of the document dump, honestly, most of it isn't really all that surprising. The actual search rankings are stupidly complex these days because that's actually the only way to combat the fake websites. Much of the system is still dictated by user interaction feedback loops, they're just far more complex than in the past.

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×