By constantly downloading random information from the internet, wouldn’t you be exposing yourself to tons of malicious content? Aren’t there pages that can run malware without you even clicking on anything?

A better example than search engines might be something like “the wayback machine”, a site that actually saves the pages, and not just links.

Comments

You must log in or register to comment.

Most_Engineering_992 t1_j2crux8 wrote on December 31, 2022 at 7:34 AM

When you click on a link the browser loads a lot of stuff, including page layout information (HTML & CSS), references to images, and Javascript code, which is like a program to do things. Normally the code communicates with the host platform to get information from databases and pass along things like passwords and emails, but that can be changed to do bad things.

Search engines don't do that. The contents of the page are downloaded and scanned for text, links, and images, but no code is run. It's like the difference between looking at directions on a map, and actually following those directions.

sailor_sega_saturn t1_j2cwu4c wrote on December 31, 2022 at 8:38 AM

Some search engines do execute JavaScript. In particular both Bing and Google have "engines" based on Chromium.

For Bing:

> Bing is adopting Microsoft Edge as the Bing engine to run JavaScript and render web pages.

For Google:

> Googlebot now runs the latest Chromium rendering engine

Officialsparxx OP t1_j2csqyo wrote on December 31, 2022 at 7:45 AM

I feel like I messed up by making this a “two parter” question. I know the wayback machine isn’t really a search engine or a web crawler, but would it be safe from malware for a similar reason, if not, why?

Toke_Ivo t1_j2d0wzf wrote on December 31, 2022 at 9:34 AM

Code is not self-executing. I know certain articles can make it seem like it, but the reason code is "self-executing" is really because your program is instructed to "download, read, and follow all instructions at <website>".

If you don't want or need that, you can just make a program that just downloads the page, without running the code. Or you can limit what code it can run.

Like, imagine a really stupid chef. You hand him a recipe and he makes the food. One day you hand him the recipe for a Molotov cocktail, and he blows up the kitchen. The issue isn't the recipe - it's the chef.

sailor_sega_saturn t1_j2cx838 wrote on December 31, 2022 at 8:44 AM

A crawler is only vulnerable to the input that it tries to parse or execute. Wayback Machine may archive windows executables, but it's certainly just treating them as binary bytes if so, and wouldn't even know how to execute them.

So I'd expect Wayback Machine to be immune to downloads to weird executables.

^(The user who downloads and runs the archived file on the other hand...)

cafk t1_j2dnu4h wrote on December 31, 2022 at 2:09 PM

It's like instead of clicking on a link you right click and select save target - that's how mirroring works - they just download the files. You're not rendering the page just downloading one file.
It's only when you open the downloaded file with a browser that it is rendered and possibly included javascript code is run, which can exploit some weakness in a specific browser.

Similarly to downloading malware - It doesn't do anything until you run it - but you can open the executable with a decompiler and look at what it does without actually running it.

u202207191655 t1_j2cx1o1 wrote on December 31, 2022 at 8:41 AM

Because they just read and write, and don't execute. Kinda like copypasta

You can read a manual on how to hurt yourself physically without being harmed. Just acting on it is damaging you

OrbitalPete t1_j2cueis wrote on December 31, 2022 at 8:06 AM

Think of it like the difference between photocopying a book and reading one. Your Web browser reads the page code and interprets it. Crawlers and things like the way back machine just copy the page code or specific bits in the code.

RSA0 t1_j2dec5o wrote on December 31, 2022 at 12:34 PM

No, not really. Modern browsers are pretty resilient, they generally don't trust the code on the page, and limit its possible actions. Loopholes still happen, but they get patched quickly. This is the first line of defense.

Then, they run the crawler code on a restricted user account, so the operating system will refuse any access to system files. That's the second line.

Finally, if the malicious code somehow finds a loophole in a browser, AND THEN a loophole in OS, they get to live - up until the next system wipe.

DragoonXNucleon t1_j2dx414 wrote on December 31, 2022 at 3:23 PM

ELI5: You can pretty easily tell if something is a book right? So you are looking for something to read. Pick it up. Is it a book? No. Toss it. Yes? Read it.

Search engines do the same with everything they process. Malware can't be embedded in a webpage, its a seperate executable downloaded by the page. So anytime the crawler reads something "is it a webpage?" No, toss it. Yes, process it, then find everything it links to, repeat.

aaaaaaaarrrrrgh t1_j2edgf7 wrote on December 31, 2022 at 5:17 PM

a) zero day exploits really aren't that common anymore - most viruses require a human to manually start them, just visiting a web site and clicking links won't do it

b) most crawlers aren't actually "looking" at most of the content, so they'd just move around the virus without actually being affected by it

c) any exploit would likely be targeted against common browsers - the environment of the crawler would be different and the exploit/virus likely wouldn't work there, unless specifically targeting the crawler (and targeting the crawler is hard, because unlike the browser, it's not public so you can't easily test your attack)

d) if the operators have any common sense, the crawlers running inside a sandbox, so exploiting the crawler does nothing and the sandbox will be automatically destroyed and recreated from a clean version on a regular basis

e) targeting crawlers specifically would be a dangerous game: due to the sandboxing it's not too valuable, but you're exposing your (valuable) zero day to an environment that could be tightly monitored. If you get caught, your zero day will be fixed and become worthless.