turniphat t1_je22fws wrote on March 28, 2023 at 9:06 PM

You need to know the type of data you are dealing with. For example, if you want to open a .wav file, you find the specification (https://ccrma.stanford.edu/courses/422-winter-2014/projects/WaveFormat/) and then you write your program to the specification.

It says first 4 bytes are the ID, then next 4 bytes are the size, then next 4 are the format... etc. etc. etc.

If somebody just hands you a blob of data and tells you to interpret it, then you are correct to be confused. You'd have no idea what the bytes mean.

Also, if you open a file in the wrong program, it interprets the bytes in the wrong way and you just get nonsense. Open a .exe file in notepad and it's just crazy characters all over the screen.

bulbaquil t1_je2yzza wrote on March 29, 2023 at 12:57 AM

To summarize the .wav specification u/turniphat mentioned:

The first 4 bytes tell the computer "Hi, I'm a multimedia file. Please treat me accordingly."
Bytes 5 through 8 tell the computer "Here's how long I am." This is the answer to your question - one of the first things files of any kind will do is tell the computer how big they are, precisely because this is something the computer needs to know.
Bytes 9 through 12 tell the computer "Specifically, I'm a .wav file."
Bytes 13 through 36 tell the computer "Because I'm a .wav file, here are some things you need to know about me. Like, what's my bitrate, am I stereo or mono, how many channels do I have, etc."
Bytes 37 through 44 tell the computer: "Okay, the actual data's coming now. Just a reminder: this is how big it is."
Bytes 45 through whatever number the previous 44 bytes told us are the actual sound itself.

As for why the computer treats 1001 as 9 instead of as 2-1, because at a very fundamental level the computer isn't reading the data bit by bit; it's reading it in chunks (sort of like taking steps two at a time). By default, the chunk size is the "X" that they're talking about whenever they refer to an "X-bit system" or "X-bit architecture", but if a file is encountered, its directives on How to Read This Kind of File take over. So it isn't seeing it as a sequence "1-0-0-1" and trying to figure out where to break it; it's seeing it as a gestalt "1001" (really, "00001001") and treating it as a single unit. If you wanted a 2 and then a 1, you'd need two different units: 00000010 00000001.

Tl;dr: Files share information about themselves to the computer when they're loaded. One of the things they share is how big they are, and another is how many bits of data the computer should read from them at a time.

fiatfighter t1_je35lzm wrote on March 29, 2023 at 1:47 AM

This really made sense to me and I am NOT that technologically literate. And I definitely do not understand coding or this byte structure thing. But when you said-ok this piece is the program or file saying this, and this one is telling it this-that helped me wrap my brain around it. Thank you! Off to submit my resume to Twitter! Oh wait…

RelativeApricot1782 t1_je4t9td wrote on March 29, 2023 at 12:47 PM

>Bytes 37 through 44 tell the computer: “Okay the actual data is coming now. Just a reminder: this is how big it is.”

Why does the computer need to be reminded?

mrpenchant t1_je4vys2 wrote on March 29, 2023 at 1:09 PM

They misstated it a little bit.

The way the format is set up is the first time it gives a length is for the whole thing, but it is defined to have 2 subchunks. The first subchunk will always have the same size for a wave file, but does provide a length of that subchunk and then the last data length is just for the data in the 2nd subchunk.

This is all to say, it's not a reminder but a slightly different length, which would be the length of the entire thing minus 36.

RelativeApricot1782 t1_je7w88z wrote on March 30, 2023 at 1:24 AM

That makes more sense thanks

aiusepsi t1_je4u9v5 wrote on March 29, 2023 at 12:55 PM

A computer doesn't, but software is (at least for now) written by human beings. You could have the size of the actual payload be implicit, and calculated from the information you've already seen, but there's more opportunity for the person writing the code which is reading the file to get the calculation wrong in some subtle way.

If the size is written explicitly just before the data, you can make the code which reads it much simpler and therefore more reliable. Simple and reliable is really good for this kind of code; mistakes can lead to software containing security vulnerabilities. Nobody wants to get a virus because they played a .wav file!

nerdguy1138 t1_je5ekl5 wrote on March 29, 2023 at 3:20 PM

The gnu file utility can read the first few bytes of a file as a magic number to determine what kind of file it is.

There is a hacker magazine called POC or GTFO, meaning proof of concept.

The PDFs of that magazine can also be interpreted in various other ways. Files that you can do this with are called polyglots.

pseudopad t1_je248it wrote on March 28, 2023 at 9:18 PM

This is probably the best explanation so far. There's a few posts talking about cpus and how many bits they are, but the question was about storage, and this reply describes how a computer (program) figures out what's inside a file.

Y34rZer0 t1_je2zatz wrote on March 29, 2023 at 1:00 AM

The difference between data and information

ColdDesert77 t1_je3tg6b wrote on March 29, 2023 at 5:22 AM

> the specification

What's that?

aiusepsi t1_je4oway wrote on March 29, 2023 at 12:09 PM

A document written by a human being which describes the format of the file.

It's basically an agreement between person writing software which writes that kind of file, and the person writing software which reads that kind of file.

psycotica0 t1_je4ovsd wrote on March 29, 2023 at 12:09 PM

Did you click on it? It's the document that describes what makes a wav file a wav file so that programs that can read wav files can read it. It's essentially a description and some instructions for the programmers making the reading program and the writing program so they know they're making the same file that contains the same information.

[deleted] t1_je3ps40 wrote on March 29, 2023 at 4:42 AM

[removed]

[deleted] t1_je4j75h wrote on March 29, 2023 at 11:11 AM

[deleted]

theBarneyBus t1_je1unvu wrote on March 28, 2023 at 8:17 PM

You’re completely correct that it could either be 9 or a 2 then a 1. The issue is that you’re assuming that the is no context.

In storage, there are conventions (e.g. ASCII) that say that basic text is 8 bits per letter. Similarly, other data is stored in fixed-length intervals.
In RAM, whoever is writing to it determines how it is used. It could be any length. The program (and programmer) using it needs to make sure they’re using it correctly.

There are also ways to compress things like text, where bit length is dynamic. But that’s a bit complex, so let me know if you want that explanation as well.

aiusepsi t1_je4ogqj wrote on March 29, 2023 at 12:05 PM

ASCII actually only uses 7 bits per letter, but because the smallest block of bits that a typical computer can individually access is 8 bits, the 8th bit goes unused and is always 0.

Which turned out to be very useful; the extra bit can be used for backwards-compatible extensions to ASCII, like UTF-8, which can represent characters not available in ASCII.

Spiritual_Jaguar4685 t1_je1vjar wrote on March 28, 2023 at 8:23 PM

Not dumb, a great question.

On the microprocessor level the hardware is designed to always read a certain number of digits, called "bits" in this case, and 4 bits become a "nibble", 8 bits are a "byte".

So a 16 bit microprocessor would read the value "one" as

0000 0000 0000 0001

and read "ten" as

0000 0000 0000 1010

So in older days, the processor size was a big deal, I played a lot of video games so I remember that the Nintendo was 8-bit, we then 16-bit systems (Sega and Super Nintendo). and then 32/64 bit processors with Nintendo 64, etc.

For the most part we've stuck at 64 bit for our processors for many reasons.

dev-ice t1_je2tdlm wrote on March 29, 2023 at 12:16 AM

Just a minor correction: 4 bits = 1 nibble 8 bits = 1 byte

Spiritual_Jaguar4685 t1_je31tfb wrote on March 29, 2023 at 1:19 AM

Thanks! Edited.

cmlobue t1_je3064h wrote on March 29, 2023 at 1:06 AM

I remember capping my gold on the first Dragon Warrior game at 65535 because it used an unsigned 16-bit integer. I was amazed that it didn't generate an overflow error.

Jack2883 t1_je6ntv7 wrote on March 29, 2023 at 8:07 PM

The lack of overflow error is due to good programmers checking the value and refusing to add to it if you hit the max.

maveric_gamer t1_je1uwub wrote on March 28, 2023 at 8:19 PM

At the absolute lowest level, it's built into the architecture of the system - when we say a "32-bit" or "64-bit" processor or architecture, what we are saying is that the native instruction set is encoded in that number of bits, with a bit being a discrete 1 or 0 - in other data sets that don't need that much, we will have code that defines the length of a piece of data.

sacoPT t1_je2zwlu wrote on March 29, 2023 at 1:04 AM

That problem is not specific to computers. 123 456 can be either one single number or 123 & 456 taken separately. Heck, negro can be dark if you read in Portuguese or black if you read in Spanish. You will know which one is the right one by using context.

In the digital world it’s up to the software to decide what 1001 means, based on context. That’s why if you open a png file with mspaint you see a picture but if you open it with notepad you see gibberish

who_you_are t1_je4o3e6 wrote on March 29, 2023 at 12:01 PM

You are 100% right about your question.

I developed on slow CPU (think about microwave, remote control, ...) and desktop.

There is two parts you need to know.

The first one, what everyone will repeat in this thread, everything work as a multiple of 8 bits. (8 to 64 nowday). Like, you can't send 7 or 9 bits, you need to ask a multiple of 8. See it like a box. You have specific boxes size to ship your stuff and worst case fill it with garbage.

Then, it is where you are right, the meaning on those numbers all depends on the CPU or software.

You need to read the CPU manual (called datasheet) to know how those bits will be interpreted because they could be 3 numbers within that 8 bits (like your example).

As for the software, well somebody (like me) programmed it to read it in a specific way to interpret part of that 8bits as I would like. So, the software know how to read it and interpret it.

For the ELI5, you can also see CPU as a software... Running human software

For desktop applications, except when size (bandwidth, space storage, ...) may become big really fast, you don't bother at all to try to squeeze as many numbers into one of those 8 bits multiple. We prefer readability over space nowday.

As for CPU... It can be quite common to have bits different meanings like your question. Again, you must read the datasheet (CPU manual).

distinct_oversight OP t1_je8sgdi wrote on March 30, 2023 at 6:30 AM

Woah. Thanks a lots. You really made complex into really eli5. Thanks again :)

TheJeeronian t1_je1ujqd wrote on March 28, 2023 at 8:17 PM

Depends on what it's reading. If it knows in advance to expect ASCII text, then it will count out 8 bits to each letter.

The simplest ruleset which doesn't limit you at all would be that, after ten letters, there is a single bit which says whether or not the message continues. This ruleset is inefficient as hell but shows a simple solution.

km89 t1_je1v2k6 wrote on March 28, 2023 at 8:20 PM

It's divided up, and any remaining space is filled with zeroes.

You may have heard the terms "bit," "byte," "megabyte," etc. A bit is one digit; a byte is 8 digits, and multiples of that are named with their SI prefixes ("kilobyte", "gigabyte", etc).

So when the computer reads, it's reading in multiples of 8 digits. In your case, the computer might read one byte that has the binary data "1001" stored in it. To the computer, this would show up as "00001001", but 2 would just be "00000010" and 1 would be "00000001."

Note that I'm talking about bytes for simplicity, but computers generally run off a "word" size (which is itself some multiple of 8 bits), and sometimes the first digit is flipped to 1 even if the data doesn't fill the whole space. You can ignore that for now, that's not important for this answer. Specifics aside, the point is that the computer is reading specific numbers of digits at a time and the data is padded with 0s if it doesn't fill all of the digits the computer's reading.

[deleted] t1_je1v8a8 wrote on March 28, 2023 at 8:21 PM

[deleted]

[deleted] t1_je1wmi5 wrote on March 28, 2023 at 8:29 PM

[removed]

zachtheperson t1_je2c4n0 wrote on March 28, 2023 at 10:10 PM

You would write in known lengths such as "each number will be 8 bits," as well as extra numbers here and there that might say things like "the first number X is how long the list is, the next X numbers are the list, the number Y after that is how many letters there are, followed by Y number of letters."

The programmer gets to determine all of these things and make up the rules. It's what makes things like reverse engineering file formats difficult, since the file could be laid out in any format.

If you want to see this being done in real time, check out the Metroid Prime Modding Discord. They've been reverse engineering the original GameCube game for years, and recently the remastered dropped so they're currently in the process of tearing that apart and figuring out how the data is laid out so they can read it.

Dman1791 t1_je2d1i0 wrote on March 28, 2023 at 10:17 PM

The most accurate short answer is "it depends."

At the processor level, everything is standard lengths and all the interpretation is physically wired into the chip. As an example, many ARM processors (used mostly in phones and such) operate with 32-bit long instructions. A specific part of those 32 bits contains what's called an opcode, which tells the processor how to interpret the rest of the bits.

At the programming level, you need some way to keep track of what format each piece of data is in. If you're programming in assembly (the lowest level language), it's up to you and you alone to make sure everything is being read properly. In something like Java, the language makes you to choose what type of data a variable is and then keeps track of it for you. In something like Python, the interpreter automatically assigns and keeps track of it without you having to do anything.

At the file level, the program you're feeding the data will try to read the file based on its extension. Most file types also have a "header" which is basically a special part at the start of the file that tells you about how to read it. For example, a text file will have a header that tells you which encoding it's using, which lets the program know things like how many bits there are per letter, and which patterns mean which letters.

Any-Growth8158 t1_je2dxol wrote on March 28, 2023 at 10:23 PM

Bytes are usually organized into words which can be multiple bytes and are the basic unit handled by computers (primary width of the registers used by the CPU usually). The computer itself just performs the requested operation on the word whether that is some arithmetic, logical, store, rotation, shift. The computer does NOT care what the data represents it it just does what it's told.

Interpretation of the data is left up the the software. I (or my compiler) will frequently stuff multiple items within a single word. I do a lot of microcontroller stuff and we are very limited on the amount of program and data memory available. My code will know that my data is located in bits 4 through 8 of the word--because I wrote the code and designed it that way. To access this data I need to do extra operations like shifting the word 4 bits to the right and then masking (setting to zero) all the bits 4 and greater of the word. This leaves me with the data of bits four through eight.

In the example above I've reduced the required data memory by packing the data into just the required bits; however, I've slowed down my code--it requires extra operations to access the data. On modern computers, the memory is essentially limitless and you'd never really bother to pack the data. Speed is more important so you'd just put your 4-bits of data in its own word and waste the unused bits. (I'm talking simple program data/variables--if you're doing movies or something you will likely compress the hell out if it).

aqhgfhsypytnpaiazh t1_je3081x wrote on March 29, 2023 at 1:07 AM

Modern computers, in terms of data storage and processing, basically only operate on bytes (groups of 8 binary digits [bits]). So at least in most cases you can assume that 00001001 should be treated as a single value.

Beyond that, it's really up to the software interacting with that data to determine how to process it. This is where file formats come into play. The file format is a specification that clearly defines how to interpret the data in a file. So it will tell you what each byte in a file means.

Sometimes the rules are very strict, like a format will say "Every byte of the file represents a character of the alphabet, here's an ANSI table that maps binary numbers to characters". Or it might be less rigid, like "The first section of the audio file is free text ANSI metadata, which ends when the null byte (00000000) is encountered. The next section..."

Without some context as to what the data represents, it's meaningless. Often this can be conveyed by following the conventions for file extensions - the part of the file name after the last dot (eg .txt is universally recognised as text data encoded with the ANSI or Unicode standards). Often there is also a specific pattern of data at the very beginning of the file (a magic number) that indicates what type of file it is. The file is stored in a file system, which is a particular arrangement of data on a storage device following file system standards. Programs are stored using standard data formats built into the operating system, which in turn send a series of electrical signals to the CPU and other processors following a standard instruction set. It's standards all the way down.

Binary data is ultimately just a series of binary digits - an abstract representation of on/off electrical signals - that the program (by way of the programmer and/or user) has to figure out what to do with. If your friend came to you and blurted out "Eleven! Seventy four! Two! Five thousand, nine hundred and sixty six!" it's not going to mean anything without context.

TMax01 t1_je31any wrote on March 29, 2023 at 1:15 AM

It's predefined. There is a fascinating (or not) history to and technical justification for how technology developers settled on an "8 bit byte" which then also became a 16, 32, 64, 128, or bigger bit byte, but in every case the answer is the same: it's predefined how many digits the reader will consider.

urlang t1_je3balf wrote on March 29, 2023 at 2:32 AM

The actual ELI5 answer: the same reason you understand 4803 as 4803 and not 48 and 3.

Separators: in this case, spaces or special sequences between words
Conventions: we use byte-sized words, so each 8 bits is a separate word

Of course, these have to be agreed upon by the sender and receiver.

Alternative_Effort t1_je4nqdz wrote on March 29, 2023 at 11:58 AM

I'll just add -- during the floppy disk era, you couldn't even easily transfer a text document between systems. Every system had their own encoding schemes, not to mention their own disk formatting schemes. It was annoying.

[deleted] t1_je5aei4 wrote on March 29, 2023 at 2:52 PM

[deleted]

idlebyte t1_je1xm20 wrote on March 28, 2023 at 8:36 PM

Short Answer: There is an index.

Long Answer: The index is in a fixed location on the disk so the drive knows where to look for it every time. Then for variable width files (images, video, music, text docs) the index gives starting/stopping locations around the disk as files are rarely contiguous these days, especially since SSDs came out. The disk then knows how to translate the start/stops/inbetweens to exact (positions on platters, locations on chip) where the 0/1's are stored to then stream the file as requested.

Comments