InsidiousApe t1_j634sxw wrote on January 27, 2023 at 11:15 AM

I enjoy that this is the simple questions thread. :)

Let me ask something much simpler, although in three parts. I am a web developer with no ML experience, but with a specific project in mind. I'd like to understand the process a touch better in order to help me find a programmer to work alongside (paid of course).

(1) Provided the information is easily found via API for instance, what is the ingestion process like time wise for very large amounts of information? I realize that is subjective to the physical size of the data, but are there other things going on which take time in that process?

(2) In order to program a system to look for correlations in data where no one may have seen them before, what is the process used to do this? This is what I'm truly looking to do once that information is taken in. For example, a ton of (HIPAA Compliant) medical information is taken in and I'm looking to build a system that can look for commonalities of people with a thyroid tumor. Obviously tons of tweaking to those results, but what is the process which allows this to happen?

trnka t1_j6583q3 wrote on January 27, 2023 at 8:15 PM

If you're ingesting from an API, typically the limiting factor is the number of API calls or network round trips. So if there's a "search" API or anything similar that returns paginated data that'll speed it up a LOT.

If you need to traverse the API to crawl data, that'll slow it down a lot. Like say if there's a "game" endpoint, a "player" endpoint, a "map" endpoint, etc.

If you're working with image data, fetching the images is usually a separate step that can be slow.

After that, it you can fit it in RAM you're good. If you can fit it on one disk, there are decent libraries with each ML framework to efficiently load from disk in batches, and you can probably optimize the disk loading too.

----

What you're describing is usually called exploratory data analysis but it depends on the general direction you want to go in. If you're trying to identify people with thyroid cancer earlier, for example, you might want to compare the data of recently-diagnosed people to similar people that have been tested and found not to have thyroid cancer. Personally, in that situation I like to just train a logistic regression model to predict that from various patient properties then check if it's predictive on a held-out data sample. If it's predictive I'll then look at the coefficients of the features to understand what's going on, then work to improve the features.

Another simple thing you can do, if the data is small enough and tabular rather than text/image/video/audio is to load it up in Pandas and run .corr then check correlations with the column you care about (has_thyroid_cancer).

Hope this helps! Happy to follow up too.

InsidiousApe t1_j658w0e wrote on January 27, 2023 at 8:20 PM

This was exactly the kind of answer I was hoping for - a great place to start more research. Thanks!