matth0x01

matth0x01 t1_j7ayc9e wrote

Seems that you are more interested on the crawling and ETL side.

Maybe you should look more into Data warehouse or Data lake literatur. Especially the shift in paradigm from ETL (extract, transform, load) to ELT (extract, load, transform) respectively schema-on-read.

2

matth0x01 t1_j76dt6k wrote

Depends a bit on your skill level and what you want to achieve.

I started with the Introduction to Information Retrieval (2008) book, which was quite math-heavy back then. But I learned a lot and found it a good starting point.

You get the concept of decompounding, reverse index, ranking functions, etc.

Newer IR strategies involve word2vec methods for item representation instead of handcrafted ones or directly learning the search ranking function, which is a different beast compared to traditional search engines.

1