Submitted by dmart89 t3_119o54q in MachineLearning

I'm thinking about building an open source library to generate structured ml datasets from sources across the internet.

I know that lots of projects utilise crawlers to get decent datasets, while you might still need to create your own for specific use cases I'm wondering whether it'd be useful to have an open source library that lets you launch crawlers with predefined schemas for popular sources like LinkedIn, YouTube (I know yt also has an api), shopify stores, twitter, reddit, news sites and more.

Kind of like a unified interface with extendable starter templates.

The lib would dump json objects into a location you specify, like your local machine, mongo, or s3.

Something like:

{ title: some video, source: https//youtube.com/jfg78, views: 245676, comments: {}

Goal would be to make it easier/faster to get datasets from sources that don't natively have an api.

This might be a useless idea, but would love to hear your thoughts.

6

Comments

You must log in or register to comment.

KPTN25 t1_j9p8zgp wrote

Good luck crawling Linkedin. Not saying it's impossible, but you'll definitely be making your life difficult if you try to publish a tool that is scraping from LI.

2

dmart89 OP t1_j9po770 wrote

Just a library not a commercial tool. Anyone using it would be scraping themselves, not via a 3rd party service or something.

1

Sal-Hardin t1_j9oqwmt wrote

How do you envision searching?

1

dmart89 OP t1_j9oxk6s wrote

Probably keeping it simple to start with and just use filters during the crawl.

2

muwnd t1_j9qvxwf wrote

Better save yourself from all the crawling trouble and use data from Commoncrawl. So you can focus on the extraction part.

1

noxiousmomentum t1_j9nil84 wrote

useless. what can easily be done needs no automation and what is hard to do isn't helped by this approach

−7

dmart89 OP t1_j9nkm2u wrote

Fair. Thanks for your thoughts. I personally find constructing scrapers and parsing data annoyingly tedious, but it's probably just me (:

2

ch9ki7 t1_j9nw6hu wrote

building and maintaining scrapers is tedious! I would also like some better solution. the idea is not bad, just maybe difficult to solve.

3

dmart89 OP t1_j9olr3r wrote

Possibly, yes, I would need to check. I recently built parsing services for tiktok, and it was super annoying to deal with.

1

ch9ki7 t1_j9oqe44 wrote

maybe something like scraperapi but with some kind of Dsl one could send as post payload.

but als a Problem is that you often need a scraped result as input for another request

1

step21 t1_j9nwh4u wrote

Also, some of it might give you legal trouble if you f e make a public crawler for linkedin

2

dmart89 OP t1_j9olf7e wrote

There was a court ruling a year or two ago that concluded that scraping public linkedin profiles is legal :) LN obviously still doesn't want you to scrape their data, so building scrapers for it is extra tedious because you need to navigate their blocking.

2

KPTN25 t1_j9qy2xi wrote

> court ruling a year or two ago that concluded that scraping public linkedin profiles is legal

Forgot about this. I may be dating myself with problems of the past.

Still imagine they're doing their best to make it really hard to do, though.

1