Submitted by dmart89 t3_119o54q in MachineLearning

I'm thinking about building an open source library to generate structured ml datasets from sources across the internet.

I know that lots of projects utilise crawlers to get decent datasets, while you might still need to create your own for specific use cases I'm wondering whether it'd be useful to have an open source library that lets you launch crawlers with predefined schemas for popular sources like LinkedIn, YouTube (I know yt also has an api), shopify stores, twitter, reddit, news sites and more.

Kind of like a unified interface with extendable starter templates.

The lib would dump json objects into a location you specify, like your local machine, mongo, or s3.

Something like:

{ title: some video, source: https//youtube.com/jfg78, views: 245676, comments: {}

Goal would be to make it easier/faster to get datasets from sources that don't natively have an api.

This might be a useless idea, but would love to hear your thoughts.

6

Comments

You must log in or register to comment.

noxiousmomentum t1_j9nil84 wrote

useless. what can easily be done needs no automation and what is hard to do isn't helped by this approach

−7

dmart89 OP t1_j9olf7e wrote

There was a court ruling a year or two ago that concluded that scraping public linkedin profiles is legal :) LN obviously still doesn't want you to scrape their data, so building scrapers for it is extra tedious because you need to navigate their blocking.

2

ch9ki7 t1_j9oqe44 wrote

maybe something like scraperapi but with some kind of Dsl one could send as post payload.

but als a Problem is that you often need a scraped result as input for another request

1

KPTN25 t1_j9p8zgp wrote

Good luck crawling Linkedin. Not saying it's impossible, but you'll definitely be making your life difficult if you try to publish a tool that is scraping from LI.

2

muwnd t1_j9qvxwf wrote

Better save yourself from all the crawling trouble and use data from Commoncrawl. So you can focus on the extraction part.

1

KPTN25 t1_j9qy2xi wrote

> court ruling a year or two ago that concluded that scraping public linkedin profiles is legal

Forgot about this. I may be dating myself with problems of the past.

Still imagine they're doing their best to make it really hard to do, though.

1