首页 > IT互联网 > 面向程序猿的数据科学与机器学习知识体系及资料合集
2016
11-24

面向程序猿的数据科学与机器学习知识体系及资料合集

Table of Contents generated withDocToc

  • DataScience & Machine Learning Reference
  • Introduction & Overview:入门与概览
    • Collections:资源汇总帖
    • Video Courses:视频教程
    • Blogs & Forum:博客与论坛
    • Data Process:数据处理
    • Machine Learning:机器学习
    • Nature Language Processing:自然语言处理
    • Deep Learning:深度学习
    • Recommend System:推荐系统
  • CrawlerSE:爬虫与搜索引擎
    • Search Engine:搜索引擎
  • Data Visual:数据可视化
    • Collections:资源汇总帖
      • 跨学科数据库与搜索引擎
    • Social Network:社交网络
    • Driving Data:驾驶数据
    • Competition:机器学习相关竞赛

DataScience & Machine Learning Reference

本文是笔者在学习DataScience过程中所有资源的汇总,本文着眼于各个领域的入门介绍以及综述性质资源的汇总,并不会过多的深挖前沿,若有兴趣了解更多,可以关注笔者的 程序猿的数据科学与机器学习实战手册 。本文主线从对数据科学与机器学习入门概览开始,继而提供一系列的资源、书籍与教程,然后介绍各个具体的领域内的参考文章,最后介绍一系列的实用工具。笔者的数据科学与机器学习世界观图解如下,其从属于笔者的编程世界观与方法论系列:

本文会随着笔者自身学习实践中格局与能力的提升而不断完善,笔者并非纯粹的机器学习与数据挖掘研究者,更多的是从工程的角度来寻找能够与工程相结合应用的方面。

Introduction & Overview:入门与概览

Introduction

Machine Learning

Deep Learning

Statistics

News:行业与新闻

Application:数据挖掘/机器学习/深度学习的实际应用案例

Resources:资源

Collections:资源汇总帖

Books:书籍

Video Courses:视频教程

Blogs & Forum:博客与论坛

Methodology:方法论

Data Process:数据处理

Machine Learning:机器学习

Nature Language Processing:自然语言处理

Deep Learning:深度学习

Application:应用

Recommend System:推荐系统

CrawlerSE:爬虫与搜索引擎

Crawler:爬虫

Search Engine:搜索引擎

Toolkits:工具

Language

Python

Java

Matlab

R

ClusterComputing

Data Visual:数据可视化

Books:书籍

Video Courses:视频教程

Toolkits:工具

Data Sets

Collections:资源汇总帖

单一数据库

跨学科数据库与搜索引擎

Text:文本

  • 20 Newsgroups :The text from 20000 messages taken from 20 Usenet newsgroups for text analysis, classification, etc. 61.6MB
  • Amazon Reviews :Over 142 million product reviews for sentiment analysis, recommender systems, and more.20GB | SMS Spam Collection | A collection of 5,574 SMS (text) messages, some spam, some normal, for spam filtering. | 204KB | http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ |

Social Network:社交网络

Media:影音图片

  • Labeled Faces in the Wild :13,000 named faces for facial recognition. Multiple training and test sets. 共173MB
  • Mushroom Identification :For hypothetically classifying mushrooms as edible or poisonous based on its characteristics.3 files, 480KB
  • NORB 3D Object Recognition :Binocular images of 50 toy figurines for 3D object recognition from image.Multiple files, over 5GB total
  • One Million Songs :Audio features and metadata for a subset (10,000) of the one million popular songs dataset for recognition/classification.1.8GB
  • Hate Speech Identification :A sampling of Twitter posts that have been judged based on whether they are offensive or contain hate speech, as a training set for text analysis.2.66MB
  • Hidden Beauty of Flickr Pictures :15,000 Flickr photo IDs that have received ratings based on aesthetics, for image analysis.138KB, use Flickr API to get images

Recognition

| Human Activity Recognition with Smartphones | Sensor data for recognizing the human activity - walking, sitting, etc. | 25MB | https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones |

Driving Data:驾驶数据

Domain:领域数据

Sports:体育

  • Football Strategy :Thousands of scenarios to make the best coaching decisions. 共876KB
  • Horses for Courses :Horse-racing data for predicting race results. 共 19MB
  • NBA & MLB Stats :Current and past season stats for teams and players for fantasy sports predictions.

Medicines:医药

Alien:外星人

  • UFO Reports:80,000 historic reports for classification or regression. This dataset has been standardized from the source data at nuforc.org 共14.6MB。

Foods:饮食

  • Wine Quality :Chemical properties of red and white wines (separately) and quality, for classification. 3个文件,共343KB。

Finance:金融

Others:其他

Competition:机器学习相关竞赛

Career:职业

 

来自:https://github.com/wxyyxc1992/DataScience-And-MachineLearning-Handbook-For-Coders/blob/master/DataScience-Reference.md

 

编程技巧