Data Science Tools
Language of Data Science
Open Source vs Free Software
Python is Open Source, R is Free Software.
Similarities:
- Both commonly refer to the same set of licenses
- Both Support collaboration
- In many cases these terms can be used interchangeably
Differences:
- Open Source Initative(OSI) champions open source while the Free Software Foundation(FSF) defines free softwares
- Open Source is more business focused while Free Software is more focused on a set of values
Useful Languages
- Python
- R
-
SQL
-
Scala
- A general purpose programming language that provides supports for functional programming
- Designed as an extension to Java, it is inter-operable with Java as it also runs on JVM
- The name Scala comes from "Scalable Language"
- Apache Spark is written in Scala
- Java
- Weka(data mining)
- Java-ML(machine learning library)
- Apache MLlib(scalable machine learning library)
- Deeplearning4j
- Hadoop
- C++
- Tensorflow
- MongoDB
- Caffe
- julia
- Designed at MIT
- Compiled Language
- JuliaDB
- JavaScript
- TensorFlow.js makes machine learning and deep learning possible in the browser and Node.js
- R-js makes linear algebra possible in TypeScript
- PHP
- Go
- Ruby
- Visual Basic
Open Source Tools
Data Management
Relational Databases:
- MySQL
- PostgreSQL
NoSQL Databases:
- MongoDB
- Apache CouchDB
- Apache Cassandra
File-based Tools:
- Hadoop File System
- Ceph
Data Integration and Transformation
- Apache Airflow
- Kubeflow
- Apache Kafka
- Apache Nifi
- Apache Spark SQL
- Node-RED
Data Visualization
- Hue
- Kibana
- Apache Superset
Model Deployment
- Apache PredictionIO
- Seldon
- MLeap
- TensorFlow Service
Model Monitoring and Assessment
- ModelDB
- Prometheus
- IBM Research Trusted AI
- IBM Adversarial Robustness Toolbox
- IBM AI Explainability 360
Code Asset Management
- git
- GitHub
- GitLab
- Bitbucket
Data Asset Management
- Apache Atlas
- ODPi Egeria
- Kylo
Development Environments
- Jupyter Notebook
- Jupyter Lab
- Apache Zeppelin
- R Studio
- Spyder
Execution Environments
- Apache Spark
- Apache Flink
- Ray
Fully Integrated Visual Tools
- KNIME
- Orange
Commercial Tools
Data Management Tools
- Oracle Database
- Microsoft SQL Server
- IBM DB2
Data Integration and Transformation Tools
- Informatica Powercenter
- IBM InfoSphere DataStage
- IBM Watson Studio Desktop
Data Visualization Tools
- Tableau
- Microsoft Power BI
- IBM Cognos Analytics
- IBM Watson Studio Desktop
Model Building Tools
- SPSS Modeler
- SAS Enterprise Miner
- IBM Watson Studio Desktop
Data Asset Management Tools
- Informatica
- IBM InfoSphere
Development Environments Tools
- IBM Watson Studio Desktop
Fully Integrated Visual
- Watson Studio
- Waston Scale
Cloud Based Tools
Fully Integrated Visual Tools and Platforms
- Watson Studio
- Waston Scale
- Azure Machine Learning
- H2O Driverless AI
Data Managements
- Amazon DynamoDB
- Cloudant
Data Integration and Transformation Tool
- Informatica
- IBM Data Refinery
Model Building Tool
- IBM Watson Machine Learning
- Google Cloud ML Engine
Libraries
Scientifics Computing Libraries in Python
- Pandas
- NumPy
Data Visualization Libraries in Python
- Matplotlib
- Seaborn
- Plotly
Machine Learning and Deep Learning Libraries in Python
- Scikit-learn
- Keras
- TensorFlow
- PyTorch
Scala Libraries
- Vegas
- Big DL
R Libraries
- ggplot2
- Keras
- TensorFlow
Data Sets
What is a Data Set?
- Collection of data
- Data Structures
- Tabular Data
- Hierarchical Data
- Network Data
- Raw files
Data Ownership
- Private data
- Confidential
- Private or personal information
- Commercially sensitive
- Open data
- Scientific institutions
- Governments
- Organizations
- Companies
- Publicly available
Where to find open data
- Open data portal list from around the world
- datacatalogs.org
- Governmental, intergovernmental and organization websites
- data.un.org
- data.gov
- europeandataportal.eu
- Kaggle
- kaggle.com/
- Google data set search