Skip to content

Data Science Tools

Language of Data Science

Open Source vs Free Software

Python is Open Source, R is Free Software.

Similarities:

  • Both commonly refer to the same set of licenses
  • Both Support collaboration
  • In many cases these terms can be used interchangeably

Differences:

  • Open Source Initative(OSI) champions open source while the Free Software Foundation(FSF) defines free softwares
  • Open Source is more business focused while Free Software is more focused on a set of values

Useful Languages

  • Python
  • R
  • SQL

  • Scala

  • A general purpose programming language that provides supports for functional programming
  • Designed as an extension to Java, it is inter-operable with Java as it also runs on JVM
  • The name Scala comes from "Scalable Language"
  • Apache Spark is written in Scala
  • Java
  • Weka(data mining)
  • Java-ML(machine learning library)
  • Apache MLlib(scalable machine learning library)
  • Deeplearning4j
  • Hadoop
  • C++
  • Tensorflow
  • MongoDB
  • Caffe
  • julia
  • Designed at MIT
  • Compiled Language
  • JuliaDB
  • JavaScript
  • TensorFlow.js makes machine learning and deep learning possible in the browser and Node.js
  • R-js makes linear algebra possible in TypeScript
  • PHP
  • Go
  • Ruby
  • Visual Basic

Open Source Tools

Data Management

Relational Databases:

  • MySQL
  • PostgreSQL

NoSQL Databases:

  • MongoDB
  • Apache CouchDB
  • Apache Cassandra

File-based Tools:

  • Hadoop File System
  • Ceph

Data Integration and Transformation

  • Apache Airflow
  • Kubeflow
  • Apache Kafka
  • Apache Nifi
  • Apache Spark SQL
  • Node-RED

Data Visualization

  • Hue
  • Kibana
  • Apache Superset

Model Deployment

  • Apache PredictionIO
  • Seldon
  • MLeap
  • TensorFlow Service

Model Monitoring and Assessment

  • ModelDB
  • Prometheus
  • IBM Research Trusted AI
  • IBM Adversarial Robustness Toolbox
  • IBM AI Explainability 360

Code Asset Management

  • git
  • GitHub
  • GitLab
  • Bitbucket

Data Asset Management

  • Apache Atlas
  • ODPi Egeria
  • Kylo

Development Environments

  • Jupyter Notebook
  • Jupyter Lab
  • Apache Zeppelin
  • R Studio
  • Spyder

Execution Environments

  • Apache Spark
  • Apache Flink
  • Ray

Fully Integrated Visual Tools

  • KNIME
  • Orange

Commercial Tools

Data Management Tools

  • Oracle Database
  • Microsoft SQL Server
  • IBM DB2

Data Integration and Transformation Tools

  • Informatica Powercenter
  • IBM InfoSphere DataStage
  • IBM Watson Studio Desktop

Data Visualization Tools

  • Tableau
  • Microsoft Power BI
  • IBM Cognos Analytics
  • IBM Watson Studio Desktop

Model Building Tools

  • SPSS Modeler
  • SAS Enterprise Miner
  • IBM Watson Studio Desktop

Data Asset Management Tools

  • Informatica
  • IBM InfoSphere

Development Environments Tools

  • IBM Watson Studio Desktop

Fully Integrated Visual

  • Watson Studio
  • Waston Scale

Cloud Based Tools

Fully Integrated Visual Tools and Platforms

  • Watson Studio
  • Waston Scale
  • Azure Machine Learning
  • H2O Driverless AI

Data Managements

  • Amazon DynamoDB
  • Cloudant

Data Integration and Transformation Tool

  • Informatica
  • IBM Data Refinery

Model Building Tool

  • IBM Watson Machine Learning
  • Google Cloud ML Engine

Libraries

Scientifics Computing Libraries in Python

  • Pandas
  • NumPy

Data Visualization Libraries in Python

  • Matplotlib
  • Seaborn
  • Plotly

Machine Learning and Deep Learning Libraries in Python

  • Scikit-learn
  • Keras
  • TensorFlow
  • PyTorch

Scala Libraries

  • Vegas
  • Big DL

R Libraries

  • ggplot2
  • Keras
  • TensorFlow

Data Sets

What is a Data Set?

  • Collection of data
  • Data Structures
  • Tabular Data
  • Hierarchical Data
  • Network Data
  • Raw files

Data Ownership

  • Private data
  • Confidential
  • Private or personal information
  • Commercially sensitive
  • Open data
  • Scientific institutions
  • Governments
  • Organizations
  • Companies
  • Publicly available

Where to find open data

  • Open data portal list from around the world
  • datacatalogs.org
  • Governmental, intergovernmental and organization websites
  • data.un.org
  • data.gov
  • europeandataportal.eu
  • Kaggle
  • kaggle.com/
  • Google data set search