Data science includes work in computation, statistics, analytics, data mining, and programming. We hear or use many terms in our industry on a daily basis, but what is the underlying meaning of these words? We have created a glossary of common statistics and Data Science terms that every person concerned should know.
Algorithm: A mathematical formula or statistical process used to perform analysis of data.
Application Program Interface (API): A set of programming standards and instructions for accessing or building web-based software applications.
Artificial Intelligence (AI): The ability of a computer program or a machine to think and learn. It used to describe machines (or computers) that mimic “cognitive” functions that humans associate with the human mind, such as “learning” and “problem solving”.
Big Data: Extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions.
Cloud Computing: The on-demand availability of computer system resources, especially data storage and computing power, without direct active management by the user.
Data Lake: A centralized repository that stores huge amount of structured and unstructured/raw data.
Deep Learning. A more advanced form of machine learning, deep learning refers to systems with multiple input/output layers, as opposed to shallow systems with one input/output layer.
Exploratory Data Analytics (EDA): An approach to analyzing data sets to summarize their main characteristics, often with graphical formats.
Fuzzy Logic: An approach to computing meant to mimic human brains by working on the concept of partial truth rather than the usual true and false (1 or 0).
ggplot2: It is a data visualization package for the statistical programming language R. It is a general scheme for data visualization which breaks up graphs into semantic components such as scales and layers.
Hadoop: An open source, Java-based software programming framework for storing and processing massive amount of data and running applications under distributed computing environments.
Iteration: It refers to the number of times an algorithm’s parameters are updated while training a model on a dataset. For example, each iteration of training a neural network takes certain number of training data and updates the weights by using gradient descent or some other weight update rule.
k-means clustering: k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
Linear Regression: In statistics, linear regression is a linear approach to modeling the relationship between a scalar response and one or more explanatory variables.
MapReduce: A programming model for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
Machine Learning (ML): A subset of AI, that provides study of computer algorithms which improve automatically through experience without being explicitly programmed.
Metadata: The data which provide context or additional information about the other data like title of the document, subject, author, revisions and size of data file.
Natural Language Processing (NLP): The automatic manipulation of natural human language using computational linguistics and artificial intelligence.
Natural language understanding (NLU): It is a branch of artificial intelligence (AI) that uses computer software to understand input made in the form of sentences in text or speech format.
Normalizing: In context of data, it is the process of organizing data into tables in a relational database, so that the data redundancy is reduced.
Ordinal Variable: Ordinal variables are those variables which have discrete values but has some order involved. It can be considered in between categorical and quantitative variables.
Parsing: The process of breaking a data block like a string into smaller parts by following a set of rules, so that it can be more easily analyzed, managed, or processed by a computer.
Predictive Analytics: The branch of advanced data analytics which is used to make predictions about unknown future events.
Sentiment Analysis: The process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer’s attitude towards a particular topic, product, etc. is positive, negative, or neutral.
Telemetry: An automated communication process by which measurements and other data are accessed at remote points and transmitted further across receiving equipment for actions.
Virtualization: In computing, virtualization refers to the act of creating a virtual version of something, including virtual computer hardware platforms, storage devices, and computer network resources.