25个术语每个数据科学家都应该知道- Coursera Blog - bob网页版,bob竞猜app下载,bobAPP手机端下载

25 Data Terms Every Data Scientist Should Know

常见的数据科学术语您的经理希望您知道。

Data science is, among other things, a language, according to Robert Brunner, a professor in the School of Information Sciences at the University of Illinois. This concept might come as a shock to those who associate data science jobs with numbers alone.

数据科学家越来越多地在整个组织中工作，沟通技巧与技术能力一样重要。随着越来越多的人和公司正在投入时间以更好地了解这一不断扩展的领域，数据科学在每个行业都在蓬勃发展。有效沟通的能力是一个关键的人才。

您是否追求对数据科学的深入了解学习专业, or simply want to gaina smart overview在该领域中，掌握正确的条款将使您在教育和专业之旅中取得成功。

According to Vinod Bakthavachalam, a senior data scientist at Coursera, using the following data science terms accurately will help you stand out from the crowd:

Business Intelligence (BI).BI是分析和报告历史数据以指导未来决策的过程。BI帮助领导者做出更好的战略决策通过确定过去使用数据的情况，例如销售统计和运营指标。
Data Engineering.数据工程师建立了基础架构data is gathered, cleaned, stored and prepped用于数据科学家的使用。优秀的工程师非常宝贵，建立一个没有他们的数据科学团队是“马之前的购物车”方法。
Decision Science.Under the umbrella of data science, decision scientists apply math and technology to solve business problems and add in behavioral science and design thinking (a process that aims to better understand the end user).
人工智能（AI）。AI computer systems can perform tasks that normally require human intelligence. This doesn’t necessarily mean replicating the human mind, but instead involves using human reasoning as a model to provide better services or create better products, such as speech recognition, decision-making and language translation.
Machine Learning.AI的子集，machine learning指系统通过识别数据中的模式，然后将这些模式应用于新问题或请求中，从而从输入数据中学习的过程。它允许数据科学家教计算机执行任务，而不是对其进行编程以逐步执行每个任务。例如，它用于学习消费者的偏好和购买模式，以通过简历在亚马逊上推荐产品，以根据关键单词和短语来识别最高潜力的求职者。
监督学习。这是一种特定类型的机器学习，涉及数据科学家作为教授算法所需结论的指南。例如，计算机通过在与每个物种及其特征正确标记的图像数据集上进行训练来学会识别动物。
Classificationis an example of supervised learning in which an algorithm puts a new piece of data under a pre-existing category, based on a set of characteristics for which the category is already known. For example, it can be used to determine if a customer is likely to spend over $20 online, based on their similarity to other customers who have previously spent that amount.
交叉验证是验证机器学习模型的稳定性或准确性的一种方法。尽管有几种类型的交叉验证，但最基本的一种涉及将您的训练集分为一分为二，并在将其应用于第二个子集之前对一个子集进行训练。因为您知道应该收到的输出，因此可以评估模型的有效性。
聚类是分类，但没有监督的学习方面。通过聚类，该算法通过将数据点分组在一起，从而在数据本身中获取相似之处。
深度学习。机器学习的一种更高级的形式，深度学习是指具有多个输入/输出层的系统，而不是具有一个输入/输出层的浅系统。在深度学习中，有几轮数据输入/输出需要帮助计算机解决复杂的现实世界问题。可以找到深入的潜水这里。
线性回归。线性回归通过将线性方程拟合到观察到的数据来模拟两个变量之间的关系。通过这样做，您可以根据其相关已知变量预测一个未知变量。一个简单的例子是个人的身高和体重之间的关系。
A / B测试。Generally used in product development, A/B testing is a randomized experiment in which you test two variants to determine the best course of action. For example, Googlefamously tested various shades of blueto determine which shade earned the most clicks.
假设检验。假设检验是使用statisticsto determine the probability that a given hypothesis is true. It’s frequently used inclinical research。
统计能力。Statistical power is the probability of making the correct decision to reject the null hypothesis when the null hypothesis is false. In other words, it’s the likelihood a study will detect an effect when there is an effect to be detected. A high statistical power means a lower likelihood of concluding incorrectly that a variable has no effect.
标准错误。Standard error is the measure of the statistical accuracy of an estimate. A larger sample size decreases the standard error.
因果推断是一个测试特定情况下因果关系之间是否存在关系的过程，这是社会和健康科学中许多数据分析的目标。他们通常不仅需要良好的数据和算法，还需要主题的专业知识。
Exploratory Data Analysis (EDA)。EDA is often the first step when analyzing datasets. WithEDA techniques、数据科学家可以总结的数据集的主要characteristics and inform the development of more complex models or logical next steps.
Data Visualization。Akey component of data science,数据可视化are the visual representations of text-based information to better detect and recognize patterns, trends and correlations. It helps people understand the significance of data by placing it in a visual context.
R。R isa programming language and software environmentfor statistical computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.
Pythonis a programming language for general-purpose programming and is one language used to manipulate and store data. Many highly trafficked websites, such as YouTube, are created using Python.
SQL。Structured Query Language或SQL是另一种用于执行任务的编程语言，例如更新或检索数据库的数据。
ETL。ETL is a type of data integration that refers to the three steps (extract, transform, load) used to blend data from multiple sources. It’s often deployed to build a data warehouse. An important aspect of this data warehousing is that it consolidates data from multiple sources and transforms it into a common, useful format. For example, ETL normalizes data from multiple business departments and processes to make it standardized and consistent.
github。githubis a code-sharing and publishing service, as well as a community for developers. It provides access control and several collaboration features, such as bug tracking, feature requests, task management and wikis for every project. GitHub offers both private repositories and free accounts, which are commonly used to host open-source software projects.
Data Models定义数据集如何相互连接，以及如何处理和存储在系统中。Data models显示数据库的结构，包括关系和约束，这有助于数据科学家了解如何最好地存储和操纵数据。
Data Warehouse。Adata warehouse是一个repository where all the data collected by an organization is stored and used as a guide to make management decisions.

掌握这些术语是迈向持久数据科学职业的绝佳第一步。同样重要的是确保它们在整个组织中都能理解，以便数据科学家可以与非DATA科学合作伙伴更加有效地运作。像任何事情一样，这需要练习，但是将这些data science building blocks到位，当机会出现时，您将具有自然的优势。

继续阅读