Advanced Analytics with PySpark Patterns for Learning from Data at Scale Using Python and Spark 1st Edition by Akash Tandon ,Sandy Ryza ,Uri Laserson ,Sean Owen ,Josh Wills – Ebook PDF Instant Download/Delivery:1098103653 ,978-1098103651
Full download Advanced Analytics with PySpark Patterns for Learning from Data at Scale Using Python and Spark 1st Edition after payment

Product details:
ISBN 10:1098103653
ISBN 13:978-1098103651
Author:Akash Tandon ,Sandy Ryza ,Uri Laserson ,Sean Owen ,Josh Wills
The amount of data being generated today is staggering and growing. Apache Spark has emerged as the de facto tool to analyze big data and is now a critical part of the data science toolbox. Updated for Spark 3.0, this practical guide brings together Spark, statistical methods, and real-world datasets to teach you how to approach analytics problems using PySpark, Spark’s Python API, and other best practices in Spark programming.
Data scientists Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills offer an introduction to the Spark ecosystem, then dive into patterns that apply common techniques-including classification, clustering, collaborative filtering, and anomaly detection, to fields such as genomics, security, and finance. This updated edition also covers NLP and image processing.
If you have a basic understanding of machine learning and statistics and you program in Python, this book will get you started with large-scale data analysis.
- Familiarize yourself with Spark’s programming model and ecosystem
- Learn general approaches in data science
- Examine complete implementations that analyze large public datasets
- Discover which machine learning tools make sense for particular problems
- Explore code that can be adapted to many uses
Table of contents:
- Analyzing Big Data
Working with Big Data
Introducing Apache Spark and PySpark
Components
PySpark
Ecosystem
Spark 3.0
PySpark Addresses Challenges of Data Science
Where to Go from Here - Introduction to Data Analysis with PySpark
Spark Architecture
Installing PySpark
Setting Up Our Data
Analyzing Data with the DataFrame API
Fast Summary Statistics for DataFrames
Pivoting and Reshaping DataFrames
Joining DataFrames and Selecting Features
Scoring and Model Evaluation
Where to Go from Here - Recommending Music and the Audioscrobbler Dataset
Setting Up the Data
Our Requirements for a Recommender System
Alternating Least Squares Algorithm
Preparing the Data
Building a First Model
Spot Checking Recommendations
Evaluating Recommendation Quality
Computing AUC
Hyperparameter Selection
Making Recommendations
Where to Go from Here - Making Predictions with Decision Trees and Decision Forests
Decision Trees and Forests
Preparing the Data
Our First Decision Tree
Decision Tree Hyperparameters
Tuning Decision Trees
Categorical Features Revisited
Random Forests
Making Predictions
Where to Go from Here - Anomaly Detection with K-means Clustering
K-means Clustering
Identifying Anomalous Network Traffic
KDD Cup 1999 Dataset
A First Take on Clustering
Choosing k
Visualization with SparkR
Feature Normalization
Categorical Variables
Using Labels with Entropy
Clustering in Action
Where to Go from Here - Understanding Wikipedia with LDA and Spark NLP
Latent Dirichlet Allocation
LDA in PySpark
Getting the Data
Spark NLP
Setting Up Your Environment
Parsing the Data
Preparing the Data Using Spark NLP
TF-IDF
Computing the TF-IDFs
Creating Our LDA Model
Where to Go from Here - Geospatial and Temporal Data Analysis on Taxi Trip Data
Preparing the Data
Converting Datetime Strings to Timestamps
Handling Invalid Records
Geospatial Analysis
Intro to GeoJSON
GeoPandas
Sessionization in PySpark
Building Sessions: Secondary Sorts in PySpark
Where to Go from Here - Estimating Financial Risk
Terminology
Methods for Calculating VaR
Variance-Covariance
Historical Simulation
Monte Carlo Simulation
Our Model
Getting the Data
Preparing the Data
Determining the Factor Weights
Sampling
The Multivariate Normal Distribution
Running the Trials
Visualizing the Distribution of Returns
Where to Go from Here - Analyzing Genomics Data and the BDG Project
Decoupling Storage from Modeling
Setting Up ADAM
Introduction to Working with Genomics Data Using ADAM
File Format Conversion with the ADAM CLI
Ingesting Genomics Data Using PySpark and ADAM
Predicting Transcription Factor Binding Sites from ENCODE Data
Where to Go from Here - Image Similarity Detection with Deep Learning and PySpark LSH
PyTorch
Installation
Preparing the Data
Resizing Images Using PyTorch
Deep Learning Model for Vector Representation of Images
Image Embeddings
Import Image Embeddings into PySpark
Image Similarity Search Using PySpark LSH
Nearest Neighbor Search
Where to Go from Here - Managing the Machine Learning Lifecycle with MLflow
Machine Learning Lifecycle
MLflow
Experiment Tracking
Managing and Serving ML Models
Creating and Using MLflow Projects
Where to Go from Here
People also search for:
big data analytics with pyspark
data analytics vs advanced analytics
is dynamodb good for analytics
advanced analytics using pyspark
spark and pyspark difference
Tags: Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills, Advanced, Analytics


