spark MLlib (1)

Scroll Down

Spark 机器学习笔记(1)

Spark 的基本数据类型

Local vector

A local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine. MLlib supports two types of local vectors: dense and sparse. A dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values. For example, a vector (1.0, 0.0, 3.0) can be represented in dense format as [1.0, 0.0, 3.0] or in sparse format as (3, [0, 2], [1.0, 3.0]), where 3 is the size of the vector.

Local vector 支持三种类型的值,integer,double,0开始的索引型,vector分为dense(密集的)和sparse(稀疏的)
一个密集向量是一个二维数组代表每个键的值,一个稀疏的向量用的是两个平行的数组:索引和值。
比如说,一个向量(1.0, 0.0, 3.0),作为密集数组可以被表示为[1.0, 0.0, 3.0],但是作为一个稀疏向量,可以被表示为(3, [0, 2], [1.0, 3.0]),其中第一个3表示有三个元素,[0, 2]表示 的是存在的两个非零元素的下标(索引), [1.0, 3.0]即是存储的元素的值

创建一个Local vector

创建一个vector

[1.0, 0.0, 3.0]

import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;

Vector dv = Vectors.dense(1,2,4,6,7);
Vector sv = Vectors.sparse(3,new int[]{0,2},new double[]{1,3});

注意这里导入的包
通过查看API文档,还有几种创建向量的方式

Vectors.dense(double[] values)              //由一个数组创建
Vectors.dense(double firstValue, double... otherValues)     //由值创建
Vectors.fromJson(String json)                   //还可以由js创建
Vectors.sparse(int size, int[] indices, double[] values)        //这种创建方式创建稀疏矩阵
//第一个参数表示大小,第二个参数是索引数组,第三个参数的值数组

Labeled point

A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification. For binary classification, a label should be either 0 (negative) or 1 (positive). For multiclass classification, labels should be class indices starting from zero: 0, 1, 2, ....

Labeled point是一个本地的结合一个标签的向量,无论是密集还是稀疏。在MLlib中,Labeled point用于监督学习中,我们用一个double类型来存储一个标签。
在二元分类中,标签应该是0或1,对于多元分类中,标签应该从0开始,例如:0,1,2...

创建一个Labeled point

import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.mllib.regression.LabeledPoint;

LabeledPoint pos = new LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0));
LabeledPoint neg = new LabeledPoint(0.0, Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0}));

Labeled Point 的创建可见是基于 Vector 的

打印结果如下

(1.0,[1.0,0.0,3.0])

Local matrix

A local matrix has integer-typed row and column indices and double-typed values, stored on a single machine. MLlib supports dense matrices, whose entry values are stored in a single double array in column-major order, and sparse matrices, whose non-zero entry values are stored in the Compressed Sparse Column (CSC) format in column-major order.

一个本地的矩阵有基于整型的行和列索引,和浮点型类型的值

创建一个Local matrix

import org.apache.spark.mllib.linalg.Matrix;
import org.apache.spark.mllib.linalg.Matrices;

// 创建一个密集矩阵 ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
// 3 是三行, 2 是两列,后面是线性的值
// 注意这是一个列主序的矩阵,1,3,5都是第一列
Matrix dm = Matrices.dense(3, 2, new double[] {1.0, 3.0, 5.0, 2.0, 4.0, 6.0});

// 创建一个稀疏矩阵 ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))
// 同3行2列, 第一个int[]是一个新列开始的索引,第二个int[]是键的行数
// 因为是列主序,colPtrs中的1对应着6,从6起开始算第二列,其中0和3不变,因为这规定着到哪里结束
Matrix sm = Matrices.sparse(3, 2, new int[] {0, 1, 3}, new int[] {0, 2, 1}, new double[] {9, 6, 8});

API
public static Matrix sparse(int numRows,
                            int numCols,
                            int[] colPtrs,
                            int[] rowIndices,
                            double[] values)

Distributed matrix

A distributed matrix has long-typed row and column indices and double-typed values, stored distributively in one or more RDDs. It is very important to choose the right format to store large and distributed matrices. Converting a distributed matrix to a different format may require a global shuffle, which is quite expensive. Four types of distributed matrices have been implemented so far.

分布式矩阵,有一个long类型的行,和一个列索引,包括double类型的值,分布式地存储在一个或多个RDD上

RowMatrix

A RowMatrix is a row-oriented distributed matrix without meaningful row indices, backed by an RDD of its rows, where each row is a local vector. Since each row is represented by a local vector, the number of columns is limited by the integer range but it should be much smaller in practice.

一个行矩阵是一个面向行的分布式矩阵,其中没有有意义的行索引,用一个RDD作为他的行,其中每一行都是一个本地的向量

创建一个 RowMatrix

JavaRDD<Vector> rows = sc.parallelize(Arrays.asList(
                Vectors.dense(1, 2, 2),
                Vectors.dense(0, 2, 7),
                Vectors.dense(6, 0, 3),
                Vectors.dense(9, 2, 0)
        ));
RowMatrix mat = new RowMatrix(rows.rdd());
// Get its size.
long m = mat.numRows();
long n = mat.numCols();