# Spark 机器学习笔记(1)

## Spark 的基本数据类型

### Local vector

A local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine. MLlib supports two types of local vectors: dense and sparse. A dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values. For example, a vector (1.0, 0.0, 3.0) can be represented in dense format as [1.0, 0.0, 3.0] or in sparse format as (3, [0, 2], [1.0, 3.0]), where 3 is the size of the vector.

Local vector 支持三种类型的值，integer，double，0开始的索引型，vector分为dense(密集的)和sparse(稀疏的)

#### 创建一个Local vector

[1.0, 0.0, 3.0]

import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;

Vector dv = Vectors.dense(1,2,4,6,7);
Vector sv = Vectors.sparse(3,new int[]{0,2},new double[]{1,3});


Vectors.dense(double[] values)              //由一个数组创建
Vectors.dense(double firstValue, double... otherValues)     //由值创建
Vectors.fromJson(String json)                   //还可以由js创建
Vectors.sparse(int size, int[] indices, double[] values)        //这种创建方式创建稀疏矩阵
//第一个参数表示大小，第二个参数是索引数组，第三个参数的值数组


### Labeled point

A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification. For binary classification, a label should be either 0 (negative) or 1 (positive). For multiclass classification, labels should be class indices starting from zero: 0, 1, 2, ....

Labeled point是一个本地的结合一个标签的向量，无论是密集还是稀疏。在MLlib中，Labeled point用于监督学习中，我们用一个double类型来存储一个标签。

#### 创建一个Labeled point

import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.mllib.regression.LabeledPoint;

LabeledPoint pos = new LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0));
LabeledPoint neg = new LabeledPoint(0.0, Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0}));


Labeled Point 的创建可见是基于 Vector 的

(1.0,[1.0,0.0,3.0])


### Local matrix

A local matrix has integer-typed row and column indices and double-typed values, stored on a single machine. MLlib supports dense matrices, whose entry values are stored in a single double array in column-major order, and sparse matrices, whose non-zero entry values are stored in the Compressed Sparse Column (CSC) format in column-major order.

#### 创建一个Local matrix

import org.apache.spark.mllib.linalg.Matrix;
import org.apache.spark.mllib.linalg.Matrices;

// 创建一个密集矩阵 ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
// 3 是三行， 2 是两列，后面是线性的值
// 注意这是一个列主序的矩阵，1，3，5都是第一列
Matrix dm = Matrices.dense(3, 2, new double[] {1.0, 3.0, 5.0, 2.0, 4.0, 6.0});

// 创建一个稀疏矩阵 ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))
// 同3行2列， 第一个int[]是一个新列开始的索引，第二个int[]是键的行数
// 因为是列主序，colPtrs中的1对应着6，从6起开始算第二列，其中0和3不变，因为这规定着到哪里结束
Matrix sm = Matrices.sparse(3, 2, new int[] {0, 1, 3}, new int[] {0, 2, 1}, new double[] {9, 6, 8});

API
public static Matrix sparse(int numRows,
int numCols,
int[] colPtrs,
int[] rowIndices,
double[] values)


### Distributed matrix

A distributed matrix has long-typed row and column indices and double-typed values, stored distributively in one or more RDDs. It is very important to choose the right format to store large and distributed matrices. Converting a distributed matrix to a different format may require a global shuffle, which is quite expensive. Four types of distributed matrices have been implemented so far.

### RowMatrix

A RowMatrix is a row-oriented distributed matrix without meaningful row indices, backed by an RDD of its rows, where each row is a local vector. Since each row is represented by a local vector, the number of columns is limited by the integer range but it should be much smaller in practice.

#### 创建一个 RowMatrix

JavaRDD<Vector> rows = sc.parallelize(Arrays.asList(
Vectors.dense(1, 2, 2),
Vectors.dense(0, 2, 7),
Vectors.dense(6, 0, 3),
Vectors.dense(9, 2, 0)
));
RowMatrix mat = new RowMatrix(rows.rdd());
// Get its size.
long m = mat.numRows();
long n = mat.numCols();