`sklearn.feature_extraction`.DictVectorizer¶

class sklearn.feature_extraction.DictVectorizer(dtype=<type 'numpy.float64'>, separator='=', sparse=True, sort=True)[source]¶

Transforms lists of feature-value mappings to vectors.

This transformer turns lists of mappings (dict-like objects) of feature names to feature values into Numpy arrays or scipy.sparse matrices for use with scikit-learn estimators.

When feature values are strings, this transformer will do a binary one-hot (aka one-of-K) coding: one boolean-valued feature is constructed for each of the possible string values that the feature can take on. For instance, a feature “f” that can take on the values “ham” and “spam” will become two features in the output, one signifying “f=ham”, the other “f=spam”.

Features that do not occur in a sample (mapping) will have a zero value in the resulting array/matrix.

See also

FeatureHasher: performs vectorization using only a hash function.
sklearn.preprocessing.OneHotEncoder: handles nominal/categorical features encoded as columns of integers.

Examples

>>> from sklearn.feature_extraction import DictVectorizer
>>> v = DictVectorizer(sparse=False)
>>> D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]
>>> X = v.fit_transform(D)
>>> X
array([[ 2.,  0.,  1.],
       [ 0.,  1.,  3.]])
>>> v.inverse_transform(X) ==         [{'bar': 2.0, 'foo': 1.0}, {'baz': 1.0, 'foo': 3.0}]
True
>>> v.transform({'foo': 4, 'unseen_feature': 3})
array([[ 0.,  0.,  4.]])

Methods

`fit`(X[, y])	Learn a list of feature name -> indices mappings.
`fit_transform`(X[, y])	Learn a list of feature name -> indices mappings and transform X.
`get_feature_names`()	Returns a list of feature names, ordered by their indices.
`get_params`([deep])	Get parameters for this estimator.
`inverse_transform`(X[, dict_type])	Transform array or sparse matrix X back to feature mappings.
`restrict`(support[, indices])	Restrict the features to those in support using feature selection.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X[, y])	Transform feature->value dicts to array or sparse matrix.

__init__(dtype=<type 'numpy.float64'>, separator='=', sparse=True, sort=True)[source]¶

fit(X, y=None)[source]¶

Learn a list of feature name -> indices mappings.

Parameters:

X : Mapping or iterable over Mappings

Dict(s) or Mapping(s) from feature names (arbitrary Python objects) to feature values (strings or convertible to dtype).

y : (ignored)

Returns:

self :

fit_transform(X, y=None)[source]¶

Learn a list of feature name -> indices mappings and transform X.

Like fit(X) followed by transform(X), but does not require materializing X in memory.

Parameters:

X : Mapping or iterable over Mappings

Dict(s) or Mapping(s) from feature names (arbitrary Python objects) to feature values (strings or convertible to dtype).

y : (ignored)

Returns:

Xa : {array, sparse matrix}

Feature vectors; always 2-d.

get_feature_names()[source]¶

Returns a list of feature names, ordered by their indices.

If one-of-K coding is applied to categorical features, this will include the constructed feature names but not the original ones.

get_params(deep=True)[source]¶

Get parameters for this estimator.

Parameters:

deep: boolean, optional :

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params : mapping of string to any

Parameter names mapped to their values.

inverse_transform(X, dict_type=<type 'dict'>)[source]¶

Transform array or sparse matrix X back to feature mappings.

X must have been produced by this DictVectorizer’s transform or fit_transform method; it may only have passed through transformers that preserve the number of features and their order.

In the case of one-hot/one-of-K coding, the constructed feature names and values are returned rather than the original ones.

Parameters:

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Sample matrix.

dict_type : callable, optional

Constructor for feature mappings. Must conform to the collections.Mapping API.

Returns:

D : list of dict_type objects, length = n_samples

Feature mappings for the samples in X.

restrict(support, indices=False)[source]¶

Restrict the features to those in support using feature selection.

This function modifies the estimator in-place.

Parameters:

support : array-like

Boolean mask or list of indices (as returned by the get_support member of feature selectors).

indices : boolean, optional

Whether support is a list of indices.

Returns:

self :

Examples

>>> from sklearn.feature_extraction import DictVectorizer
>>> from sklearn.feature_selection import SelectKBest, chi2
>>> v = DictVectorizer()
>>> D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]
>>> X = v.fit_transform(D)
>>> support = SelectKBest(chi2, k=2).fit(X, [0, 1])
>>> v.get_feature_names()
['bar', 'baz', 'foo']
>>> v.restrict(support.get_support()) 
DictVectorizer(dtype=..., separator='=', sort=True,
        sparse=True)
>>> v.get_feature_names()
['bar', 'foo']

set_params(**params)[source]¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The former have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:	self :

transform(X, y=None)[source]¶

Transform feature->value dicts to array or sparse matrix.

Named features not encountered during fit or fit_transform will be silently ignored.

Parameters:

X : Mapping or iterable over Mappings, length = n_samples

Dict(s) or Mapping(s) from feature names (arbitrary Python objects) to feature values (strings or convertible to dtype).

y : (ignored)

Returns:

Xa : {array, sparse matrix}

Feature vectors; always 2-d.

Examples using `sklearn.feature_extraction.DictVectorizer`¶

Feature Union with Heterogeneous Data Sources

../../_images/hashing_vs_dict_vectorizer1.png

FeatureHasher and DictVectorizer Comparison

sklearn.feature_extraction.DictVectorizer¶

Examples using sklearn.feature_extraction.DictVectorizer¶

`sklearn.feature_extraction`.DictVectorizer¶

Examples using `sklearn.feature_extraction.DictVectorizer`¶