`sklearn.feature_extraction.text`.HashingVectorizer¶

class sklearn.feature_extraction.text.HashingVectorizer(input=u'content', encoding=u'utf-8', decode_error=u'strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=u'(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer=u'word', n_features=1048576, binary=False, norm=u'l2', non_negative=False, dtype=<type 'numpy.float64'>)[source]¶

Convert a collection of text documents to a matrix of token occurrences

It turns a collection of text documents into a scipy.sparse matrix holding token occurrence counts (or binary occurrence information), possibly normalized as token frequencies if norm=’l1’ or projected on the euclidean unit sphere if norm=’l2’.

This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.

This strategy has several advantages:

it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory
it is fast to pickle and un-pickle as it holds no state besides the constructor parameters
it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.

There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):

there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.
there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).
no IDF weighting as this would render the transformer stateful.

The hash function employed is the signed 32-bit version of Murmurhash3.

See also

CountVectorizer, TfidfVectorizer

Methods

`build_analyzer`()	Return a callable that handles preprocessing and tokenization
`build_preprocessor`()	Return a function to preprocess the text before tokenization
`build_tokenizer`()	Return a function that splits a string into a sequence of tokens
`decode`(doc)	Decode the input into a string of unicode symbols
`fit`(X[, y])	Does nothing: this transformer is stateless.
`fit_transform`(X[, y])	Transform a sequence of documents to a document-term matrix.
`get_params`([deep])	Get parameters for this estimator.
`get_stop_words`()	Build or fetch the effective stop words list
`partial_fit`(X[, y])	Does nothing: this transformer is stateless.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X[, y])	Transform a sequence of documents to a document-term matrix.

__init__(input=u'content', encoding=u'utf-8', decode_error=u'strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=u'(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer=u'word', n_features=1048576, binary=False, norm=u'l2', non_negative=False, dtype=<type 'numpy.float64'>)[source]¶

build_analyzer()[source]¶: Return a callable that handles preprocessing and tokenization

build_preprocessor()[source]¶: Return a function to preprocess the text before tokenization

build_tokenizer()[source]¶: Return a function that splits a string into a sequence of tokens

decode(doc)[source]¶

Decode the input into a string of unicode symbols

The decoding strategy depends on the vectorizer parameters.

fit(X, y=None)[source]¶: Does nothing: this transformer is stateless.

fit_transform(X, y=None)[source]¶

Transform a sequence of documents to a document-term matrix.

Parameters:

X : iterable over raw text documents, length = n_samples

Samples. Each sample must be a text document (either bytes or unicode strings, file name or file object depending on the constructor argument) which will be tokenized and hashed.

y : (ignored)

Returns:

X : scipy.sparse matrix, shape = (n_samples, self.n_features)

Document-term matrix.

fixed_vocabulary¶: DEPRECATED: The fixed_vocabulary attribute is deprecated and will be removed in 0.18. Please use fixed_vocabulary_ instead.

get_params(deep=True)[source]¶

Get parameters for this estimator.

Parameters:

deep: boolean, optional :

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params : mapping of string to any

Parameter names mapped to their values.

get_stop_words()[source]¶: Build or fetch the effective stop words list

partial_fit(X, y=None)[source]¶

Does nothing: this transformer is stateless.

This method is just there to mark the fact that this transformer can work in a streaming setup.

set_params(**params)[source]¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The former have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:	self :

transform(X, y=None)[source]¶

Transform a sequence of documents to a document-term matrix.

Parameters:

X : iterable over raw text documents, length = n_samples

Samples. Each sample must be a text document (either bytes or unicode strings, file name or file object depending on the constructor argument) which will be tokenized and hashed.

y : (ignored)

Returns:

X : scipy.sparse matrix, shape = (n_samples, self.n_features)