sklearn中的流水线问题

用户名

我是sklearn的新手。我正在使用Pipeline在文本挖掘问题中一起使用Vectorizer和Classifier。这是我的代码:

def create_ngram_model():
tfidf_ngrams = TfidfVectorizer(ngram_range=(1, 3),
analyzer="word", binary=False)
clf = GaussianNB()
pipeline = Pipeline([('vect', tfidf_ngrams), ('clf', clf)])
return pipeline


def get_trains():
    data=open('../cleaning data/cleaning the sentences/cleaned_comments.csv','r').readlines()[1:]
    lines=len(data)
    features_train=[]
    labels_train=[]
    for i in range(lines):
        l=data[i].split(',')
        labels_train+=[int(l[0])]
        a=l[2]
        features_train+=[a]
    return features_train,labels_train

def train_model(clf_factory,features_train,labels_train):
    features_train,labels_train=get_trains()
    features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(features_train, labels_train, test_size=0.1, random_state=42)
    clf=clf_factory()
    clf.fit(features_train,labels_train)
    pred = clf.predict(features_test)
    accuracy = accuracy_score(pred,labels_test)
    return accuracy

X,Y=get_trains()
print train_model(create_ngram_model,X,Y)

从get_trains()返回的功能是字符串。我收到此错误。

clf.fit(features_train,labels_train)
  File "C:\Python27\lib\site-packages\sklearn\pipeline.py", line 130, in fit
    self.steps[-1][-1].fit(Xt, y, **fit_params)
  File "C:\Python27\lib\site-packages\sklearn\naive_bayes.py", line 149, in fit
    X, y = check_arrays(X, y, sparse_format='dense')
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 263, in check_arrays
    raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

我已经多次遇到此错误。然后,我只是将功能更改为features_transformed.toarray(),但由于在这里使用管道,因此无法执行此操作,因为转换后的功能会自动返回。我还尝试制作一个新类,该类返回features_transformed.toarray(),但也引发了相同的错误。我已经搜索了很多但没有得到。请帮忙!!

阿特姆·索博列夫(Artem Sobolev)

有2个选项:

  1. 使用兼容稀疏数据的分类器。例如,文档说Bernoulli Naive BayesMultinomial Naive Bayes支持的稀疏输入fit

  2. 在管道中添加“致密剂”。显然,您弄错了,这对我有用(当我需要沿途密集化稀疏数据时):

    class Densifier(object):
        def fit(self, X, y=None):
            pass
        def fit_transform(self, X, y=None):
            return self.transform(X)
        def transform(self, X, y=None):
            return X.toarray()
    

    确保在分类器之前将其放入管道。

本文收集自互联网,转载请注明来源。

如有侵权,请联系[email protected] 删除。

编辑于
0

我来说两句

0条评论
登录后参与评论

相关文章