Bis jetzt versuche ich den Fit-Generator für Sentiment-Analyse zu implementieren, da ich nur eine kleine PGU und große Datenmenge habe. Aber ich erhalte diesen FehlerKeras fit_generator() & Input-Array sollte das selbe wie Target-Samples haben
Using Theano backend.
Can not use cuDNN on context None: cannot compile with cuDNN. We got this error:
b'In file included from C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v8.0\\include/driver_types.h:53:0,\r\n from C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v8.0\\include/cudnn.h:63,\r\n from C:\\Users\\Def\\AppData\\Local\\Temp\\try_flags_p2iwer2o.c:4:\r\nC:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v8.0\\include/host_defines.h:84:0: warning: "__cdecl" redefined\r\n #define __cdecl\r\n ^\r\n<built-in>: note: this is the location of the previous definition\r\nd000029.o:(.idata$5+0x0): multiple definition of `__imp___C_specific_handler\'\r\nd000026.o:(.idata$5+0x0): first defined here\r\nC:/Users/Def/Anaconda3/envs/Final/Library/mingw-w64/bin/../lib/gcc/x86_64-w64-mingw32/5.3.0/../../../../x86_64-w64-mingw32/lib/../lib/crt2.o: In function `__tmainCRTStartup\':\r\nC:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:285: undefined reference to `_set_invalid_parameter_handler\'\r\ncollect2.exe: error: ld returned 1 exit status\r\n'
Mapped name None to device cuda: GeForce GTX 960M (0000:01:00.0)
Epoch 1/10
Traceback (most recent call last):
File "C:/Users/Def/PycharmProjects/KerasUkExpenditure/TweetParsing.py", line 136, in <module>
epochs=10)
File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\legacy\interfaces.py", line 88, in wrapper
return func(*args, **kwargs)
File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\models.py", line 1097, in fit_generator
initial_epoch=initial_epoch)
File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\legacy\interfaces.py", line 88, in wrapper
return func(*args, **kwargs)
File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\engine\training.py", line 1876, in fit_generator
class_weight=class_weight)
File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\engine\training.py", line 1614, in train_on_batch
check_batch_axis=True)
File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\engine\training.py", line 1307, in _standardize_user_data
_check_array_lengths(x, y, sample_weights)
File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\engine\training.py", line 229, in _check_array_lengths
'and ' + str(list(set_y)[0]) + ' target samples.')
ValueError: Input arrays should have the same number of samples as target arrays. Found 1000 input samples and 1 target samples.
Ich habe eine Matrix, die 1000 Elemente lang ist, da ich nur maximal Korpus von 1000 Worten hat, die in den Tokenizer() angegeben ist.
Ich habe dann die Stimmung, die entweder eine 0 für negative oder eine 1 für positiv ist.
Meine Frage ist, warum erhalte ich den Fehler? Ich habe versucht, die Transformation sowohl für die Daten als auch für die Beschriftungen zu verwenden und erhalte immer noch denselben Fehler. Hier ist mein Code.
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.preprocessing.text import Tokenizer
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt
import re
"""
the amount of samples out to the 1 million to use, my 960m 2GB can only handle
about 30,000ish at the moment depending on a number of neurons in the
deep layer and a number of layers.
"""
maxSamples = 3000
#Load the CSV and get the correct columns
data = pd.read_csv("C:\\Users\\Def\\Desktop\\Sentiment Analysis Dataset1.csv")
dx = pd.DataFrame()
dy = pd.DataFrame()
dy[['Sentiment']] = data[['Sentiment']]
dx[['SentimentText']] = data[['SentimentText']]
dataY = dy.iloc[0:maxSamples]
dataX = dx.iloc[0:maxSamples]
testY = dy.iloc[maxSamples: maxSamples + 1000]
testX = dx.iloc[maxSamples: maxSamples + 1000]
"""
here I filter the data and clean it up by removing @ tags, hyperlinks and
also any characters that are not alpha-numeric.
"""
def removeTagsAndLinks(dataframe):
for x in dataframe.iterrows():
#Removes Hyperlinks
x[1].values[0] = re.sub("(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\[email protected]?^=%&/~+#-])?", "", str(x[1].values[0]))
#Removes @ tags
x[1].values[0] = re.sub("@\\w+", '', str(x[1].values[0]))
#keeps only alpha-numeric chars
x[1].values[0] = re.sub("\W+", ' ', str(x[1].values[0]))
return dataframe
xData = removeTagsAndLinks(dataX)
xTest = removeTagsAndLinks(testX)
"""
This loop looks for any Tweets with characters shorter than 2 and once found write the
index of that Tweet to an array so I can remove from the Dataframe of sentiment and the
list of Tweets later
"""
indexOfBlankStrings = []
for index, string in enumerate(xData):
if len(string) < 2:
indexOfBlankStrings.append(index)
for row in indexOfBlankStrings:
dataY.drop(row, axis=0, inplace=True)
"""
This makes a BOW model out of all the tweets then creates a
vector for each of the tweets containing all the words from
the BOW model, each vector is the same size becuase the
network expects it
"""
def vectorise(tokenizer, list):
return tokenizer.fit_on_texts(list)
#Make BOW model and vectorise it
t = Tokenizer(lower=False, num_words=1000)
t.fit_on_texts(dataX.iloc[:,0].tolist())
t.fit_on_texts(dataX.iloc[:,0].tolist())
"""
Here im experimenting with multiple layers of the total
amount of words in the syllabus divided by ^2 - This
has given me quite accurate results compared to random guess's
of amount of neron's.
"""
l1 = int(xData.shape[0]/4) #To big for my GPU
l2 = int(xData.shape[0]/8) #To big for my GPU
l3 = int(xData.shape[0]/16)
l4 = int(xData.shape[0]/32)
l5 = int(xData.shape[0]/64)
l6 = int(xData.shape[0]/128)
#Make the model
model = Sequential()
model.add(Dense(l1, input_dim=xData.shape[1]))
model.add(Dropout(0.15))
model.add(Dense(l2))
model.add(Dropout(0.2))
model.add(Dense(l3))
model.add(Dropout(0.2))
model.add(Dense(l4))
model.add(Dense(1, activation='relu'))
#Compile the model
model.compile(optimizer='RMSProp', loss='binary_crossentropy', metrics=['acc'])
"""
This here will use multiple batches to train the model.
startIndex:
This is the starting index of the array for which you want to
start training the network from.
dataRange:
The number of elements use to train the network in each batch so
since dataRange = 1000 this mean it goes from
startIndex...dataRange OR 0...1000
amountOfEpochs:
This is kinda self explanitory, the more Epochs the more it
is supposed to learn AKA updates the optimisation algo numbers
"""
amountOfEpochs = 1
dataRange = 1000
startIndex = 0
def generator(tokenizer, data, labels, totalSize=maxSamples, startIndex=0):
l = labels.as_matrix()
while True:
for i in range(startIndex, totalSize):
batch_features = tokenizer.texts_to_matrix(xData.iloc[i])
batch_labels = l[i]
yield batch_features, batch_labels
derp = generator(t, data=xData, labels=dataY)
##This runs the model for batch AKA load a little them process then load a little more
for amountOfData in range(1000, maxSamples, 1000):
#(loss, acc) = model.train_on_batch(x=dim[startIndex:amountOfData], y=np.asarray(dataY.iloc[startIndex:amountOfData]))
history = model.fit_generator(generator=generator(tokenizer=t,
data=xData,
labels=dataY),
steps_per_epoch=1,
epochs=10)
Dank
Das Problem wäre, ist Sie haben 1000 Proben in Ihnen Eingangsmatrix X und 1 in der Ausgabe Y-Matrix – DJK
Aber die 1 in der Y-Matrix ist das Gefühl. Es sollte nur 1 oder 0 für jeden Tweet sein – Definity