Das Format von OpenNLP ist sehr flexibel. Wenn Sie den Klassifizierer MaxEnt in OpenNLP verwenden möchten, sind einige Schritte erforderlich.
Hier ist Beispielcode mit Kommentaren:
package example;
import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
import opennlp.tools.ml.maxent.GISTrainer;
import opennlp.tools.ml.model.Event;
import opennlp.tools.ml.model.MaxentModel;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.FilterObjectStream;
import opennlp.tools.util.MarkableFileInputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;
public class ReadData {
public static void main(String[] args) throws Exception{
// this is the data file ...
// the format is <LIST of FEATURES separated by spaces> <outcome>
// change the file to fit your needs
File f=new File("football.dat");
// we need to create an ObjectStream of events for the trainer..
// First create an InputStreamFactory -- given a file we can create an InputStream, required for resetting...
MarkableFileInputStreamFactory factory=new MarkableFileInputStreamFactory(f);
// create a PlainTextByLineInputStream -- Note: you can create your own Stream that can handle binary files or data that
// -- crosses two line...
ObjectStream<String> stream=new PlainTextByLineStream(factory, Charset.defaultCharset());
// Now you have a stream of string you need to convert it to a stream of events...
// I use a custom FilterObjectStream which simply takes a line, breaks it up into tokens,
// uses all except the last as the features [context] and the last token as the outcome class
ObjectStream<Event> eventStream=new FilterObjectStream<String, Event>(stream) {
@Override
public Event read() throws IOException {
String line=samples.read();
if (line==null) return null;
String[] parts=WhitespaceTokenizer.INSTANCE.tokenize(line);
String[] context=Arrays.copyOf(parts, parts.length-1);
System.out.println(parts[parts.length-1]+" "+Arrays.toString(context));
return new Event(parts[parts.length-1], context);
}
};
TrainingParameters parameters=new TrainingParameters();
// By default OpenNLP uses a cutoff of 5 (a feature has to occur 5 times before it is used)
// use 1 for my small dataset
parameters.put(GISTrainer.CUTOFF_PARAM, 1);
GISTrainer trainer=new GISTrainer();
// the report map is supposed to mark when default values are assigned...
Map<String,String> reportMap=new HashMap<>();
// DONT FORGET TO INITIALIZE THE TRAINER!!!
trainer.init(parameters, reportMap);
MaxentModel model=trainer.train(eventStream);
// Now we have a model -- you should test on a test set, but
// this is a toy example... so I am just resetting the eventstream.
eventStream.reset();
Event evt=null;
while ((evt=eventStream.read())!=null){
System.out.print(Arrays.toString(evt.getContext())+": ");
// Evaluate the context from the event using our model.
// you would want to calculate summary statistics..
double[] p=model.eval(evt.getContext());
System.out.print(model.getBestOutcome(p)+" ");
if (model.getBestOutcome(p).equals(evt.getOutcome())){
System.out.println("CORRECT");
}else{
System.out.println("INCORRECT");
}
}
}
}
Football.dat:
home=man_united Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_lost_previous man_united_won_previous arsenal
home=man_united Beckham=true Scholes=false Neville=true Henry=false Kanu=true Parlour=false Ferguson=tense Wengler=confident arsenal_won_previous man_united_lost_previous man_united
home=man_united Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=tense Wengler=tense arsenal_lost_previous man_united_won_previous tie
home=man_united Beckham=true Scholes=true Neville=false Henry=true Kanu=false Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous tie
home=man_united Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous arsenal
home=man_united Beckham=false Scholes=true Neville=true Henry=false Kanu=true Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous man_united
home=man_united Beckham=true Scholes=true Neville=false Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous man_united
home=arsenal Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_lost_previous man_united_won_previous arsenal
home=arsenal Beckham=true Scholes=false Neville=true Henry=false Kanu=true Parlour=false Ferguson=tense Wengler=confident arsenal_won_previous man_united_lost_previous arsenal
home=arsenal Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=tense Wengler=tense arsenal_lost_previous man_united_won_previous tie
home=arsenal Beckham=true Scholes=true Neville=false Henry=true Kanu=false Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous man_united
home=arsenal Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous arsenal
home=arsenal Beckham=false Scholes=true Neville=true Henry=false Kanu=true Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous man_united
home=arsenal Beckham=true Scholes=true Neville=false Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous arsenal
Hoffe, es hilft
schon löste das Problem aber danke für die ausführliche Antwort! –
Wissen Sie auch, wie Sie sowohl Textmerkmale als auch numerische übergeben können? Das heißt, wie ich dem System befehle, einige Werte als numerisch zu interpretieren, wenn ich z.B. ein Vektor von realen Werten als Features? –
(Es tut uns leid, es war eine Weile ...) Ich bin nicht sicher, ob OpenNLP numerische Funktionen behandeln. Haben Sie in Erwägung gezogen, die logistische Regression mit kategorischen und numerischen Werten zu verwenden? – HowYaDoing