It is possible to extract meaningful terms and concepts from unstructured information with the help of Text Type Dependency Parsers.
Unstructured information can be Text files,PDF, or MS Word documents buried on a hard drives, within emails on an exchange server, or even Audio streams after they are converted to text.
We won't waste much time going over Tree Structures,etc. and just simply dive into the good stuff..
Stanford Parser
ConceptExtractorImpl.java
import java.util.ArrayList;
import java.util.Collection;
import java.util.Iterator;
import java.util.List;
import org.apache.log4j.Logger;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;
import edu.stanford.nlp.trees.GrammaticalStructure;
import edu.stanford.nlp.trees.GrammaticalStructureFactory;
import edu.stanford.nlp.trees.PennTreebankLanguagePack;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.trees.TreebankLanguagePack;
import edu.stanford.nlp.trees.TypedDependency;
public class ConceptExtractorImpl {
Logger logger = Logger.getLogger(ConceptExtractorImpl.class);
int maxWordCount = 50;
LexicalizedParser lexicalizedParser = null;
TreebankLanguagePack treebankLanguagePack = null;
GrammaticalStructureFactory grammaticalStructureFactory = null;
String lexicalizedParserFile = "conf/models/standford/englishPCFG.ser.gz";
public ConceptExtractorImpl(){
lexicalizedParser = new LexicalizedParser(lexicalizedParserFile);
lexicalizedParser.setOptionFlags(new String[] { "-maxLength", "80",
"-retainTmpSubcategories" });
treebankLanguagePack = new PennTreebankLanguagePack();
grammaticalStructureFactory = treebankLanguagePack.grammaticalStructureFactory();
}
public String removePosition(String text){
StringBuffer newWord = new StringBuffer();
boolean isLastPosition = false;
for(int index=text.length()-1;index>=0;index--){
if ( isLastPosition){
newWord.append(text.charAt(index));
}
if ( text.charAt(index) == '-'){
isLastPosition = true;
}
}
StringBuffer word = new StringBuffer();
for(int index=newWord.length()-1;index>=0;index--){
word.append(newWord.charAt(index));
}
return word.toString();
}
private boolean shouldUse(TypedDependency typedDependency) {
boolean shouldUse = false;
if(
typedDependency.reln().getShortName().trim().equalsIgnoreCase("nn") ||
typedDependency.reln().getShortName().trim().equalsIgnoreCase("prep") ||
typedDependency.reln().getShortName().trim().equalsIgnoreCase("dep") ||
typedDependency.reln().getShortName().trim().equalsIgnoreCase("conj_and") ||
typedDependency.reln().getShortName().trim().equalsIgnoreCase("num") ||
typedDependency.reln().getShortName().trim().equalsIgnoreCase("amod")
){
shouldUse = true;
}
return shouldUse;
}
public List<String> extractConcepts(String sentence){
List<String> concepts = new ArrayList();
Tree tree = (Tree) lexicalizedParser.apply(sentence);
StringBuffer typedDependcy = new StringBuffer();
GrammaticalStructure gs = grammaticalStructureFactory.newGrammaticalStructure(tree);
Collection tdl = gs.typedDependenciesCCprocessed(false);
Iterator it = tdl.iterator();
while (it.hasNext()) {
TypedDependency typedDependency = (TypedDependency) it.next();
String phrase = removePosition( typedDependency.dep().toString().toLowerCase() ) + " " + removePosition( typedDependency.gov().toString().toLowerCase());
if ( shouldUse(typedDependency)
){
concepts.add(phrase);
}
}
return concepts;
}
}
The above code will extract "meaning" from the example sentence.
The woman has cancer in lower left lung.
and extract the following concepts from it:
1. lower lung.
2. left lung.
3. lung cancer.
After obtaining these terms and concepts one usually maps them to a taxonomy.
Labels: text analytics, Type Dependency Parsing