How to use
normalizeQuotesHyphens
method
in
zemberek.core.text.TextUtil

Best Java code snippets using zemberek.core.text.TextUtil.normalizeQuotesHyphens (Showing top 5 results out of 315)

public List<WordAnalysis> analyzeSentence(String sentence) {
 String normalized = TextUtil.normalizeQuotesHyphens(sentence);
 List<WordAnalysis> result = new ArrayList<>();
 for (Token token : tokenizer.tokenize(normalized)) {
  result.add(analyze(token));
 }
 return result;
}

public List<String> readAll(String filename) throws IOException {
 List<String> lines = new ArrayList<>();
 File file = new File(filename);
 LineIterator it = SimpleTextReader.trimmingUTF8Reader(file).getLineIterator();
 while (it.hasNext()) {
  String quotesHyphensNormalzied = TextUtil.normalizeQuotesHyphens(it.next());
  lines.add(Joiner.on(" ").join(lexer.tokenizeToStrings(quotesHyphensNormalzied)));
 }
 return lines;
}

for (String line : lines) {
 line = TextUtil.normalizeApostrophes(line);
 line = TextUtil.normalizeQuotesHyphens(line);
 line = TextUtil.normalizeSpacesAndSoftHyphens(line);

s = TextUtil.normalizeQuotesHyphens(s);
s = TextUtil.normalizeSpacesAndSoftHyphens(s);
s = removeMultipleSymbols(s);

@Override
public void run() throws Exception {
 initializeOutputDir();
 IOUtil.checkDirectoryArgument(modelRoot, "Model Root");
 IOUtil.checkFileArgument(inputPath, "Input File");
 Path out = outDir.resolve(inputPath.toFile().getName() + ".ne");
 List<String> lines = Files.readAllLines(inputPath, StandardCharsets.UTF_8);
 List<String> sentences = TurkishSentenceExtractor.DEFAULT.fromParagraphs(lines);
 Log.info("There are %d lines and about %d sentences", lines.size(), sentences.size());
 TurkishMorphology morphology = TurkishMorphology.createWithDefaults();
 PerceptronNer ner = PerceptronNer.loadModel(modelRoot, morphology);
 Stopwatch sw = Stopwatch.createStarted();
 int tokenCount = 0;
 try (PrintWriter pw = new PrintWriter(out.toFile(), "UTF-8")) {
  for (String sentence : sentences) {
   sentence = TextUtil.normalizeApostrophes(sentence);
   sentence = TextUtil.normalizeQuotesHyphens(sentence);
   sentence = TextUtil.normalizeSpacesAndSoftHyphens(sentence);
   List<String> words = TurkishTokenizer.DEFAULT.tokenizeToStrings(sentence);
   tokenCount += words.size();
   NerSentence result = ner.findNamedEntities(sentence, words);
   pw.println(result.getAsTrainingSentence(annotationStyle));
  }
 }
 double secs = sw.elapsed(TimeUnit.MILLISECONDS) / 1000d;
 Log.info("Token count = %s", tokenCount);
 Log.info("File processed in %.4f seconds.", secs);
 Log.info("Speed = %.2f tokens/sec", tokenCount / secs);
 Log.info("Result is written in %s", out);
}

Javadoc

This method converts different single and double quote symbols to a unified form. also it reduces two connected single quotes to a one double quote.

Popular methods of TextUtil

normalizeApostrophes
This method converts different apostrophe symbols to a unified form.
normalizeSpacesAndSoftHyphens
Replaces all unicode space like characters with " " and replaces soft hyphens [u00ad].
convertAmpersandStrings
replaces all special html Strings such as(&....; or &#dddd;) with their original characters.
cleanHtmlTagsAndComments
cleanScripts
containsCombiningDiacritics
Returns true iff input contains Combining Diacritics symbols. These characters sometimes appear in d
getAttributes
returns a map with attributes of an xml line. For example if [content] is `` and [element] is `Foo`
getHtmlBody
removeAmpresandStrings
This method removes all &....; type strings form html.
separateWords

Popular in Java

Creating JSON documents from java classes using gson
onCreateOptionsMenu (Activity)
getSystemService (Context)
scheduleAtFixedRate (ScheduledExecutorService)
List (java.util)
An ordered collection (also known as a sequence). The user of this interface has precise control ove
PriorityQueue (java.util)
A PriorityQueue holds elements on a priority heap, which orders the elements according to their natu
BlockingQueue (java.util.concurrent)
A java.util.Queue that additionally supports operations that wait for the queue to become non-empty
Collectors (java.util.stream)
Font (java.awt)
The Font class represents fonts, which are used to render text in a visible way. A font provides the
GridLayout (java.awt)
The GridLayout class is a layout manager that lays out a container's components in a rectangular gri
Github Copilot alternatives

How to use normalizeQuotesHyphensmethodin zemberek.core.text.TextUtil

Best Java code snippets using zemberek.core.text.TextUtil.normalizeQuotesHyphens (Showing top 5 results out of 315)

How to use
normalizeQuotesHyphens
method
in
zemberek.core.text.TextUtil